“Finite-time Analysis of the Multiarmed Bandit Problem”

September 12, 2012 in Multi-Armed Bandit Problem by hundalhh | Permalink

In the paper “Finite-time Analysis of the Multiarmed Bandit Problem“, Auer, Cesa-Bianchi, and Fisher (2002) prove that for the fixed reward distributions, “the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.”

The paper discusses the following three ideas:

1) The expected regret for the upper confidence bound algorithm UCB1 is bounded above by

$$\left[ 8 \sum_{i:\mu_i < \mu^*} \left({\log n \over {\mu^* - \mu_i}}\right)\right] + \left( 1 + {\pi^2\over 3}\right)\left(\sum_{j=1}^K (\mu^*-\mu_j) \right)$$

where $\mu_i$ is the mean of the $i$th distribution, $n$ is the number of pulls, and $K$ is the number of arms if the rewards are between 0 and 1.

2) They improve on the coefficient of $\log n$ with a slightly more complex algorithm UCB2.

3) The $\epsilon_n = 1/n$ greedy algorithm has $O(\log n)$ regret.

They conclude the paper with numerical results.

Artificial Intelligence Blog

“Finite-time Analysis of the Multiarmed Bandit Problem”

Categories

Archives

Subscribe to ArtEnt via Email

Artificial Intelligence Blog

“Finite-time Analysis of the Multiarmed Bandit Problem”

Related Posts via Categories

Categories

Archives

Subscribe to ArtEnt via Email