In the paper “Finite-time Analysis of the Multiarmed Bandit Problem“, Auer, Cesa-Bianchi, and Fisher (2002) prove that for the fixed reward distributions, “the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.”

The paper discusses the following three ideas:

1) The expected regret for the upper confidence bound algorithm UCB1 is bounded above by

$$\left[ 8 \sum_{i:\mu_i < \mu^*} \left({\log n \over {\mu^* - \mu_i}}\right)\right] + \left( 1 + {\pi^2\over 3}\right)\left(\sum_{j=1}^K (\mu^*-\mu_j) \right)$$

where $\mu_i$ is the mean of the $i$th distribution, $n$ is the number of pulls, and $K$ is the number of arms if the rewards are between 0 and 1.

2) They improve on the coefficient of $\log n$ with a slightly more complex algorithm UCB2.

3) The $\epsilon_n = 1/n$ greedy algorithm has $O(\log n)$ regret.

They conclude the paper with numerical results.

## Related Posts via Categories