“Algorithms for Infinitely Many-Armed Bandits”

In “Algorithms for Infinitely Many-Armed Bandits”, Wang, Audibert, and Munos (2008) describe some algorithms for the multi-armed bandit problem when a large number or infinitely many arms are available. Their algorithms are designed for the situation where all rewards are contained in $[0,1]$ and “the probability that a new arm is $\epsilon$-optimal is of order $\epsilon^\beta$”. More precisely, there exist real numbers $c, \mu^*,$ and $\beta$ such that the expected value of an unexplored arm $\mu$ obeys

$$P(\mu^* – \mu < \epsilon) < c \epsilon^\beta.$$

They prove that the total regret is at most of order $n^{\beta/(\beta+1)}\log^2(n)$ if $\beta > 1$ and $\log^2(n)\sqrt{n}$ otherwise. Additionally, they prove a lower bound of order $n^{\beta / (\beta + 1)}$ for any algorithm. Their algorithm applies UCB to the first $n^{\beta/(\beta+1)}$ arms. (The case where $\beta = 1$ was explored in “Bandit problems with infinitely many arms” by Berry, Chen, Zame, Heath, and Shepp (1997).)