“Analysis of Thompson Sampling for the multi-armed bandit problem”

In “Analysis of Thompson Sampling for the multi-armed bandit problem“, Agrawal and Goyal show that Thompson Sampling (“The basic idea is to choose an arm to play according to its probability of being the best arm.”) has logarithmic regret. More specifically, if there are $n$ bandits and the regret $R$ at time $T$ is defined by

$$R(T) := \sum_{t=1}^T (\mu^* – \mu_{i(t)})$$

where $\mu_i$ is the expected return of the $i$th arm and $\mu^* = \max_{i = 1, \ldots, n} \mu_i$, then

$$E[R(T)] \in O\left(\left(\sum_{i\neq i^*} 1/\Delta_i^2\right)^2 \log T \right)$$

where $\mu(i^*) = \mu^*$ and $\Delta_i = \mu^* – \mu_i$.