Multi-Armed Bandit Problem

You are currently browsing the archive for the Multi-Armed Bandit Problem category.

In “Analysis of Thompson Sampling for the multi-armed bandit problem“, Agrawal and Goyal show that Thompson Sampling (“The basic idea is to choose an arm to play according to its probability of being the best arm.”) has logarithmic regret. More specifically, if there are $n$ bandits and the regret $R$ at time $T$ is defined by

$$R(T) := \sum_{t=1}^T (\mu^* – \mu_{i(t)})$$

where $\mu_i$ is the expected return of the $i$th arm and $\mu^* = \max_{i = 1, \ldots, n} \mu_i$, then

$$E[R(T)] \in O\left(\left(\sum_{i\neq i^*} 1/\Delta_i^2\right)^2 \log T \right)$$

where $\mu(i^*) = \mu^*$ and $\Delta_i = \mu^* – \mu_i$.

In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]).  They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.”  and

“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the field of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem.  It discovers the structure of a problem on the fly.”

where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998).  The algorithm consists of the following steps

The Bayesian Optimization Algorithm (BOA)
(1) set t ← 0 randomly generate initial population P(0)
(2) select a set of promising strings S(t) from P(t)
(3) construct the network B using a chosen metric and constraints
(4) generate a set of new strings O(t) according to the joint distribution encoded by B
(5) create a new population P(t+1) by replacing some strings from P(t) with O(t)  set t ← t + 1
(6) if the termination criteria are not met, go to (2)

where B is a Bayesian network.

Check out the NIPS 2011 workshop.

NIPS was pretty fantastic this year.  There were a number of breakthroughs in the areas that interest me most:  Markov Decision Processes, Game Theory, Multi-Armed Bandits, and Deep Belief Networks.  Here is the list of papers, workshops, and presentations I found the most interesting or potentially useful:


  1. Representation, Inference and Learning in Structured Statistical Models
  2. Stochastic Search and Optimization
  3. Quantum information and the Brain
  4. Relax and Randomize : From Value to Algorithms  (Great)
  5. Classification with Deep Invariant Scattering Networks
  6. Discriminative Learning of Sum-Product Networks
  7. On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
  8. A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
  9. Regularized Off-Policy TD-Learning
  10. Multi-Stage Multi-Task Feature Learning
  11. Graphical Models via Generalized Linear Models (Great)
  12. No voodoo here! Learning discrete graphical models via inverse covariance estimation (Great)
  13. Gradient Weights help Nonparametric Regressors
  14. Dropout: A simple and effective way to improve neural networks (Great)
  15. Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions
  16. A Better Way to Pre-Train Deep Boltzmann Machines
  17. Bayesian Optimization and Decision Making
  18. Practical Bayesian Optimization of Machine Learning Algorithms
  19. Modern Nonparametric Methods in Machine Learning
  20. Deep Learning and Unsupervised Feature Learning
Unfortunately, when you have 30 full day workshops in a two day period, you miss most of them.  I could only attend the three listed above.  There were many other great ones.



“A Product of Experts model (PoE) (Hinton 2002) combines a number of individual component models (the experts) by taking their product and normalizing the result.”  Here’s the Scholarpedia page.

In “Algorithms for Infinitely Many-Armed Bandits”, Wang, Audibert, and Munos (2008) describe some algorithms for the multi-armed bandit problem when a large number or infinitely many arms are available. Their algorithms are designed for the situation where all rewards are contained in $[0,1]$ and “the probability that a new arm is $\epsilon$-optimal is of order $\epsilon^\beta$”. More precisely, there exist real numbers $c, \mu^*,$ and $\beta$ such that the expected value of an unexplored arm $\mu$ obeys

$$P(\mu^* – \mu < \epsilon) < c \epsilon^\beta.$$

They prove that the total regret is at most of order $n^{\beta/(\beta+1)}\log^2(n)$ if $\beta > 1$ and $\log^2(n)\sqrt{n}$ otherwise. Additionally, they prove a lower bound of order $n^{\beta / (\beta + 1)}$ for any algorithm. Their algorithm applies UCB to the first $n^{\beta/(\beta+1)}$ arms. (The case where $\beta = 1$ was explored in “Bandit problems with infinitely many arms” by Berry, Chen, Zame, Heath, and Shepp (1997).)

In “Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness“, Remi Munos (2011) develops an optimization algorithm similar to hierarchical optimistic optimization, Lipschitz optimization, the Zooming algorithm (Klienberg, Slivkins, & Upfal 2008), branch and bound, and the upper confidence bound algorithm.  The algorithm does not minimize regret, rather it attempts to maximize $\max_n f(x_n)$ over all the samples $x_n$.  Munos’s first algorithm, Deterministic Optimistic Optimization, requires that the function be smooth with respect to a known semi-metric.  His second algorithm, Simultaneous Optimistic Optimization, does not require knowledge of the smoothness semi-metric.  He proves performance bounds, gives examples for both algorithms, and finishes the paper by comparing the algorithm to the well known DIRECT (DIviding RECTangles) algorithm (Jones, Perttunen, Stuckman 1993).

I was reading the Machine Learning‘s article “Coactive Learning” and they referred to that paper “From Bandits to Experts: On the Value of Side-Observations” by Mannor and Shamir (2011). This paper develops algorithms for the situation where the learner gets information about neighboring bandits after it chooses which bandit arm to pull. Recall that in the mixture of experts situation, the leaner gets to see the results of all the experts (bandits) after choosing which arm to pull.

This 2011 tutorial (Parts 1 and 2) is the best introduction to stochastic and adversarial bandits and details recent advances in research that I have seen. Many of the bandit papers that I have linked to in previous posts are quoted.   Some experimental results and suggested parameter values are given. Part one mentions that regret may be much larger than the expected regret as per the paper “Exploration-exploitation tradeoff using variance estimates in multi-armed bandits” by Audibert, Munos, Szepesvári (2009). It describes the UCB-Horizon algorithm and Hoeffding-based GCL∗ policy which alleviate this problem. Part two is titled “Bandits with large sets of actions” and mostly deals with the case of infinitely many bandits. Many algorithms are described including Hierarchical Optimistic Optimization and a huge bibliography is provided at the end of the tutorial.

In the paper “Finite-time Analysis of the Multiarmed Bandit Problem“, Auer, Cesa-Bianchi, and Fisher (2002) prove that for the fixed reward distributions, “the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.”

The paper discusses the following three ideas:

1) The expected regret for the upper confidence bound algorithm UCB1 is bounded above by

$$\left[ 8 \sum_{i:\mu_i < \mu^*} \left({\log n \over {\mu^* - \mu_i}}\right)\right] + \left( 1 + {\pi^2\over 3}\right)\left(\sum_{j=1}^K (\mu^*-\mu_j) \right)$$

where $\mu_i$ is the mean of the $i$th distribution, $n$ is the number of pulls, and $K$ is the number of arms if the rewards are between 0 and 1.

2)  They improve on the coefficient of $\log n$ with a slightly more complex algorithm UCB2.

3)  The $\epsilon_n = 1/n$ greedy algorithm has $O(\log n)$ regret.


They conclude the paper with numerical results.

In “Regret Bounds for Sleeping Experts and Bandits”  Kleinberg, Niculescu-Mizil, and Sharma (2008) propose an algorithm for selecting among a list of experts some of whom may be sleeping.  At each time step $t$, the algorithm ranks the experts in order of preference with a permutation vector $\sigma(t)$.   The first analyzed algorithm is Follow The Awake Leader (FTAL) for the case where all of the expert predictions and rewards are observed.  Next they explore a natural extension of upper confidence bound algorithm for the multi-armed bandit setting.  In the adversarial setting, they prove that the regret is at least $O(\sqrt{t\, K \log K})$ for $K$ bandits at time $t$.

« Older entries § Newer entries »