Multi-Armed Bandit Problem

You are currently browsing the archive for the Multi-Armed Bandit Problem category.

“Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems” by Sébastie Bubeck and Nicolò Cesa-Bianchi is available in pdf format at

In case you had not gotten the news yet, the Go playing program AlphaGo (developed by the Deep Mind division of Google) has beaten Lee Se-dol who is among the top two or three Go players in the world.  Follow the link below for an informative informal video describing AlphaGo and the victory.

Science magazine has a nice pregame report.

Ted Dunning has a nice post on multi-armed bandits in the “long tail” blog.  He provides some references including the ICML Tutorial and recent work on Thompson sampling.

John Regehr writes this post about a compiler speed optimization technique called “Stochastic Superopimization“.  “Stochastic Superopimization” systematically searches for algorithmic improvement in code using machine learning algorithms similar to multi-armed bandit strategies.  It appears to be related to “Programming by Optimization“.  “Stochastic Superopimization” is more like a very good optimization flag on a compiler.  “Programming by Optimization” is constructing the program in such a fashion that design options are exposed and easily manipulable by an optimization program trying to maximize some performance metric.  The “Programming by Optimization” community seems to mostly use BOA, the Bayesian optimization algorithm (see [1], [2]).  I am hoping to read and write more about both of these ideas later.

In “Generalized Thompson sampling for sequential decision-making and causal inference“, Ortega and Braun (2013) give a short history of Thompson sampling (see [1][2][3][4]) and report on the relationship between intelligent agents, evolutionary game theory, Bayesian inference, KL Divergence, and Thompson sampling.  They develop a generalization of Thompson sampling based on a Bayesian prior over a distribution of environments.  Some numerical results are provided.  Here’s the abstract:

Recently, it has been shown how sampling actions from the predictive distribution over the optimal action—sometimes called Thompson sampling—can be applied to solve sequential adaptive control problems, when the optimal policy is known for each possible environment. The predictive distribution can then be constructed by a Bayesian superposition of the optimal policies weighted by their posterior probability that is updated by Bayesian inference and causal calculus. Here we discuss three important features of this approach. First, we discuss in how far such Thompson sampling can be regarded as a natural consequence of the Bayesian modeling of policy uncertainty. Second, we show how Thompson sampling can be used to study interactions between multiple adaptive agents, thus, opening up an avenue of game-theoretic analysis. Third, we show how Thompson sampling can be applied to infer causal relationships when interacting with an environment in a sequential fashion. In summary, our results suggest that Thompson sampling might not merely be a useful heuristic, but a principled method to address problems of adaptive sequential decision-making and causal inference.

“…we often joke that our job, as the team that builds the experimentation platform, is to tell our clients that their new baby is ugly, …”

Andrew Gelman at Statistical Modeling, Causal Inference, and Social Science pointed me towards the paper “Trustworthy Online Controlled Experiments:  Five Puzzling Outcomes Explained” by Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu all of whom seem to be affiliated with Microsoft.

The paper itself recounted five online statistical experiments mostly done at Microsoft that had informative counter-intuitive results:

  • Overall Evaluation Criteria for Bing
  • Click Tracking
  • Initial Effects
  • Experiment Length
  • Carry Over Effects.

The main lessons learned were:

  • Be careful what you wish for. – Short term effects may be diametrically opposed to long-term effects.  Specifically, a high number clicks or queries per session could be indicative of a bug rather than success.  It’s important to choose the right metric.  The authors ended up focusing on “sessions per user” as a metric as opposed to “queries per month” partly due to a bug which increased (in the short-term) queries and revenues while degrading the user’s experience.
  • Initial results are strongly affected by “Primacy and Novelty”. – In the beginning, experienced users may click on a new option just because it is new, not because it’s good.  On the other hand, experienced users may be initially slowed by a new format even if the new format is “better”.
  • If reality is constantly changing, the experiment length may not improve the accuracy of the experiment.  The underlying behavior of the users may change every month.  A short-term experiment may only capture a short-term behavior. Rather than running the experiment for years, the best option may be to run several short-term experiments and adapt the website to the changing behavior as soon as the new behavior is observed.
  • If the same user is presented with the same experiment repeatedly, her reaction to the experiment is a function of the number of times she has been exposed to the experiment.  This effect must be considered when interpreting experimental results.
  • The Poisson Distribution should not be used to model clicks.  They preferred Negative Binomial.

The paper is easy to read, well written, and rather informative. It is especially good for web analytics and for anyone new to experimental statistics.  I found the references below to be especially interesting:


In “An Empirical Evaluation of Thompson Sampling”, Chapelle and Lihong Li (NIPS 2011) compare Upper Confidence Bound (UCB), Gittins, and Thompson methods for the multi-armed bandit problem.  More theoretical analysis has been done for UCB and Gittins, but Thompson sampling is simple (as opposed to Gittins) and seems to work well, sometimes performing significantly better that the other common algorithms. Both synthetic and real world (advertising) results are presented.

Thanks to Nuit Blanche for pointing me towards the presentation by Andrea MontanariCollaborative Filtering: Models and Algorithms” and the associated Deshpande and Montanari paper “Linear Bandits in High Dimension and Recommendation Systems”  (2012).  In the presentation, Montanari reviews Spectral, Gradient Descent, Stochasitc Gradient Descent, Convex Relaxation, and Linear Bandit methods for approximating the standard linear model for recommendation systems and some accuracy guarantees.

Assuming the $j$th movie has features $v_{j1}, v_{j2}, \ldots, v_{jr}$, then the $i$th viewer gives the rating

$R_{ij} = \langle u_i, v_j \rangle +\epsilon_{ij}$

where $u_i$ is an $r$ dimensional vector representing the preferences of $i$th viewer and $\epsilon_{ij}$ is Gaussian noise.

The paper introduces a new Linear Bandit method,  Smooth Explore, better suited for recommendation systems.  Their method is motivated by the three objectives:

  • Constant-optimal cumulative reward,
  • Constant-optimal regret, and 
  • Approximate monotonicity (rewards approximately increase with time).

Smooth Explore estimates the user preferences vectors with a regularized least squares regression.  Proofs of optimality and numerical results are provided.



In “A Survey of Monte Carlo Tree Search Methods“, Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener, Perez, Samothrakis, and Colton (2012) wrote an extensive review of the variations of Monte Carlo Tree Search (MCTS) referencing 240 previous papers.  MCTS (specifically upper confidence trees (UCT)) was popularized by its unusual effectiveness in the game Go.  UCT significantly improved computer Go to the point where it is now competitive with professional Go players on small boards, but not on the standard 19×19 board. The paper updates and significantly extends the 2010 survey of MCTS for Go “Current Frontiers in Computer Go” by Rimmel, Teytaud, Lee, Yen, Wang, and Tsai.



“Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.”


“In Section 2, we present central concepts of AI and games, introducing notation and terminology that set the stage for MCTS. In Section 3, the MCTS algorithm and its key components are described in detail. Section 4 summarises the main variations that have been proposed. Section 5 considers enhancements to the tree policy, used to navigate and construct the search tree.  Section 6 considers other enhancements, particularly to simulation and backpropagation steps. Section 7 surveys the key applications to which MCTS has been applied, both in games and in other domains. In Section 8, we summarise the paper to give a snapshot of the state of the art in MCTS research, the strengths and weaknesses of the approach, and open questions for future research.  The paper concludes with two tables that summarise the many variations and enhancements of MCTS and the domains to which they have been applied.”

In “Does Luck Matter More Than Skill?“, Cal Newport writes

<success of a project> = <project potential> x <serendipitous factors>,

where <project potential> is a measure of the rareness and value of your relevant skills, and the value of the serendipitous factors is drawn from something like an exponential distribution.


If you believe that something like this equation is true, then this approach of becoming as good as possible while trying many different projects, maximizes your expected success.

Indeed, we can call this the Schwarzenegger Strategy, as it does a good job of describing his path to stardom. Looking back at his story, notice that he tried to maximize the potential in every project he pursued (always “putting in the reps”). But he also pursued a lot of projects, maximizing the chances that he would occasionally complete one with high serendipity. His breaks, as described above, all required both rare and valuable skills, and luck. And each such project was surrounded in his life by other projects in which things did not turn out so well.

If success is measured in dollars, then I bet the distributions of <serendipitous factors> have fat 1/polynomial tails because there are a lot of people with great skills, but the wealth distribution among self-made billionaires is something like C/earnings^1.7.  For many skills, like probability of hitting a baseball, the amount of skill seems to be proportional to log(practice time) plus a constant.  For other skills, like memorized vocabulary, the amount of skill seems proportional to (study time)^0.8 or the Logarithmic Integral Function.  Mr Newport emphasizes the “rareness” of skill also.  Air is important, but ubiquitous, so no one charges for it despite it’s value.  In baseball, I imagine that increasing your batting average a little bit can increase your value a lot.  I wonder what the formulas for <project potential> are for various skills.  If we could correctly model Newport’s success equation, we could figure out the correct multi-armed bandit strategy for maximizing success.  (Maybe we could call it the Schwarzenegger Bandit Success Formula.) You may even be able to add happiness into the success formula and still get a good bandit strategy for achieving it.

« Older entries