“Mixture of Vector Experts” by Henderson, Shawe-Taylor, and Zerovnik (2005) describes an algorithm for mixing the predictions of several experts. Each expert gives at prediction vector in $\{0,1\}^n$ at time $t$. Their experimental evidence suggests that their algorithm outperforms a similar componentwise scalar mixture of experts.

You are currently browsing the archive for the **Multi-Armed Bandit Problem** category.

In “Gambling in a rigged casino: The adversarial multi-armed bandit problem” Auer, Cesa-Bianchi, Freund, and Schapire (1998) give an algorithm for adversarial multi-armed bandits with at worst $O(\sqrt{T K \log K })$ regret for $K$ bandits over $T$ tries which is similar to the weighted majority algorithm. They also establish a lower bound on the regret of $O(\sqrt{T K})$ for the adversarial case. In section 9, the authors prove that this algorithm can be applied to two person finite games to compute the value of the game for both players.

The paper develops four algorithms in sequence called Hedge, Exp3, Exp3.1, and Exp4. Each algorithm uses a previous algorithm as a subroutine. Hedge is a full information algorithm that requires observation of the untried arms after choosing one arm. Exp3 only requires a bound on the maximum expected value possible at time $T$ and Exp3.1 operates by guessing the maximal expected value. Exp4 is a mixture of experts algorithm.

In the related paper “The non-stochastic multi-armed bandit problem” by the same authors, the algorithms are extended and the regret bounds are improved. At the end of this paper, they explore applications to Game Theory.

I’m rather fascinated by the HOO algorithm. It seems to be a combination of simulated annealing, branch and bound, simplex method, and upper confidence trees. I’m going to code it up an try it on a few problems.

Bubeck Slides “Continuous Stochastic Optimization” using Hierarchical Optimistic Optimization (HOO)

*X*-Armed Bandits (HOO)

Sample-Based Planning for Continuous Action for Markov Decision Processes

Prediction, Learning, and Games (Book by CESA-BIANCH & LUGOS )

Several multi-armed bandit algorithms with logarithmic regret.

This paper by Kuleshov and Doina gives some practical advice and empirical evidence for multi-armed bandit (MAB) algorithms. They conduct several experiments in the standard MAB setting with $\epsilon$-greedy, upper confidence bound, learning, and Boltzmann algorithms. Their time horizon is only 1000 pulls, so, in general, the $\epsilon$-greedy and Boltzmann algorithms did better than the logarithmic regret algorithms.

This Power Point presentation by Avrim Blum covers the history of Bandit Problems, Regret, and Online Learning Theory. More power point presentations from his course can be found at http://www.cs.cmu.edu/~avrim/ML12/.