Gambling in a rigged casino

In “Gambling in a rigged casino: The adversarial multi-armed bandit problem” Auer, Cesa-Bianchi, Freund, and Schapire (1998) give an algorithm for adversarial multi-armed bandits with at worst $O(\sqrt{T K \log K })$ regret for $K$ bandits over $T$ tries which is similar to the weighted majority algorithm. They also establish a lower bound on the regret of $O(\sqrt{T K})$ for the adversarial case.  In section 9, the authors prove that this algorithm can be applied to two person finite games to compute the value of the game for both players.

The paper develops four algorithms in sequence called Hedge, Exp3, Exp3.1, and Exp4.   Each algorithm uses a previous algorithm as a subroutine.  Hedge is a full information algorithm that requires observation of the untried arms after choosing one arm.  Exp3 only requires a bound on the maximum expected value possible at time $T$ and Exp3.1 operates by guessing the maximal expected value.  Exp4 is a mixture of experts algorithm.

In the related paper “The non-stochastic multi-armed bandit problem” by the same authors, the algorithms are extended and the regret bounds are improved. At the end of this paper, they explore applications to Game Theory.