Reinforcement Learning

You are currently browsing the archive for the Reinforcement Learning category.

In “Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity“, Ailon (2011) presents a method for ordering a set based on noisy comparisons of elements.  He proves that his adaptive method approximates the NP Hard optimal solution (Alon 2006) to within $(1+\epsilon)$ times the minimum error after $f(\epsilon^{-1}, n)$ comparisons where $f$ is a polynomial.  Ailon builds on the work of Kenyon-Mathieu and Schudy (2007) who also developed a polynomial time algorithm for approximate ranking.  The Kenyon-Mathieu and Schudy algorithm was not query efficient.

In “Active Learning Literature Survey“, Burr Settles (2010) reviews uncertainty sampling (Lewis and Gale, 1994), margin sampling (Scheffer et al., 2001), entropy sampling, optimal experimental design, query-by-committee (Seung et al., 1992), query-by-boosting, query-by-bagging, expected model change, expected error reduction, expected information gain, variance reduction, and density weighted methods.  He then comments on theoretical and empirical performance of these methods, practical considerations, and related areas of machine learning including: semi-supervised learning, reinforcement learning, and compression.

In “Variance Reduction in Monte-Carlo Tree Search” Veness, Lanctot, and Bowling (2011) “examine the application of some standard techniques for variance reduction in” MonteCarlo Tree Search.  The variance reduction techniques included control variates, common random numbers, and antithetic variates. These techniques were applied to three games: Pig, Can’t Stop, and Dominion.  After reviewing Markov Decision Process and Upper Confidence Trees (UCT), they describe the variance reduction techniques and display charts showing the improvement of UCT for each technique and each game. The simplest game, Pig, gained the most from the variance reduction techniques, but all of the games gained from all of the techniques except no effective antithetic variate was found for Dominion.

In “A Tutorial on the Cross-Entropy Method” De Boer, Kroese, Mannor, and Rubinstein (2005) write

“The CE method was motivated by an adaptive algorithm for estimating probabilities of rare events in complex stochastic networks (Rubinstein, 1997), which involves variance minimization. It was soon realized (Rubinstein, 1999, 2001) that a simple cross-entropy modification of Rubinstein (1997) could be used not only for estimating probabilities of rare events but for solving difficult COPs as well. This is done by translating the “deterministic” optimization problem into a related “stochastic” optimization problem and then using rare event simulation techniques similar to Rubinstein (1997). Several recent applications demonstrate the power of the CE method (Rubinstein, 1999) as a generic and practical tool for solving NP-hard problems.”

and then go on to compare cross-entropy with simulated annealing, tabu search , genetic algorithms, the nested partitioning method, ant colony optimization, rare event estimation algorithms, importance sampling, and algorithms for continuous multi-extremal optimization.

Several applications are listed in the introduction.

“An increasing number of applications is being found for the CE method. Recent publications on applications of the CE method include: buffer allocation (Alon et al., 2005); static simulation models (Homem-de-Mello and Rubinstein, 2002); queueing models of telecommunication systems (de Boer, 2000; de Boer, Kroese, and Rubinstein, 2004); neural computation (Dubin, 2002, 2004); control and navigation (Helvik and Wittner, 2001); DNA sequence alignment (Keith and Kroese, 2002); scheduling (Margolin, 2002, 2004); vehicle routing (Chepuri and Homem-de-Mello, 2005); reinforcement learning (Mannor, Rubinstein, and Gat, 2003; Menache, Mannor, and Shimkin, 2005); project management (Cohen, Golany, and Shtub, 2005); heavy-tail distributions (Asmussen, Kroese, and Rubinstein, 2005); (Kroese and Rubinstein, 2004); CE convergence (Margolin, 2005); network reliability (Hui et al., 2005); repairable systems (Ridder, 2005); and maxcut and bipartition problems (Rubinstein, 2002).”

The rest of the paper describes the basic CE algorithm, its convergence properties, specific applications, and variations.


Tetris is hard, but it is good for your brain. So machine learners have written a number of papers about the application of reinforcement learning to Tetris. Donald Carr (2005) reviews a number of algorithms for Tetris including temporal difference learning, genetic algorithms (Llima, 2005), hand-tuned algorithms specific to Tetris (Dellacherie, Fahey).  Szita and Lörincz (2006) in “Learning Tetris Using the Noisy Cross-Entropy Method” use noise to prevent early suboptimal convergence of the fascinating cross entropy method (similar to particle swarm optimization).  Much like chess, the value of the current state of a Tetris game is estimated with a linear combination of 22 features (for more details, check out the seminal Tetris paper “Feature-Based Methods for Large Scale Dynamic Programming” by Bertsekas and Tsitsiklis 1996.) Their noisy CE method produced solutions almost as good at the best contemporaneous hand-tuned Tetris algorithms.

Newer entries »