In the seminal paper “Gene Selection for Cancer Classification using Support Vector Machines“, Guyon, Weston, Barnhill, and Vapnik (2002) use Recursive Feature Elimination to find the genes which are the most predictive of cancer. Recursive Feature Elimination repeatedly ranks the features and eliminates the worst feature until only a small subset of the original set of features remains. Although several feature ranking methods were explored, the main method was a soft margin SVM classifier with which the authors found 8 key colon cancer genes out of 7000.
Mathematician and Father. Into games, astronomy, psychology and philosophy.
In “Temporal Autoencoding Restricted Boltzmann Machine“, Hausler and Susemihl explain how to train a deep belief RBM to learn to recognize patterns in sequences of inputs (mostly video). The resulting networks could recognize the patterns in human motion capture or the non-linear dynamics of a bouncing ball.
In “A Survey of Monte Carlo Tree Search Methods“, Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener, Perez, Samothrakis, and Colton (2012) wrote an extensive review of the variations of Monte Carlo Tree Search (MCTS) referencing 240 previous papers. MCTS (specifically upper confidence trees (UCT)) was popularized by its unusual effectiveness in the game Go. UCT significantly improved computer Go to the point where it is now competitive with professional Go players on small boards, but not on the standard 19×19 board. The paper updates and significantly extends the 2010 survey of MCTS for Go “Current Frontiers in Computer Go” by Rimmel, Teytaud, Lee, Yen, Wang, and Tsai.
Abstract
“Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.”
Outline
“In Section 2, we present central concepts of AI and games, introducing notation and terminology that set the stage for MCTS. In Section 3, the MCTS algorithm and its key components are described in detail. Section 4 summarises the main variations that have been proposed. Section 5 considers enhancements to the tree policy, used to navigate and construct the search tree. Section 6 considers other enhancements, particularly to simulation and backpropagation steps. Section 7 surveys the key applications to which MCTS has been applied, both in games and in other domains. In Section 8, we summarise the paper to give a snapshot of the state of the art in MCTS research, the strengths and weaknesses of the approach, and open questions for future research. The paper concludes with two tables that summarise the many variations and enhancements of MCTS and the domains to which they have been applied.”
“They’re probably going to render us extinct one day, so we might as well enjoy their servitude, while it lasts.”
http://www.odditycentral.com/travel/the-future-is-now-china-opens-robot-operated-restaurant.html
http://marginalrevolution.com/marginalrevolution/2013/01/the-robot-restaurant.html
Bengio and Lecun created this wonderful video on Deep Neural Networks. Any logical function can be represented by a neural net with 3 layers (one hidden, see e.g. CNF), however simple 4 level logical functions with a small number of nodes may require a large number of nodes in a 3 layer representation. They point to theorems that show that the number of nodes required to represent a k level logical function can require an exponential number of nodes in a k-1 level network. They go on to explain denoising auto encoders for the training of deep neural nets.
Check out Terence Tao‘s wonderful post “Matrix identities as derivatives of determinant identities“.
Check out “Recent Algorithms Development and Faster Belief Propagation algorithms” by Igor Carron at the Nuit Blanche blog.
Sean J. Taylor writes a short, critical, amusing article about R, Python or JVM languages, Julia, Stata, SPSS, Matlab, Mathematica, and SAS.
In the seminal paper “Stacked generalization“, David H. Wolpert generalizes the idea of cross-validation.
Suppose you had a data set $(x_i, y_i)$ where $i=1, \ldots, n$, $x_i \in A$, and $y_i \in R$. For cross validation, you might partition the data set into three subsets, train $k$ classifiers on two of the three subsets, and test the classifiers on the held out data. Then we might select one of the $k$ classifiers by choosing the classifier which did the best on the held out data.
Wolpert generalizes this idea by forming an $n$ by $k$ matrix of predictions from the $k$ classifiers. The $i$th row and the $j$th column would contain the prediction of the $j$th classifier on $x_i$ trained on partitions of the data set that do not include $x_i$. Then Wolpert would train a new classifier using the $i$th row of the matrix as input and trying to match $y_i$ as output. The new classifier $f$ would map $R^k$ into $R$. Let $G_j$ be the result of training the $j$th classifier on all of the data. Then the Wolpert generalized classifier would have the form
$$h(x) = f( G_1(x), G_2(x), \ldots, G_k(x) ).$$
Wolpert actually describes an even more general scheme which could have a large number of layers, much like deep belief networks and auto encoders. The idea of leaving part of the data out is similar to denoising or dropout.
In the paper “Feature-Weighted Linear Stacking“, Sill, Takacs, Mackey, and Lin describe a faster version of stacking used extensively by the second place team in the Neflix Prize contest.
In “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo“, Homan and Gelman present an improvement of the Markov Chain Monte Carlo and the Hamiltonian Monte Carlo methods. Here’s the abstract:
Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.