February 2013

You are currently browsing the monthly archive for February 2013.

In the seminal paper “Stacked generalization“, David H. Wolpert generalizes the idea of cross-validation.

Suppose you had a data set $(x_i, y_i)$ where $i=1, \ldots, n$, $x_i \in A$, and $y_i \in R$. For cross validation, you might partition the data set into three subsets, train $k$ classifiers on two of the three subsets, and test the classifiers on the held out data. Then we might select one of the $k$ classifiers by choosing the classifier which did the best on the held out data.

Wolpert generalizes this idea by forming an $n$ by $k$ matrix of predictions from the $k$ classifiers.  The $i$th row and the $j$th column would contain the prediction of the $j$th classifier on  $x_i$ trained on partitions of the data set that do not include $x_i$.  Then Wolpert would train a new classifier using the $i$th row of the matrix as input and trying to match $y_i$ as output.   The new classifier $f$ would map $R^k$ into $R$.  Let $G_j$ be the result of training the $j$th classifier on all of the data.  Then the Wolpert generalized classifier would have the form

$$h(x) = f( G_1(x), G_2(x), \ldots, G_k(x) ).$$

Wolpert actually describes an even more general scheme which could have a large number of layers, much like deep belief networks and auto encoders. The idea of leaving part of the data out is similar to denoising or dropout.

In the paper “Feature-Weighted Linear Stacking“, Sill, Takacs, Mackey, and Lin describe a faster version of stacking used extensively by the second place team in the Neflix Prize contest.

 

In “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo“, Ho man and Gelman present an improvement of the Markov Chain Monte Carlo and the Hamiltonian Monte Carlo methods.  Here’s the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

In the short, well written paper “An Estimate of an Upper Bound for the Entropy of English“, Brown, Stephan Della Pietra, Mercer, Vincent Della Pietra, and Lai (1992) give an estimated upper bound for English of 1.75 bits per character.  That estimate was somewhat lower than Shannon’s original upper bound of 2.3 bits per character. Along the way they give nice simple explanations of entropy and cross-entropy as applied to text.  More recently Montemurro and Zanette (2011) showed the entropy of all languages is around 3.5 bits per word. (see Wired Article and Plos One)

In “Does Luck Matter More Than Skill?“, Cal Newport writes

<success of a project> = <project potential> x <serendipitous factors>,

where <project potential> is a measure of the rareness and value of your relevant skills, and the value of the serendipitous factors is drawn from something like an exponential distribution.

and

If you believe that something like this equation is true, then this approach of becoming as good as possible while trying many different projects, maximizes your expected success.

Indeed, we can call this the Schwarzenegger Strategy, as it does a good job of describing his path to stardom. Looking back at his story, notice that he tried to maximize the potential in every project he pursued (always “putting in the reps”). But he also pursued a lot of projects, maximizing the chances that he would occasionally complete one with high serendipity. His breaks, as described above, all required both rare and valuable skills, and luck. And each such project was surrounded in his life by other projects in which things did not turn out so well.

If success is measured in dollars, then I bet the distributions of <serendipitous factors> have fat 1/polynomial tails because there are a lot of people with great skills, but the wealth distribution among self-made billionaires is something like C/earnings^1.7.  For many skills, like probability of hitting a baseball, the amount of skill seems to be proportional to log(practice time) plus a constant.  For other skills, like memorized vocabulary, the amount of skill seems proportional to (study time)^0.8 or the Logarithmic Integral Function.  Mr Newport emphasizes the “rareness” of skill also.  Air is important, but ubiquitous, so no one charges for it despite it’s value.  In baseball, I imagine that increasing your batting average a little bit can increase your value a lot.  I wonder what the formulas for <project potential> are for various skills.  If we could correctly model Newport’s success equation, we could figure out the correct multi-armed bandit strategy for maximizing success.  (Maybe we could call it the Schwarzenegger Bandit Success Formula.) You may even be able to add happiness into the success formula and still get a good bandit strategy for achieving it.

Newer entries »