Artificial Intelligence Blog

“Stacked Generalization”

February 8, 2013 in Ensemble Learning by hundalhh | Permalink

In the seminal paper “Stacked generalization“, David H. Wolpert generalizes the idea of cross-validation.

Suppose you had a data set $(x_i, y_i)$ where $i=1, \ldots, n$, $x_i \in A$, and $y_i \in R$. For cross validation, you might partition the data set into three subsets, train $k$ classifiers on two of the three subsets, and test the classifiers on the held out data. Then we might select one of the $k$ classifiers by choosing the classifier which did the best on the held out data.

Wolpert generalizes this idea by forming an $n$ by $k$ matrix of predictions from the $k$ classifiers. The $i$th row and the $j$th column would contain the prediction of the $j$th classifier on $x_i$ trained on partitions of the data set that do not include $x_i$. Then Wolpert would train a new classifier using the $i$th row of the matrix as input and trying to match $y_i$ as output. The new classifier $f$ would map $R^k$ into $R$. Let $G_j$ be the result of training the $j$th classifier on all of the data. Then the Wolpert generalized classifier would have the form

$$h(x) = f( G_1(x), G_2(x), \ldots, G_k(x) ).$$

Wolpert actually describes an even more general scheme which could have a large number of layers, much like deep belief networks and auto encoders. The idea of leaving part of the data out is similar to denoising or dropout.

In the paper “Feature-Weighted Linear Stacking“, Sill, Takacs, Mackey, and Lin describe a faster version of stacking used extensively by the second place team in the Neflix Prize contest.

“The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo”

February 6, 2013 in Graphical Models, Optimization, Statistics by hundalhh | Permalink

In “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo“, Homan and Gelman present an improvement of the Markov Chain Monte Carlo and the Hamiltonian Monte Carlo methods. Here’s the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

“An Estimate of an Upper Bound for the Entropy of English”

February 4, 2013 in Information Theory, Languages by hundalhh | Permalink

In the short, well written paper “An Estimate of an Upper Bound for the Entropy of English“, Brown, Stephan Della Pietra, Mercer, Vincent Della Pietra, and Lai (1992) give an estimated upper bound for English of 1.75 bits per character. That estimate was somewhat lower than Shannon’s original upper bound of 2.3 bits per character. Along the way they give nice simple explanations of entropy and cross-entropy as applied to text. More recently Montemurro and Zanette (2011) showed the entropy of all languages is around 3.5 bits per word. (see Wired Article and Plos One)

Schwarzenegger Bandit Success Formula

February 2, 2013 in Multi-Armed Bandit Problem, Uncategorized by hundalhh | Permalink

In “Does Luck Matter More Than Skill?“, Cal Newport writes

<success of a project> = <project potential> x <serendipitous factors>,

where <project potential> is a measure of the rareness and value of your relevant skills, and the value of the serendipitous factors is drawn from something like an exponential distribution.

and

If you believe that something like this equation is true, then this approach of becoming as good as possible while trying many different projects, maximizes your expected success.

Indeed, we can call this the Schwarzenegger Strategy, as it does a good job of describing his path to stardom. Looking back at his story, notice that he tried to maximize the potential in every project he pursued (always “putting in the reps”). But he also pursued a lot of projects, maximizing the chances that he would occasionally complete one with high serendipity. His breaks, as described above, all required both rare and valuable skills, and luck. And each such project was surrounded in his life by other projects in which things did not turn out so well.

If success is measured in dollars, then I bet the distributions of <serendipitous factors> have fat 1/polynomial tails because there are a lot of people with great skills, but the wealth distribution among self-made billionaires is something like C/earnings^1.7. For many skills, like probability of hitting a baseball, the amount of skill seems to be proportional to log(practice time) plus a constant. For other skills, like memorized vocabulary, the amount of skill seems proportional to (study time)^0.8 or the Logarithmic Integral Function. Mr Newport emphasizes the “rareness” of skill also. Air is important, but ubiquitous, so no one charges for it despite it’s value. In baseball, I imagine that increasing your batting average a little bit can increase your value a lot. I wonder what the formulas for <project potential> are for various skills. If we could correctly model Newport’s success equation, we could figure out the correct multi-armed bandit strategy for maximizing success. (Maybe we could call it the Schwarzenegger Bandit Success Formula.) You may even be able to add happiness into the success formula and still get a good bandit strategy for achieving it.

Math ∩ Programming on Information Distance

January 30, 2013 in Abstraction for Learning, Information Theory by hundalhh | Permalink

I really enjoyed Jeremy Kun’s article on Information Distance at Math ∩ Programming.

Ross Rosen’s list of Machine Learning Tools for Phython

January 28, 2013 in General ML by hundalhh | Permalink

Check out Ross Rosen’s collection of tools for machine learning with Python.

Robot Distopia Utopia

January 26, 2013 in Robots by hundalhh | Permalink

Caleb Crane wrote

“The mention of military technology brings me to my last idea. This is the challenge of the robot utopia. You remember the robot utopia. You imagined it when you were in fifth grade, and your juvenile mind first seized with rapture upon the idea of intelligent machines that would perform dull, repetitive tasks yet demand nothing for themselves. In the future, you foresaw, robots would do more and more, and humans less and less. There would be no need for humans to endanger themselves in coal mines or bore themselves on assembly lines. A few people would always be needed to repair and build the robots, and this drudgery of robot supervision would have to be rewarded somehow, but someday robots would surely make wealth so abundant that most people wouldn’t need to work and would be free merely to enjoy and cultivate themselves—by, say, hunting in the morning, fishing in the afternoon, and doing literary criticism after dinner.

Your fifth-grade self was wrong, of course. Robots aren’t altruistic beings; they’re capital investments; and though robots may not ask to be paid, their owners demand a return on their investment. We now live in the robot utopia, which isn’t one. Thanks in large part to computerized mechanization, manufacturing productivity in the past century has increased many times over. Standards of living are higher than they ever were, but we no longer need as many humans to work as we once did. Perhaps not coincidentally, human wages, in America at least, have stagnated since the 1970s. If humans made no more money in the past four decades, where did the wealth created by the higher productivity go? Toward robot wages, as it were. The owners of the robots took the money—that is, the capitalists. Any fifth-grader can see where this leads. At some point society has to choose. Either society accepts the robots’ gift as a general one, and redistributes the wealth that the robots inadvertently concentrate, or society allows the robots to become the exclusive tools of an ever-shrinking elite, increasingly resented, in confused fashion, by the people whom the robots have displaced.”

“DEBTMAGEDDON VS. THE ROBOT UTOPIA”

Power Play

January 24, 2013 in Abstraction for Learning, Complexity by hundalhh | Permalink

Srivastava and Schmidhuber have recently written about their first experiences with Power Play. Power Play is a selfmodifying general problem solver that uses algorithmic complexity theory and ideas similar to Levin Search and AIQ/AIXI (see [1], [2], and [3]) to find novel solutions to problems and invent new “interesting” problems. I was more interested in Hutter‘s approximations to AIXI mainly because it was easy to understand the results when it was applied to the simple games (1d-maze, Cheese Maze, Tiger, TicTacToe, Biased Rock-Paper-Scissor, Kuhn Poker, and Partially Observable Pacman), but I look forward to future papers on Power Play.

WhiteSwami’s Prerequisites for Machine Learning

January 22, 2013 in General ML by hundalhh | Permalink

Check out WhiteSwami’s list of subjects to master if you are going to become a machine learner.

“Detecting Adversarial Advertisements in the Wild”

January 20, 2013 in General ML by hundalhh | Permalink

David Andrzejewski at Bayes’ Cave wrote up a nice summary of practical machine learning advice from the KDD 2011 paper “Detecting Advesarial Advertisements in the Wild”. I’ve quoted below several of the main points from David’s summary:

ABE: Always Be Ensemble-ing
Throw a ton of features at the model and let L1 sparsity figure it out
Map features with the “hashing trick“
Handle the class imbalance problem with ranking
Use a cascade of classifiers
make sure the system “still works” as its inputs evolve over time
Make efficient use of expert effort
Allow humans to hard-code rules
periodically use non-expert evaluations to make sure the system is working

« Older entries § Newer entries »