In the short, well written paper “An Estimate of an Upper Bound for the Entropy of English“, Brown, Stephan Della Pietra, Mercer, Vincent Della Pietra, and Lai (1992) give an estimated upper bound for English of 1.75 bits per character. That estimate was somewhat lower than Shannon’s original upper bound of 2.3 bits per character. Along the way they give nice simple explanations of entropy and cross-entropy as applied to text. More recently Montemurro and Zanette (2011) showed the entropy of all languages is around 3.5 bits per word. (see Wired Article and Plos One)
In “Does Luck Matter More Than Skill?“, Cal Newport writes
<success of a project> = <project potential> x <serendipitous factors>,
where <project potential> is a measure of the rareness and value of your relevant skills, and the value of the serendipitous factors is drawn from something like an exponential distribution.
and
If you believe that something like this equation is true, then this approach of becoming as good as possible while trying many different projects, maximizes your expected success.
Indeed, we can call this the Schwarzenegger Strategy, as it does a good job of describing his path to stardom. Looking back at his story, notice that he tried to maximize the potential in every project he pursued (always “putting in the reps”). But he also pursued a lot of projects, maximizing the chances that he would occasionally complete one with high serendipity. His breaks, as described above, all required both rare and valuable skills, and luck. And each such project was surrounded in his life by other projects in which things did not turn out so well.
If success is measured in dollars, then I bet the distributions of <serendipitous factors> have fat 1/polynomial tails because there are a lot of people with great skills, but the wealth distribution among self-made billionaires is something like C/earnings^1.7. For many skills, like probability of hitting a baseball, the amount of skill seems to be proportional to log(practice time) plus a constant. For other skills, like memorized vocabulary, the amount of skill seems proportional to (study time)^0.8 or the Logarithmic Integral Function. Mr Newport emphasizes the “rareness” of skill also. Air is important, but ubiquitous, so no one charges for it despite it’s value. In baseball, I imagine that increasing your batting average a little bit can increase your value a lot. I wonder what the formulas for <project potential> are for various skills. If we could correctly model Newport’s success equation, we could figure out the correct multi-armed bandit strategy for maximizing success. (Maybe we could call it the Schwarzenegger Bandit Success Formula.) You may even be able to add happiness into the success formula and still get a good bandit strategy for achieving it.
I really enjoyed Jeremy Kun’s article on Information Distance at Math ∩ Programming.
Check out Ross Rosen’s collection of tools for machine learning with Python.
Caleb Crane wrote
“The mention of military technology brings me to my last idea. This is the challenge of the robot utopia. You remember the robot utopia. You imagined it when you were in fifth grade, and your juvenile mind first seized with rapture upon the idea of intelligent machines that would perform dull, repetitive tasks yet demand nothing for themselves. In the future, you foresaw, robots would do more and more, and humans less and less. There would be no need for humans to endanger themselves in coal mines or bore themselves on assembly lines. A few people would always be needed to repair and build the robots, and this drudgery of robot supervision would have to be rewarded somehow, but someday robots would surely make wealth so abundant that most people wouldn’t need to work and would be free merely to enjoy and cultivate themselves—by, say, hunting in the morning, fishing in the afternoon, and doing literary criticism after dinner.
Your fifth-grade self was wrong, of course. Robots aren’t altruistic beings; they’re capital investments; and though robots may not ask to be paid, their owners demand a return on their investment. We now live in the robot utopia, which isn’t one. Thanks in large part to computerized mechanization, manufacturing productivity in the past century has increased many times over. Standards of living are higher than they ever were, but we no longer need as many humans to work as we once did. Perhaps not coincidentally, human wages, in America at least, have stagnated since the 1970s. If humans made no more money in the past four decades, where did the wealth created by the higher productivity go? Toward robot wages, as it were. The owners of the robots took the money—that is, the capitalists. Any fifth-grader can see where this leads. At some point society has to choose. Either society accepts the robots’ gift as a general one, and redistributes the wealth that the robots inadvertently concentrate, or society allows the robots to become the exclusive tools of an ever-shrinking elite, increasingly resented, in confused fashion, by the people whom the robots have displaced.”
in
Srivastava and Schmidhuber have recently written about their first experiences with Power Play. Power Play is a selfmodifying general problem solver that uses algorithmic complexity theory and ideas similar to Levin Search and AIQ/AIXI (see [1], [2], and [3]) to find novel solutions to problems and invent new “interesting” problems. I was more interested in Hutter‘s approximations to AIXI mainly because it was easy to understand the results when it was applied to the simple games (1d-maze, Cheese Maze, Tiger, TicTacToe, Biased Rock-Paper-Scissor, Kuhn Poker, and Partially Observable Pacman), but I look forward to future papers on Power Play.
Check out WhiteSwami’s list of subjects to master if you are going to become a machine learner.
David Andrzejewski at Bayes’ Cave wrote up a nice summary of practical machine learning advice from the KDD 2011 paper “Detecting Advesarial Advertisements in the Wild”. I’ve quoted below several of the main points from David’s summary:
- ABE: Always Be Ensemble-ing
- Throw a ton of features at the model and let L1 sparsity figure it out
- Map features with the “hashing trick“
- Handle the class imbalance problem with ranking
- Use a cascade of classifiers
- make sure the system “still works” as its inputs evolve over time
- Make efficient use of expert effort
- Allow humans to hard-code rules
- periodically use non-expert evaluations to make sure the system is working
According to this graph
high quality elementary school teachers increase the lifetime earnings of their students by about $200,000 per child.
In “Improving neural networks by preventing co-adaptation of feature detectors“, Hinton, Srivastava, Krizhevsky, Sutskever, and Salakhutdinov answer the question: What happens if “On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present.” This mimics the standard technique of training several neural nets and averaging them, but it is faster. When they applied the “dropout” technique to a deep Boltzmann neural net on the MNIST hand written digit data set and the TIMIT speech data set, they got robust learning without overfitting. This was one of the main techniques used by the winners of the Merck Molecular Activity Challenge.
Hinton talks about the dropout technique in his video Brains, Sex, and Machine Learning.