# General ML

You are currently browsing the archive for the General ML category.

## Simple Fact about Maximizing a Gaussian

Over the last few weeks, I’ve been working with some tricky features. Interestingly, I needed to add noise to the features to improve my classifier performance.  I will write a post on these “confounding features” later.  For now, let me just point out the following useful fact.

If

$$f(x, \sigma) = {1\over{\sigma \sqrt{2 \pi}}} \exp{\left(-{{x^2}\over{2 \sigma^2}}\right)},$$

then

$$\max_\sigma f(x,\sigma) = f(x, |x|).$$

So, if you have a Gaussian with mean zero and you want to fatten it to maximize the likelihood of the probability density function at $x$ without changing the mean, then set the standard deviation to $|x|$.

## Six Notions of Abstraction

So I have been learning a little category theory in the hopes that I could use it somehow to teach computers abstraction.

What is abstraction?  Carl and I have been talking about this for years.  Here are some ideas on what abstraction could be:

1a) An average model of a large group of similar things.  For example, the abstraction of a chair might be an idealized chair with a seat, a backrest, four legs, and possibly “strechers” between the legs to give it strength and stability.

1b) A function in a computer program that is used in several places in the code.  In effect this function shortens the code.  If the code has length L1 without the function, the code is length L2 with the function, the function is length LC, and the function is used N times in the code with the function, then typically

L2 $\approx$ L1 – LC*N + LC + N.

The function is an abstraction of the pieces of code that it replaced similar to the way an ideal chair is an abstraction of many chairs.  The function has the common characteristics of the code it replaced.  For example, the following code consists of about 160 characters

// C Code without function

main()
{
printf("Hello John\n");
printf("Hello Harry\n");
printf("Hello Rose\n");
printf("Hello Carol\n");
printf("Hello Mr. Thomas\n");
printf("Hello Mrs. Jones\n");
}

and it could be shortened slightly to about 137 characters by

// C Code with function

void pr(char *s)
{
printf("Hello %s\n, s");
}

main() {
pr("John");
pr("Harry");
pr("Rose");
pr("Carol");
pr("Mr. Thomas");
pr("Mrs. Jones");
}

Though this example is not in any way practical, it does illustrate how a function can capture commonality.  All of the pieces of code replaced by the pr function had a call to printf, the substring “Hello “, and the carriage return “\n”.  The pr function was, in a sense, the idealized version of the six print statements in the original code.

In category theory, we could refer to the category of strings with arrows going from superstrings to substrings.  In this category, there would be an arrow from the line of code ”   printf(“Hello John\n”);” to the string ”   printf(“Hello \n”);”.  The categorical coproduct of several strings in this category would be the longest common substring.  So, category theory suggests the substring ”   printf(“Hello \n”);” should somehow be included in the abstraction of those six printf statements in the original code.

2)  A simplifying transformation that changes things into simpler things which are easier to work with.  For example, if you are counting sheep in a pen, you might convert groups of sheep into numbers and then add the numbers to get a total count for the entire pen.  We simplified the counting process by first converting groups of sheep into numbers.  Another example might be solving a corn maze by first flying over it, taking a picture, and then converting the “impassable” rows of corn into lines drawn on a piece of paper.  To solve the corn maze, you first convert it to a pen drawing on paper and then solve the pen drawn maze.

3)  Removing irrelevant information or detail.  If a piece in chess is attacking the opponent’s king, then the opponent must either a) move the king, b) put a piece between the attacker and the king, or c) capture the attacking piece.  If you wanted to list the possible moves for the opponent, you can often delete from the chess board most of the pieces because most of the pieces on the board do not affect the three options above.  In the diagram below, the white pawns, rook, and king and the black pawn on a7 do not affect the legal moves for black.

Check

In a corn maze, the type of corn, the height of the corn stalks, the temperature, or the type of soil do not affect the solution of the corn maze, so that information could be ignored.  Ignoring irrelevant information allows the brain to form a more abstract, simple model of the corn maze.

4) Variables in first order predicate logic.  “If person A is taller than person B and person B is taller than person C, then person A is taller than person C.”  In that statement, person A could any one of a billion people on earth.

5) Establishing classes of objects.  “Fruit tends to spoil quickly.”  That statement applies to a huge variety of organic objects.  Abstraction classes can even be put in a hierarchy like   Animal ==>> Mammal ==> Cat ==> Siamese Cat.

6) Statistical Patterns and Compression.  If a piece of code like zip.exe compresses a corpus of English text files, we could say that the piece of code formed an abstract model of the corpus.  In the case of the zip function which uses the Lempel–Ziv–Markov chain algorithm, the zip program would notice that

• the letters ‘e’ and ‘t’ appear frequently,
• the letter ‘u’ usually follows ‘q’, and
• the capital letter I frequently appears with spaces on either side.

These patterns are all mini-abstractions or aspects of English.

A very general notion of abstraction could be any function which shortens the compression of data.  That function could be an abstraction of some part of the data.  If your data consisted of 3 dimensional models of chairs, then a compression program could use the fact that a chair usually has four legs, a seat, and a backrest.  A compression program might benefit from a function like the one shown below.

char *DescribeChair( double *featureVector)
{
char *description;

description = describeLegs(outputBuffer, featureVector);
description = describeSeatOnTopOfLegs(description,
featureVector);
description = describeBackRestOnTopOfSeat(description,
featureVector);

return description;
}

The abstract version of a chair would help compress the data describing many chairs.

A compression program for a corpus of chess games would probably include a board evaluation function.  The program could use the evaluation function to suggest the top five most likely moves.  This would allow the program to use less space to compress the moves by using a probability distribution (e.g. arithmetic encoding) over moves as a function of the “goodness” of each move.  Better moves would be more likely than worse moves.  Converting the positions to valuations helps the computer to compress by making it “understand” chess.  If the computer is compressing many chess games, it may even note that some human players were more aggressive players or stronger players.  By classifying the players by skill or behavior, the program achieves better compression.

Well that’s it for now.

Hopefully, sometime in the future, I will be describing how category theory can help us automate those six abstractions and then use the abstractions to generate features for learning algorithms.

(Code highlighting was done by http://hilite.me/.  Thanks!)

## “Randomized Numerical Linear Algebra (RandNLA): Theory and Practice”

Nuit Blanche has a nice summary of the FOCS 2012 53rd Annual IEEE Symposium on Foundations of Computer Science Workshop on “Randomized Numerical Linear Algebra (RandNLA): Theory and Practice“.  Faster random algorithms for QR, SVD, Eigenvalues, Least Squares, …  using random projections and other techniques were discussed.

Erik Brynojolfsson, director of the MIT Center for Digital Business, gave a great TED talk on the recent changes in world-wide productivity, computers, machine learning and the “great stagnation“.  There has been a stagnation in GDP per capita for the last 5 years in the United States, Europe, and Japan.  However, growth has continued unabated in India, China, and Brazil.  Economist Tyler Cowen is not particularly optimistic about the current economic situation which he describes in his book “The Great Stagnation: How America Ate All The Low-Hanging Fruit of Modern History, Got Sick, and Will (Eventually) Get Better” ($4 Kindle edition). Erik disagrees with the idea that the economy, technology, and science are stagnant. Instead, we are being confused by what he calls the “great decoupling” of productivity growth and employment/median income. He believes that we are actually in the early phases of the “new machine age” which will create more wealth than the industrial revolution. Here are some notes on his TED talk: • The introduction of electricity did not immediately lead to large productivity growth. At first, factory managers just replaced existing steam machines with electric machines. It took thirty years (one generation) before the factories were redesigned with electricity in mind and new work processes were created. Then productivity growth took off. • Big general purpose technologies like fire, agriculture, writing, metallurgy, printing press, steam engine, telecommunications & electricity, and now computers cause “cascades of complementary innovations”. • The GDP growth rate in America has been fairly steady at an average of 1.9% per year since 1800. • Erik graphically compares the 1960-2011 productivity growth caused partly by computers and the 1890-1940 productivity growth caused by electricity. They appear quite similar. • Productivity has grown faster (2.4% per year) in the last decade than any other decade since 1970. The average productivity growth rate was about 2.3% during the second industrial revolution. • World GDP in the last decade has grown faster than any other decade in history. • New growth has been driven by thoughts and ideas more than physical products. This growth is hard to measure. The massive utility of free products like Google Search or the Wikipedia is not included in GDP. • The “new machine age” is “digital, exponential, and combinatorial”. • Digital goods are perfectly, freely replicable and they have near zero transportation cost (often moving at great speed crossing the globe in a minute.) • Computer power has grown exponentially. In Erik’s words, “Computers get better faster than anything else ever.” (2013 playstation = 1996 supercomputer) • We are not designed by evolution to anticipate exponential trends. We expect linear trends. • We now have Big Data and Machine Learning which allow us to analyze everything more deeply. • He disagrees with the idea of “low hanging fruit” inventions (see e.g. Cowen’s book). He thinks that “each innovation produces building blocks for even more innovations”. • The digital, exponential, and combinatorial nature of the new machine age will lead to the greatest industrial revolution in history. (see e.g. cat transportation at 6:55 in the video). • Machine Learning may be the most important innovation of this revolution. (The Watson Jeopardy team improved faster than any human could by using machine learning.) Watson’s technology is now being applied in the legal, banking, and medical fields. • “The great decoupling” is the decoupling of high productivity growth from employment and median income. • Jobs are being replaced by computers. • In 1997 Gary Kasparov was beaten by the super computer Deep Blue. Today a cell phone can beat a chess grandmaster. • In computer assisted chess play (Advanced Chess), sometimes the winners are neither computer experts nor grandmasters. Instead, they were the best at interacting with the computer—guiding its exploration of the possibilities. So, maybe we will not lose our jobs to computers, rather we will collaborate with computers to create wonderful things. ## “Is chess the drosophila of artificial intelligence?” In the article “Is chess the drosophila of artificial intelligence? A social history of an algorithm“, Nathan Ensmenger writes about the history of artificial intelligence and chess computers as well as providing a detailed description of the main chess algorithm, minimax tree search. Here are some notes on the article: • Russian mathematician Alexander Kronrod was credited for first stating that “chess was the drosophila of artificial intelligence” in 1965 to justify the use of expensive computers for playing chess. • In 1946, Turing suggested that chess playing computers would be an example of a thinking machine and in 1953 he wrote the first program that could play chess; however, there were no computers around that could run his program. • In 1950, Claude Shannon wrote the first paper on chess algorithms and proposed two main approaches: type ‘A’ brute force minimax search and type ‘B’ which used heavy heuristics to carefully examine a much smaller number of positions. Type ‘A’ turned out to be superior for chess and governed the vast majority of chess computer research. Shannon introduced the terms “ply” and “pruning”. • Chess minimax algorithms were attractive because they were simple to program and modify, they required only a little memory, and they worked. Chess was attractive to AI because there were well-known standards for play (Elo chess ratings) and ever improving computing hardware created ever improving, measurable performance in computer chess play, providing some defense against the critics of AI during the AI winter. (see e.g. Lighthill Report) • Unlike the drosophila in genetics research, the simplistic minimax algorithm never lead to improved theoretical understanding in AI and may have even hindered the progress of AI. • When Deep Blue beat Gary Kasparov, it was capable of “11,380,000,000 floating point operations per second, making it one of the top 300 supercomputers in the entire world at the time.” Perhaps surprisingly, the article does not discuss the major improvements made to computer chess by Fabien Letouzey (Fruit) and Vasik Rajlich (Rybkain 2006 and 2007. Though these programs increased the power of chess computers by 200 Elo points over one year, they had little or no impact on AI. I guess 200 Elo points does not seem like much compared to the 2000 Elo gain of computer chess programs between 1960 and today. The 2000 point gain has been due to both hardware and software improvements but does not seem to stem from advances in AI. In fact, modern chess programs do not seem to use any modern machine learning algorithms. (Please write a comment about this if I am wrong!) ## “Machine Learning: A Love Story” I enjoyed this simple entertaining video by Hilary Mason which is both a history of machine learning and an overview. ## “Matrix Factorizations and the Grammar of Life” I’m quite excited by the Nuit Blanche post on the papers “Structure Discovery in Nonparametric Regression through Compositional Kernel Search” (Duvenaudy, Lloydy, Grossez, Tenenbaumz, Ghahramaniy 2013) and “Exploiting compositionality to explore a large space of model structures” (Grosse, Salakhutdinovm, Freeman, and Tenenbaum 2012). For years my old company Momentum Investment Services, Carl, and I have been looking for fast, systematic ways to search large hypothesis spaces. We considered context-free grammars as a means of generating hypothesis. Carl and I did not get anywhere with that research, but now it seems that others have succeeded. Be sure to look over the article, the blog posts, and the comments. ## “Quantum information and the Brain” Scott Aaronson gave a wonderful talk at NIPS 2012 called “Quantum information and the Brain“. I really enjoyed his very concise intuitive description of quantum mechanics and his sceptical yet fair attitude towards the ideas expressed by Penrose in “The Emperor’s New Mind”. ## Notes on “A Few Useful Things to Know about Machine Learning” Pedro Domingos wrote this wonderful introductory overview of machine learning which teaches twelve short lessons about classifiers: 1. LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION 2. IT’S GENERALIZATION THAT COUNTS 3. DATA ALONE IS NOT ENOUGH 4. OVERFITTING HAS MANY FACES 5. INTUITION FAILS IN HIGH DIMENSIONS 6. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM 7. FEATURE ENGINEERING IS THE KEY 8. MORE DATA BEATS A CLEVERER ALGORITHM 9. LEARN MANY MODELS, NOT JUST ONE 10. SIMPLICITY DOES NOT IMPLY ACCURACY 11. REPRESENTABLE DOES NOT IMPLY LEARNABLE 12. CORRELATION DOES NOT IMPLY CAUSATION The lessons are augmented with excellent examples, so I highly recommend reading the simply written, short article. I put together some notes and quotes, on and from these lessons which I will share below. LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION – Creating a classifiers involves three key ideas: representation, an objective function, and optimization. For example, for$k$-means classification, the domain is divided up by a Voronoi diagram and each region has a different label or possibly a mixture of labels. This is the representation. The objective function does not involve the labels—only the input data. The objective function is the sum of the squared distances from input data to the “center” of each region to which the datum belongs. The optimization technique is greedy alternation between optimizing the center of each region and the assignment of data to regions. IT’S GENERALIZATION THAT COUNTS – “The fundamental goal of machine learning is to generalize beyond the examples in the training set.” Merely learning the training set is both trivial and insufficient. Often doing well on the training set is a bad indicator of performance outside the training set. Generalization is the only way to label inputs that are not in the training set. Unfortunately, Wolpert’s “no free lunch” theorems demonstrate that no generalization method can always beat random guessing for learning problems without structure (i.e. Martin-Löf random classification problems do not admit generalization.) OVERFITTING HAS MANY FACES – If we do not have enough data or time to determine the correct representation or the correct number of parameters, then we introduce either bias errors when we use the wrong representation or overfitting errors when we have too many parameters (i.e. Bias-Variance Trade-off). The fact that trying too many hypotheses has the same dangers as using too many parameters implies “contrary to intuition, a more powerful learner is not necessarily better than a less powerful one.” and “strong false assumptions can be better than weak true ones, because a learner with the latter needs more data to avoid overﬁtting.” Cross validation, statistical significance tests, minimizing the false discovery rate, and regularization terms prevent most overfitting. Overfitting can even occur when all the data is correctly labelled. INTUITION FAILS IN HIGH DIMENSIONS – Most algorithms have more trouble in higher dimensional spaces and our intuition fails in higher dimensional spaces. Nearest Neighbor for simple classification problems tends to fail in more than 100 dimensions. Our intuition built on experience with 3 dimensional spaces fails to understand facts like a hypersphere with radius 0.95 has less than 1% of the volume of a hypersphere of radius 1 in a 200 dimensional space. So for a Gaussian in such a space, the majority of the probability is not near the mean, instead it lies in a “shell” about the mean. The volume of a hypersphere is much less than the smallest hypercube that contains it. These counter-intuitive facts reek havoc for several machine learning algorithms. It is even hard to fix or adjust the algorithm because we often can’t visualize the problem in high dimensional spaces. So dimensionality reduction via manifold learning, PCA, LDA, ICA, and feature selection becomes very important. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM – Domingos repeats the well known argument that if bad classifiers are wrong with probability greater than$\epsilon$and the number of hypothetical classifiers is$H$, then$n = {1\over\epsilon}(\ln(H) + \ln({1\over\delta}))$randomly chosen test examples are sufficient to eliminate all of the bad classifiers with probability greater than$1- \delta$. This works great as long at$H$is not large, but often$H$is large. The number of possible classifiers for$k$binary features is$2^{2^k}$, so the bound above is sometimes not useful. The number of hypotheses can also be effectively infinite. Similarly, if the number of bad classifiers is large compared to the number of good classifiers, then$\delta\$ needs to be very small.  Other performance bounds are true only in the limit (e.g. infinite data) or contain unknown constants, so they merely give the user hope rather than a guarantee that the resulting classifier is correct.

FEATURE ENGINEERING IS THE KEY –  Machine learning is often an iterative process of generating features, choosing algorithms, and analyzing the results.  Creating useful features is often difficult and can require domain knowledge or luck, so “there is ultimately no replacement for the smarts you put into feature engineering.”

LEARN MANY MODELS, NOT JUST ONE – Use ensemble learning including bagging, boosting, bootstrap, stacking, Bayesian optimization, mixture of experts, and plain old averaging/voting.  These ensembles of other well-known methods work and win contests like the Netflix prize.  Bayesian model averaging (BMA) is theoretically optimal, but it tend to heavily weight one of the learners, so it should not be considered ensemble learning.

MORE DATA BEATS A CLEVERER ALGORITHM – If you are only going to read one section, read this section.  The title itself is self-explanatory, but Domingos adds several interesting insights in this section like,

“…even though in principle more data means that more complex classiﬁers can be learned, in practice simpler classiﬁers wind up being used, because complex ones take too long to learn.”

“As a rule, it pays to try the simplest learners ﬁrst (e.g., naive Bayes before logistic regression, k-nearest neighbor before support vector machines). More sophisticated learners are seductive, but they are usually harder to use, because they have more knobs you need to turn to get good results, and because their internals are more opaque.”,

“Part of the reason using cleverer algorithms has a smaller payoﬀ than you might expect is that, to a ﬁrst approximation, they all do the same. This is surprising when you consider representations as different as, say, sets of rules and neural networks. But in fact propositional rules are readily encoded as neural networks, and similar relationships hold between other representations. All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of “nearby.” “,

“In the end, the biggest bottleneck is not data or CPU cycles, but human cycles. In research papers, learners are typically compared on measures of accuracy and computational cost. But human eﬀort saved and insight gained, although harder to measure, are often more important. This favors learners that produce human-understandable output (e.g., rule sets). And the organizations that make the most of machine learning are those that have in place an infrastructure that makes experimenting with many different learners, data sources and learning problems easy and eﬃcient, and where there is a close collaboration between machine learning experts and application domain ones.”,

and

“Learners can be divided into two major types: those whose representation has a ﬁxed size, like linear classiﬁers, and those whose representation can grow with the data, like decision trees.”

He says the variable size learners can, in theory, learn any class, but they often fall into local optima, require too much time, or fall to the curse of dimensionality, so these methods require a lot of experimentation and thinking on the part of the analyst.

Lastly, Generative models are more fun.

## “Machine Learning Cheat Sheet (for scikit-learn)”

I love this diagram created by Peekaboo: Andy’s Computer Vision and Machine Learning Blog.