General ML

You are currently browsing the archive for the General ML category.

Notes on “A Few Useful Things to Know about Machine Learning”

March 12, 2013 in General ML by hundalhh | Permalink

Pedro Domingos wrote this wonderful introductory overview of machine learning which teaches twelve short lessons about classifiers:

LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
IT’S GENERALIZATION THAT COUNTS
DATA ALONE IS NOT ENOUGH
OVERFITTING HAS MANY FACES
INTUITION FAILS IN HIGH DIMENSIONS
THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
FEATURE ENGINEERING IS THE KEY
MORE DATA BEATS A CLEVERER ALGORITHM
LEARN MANY MODELS, NOT JUST ONE
SIMPLICITY DOES NOT IMPLY ACCURACY
REPRESENTABLE DOES NOT IMPLY LEARNABLE
CORRELATION DOES NOT IMPLY CAUSATION

The lessons are augmented with excellent examples, so I highly recommend reading the simply written, short article.

I put together some notes and quotes, on and from these lessons which I will share below.

LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION – Creating a classifiers involves three key ideas: representation, an objective function, and optimization. For example, for $k$-means classification, the domain is divided up by a Voronoi diagram and each region has a different label or possibly a mixture of labels. This is the representation. The objective function does not involve the labels—only the input data. The objective function is the sum of the squared distances from input data to the “center” of each region to which the datum belongs. The optimization technique is greedy alternation between optimizing the center of each region and the assignment of data to regions.

IT’S GENERALIZATION THAT COUNTS – “The fundamental goal of machine learning is to generalize beyond the examples in the training set.” Merely learning the training set is both trivial and insufficient. Often doing well on the training set is a bad indicator of performance outside the training set. Generalization is the only way to label inputs that are not in the training set. Unfortunately, Wolpert’s “no free lunch” theorems demonstrate that no generalization method can always beat random guessing for learning problems without structure (i.e. Martin-Löf random classification problems do not admit generalization.)

OVERFITTING HAS MANY FACES – If we do not have enough data or time to determine the correct representation or the correct number of parameters, then we introduce either bias errors when we use the wrong representation or overfitting errors when we have too many parameters (i.e. Bias-Variance Trade-off). The fact that trying too many hypotheses has the same dangers as using too many parameters implies “contrary to intuition, a more powerful learner is not necessarily better than a less powerful one.” and “strong false assumptions can be better than weak true ones, because a learner with the latter needs more data to avoid overﬁtting.” Cross validation, statistical significance tests, minimizing the false discovery rate, and regularization terms prevent most overfitting. Overfitting can even occur when all the data is correctly labelled.

INTUITION FAILS IN HIGH DIMENSIONS – Most algorithms have more trouble in higher dimensional spaces and our intuition fails in higher dimensional spaces. Nearest Neighbor for simple classification problems tends to fail in more than 100 dimensions. Our intuition built on experience with 3 dimensional spaces fails to understand facts like a hypersphere with radius 0.95 has less than 1% of the volume of a hypersphere of radius 1 in a 200 dimensional space. So for a Gaussian in such a space, the majority of the probability is not near the mean, instead it lies in a “shell” about the mean. The volume of a hypersphere is much less than the smallest hypercube that contains it. These counter-intuitive facts reek havoc for several machine learning algorithms. It is even hard to fix or adjust the algorithm because we often can’t visualize the problem in high dimensional spaces. So dimensionality reduction via manifold learning, PCA, LDA, ICA, and feature selection becomes very important.

THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM – Domingos repeats the well known argument that if bad classifiers are wrong with probability greater than $\epsilon$ and the number of hypothetical classifiers is $H$, then $n = {1\over\epsilon}(\ln(H) + \ln({1\over\delta}))$ randomly chosen test examples are sufficient to eliminate all of the bad classifiers with probability greater than $1- \delta$. This works great as long at $H$ is not large, but often $H$ is large. The number of possible classifiers for $k$ binary features is $2^{2^k}$, so the bound above is sometimes not useful. The number of hypotheses can also be effectively infinite. Similarly, if the number of bad classifiers is large compared to the number of good classifiers, then $\delta$ needs to be very small. Other performance bounds are true only in the limit (e.g. infinite data) or contain unknown constants, so they merely give the user hope rather than a guarantee that the resulting classifier is correct.

FEATURE ENGINEERING IS THE KEY – Machine learning is often an iterative process of generating features, choosing algorithms, and analyzing the results. Creating useful features is often difficult and can require domain knowledge or luck, so “there is ultimately no replacement for the smarts you put into feature engineering.”

LEARN MANY MODELS, NOT JUST ONE – Use ensemble learning including bagging, boosting, bootstrap, stacking, Bayesian optimization, mixture of experts, and plain old averaging/voting. These ensembles of other well-known methods work and win contests like the Netflix prize. Bayesian model averaging (BMA) is theoretically optimal, but it tend to heavily weight one of the learners, so it should not be considered ensemble learning.

MORE DATA BEATS A CLEVERER ALGORITHM – If you are only going to read one section, read this section. The title itself is self-explanatory, but Domingos adds several interesting insights in this section like,

“…even though in principle more data means that more complex classiﬁers can be learned, in practice simpler classiﬁers wind up being used, because complex ones take too long to learn.”

“As a rule, it pays to try the simplest learners ﬁrst (e.g., naive Bayes before logistic regression, k-nearest neighbor before support vector machines). More sophisticated learners are seductive, but they are usually harder to use, because they have more knobs you need to turn to get good results, and because their internals are more opaque.”,

“Part of the reason using cleverer algorithms has a smaller payoﬀ than you might expect is that, to a ﬁrst approximation, they all do the same. This is surprising when you consider representations as different as, say, sets of rules and neural networks. But in fact propositional rules are readily encoded as neural networks, and similar relationships hold between other representations. All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of “nearby.” “,

“In the end, the biggest bottleneck is not data or CPU cycles, but human cycles. In research papers, learners are typically compared on measures of accuracy and computational cost. But human eﬀort saved and insight gained, although harder to measure, are often more important. This favors learners that produce human-understandable output (e.g., rule sets). And the organizations that make the most of machine learning are those that have in place an infrastructure that makes experimenting with many different learners, data sources and learning problems easy and eﬃcient, and where there is a close collaboration between machine learning experts and application domain ones.”,

and

“Learners can be divided into two major types: those whose representation has a ﬁxed size, like linear classiﬁers, and those whose representation can grow with the data, like decision trees.”

He says the variable size learners can, in theory, learn any class, but they often fall into local optima, require too much time, or fall to the curse of dimensionality, so these methods require a lot of experimentation and thinking on the part of the analyst.

Lastly, Generative models are more fun.

“Machine Learning Cheat Sheet (for scikit-learn)”

March 10, 2013 in General ML by hundalhh | Permalink

I love this diagram created by Peekaboo: Andy’s Computer Vision and Machine Learning Blog.

Feature Selection in Medicine

February 24, 2013 in General ML, Support Vector Machines by hundalhh | Permalink

In the seminal paper “Gene Selection for Cancer Classification using Support Vector Machines“, Guyon, Weston, Barnhill, and Vapnik (2002) use Recursive Feature Elimination to find the genes which are the most predictive of cancer. Recursive Feature Elimination repeatedly ranks the features and eliminates the worst feature until only a small subset of the original set of features remains. Although several feature ranking methods were explored, the main method was a soft margin SVM classifier with which the authors found 8 key colon cancer genes out of 7000.

Ross Rosen’s list of Machine Learning Tools for Phython

January 28, 2013 in General ML by hundalhh | Permalink

Check out Ross Rosen’s collection of tools for machine learning with Python.

WhiteSwami’s Prerequisites for Machine Learning

January 22, 2013 in General ML by hundalhh | Permalink

Check out WhiteSwami’s list of subjects to master if you are going to become a machine learner.

“Detecting Adversarial Advertisements in the Wild”

January 20, 2013 in General ML by hundalhh | Permalink

David Andrzejewski at Bayes’ Cave wrote up a nice summary of practical machine learning advice from the KDD 2011 paper “Detecting Advesarial Advertisements in the Wild”. I’ve quoted below several of the main points from David’s summary:

ABE: Always Be Ensemble-ing
Throw a ton of features at the model and let L1 sparsity figure it out
Map features with the “hashing trick“
Handle the class imbalance problem with ranking
Use a cascade of classifiers
make sure the system “still works” as its inputs evolve over time
Make efficient use of expert effort
Allow humans to hard-code rules
periodically use non-expert evaluations to make sure the system is working

NYU Large Scale Machine Learning Class

January 8, 2013 in General ML by hundalhh | Permalink

Check out the new class (all lectures on-line) advertised on the Machine Learning (Theory) blog.

Kurweil’s book on Creating a Mind

January 3, 2013 in General ML, Neural Nets by hundalhh | Permalink

Check out the Bookworm9765’s review of Kurzweil‘s book “How to Create a mind” on Amazon.com. Here is a snippet:

The core of Kurzweil’s theory is that the brain is made up of pattern processing units comprised of around 100 neurons, and he suggests that the brain can be understood and simulated primarily by looking at how these lego-like building blocks are interconnected.

The 20 most striking papers, workshops, and presentations from NIPS 2012

December 18, 2012 in Deep Belief Networks, General ML, Graphical Models, Multi-Armed Bandit Problem, Neural Nets, Reinforcement Learning by hundalhh | Permalink

NIPS was pretty fantastic this year. There were a number of breakthroughs in the areas that interest me most: Markov Decision Processes, Game Theory, Multi-Armed Bandits, and Deep Belief Networks. Here is the list of papers, workshops, and presentations I found the most interesting or potentially useful:

Unfortunately, when you have 30 full day workshops in a two day period, you miss most of them. I could only attend the three listed above. There were many other great ones.

Principles for Applying Machine Learning Techniques

December 2, 2012 in General ML by hundalhh | Permalink

Timothy Chklovski at Factual Blog has this cool list of 5 principles for Applying Machine Learning Techniques. His datacentric techniques are:

Don’t Ignore the Corners – The “Corners” are unusual cases in the Data
Be Attentive to the Boundaries – If you use a linear discriminant or decision tree, pay special attention to boundary cases.
Spend Time on Special Cases – i.e. special cases in the data.
Listen to the Data
Love Your Data

Jack Coughlin also at Factual Blog adds this list:

Ask for help first.
The documentation is your best friend.
Know the ecosystem. (Python, Java/Hadoop/Weka, R, Malab, …)
Machine Learning applications are mostly the boring stuff. “The majority of the effort is in pre-processing”
Save the ML for the problems you can’t think to solve in any other way.
Coding in R makes you feel like a ninja. “The R core library is full of awesome one-liners ….”

« Older entries § Newer entries »

Artificial Intelligence Blog