David Andrzejewski at Bayes’ Cave wrote up a nice summary of practical machine learning advice from the KDD 2011 paper “Detecting Advesarial Advertisements in the Wild”.  I’ve quoted below several of the main points from David’s summary:

  • ABE: Always Be Ensemble-ing
  • Throw a ton of features at the model and let L1 sparsity figure it out
  • Map features with the “hashing trick
  • Handle the class imbalance problem with ranking
  • Use a cascade of classifiers
  • make sure the system “still works” as its inputs evolve over time
  • Make efficient use of expert effort
  • Allow humans to hard-code rules
  • periodically use non-expert evaluations to make sure the system is working

According to this graph

Figure 10 From THE LONG-TERM IMPACTS OF TEACHERS: TEACHER VALUE-ADDED AND STUDENT OUTCOMES IN ADULTHOOD by Chetty, Friedman, and Jonah E. Rockoff

high quality elementary school teachers increase the lifetime earnings of their students by about $200,000 per child.

 

In “Improving neural networks by preventing co-adaptation of feature detectors“, Hinton, Srivastava, Krizhevsky, Sutskever, and Salakhutdinov answer the question:  What happens if “On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present.”  This mimics the standard technique of training several neural nets and averaging them, but it is faster.  When they applied the “dropout” technique to a deep Boltzmann neural net on the MNIST hand written digit data set and the TIMIT speech data set, they got robust learning without overfitting.  This was one of the main techniques used by the winners of the Merck Molecular Activity Challenge.

Hinton talks about the dropout technique in his video Brains, Sex, and Machine Learning.

“Aaron Swartz (1986-2013)”

In “Analysis of Thompson Sampling for the multi-armed bandit problem“, Agrawal and Goyal show that Thompson Sampling (“The basic idea is to choose an arm to play according to its probability of being the best arm.”) has logarithmic regret. More specifically, if there are $n$ bandits and the regret $R$ at time $T$ is defined by

$$R(T) := \sum_{t=1}^T (\mu^* – \mu_{i(t)})$$

where $\mu_i$ is the expected return of the $i$th arm and $\mu^* = \max_{i = 1, \ldots, n} \mu_i$, then

$$E[R(T)] \in O\left(\left(\sum_{i\neq i^*} 1/\Delta_i^2\right)^2 \log T \right)$$

where $\mu(i^*) = \mu^*$ and $\Delta_i = \mu^* – \mu_i$.

Taleb published an idea is similar to Cotner’s Observation which I will formally state below.
$\ $
If you see a tall person from a distance, you tend to expect that the tall person is male.  In fact, the taller the person is, the more likely it is that the person is male. If you plot histograms of people’s heights, you will find that this intuition is correct.  Cotner once pointed out to me that this intuition is false for many non-Gaussian Distributions.
$\ $

Cotner’s Observation:$\ $

Consider two random variables, $A$ and $B$ with $P(B>A) > 0.5$ and a random variable $X$ which is always equal to $A$ or $B$.  Suppose we flip a fair coin and if the coin is heads we assign $X$ to $A$ and if it is tails we let $X$ equal $B$.   Now if $X$ is large, we tend to think the coin was probably tails because $B$ is usually larger than $A$.  Furthermore, the larger $X$ is the more we think that the coin was probably tails.  This intuition is right for Gaussian Distributions, but it is wrong for many other distributions.  Let $p(x)$ be the probability that the coin was heads given that $X = x$
$$p(x) := P( X=A | X=x).$$
  • If $A$ and $B$ have Gaussian Distributions with equal standard deviations, then $p(x)$ is decreasing.
  • If $A$ and $B$ have Laplace Distributions with equal standard deviations, then $p(x)$ is becomes constant for all $x > E[B]$ and $p(x) < 0.5$ when $x > E[B]$.
  • If $A$ and $B$ have Student T-Distributions with equal standard deviations and equal degrees of freedom, then $p(x)$ will actually increase if $x$ is large enough and $p(x)$ approaches 0.5 as $x$ goes to infinity.
$\ $
Many distributions have a fat tails like the Student T Distributions.

In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]).  They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.”  and

“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the field of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem.  It discovers the structure of a problem on the fly.”

where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998).  The algorithm consists of the following steps

The Bayesian Optimization Algorithm (BOA)
(1) set t ← 0 randomly generate initial population P(0)
(2) select a set of promising strings S(t) from P(t)
(3) construct the network B using a chosen metric and constraints
(4) generate a set of new strings O(t) according to the joint distribution encoded by B
(5) create a new population P(t+1) by replacing some strings from P(t) with O(t)  set t ← t + 1
(6) if the termination criteria are not met, go to (2)

where B is a Bayesian network.

Check out the NIPS 2011 workshop.

Check out the new class (all lectures on-line) advertised on the Machine Learning (Theory) blog.

Check out the Nuit Blanche posts “Predicting the Future: The Steamrollers” and “Predicting the Future: Randomness and Parsimony” where Igor Carron repeats the well known mantra of Moore’s law that always seems to catch us by surprise. Carron’s remarks on medicine surprised me but also I thought, “I should have guessed that would happen” while reading the articles.

Garvesh Raskutti has some nice slides on Probabilistic Graphical Models and Markov Logic Networks (Richardson & Domingos 2006).  Markov Logic Networks encode first order predicate logic into a Markov Random Field. The resulting networks can be quite large because statements like “for all x, y, and z, x is y’s parent and z is x’s parent imply z is y’s grandparent” require the existence of $2n^2$ nodes in the graph where $n$ is the number of people considered.  The resulting networks are frequently solved by using Gibbs Sampling.  For even more information Pedro Domingos has put an entire course online at

www.cs.washington.edu/homes/pedrod/803

 

« Older entries § Newer entries »