NIPS was pretty fantastic this year.  There were a number of breakthroughs in the areas that interest me most:  Markov Decision Processes, Game Theory, Multi-Armed Bandits, and Deep Belief Networks.  Here is the list of papers, workshops, and presentations I found the most interesting or potentially useful:

 

  1. Representation, Inference and Learning in Structured Statistical Models
  2. Stochastic Search and Optimization
  3. Quantum information and the Brain
  4. Relax and Randomize : From Value to Algorithms  (Great)
  5. Classification with Deep Invariant Scattering Networks
  6. Discriminative Learning of Sum-Product Networks
  7. On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
  8. A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
  9. Regularized Off-Policy TD-Learning
  10. Multi-Stage Multi-Task Feature Learning
  11. Graphical Models via Generalized Linear Models (Great)
  12. No voodoo here! Learning discrete graphical models via inverse covariance estimation (Great)
  13. Gradient Weights help Nonparametric Regressors
  14. Dropout: A simple and effective way to improve neural networks (Great)
  15. Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions
  16. A Better Way to Pre-Train Deep Boltzmann Machines
  17. Bayesian Optimization and Decision Making
  18. Practical Bayesian Optimization of Machine Learning Algorithms
  19. Modern Nonparametric Methods in Machine Learning
  20. Deep Learning and Unsupervised Feature Learning
Unfortunately, when you have 30 full day workshops in a two day period, you miss most of them.  I could only attend the three listed above.  There were many other great ones.

 

 

The blog Computational Information Geometry Wonderland pointed me toward the article “k-MLE: A fast algorithm for learning statistical mixture models” by Frank Nielsen (2012).  $k$-means can be viewed as alternating between 1) assigning points to clusters and 2) performing a maximum likelihood estimation (MLE) of the mean of spherical Gaussians clusters (all of which are forced to have the same covariance matrix equal to a scalar multiple of the identity).  If we replace the spherical Gaussian with another set of distributions, we get $k$-MLE.  Nielsen does a remarkably good job of introducing the reader to some complex concepts without requiring anything other than a background in probability and advance calculus.  He explores the relationships between $k$-MLE with exponential families and information geometry.  Along the way he exposes the reader to Bregman divergences,  cross-entropy,  Legendre dualityItakura-Saito divergence, and Burg matrix divergence.

Julia can be written like Malab without typing information and it runs very fast, at nearly the speed of C, because it does runtime type inference and JIT compilation. Underneath it has sophisticated dynamic algebraic typing system which can be manipulated by the programmer (much like Haskell).  Carl sent me a link to this video about how the language achieves this level of type inference and type manipulation.

In “Semantic Hashing“, Salakhutdinov and Hinton (2007) show how to classify documents with binary vectors.  They combine deep learning and graphical models to assign each document a binary vector.  Similar documents can be found by using the L1 difference between the binary vectors.  Here is their abstract.

We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs “semantic hashing”: Documents are mapped to memory addresses in such away that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set.

Here’s a link to the video.  The tiny “Alien” skull below is printed in less than three minutes.

I love this graph.  For the complete article, click here.

“A Product of Experts model (PoE) (Hinton 2002) combines a number of individual component models (the experts) by taking their product and normalizing the result.”  Here’s the Scholarpedia page.

In “Rotation Forest: A New Classifier Ensemble Method“, Rodrıguez and Kuncheva (2006) develop a Random Forest method where the decision boundaries are not of the form   original data feature value <= fixed number. They do this by creating new features by “rotating” the data in the high dimensional feature space before creating a decision tree.  Each decision tree is created with a different rotation of the data, but all of the rotations are generated from Principal Component Analysis. The trees are combined to generate the final classifier.  Here is their abstract

We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name “forest.” Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Using WEKA, we examined the Rotation Forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with Bagging, AdaBoost, and Random Forest. The results were favorable to Rotation Forest and prompted an investigation into diversity-accuracy landscape of the ensemble models. Diversity-error diagrams revealed that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest, and more diverse than these in Bagging, sometimes more accurate as well.

At NIPS yesterday, James Spall gave a nice overview of stochastic optimization. Stochastic optimization is the process of finding the minimum of a function $f(x)$ where are measurements or samples of the function are noisy.  He stressed that the no free lunch theorems (Wolpert & Macready 1995) limit the efficiency of any global minimization problem if there are no restrictions on $f$.  He described in detail the Simultaneous Perturbation Stochastic Approximation (SPSA) method which appears to be a great method for optimizing with noisy measurements. The basic is idea is that you don’t need to approximate the gradient by making $p+1$ measurements in a $p$ dimensional domain.  Instead, you sample $f$ at two nearby randomly generated points and make a nearly unbiased estimate of the gradient from those two measurements.   There is also way to form estimates of the Hessian with just 4 samples which leads to a stochastic algorithm similar to Newton-Raphson.  None of these methods can converge faster than $O(1/\sqrt{n})$ due to the noise, but they may be very useful as robust semi-global optimizers for functions with lots of local minima or high frequencies.

 at Factual Blog has this cool list of 5 principles for Applying Machine Learning Techniques.  His datacentric techniques are:

  • Don’t Ignore the Corners – The “Corners” are unusual cases in the Data
  • Be Attentive to the Boundaries – If you use a linear discriminant or decision tree, pay special attention to boundary cases.
  • Spend Time on Special Cases – i.e. special cases in the data.
  • Listen to the Data
  • Love Your Data
 also at Factual Blog adds this list:
  • Ask for help first.
  • The documentation is your best friend.
  • Know the ecosystem.  (Python, Java/Hadoop/Weka, R, Malab, …)
  • Machine Learning applications are mostly the boring stuff.  “The majority of the effort is in pre-processing”
  • Save the ML for the problems you can’t think to solve in any other way.
  • Coding in R makes you feel like a ninja.  “The R core library is full of awesome one-liners ….”

« Older entries § Newer entries »