General ML

You are currently browsing the archive for the General ML category.

The Daily Mail reports that the Computer Poker Research Group at the University of Alberta seems to have solved heads-up limit hold’em poker.

You can play against their AI online.

(My thanks to Glen for emailing me the story!)

Fernandez-Delgado, Cernadas, Barro, and Amorim tested 179 classfiers on 121 data sets and reported their results in “Do we Need Hundreds of Classi ers to Solve Real World Classi cation Problems?”  The classifiers were drawn from the following 17 families:

“discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classi fiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods”

from the Weka, Matlab, and R machine learning libraries.  The 121 datasets were drawn mostly from the UCI classification repository.

The overall result was that the random forest classifiers were best on average followed by support vector machines, neural networks, and boosting ensembles.

For more details, read the paper!

Mnih, Kavukcuoglu, Silver, Graves, Antonoglon, Wierstra, and Riedmiller authored the paper “Playing Atari with Deep Reinforcement Learning” which describes and an Atari game playing program created by the company Deep Mind (recently acquired by Google). The AI did not just learn how to pay one game. It learned to play seven Atari games without game specific direction from the programmers. (The same learning parameters, neural network topologies, and algorithms were used for every game).

The 2600 Atari gaming system was quite popular in the late 1970’s and the early 1980’s. The games ran with only four kilobytes of RAM and a 210 x 160 pixel display with 128 colors. Various machine learning techniques have been applied to the old Atari games using the Arcade Learning Environment which precisely reproduces the Atari 2600 gaming system. (See e.g. “An Object-Oriented Representation for Efficient Reinforcement Learning” by Diuk, Cohen, and Littman 2008, ”HyperNEAT-GGP:A HyperNEAT-based Atari General Game Player” by Hausknecht, Khandelwal, Miikkulainen, and Stone 2012, “Application of TEXPLORE on Atari Games “ by Shung Zhang , ”A Neuroevolution Approach to General Atari Game Playing” by Hausknect, Lehman,” Miikkulalianen, and Stone 2014, and “Replicating the Paper ‘Playing Atari with Deep Reinforcement Learning’ ” by Korjus, Kuzovkin, Tampuu, and Pungas 2014.)

Various papers have been written on how computers can learn to pay the Atari games, but most of them used the abstract representations of objects on the screen within the emulator. The Mnih et al AI learned to play the games using only the raw 210 x 160 video and the score. It seems to be the first successful attempt to learn arcade gaming from raw video.

To learn from raw video, they first converted the video to grayscale and then downsampled/cropped to 84 x 84 images. The last four frames were used to determine actions. The 28224 input pixels were run through two hidden convolution neural net layers and one fully connected (no convolution) 256 node hidden layer with a single output for each possible action. Training was done with stochastic gradient decent using random samples drawn from a historical database of previous games played by the AI to improve convergence   (This technique known as “experience replay” is described in “Reinforcement learning for robots using neural nets” Long-Ji Lin 1993.)

The objective function for supervised learning is usually a loss function representing the difference between the predicted label and the actual label. For these games the correct action is unknown, so reinforcement learning is used instead of supervised learning. The authors used a variant of Q-learning to train the weights in their neural network. They describe their algorithm in detail and compare it to several historical reinforcement algorithms, so this section of the paper can be used as a brief introduction to reinforcement learning.

The AI was trained to play seven games: Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest, and Space Invaders. In six of the seven games, this general game learning algorithm outperformed all previously known reinforcement learning algorithms tested on those games and “surpasses a human expert on three” of the seven games.


The KDD 2014 article, written by Dong, Gabrilovich, Heitz, Horn, Lao, Murphy, Strohmann, Sun, and Zang, describes in detail Google’s knowledge database Knowledge Vault. Knowledge Vault is a probability triple store database. Each entry in the database is of the form subject-predicate-object-probability restricted to about 4500 predicates such as “born in”, “married to”, “held at”, or “authored by”. The database was built by combining the knowledge base Freebase with the Wikipedia and approximately one billion web pages. Knowledge Vault appears to be the largest knowledge base ever created. Dong et al. compared Knowledge Vault to NELL, YAGO2, Freebase, and the related project Knowledge Graph. (Knowledge Vault is probabilistic and contains many facts with less than 50% certainty. Knowledge Graph consists of high confidence knowledge.)

The information from the Wikipedia and the Web was extracted using standard natural language processing (NLP) tools including: “named entity recognition, part of speech tagging, dependency parsing, co-reference resolution (within each document), and entity linkage (which maps mentions of proper nouns and their co-references to the corresponding entities in the KB).” The text in these sources is mined using “distance supervision” (see Mintz, Bills, Snow, and Jurafksy “Distant Supervision for relation extraction without labeled data” 2009). Probabilities for each triple store are calculated using logistic regression (via MapReduce). Further information is extracted from internet tables (over 570 million tables) using the techniques in “Recovering semantics of tables on the web” by Ventis, Halevy, Madhavan, Pasca, Shen, Wu, Miao, and Wi 2012.

The facts extracted using the various extraction techniques are fused with logistic regression and boosted decision stumps (see “How boosting the margin can also boost classifier complexity” by Reyzin and Schapire 2006). Implications of the extracted knowledge are created using two techniques: the path ranking algorithm and a modified tensor decomposition.

The path ranking algorithm (see “Random walk inference and learning in a large scale knowledge base” by Lao, Mitchell, and Cohen 2011) can guess that if two people parent the same child, then it is likely that they are married. Several other examples of inferences derived from path ranking are provided in table 3 of the paper.

Tensor decomposition is just a generalization of singular value decomposition, a well-known machine learning technique. The authors used a “more powerful” modified version of tensor decomposition to derive additional facts. (See “Reasoning with Neural Tensor Networks for Knowledge Base Completion” by Socher, Chen, Manning, and Ng 2013.)

The article is very detailed and provides extensive references to knowledge base construction techniques. It, along with the references, can serve as a great introduction to modern knowledge engineering.

I read two nice articles this week: “Ten Simple Rules for Effective Computational Research” and “Seven Principles of Learning Better From Cognitive Science”.

In “Ten Simple Rules for Effective Computational Research”, Osborne, Bernabeu, Bruna, Calderhead, Cooper, Dalchau, Dunn, Fletcher, Freeman, Groen, Knapp, McInerny, Mirams, Pitt-Francis, Sengupta, Wright, Yates, Gavaghan, Emmott and Deane wrote up these guidelines for algorithm research:

  1. Look Before You Leap
  2. Develop a Prototype First
  3. Make Your Code Understandable to Others (and Yourself)
  4. Don’t Underestimate the Complexity of Your Task
  5. Understand the Mathematical, Numerical, and Computational Methods Underpinning Your Work
  6. Use Pictures: They Really Are Worth a Thousand Words
  7. Version Control Everything
  8. Test Everything
  9. Share Everything
  10. Keep Going!

Read the full PLOS 5 page article by clicking here!



Scott Young recently returned from a year abroad and wrote Up “Seven Principles of Learning Better From Cognitive Science” which is a review/summary of the book “Why Don’t Students Like School?” by Daniel Willingham. Here are the 7 Principles:

  1. Factual knowledge precedes skill.
  2. Memory is the residue of thought.
  3. We understand new things in the context of what we already know.
  4. Proficiency requires practice.
  5. Cognition is fundamentally different early and late in training.
  6. People are more alike than different in how we learn.
  7. Intelligence can be changed through sustained hard work.

Read the full 5 page article by clicking here! Get the great $10 book here. has a great reading list.

Topics covered include:

  • 5 General overview papers (Bengio, Schmidhuber, Graves ….)
  • 12  Foundation Theory and Motivation papers (Bingio, Hinton, LeCun, ….)
  • 7 Computer Vision papers (Deep Convolutional Networks, Hierarchical Features, Practical Applications)
  • 5 Papers on Natural Language Processing and Speech (Recursive Auto-encoders, Bidirectional Long Short Term Memory)
  • 15 Papers on  Unsupervised Feature Learning (Deep Boltzmann Machines, Autoencoders, …)
  • And about 40 papers under the headings: Disentangling Factors and Varitions with Depth, Transfer Learning and domain adaptation, Practical Tricks and Guides, Sparse Coding,  Classification,  Large Scale Deep Learning,  Recurrent Networks,   Hyper Parameters, Miscellaneous, and Optimization.

They conclude their list with a list of three other machine learning reading lists and three other links to deep learning tutorials.

Carl sent me a YouTube video by Dr  Gunnar Carlsson on the application of Topology to Data Mining (Topological Data Analysis).

Dr. Carlsson created a short 5 minute introduction, and a longer video of one of his lectures.

For more information, check out “Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and
excellent survival” by Nicolaua, Levineb, and Carlsson.

Also, Bansal and Choudhary put together a nice set of slides on the subject with applications to clustering and visualization.


  1. Enjoying John Baez’s blog Azimuth.  Especially the posts on good research practices and an older post on levels of mathematical understanding.
  2. García-Pérez, Serrano, and Boguñá wrote a cool paper on primes, probability, and integers as a bipartite network.
  3. Loved the idea behind the game theoretical book “Survival of the Nicest”  (see Yes Magazine for a two page introduction).
  4. Scott Young is learning Chinese quickly.
  5. Cyber warriors to the rescue.
  6. Mao, Fluxx, and Douglas Hofstadter‘s Nomic are fun games.
  7. Healy and Caudell are applying category theory to semantic and neural networks.
  8. Some MOOCs for data science and machine learning.
  9. Here an old but good free online course on the Computational Complexity of Machine Learning.
  10. Great TeX graphics.
  11. Watch this Ted Video to learn anything in 20 hours (YMMV).
  12. Where are all the Steeler fans?  Cowboy fans?  ….
  13. Productivity Hints.
  14. Copper + Magnets = Fun
  15. Stray dogs on the subway.
  16. Deep learning on NPR.
  17. Happy 40th birthday D&D
  18. Google is applying deep learning to images of house numbers
  19. Deep learning in your browser.
  20. How to write a great research paper.
  21. Do Deep Nets Really Need to be Deep?
  22. A variation on neural net dropout.
  23. Provable algorithms for Machine Learning
  24. 100 Numpy Exercises
  25. Learn and Practice Applied Machine Learning | Machine Learning Mastery


Over the last few weeks, I’ve been working with some tricky features. Interestingly, I needed to add noise to the features to improve my classifier performance.  I will write a post on these “confounding features” later.  For now, let me just point out the following useful fact.


$$f(x, \sigma) = {1\over{\sigma \sqrt{2 \pi}}} \exp{\left(-{{x^2}\over{2 \sigma^2}}\right)},$$


$$\max_\sigma f(x,\sigma) = f(x, |x|).$$

So, if you have a Gaussian with mean zero and you want to fatten it to maximize the likelihood of the probability density function at $x$ without changing the mean, then set the standard deviation to $|x|$.

So I have been learning a little category theory in the hopes that I could use it somehow to teach computers abstraction.

What is abstraction?  Carl and I have been talking about this for years.  Here are some ideas on what abstraction could be:

1a) An average model of a large group of similar things.  For example, the abstraction of a chair might be an idealized chair with a seat, a backrest, four legs, and possibly “strechers” between the legs to give it strength and stability.

1b) A function in a computer program that is used in several places in the code.  In effect this function shortens the code.  If the code has length L1 without the function, the code is length L2 with the function, the function is length LC, and the function is used N times in the code with the function, then typically

L2 $\approx$ L1 – LC*N + LC + N.

The function is an abstraction of the pieces of code that it replaced similar to the way an ideal chair is an abstraction of many chairs.  The function has the common characteristics of the code it replaced.  For example, the following code consists of about 160 characters

// C Code without function

   printf("Hello John\n");
   printf("Hello Harry\n");
   printf("Hello Rose\n");
   printf("Hello Carol\n");
   printf("Hello Mr. Thomas\n");
   printf("Hello Mrs. Jones\n");


and it could be shortened slightly to about 137 characters by


// C Code with function

void pr(char *s)
   printf("Hello %s\n, s");

main() {
   pr("Mr. Thomas");
   pr("Mrs. Jones");


Though this example is not in any way practical, it does illustrate how a function can capture commonality.  All of the pieces of code replaced by the pr function had a call to printf, the substring “Hello “, and the carriage return “\n”.  The pr function was, in a sense, the idealized version of the six print statements in the original code.

In category theory, we could refer to the category of strings with arrows going from superstrings to substrings.  In this category, there would be an arrow from the line of code ”   printf(“Hello John\n”);” to the string ”   printf(“Hello \n”);”.  The categorical coproduct of several strings in this category would be the longest common substring.  So, category theory suggests the substring ”   printf(“Hello \n”);” should somehow be included in the abstraction of those six printf statements in the original code.

2)  A simplifying transformation that changes things into simpler things which are easier to work with.  For example, if you are counting sheep in a pen, you might convert groups of sheep into numbers and then add the numbers to get a total count for the entire pen.  We simplified the counting process by first converting groups of sheep into numbers.  Another example might be solving a corn maze by first flying over it, taking a picture, and then converting the “impassable” rows of corn into lines drawn on a piece of paper.  To solve the corn maze, you first convert it to a pen drawing on paper and then solve the pen drawn maze.

3)  Removing irrelevant information or detail.  If a piece in chess is attacking the opponent’s king, then the opponent must either a) move the king, b) put a piece between the attacker and the king, or c) capture the attacking piece.  If you wanted to list the possible moves for the opponent, you can often delete from the chess board most of the pieces because most of the pieces on the board do not affect the three options above.  In the diagram below, the white pawns, rook, and king and the black pawn on a7 do not affect the legal moves for black.


In a corn maze, the type of corn, the height of the corn stalks, the temperature, or the type of soil do not affect the solution of the corn maze, so that information could be ignored.  Ignoring irrelevant information allows the brain to form a more abstract, simple model of the corn maze.

4) Variables in first order predicate logic.  “If person A is taller than person B and person B is taller than person C, then person A is taller than person C.”  In that statement, person A could any one of a billion people on earth.

5) Establishing classes of objects.  “Fruit tends to spoil quickly.”  That statement applies to a huge variety of organic objects.  Abstraction classes can even be put in a hierarchy like   Animal ==>> Mammal ==> Cat ==> Siamese Cat.

6) Statistical Patterns and Compression.  If a piece of code like zip.exe compresses a corpus of English text files, we could say that the piece of code formed an abstract model of the corpus.  In the case of the zip function which uses the Lempel–Ziv–Markov chain algorithm, the zip program would notice that

  • the letters ‘e’ and ‘t’ appear frequently,
  • the letter ‘u’ usually follows ‘q’, and
  • the capital letter I frequently appears with spaces on either side.

These patterns are all mini-abstractions or aspects of English.


A very general notion of abstraction could be any function which shortens the compression of data.  That function could be an abstraction of some part of the data.  If your data consisted of 3 dimensional models of chairs, then a compression program could use the fact that a chair usually has four legs, a seat, and a backrest.  A compression program might benefit from a function like the one shown below.

char *DescribeChair( double *featureVector)
   char *description;

   description = describeLegs(outputBuffer, featureVector);
   description = describeSeatOnTopOfLegs(description,
   description = describeBackRestOnTopOfSeat(description,

   return description;


The abstract version of a chair would help compress the data describing many chairs.

A compression program for a corpus of chess games would probably include a board evaluation function.  The program could use the evaluation function to suggest the top five most likely moves.  This would allow the program to use less space to compress the moves by using a probability distribution (e.g. arithmetic encoding) over moves as a function of the “goodness” of each move.  Better moves would be more likely than worse moves.  Converting the positions to valuations helps the computer to compress by making it “understand” chess.  If the computer is compressing many chess games, it may even note that some human players were more aggressive players or stronger players.  By classifying the players by skill or behavior, the program achieves better compression.

Well that’s it for now.

Hopefully, sometime in the future, I will be describing how category theory can help us automate those six abstractions and then use the abstractions to generate features for learning algorithms.



(Code highlighting was done by  Thanks!)



« Older entries