Abstraction for Learning

You are currently browsing the archive for the Abstraction for Learning category.

So I have been learning a little category theory in the hopes that I could use it somehow to teach computers abstraction.

What is abstraction?  Carl and I have been talking about this for years.  Here are some ideas on what abstraction could be:

1a) An average model of a large group of similar things.  For example, the abstraction of a chair might be an idealized chair with a seat, a backrest, four legs, and possibly “strechers” between the legs to give it strength and stability.

1b) A function in a computer program that is used in several places in the code.  In effect this function shortens the code.  If the code has length L1 without the function, the code is length L2 with the function, the function is length LC, and the function is used N times in the code with the function, then typically

L2 $\approx$ L1 – LC*N + LC + N.

The function is an abstraction of the pieces of code that it replaced similar to the way an ideal chair is an abstraction of many chairs.  The function has the common characteristics of the code it replaced.  For example, the following code consists of about 160 characters

// C Code without function

main()
{
   printf("Hello John\n");
   printf("Hello Harry\n");
   printf("Hello Rose\n");
   printf("Hello Carol\n");
   printf("Hello Mr. Thomas\n");
   printf("Hello Mrs. Jones\n");
}

 

and it could be shortened slightly to about 137 characters by

 

// C Code with function

void pr(char *s)
{
   printf("Hello %s\n, s");
}

main() {
   pr("John");
   pr("Harry");
   pr("Rose");
   pr("Carol");
   pr("Mr. Thomas");
   pr("Mrs. Jones");
}

 

Though this example is not in any way practical, it does illustrate how a function can capture commonality.  All of the pieces of code replaced by the pr function had a call to printf, the substring “Hello “, and the carriage return “\n”.  The pr function was, in a sense, the idealized version of the six print statements in the original code.

In category theory, we could refer to the category of strings with arrows going from superstrings to substrings.  In this category, there would be an arrow from the line of code ”   printf(“Hello John\n”);” to the string ”   printf(“Hello \n”);”.  The categorical coproduct of several strings in this category would be the longest common substring.  So, category theory suggests the substring ”   printf(“Hello \n”);” should somehow be included in the abstraction of those six printf statements in the original code.

2)  A simplifying transformation that changes things into simpler things which are easier to work with.  For example, if you are counting sheep in a pen, you might convert groups of sheep into numbers and then add the numbers to get a total count for the entire pen.  We simplified the counting process by first converting groups of sheep into numbers.  Another example might be solving a corn maze by first flying over it, taking a picture, and then converting the “impassable” rows of corn into lines drawn on a piece of paper.  To solve the corn maze, you first convert it to a pen drawing on paper and then solve the pen drawn maze.

3)  Removing irrelevant information or detail.  If a piece in chess is attacking the opponent’s king, then the opponent must either a) move the king, b) put a piece between the attacker and the king, or c) capture the attacking piece.  If you wanted to list the possible moves for the opponent, you can often delete from the chess board most of the pieces because most of the pieces on the board do not affect the three options above.  In the diagram below, the white pawns, rook, and king and the black pawn on a7 do not affect the legal moves for black.

Check

In a corn maze, the type of corn, the height of the corn stalks, the temperature, or the type of soil do not affect the solution of the corn maze, so that information could be ignored.  Ignoring irrelevant information allows the brain to form a more abstract, simple model of the corn maze.

4) Variables in first order predicate logic.  “If person A is taller than person B and person B is taller than person C, then person A is taller than person C.”  In that statement, person A could any one of a billion people on earth.

5) Establishing classes of objects.  “Fruit tends to spoil quickly.”  That statement applies to a huge variety of organic objects.  Abstraction classes can even be put in a hierarchy like   Animal ==>> Mammal ==> Cat ==> Siamese Cat.

6) Statistical Patterns and Compression.  If a piece of code like zip.exe compresses a corpus of English text files, we could say that the piece of code formed an abstract model of the corpus.  In the case of the zip function which uses the Lempel–Ziv–Markov chain algorithm, the zip program would notice that

  • the letters ‘e’ and ‘t’ appear frequently,
  • the letter ‘u’ usually follows ‘q’, and
  • the capital letter I frequently appears with spaces on either side.

These patterns are all mini-abstractions or aspects of English.

 

A very general notion of abstraction could be any function which shortens the compression of data.  That function could be an abstraction of some part of the data.  If your data consisted of 3 dimensional models of chairs, then a compression program could use the fact that a chair usually has four legs, a seat, and a backrest.  A compression program might benefit from a function like the one shown below.

char *DescribeChair( double *featureVector)
{
   char *description;

   description = describeLegs(outputBuffer, featureVector);
   description = describeSeatOnTopOfLegs(description,
      featureVector);
   description = describeBackRestOnTopOfSeat(description,
      featureVector);

   return description;
}

 

The abstract version of a chair would help compress the data describing many chairs.

A compression program for a corpus of chess games would probably include a board evaluation function.  The program could use the evaluation function to suggest the top five most likely moves.  This would allow the program to use less space to compress the moves by using a probability distribution (e.g. arithmetic encoding) over moves as a function of the “goodness” of each move.  Better moves would be more likely than worse moves.  Converting the positions to valuations helps the computer to compress by making it “understand” chess.  If the computer is compressing many chess games, it may even note that some human players were more aggressive players or stronger players.  By classifying the players by skill or behavior, the program achieves better compression.

Well that’s it for now.

Hopefully, sometime in the future, I will be describing how category theory can help us automate those six abstractions and then use the abstractions to generate features for learning algorithms.

 

 

(Code highlighting was done by http://hilite.me/.  Thanks!)

 

 

I’m quite excited by the Nuit Blanche post on the papers “Structure Discovery in Nonparametric Regression through Compositional Kernel Search” (Duvenaudy, Lloydy, Grossez, Tenenbaumz, Ghahramaniy 2013) and “Exploiting compositionality to explore a large space of model structures” (Grosse, Salakhutdinovm, Freeman, and Tenenbaum 2012).  For years my old company Momentum Investment Services, Carl, and I have been looking for fast, systematic ways to search large hypothesis spaces.  We considered context-free grammars as a means of generating hypothesis.  Carl and I did not get anywhere with that research, but now it seems that others have succeeded.  Be sure to look over the article, the blog posts, and the comments.

I really enjoyed Jeremy Kun’s article on Information Distance at Math ∩ Programming.

Srivastava and Schmidhuber have recently written about their first experiences with Power Play.  Power Play is a selfmodifying general problem solver that uses algorithmic complexity theory and ideas similar to Levin Search and AIQ/AIXI (see [1], [2], and [3]) to find novel solutions to problems and invent new “interesting” problems.  I was more interested in Hutter‘s approximations to AIXI mainly because it was easy to understand the results when it was applied to the simple games (1d-maze, Cheese Maze, Tiger, TicTacToe, Biased Rock-Paper-Scissor, Kuhn Poker, and Partially Observable Pacman), but I look forward to future papers on Power Play.

 

In “Knowledge-Based General Game Playing”  Haufe, Michulke, Schiffel, and Thielscher (2011) discuss their general game playing program FLUXPLAYER.  They summarize their article as

“This article gives an overview of five essential methods for knowledge-based General Game Playing:
1. We show how to automatically generate evaluation functions for non-terminal positions.
2. We demonstrate the application of machine learning
techniques to improve these evaluation functions.
3. We illustrate how a system can automatically derive new
properties of games using automated theorem proving.
4. We present two specific techniques that help to reduce the
complexity of the search space by mimicking the ability
of humans to
(a) detect symmetries in games and
(b) identify subgames in composite games.”

FLUXPLAYER creates a fuzzy logic position evaluation function based on the rules for “for terminal and goal states”, persistent makers (such as pieces), a “logic-to-neural-network transformation”, simulations, temporal difference learning, symmetry detection, factoring (“Given its mere rules, how can a previously unknown game be automatically decomposed into independent subgames?”), automated theorem proving, and the $C- IL^2P$ algorithm.

Suppose some Intelligent Algorithm can learn to play tic-tac-toe. Tic-tac-toe is such a simple game that it can be exhaustively learned via memorization – but this is not what I mean here. I mean that the intelligent algorithm (IA) actually has the ability to generally learn, and that it has used this ability to learn tic-tac-toe. What would one like the IA to now know? Well, in addition to knowing the rules and optimal strategy of the game, the IA should also have figured out that tic-tac-toe is played on a planar grid, and that the winning positions correspond to horizontal, vertical, and diagonal lines on this grid.

Now after it has learned 3 x 3 tic-tac-toe, suppose that we want our IA to also learn to play 4 x 4 tic-tac-toe (perhaps this game should be called tic-toc-toe-tum). The IA should be able to use what it has learned in the 3 x 3 game to more easily learn the 4 x 4 game, and it should explicitly “understand” this relationship. Concretely, what does this mean?

I believe this is one of the couple fundamental open problems in machine learning.

So, I am a bit disappointed that there do not seem to be any great methods for automated proving of theorems for games.  The best seems to be “Automated Theorem Proving for General Game Playing” (Schiffel & Thielsche 2009) and “A Temporal Proof System for General Game Playing” (Thielscher & Voigt 2010).  In the first paper, their program (which uses a sophisticated “answer set programming” theorem prover) proves theorems like ”

• proving that the argument of the control(P) feature is
unique in every state (provided the feature exists);
• proving that the contents of a board cell is unique (to which
end boards were identified manually);
• proving that the contents of a board cell is unique given
the information that the argument of control(P) is;
• proving that the game is zero-sum;
• systematic search for unique arguments as described at the
end of the preceding section.”

It is great that the authors combine automated theorem provers with game playing, but I am still somewhat disappointed because most people who play a game can easily read the rules and almost immediately answer questions about the game like “Is it zero-sum?” or “Do the squares only have one piece?”.  I was hoping for deeper theorems.  It is also disappointing that one of our best automated theorem proving algorithms cannot even prove these simple theorems for many games.

On the other hand, maybe we should only prove simple properties like this. When playing chess about the only theorems I ever apply are:  1) when in check, there are only three possible options – move the king, interpose a piece, or capture the attacking piece, and 2) if a piece is blocking an attack on its king, it cannot move (a pin). Really, I don’t think I use any theorems deeper than this while playing chess and those two theorems are not at all useful to a computer because the computer can easily list all the possible moves.

In other games, I guess I do use theorems like “if I lead the third highest card in bridge, a finesse will succeed if the player on my left has the second highest card and my partner has the highest card (assuming no trump)” or “if my team holds 11 of the hearts, then the other team’s hearts will split 1-1 52% of the time.”

If computers are going to reason about games, then we probably need fuzzy logic (as in this paper) and/or probability theory.

In the paper “Automated Theorem Proving for General Game Playing” by Stephan Schiffel and Michael Thielscher, the authors apply answer set programming (similar to Prolog) to improve performance in general game playing. They point out that it is important for the AI to classify what type of game it is playing and to apply appropriate techniques for that class of game. For example null moves, upper confidence trees (UCT), alpha-beta search, Bellman equations (dynamic programming), and temporal difference learning would all be appropriate for two-player, zero-sum, and turn-taking games.  Also it is important to recognize concepts “like boards, pieces, and mobility.” Apparently, van der Hoek pioneered using automated theorem proving to extract appropriate concepts from the game rules.  Shiffel and Thielscher extend the theorem proving to “automatically prove properties that hold across all legal positions.”

(Based on the discussion between Carl and me on Saturday, July 7.)

It seems that in order to achieve Strong AI we need to combine Machine Learning and Good Old-Fashioned Artificial Intelligence. For those not familiar with GOFAI, GOFAI is mostly about symbol manipulation and includes subjects like computer algebra, theorem proving, first order predicate calculus, lambda calculus, Turing machines, alpha beta pruning, context free grammers, depth/breath first search, branch and bound, semantic networks, expert systems, Prolog, lisp, resolution theorem proving, and Godel’s incompleteness theorem. I wonder if we also have to add less formal, plausible reasoning systems that guess at theorems which are only “mostly true” (true only for points in general position outside of a set of measure zero) or have a high “probability of being true”. Such reasoning techniques might be derived from Bayesian networks or Alternative Hypothesis Spaces (Griffin 2009) (http://bmir.stanford.edu/file_asset/index.php/1565/BMIR-2010-1375.pdf).

Maybe we can invent a non-discrete Turing machine. Say we replace the finite state machine which reads and prints on the tape with a neural net. We might also print real number between 0 and 1 on the tape instead of printing only zeros and ones. Of course, this continuous Turing machine would not be better at classification of binary strings than a traditional discrete Turing machine, rather, the hope is that such a machine might be easier to optimize using general optimization techniques similar to gradient decent. My personal feeling is that this particular method will not work, but something like it might work.

We need something that captures the most fundamental ideas like the idea of a Cartesian Plane. The idea of a Cartesian product arises naturally from Category Theory. If one wants to formulate thumb rules for tic-tac-toe, the 3 by 3 grid is an essential idea that is hard for a neural net to learn without prompting. Maybe we can use Category theory to guide our learning algorithms (http://golem.ph.utexas.edu/category/2007/09/category_theory_in_machine_lea.html) to the correct abstractions.

What is an abstraction anyway? In chess, if a king is in check, there are only three types of possible moves: move the king, interpose a piece, or capture the piece attacking the king. So when we consider responses to a check, we only need consider those three types of moves. The fact that only these three types of moves exists is a theorem. This theorem is like a subroutine that speeds up our processing in a chess game. So an abstraction could be like a helpful subroutine. Also, this theorem allows us to disregard pieces that are not “near” the king or the attacking piece. In this regard, the theorem allows us to remove information from the game board before calculating the possible counter-moves. In this regard, the theorem is like a projection. The matrix

[1 0;
0 0]

sets the y coordinate to zero, so it destroys information. So some abstractions can be viewed as helpful projections that remove irrelevant information—just like manifold learning or dimension reduction. Another way to remove information is to disregard the particulars of the situation and boil it down to the essentials. Group theory was invented this way.

Group theory is important in another regard: examples and models. Suppose that a thinking machine was trying to prove conjectures about group theory. Before trying to prove the conjecture, the machine could check some simple finite symmetric groups, a few finite cyclic groups, and maybe addition on real numbers to see if the conjecture was true for those examples. If it was true, then the prover could try to prove the conjecture. If it was false, then the computer would not waste time trying to prove the conjecture.

Another useful idea is the idea that classifiers or theorems could have natural topologies (http://philosophy.berkeley.edu/file/698/FractalCompletenessTechniques.pdf). It seems that it is hard for neural nets to learn parity because changing any one bit in the input changes the class from even to odd or vice versa. Nearest neighbor with k=1 would be worse than useless for such a problem. Topology could have further insights for machine learning like what is learnable. If the fractal dimension of a class is high (or the boundary has a high fractal dimension), we would imagine that the concept is harder to learn.

For example, if a classifier had to classify the points in a circle and it used nearest neighbor with k=1, then number of random samples needed to achieve an error of epsilon is order (1/$\epsilon$)^2. The number of points needed to classify the area inside a Koch Snowflake (http://www.wolframalpha.com/input/?i=Koch+snowflake&lk=3) is order (1/$\epsilon$)^(2/(2-k)) where k is the fractal dimension of the border of the Koch Snowflake.