So I played the game 2048 and thought it would be fun to create an AI for the game.  I wrote my code in Mathematica because it is a fun, terse language with easy access to numerical libraries since I was thinking some simple linear or logit regression ought to give me some easy to code strategies.

The game itself is pretty easy to code in Mathematica.  The game is played on a 4×4 grid of tiles.  Each tile in the grid can contain either a blank (I will hereafter use a _ to denote a blank), or one of the numbers 2,4,8,16,32,64,128, 256, 512, 1024, or 2048.  On every turn, you can shift the board right, left, up, or down. When you shift the board, any adjacent matching numbers will combine.

The object of the game is to get one 2048 tile which usually requires around a thousand turns.

Here’s an example.  Suppose you have the board below.

start

 

If you shift right, the top row will become _, _, 4, 16, because the two twos will combine.  Nothing else will combine.

movEr

 

Shifting left gives the same result as shifting right except all of the tiles are on the left side instead of the right.

Shift Left

If we look back at the original image (below), we can see that there are two sixteen tiles stacked vertically.

original

So the original board (above) shifted down combines the two sixteens into a thirty-two (below) setting some possible follow-up moves like combining the twos on the bottom row or combining the two thirty-twos into a sixty-four.

Shifted down

There are a few other game features that make things more difficult.  After every move, if the board changes, then a two or four tile randomly fills an empty space. So, the sum of the tiles increases by two or four every move.

 

Random Play

The first strategy I tried was random play.  A typical random play board will look something like this after 50 moves.  rand50

 

If you play for a while, you will think that this position is in a little difficult because the high numbers are not clumped, so they might be hard to combine.  A typical random game ends in a position like this one.

randEnd

In the board above, the random player is in deep trouble.  His large tiles are near the center, the 128 is not beside the 64, there are two 32 tiles separated by the 64, and the board is almost filled.  The player can only move up or down.  In this game the random player incorrectly moved up combining the two twos in the lower right and the new two tile appeared in the lower right corner square ending the game.

I ran a 1000 random games and found that the random player ends the game with an average tile total of 252.  Here is a histogram of the resulting tile totals at the end of all 1000 games.

histRand

(The random player’s tile total was between 200 and 220 one hundred and twenty seven times.  He was between 400 and 420 thirteen times.)

The random player’s largest tile was 256 5% of the time, 128 44% of the time, 64 42% of the time, 32 8% of the time, and 16 0.5% of the time.  It seems almost impossible to have a max of 16 at the end.  Here is how the random player achieved this ignominious result.

randEnd16

The Greedy Strategy

So what’s a simple improvement?  About the simplest possible improvement is to try to score the most points every turn—the greedy strategy.  You get the most points by making the largest tile possible.  In the board below, the player has the options of going left or right combining the 32 tiles, or moving up or down combining the two 2 tiles in the lower left.

combine2048combine2048Opt

The greedy strategy would randomly choose between left or right combining the two 32 tiles.  The greedy action avoids splitting the 32 tiles.

I ran 1000 greedy player games games.  The average tile total at the end of a greedy player simulation was 399—almost a 60% improvement over random play. Here is a histogram of the greedy player’s results.

GreedyResults

If you look closely, you will see that is a tiny bump just above 1000.  In all 1000 games, the greedy player did manage to get a total above 1000 once, but never got the elusive 1024 tile.  He got the 512 tile 2% of the time, the 256 tile 43% of the time, the 128 44%, the 64 11%, and in three games, his maximum tile was 32.

I ran a number of other simple strategies before moving onto the complex look ahead strategies like UCT.  More on that in my next post.

 

  1. Enjoying John Baez’s blog Azimuth.  Especially the posts on good research practices and an older post on levels of mathematical understanding.
  2. García-Pérez, Serrano, and Boguñá wrote a cool paper on primes, probability, and integers as a bipartite network.
  3. Loved the idea behind the game theoretical book “Survival of the Nicest”  (see Yes Magazine for a two page introduction).
  4. Scott Young is learning Chinese quickly.
  5. Cyber warriors to the rescue.
  6. Mao, Fluxx, and Douglas Hofstadter‘s Nomic are fun games.
  7. Healy and Caudell are applying category theory to semantic and neural networks.
  8. Some MOOCs for data science and machine learning.
  9. Here an old but good free online course on the Computational Complexity of Machine Learning.
  10. Great TeX graphics.
  11. Watch this Ted Video to learn anything in 20 hours (YMMV).
  12. Where are all the Steeler fans?  Cowboy fans?  ….
  13. Productivity Hints.
  14. Copper + Magnets = Fun
  15. Stray dogs on the subway.
  16. Deep learning on NPR.
  17. Happy 40th birthday D&D
  18. Google is applying deep learning to images of house numbers
  19. Deep learning in your browser.
  20. How to write a great research paper.
  21. Do Deep Nets Really Need to be Deep?
  22. A variation on neural net dropout.
  23. Provable algorithms for Machine Learning
  24. 100 Numpy Exercises
  25. Learn and Practice Applied Machine Learning | Machine Learning Mastery

 

What would you do with 2880 cores?

Vasik Rajlich, author of the legendary chess program Rybka,  used them to show that chess champion Bobby Fisher was right about the chess opening King’s Gambit (see “A bust for King’s Gambit“) .

 

King’s Gambit

 

Vasik proved that King’s Gambit is a loss for white unless white plays 3. Be2 after black takes the pawn.  (In that case, it’s a draw.)  For the details, click on the link below.

http://en.chessbase.com/post/rajlich-busting-the-king-s-gambit-this-time-for-sure

In September 2008, I met John in Harrisburg.  We slept overnight at his friend Zoungy’s place and then enjoyed the beautiful drive to Cherry Springs for the Black Forest Star Party.  John was perhaps the most entertaining, charming, and inspiring person I have ever met.  He had a million stories about astronomy, philosophy, science, and life in general.  At age 93, he was recovering from a stroke, meeting new people, travelling alone with strangers like me, and giving entertaining informative speeches to crowds numbering in the hundreds.  He would charm everyone he met.

John — rest in peace.

http://www.skyandtelescope.com/news/home/John-Dobson-1915ndash2014-240456881.html

http://www.universetoday.com/108150/john-dobson-inventor-of-the-popular-dobsonian-telescope-dead-at-98/

http://www.astronomy.com/news/2014/01/astronomy-popularizer-john-dobson-dies

 

polarplot2

Equation from Teacher’s Choice

So I have been wanting to understand Category Theory (see [1], [2], [3]) mainly because I thought it would help me understand advanced functional programming in Haskell and because of the book “Physics, topology, logic and computation: a Rosetta Stone” by Baez and Stay (2011).  A number of the constructs in Haskell were derived from Category Theory so that was enough motivation for me to learn it.  But also, monoidal categories are useful in several fields: physics, computer science, pure mathematics, and statistics. The biggest application for me is the idea of generalizing probabilistic graphical models through either Ologs or monoidal categories / tensor networks (see Brendan Fong’s thesis “Causal Theories: A Categorical Perspective on Bayesian Networks“).

To start my investigation of Category Theory, I began with the $20 thin book “Basic Category Theory for Computer Scientists” by Benjamin Pierce (see also the free online version “A taste of category theory for computer scientists” (1988)).

I really enjoyed the book.  Dr. Pierce’s style is a little informal compared to pure math books like Mac Lane’s “Categories for the Working Mathematician” (pdf / Amazon), but I enjoy that more relaxed style of writing when I am first learning a field.

The biggest obstacle for learning category theory is the fact that category theory generalizes a lot of areas of pure mathematics like topology, abstract algebra, and geometry.  It’s hard to generalize before you have examples to generalize, but the examples being generalized in category theory are mostly from higher level mathematics found in senior level undergraduate and graduate level courses. Pierce ameliorates this problem by introducing some of the most basic categories first: sets, ordered sets, partially ordered sets, groups, monoids, vector spaces, measure spaces, topological spaces, proofs, and a simple functional computer language.  He takes the time to explicitly define most of these ideas, so, in theory, you could read this book without a background in theoretical mathematics, but it would be hard.

After defining categories and introducing the most basic categories, Pierce describes and defines the most basic ideas in category theory: subcategories, commutative diagrams, monomorphisms, epimorphisms, isomorphisms, initial/terminal objects, products, coproducts, universal constructions, equalizers, pullbacks, pushouts, limits, cones, colimits, cocones, exponentiation, and closed Cartesian categories. These ideas are spelled out over the thirty pages of chapter one including illuminating homework exercises. The homework exercises varied significantly in difficulty.  Many of the exercises were trivial and there are two or three that I am still working on despite investing several hours of thought.  Generally, I found the exercises to be a bit harder than those in Mac Lane’s book, but Pierce’s book required less of a background in mathematics.  A couple of the exercises were incorrectly stated or impossible.

Chapter two introduced functors, natural transformations, adjoints, and F-algebras. After reading this chapter, I was finally able to understand the definition of monads which are an important part of the computer language Haskell!  Pierce provides many examples of each of these ideas and enjoyable homework exercises to increase understanding.  Pierce’s definition of adjoints is much easier to understand than the standard definitions using counit adjunction or Hom Sets.

The last major chapter concerns applications of category theory to computer science–specifically lambda-calculus and programming language design.  

The first two chapters of the book give a reasonable condensed introduction to category theory for students that have taken a course in abstract algebra.  A course in topology or linear algebra would be another useful prerequisite.  I carried around the light 100 page book for a few months so that I could learn something whenever I had some extra time.  I had hoped that when I had proven that several functors where monads, I would then really understand monads, but a full understanding still eludes me.  Similarly, I have proven that several functor pairs are adjoint, but I still don’t feel as though I understand adjoint functors.  I guess more work and learning are needed.

Have a great year.  –  Cheers, Hein

 

Check out Markus Beissinger’s blog post “Deep Learning 101″.  Markus reviews a lot of deep learning basics derived from the papers “Representation Learning: A Review and New Perspectives” (Bengio, Courville, Vincen 2012) and “Deep Learning of Representations: Looking Forward” (Bengio 2013). Beissinger covers the following topics:

  • An easy intro to Deep Learing
  • The Current State of Deep Learing
  • Probabilistic Graphical Models
  • Principal Component Analysis
  • Restricted Boltzman Machines
  • Auto-Encoders
  • “Challenges Looking Ahead”

This is a great intro and I highly recommend it.

If you want more information, check out Ng’s lecture notesHonglak Lee’s 2010 NIPS slides, and Hinton’s Videos ([2009] [2013]).

Zygmunt Zając at fastml.com has a great post titled “13 NIPS Papers that caught our eye.” Zajac provides a short readable summary of his favorite NIPS 2013 papers.  (NIPS 2013 which just ended last week.) The papers are:

  1. Understanding Dropout  by Baldi and Sadowski
  2. Training and Analysing Deep Recurrent Neural Networks by Hermans and Schrauwen
  3. RNADE: The real-valued neural autoregressive density-estimator by Uria, Murray, and Larochelle
  4. Predicting Parameters in Deep Learning by Denil, Shakibi, Dinh,  Ranzato, and Freitas
  5. Pass-efficient unsupervised feature selection by Maung and Schweitzer
  6. Multi-Prediction Deep Boltzmann Machines by Goodfellow, Mirza, Courville, and Bengio
  7. Memoized Online Variational Inference for Dirichlet Process Mixture Models by Hughes, and Sudderth
  8. Learning word embeddings efficiently with noise-contrastive estimation by Andriy , and Kavukcuoglu
  9. Learning Stochastic Feedforward Neural Networks by Tang and Salakhutdinov
  10. Distributed Representations of Words and Phrases and their Compositionality by Mikolov, Sutskever, Chen, Corrado, and Dean
  11. Correlated random features for fast semi-supervised learning by McWilliams, Balduzzi, and Buhmann
  12. Convex Two-Layer Modeling by Aslan, Cheng, Zhang, and Schuurmans, and
  13. Approximate inference in latent Gaussian-Markov models from continuous time observations by Cseke, Opper, and Sanguinetti

I suggest reading Zajac’s summaries before diving in.

Reddit has this pretty cool subreddit ELI5 = “Explain like I’m 5″ where people try to explain topics as if they are talking to a five year old.  It’s an interesting exercise to see how much of a difficult topic like nuclear energy or solar physics you could explain to a five year old.  Here is my attempt to explain Entropy and Mutual Information to a bright five year old that isn’t afraid of big words or long sentences.

(This post is going to be amazingly long and very simplistic, so I suggest you don’t bother reading it unless you think it might be fun to consider how to explain a complex concept like mutual information to a five year old.)

 

Random Variables

In order to explain (information theoretic) entropy we need to talk about random variables.  A random variable is a number that you get from a repeatable experiment.  For example, if you roll a die, the outcome is 1, 2, 3, 4, 5, or 6.  We say that the result of each die roll is a random variable.

 

Entropy

So random variables have something called entropy.  In order to talk about entropy it’s helpful to first talk about the doubling numbers.  The doubling numbers are 1, 2, 4, 8, 16, ….  I call them doubling numbers because 1+1 = 2, 2+2 =4, 4+4 =8, and so on.  (The official name of doubling numbers is “the powers of two” but let’s just call them doubling numbers for now.)

 

BITS

If a random variable has two equally likely outcomes, we say that the variable has an entropy of 1 bit.  If it has four equally likely outcomes, then the variable has 2 bits, eight 3 bits, and 16 outcomes gives 4 bits.  Of course, some random variables have like 6 outcomes and that is not a doubling number, so we could say that it has about two and a half bits because 6 outcomes is half way between 4 outcomes which is 2 bits and 8 outcomes which is 3 bits.

How did we get the name bits?  Well, in a computer bits are little electronic parts that are either on or off.  We use the code 0 for off and 1 for on.  (This code is called the binary code.)  We could represent the result of an experiment (a random variable) with these bits.  Suppose that our experiment is flipping a coin and we call tails 1 and heads 0.  Then we would need one bit to record the results of each coin flip.

 

Three bits for an eight sided DIE

The number of bits needed to record the roll of an eight sided die would be larger. It can be done with three bits as follows.   If you roll a 1, set the bits to off-off-on (001).  If you roll a two, set the bits to 010.  A 3 set them to 011, 4 ->  100, 5 -> 101, 6 -> 110, 7 -> 111, and 8 -> 000.  Notice that we were able to represent every possible outcome with only 3 bits.  That’s why we say the entropy of an eight sided die roll is three bits.

 

A multi-stage experiment

More complicated experiments would require some more difficult computations. Sometimes, the outcomes are not all equally likely.  Suppose our experiment has two stages.  In stage one we flip a nickel.  If the result is tails, we quit.  If the result is heads, we flip a quarter.  There are three possible results for this experiment:

  1. The nickel is tails
  2. The nickel is heads and the quarter is tails.
  3. The nickel is heads and the quarter is heads.

The nickel is tails about half the time and heads half the time.  So we only have to record the quarter half of the time.  The average number of bits needed to record this experiment is one and a half.  One bit for the nickel and one half bit for the quarter because we only need to write down the quarter flip outcome (1 bit) half of the time.

 

MUTUAL INFORMATiON

Sometimes random variables share bits.  When that happens, we say that they have mutual information. Most random variables do not share information. In that case, we say the variables are independent and the mutual information (shared information) is zero.

 

A COMPLICATED EXAMPLE

Suppose that Harry and Sally are doing an experiment.  They flip a penny, a nickel, a dime, and a quarter.  After the experiment, Harry writes down the results of the penny flip and the nickel flip.  Sally writes down the results of the nickel, the dime, and the quarter.  There are four possible outcomes for Harry: Head-Head, Head-Tail, Tail-Head, and Tail-Tail.  We say that Harry’s recording uses 2 bits, so his entropy is 2 bits.  Sally could have 8 different results, but she would only need 3 bits, one for each coin, to record her results, so her entropy is 3 bits.  The mutual information between Harry and Sally is the nickel flip which is just 1 bit (1 for heads, 0 for tails).  The overall entropy of Harry and Sally together is 4 bits because there are four coins.  The Joint entropy of Harry and Sally together is the number of bits needed to record both Harry’s result and Sally’s result.  Harry uses 2 bits and Sally 3 bits, but only 4 bits are needed because the nickel is shared.

 

THE FORMULA

There is a general formula for the mutual information between Harry and Sally.

Mutual Information = Harry’s Entropy + Sally’s Entropy – The Joint Entropy.

For this experiment the mutual information is  2 + 3 – 4  which is 1 bit.

 

 End Note

Well, that’s it.  Maybe I gave an explanation appropriate for twelve year olds. Maybe not.  I had my eighth grade son read it and then I gave him a 6 question quiz. After I corrected two of his answers, he was able to calculate the mutual information of two, fairly simple random variables.

 

« Older entries § Newer entries »