Eli5 Entropy and Mutual Information

Reddit has this pretty cool subreddit ELI5 = “Explain like I’m 5″ where people try to explain topics as if they are talking to a five year old.  It’s an interesting exercise to see how much of a difficult topic like nuclear energy or solar physics you could explain to a five year old.  Here is my attempt to explain Entropy and Mutual Information to a bright five year old that isn’t afraid of big words or long sentences.

(This post is going to be amazingly long and very simplistic, so I suggest you don’t bother reading it unless you think it might be fun to consider how to explain a complex concept like mutual information to a five year old.)


Random Variables

In order to explain (information theoretic) entropy we need to talk about random variables.  A random variable is a number that you get from a repeatable experiment.  For example, if you roll a die, the outcome is 1, 2, 3, 4, 5, or 6.  We say that the result of each die roll is a random variable.



So random variables have something called entropy.  In order to talk about entropy it’s helpful to first talk about the doubling numbers.  The doubling numbers are 1, 2, 4, 8, 16, ….  I call them doubling numbers because 1+1 = 2, 2+2 =4, 4+4 =8, and so on.  (The official name of doubling numbers is “the powers of two” but let’s just call them doubling numbers for now.)



If a random variable has two equally likely outcomes, we say that the variable has an entropy of 1 bit.  If it has four equally likely outcomes, then the variable has 2 bits, eight 3 bits, and 16 outcome gives 4 bits.  Of course, some random variables have like 6 outcomes and that is not a doubling number, so we could say that it has about two and a half bits because 6 outcomes is half way between 4 outcomes which is 2 bits and 8 outcomes which is 3 bits.

How did we get the name bits?  Well, in a computer bits are little electronic parts that are either on or off.  We use the code 0 for off and 1 for on.  (This code is called the binary code.)  We could represent the result of an experiment (a random variable) with these bits.  Suppose that our experiment is flipping a coin and we call tails 1 and heads 0.  Then we would need one bit to record the results of each coin flip.


Three bits for an eight sided DIE

The number of bits needed to record the roll of an eight sided die would be larger. It can be done with three bits as follows.   If you roll a 1, set the bits to off-off-on (001).  If you roll a two, set the bits to 010.  A 3 set them to 011, 4 ->  100, 5 -> 101, 6 -> 110, 7 -> 111, and 8 -> 000.  Notice that we were able to represent every possible outcome with only 3 bits.  That’s why we say the entropy of an eight sided die roll is three bits.


A multi-stage experiment

More complicated experiments would require some more difficult computations. Sometimes, the outcomes are not all equally likely.  Suppose our experiment has two stages.  In stage one we flip a nickel.  If the result is tails, we quit.  If the result is heads, we flip a quarter.  There are three possible results for this experiment:

  1. The nickel is tails
  2. The nickel is heads and the quarter is tails.
  3. The nickel is heads and the quarter is heads.

The nickel is tails about half the time and heads half the time.  So we only have to record the quarter half of the time.  The average number of bits needed to record this experiment is one and a half.  One bit for the nickel and one half bit for the quarter because we only need to write down the quarter flip outcome (1 bit) half of the time.



Sometimes random variables share bits.  When that happens, we say that they have mutual information. Most random variables do not share information. In that case, we say the variables are independent and the mutual information (shared information) is zero.



Suppose that Harry and Sally are doing an experiment.  They flip a penny, a nickel, a dime, and a quarter.  After the experiment, Harry writes down the results of the penny flip and the nickel flip.  Sally writes down the results of the nickel, the dime, and the quarter.  There are four possible outcomes for Harry: Head-Head, Head-Tail, Tail-Head, and Tail-Tail.  We say that Harry’s recording uses 2 bits, so his entropy is 2 bits.  Sally could have 8 different results, but she would only need 3 bits, one for each coin, to record her results, so her entropy is 3 bits.  The mutual information between Harry and Sally is the nickel flip which is just 1 bit (1 for heads, 0 for tails).  The overall entropy of Harry and Sally together is 4 bits because there are four coins.  The Joint entropy of Harry and Sally together is the number of bits needed to record both Harry’s result and Sally’s result.  Harry uses 2 bits and Sally 3 bits, but only 4 bits are needed because the nickel is shared.



There is a general formula for the mutual information between Harry and Sally.

Mutual Information = Harry’s Entropy + Sally’s Entropy – The Joint Entropy.

For this experiment the mutual information is  2 + 3 – 4  which is 1 bit.


 End Note

Well, that’s it.  Maybe I gave an explanation appropriate for twelve year olds. Maybe not.  I had my eighth grade son read it and then I gave him a 6 question quiz. After I corrected two of his answers, he was able to calculate the mutual information of two, fairly simple random variables.