Articles by hundalhh

Mathematician and Father. Into games, astronomy, psychology and philosophy.

Zygmunt Zając at fastml.com has a great post titled “13 NIPS Papers that caught our eye.” Zajac provides a short readable summary of his favorite NIPS 2013 papers.  (NIPS 2013 which just ended last week.) The papers are:

  1. Understanding Dropout  by Baldi and Sadowski
  2. Training and Analysing Deep Recurrent Neural Networks by Hermans and Schrauwen
  3. RNADE: The real-valued neural autoregressive density-estimator by Uria, Murray, and Larochelle
  4. Predicting Parameters in Deep Learning by Denil, Shakibi, Dinh,  Ranzato, and Freitas
  5. Pass-efficient unsupervised feature selection by Maung and Schweitzer
  6. Multi-Prediction Deep Boltzmann Machines by Goodfellow, Mirza, Courville, and Bengio
  7. Memoized Online Variational Inference for Dirichlet Process Mixture Models by Hughes, and Sudderth
  8. Learning word embeddings efficiently with noise-contrastive estimation by Andriy , and Kavukcuoglu
  9. Learning Stochastic Feedforward Neural Networks by Tang and Salakhutdinov
  10. Distributed Representations of Words and Phrases and their Compositionality by Mikolov, Sutskever, Chen, Corrado, and Dean
  11. Correlated random features for fast semi-supervised learning by McWilliams, Balduzzi, and Buhmann
  12. Convex Two-Layer Modeling by Aslan, Cheng, Zhang, and Schuurmans, and
  13. Approximate inference in latent Gaussian-Markov models from continuous time observations by Cseke, Opper, and Sanguinetti

I suggest reading Zajac’s summaries before diving in.

Reddit has this pretty cool subreddit ELI5 = “Explain like I’m 5″ where people try to explain topics as if they are talking to a five year old.  It’s an interesting exercise to see how much of a difficult topic like nuclear energy or solar physics you could explain to a five year old.  Here is my attempt to explain Entropy and Mutual Information to a bright five year old that isn’t afraid of big words or long sentences.

(This post is going to be amazingly long and very simplistic, so I suggest you don’t bother reading it unless you think it might be fun to consider how to explain a complex concept like mutual information to a five year old.)

 

Random Variables

In order to explain (information theoretic) entropy we need to talk about random variables.  A random variable is a number that you get from a repeatable experiment.  For example, if you roll a die, the outcome is 1, 2, 3, 4, 5, or 6.  We say that the result of each die roll is a random variable.

 

Entropy

So random variables have something called entropy.  In order to talk about entropy it’s helpful to first talk about the doubling numbers.  The doubling numbers are 1, 2, 4, 8, 16, ….  I call them doubling numbers because 1+1 = 2, 2+2 =4, 4+4 =8, and so on.  (The official name of doubling numbers is “the powers of two” but let’s just call them doubling numbers for now.)

 

BITS

If a random variable has two equally likely outcomes, we say that the variable has an entropy of 1 bit.  If it has four equally likely outcomes, then the variable has 2 bits, eight 3 bits, and 16 outcomes gives 4 bits.  Of course, some random variables have like 6 outcomes and that is not a doubling number, so we could say that it has about two and a half bits because 6 outcomes is half way between 4 outcomes which is 2 bits and 8 outcomes which is 3 bits.

How did we get the name bits?  Well, in a computer bits are little electronic parts that are either on or off.  We use the code 0 for off and 1 for on.  (This code is called the binary code.)  We could represent the result of an experiment (a random variable) with these bits.  Suppose that our experiment is flipping a coin and we call tails 1 and heads 0.  Then we would need one bit to record the results of each coin flip.

 

Three bits for an eight sided DIE

The number of bits needed to record the roll of an eight sided die would be larger. It can be done with three bits as follows.   If you roll a 1, set the bits to off-off-on (001).  If you roll a two, set the bits to 010.  A 3 set them to 011, 4 ->  100, 5 -> 101, 6 -> 110, 7 -> 111, and 8 -> 000.  Notice that we were able to represent every possible outcome with only 3 bits.  That’s why we say the entropy of an eight sided die roll is three bits.

 

A multi-stage experiment

More complicated experiments would require some more difficult computations. Sometimes, the outcomes are not all equally likely.  Suppose our experiment has two stages.  In stage one we flip a nickel.  If the result is tails, we quit.  If the result is heads, we flip a quarter.  There are three possible results for this experiment:

  1. The nickel is tails
  2. The nickel is heads and the quarter is tails.
  3. The nickel is heads and the quarter is heads.

The nickel is tails about half the time and heads half the time.  So we only have to record the quarter half of the time.  The average number of bits needed to record this experiment is one and a half.  One bit for the nickel and one half bit for the quarter because we only need to write down the quarter flip outcome (1 bit) half of the time.

 

MUTUAL INFORMATiON

Sometimes random variables share bits.  When that happens, we say that they have mutual information. Most random variables do not share information. In that case, we say the variables are independent and the mutual information (shared information) is zero.

 

A COMPLICATED EXAMPLE

Suppose that Harry and Sally are doing an experiment.  They flip a penny, a nickel, a dime, and a quarter.  After the experiment, Harry writes down the results of the penny flip and the nickel flip.  Sally writes down the results of the nickel, the dime, and the quarter.  There are four possible outcomes for Harry: Head-Head, Head-Tail, Tail-Head, and Tail-Tail.  We say that Harry’s recording uses 2 bits, so his entropy is 2 bits.  Sally could have 8 different results, but she would only need 3 bits, one for each coin, to record her results, so her entropy is 3 bits.  The mutual information between Harry and Sally is the nickel flip which is just 1 bit (1 for heads, 0 for tails).  The overall entropy of Harry and Sally together is 4 bits because there are four coins.  The Joint entropy of Harry and Sally together is the number of bits needed to record both Harry’s result and Sally’s result.  Harry uses 2 bits and Sally 3 bits, but only 4 bits are needed because the nickel is shared.

 

THE FORMULA

There is a general formula for the mutual information between Harry and Sally.

Mutual Information = Harry’s Entropy + Sally’s Entropy – The Joint Entropy.

For this experiment the mutual information is  2 + 3 – 4  which is 1 bit.

 

 End Note

Well, that’s it.  Maybe I gave an explanation appropriate for twelve year olds. Maybe not.  I had my eighth grade son read it and then I gave him a 6 question quiz. After I corrected two of his answers, he was able to calculate the mutual information of two, fairly simple random variables.

 

I have been wanting to write a post about using mutual information for feature selection, but I know that a few readers do not know information theory.  Some of my executive friends have forgotten all of their calculus, so I will probably first write a short executive introduction to information theory next week or find an appropriate introduction on the internet.  Earlier today while I was looking for a very basic introduction, I ran across the 111 page “Basic Concepts in Information Theory” by Marc Uro and the 3 page blog post “A Short, Simple Introduction to Information Theory” by Ryan Moulton which are both appropriate for undergraduate sophomores.

Marc’s book is great.  It covers all of the basic information theory ideas like entropy, cross-entropy, mutual information, linear codes, compression, and Shannon capacity.  Most of his explanations use only basic probability with a little summation notation and a matrix or two.  (His notation is carefully crafted, but sometimes a bit non-standard).  He avoids using calculus, so this book can be read by a bright freshman.

Ryan’s two page web post uses a little probability, logarithms, and summation notation but nothing else.

   I have slightly edited the post below by my friend the big O.
$$ $$
    After all my complaining I have reached an actual argument about scientists or, better put, the science class.  They lack rigorous checks and balances.
    To think that science can effectively check itself for error and bias just because they’re supposed to is analogous to thinking that the house of representatives can effectively check itself for error and bias just because they’re supposed to. But we know that the job of congressman frequently attracts a type of personality. We know there is an attractive glory to accomplishment in government as there is an attractive glory to accomplishing something in science.  We know that money provides an incentivizing role in even the noblest of endeavors.
    And so in government we have branches whose explicit role is to check and balance one another.  And we have a press whose job is to check and seek error with the government itself. And we have a population who feel no embarrassment at checking everyone: we scrutinize the press, we critique the government in general, and we attack specific branches of government, all the while recognizing that government by the people (etc.) is prone to errors and biases specifically because it is a government run by people.
     This is not the attitude we have with the class of people who conduct science.  The scientist class has reached such a rarefied status that it lacks equivalent checks and balances. We expect scientists to nobly check themselves.  But i argue they cannot because they’re people.  We need more rigorous outside skepticism than we currently have.
    Science has the hardest arguments on earth to develop, prove, justify, and explain because the arguments of science are targeted at revealing something close to objective truth. there are more obstacles and unseen variables between scientific theory and proof than in any other field.  I think we would be better off to consider non-scientist (a “scientist” today being someone who is sanctioned by a university to be labeled as such) checking and balancing as part of – not apart from – the scientific process.
     I like the idea of retaining a unembarrassed and reasoned skepticism of the “truth” offered by scientists – particularly in the weak sciences – and instead accept effectiveness (e.g. when science becomes technology) as truth.  When something – a theory or an experiment – works in our daily lives we can label that something as “true enough to be effective” and realize that as an auspicious label.  The rest should invite continued checking and balancing from both in and outside the scientific class.
   Some links below
$$\  $$
  And then in another email, I added this as a response to my critics:
  The one thing I do not attack is the scientific method or reason.
   I say the scientific method is not currently being employed to an extent that it could be and we’re worse off for it. As evidence of this I see the very frequent conflation of science – which is an effective process – with scientists – whom I consider to be as flawed and for the same reasons as any other profession.  (That npr study conforms with my personal experience that scientists my be more justifying of their bias than others).  This conflation leads to a citizenry reluctant to be skeptical of scientists and scientific work because they fear they are questioning science.  They are not. they are a part (apparently unknowingly) of science and their reluctance to analyze, question, investigate, criticize scientists and scientist’s work leads to a weakening of science.

The peer review you cite is precisely what I am labeling inadequate.  What if a congressman suggested his law was good because it had been peer reviewed?  What if he said he had special training as a lawyer or an economist or a historian?  Would that be a satisfying rational not to have separate gov’t branches, press, or citizenry actively challenge his work?  Scientific peer review is equivalent to gov’t peer review – it is necessary but less adequate than broadening that peer review to others currently only limitedly involved or allowed in the process.

   Again:  re: the scientific method, I don’t believe that a science that includes only an academia certified science class does or even can adequately follow the scientific method to its fullest rigor anymore than congress can adequately run the government to its fullest rigor minus the critical analysis of outside agencies such as other branches of gov’t, the press, and citizenry.

Over the last few weeks, I’ve been working with some tricky features. Interestingly, I needed to add noise to the features to improve my classifier performance.  I will write a post on these “confounding features” later.  For now, let me just point out the following useful fact.

If

$$f(x, \sigma) = {1\over{\sigma \sqrt{2 \pi}}} \exp{\left(-{{x^2}\over{2 \sigma^2}}\right)},$$

then

$$\max_\sigma f(x,\sigma) = f(x, |x|).$$

So, if you have a Gaussian with mean zero and you want to fatten it to maximize the likelihood of the probability density function at $x$ without changing the mean, then set the standard deviation to $|x|$.

Busy grading, so just some links this week.

How to Gamify Your Life: An Experiment – Part 1

Write Yourself a Scheme in 48 Hours

Minecraft with Quantum Rules

The smarter you are, the better you are at self-deception

A (very?) fast Lisp

I don’t have any time these days, so I’m just going to say that the link below is the best machine learning video that I have seen this year.

Random Kitchen Sinks: Replacing Optimization with Randomization in Learning“.

(Thanks to Ali Rahimi for giving the talk and thanks to Carl for sending me the link.)

Edit 10-29-13:

Check out Igor’s and Alex Smola posts

Random Kitchen Sinks, Fast Food and other randomized kernel evaluations” (Nuit Blanche)

Compressed Sensing: Random Features for Large-Scale Kernel Machines” (Nuit Blanche)

The Neal Kernel and Random Kitchen Sinks” (Adventures in Data Land)

 

 

Imitation is the greatest form of flattery.

 

  1. The just plain fun retro turbo Pascal editor.
  2. The Rush movie reminds me of the “Who is your arch-enemy?” post (see also 37 signals) and “The smackdown learning model“.
  3. The rule of three for re-usable code.
  4. The PCP Theorem and Zero Knowledge Proofs suggest avenues of attack on P vs NP
  5. God’s number is 20 (assuming God invented Rubik’s cube)
  6. Why Your Brain Needs More Downtime (Scientific American, Link from Carl)
  7. Keith’s Photos of Caves, Critters, and Nature

 

So I have been learning a little category theory in the hopes that I could use it somehow to teach computers abstraction.

What is abstraction?  Carl and I have been talking about this for years.  Here are some ideas on what abstraction could be:

1a) An average model of a large group of similar things.  For example, the abstraction of a chair might be an idealized chair with a seat, a backrest, four legs, and possibly “strechers” between the legs to give it strength and stability.

1b) A function in a computer program that is used in several places in the code.  In effect this function shortens the code.  If the code has length L1 without the function, the code is length L2 with the function, the function is length LC, and the function is used N times in the code with the function, then typically

L2 $\approx$ L1 – LC*N + LC + N.

The function is an abstraction of the pieces of code that it replaced similar to the way an ideal chair is an abstraction of many chairs.  The function has the common characteristics of the code it replaced.  For example, the following code consists of about 160 characters

// C Code without function

main()
{
   printf("Hello John\n");
   printf("Hello Harry\n");
   printf("Hello Rose\n");
   printf("Hello Carol\n");
   printf("Hello Mr. Thomas\n");
   printf("Hello Mrs. Jones\n");
}

 

and it could be shortened slightly to about 137 characters by

 

// C Code with function

void pr(char *s)
{
   printf("Hello %s\n, s");
}

main() {
   pr("John");
   pr("Harry");
   pr("Rose");
   pr("Carol");
   pr("Mr. Thomas");
   pr("Mrs. Jones");
}

 

Though this example is not in any way practical, it does illustrate how a function can capture commonality.  All of the pieces of code replaced by the pr function had a call to printf, the substring “Hello “, and the carriage return “\n”.  The pr function was, in a sense, the idealized version of the six print statements in the original code.

In category theory, we could refer to the category of strings with arrows going from superstrings to substrings.  In this category, there would be an arrow from the line of code ”   printf(“Hello John\n”);” to the string ”   printf(“Hello \n”);”.  The categorical coproduct of several strings in this category would be the longest common substring.  So, category theory suggests the substring ”   printf(“Hello \n”);” should somehow be included in the abstraction of those six printf statements in the original code.

2)  A simplifying transformation that changes things into simpler things which are easier to work with.  For example, if you are counting sheep in a pen, you might convert groups of sheep into numbers and then add the numbers to get a total count for the entire pen.  We simplified the counting process by first converting groups of sheep into numbers.  Another example might be solving a corn maze by first flying over it, taking a picture, and then converting the “impassable” rows of corn into lines drawn on a piece of paper.  To solve the corn maze, you first convert it to a pen drawing on paper and then solve the pen drawn maze.

3)  Removing irrelevant information or detail.  If a piece in chess is attacking the opponent’s king, then the opponent must either a) move the king, b) put a piece between the attacker and the king, or c) capture the attacking piece.  If you wanted to list the possible moves for the opponent, you can often delete from the chess board most of the pieces because most of the pieces on the board do not affect the three options above.  In the diagram below, the white pawns, rook, and king and the black pawn on a7 do not affect the legal moves for black.

Check

In a corn maze, the type of corn, the height of the corn stalks, the temperature, or the type of soil do not affect the solution of the corn maze, so that information could be ignored.  Ignoring irrelevant information allows the brain to form a more abstract, simple model of the corn maze.

4) Variables in first order predicate logic.  “If person A is taller than person B and person B is taller than person C, then person A is taller than person C.”  In that statement, person A could any one of a billion people on earth.

5) Establishing classes of objects.  “Fruit tends to spoil quickly.”  That statement applies to a huge variety of organic objects.  Abstraction classes can even be put in a hierarchy like   Animal ==>> Mammal ==> Cat ==> Siamese Cat.

6) Statistical Patterns and Compression.  If a piece of code like zip.exe compresses a corpus of English text files, we could say that the piece of code formed an abstract model of the corpus.  In the case of the zip function which uses the Lempel–Ziv–Markov chain algorithm, the zip program would notice that

  • the letters ‘e’ and ‘t’ appear frequently,
  • the letter ‘u’ usually follows ‘q’, and
  • the capital letter I frequently appears with spaces on either side.

These patterns are all mini-abstractions or aspects of English.

 

A very general notion of abstraction could be any function which shortens the compression of data.  That function could be an abstraction of some part of the data.  If your data consisted of 3 dimensional models of chairs, then a compression program could use the fact that a chair usually has four legs, a seat, and a backrest.  A compression program might benefit from a function like the one shown below.

char *DescribeChair( double *featureVector)
{
   char *description;

   description = describeLegs(outputBuffer, featureVector);
   description = describeSeatOnTopOfLegs(description,
      featureVector);
   description = describeBackRestOnTopOfSeat(description,
      featureVector);

   return description;
}

 

The abstract version of a chair would help compress the data describing many chairs.

A compression program for a corpus of chess games would probably include a board evaluation function.  The program could use the evaluation function to suggest the top five most likely moves.  This would allow the program to use less space to compress the moves by using a probability distribution (e.g. arithmetic encoding) over moves as a function of the “goodness” of each move.  Better moves would be more likely than worse moves.  Converting the positions to valuations helps the computer to compress by making it “understand” chess.  If the computer is compressing many chess games, it may even note that some human players were more aggressive players or stronger players.  By classifying the players by skill or behavior, the program achieves better compression.

Well that’s it for now.

Hopefully, sometime in the future, I will be describing how category theory can help us automate those six abstractions and then use the abstractions to generate features for learning algorithms.

 

 

(Code highlighting was done by http://hilite.me/.  Thanks!)

 

 

« Older entries § Newer entries »