Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?”  Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus.  At the time I found it hard to follow, but Nielsen’s post made it more clear to me.

 

Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved.  It uses notation like

$\alpha$ = P( Y=1 | do(X=2))

to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2.  This is not as straight forward as it might seem. We tend to conflate P(Y=1 | do(X=2)) with the conditional probability P(Y=1 | X=2). Below I will describe an example1, based on Simpson’s paradox, where they are different.

Suppose that there are two treatments for kidney stones: treatment A and treatment B.  The following situation is possible:

  • Patients that received treatment A recovered 33% of the time.  
  • Patients that received treatment B recovered 67% of the time.  
  • Treatment A is significantly better than treatment B.

This seemed very counterintuitive to me.  How is this possible?

The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat.  If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.

Imagine that

  • treatment A is given to one million people with a large stone and 1/3 of them recover,
  • treatment A is given to one thousand people with a small stone and all of them recover,
  • treatment B is given to one thousand people with a large stone and none of them recover,
  • treatment B is given to one million people with a small stone and 2/3 of them recover.

Notice that about one-third of the treatment A patients recovered and about two-thirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B.  If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.

Mathematically P( Recovery | Treatment A) $\approx$ 1/3   (i.e.  about 1/3 of the patients who got treatment A recovered).

The formula for P( Recovery | do(Treatment A)) is much different.  Here we force all patients (all 2,002,000 of them) to use treatment A.  In that case,

P( Recovery | do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.

Similarly for treatment B, P( Recovery |  Treatment B) $\approx$ 2/3 and

P( Recovery | do(Treatment B) ) $\approx$ 1/3.

 

This example may seemed contrived, but as Nielsen said, “Keep in mind that this really happened.”

 

Edit Aug 8,2013:  Judea Pearl has a wonderful write-up on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?).  I think equation (9) in the article has a typo on the right-hand side.  I think it should read

$$ P (E |do(\neg C)) = P (E |do(\neg C),F) P (F ) +P (E | do(\neg C), \neg F) P (\neg F ).$$

Nuit Blanche‘s article “The Summer of the Deeper Kernels” references the two page paper “Deep Support Vector Machines for Regression Problems” by Schutten, Meijster, and Schomaker (2013).

 

The deep SMV is a pretty cool idea.  A normal support vector machine (SVM) classifier, finds $\alpha_i$ such that

$f(x) = \sum_i \alpha_i K(x_i, x)$ is positive for one class of $x_i$ and negative for the other class (sometimes allowing exceptions).  ($K(x,y)$ is called the kernel function which is in the simplest case just the dot product of $x$ and $y$.)  SVM’s are great because they are fast and the solution is sparse (i.e. most of the $\alpha_i$ are zero).

Schutten, Meijster, and Schomaker apply the ideas of deep neural nets to SVMs.

They construct $d$ SVMs of the form

$f_a(x) = \sum_i \alpha_i(a) K(x_i, x)+b_a$

and then compute a more complex two layered SVM

$g(x) = \sum_i \alpha_i  K(f(x_i), f(x))+b$

where $f(x) = (f_1(x), f_2(x), \ldots, f_d(x))$.  They use a simple gradient descent algorithm to optimize the alphas and obtain numerical results on ten different data sets comparing the mean squared error to a standard SVM.

ASIMO is becoming very impressive–here’s the video.

Robin Hanson gave a very interesting TED talk about a possible technological singularity based on AI and robotics.  I outline his talk below and add a few ideas from Hanson’s 1998 article “Long-Term Growth As A Sequence of Exponential Modes“, Ray Kurzweil’s essay (2001) and book (2005), and Bill Joy’s Wired article (2000).

 

History of Economic Growth Rates

 

Notes

Hanson’s “Great Eras” graph illustrates major economic revolutions and their effect on growth rates on a log-log scale.  Humans doubled their population about every 200,000 years until the agricultural age started 10,000 years ago.1  In the agricultural age, human population and production doubled every 1000 years. Then the industrial revolution occurred and our world economy started doubling every 15 years.  Both revolutions increased the growth rate by a factor of 50 to 200.  (Kurzweil extends this pattern of revolutions into both the past and the future.)

What if a third major economic revolution occurred?  How could it occur?

Robots.

For the past 10,000 years, human economic growth has been based on land, capital, and labor.  For the past 200 years we have increased world productivity by population growth, education, and by creating bigger/better tools.  Intelligent robots could reproduce more quickly than us, transfer knowledge faster, and they would be their own tools.  Robots or artificial intelligences would be so extremely productive that we could experience an economy where our production doubled every month.  The amount of stories, art, television, games, science, philosophy, and legal opinions produced would be phenomenal.

Hansen suggest a path to a humanistic robotic world created by this third economic revolution.  Artificial intelligence might first be created by mapping the human brain and up-loading human brains into computer hardware.  The most productive human individual minds uploaded into computers would proliferate the fastest and thus would dominate.  They would be immortal and they would be able to travel across the world quickly and cheaply.   Humans would not be able to compete and perhaps we would all retire.  The uploaded human minds would live in a beautiful virtual reality.

Hansen describes a society of virtual minds.  The most productive minds would acquire or create the most computer resources.  Their software personality emulations would run the fastest relative to real-time.  He speculates about the structure of this virtual society which includes a caste system, their working habits, and the virtual worlds that the different castes would inhabit.
1This estimate is from “Long-Term Growth As A Sequence of Exponential Modes” and is based on “Population Bottlenecks and Pleistocene Human Evolution” Hawks et al. (2000) andOn the number of members of the genus Homo who have ever lived, and some evolutionary implications” Weiss (1984).

 

The Johnson–Lindenstrauss lemma roughly states that if you have large number of points in a high dimensional space, you can find a mapping to a lower dimensional space that almost preserves the distances between points. This lemma is useful for compressive sensingself-organizing maps (Kohonen maps) and manifold learning.  The following corollary is an immediate consequence of the lemma.

Corollary: For any positive integers $j>k$ and any set $S$ of $n$ points in $R^j$ there exists a mapping $f: R^j \rightarrow R^k$ such that for any two points $x,y \in S$,

$$(1 – \epsilon) ||x-y|| < ||f(x) – f(y)|| < (1 +\epsilon) ||x-y||$$

where

$$\epsilon = \sqrt{ 8 \log n \over k }.$$

According to the Wikipedia “The map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.”

Ailon and Chazelle published an interesting approximate nearest neighbor algorithm based on “the Fast Johnson-Lindenstrauss Transform”.  They write “The FJLT is faster than standard random projections and just as easy to implement. It is based upon the preconditioning of a sparse projection matrix with a randomized Fourier transform.” and

By the Johnson-Lindenstrauss lemma [25,27,31], $n$ points in Euclidean space can be projected down to $k = O(\epsilon^{−2} \log n)$ dimensions while incurring a distortion of at most $1 + \epsilon$ in their pairwise distances. To achieve this requires a dense $k$-by-$d$ matrix; and so mapping each point takes $O(d \log n)$ time (for fixed $\epsilon$). Is that optimal? A lower bound of Alon [3] dashes any hope of reducing the number of rows (as a function of n), so the obvious question is whether the matrix can be made sparse.  Achlioptas [1] has shown that the density can be reduced by a constant factor, but one cannot go much further because a sparse matrix will typically distort a sparse vector. To prevent this, we use a central concept from harmonic analysis known as the Heisenberg principle (so named because it is the key idea behind the Uncertainty Principle): A signal and its spectrum cannot be both concentrated. With this in mind, we precondition the random projection with a Fourier transform (via an FFT) in order to isometrically enlarge the support of any sparse vector. To prevent the inverse effect, ie, the sparsification of dense vectors, we randomize the Fourier transform.

 

 

atari

In “A Neuro-evolution Approach to General Atari Game Playing“, Hausknecht, Miikkulainen, and Stone (2013) describe and test four general game learning AIs based on evolving neural nets. They apply the AIs to sixty-one Atari 2600 games exceeding the best known human performance in three of the sixty-one games (Bowling, Kung Fu Master, and Video Pinball).  This work improves their previous Atari gaming AI described in “HyperNEAT-GGP: A HyperNEAT-based Atari General Game Player” (2012) with P. Khandelwal.

The Atari 2600 presents a uniform interface for all its games:  a 2D screen, a joystick, and one button.  The Atari 2600 games are simulated in the Arcade Learning Environment which has allowed several researchers to develop AIs for the Atari.

The four algorithms tested are:

  1. Fixed topology neural nets that adapt by changing weights between neurons
  2. Neural nets that evolve both the weights and the topology of the network (NEAT created by Stanley and Miikkulainen (2002))
  3. “Indirect encoding of the network weights”  (HyperNEAT created by Gauci and Stanley (2008)) 
  4. “A hybrid algorithm combining elements of both indirect encodings and individual weight evolution” (HybrID by Clune, Stanley, Pennock, and Ofria (2011))

All of the algorithms evolved a population of 100 individual neural nets over 150 generations mostly using the topology shown below.

nntopo

For MOVIES of the AIs in action click below

http://www.cs.utexas.edu/~mhauskn/projects/atari/movies.html

 

So I was thinking about how I should estimate the KL distance between two mixtures of Gaussians.  For discussion purposes, assume that the first mixture has pdf $f(x)$ and the second has pdf $g(x)$.

My fist thought was to randomly generate samples from $f(x)$ and to compute the average value of $\log(g(x))$ from the other mixture.  That would give me $E_f[\log(g(X))] = -H(f,g)$  and  $KL(f,g) = H(f,g) – H(f)$ where $H(f,g)$ is the cross entropy and $H(f)$ is the entropy.  This converges at order $1/\sqrt{n}$ and it is easy to program.

My second thought was to borrow an idea from Unscented Kalman filters.  I thought about creating a non-random sampling of from one distribution at specific points and estimating $E_f[\log(g(X))]$ from those points.

My third thought was to try Google.  Google suggested “Lower and Upper Bounds for Approximation of the Kullback-Leibler Divergence Between Gaussian Mixture Models” by Durrien, Thiran, and Kelly (2012) and “Approximating the Kullback Leibler divergence between Gaussian Mixture Models” by Hershey and Olsen (2007).  Here are some notes from their papers:

1.  The KL distance between two Gaussians $f$ and $g$ is

$D_{KL}( f || g ) = {1\over2}\left( \log\left( { \det(\Sigma_g)}\over { \det(\Sigma_f)}\right) + Tr( \Sigma_g^{-1}  \Sigma_f)  + ||\mu_f – \mu_g||_g^2 -d \right)$ where $d$ is the dimension of the space, $\Sigma$ is the covariance matrix, $\mu$ is the mean, $Tr$ is the trace, and

$|| x ||^2_g = x^T (\Sigma_g^{-1}) x$.

2.   Hershey and Olsen review several methods for estimating the divergence:

  • Monte-Carlo methods,  
  • Unscented methods (unscented methods are simple and an unscented approximation of $\int f(x) g(x) dx$ is exact if $f$ is a Gaussian and $g$ is quadratic),
  • Gaussian Approximation (this is bad, don’t do it, if you do do it, “I told you so”),
  • Product of Gaussian approximations using Jensen’s inequality (this is cute, I like it, but I’m not sure how accurate it is), and
  • Match Bound approximation by Do (2003) and Goldberg et al (2003) (just match each Gaussian with another Gaussian in the other mixture and compute those KL distances).

3.  Hershey and Olsen introduce a delightful improvement over Match Bound approximation using variational methods.  They have the same idea as Match Bound, but they significantly reduce the error in Jensen’s inequality by introducing weighted averages.  Since Jensen’s inequality produces a lower bound using the weighted average, they maximize the lower bound under all possible weightings. The maximizer happens to have a very simple form, so the bound is also is very simple to compute.  Very nice.  (Numerical results are given at the end of the paper.)  I’ve got to try this one out.

4.  Durrien, Thiran, and Kelly improve on the Hershey and Olsen method, but I’m not sure how much better the new method is.  More research required.

Nuit Blanche has a nice summary of the FOCS 2012 53rd Annual IEEE Symposium on Foundations of Computer Science Workshop on “Randomized Numerical Linear Algebra (RandNLA): Theory and Practice“.  Faster random algorithms for QR, SVD, Eigenvalues, Least Squares, …  using random projections and other techniques were discussed.

The last week of classes and finals week were kind of brutal both for me and my students.  (It wasn’t that bad, but it was a lot of work.)  So, I’m taking some time off blogging.  Cheers, Hein

Erik Brynojolfsson, director of the MIT Center for Digital Business, gave a great TED talk on the recent changes in world-wide productivity, computers, machine learning and the “great stagnation“.  There has been a stagnation in GDP per capita for the last 5 years in the United States, Europe, and Japan.  However, growth has continued unabated in India, China, and Brazil.  Economist Tyler Cowen is not particularly optimistic about the current economic situation which he describes in his book “The Great Stagnation: How America Ate All The Low-Hanging Fruit of Modern History, Got Sick, and Will (Eventually) Get Better” ($4 Kindle edition). Erik disagrees with the idea that the economy, technology, and science are stagnant.  Instead, we are being confused by what he calls the “great decoupling” of productivity growth and employment/median income.  He believes that we are actually in the early phases of the “new machine age” which will create more wealth than the industrial revolution.  Here are some notes on his TED talk:

  • The introduction of electricity did not immediately lead to large productivity growth. At first, factory managers just replaced existing steam machines with electric machines. It took thirty years (one generation) before the factories were redesigned with electricity in mind and new work processes were created. Then productivity growth took off.
  • Big general purpose technologies like fire, agriculture, writing, metallurgy, printing press, steam engine, telecommunications & electricity, and now computers cause “cascades of complementary innovations”.
  • The GDP growth rate in America has been fairly steady at an average of 1.9% per year since 1800.
  • Erik graphically compares the 1960-2011 productivity growth caused partly by computers and the 1890-1940 productivity growth caused by electricity.  They appear quite similar.
  • Productivity has grown faster (2.4% per year) in the last decade than any other decade since 1970. The average productivity growth rate was about 2.3% during the second industrial revolution.
  • World GDP in the last decade has grown faster than any other decade in history.
  • New growth has been driven by thoughts and ideas more than physical products. This growth is hard to measure. The massive utility of free products like Google Search or the Wikipedia is not included in GDP.
  • The “new machine age” is “digital, exponential, and combinatorial”.
  • Digital goods are perfectly, freely replicable and they have near zero transportation cost (often moving at great speed crossing the globe in a minute.)
  • Computer power has grown exponentially. In Erik’s words, “Computers get better faster than anything else ever.” (2013 playstation = 1996 supercomputer)
  • We are not designed by evolution to anticipate exponential trends. We expect linear trends.
  • We now have Big Data and Machine Learning which allow us to analyze everything more deeply.
  • He disagrees with the idea of “low hanging fruit” inventions (see e.g. Cowen’s book). He thinks that “each innovation produces building blocks for even more innovations”.
  • The digital, exponential, and combinatorial nature of the new machine age will lead to the greatest industrial revolution in history. (see e.g. cat transportation at 6:55 in the video).
  • Machine Learning may be the most important innovation of this revolution. (The Watson Jeopardy team improved faster than any human could by using machine learning.)  Watson’s technology is now being applied in the legal, banking, and medical fields.
  • “The great decoupling” is the decoupling of high productivity growth from employment and median income.
  • Jobs are being replaced by computers.
  • In 1997 Gary Kasparov was beaten by the super computer Deep Blue. Today a cell phone can beat a chess grandmaster.
  • In computer assisted chess play (Advanced Chess), sometimes the winners are neither computer experts nor grandmasters. Instead, they were the best at interacting with the computer—guiding its exploration of the possibilities. So, maybe we will not lose our jobs to computers, rather we will collaborate with computers to create wonderful things.

« Older entries § Newer entries »