July 2013

You are currently browsing the monthly archive for July 2013.

One of my friends showed me his new gaming computer and said that the GPU could do 1.3 teraflops (1.3 trillion floating point operations per second) which is about 500 times faster than my home computer, so I thought, “Imagine how quickly we could search a game tree.”  So I started looking around the internet for a super-great GPU chess engine and found basically nothing!!  Turns out that the amount of memory per thread is too low, the size of the L1 cache is too small, and the alpha-beta pruning algorithm is not quite parallel enough for GPUs to play chess well.  Here is a nice graphic of the L1 access time for a few CPUs and GPUs.


In the paper “Parallel Game Tree Search Using GPU” (2011), L’ubomír Lackovic improved the tree search speed by a factor of two to three by using a GPU instead of the more traditional CPU based tree search for Czech draughts (similar to American Checkers).  His tests were based on the ATI Radeon 4890 GPU, the Nvidia GTX460 GPU, and the quad-core processor Intel i5 750 CPU.  I had hoped for more speed.

In “Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks” (2011), von Damian Sulewski invented and test several algorithms for search, reviewed game theory algorithms, and applied GPU processing to several games including “Nine Men’s Morris“.  Sulewski used an Intel Core i7 CPU 920 with an NVIDIA GeForce 285 GTX GPU to run his tests. He reported that the GPU was faster by a factor of three to twelve as long as sufficient RAM was available.  If the computer ran out of RAM and had to use disk storage, then the GPU performance degraded significantly.  He states,


“The observed speed-ups of over one order of magnitude have been obtained (plotted in bold font), exceeding the number of cores on most current PCs. Note that this assertion is true for the dual 6-core CPUs available from Intel, but not on a dual Xeon machine with two quad-core CPUs creating 16 logical cores due to multi-threading. Nonetheless, better speed-ups are possible since NVIDIA GPUs can be used in parallel and the Fermi architecture (e.g. located on the GeForce GTX 480 graphics card) is coming out which will go far beyond the 240 GPU cores we had access to.

For larger levels, however, we observe that the GPU performance degrades. When profiling the code, we identified I/O access as one limiting factor. For example, reading S8,8 from one HDD required 100 seconds, while the expansion of 8 million states, including ranking and unranking required only about 1 second on the GPU.”


So GPU’s are not the silver bullet for games yet.


The linear-nonlinear-Poisson (LNP) cascade model is a standard model of neuron responses.  Louis Shao has recently shown that an artificial neural net consisting of LNP neurons can simulate any Boltzmann machine and perform “a semi-stochastic Bayesian inference algorithm lying between Gibbs sampling and variational inference.”  In his paper he notes that the “properties of visual area V2 are found to be comparable to those on the sparse autoencoder networks [3]; the sparse coding learning algorithm [4] is originated directly from neuroscience observations; also psychological phenomenon such as end-stopping is observed in sparse coding experiments [5].”


Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?”  Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus.  At the time I found it hard to follow, but Nielsen’s post made it more clear to me.


Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved.  It uses notation like

$\alpha$ = P( Y=1 | do(X=2))

to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2.  This is not as straight forward as it might seem. We tend to conflate P(Y=1 | do(X=2)) with the conditional probability P(Y=1 | X=2). Below I will describe an example1, based on Simpson’s paradox, where they are different.

Suppose that there are two treatments for kidney stones: treatment A and treatment B.  The following situation is possible:

  • Patients that received treatment A recovered 33% of the time.  
  • Patients that received treatment B recovered 67% of the time.  
  • Treatment A is significantly better than treatment B.

This seemed very counterintuitive to me.  How is this possible?

The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat.  If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.

Imagine that

  • treatment A is given to one million people with a large stone and 1/3 of them recover,
  • treatment A is given to one thousand people with a small stone and all of them recover,
  • treatment B is given to one thousand people with a large stone and none of them recover,
  • treatment B is given to one million people with a small stone and 2/3 of them recover.

Notice that about one-third of the treatment A patients recovered and about two-thirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B.  If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.

Mathematically P( Recovery | Treatment A) $\approx$ 1/3   (i.e.  about 1/3 of the patients who got treatment A recovered).

The formula for P( Recovery | do(Treatment A)) is much different.  Here we force all patients (all 2,002,000 of them) to use treatment A.  In that case,

P( Recovery | do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.

Similarly for treatment B, P( Recovery |  Treatment B) $\approx$ 2/3 and

P( Recovery | do(Treatment B) ) $\approx$ 1/3.


This example may seemed contrived, but as Nielsen said, “Keep in mind that this really happened.”


Edit Aug 8,2013:  Judea Pearl has a wonderful write-up on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?).  I think equation (9) in the article has a typo on the right-hand side.  I think it should read

$$ P (E |do(\neg C)) = P (E |do(\neg C),F) P (F ) +P (E | do(\neg C), \neg F) P (\neg F ).$$

Nuit Blanche‘s article “The Summer of the Deeper Kernels” references the two page paper “Deep Support Vector Machines for Regression Problems” by Schutten, Meijster, and Schomaker (2013).


The deep SMV is a pretty cool idea.  A normal support vector machine (SVM) classifier, finds $\alpha_i$ such that

$f(x) = \sum_i \alpha_i K(x_i, x)$ is positive for one class of $x_i$ and negative for the other class (sometimes allowing exceptions).  ($K(x,y)$ is called the kernel function which is in the simplest case just the dot product of $x$ and $y$.)  SVM’s are great because they are fast and the solution is sparse (i.e. most of the $\alpha_i$ are zero).

Schutten, Meijster, and Schomaker apply the ideas of deep neural nets to SVMs.

They construct $d$ SVMs of the form

$f_a(x) = \sum_i \alpha_i(a) K(x_i, x)+b_a$

and then compute a more complex two layered SVM

$g(x) = \sum_i \alpha_i  K(f(x_i), f(x))+b$

where $f(x) = (f_1(x), f_2(x), \ldots, f_d(x))$.  They use a simple gradient descent algorithm to optimize the alphas and obtain numerical results on ten different data sets comparing the mean squared error to a standard SVM.

ASIMO is becoming very impressive–here’s the video.