Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?” Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus. At the time I found it hard to follow, but Nielsen’s post made it more clear to me.

Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved. It uses notation like

$\alpha$ = P( Y=1 | do(X=2))

to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2. This is not as straight forward as it might seem. We tend to conflate P(Y=1 | do(X=2)) with the conditional probability P(Y=1 | X=2). Below I will describe an example^{1}, based on Simpson’s paradox, where they are different.

Suppose that there are two treatments for kidney stones: treatment A and treatment B. The following situation is possible:

- Patients that received treatment A recovered 33% of the time.
- Patients that received treatment B recovered 67% of the time.
- Treatment A is significantly better than treatment B.

This seemed very counterintuitive to me. How is this possible?

The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat. If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.

Imagine that

- treatment A is given to one million people with a large stone and 1/3 of them recover,
- treatment A is given to one thousand people with a small stone and all of them recover,
- treatment B is given to one thousand people with a large stone and none of them recover,
- treatment B is given to one million people with a small stone and 2/3 of them recover.

Notice that about one-third of the treatment A patients recovered and about two-thirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B. If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.

Mathematically P( Recovery | Treatment A) $\approx$ 1/3 (i.e. about 1/3 of the patients who got treatment A recovered).

The formula for P( Recovery | do(Treatment A)) is much different. Here we force all patients (all 2,002,000 of them) to use treatment A. In that case,

P( Recovery | do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.

Similarly for treatment B, P( Recovery | Treatment B) $\approx$ 2/3 and

P( Recovery | do(Treatment B) ) $\approx$ 1/3.

This example may seemed contrived, but as Nielsen said, “Keep in mind that this *really happened*.”

Edit Aug 8,2013: Judea Pearl has a wonderful write-up on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?). I think equation (9) in the article has a typo on the right-hand side. I think it should read

$$ P (E |do(\neg C)) = P (E |do(\neg C),F) P (F ) +P (E | do(\neg C), \neg F) P (\neg F ).$$