# Statistics

You are currently browsing the archive for the Statistics category.

## A Simple Minesweeper Puzzle

Suppose that you are playing the game Minesweeper.  On your first move, you click on the lower left corner square and reveal a 1.  Then you click on the square above the corner and reveal a 2.  Then you click on the square above that and reveal a 3.  What is the “safest” next move?

In order to talk about the contents of the blue squares, we will label them A,B,C,D, and E.

There are only three possible scenarios:

a) A, B, C, and E have mines,

b) A, C, and D have mines, or

c) B, C, and D have mines.

But, not all of these scenarios are equally likely.  Suppose there are a total of $m$ mines on the board and $s$ squares left excluding the eight that we are looking at. Then the total number of possible distributions of the mines for scenarios a, b, and c are:

• $n_a = {s\choose m-4},$
• $n_b= {s\choose m-3},$ and
• $n_c ={s\choose m-3}.$

These scenarios are not equally likely.  (Above we used choose notation.  ${n\choose m}= \frac{n!}{m! (n-m)!}$ where $n!=1\cdot2\cdot3\cdot\cdots\cdot n$.  For example 4!=24 and ${5\choose 2}=\frac{5!}{2!\ \cdot\ 3!} = \frac{120}{2 \cdot 6}= 10$.)  In fact,

\begin{aligned} r=\frac{n_b}{n_a}&=\frac{s\choose m-3}{s\choose m-4} \\&=\frac{\frac{s!}{(m-3)! (s-(m-3))!}}{\frac{s!}{(m-4)! (s-(m-4))!}}\\&=\frac{\frac{1}{(m-3)! (s-(m-3))!}}{\frac{1}{(m-4)! (s-(m-4))!}}\\&= \frac{(m-4)! (s-(m-4))!}{(m-3)! (s-(m-3))!}\\&= \frac{ (s-m+4)!}{(m-3) (s-m+3))!}\\&= \frac{ s-m+4}{m-3 }.\end{aligned}

In the beginning of the game $r\approx s/m-1\approx 4$, so scenarios b and c are about four times as likely as scenario a.  We can now estimate the probabilities of scenarios a, b, and c to be about

• “probability of scenario a” = $p_a \approx 1/9,$
• “probability of scenario b” = $p_b \approx 4/9, and$
• “probability of scenario c” = $p_c \approx 4/9.$

We can now conclude that the probability that square A has a mine is 5/9, that square B has a mine is 5/9, that square C has a mine is 100%, that square D has a mine is 8/9, and that square E has a mine is 1/9, so square E is the “safest” move.

Generally speaking, scenarios with more mines are less likely if less than half of the unknown squares have mines.

Another interesting way to think about it is that the 3 and 2 squares pulled the mines toward them making square E less likely to contain a mine.

You can approximate the probability of each scenario by just assuming that the squares are independent random variables (a false, but almost true assumption) each of which has probability $m/s$ of containing a mine.  Using that method gives the same results as the approximation above.

If you prefer an exact calculation, then use the formula

$$r=\frac{ s-m+4}{m-3 }$$

to get the exact ratio of $\frac{p_b}{p_a} = \frac{p_c}{p_a}=r$.

(PS:  Jennifer told me via email that you can play Minesweeper online at https://www.solitaireparadise.com/games_list/minesweeper.htm)

## A very basic description of the binomial distribution

I wrote this up for a friend this morning.  Hopefully someone else can benefit from it.

### Part 1 –  The relationship between the expansion of (x+y)^n and the binomial distribution

The binomial distribution for n coin flips and i heads is the probability of getting i heads with n independent coin flips assuming the probability of getting one head with one coin toss is p.  (In the explanations below, we will always assume that the coin flips are independent.)

There is a very strong relationship between the binomial distribution and the expansion of (x+y)^n.

For example, for n=1 if we set x=1/3 and y=2/3, then

(x+y)^1 = x+ y = 1/3 + 2/3.

If a person flips a biased coin once and the probability of heads is 1/3, then the two possible outcomes are

– one head with probability 1/3, or

– one tail with probability 2/3.

For n=2 if we set x=1/3 and y=2/3, then

(x+y)^2

= (x+ y)*(x+y)

= x*(x+y) + y*(x+y)

= x*x + x*y + y*x + y*y

= x^2 + 2*x*y + y^2

= (1/3)^2 + 2*(1/3)*(2/3) + (2/3)^2.

(I am using * to indicate multiplication which is common for programmers.  Mathematicians tend to omit the * and write  x y  to indicate x times y.)

If a person flips a biased coin twice and the probability of each heads is 1/3, then there are the following possible outcomes:

– two heads HH with probability (1/3)^2,

– one head and tail, HT or TH  2*(1/3)*(2/3), or

– two tails TT with probability (2/3)^2.

For n=3 if we set x=1/3 and y=2/3, then

(x+y)^3

= (x+y)*(x+y)^2

= (x+y)*(x*x + x*y +y*x + y*y)

= x*(x*x + x*y +y*x + y*y) + y*(x*x + x*y +y*x + y*y)

= x*x*x + x*x*y + x*y*x + x*y*y + y*x*x + y*x*y + y*y*x + y*y*y

= x^3 + 3*x^2*y + 3*x*y^2 + y^3

= (1/3)^3 + 3*(1/3)^2*(2/3)^2 + 3*(1/3)*(2/3)^2 + (2/3)^3.

If a person flips a biased coin three times and the probability of heads is 1/3, then there are the following possible outcomes:

– three Heads HHH with probability (1/3)^3,

– two heads and one tail, HHT, HTH, THH with probability   3*(1/3)^2*(2/3),

– one head and two tails, HTT, THT, TTH with probability   3*(1/3)*(2/3)^2, or

– three tails TTT with probability (2/3)^3.

Notice that every possible sequence of H’s and T’s for 3 flips can be obtained by fully expanding (x+y)^3 and replacing the x’s and y’s with H’s and T’s.  This shows that there is a very strong correspondence between the results of n coin flips and the expansion of (x+y)^n.  Note also that ( 1/3 + 2/3)^n = 1^n = 1, so all these probabilities will add up to 1.

### Part 2 -Pascal’s Triangle

If we look more closely at the expansion of (x+y)^n, we see a pattern.

(x+y)^0 = 1

(x+y)^1 = 1*x+1*y

(x+y)^2 = 1*x^2 + 2*x*y + 1*y^2

(x+y)^3 = 1*x^3 + 3*x^2*y + 3*x*y^2 +1* y^3

(x+y)^4 = 1*x^4 + 4*x^3*y + 6*x^2*y^2+ 4*x*y^3 + 1*y^4

(x+y)^5 = 1*x^5 + 5*x^4*y + 10*x^3*y^2+ 10*x^2*y^3 + 5*x*y^4 + 1*y^5.

Looking just at the numbers in each expansion gives us a triangle which is knows as Pascal’s triangle:

1

1    1

1    2   1

1  3     3  1

1  4    6   4  1

1 5   10 10  5  1

1 6 15 20 15 6 1

Every number is the sum of the two numbers “above” it.   These numbers can be obtained from the formula

(n  choose  i) = n!/ ( i! * (n-i)!)

where n is the row number and i=0,1,2,…, n.

For example, in the 5th row the third entry corresponds to i=2.

b(5,1) = 5!/( 2! * 3!) = 120/( 2 * 6) = 120/12=10.

Where does the formula for (n  choose  i)  come from?    It comes from the fact that there are exactly n! ways to order the numbers 1,2,3, …, n.  For example, for n=5, there are 5!=120 ways to order  the numbers 1,2,3, 4, and 5.

12345 12354 12435 12453 12534 12543 13245 13254 13425 13452 13524

13542 14235 14253 14325 14352 14523 14532 15234 15243 15324 15342

15423 15432 21345 21354 21435 21453 21534 21543 23145 23154 23415

23451 23514 23541 24135 24153 24315 24351 24513 24531 25134 25143

25314 25341 25413 25431 31245 31254 31425 31452 31524 31542 32145

32154 32415 32451 32514 32541 34125 34152 34215 34251 34512 34521

35124 35142 35214 35241 35412 35421 41235 41253 41325 41352 41523

41532 42135 42153 42315 42351 42513 42531 43125 43152 43215 43251

43512 43521 45123 45132 45213 45231 45312 45321 51234 51243 51324

51342 51423 51432 52134 52143 52314 52341 52413 52431 53124 53142

53214 53241 53412 53421 54123 54132 54213 54231 54312 54321

We could call these 120 orderings.

If we look at the first two numbers in every 5 digit ordering above, there are only 10 possible prefixes:  12, 13, 14, 15, 23, 24, 25, 34, 35, and 45.  So there are 10 ways to choose two numbers from a list 1,2,3,4, 5.  Mathematicians would say that

(5 choose 2) = 10.

After you choose 2 numbers, there are 3 remaining.   We can get every one of the 120 ordering by

a) taking each of the 10 choices of 2 numbers from 5,

b) looking at how many ways each of the chose numbers can be ordered.  In this case the chosen can be ordered 2!=2 ways.  (For example, 13 could ordered as 13 or 31.), and

c)  looking at how many ways the unchosen numbers can be ordered.  In this case each choice can be ordered 3!=6 ways.  (For example, if we chose 13, then 2, 4, and 5 remain and those numbers can be ordered as  245, 254, 425, 452, 524, and 542.)

So there were 2! orderings for the first 2 and 3! orderings for the remaining 3.  All 120 orderings can be found by

a) one of 10 choices,  (10 = 5 choose 2),

b) one of the 2 ways to order the chosen, (2=2!) and

c) one of the 6 ways to order the unchosen (6=3!).

The resulting formula is

Number of orderings of 1,2,3,4, and 5  =  5! = 120 = 10*2*6 = (5 choose 2)*2!*3!.

In general, if you have the numbers 1,2,…, n, and you choose i numbers, then every one of the n! orderings can be reproduced from

a)  (n choose i) ways to select i numbers from the list 1,2,3,…,n,

b)  i! ways to order the i chosen numbers, and

c)  (n-i)! ways to order the unchosen.

The resulting formula is

n! = (n choose i) * i! * (n-i)!  .

If we divide both sides by  i! * (n-i)!, we get

n! / (  i! * (n-i)!  ) = (n choose i).

### Part 3 – Conclusion

We showed that there is a very strong relationship between the expansion of (x+y)^n and the binomial distribution of n coin flips.  The coefficient for each term of (x+y)^n has the formula

(n choose i) = n!/( i! *  (n-i)! ).

We derived this formula.    The final result is that the probability of getting  i   heads when flipping   n   coins  each of which has probability p of heads is

“the number of ways of choosing which of the n coins are heads”  *  “the probability of flipping i heads in a row” * “ the probability of flipping (n-i) tails in a row”

= (n choose i) * p^i * (1-p)^(n-i)

= n!.( i! * (n-i)! ) * p^i * (1-p)^(n-i).

## When does the MGF of A-C exist?

On Math stack exchange, purpleostrich asked “Consider random variables A, B, and C. We know that A = B + C. We also know that A and C have an MGF. Is it the case that B must have a MGF?”

#### You Can’t Compute the MGF

In general, you can’t compute the MGF of $B$ if you only know the MGFs of $A$ and $C$. For example, consider two possible joint distributions of $A$ and $C$:

Case 1: P( A=0 and C=0) = 1/2 and P(A=1 and C=1)=1/2. In this case, the MGFs of A and C are $(1+\exp(t))/2$ and the MGF of B is 1.

Case 2: P( A=0 and C=1) = 1/2 and P(A=1 and C=0)=1/2. In this case, the MGFs of A and C are $(1+\exp(t))/2$ and the MGF of B is $\frac{\exp(-t)+\exp(t)}2$.

Notice that in both Case 1 and Case 2 the MGFs for $A$ and $C$ were $(1+exp(t))/2$, but the MGF for $B$ changed from Case 1 to Case 2.

#### You can prove the MGF exists

Although you can’t computer the MGF of $B$, you can prove that $M_B(t)$ exists for $t\in D=\frac12 (Dom(M_A)\cap (-Dom(M_C))$. Suppose $t\in D$. Then $||\exp(ta)||_1<\infty$ and $||\exp(-tc)||_1<\infty$ where $||g||_p=\left(\int\int |g(a,c)|^p\; f(a,c)\; da\; dc\right)^{1/p}$ is the $L_p$-norm of $g$ over the joint probability space and $f(a,c)$ is the joint pdf of $A$ and $C$. That implies $||\exp(ta/2)||_2 < \infty$ and $||\exp(-tc/2)||_2 < \infty$. By the Hölder’s inequality or, more specifically, Schwarz inequality, $||\exp(ta)\exp(-tc)||_1<\infty$. But, $||\exp(ta)\exp(-tc)||_1= ||\exp(t(a-c)||_1= E[\exp(tB)]=M_B(t).$ This proves that $M_B(t)$ exists for $t\in D$.

#### If A and C are independent

If $A$ and $C$ are independent and $B = A-C$, then it must be the case that
$$M_B(t) = M_A(t)\cdot M_C(-t)$$
whenever $t\in Dom(M_A)\cap(-Dom(M_C))$ (see e.g. Wikipedia). Here is a rough proof.

If $t\in Dom(M_A)\cap(-Dom(M_C))$, then
$$M_A(t)\cdot M_C(-t) = \int_{a=-\infty}^\infty \exp(t a) dF_A(a) \cdot \int_{c=-\infty}^\infty \exp(-t c) dF_C(c)$$
$$= \int_{a=-\infty}^\infty \int_{c=-\infty}^\infty \exp(t (a-c)) dF_A(a) dF_C(c)$$
$$= \int_{b=-\infty}^\infty \exp(t b) dF_B(b) = M_B(t)$$
where $F_A, F_B$, and $F_C$ are the cumulative distribution functions of $A, B$, and $C$ respectively.

## Removing 0 from a distributoin

An interesting mathematical problem came up at work today.  I had to find a formula for the standard deviation of a binomial distribution given that the random variable was not zero.  I put some notes below summarizing my results.

### Removing 0 from any Distribution

Suppose that you have a random variable $X$.  What are the values of $\mu_0 := E[X | X\neq 0]$ and $\sigma_0 := \sqrt{E[ (X-\mu_0)^2| X\neq 0]}$?  After doing some algebra, I got

$$\mu_0 = \bar{X}/(1-p_0), \quad\mathrm{and}$$

$$\sigma_0 = \sqrt{ \frac{\sigma_X^2 – p_0({\bar{X}}^2+\sigma_X^2)}{\left(1-p_0\right)^2}}= \sqrt{\frac{\sigma_X^2}{1-p_0} \;-\; \frac{ p_0 \bar{X}^2}{(1-p_0)^2}}$$

where $p_0:=P(X=0)$, $\bar{X}=E[X]$, and $\sigma_X := \sqrt{E\left[\left(X-\bar{X}\right)^2\right]}\,$.

Notice that if $p_0=0$ then the right hand side reduces to $\sigma_X$.

### Bernoulli Distribution

If we apply the formulas above to the Bernoulli Distribution where $X$ is either 0 or 1 and $P(X=1)=p$, then $p_0 = (1-p)$, $\bar{X}=p$, and $\sigma_X^2 = p(1-p)$, so $\mu_0 = p/(1-(1-p))=1$ and

$$\sigma_0 = \sqrt{\frac{\sigma_X^2}{1-p_0} – \frac{ p_0 \bar{X}^2}{(1-p_0)^2}}=\sqrt{\frac{p(1-p)}{p} – \frac{ (1-p)p^2}{p^2}}=0.$$

That is to be expected because if $X$ is not 0, then it must be 1.

### Binomial Distribution

Anyway, I really wanted to apply these formulas to the Binomial Distribution.  For the Binomial Distribution, $p_0=(1-p)^n$, $\bar{X} = np$, and $\sigma_X = \sqrt{n p (1-p)}$.  So,

$$\mu_0 = n p/(1-(1-p)^n), \quad\mathrm{and}$$

\begin{align}\sigma_0&= \sqrt{ \frac{n p (1-p) – (1-p)^n(n^2p^2+n p (1-p))}{\left(1-(1-p)^n\right)^2} }\\&= \sqrt{ n p \frac{ (1-p) – (1-p)^n(np+ (1-p))}{\left(1-(1-p)^n\right)^2}.}\end{align}

Notice that if $n=1$ then $\mu_0=1$ and $\sigma_0=0$ which makes sense because if $n=1$ and $X\neq0$ then $X$ is always 1.  Also notice that $\lim_{n->\infty} (\mu_0 – n p) = 0$ and $\lim_{n->\infty} (\sigma_0 – \sqrt{n p (1-p)}) = 0$ which is to be expected because $\lim_{n->\infty} p_0=0$. (I am assuming $0< p<1$.)

## Simple Fact about Maximizing a Gaussian

Over the last few weeks, I’ve been working with some tricky features. Interestingly, I needed to add noise to the features to improve my classifier performance.  I will write a post on these “confounding features” later.  For now, let me just point out the following useful fact.

If

$$f(x, \sigma) = {1\over{\sigma \sqrt{2 \pi}}} \exp{\left(-{{x^2}\over{2 \sigma^2}}\right)},$$

then

$$\max_\sigma f(x,\sigma) = f(x, |x|).$$

So, if you have a Gaussian with mean zero and you want to fatten it to maximize the likelihood of the probability density function at $x$ without changing the mean, then set the standard deviation to $|x|$.

## Zipf’s Law, ArtEnt Blog Hits

As I look at the hit statistics for the last quarter, I cannot help but wonder how well they fit Zipf’s law (a.k.a. Power Laws, Zipf–Mandelbrot law, discrete Pareto distribution).  Zipf’s law states that the distribution of many ranked things like city populations, country populations, blog hits, word frequency distribution, probability distribution of questions for Alicebot, Wikipedia Hits, terrorist attacksthe response time of famous scientists, … look like a line when plotted on a log-log diagram.  So here are the numbers for my blog hits and, below that, a plot of log(blog hits) vs log(rank) :

Not too linear.  Hmmm.

(Though Zipf’s “law” has been known for a long time, this post is at least partly inspired by Tarence Tao’s wonderful post “Benford’s law, Zipf’s law, and the Pareto distribution“.)

## The Exact Standard Deviation of the Sample Median

In a previous post, I gave the well-known approximation to the standard deviation of the sample median

$$\sigma \approx {1 \over 2\sqrt{n}\,f(x_m)}$$

where $f(x)$ is the probability density function and $x_m$ is the median (see Laplace and Kenney and Keeping).  Here are some examples.

 Distribution Median Approx StD of Median Standard Gaussain mean 0 std 1 0 $\sqrt{\pi \over{2 n}}$ Uniform 0 to 1 1/2 $1\over{2\sqrt{n}}$ Logistic with mean 0 and shape $\beta$ 0 ${2\beta}\over{\sqrt{n}}$ Student T with mean 0 and $\nu$ deg free 0 $\frac{\sqrt{\nu }\ B\left(\frac{\nu }{2},\frac{1}{2}\right)}{2 \sqrt{n}}$

$\$

Computing the exact standard deviation of the sample median is more difficult. You first need to find the probability density function of the sample median which is

$$f_m(x) = g(c(x)) f(x)$$

where

$$g(x) = \frac{(1-x)^{\frac{n-1}{2}} x^{\frac{n-1}{2}}}{B\left(\frac{n+1}{2},\frac{n+1}{2}\right)},$$
$B$ is the beta function, $c(x)$ is the cumulative distribution function of the sample distribution, and $f(x)$ is the probability density function of the sample distribution.

Now the expected value of the sample median is

$$\mu_m = \int x f_m(x) dx$$

and the standard deviation of the sample median is

$$\sigma_m = \sqrt{\int (x-\mu_m)^2 f_m(x)\ dx}.$$

Generally speaking, these integrals are hard, but they are fairly simple for the uniform distribution.  If the sample distribution is uniform between 0 and 1, then

$f(x) = 1,$

$c(x) = x,$

$g(x) = \frac{(1-x)^{\frac{n-1}{2}} x^{\frac{n-1}{2}}}{B\left(\frac{n+1}{2},\frac{n+1}{2}\right)},$

$f_m(x) = g(x),$

$\mu_m = \int x g(x) dx \ =\ 1/2,$ and

$$\sigma_m = \sqrt{\int (x-\mu_m)^2 f_m(x)\ dx}\ = \ {1\over{2\sqrt{n+2}}}$$

which is close to the approximation given in the table.

(Technical note:  The formulas above only apply for odd values of $n$ and continuous sample probability distributions.)

If you want the standard deviation for the sample median of a particular distribution and a $n$, then you can use numerical integration to get the answer. If you like, I could compute it for you.  Just leave a comment indicating the distribution and $n$.

## Simpson’s paradox and Judea Pearl’s Causal Calculus

Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?”  Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus.  At the time I found it hard to follow, but Nielsen’s post made it more clear to me.

Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved.  It uses notation like

$\alpha$ = P( Y=1 | do(X=2))

to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2.  This is not as straight forward as it might seem. We tend to conflate P(Y=1 | do(X=2)) with the conditional probability P(Y=1 | X=2). Below I will describe an example1, based on Simpson’s paradox, where they are different.

Suppose that there are two treatments for kidney stones: treatment A and treatment B.  The following situation is possible:

• Patients that received treatment A recovered 33% of the time.
• Patients that received treatment B recovered 67% of the time.
• Treatment A is significantly better than treatment B.

This seemed very counterintuitive to me.  How is this possible?

The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat.  If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.

Imagine that

• treatment A is given to one million people with a large stone and 1/3 of them recover,
• treatment A is given to one thousand people with a small stone and all of them recover,
• treatment B is given to one thousand people with a large stone and none of them recover,
• treatment B is given to one million people with a small stone and 2/3 of them recover.

Notice that about one-third of the treatment A patients recovered and about two-thirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B.  If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.

Mathematically P( Recovery | Treatment A) $\approx$ 1/3   (i.e.  about 1/3 of the patients who got treatment A recovered).

The formula for P( Recovery | do(Treatment A)) is much different.  Here we force all patients (all 2,002,000 of them) to use treatment A.  In that case,

P( Recovery | do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.

Similarly for treatment B, P( Recovery |  Treatment B) $\approx$ 2/3 and

P( Recovery | do(Treatment B) ) $\approx$ 1/3.

This example may seemed contrived, but as Nielsen said, “Keep in mind that this really happened.”

Edit Aug 8,2013:  Judea Pearl has a wonderful write-up on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?).  I think equation (9) in the article has a typo on the right-hand side.  I think it should read

$$P (E |do(\neg C)) = P (E |do(\neg C),F) P (F ) +P (E | do(\neg C), \neg F) P (\neg F ).$$

## Corey Chivers’ Introduction to Bayesian Methods

Corey Chivers’ created a beautiful set of slides introducing the reader to Bayesian Inference, Metropolis-Hastings (Markov Chain Monte Carlo), and Hyper Parameters with some applications to biology.

## “Introduction To Monte Carlo Algorithms”

Introduction To Monte Carlo Algorithms” by Werner Krauth (1998) is a cute, simple introduction to Monte Carlo algorithms with both amusing hand drawn images and informative graphics.  He starts with a simple game from Monaco for estimating $\pi$ (not Buffon’s Needle) and revises it to introduce the concepts of Markov Chain Monte Carlo (MCMC), ergodicity, rejection, and Detailed Balance (see also Haystings-Metropolis).  He then introduces the hard sphere MCMC example, Random Sequential Absorption, the 8-puzzle, and ends with some accelerated algorithms for sampling.

« Older entries