Statistics

You are currently browsing the archive for the Statistics category.

On Math stack exchange, purpleostrich asked “Consider random variables A, B, and C. We know that A = B + C. We also know that A and C have an MGF. Is it the case that B must have a MGF?”

Here is my answer:

You Can’t Compute the MGF

In general, you can’t compute the MGF of $B$ if you only know the MGFs of $A$ and $C$. For example, consider two possible joint distributions of $A$ and $C$:

Case 1: P( A=0 and C=0) = 1/2 and P(A=1 and C=1)=1/2. In this case, the MGFs of A and C are $(1+\exp(t))/2$ and the MGF of B is 1.

Case 2: P( A=0 and C=1) = 1/2 and P(A=1 and C=0)=1/2. In this case, the MGFs of A and C are $(1+\exp(t))/2$ and the MGF of B is $\frac{\exp(-t)+\exp(t)}2$.

Notice that in both Case 1 and Case 2 the MGFs for $A$ and $C$ were $(1+exp(t))/2$, but the MGF for $B$ changed from Case 1 to Case 2.

 

You can prove the MGF exists

Although you can’t computer the MGF of $B$, you can prove that $M_B(t)$ exists for $t\in D=\frac12 (Dom(M_A)\cap (-Dom(M_C))$. Suppose $t\in D$. Then $||\exp(ta)||_1<\infty$ and $||\exp(-tc)||_1<\infty$ where $||g||_p=\left(\int\int |g(a,c)|^p\; f(a,c)\; da\; dc\right)^{1/p}$ is the $L_p$-norm of $g$ over the joint probability space and $f(a,c)$ is the joint pdf of $A$ and $C$. That implies $||\exp(ta/2)||_2 < \infty$ and $||\exp(-tc/2)||_2 < \infty$. By the Hölder’s inequality or, more specifically, Schwarz inequality, $||\exp(ta)\exp(-tc)||_1<\infty$. But, $||\exp(ta)\exp(-tc)||_1= ||\exp(t(a-c)||_1= E[\exp(tB)]=M_B(t).$ This proves that $M_B(t)$ exists for $t\in D$.

 

If A and C are independent

If $A$ and $C$ are independent and $B = A-C$, then it must be the case that
$$
M_B(t) = M_A(t)\cdot M_C(-t)
$$
whenever $t\in Dom(M_A)\cap(-Dom(M_C))$ (see e.g. Wikipedia). Here is a rough proof.

If $t\in Dom(M_A)\cap(-Dom(M_C))$, then
$$M_A(t)\cdot M_C(-t) = \int_{a=-\infty}^\infty \exp(t a) dF_A(a) \cdot \int_{c=-\infty}^\infty \exp(-t c) dF_C(c)$$
$$
= \int_{a=-\infty}^\infty \int_{c=-\infty}^\infty \exp(t (a-c)) dF_A(a) dF_C(c)
$$
$$
= \int_{b=-\infty}^\infty \exp(t b) dF_B(b) = M_B(t)
$$
where $F_A, F_B$, and $F_C$ are the cumulative distribution functions of $A, B$, and $C$ respectively.

 

 

An interesting mathematical problem came up at work today.  I had to find a formula for the standard deviation of a binomial distribution given that the random variable was not zero.  I put some notes below summarizing my results.

Removing 0 from any Distribution

Suppose that you have a random variable $X$.  What are the values of $\mu_0 := E[X | X\neq 0]$ and $\sigma_0 := \sqrt{E[ (X-\mu_0)^2| X\neq 0]}$?  After doing some algebra, I got

$$\mu_0 = \bar{X}/(1-p_0), \quad\mathrm{and}$$

$$\sigma_0 = \sqrt{ \frac{\sigma_X^2 – p_0({\bar{X}}^2+\sigma_X^2)}{\left(1-p_0\right)^2}}= \sqrt{\frac{\sigma_X^2}{1-p_0} \;-\; \frac{ p_0 \bar{X}^2}{(1-p_0)^2}}$$

where $p_0:=P(X=0)$, $\bar{X}=E[X]$, and $\sigma_X := \sqrt{E[\left(X-\bar{X}\right)^2]}\,$.

Notice that if $p_0=0$ then the right hand side reduces to $\sigma_X$.

Bernoulli Distribution

If we apply the formulas above to the Bernoulli Distribution where $X$ is either 0 or 1 and $P(X=1)=p$, then $p_0 = (1-p)$, $\bar{X}=p$, and $\sigma_X^2 = p(1-p)$, so $\mu_0 = p/(1-(1-p))=1$ and

$$\sigma_0 = \sqrt{\frac{\sigma_X^2}{1-p_0} – \frac{ p_0 \bar{X}^2}{(1-p_0)^2}}=\sqrt{\frac{p(1-p)}{p} – \frac{ (1-p)p^2}{p^2}}=0.$$

That is to be expected because if $X$ is not 0, then it must be 1.

Binomial Distribution

Anyway, I really wanted to apply these formulas to the Binomial Distribution.  For the Binomial Distribution, $p_0=(1-p)^n$, $\bar{X} = np$, and $\sigma_X = \sqrt{n p (1-p)}$.  So,

$$\mu_0 = n p/(1-(1-p)^n), \quad\mathrm{and}$$

$$\begin{align}\sigma_0&= \sqrt{  \frac{n p (1-p) – (1-p)^n(n^2p^2+n p (1-p))}{\left(1-(1-p)^n\right)^2} }\\&= \sqrt{  n p \frac{ (1-p) – (1-p)^n(np+ (1-p))}{\left(1-(1-p)^n\right)^2}.}\end{align}$$

Notice that if $n=1$ then $\mu_0=1$ and $\sigma_0=0$ which makes sense because if $n=1$ and $X\neq0$ then $X$ is always 1.  Also notice that $\lim_{n->\infty} (\mu_0 – n p) = 0$ and $\lim_{n->\infty} (\sigma_0 – \sqrt{n p (1-p)}) = 0$ which is to be expected because $\lim_{n->\infty} p_0=0$. (I am assuming $0< p<1$.)

Over the last few weeks, I’ve been working with some tricky features. Interestingly, I needed to add noise to the features to improve my classifier performance.  I will write a post on these “confounding features” later.  For now, let me just point out the following useful fact.

If

$$f(x, \sigma) = {1\over{\sigma \sqrt{2 \pi}}} \exp{\left(-{{x^2}\over{2 \sigma^2}}\right)},$$

then

$$\max_\sigma f(x,\sigma) = f(x, |x|).$$

So, if you have a Gaussian with mean zero and you want to fatten it to maximize the likelihood of the probability density function at $x$ without changing the mean, then set the standard deviation to $|x|$.

As I look at the hit statistics for the last quarter, I cannot help but wonder how well they fit Zipf’s law (a.k.a. Power Laws, Zipf–Mandelbrot law, discrete Pareto distribution).  Zipf’s law states that the distribution of many ranked things like city populations, country populations, blog hits, word frequency distribution, probability distribution of questions for Alicebot, Wikipedia Hits, terrorist attacksthe response time of famous scientists, … look like a line when plotted on a log-log diagram.  So here are the numbers for my blog hits and, below that, a plot of log(blog hits) vs log(rank) :

 

“Deep Support Vector Machines for Regression Problems” 400
Simpson’s paradox and Judea Pearl’s Causal Calculus 223
Standard Deviation of Sample Median 220
100 Most useful Theorems and Ideas in Mathematics 204
Computer Evaluation of the best Historical Chess Players 181
Notes on “A Few Useful Things to Know about Machine Learning” 178
Comet ISON, Perihelion, Mars, and the rule of 13.3 167
Dropout – What happens when you randomly drop half the features? 139
The Exact Standard Deviation of the Sample Median 101
Bengio LeCun Deep Learning Video 99
Category Theory ? 92
“Machine Learning Techniques for Stock Prediction” 89
Approximation of KL distance between mixtures of Gaussians 75
“A Neuro-evolution Approach to General Atari Game Playing” 74
The 20 most striking papers, workshops, and presentations from NIPS 2012 65
Matlab code and a Tutorial on DIRECT Optimization 61
About 51

 

 

bloghits4

 

Not too linear.  Hmmm.

(Though Zipf’s “law” has been known for a long time, this post is at least partly inspired by Tarence Tao’s wonderful post “Benford’s law, Zipf’s law, and the Pareto distribution“.)

In a previous post, I gave the well-known approximation to the standard deviation of the sample median

$$\sigma \approx {1 \over 2\sqrt{n}\,f(x_m)}$$

where $f(x)$ is the probability density function and $x_m$ is the median (see Laplace and Kenney and Keeping).  Here are some examples.

Distribution Median Approx StD of Median
Standard Gaussain mean 0 std 1 0 $\sqrt{\pi \over{2 n}}$
Uniform 0 to 1 1/2 $1\over{2\sqrt{n}}$
Logistic with mean 0 and shape $\beta$ 0 ${2\beta}\over{\sqrt{n}}$
Student T with mean 0 and $\nu$ deg free 0 $\frac{\sqrt{\nu }\  B\left(\frac{\nu }{2},\frac{1}{2}\right)}{2 \sqrt{n}}$

$\ $

Computing the exact standard deviation of the sample median is more difficult. You first need to find the probability density function of the sample median which is

$$f_m(x) = g(c(x)) f(x)$$

where

$$g(x) = \frac{(1-x)^{\frac{n-1}{2}}
x^{\frac{n-1}{2}}}{B\left(\frac{n+1}{2},\frac{n+1}{2}\right)},$$
$B$ is the beta function, $c(x)$ is the cumulative distribution function of the sample distribution, and $f(x)$ is the probability density function of the sample distribution.

Now the expected value of the sample median is

$$\mu_m = \int x f_m(x) dx$$

and the standard deviation of the sample median is

$$\sigma_m = \sqrt{\int (x-\mu_m)^2 f_m(x)\ dx}. $$

 

Generally speaking, these integrals are hard, but they are fairly simple for the uniform distribution.  If the sample distribution is uniform between 0 and 1, then

$f(x) = 1,$

$c(x) = x,$

$g(x) = \frac{(1-x)^{\frac{n-1}{2}}
x^{\frac{n-1}{2}}}{B\left(\frac{n+1}{2},\frac{n+1}{2}\right)},$

$f_m(x) = g(x),$

$\mu_m = \int x g(x) dx \ =\ 1/2,$ and

$$\sigma_m = \sqrt{\int (x-\mu_m)^2 f_m(x)\ dx}\  = \ {1\over{2\sqrt{n+2}}} $$

which is close to the approximation given in the table.

 

(Technical note:  The formulas above only apply for odd values of $n$ and continuous sample probability distributions.)

If you want the standard deviation for the sample median of a particular distribution and a $n$, then you can use numerical integration to get the answer. If you like, I could compute it for you.  Just leave a comment indicating the distribution and $n$.

Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?”  Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus.  At the time I found it hard to follow, but Nielsen’s post made it more clear to me.

 

Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved.  It uses notation like

$\alpha$ = P( Y=1 | do(X=2))

to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2.  This is not as straight forward as it might seem. We tend to conflate P(Y=1 | do(X=2)) with the conditional probability P(Y=1 | X=2). Below I will describe an example1, based on Simpson’s paradox, where they are different.

Suppose that there are two treatments for kidney stones: treatment A and treatment B.  The following situation is possible:

  • Patients that received treatment A recovered 33% of the time.  
  • Patients that received treatment B recovered 67% of the time.  
  • Treatment A is significantly better than treatment B.

This seemed very counterintuitive to me.  How is this possible?

The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat.  If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.

Imagine that

  • treatment A is given to one million people with a large stone and 1/3 of them recover,
  • treatment A is given to one thousand people with a small stone and all of them recover,
  • treatment B is given to one thousand people with a large stone and none of them recover,
  • treatment B is given to one million people with a small stone and 2/3 of them recover.

Notice that about one-third of the treatment A patients recovered and about two-thirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B.  If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.

Mathematically P( Recovery | Treatment A) $\approx$ 1/3   (i.e.  about 1/3 of the patients who got treatment A recovered).

The formula for P( Recovery | do(Treatment A)) is much different.  Here we force all patients (all 2,002,000 of them) to use treatment A.  In that case,

P( Recovery | do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.

Similarly for treatment B, P( Recovery |  Treatment B) $\approx$ 2/3 and

P( Recovery | do(Treatment B) ) $\approx$ 1/3.

 

This example may seemed contrived, but as Nielsen said, “Keep in mind that this really happened.”

 

Edit Aug 8,2013:  Judea Pearl has a wonderful write-up on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?).  I think equation (9) in the article has a typo on the right-hand side.  I think it should read

$$ P (E |do(\neg C)) = P (E |do(\neg C),F) P (F ) +P (E | do(\neg C), \neg F) P (\neg F ).$$

Corey Chivers’ created a beautiful set of slides introducing the reader to Bayesian Inference, Metropolis-Hastings (Markov Chain Monte Carlo), and Hyper Parameters with some applications to biology.

Introduction To Monte Carlo Algorithms” by Werner Krauth (1998) is a cute, simple introduction to Monte Carlo algorithms with both amusing hand drawn images and informative graphics.  He starts with a simple game from Monaco for estimating $\pi$ (not Buffon’s Needle) and revises it to introduce the concepts of Markov Chain Monte Carlo (MCMC), ergodicity, rejection, and Detailed Balance (see also Haystings-Metropolis).  He then introduces the hard sphere MCMC example, Random Sequential Absorption, the 8-puzzle, and ends with some accelerated algorithms for sampling.

You can find anything on the web.  Earlier today, I noticed that the Sterling approximation and the Gaussian distribution both had a $\sqrt{2 \pi}$ in them, so I started thinking that maybe you could apply Sterling’s approximation to the binomial distribution with $p=1/2$ and large $n$ and the central limit theorem to derive the Gaussian distribution.  After making a few calculations it seemed to be working and I thought, “Hey, maybe someone else has done this.”  Sure enough, Jake Hoffman did it and posted it here.  The internet is pretty cool.

I’m quite excited by the Nuit Blanche post on the papers “Structure Discovery in Nonparametric Regression through Compositional Kernel Search” (Duvenaudy, Lloydy, Grossez, Tenenbaumz, Ghahramaniy 2013) and “Exploiting compositionality to explore a large space of model structures” (Grosse, Salakhutdinovm, Freeman, and Tenenbaum 2012).  For years my old company Momentum Investment Services, Carl, and I have been looking for fast, systematic ways to search large hypothesis spaces.  We considered context-free grammars as a means of generating hypothesis.  Carl and I did not get anywhere with that research, but now it seems that others have succeeded.  Be sure to look over the article, the blog posts, and the comments.

« Older entries