# Statistics

You are currently browsing the archive for the Statistics category.

## Deriving the Gaussian Distribution from the Sterling Approximation and the Central Limit Theorem

You can find anything on the web.  Earlier today, I noticed that the Sterling approximation and the Gaussian distribution both had a $\sqrt{2 \pi}$ in them, so I started thinking that maybe you could apply Sterling’s approximation to the binomial distribution with $p=1/2$ and large $n$ and the central limit theorem to derive the Gaussian distribution.  After making a few calculations it seemed to be working and I thought, “Hey, maybe someone else has done this.”  Sure enough, Jake Hoffman did it and posted it here.  The internet is pretty cool.

## “Matrix Factorizations and the Grammar of Life”

I’m quite excited by the Nuit Blanche post on the papers “Structure Discovery in Nonparametric Regression through Compositional Kernel Search” (Duvenaudy, Lloydy, Grossez, Tenenbaumz, Ghahramaniy 2013) and “Exploiting compositionality to explore a large space of model structures” (Grosse, Salakhutdinovm, Freeman, and Tenenbaum 2012).  For years my old company Momentum Investment Services, Carl, and I have been looking for fast, systematic ways to search large hypothesis spaces.  We considered context-free grammars as a means of generating hypothesis.  Carl and I did not get anywhere with that research, but now it seems that others have succeeded.  Be sure to look over the article, the blog posts, and the comments.

## “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained”

“…we often joke that our job, as the team that builds the experimentation platform, is to tell our clients that their new baby is ugly, …”

Andrew Gelman at Statistical Modeling, Causal Inference, and Social Science pointed me towards the paper “Trustworthy Online Controlled Experiments:  Five Puzzling Outcomes Explained” by Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu all of whom seem to be affiliated with Microsoft.

The paper itself recounted five online statistical experiments mostly done at Microsoft that had informative counter-intuitive results:

• Overall Evaluation Criteria for Bing
• Click Tracking
• Initial Effects
• Experiment Length
• Carry Over Effects.

The main lessons learned were:

• Be careful what you wish for. – Short term effects may be diametrically opposed to long-term effects.  Specifically, a high number clicks or queries per session could be indicative of a bug rather than success.  It’s important to choose the right metric.  The authors ended up focusing on “sessions per user” as a metric as opposed to “queries per month” partly due to a bug which increased (in the short-term) queries and revenues while degrading the user’s experience.
• Initial results are strongly affected by “Primacy and Novelty”. – In the beginning, experienced users may click on a new option just because it is new, not because it’s good.  On the other hand, experienced users may be initially slowed by a new format even if the new format is “better”.
• If reality is constantly changing, the experiment length may not improve the accuracy of the experiment.  The underlying behavior of the users may change every month.  A short-term experiment may only capture a short-term behavior. Rather than running the experiment for years, the best option may be to run several short-term experiments and adapt the website to the changing behavior as soon as the new behavior is observed.
• If the same user is presented with the same experiment repeatedly, her reaction to the experiment is a function of the number of times she has been exposed to the experiment.  This effect must be considered when interpreting experimental results.
• The Poisson Distribution should not be used to model clicks.  They preferred Negative Binomial.

The paper is easy to read, well written, and rather informative. It is especially good for web analytics and for anyone new to experimental statistics.  I found the references below to be especially interesting:

Sean J. Taylor writes a short, critical, amusing article about RPython or JVM languages, Julia, Stata, SPSS,  Matlab, Mathematica, and SAS.

## “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo”

In “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo“, Ho man and Gelman present an improvement of the Markov Chain Monte Carlo and the Hamiltonian Monte Carlo methods.  Here’s the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

## Cotner’s Observation

Taleb published an idea is similar to Cotner’s Observation which I will formally state below.
$\$
If you see a tall person from a distance, you tend to expect that the tall person is male.  In fact, the taller the person is, the more likely it is that the person is male. If you plot histograms of people’s heights, you will find that this intuition is correct.  Cotner once pointed out to me that this intuition is false for many non-Gaussian Distributions.
$\$

Cotner’s Observation:$\$

Consider two random variables, $A$ and $B$ with $P(B>A) > 0.5$ and a random variable $X$ which is always equal to $A$ or $B$.  Suppose we flip a fair coin and if the coin is heads we assign $X$ to $A$ and if it is tails we let $X$ equal $B$.   Now if $X$ is large, we tend to think the coin was probably tails because $B$ is usually larger than $A$.  Furthermore, the larger $X$ is the more we think that the coin was probably tails.  This intuition is right for Gaussian Distributions, but it is wrong for many other distributions.  Let $p(x)$ be the probability that the coin was heads given that $X = x$
$$p(x) := P( X=A | X=x).$$
• If $A$ and $B$ have Gaussian Distributions with equal standard deviations, then $p(x)$ is decreasing.
• If $A$ and $B$ have Laplace Distributions with equal standard deviations, then $p(x)$ is becomes constant for all $x > E[B]$ and $p(x) < 0.5$ when $x > E[B]$.
• If $A$ and $B$ have Student T-Distributions with equal standard deviations and equal degrees of freedom, then $p(x)$ will actually increase if $x$ is large enough and $p(x)$ approaches 0.5 as $x$ goes to infinity.
$\$
Many distributions have a fat tails like the Student T Distributions.

## Ecological Fallacy

Ecological Fallacy” is a phenomena in statistics where correlations for groups of data points are differ from the correlations for ungrouped data.  Here a quote from the Wikipedia,

“Another example is a 1950 paper by William S. Robinson that coined the term. For each of the 48 states + District of Columbia in the US as of the 1930 census, he computed the illiteracy rate and the proportion of the population born outside the US. He showed that these two figures were associated with a negative correlation of −0.53 — in other words, the greater the proportion of immigrants in a state, the lower its average illiteracy. However, when individuals are considered, the correlation was +0.12 — immigrants were on average more illiterate than native citizens. Robinson showed that the negative correlation at the level of state populations was because immigrants tended to settle in states where the native population was more literate. He cautioned against deducing conclusions about individuals on the basis of population-level, or “ecological” data. In 2011, it was found that Robinson’s calculations of the ecological correlations are based on the wrong state level data. The correlation of −0.53 mentioned above is in fact −0.46.

## What happens when you combine Relational Databases, Logic, and Machine Learning?

Answer:  Statistical Relational Learning.  Maybe I can get the book for Christmas.

## Why are Gaussian Distributions Great?

Gaussian distributions are the most “natural” distributions. They show up everywhere. Here is a list of the properties that make me think that Gaussians are the most natural distributions:

• The sum of several random variables (like dice) tends to be Gaussian. (Central Limit Theorem).
• There are two natural ideas that appear in Statistics, the standard deviation and the maximum entropy principle. If you ask the question, “Among all distributions with standard deviation 1 and mean 0, what is the distribution with maximum entropy?” The answer is the Gaussian.
• Randomly select a point inside a high dimensional hypersphere. The distribution of any particular coordinate is approximately Gaussian. The same is true for a random point on the surface of the hypersphere.
• Take several samples from a Gaussian Distribution. Compute the Discrete Fourier Transform of the samples. The results have a Gaussian Distribution. I am pretty sure that the Gaussian is the only distribution with this property.
• The eigenfunctions of the Fourier Transforms are products of polynomials and Gaussians.
• The solution to the differential equation y’ = -x y is a Gaussian. This fact makes computations with Gaussians easier. (Higher derivatives involve Hermite polynomials.)
• I think Gaussians are the only distributions closed under multiplication, convolution, and linear transformations.
• Maximum likelihood estimators to problems involving Gaussians tend to also be the least squares solutions.
• I think all solutions to stochastic differential equations involve Gaussians. (This is mainly a consequence of the Central Limit Theorem.
• “The normal distribution is the only absolutely continuous distribution all of whose cumulants beyond the first two (i.e. other than the mean and variance) are zero.” – Wikipedia.
• For even n, the nth moment of the Gaussian is simply an integer multiplied by the standard deviation to the nth power.
• Many of the other standard distributions are strongly related to the Gaussian (i.e. binomial, Poisson, chi-squared, Student t, Rayleigh, Logistic, Log-Normal, Hypergeometric …)
• “If X1 and X2 are independent and their sum X1 + X2 is distributed normally, then both X1 and X2 must also be normal.” — From the Wikipedia.
• “The conjugate prior of the mean of a normal distribution is another normal distribution.” — From the Wikipedia.
• When using Gaussians, the math is easier.
• The Erdős–Kac theorem implies that the distribution of the prime factors of a “random” integer is Gaussian.
• The velocities of random molecules in a gas are distributed as a Gaussian. (With standard deviation = $z*\sqrt{ k\, T / m}$ where $z$ is a “nice” constant, $m$ is the mass of the particle, and $k$ is Boltzmann’s constant.)
• “A Gaussian function is the wave function of the ground state of the quantum harmonic oscillator.” — From Wikipedia
• Kalman Filters.
• The Gauss–Markov theorem.

## Standard Deviation of Sample Median

“Take $n$ samples from a standard normal distribution $N(0,1)$. Which has greater variance: the average or the median?”

(Aug 12, 2013 edit:  If you want to understand the answer to this question, read below.  If you want to compute the exact standard deviation of the median for a particular sample distribution, click “The Exact Standard Deviation of the Sample Median“.)

This question was mostly answered by Laplace and by Kenney and Keeping.

If the number of samples $n$ of a symmetric continuous distribution with probability density function $f$ is large, then the distribution of the sample median is approximately $N(\overline{x}, \sigma)$, where $\overline{x}$ is the the mean of $f$, and
$$\sigma = {1 \over 2\sqrt{n}\,f(\overline{x})}.$$

The $1 \over 2\sqrt{n}$ comes from the fact that the number of samples above the mean has mean $n/2$ with standard deviation $1 \over 2\sqrt{n}$.

For the Gaussian distribution, the estimate for the standard deviation of the sample median is
$${\sqrt{2\pi} \over 2\sqrt{n}} = \sqrt{\pi \over 2n}.$$

The standard deviation of the sample mean is $1 / \sqrt{n}$. So the standard deviation of the sample median is larger by a factor of $\sqrt{\pi / 2} \approx 1.25$ (i.e., it has about 25% more error). This is a small price to play if outliers are possible.

So we have answered the question for the Gaussian. What about any symmetric continuous single mode distribution? The standard deviation of the median decreases as the density of the distribution at the median increases, so for sharply pointed distributions like the symmetric Laplacian distribution the standard deviation of the sample median is actually lower than the standard deviation of the sample mean.

I used approximations above, but with a lot more work and Beta distributions, you can calculate to arbitrary precision values for the standard deviation of the sample median.