July 2012

You are currently browsing the monthly archive for July 2012.

A friend of mine asked:

“Take $n$ samples from a standard normal distribution $N(0,1)$. Which has greater variance: the average or the median?”

(Aug 12, 2013 edit:  If you want to understand the answer to this question, read below.  If you want to compute the exact standard deviation of the median for a particular sample distribution, click “The Exact Standard Deviation of the Sample Median“.)

This question was mostly answered by Laplace and by Kenney and Keeping.

If the number of samples $n$ of a symmetric continuous distribution with probability density function $f$ is large, then the distribution of the sample median is approximately $N(\overline{x}, \sigma)$, where $\overline{x}$ is the the mean of $f$, and
$$\sigma = {1 \over 2\sqrt{n}\,f(\overline{x})}.$$

The $1 \over 2\sqrt{n}$ comes from the fact that the number of samples above the mean has mean $n/2$ with standard deviation $1 \over 2\sqrt{n}$.

For the Gaussian distribution, the estimate for the standard deviation of the sample median is
$${\sqrt{2\pi} \over 2\sqrt{n}} = \sqrt{\pi \over 2n}.$$

The standard deviation of the sample mean is $1 / \sqrt{n}$. So the standard deviation of the sample median is larger by a factor of $\sqrt{\pi / 2} \approx 1.25$ (i.e., it has about 25% more error). This is a small price to play if outliers are possible.

So we have answered the question for the Gaussian. What about any symmetric continuous single mode distribution? The standard deviation of the median decreases as the density of the distribution at the median increases, so for sharply pointed distributions like the symmetric Laplacian distribution the standard deviation of the sample median is actually lower than the standard deviation of the sample mean.

I used approximations above, but with a lot more work and Beta distributions, you can calculate to arbitrary precision values for the standard deviation of the sample median.

I’m rather fascinated by the HOO algorithm.  It seems to be a combination of simulated annealing, branch and bound, simplex method, and upper confidence trees.  I’m going to code it up an try it on a few problems.

According to the Wikipedia, Student’s t-distribution is the maximum entropy probability distribution for a random variate X for which $E(\log(\nu + X^2))$ is fixed. This is similar to the fact that the Gaussian distribution is the maximum entropy distribution among all distributions for which $E(X^2)$ is fixed.

Here is a cool set of power point slides by Brodeur and Child on upper confidence trees (UCT) applied to Go.  They explore modifications to UCT to explore directed acyclic graphs and grouping of game states.

Carl, I, and the Wikipedia put together this list.

 

Wikipedia list

Bubeck Slides “Continuous Stochastic Optimization” using Hierarchical Optimistic Optimization (HOO)

X-Armed Bandits (HOO)

Sample-Based Planning for Continuous Action for Markov Decision Processes

Open Loop Optimistic Planning

Prediction, Learning, and Games (Book by CESA-BIANCH & LUGOS )

The non-stochastic multi-armed bandit problem

I just discover the Face Recognition Homepage while looking for a database of images of faces. But it also is a good place to find algorithms, source code, and other good stuff!

In this article, Wang, Huang, Kamangar, Nie, and Ding discuss a new algorithm for multi-instance learning.

“In MIL data objects are represented as bags of instances, therefore the distance between the objects is a set-to-set distance. Compared to traditional single-instance data that use vector distance such as Euclidean distance, estimating the Bag-to-Bag (B2B) distance in MIL is more challenging [7,8]. In addition, the B2B distances often do not truly reflect the semantic similarities [9].”

I am hoping to apply these ideas to astronomical image stacking.

(Based on the discussion between Carl and me on Saturday, July 7.)

It seems that in order to achieve Strong AI we need to combine Machine Learning and Good Old-Fashioned Artificial Intelligence. For those not familiar with GOFAI, GOFAI is mostly about symbol manipulation and includes subjects like computer algebra, theorem proving, first order predicate calculus, lambda calculus, Turing machines, alpha beta pruning, context free grammers, depth/breath first search, branch and bound, semantic networks, expert systems, Prolog, lisp, resolution theorem proving, and Godel’s incompleteness theorem. I wonder if we also have to add less formal, plausible reasoning systems that guess at theorems which are only “mostly true” (true only for points in general position outside of a set of measure zero) or have a high “probability of being true”. Such reasoning techniques might be derived from Bayesian networks or Alternative Hypothesis Spaces (Griffin 2009) (http://bmir.stanford.edu/file_asset/index.php/1565/BMIR-2010-1375.pdf).

Maybe we can invent a non-discrete Turing machine. Say we replace the finite state machine which reads and prints on the tape with a neural net. We might also print real number between 0 and 1 on the tape instead of printing only zeros and ones. Of course, this continuous Turing machine would not be better at classification of binary strings than a traditional discrete Turing machine, rather, the hope is that such a machine might be easier to optimize using general optimization techniques similar to gradient decent. My personal feeling is that this particular method will not work, but something like it might work.

We need something that captures the most fundamental ideas like the idea of a Cartesian Plane. The idea of a Cartesian product arises naturally from Category Theory. If one wants to formulate thumb rules for tic-tac-toe, the 3 by 3 grid is an essential idea that is hard for a neural net to learn without prompting. Maybe we can use Category theory to guide our learning algorithms (http://golem.ph.utexas.edu/category/2007/09/category_theory_in_machine_lea.html) to the correct abstractions.

What is an abstraction anyway? In chess, if a king is in check, there are only three types of possible moves: move the king, interpose a piece, or capture the piece attacking the king. So when we consider responses to a check, we only need consider those three types of moves. The fact that only these three types of moves exists is a theorem. This theorem is like a subroutine that speeds up our processing in a chess game. So an abstraction could be like a helpful subroutine. Also, this theorem allows us to disregard pieces that are not “near” the king or the attacking piece. In this regard, the theorem allows us to remove information from the game board before calculating the possible counter-moves. In this regard, the theorem is like a projection. The matrix

[1 0;
0 0]

sets the y coordinate to zero, so it destroys information. So some abstractions can be viewed as helpful projections that remove irrelevant information—just like manifold learning or dimension reduction. Another way to remove information is to disregard the particulars of the situation and boil it down to the essentials. Group theory was invented this way.

Group theory is important in another regard: examples and models. Suppose that a thinking machine was trying to prove conjectures about group theory. Before trying to prove the conjecture, the machine could check some simple finite symmetric groups, a few finite cyclic groups, and maybe addition on real numbers to see if the conjecture was true for those examples. If it was true, then the prover could try to prove the conjecture. If it was false, then the computer would not waste time trying to prove the conjecture.

Another useful idea is the idea that classifiers or theorems could have natural topologies (http://philosophy.berkeley.edu/file/698/FractalCompletenessTechniques.pdf). It seems that it is hard for neural nets to learn parity because changing any one bit in the input changes the class from even to odd or vice versa. Nearest neighbor with k=1 would be worse than useless for such a problem. Topology could have further insights for machine learning like what is learnable. If the fractal dimension of a class is high (or the boundary has a high fractal dimension), we would imagine that the concept is harder to learn.

For example, if a classifier had to classify the points in a circle and it used nearest neighbor with k=1, then number of random samples needed to achieve an error of epsilon is order (1/$\epsilon$)^2. The number of points needed to classify the area inside a Koch Snowflake (http://www.wolframalpha.com/input/?i=Koch+snowflake&lk=3) is order (1/$\epsilon$)^(2/(2-k)) where k is the fractal dimension of the border of the Koch Snowflake.

It occurs to me that when designing an algorithm for a large organization the designer should minimize
$$\alpha T + \beta L + \gamma H$$
where $\alpha$, $\beta$, and $\gamma$ are unknown positive constants, $T$ is the time your code needs to run, $L$ is the number of lines of code required, and $H$ is the number of hours required by other engineers to understand your algorithm.

PS: The above statement is considerably less funny if you try to estimate $\alpha$, $\beta$, and $\gamma$.

« Older entries