September 2012

You are currently browsing the monthly archive for September 2012.

I was reading the Machine Learning‘s article “Coactive Learning” and they referred to that paper “From Bandits to Experts: On the Value of Side-Observations” by Mannor and Shamir (2011). This paper develops algorithms for the situation where the learner gets information about neighboring bandits after it chooses which bandit arm to pull. Recall that in the mixture of experts situation, the leaner gets to see the results of all the experts (bandits) after choosing which arm to pull.

There is a great post at stack overflow comparing SVMs with Neural Nets. Check it out.

Nuit Blanche pointed me toward the interesting paper “Re-Weighted L1 Dynamic Filtering for Time-Varying Sparse Signal Estimation” by Charles and Rozell (2012). They investigate the generalized Kalman filter situation which is a hidden Markov model with an infinite number of states and real valued observations. More specifically
$$x_n = f_n (x_{n-1}) + v_n$$

$$y_n = G_n x_{n-1} + \epsilon_n$$

where $x_n$ is the state, $f_n$ is the known state transition function, $y_n$ is a real valued observation vector, $G_n$ is the known observation matrix (full rank), $v_n$ is the process vector noise (not necessarily Gaussian), and $\epsilon_n$ is the observation noise vector (once again, not necessarily Gaussian). The object of the game is to estimate $x_n$ given the transition function, the observation matrices, and the observations $y_n$.

They sparsely represent the state vector $x_n$ is by $x_n = W z_n$ where $W$ is a matrix and $z_n$ is a vector that is mostly zeros. In order to achieve a sparse representation, a L1 norm regularization term is applied. This regularization usually corresponds to a Laplacian prior as in Bias pursuit denoising or LASSO regression. As with neural nets, sparsity has had an impact in many fields. They write

In particular, outside of the dynamic estimation regime, signal priors based on the notion of sparsity have led to state-of-the-art performance in many linear inverse problems (e.g., undersampling, inpainting, denoising, compressed sensing, etc. [3]) across several different application domains (e.g. natural images [3], audio [4], hyperspectral imagery [5], etc.).

The previous work integrating sparsity models into dynamic estimation has not yet followed the spirit
of the successful Kalman filter by propagating higher-order statistics from one time step to the next in a
way that is well-matched to the assumptions of the problem.

As is pointed out in [8], propagating a covariance matrix is no longer necessarily the correct or optimal method once the
assumptions of the Kalman filter are not met. Instead, such a matrix is no more special than a simple parameter matrix, which
may or may not assist in the signal estimation.”

The paper includes several experimental results in section IV.

Gaussian distributions are the most “natural” distributions. They show up everywhere. Here is a list of the properties that make me think that Gaussians are the most natural distributions:

  • The sum of several random variables (like dice) tends to be Gaussian. (Central Limit Theorem).
  • There are two natural ideas that appear in Statistics, the standard deviation and the maximum entropy principle. If you ask the question, “Among all distributions with standard deviation 1 and mean 0, what is the distribution with maximum entropy?” The answer is the Gaussian.
  • Randomly select a point inside a high dimensional hypersphere. The distribution of any particular coordinate is approximately Gaussian. The same is true for a random point on the surface of the hypersphere.
  • Take several samples from a Gaussian Distribution. Compute the Discrete Fourier Transform of the samples. The results have a Gaussian Distribution. I am pretty sure that the Gaussian is the only distribution with this property.
  • The eigenfunctions of the Fourier Transforms are products of polynomials and Gaussians.
  • The solution to the differential equation y’ = -x y is a Gaussian. This fact makes computations with Gaussians easier. (Higher derivatives involve Hermite polynomials.)
  • I think Gaussians are the only distributions closed under multiplication, convolution, and linear transformations.
  • Maximum likelihood estimators to problems involving Gaussians tend to also be the least squares solutions.
  • I think all solutions to stochastic differential equations involve Gaussians. (This is mainly a consequence of the Central Limit Theorem.
  • “The normal distribution is the only absolutely continuous distribution all of whose cumulants beyond the first two (i.e. other than the mean and variance) are zero.” – Wikipedia.
  • For even n, the nth moment of the Gaussian is simply an integer multiplied by the standard deviation to the nth power.
  • Many of the other standard distributions are strongly related to the Gaussian (i.e. binomial, Poisson, chi-squared, Student t, Rayleigh, Logistic, Log-Normal, Hypergeometric …)
  • “If X1 and X2 are independent and their sum X1 + X2 is distributed normally, then both X1 and X2 must also be normal.” — From the Wikipedia.
  • “The conjugate prior of the mean of a normal distribution is another normal distribution.” — From the Wikipedia.
  • When using Gaussians, the math is easier.
  • The Erdős–Kac theorem implies that the distribution of the prime factors of a “random” integer is Gaussian.
  • The velocities of random molecules in a gas are distributed as a Gaussian. (With standard deviation = $z*\sqrt{ k\, T / m} $ where $z$ is a “nice” constant, $m$ is the mass of the particle, and $k$ is Boltzmann’s constant.)
  • “A Gaussian function is the wave function of the ground state of the quantum harmonic oscillator.” — From Wikipedia
  • Kalman Filters.
  • The Gauss–Markov theorem.

Sometimes I can’t remember the formulas for matrix calculus.  Fortunately there are some good references on the web.  One of my favorites is “Old and New Matrix Algebra Useful for Statistics” (Minka 2000 1, 2).

Someone at stack overflow asked for advice about Machine Learning for checkers. Here is my response:

You might want to take a look at the following: Chinook, Upper Confidence Trees, Reinforcement Learning, and Alpha-Beta pruning.  I personally like to combine Alpha-Beta Pruning and Upper Confidence Trees (UCT) for perfect information games where each player has less than 10 reasonable moves.  You can use Temporal Difference Learning to create a position evaluation function.  Game AI is probably the funnest way to learn machine learning.

Proofs without Words

On mathoverflow, they have this wonderful question “Can you give examples of proofs without words?”.  The answers are beautiful. They visually “prove”

$$1 + 2 + 3 + \cdots + n = n (n + 1) / 2 = \left({ \begin{matrix} n \\ 2 \end{matrix}}\right),$$

$$32.5 = 31.5,$$

$$1^2 + 2^2 + 3^2 + \cdots + n^2 = n(n + 1)(n + 1/2)/3,$$

$$F^2_0 + F^2_1 + \cdots + F^2_n = F_n F_{n+1},$$

with $F_0 = 1$ for Fibonacci numbers $F_n$, and more than 15 other facts visually.

The other day, Carl mentioned that Euclidean Geometry was decidable.  I thought that was impossible because I thought it would have an isomorphic copy of Peano arithmetic which is not decidable.   Later he pointed me toward Tarski’s axioms.   Here’s a quote from the Wikipedia page “This fact allowed Tarski to prove that Euclidean geometry is decidable: there exists an algorithm which can determine the truth or falsity of any sentence.”  I found the whole article to be pretty cool because I had never really dug into geometry as a first order predicate calculus.

The article “Big Data and the Hourly Workforce” (see Slashdot comments) repeats the known story that Big Data is being used by companies to improve their product.  It is used by Target to target customers (by, for example, identifying pregnant women), by the Center for Disease Control to identify outbreaks, to predicted box office hits from Twitter, and by political campaigns to target the right voters with the right message.  The article implies that Big Data algorithms identify patterns in the data which companies can exploit.

But then the article gets into application of Big Data to the hourly work force. Companies are focusing on improving worker retention, worker productivity, customer satisfaction, and average revenue per sale.  Big Data is being used to increase each of these metrics five to twenty-five percent.  As in baseball (Moneyball), some of the old rules of thumb like excluding “job hoppers” or even those with an old criminal history are being rejected by the data in favor of machine learning features like distance from work or the Meyers Briggs personality classification which are supported by the data.

This 2011 tutorial (Parts 1 and 2) is the best introduction to stochastic and adversarial bandits and details recent advances in research that I have seen. Many of the bandit papers that I have linked to in previous posts are quoted.   Some experimental results and suggested parameter values are given. Part one mentions that regret may be much larger than the expected regret as per the paper “Exploration-exploitation tradeoff using variance estimates in multi-armed bandits” by Audibert, Munos, Szepesvári (2009). It describes the UCB-Horizon algorithm and Hoeffding-based GCL∗ policy which alleviate this problem. Part two is titled “Bandits with large sets of actions” and mostly deals with the case of infinitely many bandits. Many algorithms are described including Hierarchical Optimistic Optimization and a huge bibliography is provided at the end of the tutorial.

« Older entries