January 2013

You are currently browsing the monthly archive for January 2013.

Taleb published an idea is similar to Cotner’s Observation which I will formally state below.
$\ $
If you see a tall person from a distance, you tend to expect that the tall person is male.  In fact, the taller the person is, the more likely it is that the person is male. If you plot histograms of people’s heights, you will find that this intuition is correct.  Cotner once pointed out to me that this intuition is false for many non-Gaussian Distributions.
$\ $

Cotner’s Observation:$\ $

Consider two random variables, $A$ and $B$ with $P(B>A) > 0.5$ and a random variable $X$ which is always equal to $A$ or $B$.  Suppose we flip a fair coin and if the coin is heads we assign $X$ to $A$ and if it is tails we let $X$ equal $B$.   Now if $X$ is large, we tend to think the coin was probably tails because $B$ is usually larger than $A$.  Furthermore, the larger $X$ is the more we think that the coin was probably tails.  This intuition is right for Gaussian Distributions, but it is wrong for many other distributions.  Let $p(x)$ be the probability that the coin was heads given that $X = x$
$$p(x) := P( X=A | X=x).$$
  • If $A$ and $B$ have Gaussian Distributions with equal standard deviations, then $p(x)$ is decreasing.
  • If $A$ and $B$ have Laplace Distributions with equal standard deviations, then $p(x)$ is becomes constant for all $x > E[B]$ and $p(x) < 0.5$ when $x > E[B]$.
  • If $A$ and $B$ have Student T-Distributions with equal standard deviations and equal degrees of freedom, then $p(x)$ will actually increase if $x$ is large enough and $p(x)$ approaches 0.5 as $x$ goes to infinity.
$\ $
Many distributions have a fat tails like the Student T Distributions.

In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]).  They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.”  and

“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the field of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem.  It discovers the structure of a problem on the fly.”

where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998).  The algorithm consists of the following steps

The Bayesian Optimization Algorithm (BOA)
(1) set t ← 0 randomly generate initial population P(0)
(2) select a set of promising strings S(t) from P(t)
(3) construct the network B using a chosen metric and constraints
(4) generate a set of new strings O(t) according to the joint distribution encoded by B
(5) create a new population P(t+1) by replacing some strings from P(t) with O(t)  set t ← t + 1
(6) if the termination criteria are not met, go to (2)

where B is a Bayesian network.

Check out the NIPS 2011 workshop.

Check out the new class (all lectures on-line) advertised on the Machine Learning (Theory) blog.

Check out the Nuit Blanche posts “Predicting the Future: The Steamrollers” and “Predicting the Future: Randomness and Parsimony” where Igor Carron repeats the well known mantra of Moore’s law that always seems to catch us by surprise. Carron’s remarks on medicine surprised me but also I thought, “I should have guessed that would happen” while reading the articles.

Garvesh Raskutti has some nice slides on Probabilistic Graphical Models and Markov Logic Networks (Richardson & Domingos 2006).  Markov Logic Networks encode first order predicate logic into a Markov Random Field. The resulting networks can be quite large because statements like “for all x, y, and z, x is y’s parent and z is x’s parent imply z is y’s grandparent” require the existence of $2n^2$ nodes in the graph where $n$ is the number of people considered.  The resulting networks are frequently solved by using Gibbs Sampling.  For even more information Pedro Domingos has put an entire course online at



Ecological Fallacy” is a phenomena in statistics where correlations for groups of data points are differ from the correlations for ungrouped data.  Here a quote from the Wikipedia,

“Another example is a 1950 paper by William S. Robinson that coined the term.[5] For each of the 48 states + District of Columbia in the US as of the 1930 census, he computed the illiteracy rate and the proportion of the population born outside the US. He showed that these two figures were associated with a negative correlation of −0.53 — in other words, the greater the proportion of immigrants in a state, the lower its average illiteracy. However, when individuals are considered, the correlation was +0.12 — immigrants were on average more illiterate than native citizens. Robinson showed that the negative correlation at the level of state populations was because immigrants tended to settle in states where the native population was more literate. He cautioned against deducing conclusions about individuals on the basis of population-level, or “ecological” data. In 2011, it was found that Robinson’s calculations of the ecological correlations are based on the wrong state level data. The correlation of −0.53 mentioned above is in fact −0.46.[6]

Check out the Bookworm9765’s review of Kurzweil‘s book “How to Create a mind” on Amazon.com.  Here is a snippet:

The core of Kurzweil’s theory is that the brain is made up of pattern processing units comprised of around 100 neurons, and he suggests that the brain can be understood and simulated primarily by looking at how these lego-like building blocks are interconnected.

In “Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems” Ibrahimi, Javanmard, and Van Roy (2012) present a fast reinforcement learning solution for linear control systems problems with quadratic costs.  The problem is to minimize the total cost $\min_u \sum_{i=1}^T x_i^t Q x_i + u_i^t R u_i$ for the control system $x_{i+1} = A x_i + B u_i + w_i$ where $x\in R^j$ is the state of the system; $u\in R^k$ is the control; $Q, R, A,$ and $B$ are matrices; and $w_i\in R^j$ is a random multivariate normally distributed vector.

Newer entries »