Artificial Intelligence Blog

Maybe good teachers should be paid $200,000 per year

January 18, 2013 in Uncategorized by hundalhh | Permalink

According to this graph

Figure 10 From THE LONG-TERM IMPACTS OF TEACHERS: TEACHER VALUE-ADDED AND STUDENT OUTCOMES IN ADULTHOOD by Chetty, Friedman, and Jonah E. Rockoff

high quality elementary school teachers increase the lifetime earnings of their students by about $200,000 per child.

Dropout – What happens when you randomly drop half the features?

January 16, 2013 in Deep Belief Networks, Neural Nets by hundalhh | Permalink

In “Improving neural networks by preventing co-adaptation of feature detectors“, Hinton, Srivastava, Krizhevsky, Sutskever, and Salakhutdinov answer the question: What happens if “On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present.” This mimics the standard technique of training several neural nets and averaging them, but it is faster. When they applied the “dropout” technique to a deep Boltzmann neural net on the MNIST hand written digit data set and the TIMIT speech data set, they got robust learning without overfitting. This was one of the main techniques used by the winners of the Merck Molecular Activity Challenge.

Hinton talks about the dropout technique in his video Brains, Sex, and Machine Learning.

“Aaron Swartz (1986-2013)”

January 15, 2013 in Uncategorized by hundalhh | Permalink

“Aaron Swartz (1986-2013)”

“Analysis of Thompson Sampling for the multi-armed bandit problem”

January 14, 2013 in Multi-Armed Bandit Problem by hundalhh | 1 comment

In “Analysis of Thompson Sampling for the multi-armed bandit problem“, Agrawal and Goyal show that Thompson Sampling (“The basic idea is to choose an arm to play according to its probability of being the best arm.”) has logarithmic regret. More specifically, if there are $n$ bandits and the regret $R$ at time $T$ is defined by

$$R(T) := \sum_{t=1}^T (\mu^* – \mu_{i(t)})$$

where $\mu_i$ is the expected return of the $i$th arm and $\mu^* = \max_{i = 1, \ldots, n} \mu_i$, then

$$E[R(T)] \in O\left(\left(\sum_{i\neq i^*} 1/\Delta_i^2\right)^2 \log T \right)$$

where $\mu(i^*) = \mu^*$ and $\Delta_i = \mu^* – \mu_i$.

Cotner’s Observation

January 12, 2013 in Statistics by hundalhh | 1 comment

Taleb published an idea is similar to Cotner’s Observation which I will formally state below.

$\ $

If you see a tall person from a distance, you tend to expect that the tall person is male. In fact, the taller the person is, the more likely it is that the person is male. If you plot histograms of people’s heights, you will find that this intuition is correct. Cotner once pointed out to me that this intuition is false for many non-Gaussian Distributions.

$\ $

Cotner’s Observation:$\ $

Consider two random variables, $A$ and $B$ with $P(B>A) > 0.5$ and a random variable $X$ which is always equal to $A$ or $B$. Suppose we flip a fair coin and if the coin is heads we assign $X$ to $A$ and if it is tails we let $X$ equal $B$. Now if $X$ is large, we tend to think the coin was probably tails because $B$ is usually larger than $A$. Furthermore, the larger $X$ is the more we think that the coin was probably tails. This intuition is right for Gaussian Distributions, but it is wrong for many other distributions. Let $p(x)$ be the probability that the coin was heads given that $X = x$

$$p(x) := P( X=A | X=x).$$

If $A$ and $B$ have Gaussian Distributions with equal standard deviations, then $p(x)$ is decreasing.
If $A$ and $B$ have Laplace Distributions with equal standard deviations, then $p(x)$ is becomes constant for all $x > E[B]$ and $p(x) < 0.5$ when $x > E[B]$.
If $A$ and $B$ have Student T-Distributions with equal standard deviations and equal degrees of freedom, then $p(x)$ will actually increase if $x$ is large enough and $p(x)$ approaches 0.5 as $x$ goes to infinity.

$\ $

Many distributions have a fat tails like the Student T Distributions.

“BOA: The Bayesian Optimization Algorithm”

January 10, 2013 in Graphical Models, Multi-Armed Bandit Problem, Optimization by hundalhh | Permalink

In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]). They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.” and

“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the ﬁeld of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem. It discovers the structure of a problem on the ﬂy.”

where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998). The algorithm consists of the following steps

The Bayesian Optimization Algorithm (BOA)
(1) set t ← 0 randomly generate initial population P(0)
(2) select a set of promising strings S(t) from P(t)
(3) construct the network B using a chosen metric and constraints
(4) generate a set of new strings O(t) according to the joint distribution encoded by B
(5) create a new population P(t+1) by replacing some strings from P(t) with O(t) set t ← t + 1
(6) if the termination criteria are not met, go to (2)

where B is a Bayesian network.

Check out the NIPS 2011 workshop.

NYU Large Scale Machine Learning Class

January 8, 2013 in General ML by hundalhh | Permalink

Check out the new class (all lectures on-line) advertised on the Machine Learning (Theory) blog.

“Predicting the Future: Randomness and Parsimony”

January 8, 2013 in Uncategorized by hundalhh | 2 comments

Check out the Nuit Blanche posts “Predicting the Future: The Steamrollers” and “Predicting the Future: Randomness and Parsimony” where Igor Carron repeats the well known mantra of Moore’s law that always seems to catch us by surprise. Carron’s remarks on medicine surprised me but also I thought, “I should have guessed that would happen” while reading the articles.

Markov Logic Networks Tutorial

January 6, 2013 in Graphical Models, Logic by hundalhh | Permalink

Garvesh Raskutti has some nice slides on Probabilistic Graphical Models and Markov Logic Networks (Richardson & Domingos 2006). Markov Logic Networks encode first order predicate logic into a Markov Random Field. The resulting networks can be quite large because statements like “for all x, y, and z, x is y’s parent and z is x’s parent imply z is y’s grandparent” require the existence of $2n^2$ nodes in the graph where $n$ is the number of people considered. The resulting networks are frequently solved by using Gibbs Sampling. For even more information Pedro Domingos has put an entire course online at

www.cs.washington.edu/homes/pedrod/803

Ecological Fallacy

January 4, 2013 in Statistics by hundalhh | Permalink

“Ecological Fallacy” is a phenomena in statistics where correlations for groups of data points are differ from the correlations for ungrouped data. Here a quote from the Wikipedia,

“Another example is a 1950 paper by William S. Robinson that coined the term.^[5] For each of the 48 states + District of Columbia in the US as of the 1930 census, he computed the illiteracy rate and the proportion of the population born outside the US. He showed that these two figures were associated with a negative correlation of −0.53 — in other words, the greater the proportion of immigrants in a state, the lower its average illiteracy. However, when individuals are considered, the correlation was +0.12 — immigrants were on average more illiterate than native citizens. Robinson showed that the negative correlation at the level of state populations was because immigrants tended to settle in states where the native population was more literate. He cautioned against deducing conclusions about individuals on the basis of population-level, or “ecological” data. In 2011, it was found that Robinson’s calculations of the ecological correlations are based on the wrong state level data. The correlation of −0.53 mentioned above is in fact −0.46.^[6]“

« Older entries § Newer entries »

Artificial Intelligence Blog

Maybe good teachers should be paid $200,000 per year

Dropout – What happens when you randomly drop half the features?

“Aaron Swartz (1986-2013)”

“Analysis of Thompson Sampling for the multi-armed bandit problem”

Cotner’s Observation

“BOA: The Bayesian Optimization Algorithm”

NYU Large Scale Machine Learning Class

“Predicting the Future: Randomness and Parsimony”

Markov Logic Networks Tutorial

Ecological Fallacy

Categories

Archives

Subscribe to ArtEnt via Email