Articles by hundalhh

Mathematician and Father. Into games, astronomy, psychology and philosophy.

In “Greedy Function Approximation: A Gradient Boosting Machine” Freeman(1999) presents an algorithm for incrementally improving the approximation of a function

$$f(x) = \sum_{m=0}^M \beta_m h(x, a_m)$$

where the $\beta_m$ are real and $a_m$ are a vectors of trained parameters. At each iteration, gradient boost adds on a new function $h$ using steepest descent.

Carl and I have been discussing concepts related to sparsity in neural nets. One older idea similar to sparsity is negatively correlated or uncorrelated neural nets. To use this technique, you could train several neural nets on part or all of the data and then keep the neural networks that are less correlated. The idea used CELS algorithm is to add a correlation penalty term to the loss function of the neural net optimization process. In “Simultaneous Training of Negatively Correlated Neural Networks in an Ensemble” Liu and Yao (1999) say that

The idea behind CELS is to encourage different individual networks in an ensemble to learn different parts or aspects of a training data so that the ensemble can learn the whole training data better.

which seems similar to what happens in sparse neural networks.

My Friend Andrew wrote:

Initially, I thought it was dependent on your definition of “random rational number”, but then I saw your second example that removed duplicates (1/2, 2/4, etc..).

I tried a different method by sampling from a real distribution and then rationalizing it:

Mean@Mod[Denominator@Rationalize[#, 1*^-6] & /@ RandomReal[{0, 1}, 100000], 2]

Then, I tried changing the distribution, and still get the same result:

Mean@Mod[Denominator@Rationalize[#, 1*^-6] & /@ RandomReal[NormalDistribution[0, 1], 100000], 2]

— Andrew

ProdyPlot

What percentage of “random fractions” have an even denominator? My friend Pig told me that the answer is around 1/3. (At least the answer is 1/3 for two somewhat natural definitions of the term “random fraction”.) Here is the experimental “proof” in Mathematica. :)

Mathematica input for test #1:

percentage[v_List, pred_] := N[ Length[Select[v, pred]] / Length[v] ];

percentage[ Table[ RandomInteger[{1, 1000}] / RandomInteger[{1, 1000}],
 {100000} ], EvenQ[Denominator[#]] ]

Output for test #1:

0.33343

Mathematica input for test #2:

percentage[ Table[i/j, {i, 1000}, {j, 1000}] // Flatten // Union,
            EvenQ[Denominator[#]] ]

Output for test #2 : 

0.333443

Adam posted this code:

In “Negative Correlation Learning for Classification Ensembles” Wang, Chen and Yao (2010) modify adaboost to put more weight on classifiers that are less correlated to the boosted classifier.  They do this by introducing a penalty term “
$$p_t = (h_t – H) \sum_{k\neq t} (h_k – H),$$
where H is the final output by combining the decisions from the individuals.”
The idea for this came from the Cooperative ensemble learning system (CELS) by Liu and Yao (1999).  Experimental results and performance bounds were given in “The Effectiveness of a New Negative Correlation Learning Algorithm for Classification Ensembles” Wang and Yao (2010).

I ran across an interesting paper yesterday “Logistic Regression, AdaBoost and Bregman Distances” (Collins, Schapire, Singer 2000).  Adaboost is a great way to combine several weak predictors into a strong predictor (Freund 2012 is a great reference).  The Bregman distance is a generalization of the Euclidean norm and KL-Divergence.  Many well known algorithms that use the Euclidean norm still work when the Euclidian norm is replaced with a Bregman distance (e.g. Dykstra’s Algorithm  see Censor Reich 2000).  In the paper, the authors show that Adaboost solves a best approximation problem using the KL-distance and introduce many variants of Adaboost.  Here are some quotes from the paper:

1) “We are now able to borrow methods from the maximum entropy literature for logistic regression and apply them to the exponential loss used by AdaBoost, especially convergence proof techniques.”

2) “Duffy and Helmbold (1999) gave conditions under which a loss function gives a boosting algorithm. They showed that minimizing logistic loss does lead to a boosting algorithm the PAC sense.”

3) “Our work builds heavily on that of Kivinen andWarmuth (1999) who, along with Lafferty, were the first to make a connection between AdaBoost and information geometry. They showed that the update used by AdaBoost is a form of “entropy projection.”” (bold face added.)

So, I am a bit disappointed that there do not seem to be any great methods for automated proving of theorems for games.  The best seems to be “Automated Theorem Proving for General Game Playing” (Schiffel & Thielsche 2009) and “A Temporal Proof System for General Game Playing” (Thielscher & Voigt 2010).  In the first paper, their program (which uses a sophisticated “answer set programming” theorem prover) proves theorems like ”

• proving that the argument of the control(P) feature is
unique in every state (provided the feature exists);
• proving that the contents of a board cell is unique (to which
end boards were identified manually);
• proving that the contents of a board cell is unique given
the information that the argument of control(P) is;
• proving that the game is zero-sum;
• systematic search for unique arguments as described at the
end of the preceding section.”

It is great that the authors combine automated theorem provers with game playing, but I am still somewhat disappointed because most people who play a game can easily read the rules and almost immediately answer questions about the game like “Is it zero-sum?” or “Do the squares only have one piece?”.  I was hoping for deeper theorems.  It is also disappointing that one of our best automated theorem proving algorithms cannot even prove these simple theorems for many games.

On the other hand, maybe we should only prove simple properties like this. When playing chess about the only theorems I ever apply are:  1) when in check, there are only three possible options – move the king, interpose a piece, or capture the attacking piece, and 2) if a piece is blocking an attack on its king, it cannot move (a pin). Really, I don’t think I use any theorems deeper than this while playing chess and those two theorems are not at all useful to a computer because the computer can easily list all the possible moves.

In other games, I guess I do use theorems like “if I lead the third highest card in bridge, a finesse will succeed if the player on my left has the second highest card and my partner has the highest card (assuming no trump)” or “if my team holds 11 of the hearts, then the other team’s hearts will split 1-1 52% of the time.”

If computers are going to reason about games, then we probably need fuzzy logic (as in this paper) and/or probability theory.

In the paper “Automated Theorem Proving for General Game Playing” by Stephan Schiffel and Michael Thielscher, the authors apply answer set programming (similar to Prolog) to improve performance in general game playing. They point out that it is important for the AI to classify what type of game it is playing and to apply appropriate techniques for that class of game. For example null moves, upper confidence trees (UCT), alpha-beta search, Bellman equations (dynamic programming), and temporal difference learning would all be appropriate for two-player, zero-sum, and turn-taking games.  Also it is important to recognize concepts “like boards, pieces, and mobility.” Apparently, van der Hoek pioneered using automated theorem proving to extract appropriate concepts from the game rules.  Shiffel and Thielscher extend the theorem proving to “automatically prove properties that hold across all legal positions.”

A friend of mine asked:

“Take $n$ samples from a standard normal distribution $N(0,1)$. Which has greater variance: the average or the median?”

(Aug 12, 2013 edit:  If you want to understand the answer to this question, read below.  If you want to compute the exact standard deviation of the median for a particular sample distribution, click “The Exact Standard Deviation of the Sample Median“.)

This question was mostly answered by Laplace and by Kenney and Keeping.

If the number of samples $n$ of a symmetric continuous distribution with probability density function $f$ is large, then the distribution of the sample median is approximately $N(\overline{x}, \sigma)$, where $\overline{x}$ is the the mean of $f$, and
$$\sigma = {1 \over 2\sqrt{n}\,f(\overline{x})}.$$

The $1 \over 2\sqrt{n}$ comes from the fact that the number of samples above the mean has mean $n/2$ with standard deviation $1 \over 2\sqrt{n}$.

For the Gaussian distribution, the estimate for the standard deviation of the sample median is
$${\sqrt{2\pi} \over 2\sqrt{n}} = \sqrt{\pi \over 2n}.$$

The standard deviation of the sample mean is $1 / \sqrt{n}$. So the standard deviation of the sample median is larger by a factor of $\sqrt{\pi / 2} \approx 1.25$ (i.e., it has about 25% more error). This is a small price to play if outliers are possible.

So we have answered the question for the Gaussian. What about any symmetric continuous single mode distribution? The standard deviation of the median decreases as the density of the distribution at the median increases, so for sharply pointed distributions like the symmetric Laplacian distribution the standard deviation of the sample median is actually lower than the standard deviation of the sample mean.

I used approximations above, but with a lot more work and Beta distributions, you can calculate to arbitrary precision values for the standard deviation of the sample median.

I’m rather fascinated by the HOO algorithm.  It seems to be a combination of simulated annealing, branch and bound, simplex method, and upper confidence trees.  I’m going to code it up an try it on a few problems.

« Older entries § Newer entries »