In the paper “Automated Theorem Proving for General Game Playing” by Stephan Schiffel and Michael Thielscher, the authors apply answer set programming (similar to Prolog) to improve performance in general game playing. They point out that it is important for the AI to classify what type of game it is playing and to apply appropriate techniques for that class of game. For example null moves, upper confidence trees (UCT), alpha-beta search, Bellman equations (dynamic programming), and temporal difference learning would all be appropriate for two-player, zero-sum, and turn-taking games. Also it is important to recognize concepts “like boards, pieces, and mobility.” Apparently, van der Hoek pioneered using automated theorem proving to extract appropriate concepts from the game rules. Shiffel and Thielscher extend the theorem proving to “automatically prove properties that hold across all legal positions.”
A friend of mine asked:
“Take $n$ samples from a standard normal distribution $N(0,1)$. Which has greater variance: the average or the median?”
(Aug 12, 2013 edit: If you want to understand the answer to this question, read below. If you want to compute the exact standard deviation of the median for a particular sample distribution, click “The Exact Standard Deviation of the Sample Median“.)
This question was mostly answered by Laplace and by Kenney and Keeping.
If the number of samples $n$ of a symmetric continuous distribution with probability density function $f$ is large, then the distribution of the sample median is approximately $N(\overline{x}, \sigma)$, where $\overline{x}$ is the the mean of $f$, and
$$\sigma = {1 \over 2\sqrt{n}\,f(\overline{x})}.$$
The $1 \over 2\sqrt{n}$ comes from the fact that the number of samples above the mean has mean $n/2$ with standard deviation $1 \over 2\sqrt{n}$.
For the Gaussian distribution, the estimate for the standard deviation of the sample median is
$${\sqrt{2\pi} \over 2\sqrt{n}} = \sqrt{\pi \over 2n}.$$
The standard deviation of the sample mean is $1 / \sqrt{n}$. So the standard deviation of the sample median is larger by a factor of $\sqrt{\pi / 2} \approx 1.25$ (i.e., it has about 25% more error). This is a small price to play if outliers are possible.
So we have answered the question for the Gaussian. What about any symmetric continuous single mode distribution? The standard deviation of the median decreases as the density of the distribution at the median increases, so for sharply pointed distributions like the symmetric Laplacian distribution the standard deviation of the sample median is actually lower than the standard deviation of the sample mean.
I used approximations above, but with a lot more work and Beta distributions, you can calculate to arbitrary precision values for the standard deviation of the sample median.
I’m rather fascinated by the HOO algorithm. It seems to be a combination of simulated annealing, branch and bound, simplex method, and upper confidence trees. I’m going to code it up an try it on a few problems.
According to the Wikipedia, Student’s t-distribution is the maximum entropy probability distribution for a random variate X for which $E(\log(\nu + X^2))$ is fixed. This is similar to the fact that the Gaussian distribution is the maximum entropy distribution among all distributions for which $E(X^2)$ is fixed.
Here is a cool set of power point slides by Brodeur and Child on upper confidence trees (UCT) applied to Go. They explore modifications to UCT to explore directed acyclic graphs and grouping of game states.
Carl, I, and the Wikipedia put together this list.
- Theory
- Learning
- Markov Methods & Probabilistic Graphical Models
- Neural nets
- Boltzmann machine, Restricted Boltzmann machines
- Deep Belief Networks (Hinton1, Hinton2, Drop out)
- Sparse Deep Belief Networks and Autoencoders
- Pattern Recognition
- Naive Bayes
- Support vector machines
- Decision trees (CART)
- Predictive Analytics
- Kernel methods (kernelized support vector machine)
- Kernal Density Estimation (Parzen Windows)
- Random Forests
- Structured Prediction
- Dimension Reduction
- LDA & QDA (Linear and Quadratic Discriminant Analysis)
- PCA
- ICA
- Self-organizing maps. (Kohonen)
- Latent Dirichlet Allocation
- Curve fitting
Bubeck Slides “Continuous Stochastic Optimization” using Hierarchical Optimistic Optimization (HOO)
X-Armed Bandits (HOO)
Sample-Based Planning for Continuous Action for Markov Decision Processes
Prediction, Learning, and Games (Book by CESA-BIANCH & LUGOS )
I just discover the Face Recognition Homepage while looking for a database of images of faces. But it also is a good place to find algorithms, source code, and other good stuff!
In this article, Wang, Huang, Kamangar, Nie, and Ding discuss a new algorithm for multi-instance learning.
“In MIL data objects are represented as bags of instances, therefore the distance between the objects is a set-to-set distance. Compared to traditional single-instance data that use vector distance such as Euclidean distance, estimating the Bag-to-Bag (B2B) distance in MIL is more challenging [7,8]. In addition, the B2B distances often do not truly reflect the semantic similarities [9].”
I am hoping to apply these ideas to astronomical image stacking.
(Based on the discussion between Carl and me on Saturday, July 7.)
It seems that in order to achieve Strong AI we need to combine Machine Learning and Good Old-Fashioned Artificial Intelligence. For those not familiar with GOFAI, GOFAI is mostly about symbol manipulation and includes subjects like computer algebra, theorem proving, first order predicate calculus, lambda calculus, Turing machines, alpha beta pruning, context free grammers, depth/breath first search, branch and bound, semantic networks, expert systems, Prolog, lisp, resolution theorem proving, and Godel’s incompleteness theorem. I wonder if we also have to add less formal, plausible reasoning systems that guess at theorems which are only “mostly true” (true only for points in general position outside of a set of measure zero) or have a high “probability of being true”. Such reasoning techniques might be derived from Bayesian networks or Alternative Hypothesis Spaces (Griffin 2009) (http://bmir.stanford.edu/file_asset/index.php/1565/BMIR-2010-1375.pdf).
Maybe we can invent a non-discrete Turing machine. Say we replace the finite state machine which reads and prints on the tape with a neural net. We might also print real number between 0 and 1 on the tape instead of printing only zeros and ones. Of course, this continuous Turing machine would not be better at classification of binary strings than a traditional discrete Turing machine, rather, the hope is that such a machine might be easier to optimize using general optimization techniques similar to gradient decent. My personal feeling is that this particular method will not work, but something like it might work.
We need something that captures the most fundamental ideas like the idea of a Cartesian Plane. The idea of a Cartesian product arises naturally from Category Theory. If one wants to formulate thumb rules for tic-tac-toe, the 3 by 3 grid is an essential idea that is hard for a neural net to learn without prompting. Maybe we can use Category theory to guide our learning algorithms (http://golem.ph.utexas.edu/category/2007/09/category_theory_in_machine_lea.html) to the correct abstractions.
What is an abstraction anyway? In chess, if a king is in check, there are only three types of possible moves: move the king, interpose a piece, or capture the piece attacking the king. So when we consider responses to a check, we only need consider those three types of moves. The fact that only these three types of moves exists is a theorem. This theorem is like a subroutine that speeds up our processing in a chess game. So an abstraction could be like a helpful subroutine. Also, this theorem allows us to disregard pieces that are not “near” the king or the attacking piece. In this regard, the theorem allows us to remove information from the game board before calculating the possible counter-moves. In this regard, the theorem is like a projection. The matrix
[1 0;
0 0]
sets the y coordinate to zero, so it destroys information. So some abstractions can be viewed as helpful projections that remove irrelevant information—just like manifold learning or dimension reduction. Another way to remove information is to disregard the particulars of the situation and boil it down to the essentials. Group theory was invented this way.
Group theory is important in another regard: examples and models. Suppose that a thinking machine was trying to prove conjectures about group theory. Before trying to prove the conjecture, the machine could check some simple finite symmetric groups, a few finite cyclic groups, and maybe addition on real numbers to see if the conjecture was true for those examples. If it was true, then the prover could try to prove the conjecture. If it was false, then the computer would not waste time trying to prove the conjecture.
Another useful idea is the idea that classifiers or theorems could have natural topologies (http://philosophy.berkeley.edu/file/698/FractalCompletenessTechniques.pdf). It seems that it is hard for neural nets to learn parity because changing any one bit in the input changes the class from even to odd or vice versa. Nearest neighbor with k=1 would be worse than useless for such a problem. Topology could have further insights for machine learning like what is learnable. If the fractal dimension of a class is high (or the boundary has a high fractal dimension), we would imagine that the concept is harder to learn.
For example, if a classifier had to classify the points in a circle and it used nearest neighbor with k=1, then number of random samples needed to achieve an error of epsilon is order (1/$\epsilon$)^2. The number of points needed to classify the area inside a Koch Snowflake (http://www.wolframalpha.com/input/?i=Koch+snowflake&lk=3) is order (1/$\epsilon$)^(2/(2-k)) where k is the fractal dimension of the border of the Koch Snowflake.