(Based on the discussion between Carl and me on Saturday, July 7.)

It seems that in order to achieve Strong AI we need to combine Machine Learning and Good Old-Fashioned Artificial Intelligence. For those not familiar with GOFAI, GOFAI is mostly about symbol manipulation and includes subjects like computer algebra, theorem proving, first order predicate calculus, lambda calculus, Turing machines, alpha beta pruning, context free grammers, depth/breath first search, branch and bound, semantic networks, expert systems, Prolog, lisp, resolution theorem proving, and Godel’s incompleteness theorem. I wonder if we also have to add less formal, plausible reasoning systems that guess at theorems which are only “mostly true” (true only for points in general position outside of a set of measure zero) or have a high “probability of being true”. Such reasoning techniques might be derived from Bayesian networks or Alternative Hypothesis Spaces (Griffin 2009) (http://bmir.stanford.edu/file_asset/index.php/1565/BMIR-2010-1375.pdf).

Maybe we can invent a non-discrete Turing machine. Say we replace the finite state machine which reads and prints on the tape with a neural net. We might also print real number between 0 and 1 on the tape instead of printing only zeros and ones. Of course, this continuous Turing machine would not be better at classification of binary strings than a traditional discrete Turing machine, rather, the hope is that such a machine might be easier to optimize using general optimization techniques similar to gradient decent. My personal feeling is that this particular method will not work, but something like it might work.

We need something that captures the most fundamental ideas like the idea of a Cartesian Plane. The idea of a Cartesian product arises naturally from Category Theory. If one wants to formulate thumb rules for tic-tac-toe, the 3 by 3 grid is an essential idea that is hard for a neural net to learn without prompting. Maybe we can use Category theory to guide our learning algorithms (http://golem.ph.utexas.edu/category/2007/09/category_theory_in_machine_lea.html) to the correct abstractions.

What is an abstraction anyway? In chess, if a king is in check, there are only three types of possible moves: move the king, interpose a piece, or capture the piece attacking the king. So when we consider responses to a check, we only need consider those three types of moves. The fact that only these three types of moves exists is a theorem. This theorem is like a subroutine that speeds up our processing in a chess game. So an abstraction could be like a helpful subroutine. Also, this theorem allows us to disregard pieces that are not “near” the king or the attacking piece. In this regard, the theorem allows us to remove information from the game board before calculating the possible counter-moves. In this regard, the theorem is like a projection. The matrix

[1 0;
0 0]

sets the y coordinate to zero, so it destroys information. So some abstractions can be viewed as helpful projections that remove irrelevant information—just like manifold learning or dimension reduction. Another way to remove information is to disregard the particulars of the situation and boil it down to the essentials. Group theory was invented this way.

Group theory is important in another regard: examples and models. Suppose that a thinking machine was trying to prove conjectures about group theory. Before trying to prove the conjecture, the machine could check some simple finite symmetric groups, a few finite cyclic groups, and maybe addition on real numbers to see if the conjecture was true for those examples. If it was true, then the prover could try to prove the conjecture. If it was false, then the computer would not waste time trying to prove the conjecture.

Another useful idea is the idea that classifiers or theorems could have natural topologies (http://philosophy.berkeley.edu/file/698/FractalCompletenessTechniques.pdf). It seems that it is hard for neural nets to learn parity because changing any one bit in the input changes the class from even to odd or vice versa. Nearest neighbor with k=1 would be worse than useless for such a problem. Topology could have further insights for machine learning like what is learnable. If the fractal dimension of a class is high (or the boundary has a high fractal dimension), we would imagine that the concept is harder to learn.

For example, if a classifier had to classify the points in a circle and it used nearest neighbor with k=1, then number of random samples needed to achieve an error of epsilon is order (1/$\epsilon$)^2. The number of points needed to classify the area inside a Koch Snowflake (http://www.wolframalpha.com/input/?i=Koch+snowflake&lk=3) is order (1/$\epsilon$)^(2/(2-k)) where k is the fractal dimension of the border of the Koch Snowflake.

It occurs to me that when designing an algorithm for a large organization the designer should minimize
$$\alpha T + \beta L + \gamma H$$
where $\alpha$, $\beta$, and $\gamma$ are unknown positive constants, $T$ is the time your code needs to run, $L$ is the number of lines of code required, and $H$ is the number of hours required by other engineers to understand your algorithm.

PS: The above statement is considerably less funny if you try to estimate $\alpha$, $\beta$, and $\gamma$.

Several multi-armed bandit algorithms with logarithmic regret.

Here is fun paper solving 4×4 minesweeper via the Bellman MDP algoritm.

This paper by Kuleshov and Doina gives some practical advice and empirical evidence for multi-armed bandit (MAB) algorithms. They conduct several experiments in the standard MAB setting with $\epsilon$-greedy, upper confidence bound, learning, and Boltzmann algorithms. Their time horizon is only 1000 pulls, so, in general, the $\epsilon$-greedy and Boltzmann algorithms did better than the logarithmic regret algorithms.

In “The Weighted Majority Algorithm” (1994) Littlestone and Warmuth describe an algorithm for combining several predictors together into a better predictor. The algorithm is very simillar to Adaboost (Freund and Schapire 1995).

Assume that there is an observed sequence of zeros and ones and a series of Turing machines which also output zeros and ones. Suppose that we want to predict the $i$-th value of the observed sequence given the first $i – 1$ of the observed sequence and the first $i$ values of the Turing machines. The weighted majority algorithm works by assigning initial weights of $1$ to each Turing machine and then multiplying the weight of the Turing machine by $\beta$, a number between 0 and 1, every time the machine output does not match the observed value. Then to predict the $i$-th value of the sequence, we 1) add up the weights of the Turing machines that predict $1$ and add up the weights of the $0$ predictors, 2) predict the value associated with the larger sum of weights.

For example, suppose that $\beta= 0.5$, the observed sequence is $01001000100001$, and there are three Turing machines which output $101010101$, $001001001$, and $101101110$ respectively.

For the first observation the weights are $1$, $1$, and $1$. The predictions are $1$, $0$, and $1$ respectively, so the weighted majority prediction is $1$, which is wrong!

The first and third Turing machines were wrong, so the new weights are $0.5$, $1.0$, and $0.5$. The next predicted values are $0$, $0$, and $0$. So the prediction is $0$.

All of the machines were wrong, so the new weights are $0.25$, $0.5$, and $0.25$. The third predicted values are $1$, $1$, and $1$. Once again, they are all wrong.

All of the machines were wrong, so the new weights are $0.125$, $0.25$, and $0.125$. The fourth predictions are $0$, $0$, and $1$. The weighted majority prediction is $0$ which is correct!

The new weights are $0.125$, $0.25$, and $0.0625$. The predictions are $1$, $0$, and $0$. The majority prediction is $0$ which is wrong.

According to Littleston and Warmuth, the maximum number of mistakes made by the weighed majority algorithm is
$$\log n + m\log 1 / \beta \over \log 2 / (1 + \beta)$$
where $m$ is the number of mistakes made by the best performing Turing machine and $n$ is the number of Turing machines. The bound is bounded below by $2m$ and the limit as $\beta$ approaches $1$ is $2m$.

“How Many Computers to Identify a Cat? 16,000” is a New York Times article about Deep Belief Networks, “the hottest thing in the speech recognition field these days.”

“Overproduction of Ph.D.s, caused by universities’ recruitment of graduate students and postdocs to staff labs, without regard to the career opportunities that await them, has glutted the market with scientists hoping for academic research careers. Long years of training and dismal career prospects form ‘a strong disincentive to American college graduates to enroll in doctoral programs,’ and early-career Ph.D.s have ‘little expectation of finding an academic research position that utilizes the training they received as a graduate student and a postdoctorate [sic] fellow,’ Research Universities states” (from “A Stellar Opportunity”, Science Career Magazine).

You would think that a large population of highly educated people would be a great thing for a country. The unemployment rate for astrophysics and math PhD’s is low because they can be used in many industrial settings, but for chemistry and biology Ph.D.’s the unemployment rate is higher.

This Power Point presentation by Avrim Blum covers the history of Bandit Problems, Regret, and Online Learning Theory. More power point presentations from his course can be found at http://www.cs.cmu.edu/~avrim/ML12/.

“Matternet will alleviate poverty and accelerate economic growth for the rising billion through a roadless transportation network.”

If you skip to 4:00 in the video, you can see how mini-quadricopters can transport small packages in the third world.

Newer entries »