(Based on the discussion between Carl and me on Saturday, July 7.)
It seems that in order to achieve Strong AI we need to combine Machine Learning and Good Old-Fashioned Artificial Intelligence. For those not familiar with GOFAI, GOFAI is mostly about symbol manipulation and includes subjects like computer algebra, theorem proving, first order predicate calculus, lambda calculus, Turing machines, alpha beta pruning, context free grammers, depth/breath first search, branch and bound, semantic networks, expert systems, Prolog, lisp, resolution theorem proving, and Godel’s incompleteness theorem. I wonder if we also have to add less formal, plausible reasoning systems that guess at theorems which are only “mostly true” (true only for points in general position outside of a set of measure zero) or have a high “probability of being true”. Such reasoning techniques might be derived from Bayesian networks or Alternative Hypothesis Spaces (Griffin 2009) (http://bmir.stanford.edu/file_asset/index.php/1565/BMIR-2010-1375.pdf).
Maybe we can invent a non-discrete Turing machine. Say we replace the finite state machine which reads and prints on the tape with a neural net. We might also print real number between 0 and 1 on the tape instead of printing only zeros and ones. Of course, this continuous Turing machine would not be better at classification of binary strings than a traditional discrete Turing machine, rather, the hope is that such a machine might be easier to optimize using general optimization techniques similar to gradient decent. My personal feeling is that this particular method will not work, but something like it might work.
We need something that captures the most fundamental ideas like the idea of a Cartesian Plane. The idea of a Cartesian product arises naturally from Category Theory. If one wants to formulate thumb rules for tic-tac-toe, the 3 by 3 grid is an essential idea that is hard for a neural net to learn without prompting. Maybe we can use Category theory to guide our learning algorithms (http://golem.ph.utexas.edu/category/2007/09/category_theory_in_machine_lea.html) to the correct abstractions.
What is an abstraction anyway? In chess, if a king is in check, there are only three types of possible moves: move the king, interpose a piece, or capture the piece attacking the king. So when we consider responses to a check, we only need consider those three types of moves. The fact that only these three types of moves exists is a theorem. This theorem is like a subroutine that speeds up our processing in a chess game. So an abstraction could be like a helpful subroutine. Also, this theorem allows us to disregard pieces that are not “near” the king or the attacking piece. In this regard, the theorem allows us to remove information from the game board before calculating the possible counter-moves. In this regard, the theorem is like a projection. The matrix
sets the y coordinate to zero, so it destroys information. So some abstractions can be viewed as helpful projections that remove irrelevant information—just like manifold learning or dimension reduction. Another way to remove information is to disregard the particulars of the situation and boil it down to the essentials. Group theory was invented this way.
Group theory is important in another regard: examples and models. Suppose that a thinking machine was trying to prove conjectures about group theory. Before trying to prove the conjecture, the machine could check some simple finite symmetric groups, a few finite cyclic groups, and maybe addition on real numbers to see if the conjecture was true for those examples. If it was true, then the prover could try to prove the conjecture. If it was false, then the computer would not waste time trying to prove the conjecture.
Another useful idea is the idea that classifiers or theorems could have natural topologies (http://philosophy.berkeley.edu/file/698/FractalCompletenessTechniques.pdf). It seems that it is hard for neural nets to learn parity because changing any one bit in the input changes the class from even to odd or vice versa. Nearest neighbor with k=1 would be worse than useless for such a problem. Topology could have further insights for machine learning like what is learnable. If the fractal dimension of a class is high (or the boundary has a high fractal dimension), we would imagine that the concept is harder to learn.
For example, if a classifier had to classify the points in a circle and it used nearest neighbor with k=1, then number of random samples needed to achieve an error of epsilon is order (1/$\epsilon$)^2. The number of points needed to classify the area inside a Koch Snowflake (http://www.wolframalpha.com/input/?i=Koch+snowflake&lk=3) is order (1/$\epsilon$)^(2/(2-k)) where k is the fractal dimension of the border of the Koch Snowflake.