August 2012

You are currently browsing the monthly archive for August 2012. has a post on the Julia computer language which seems to be getting faster. Carl and I have stumbled across many languages in our efforts to get a fast version of lisp with good debugging tools. For a while we thought at one time that Haskell was the answer, but it now seems we are leaning more toward Clojure and Python recently. It is hard to pass up languages that can use tools like WEKA, SciPy, and Numpy. Another option is R, but only one of my friends uses R. Clojure is neat because it runs on the JVM and Javascript (and is being targeted at other languages such as Python, Scheme, and C).

For a speed comparison with Matlab, see this post.

I looked through the some machine learning blogs, most of which are listed here, and this is what I found:

Machine Learning Theory – Widely read

Edwin Chen’s blog – Great Visualization, Great Posts

Maxim Raginsky’s The Information Structuralist – Some Econ, Some AI

Brendan O’Connor’s AI and Social Science blog – Econ, AI, great graphics, fun posts

Matthew Hurst’s Data Mining: Text Mining, Visualization and Social Media: – Lots of post and images


Bob Carpenter’s LingPipe Blog Natural Language Processing and Text Analytics – Full of interesting stuff

Computer Languages, Matlab, Software, Cool Miscalaneous

Sports Predictions, Lots of other cool ML stuff

A personal blog with several AI related posts

Personal blog with many technical post (some AI)

Three or fewer posts this year:

Venn Diagrams

Carl sent me a link to a Venn diagrams post, so that got me thinking. A Venn Diagram with $n$ atoms has to represent $2^n$ regions. For example if $n$ is $2$, then you have the standard Venn diagram below.

Each time you increase $n$ by one, you double the number of regions.  This makes me think of binary codes and orthogonal functions. Everybody’s favorite orthogonal functions are the trig functions, so you should be able to draw Venn diagrams with wavy trig functions. Here was my first try.

Those seemed kind of busy, so I dampened the amplitude on the high frequencies (making the slopes that same and possibly increasing artistic appeal.)

I really like the last one.

I discovered automatic differentiation a few weeks ago, and I can’t believe I had never heard of it before. Although I believe everybody who has more than a passing knowledge of algorithms (especially numerical algorithms!) should know about it, apparently very few do.

I will just give a very brief introduction here before pointing out a few good sources of information.

First of all, automatic differentiation — “autodiff” — is neither numerical differentiation nor symbolic differentiation, although it does calculate exact derivatives!

In the formulation that I find most astonishing, autodiff uses object-orientied programming techniques and operator overloading to simultaneously and transparently turn all (differentiable) function calculations into evaluations of the function’s first derivative.

More concretely, instead of calculating with a floating point variable x, we instead calculate with an object x that has two data components (x.value and x.deriv, say), the value and the derivative, and has methods that overload all of the mathematical functions and operators in the language. So that, for example, when one calculates y = cos x one is automatically calculating both $y = \cos x$ and $y’ = – \sin x$ and with the results being stored in y.value and y.deriv! Operator overloading covers cases like x ^ 3, 3 ^ x, or even x ^ x using the standard rules of differentiation. And once one realizes how that works, then expressions like x + y, x * y, and x / y become easy as well. In this way, all calculations can be handled.

It should be noted that this works for all numerical computations, not just calculations involving mathematical formulas, and that it can be easily generalized to calculating arbitrary higher-order derivatives.

By the way, the technique outlined above is “forward mode” autodiff; it is less obvious that there is also “reverse mode” autodiff. Forward mode is more efficient for functions of single variables; reverse mode is more efficient for functions with a single output value (i.e., real-valued as opposed to vector-valued functions). It turns out that reverse mode autodiff is a generalization of neural net back-propagation and was actually discovered before backprop!

Justin Domke made a great blog post on automatic differentiation in 2009; and Introduction to Automatic Differentiation and MATLAB Object-Oriented Programming is a very accessible paper on actually implementing autodiff in Matlab.

Finally, seems to be the home of all things autodiff on the web.

So I decided to look at other AI blogs and I began by typing “artificial intelligence blog” into Google.  (That might not be the best way to find AI blogs, but it seemed like a good place to start.) Most of the links were popular blogs with an article on AI (like the Economist commenting on crossword AI). Here are some of the other links I came across:

A Great List of AI resources

Blog – Lisp – 3 posts this year




Two posts this year – One of the posts is very good.

Good ML blog

This blog markets “Drools” software.

Great response to a question about resources for AI.

Social Cognition and AI

AI applied to the stock market

A huge numbers of videos from PyCon 2012 Us are available at Marcel Caraciolo at posted links to 17 of them on his blog. I have been avoiding Python for machine learning, but maybe I’ve been wrong.

In “How to Use Expert Advice“, Cesa-Bianchi, Freund, Haussler, Helmbold, Sharpire, and Warmuth (1997) describe boosting type algorithms to combine classifiers. They prove a bound of

$$ \sqrt{n  \log(M)}$$

on the total regret where $n$ is the number of data points and $M$ is the number of classifiers.  They relate their bounds to the L1 distance between binary strings and apply their results to pattern recognition theory and PAC learning.


In “Greedy Function Approximation: A Gradient Boosting Machine” Freeman(1999) presents an algorithm for incrementally improving the approximation of a function

$$f(x) = \sum_{m=0}^M \beta_m h(x, a_m)$$

where the $\beta_m$ are real and $a_m$ are a vectors of trained parameters. At each iteration, gradient boost adds on a new function $h$ using steepest descent.

Carl and I have been discussing concepts related to sparsity in neural nets. One older idea similar to sparsity is negatively correlated or uncorrelated neural nets. To use this technique, you could train several neural nets on part or all of the data and then keep the neural networks that are less correlated. The idea used CELS algorithm is to add a correlation penalty term to the loss function of the neural net optimization process. In “Simultaneous Training of Negatively Correlated Neural Networks in an Ensemble” Liu and Yao (1999) say that

The idea behind CELS is to encourage different individual networks in an ensemble to learn different parts or aspects of a training data so that the ensemble can learn the whole training data better.

which seems similar to what happens in sparse neural networks.

My Friend Andrew wrote:

Initially, I thought it was dependent on your definition of “random rational number”, but then I saw your second example that removed duplicates (1/2, 2/4, etc..).

I tried a different method by sampling from a real distribution and then rationalizing it:

Mean@Mod[Denominator@Rationalize[#, 1*^-6] & /@ RandomReal[{0, 1}, 100000], 2]

Then, I tried changing the distribution, and still get the same result:

Mean@Mod[Denominator@Rationalize[#, 1*^-6] & /@ RandomReal[NormalDistribution[0, 1], 100000], 2]

— Andrew

« Older entries § Newer entries »