T Jake Luciani wrote a nice, easy to read blog post on the recent developments in neural networks.
In “Autoencoders, MDL, and Helmholtz Free Energy“, Hinton and Zemel (2001) use Minimum Description Length as an objective function for formulating generative and recognition weights for an autoencoding neural net. They develop a stochastic Vector Quantization method very similar to mixture of Gaussians where each input vector is encoded with
$$E_i = – \log \pi_i – k \log t + {k\over2} \log 2 \pi \sigma^2 + {{d^2} \over{2\sigma^2}}$$
nats (1 nat = 1/log(2) bits = 1.44 bits) where $t$ is the quantization width, $d$ is the Mahalanobis distance to the mean of the Gaussian, $k$ is the dimension of the input space, $\pi_i$ is the weight of the $i$th Gaussian. They call this the “energy” of the code. Encoding only using this scheme wastes bits because, for example, there may be vectors that are equally distant from two Gaussian. The amount wasted is
$$H = -\sum p_i \log p_i$$
where $p_i$ the probability that the code will be assigned to the $i$th Gaussian. So the “true” expected description length is
$$F = \sum_i p_i E_i – H$$
which “has exactly the form of the Helmholtz free energy.” This free energy is minimized by setting
$$p_i = {{e^{-E_i}}\over{\sum_j e^{-E_j}}}.$$
In order to make computation practical, they recommend using a suboptimal distributions “as a Lyapunov function for learning” (see Neal and Hinton 1993). They apply their method to learn factorial codes.
A long time ago I was a meteorologist at the Joint Typhoon Warning Center in Guam. One of my forecasters had a friend that used solar images to forecast wheat futures. I wonder what those people are predicting these days.
The Dalton minimum occurred between 1790 to 1830. If the sun does not perk up, solar cycle #24 will resemble the Dalton minimum solar cycles. (See also the Spörer minimum and the Maunder minimum.) For more information check out NOAA’s Solar Cycle Progression page.
You can find anything on the web. Earlier today, I noticed that the Sterling approximation and the Gaussian distribution both had a $\sqrt{2 \pi}$ in them, so I started thinking that maybe you could apply Sterling’s approximation to the binomial distribution with $p=1/2$ and large $n$ and the central limit theorem to derive the Gaussian distribution. After making a few calculations it seemed to be working and I thought, “Hey, maybe someone else has done this.” Sure enough, Jake Hoffman did it and posted it here. The internet is pretty cool.
In “Generalized Thompson sampling for sequential decision-making and causal inference“, Ortega and Braun (2013) give a short history of Thompson sampling (see [1], [2], [3], [4]) and report on the relationship between intelligent agents, evolutionary game theory, Bayesian inference, KL Divergence, and Thompson sampling. They develop a generalization of Thompson sampling based on a Bayesian prior over a distribution of environments. Some numerical results are provided. Here’s the abstract:
Recently, it has been shown how sampling actions from the predictive distribution over the optimal action—sometimes called Thompson sampling—can be applied to solve sequential adaptive control problems, when the optimal policy is known for each possible environment. The predictive distribution can then be constructed by a Bayesian superposition of the optimal policies weighted by their posterior probability that is updated by Bayesian inference and causal calculus. Here we discuss three important features of this approach. First, we discuss in how far such Thompson sampling can be regarded as a natural consequence of the Bayesian modeling of policy uncertainty. Second, we show how Thompson sampling can be used to study interactions between multiple adaptive agents, thus, opening up an avenue of game-theoretic analysis. Third, we show how Thompson sampling can be applied to infer causal relationships when interacting with an environment in a sequential fashion. In summary, our results suggest that Thompson sampling might not merely be a useful heuristic, but a principled method to address problems of adaptive sequential decision-making and causal inference.
I enjoyed this simple entertaining video by Hilary Mason which is both a history of machine learning and an overview.
I was reading MarkCC’s gripe about the misuse of the word dimension and it reminded me about the Hausdorff dimension. Generally speaking, the Hausdorff dimension matches our normal intuitive definition of dimension, i.e the H-dimension of a smooth curve or line is 1, the H-dimension of a smooth surface is 2, the H-dimension of the interior of a cube is 3. But for fractals, the H-dimension can be a non-integer. For the Cantor set, the H-dimension is 0.63. For more information, check out the Wikipedia article
http://en.wikipedia.org/wiki/Hausdorff_dimension
I’m quite excited by the Nuit Blanche post on the papers “Structure Discovery in Nonparametric Regression through Compositional Kernel Search” (Duvenaudy, Lloydy, Grossez, Tenenbaumz, Ghahramaniy 2013) and “Exploiting compositionality to explore a large space of model structures” (Grosse, Salakhutdinovm, Freeman, and Tenenbaum 2012). For years my old company Momentum Investment Services, Carl, and I have been looking for fast, systematic ways to search large hypothesis spaces. We considered context-free grammars as a means of generating hypothesis. Carl and I did not get anywhere with that research, but now it seems that others have succeeded. Be sure to look over the article, the blog posts, and the comments.
In “Algorithm Portfolio Design: Theory vs. Practice“, Gomes and Selman (2013) study the use of a portfolio of stochastic search algorithm to solve computationally hard search problems. Here are some interesting quotes from the paper:
“Our studies reveal that in many cases the performance of a single algorithm dominates all others, on the problem class under consideration.”
“Given the diversity in performance profiles among algorithms, various approaches have been developed to combine different algorithms to take into account the computational resource constraints and to optimize the overall performance. These considerations led to the development of anytime algorithms (Dean and Boddy 1988), decision theoretic metareasoning and related approaches (Horvitz and Zilberstein 1996; Russell and Norvig 1995), and algorithm portfolio design (Huberman et al. 1997).”
“In addition, we also show that a good strategy for designing a portfolio is to combine many short runs of the same algorithm. The effectiveness of such portfolios explains the common practice of “restarts” for stochastic procedures, where the same algorithm is run repeatedly with different initial seeds for the random number generator. (For related work on the effectiveness of restarts, see e.g., Aldous and Vazirani 1994; Ertel 1991; Selman and Kirkpatrick 1996.)”
Scott Aaronson gave a wonderful talk at NIPS 2012 called “Quantum information and the Brain“. I really enjoyed his very concise intuitive description of quantum mechanics and his sceptical yet fair attitude towards the ideas expressed by Penrose in “The Emperor’s New Mind”.