Deep Belief Networks

You are currently browsing the archive for the Deep Belief Networks category.

  1. Enjoying John Baez’s blog Azimuth.  Especially the posts on good research practices and an older post on levels of mathematical understanding.
  2. García-Pérez, Serrano, and Boguñá wrote a cool paper on primes, probability, and integers as a bipartite network.
  3. Loved the idea behind the game theoretical book “Survival of the Nicest”  (see Yes Magazine for a two page introduction).
  4. Scott Young is learning Chinese quickly.
  5. Cyber warriors to the rescue.
  6. Mao, Fluxx, and Douglas Hofstadter‘s Nomic are fun games.
  7. Healy and Caudell are applying category theory to semantic and neural networks.
  8. Some MOOCs for data science and machine learning.
  9. Here an old but good free online course on the Computational Complexity of Machine Learning.
  10. Great TeX graphics.
  11. Watch this Ted Video to learn anything in 20 hours (YMMV).
  12. Where are all the Steeler fans?  Cowboy fans?  ….
  13. Productivity Hints.
  14. Copper + Magnets = Fun
  15. Stray dogs on the subway.
  16. Deep learning on NPR.
  17. Happy 40th birthday D&D
  18. Google is applying deep learning to images of house numbers
  19. Deep learning in your browser.
  20. How to write a great research paper.
  21. Do Deep Nets Really Need to be Deep?
  22. A variation on neural net dropout.
  23. Provable algorithms for Machine Learning
  24. 100 Numpy Exercises
  25. Learn and Practice Applied Machine Learning | Machine Learning Mastery


Check out Markus Beissinger’s blog post “Deep Learning 101″.  Markus reviews a lot of deep learning basics derived from the papers “Representation Learning: A Review and New Perspectives” (Bengio, Courville, Vincen 2012) and “Deep Learning of Representations: Looking Forward” (Bengio 2013). Beissinger covers the following topics:

  • An easy intro to Deep Learing
  • The Current State of Deep Learing
  • Probabilistic Graphical Models
  • Principal Component Analysis
  • Restricted Boltzman Machines
  • Auto-Encoders
  • “Challenges Looking Ahead”

This is a great intro and I highly recommend it.

If you want more information, check out Ng’s lecture notesHonglak Lee’s 2010 NIPS slides, and Hinton’s Videos ([2009] [2013]).

The linear-nonlinear-Poisson (LNP) cascade model is a standard model of neuron responses.  Louis Shao has recently shown that an artificial neural net consisting of LNP neurons can simulate any Boltzmann machine and perform “a semi-stochastic Bayesian inference algorithm lying between Gibbs sampling and variational inference.”  In his paper he notes that the “properties of visual area V2 are found to be comparable to those on the sparse autoencoder networks [3]; the sparse coding learning algorithm [4] is originated directly from neuroscience observations; also psychological phenomenon such as end-stopping is observed in sparse coding experiments [5].”


Nuit Blanche‘s article “The Summer of the Deeper Kernels” references the two page paper “Deep Support Vector Machines for Regression Problems” by Schutten, Meijster, and Schomaker (2013).


The deep SMV is a pretty cool idea.  A normal support vector machine (SVM) classifier, finds $\alpha_i$ such that

$f(x) = \sum_i \alpha_i K(x_i, x)$ is positive for one class of $x_i$ and negative for the other class (sometimes allowing exceptions).  ($K(x,y)$ is called the kernel function which is in the simplest case just the dot product of $x$ and $y$.)  SVM’s are great because they are fast and the solution is sparse (i.e. most of the $\alpha_i$ are zero).

Schutten, Meijster, and Schomaker apply the ideas of deep neural nets to SVMs.

They construct $d$ SVMs of the form

$f_a(x) = \sum_i \alpha_i(a) K(x_i, x)+b_a$

and then compute a more complex two layered SVM

$g(x) = \sum_i \alpha_i  K(f(x_i), f(x))+b$

where $f(x) = (f_1(x), f_2(x), \ldots, f_d(x))$.  They use a simple gradient descent algorithm to optimize the alphas and obtain numerical results on ten different data sets comparing the mean squared error to a standard SVM.

Here’s a pretty cool video by Alex Acero (Microsoft) titled
$\ $
$\ $
Check out minutes 47 to 50 where he says that the deep belief network approach created a 30% improvement over the state of the art speech recognition systems.

T Jake Luciani wrote a nice, easy to read blog post on the recent developments in neural networks.

In “Autoencoders, MDL, and Helmholtz Free Energy“, Hinton and Zemel (2001) use Minimum Description Length as an objective function for formulating generative and recognition weights for an autoencoding neural net.  They develop a stochastic Vector Quantization method very similar to mixture of Gaussians where each input vector is encoded with

$$E_i = – \log \pi_i  – k \log t + {k\over2} \log 2 \pi \sigma^2 + {{d^2} \over{2\sigma^2}}$$

nats (1 nat = 1/log(2) bits = 1.44 bits) where $t$ is the quantization width, $d$ is the Mahalanobis distance to the mean of the Gaussian, $k$ is the dimension of the input space, $\pi_i$ is the weight of the $i$th Gaussian.  They call this the “energy” of the code.  Encoding only using this scheme wastes bits because, for example, there may be vectors that are equally distant from two Gaussian. The amount wasted is

$$H = -\sum p_i \log p_i$$

where $p_i$ the probability that the code will be assigned to the $i$th Gaussian. So the “true” expected description length is

$$F = \sum_i p_i E_i – H$$

which “has exactly the form of the Helmholtz free energy.”  This free energy is minimized by setting

$$p_i = {{e^{-E_i}}\over{\sum_j e^{-E_j}}}.$$

In order to make computation practical, they recommend using a suboptimal distributions “as a Lyapunov function for learning” (see Neal and Hinton 1993). They apply their method to learn factorial codes.


In “Temporal Autoencoding Restricted Boltzmann Machine“, Hausler and Susemihl explain how to train a deep belief RBM to learn to recognize patterns in sequences of inputs (mostly video).  The resulting networks could recognize the patterns in human motion capture or the non-linear dynamics of a bouncing ball.

Bengio and Lecun created this wonderful video on Deep Neural Networks.  Any logical function can be represented by a neural net with 3 layers (one hidden, see e.g. CNF), however simple 4 level logical functions with a small number of nodes may require a large number of nodes in a 3 layer representation.  They point to theorems that show that the number of nodes required to represent a k level logical function can require an exponential number of nodes in a k-1 level network. They go on to explain denoising auto encoders for the training of deep neural nets.

In “Improving neural networks by preventing co-adaptation of feature detectors“, Hinton, Srivastava, Krizhevsky, Sutskever, and Salakhutdinov answer the question:  What happens if “On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present.”  This mimics the standard technique of training several neural nets and averaging them, but it is faster.  When they applied the “dropout” technique to a deep Boltzmann neural net on the MNIST hand written digit data set and the TIMIT speech data set, they got robust learning without overfitting.  This was one of the main techniques used by the winners of the Merck Molecular Activity Challenge.

Hinton talks about the dropout technique in his video Brains, Sex, and Machine Learning.

« Older entries § Newer entries »