# Sparsity

You are currently browsing the archive for the Sparsity category.

## “Deep Support Vector Machines for Regression Problems”

Nuit Blanche‘s article “The Summer of the Deeper Kernels” references the two page paper “Deep Support Vector Machines for Regression Problems” by Schutten, Meijster, and Schomaker (2013).

The deep SMV is a pretty cool idea.  A normal support vector machine (SVM) classifier, finds $\alpha_i$ such that

$f(x) = \sum_i \alpha_i K(x_i, x)$ is positive for one class of $x_i$ and negative for the other class (sometimes allowing exceptions).  ($K(x,y)$ is called the kernel function which is in the simplest case just the dot product of $x$ and $y$.)  SVM’s are great because they are fast and the solution is sparse (i.e. most of the $\alpha_i$ are zero).

Schutten, Meijster, and Schomaker apply the ideas of deep neural nets to SVMs.

They construct $d$ SVMs of the form

$f_a(x) = \sum_i \alpha_i(a) K(x_i, x)+b_a$

and then compute a more complex two layered SVM

$g(x) = \sum_i \alpha_i K(f(x_i), f(x))+b$

where $f(x) = (f_1(x), f_2(x), \ldots, f_d(x))$.  They use a simple gradient descent algorithm to optimize the alphas and obtain numerical results on ten different data sets comparing the mean squared error to a standard SVM.

## “Autoencoders, MDL, and Helmholtz Free Energy”

In “Autoencoders, MDL, and Helmholtz Free Energy“, Hinton and Zemel (2001) use Minimum Description Length as an objective function for formulating generative and recognition weights for an autoencoding neural net.  They develop a stochastic Vector Quantization method very similar to mixture of Gaussians where each input vector is encoded with

$$E_i = – \log \pi_i – k \log t + {k\over2} \log 2 \pi \sigma^2 + {{d^2} \over{2\sigma^2}}$$

nats (1 nat = 1/log(2) bits = 1.44 bits) where $t$ is the quantization width, $d$ is the Mahalanobis distance to the mean of the Gaussian, $k$ is the dimension of the input space, $\pi_i$ is the weight of the $i$th Gaussian.  They call this the “energy” of the code.  Encoding only using this scheme wastes bits because, for example, there may be vectors that are equally distant from two Gaussian. The amount wasted is

$$H = -\sum p_i \log p_i$$

where $p_i$ the probability that the code will be assigned to the $i$th Gaussian. So the “true” expected description length is

$$F = \sum_i p_i E_i – H$$

which “has exactly the form of the Helmholtz free energy.”  This free energy is minimized by setting

$$p_i = {{e^{-E_i}}\over{\sum_j e^{-E_j}}}.$$

In order to make computation practical, they recommend using a suboptimal distributions “as a Lyapunov function for learning” (see Neal and Hinton 1993). They apply their method to learn factorial codes.

## “Matrix Factorizations and the Grammar of Life”

I’m quite excited by the Nuit Blanche post on the papers “Structure Discovery in Nonparametric Regression through Compositional Kernel Search” (Duvenaudy, Lloydy, Grossez, Tenenbaumz, Ghahramaniy 2013) and “Exploiting compositionality to explore a large space of model structures” (Grosse, Salakhutdinovm, Freeman, and Tenenbaum 2012).  For years my old company Momentum Investment Services, Carl, and I have been looking for fast, systematic ways to search large hypothesis spaces.  We considered context-free grammars as a means of generating hypothesis.  Carl and I did not get anywhere with that research, but now it seems that others have succeeded.  Be sure to look over the article, the blog posts, and the comments.

## “Exact Sparse Recovery with L0 Projections”

Thanks to Nuit Blanche for pointing me towards the paper “Exact Sparse Recovery with L0 Projections” by Ping Li and Cun-Hui Zhang (2013).  The paper describes an algorithm (Min+Gap(3)), which perfectly recovers $x\in R^N$ from $M_0$ random measurements $y_i\in R$.  The vector $y$ is generated by

$y = S x$

where $S\in R^{N\times M},\ ||x||_0 = K,\ K\ll N,\ s_{ij} \sim S(\alpha, 1)$, and $S(\alpha, 1)$ denotes an $\alpha$-stable distribution with unit scale.   Typically the number of measurements required for exact recovery is “about $5 K$”.  More specifically,

“$M_0 = K \log ((N − K)/\delta)$ measurements are sufﬁcient for detecting all zeros with at least probability $1 − \delta$.”

Li and Zhang illustrate the effectiveness of their algorithm by applying it to a simple video stream exactly recovering the video from measurements constructed from the differences between frames.  They also discuss an application to finding “heavy hitters” (“elephant detection”) in marketing data (see e.g “Compressive sensing medium access control for wireless lans” Globecom, 2012 and “A data streaming algorithm for estimating entropies of od ﬂows” IMC, San Diego, CA, 2007.).

The idea for the algorithm seems to have stemmed from Piotr Indyk’s paper “Stable distributions, pseudorandom generators, embeddings, and data stream computation.”  Li and Zang state that Indyk estimated “the $\alpha$-th frequency moment $\sum_{i=1}^N |x_i|^\alpha$” using random measurements generated from an $\alpha$-stable distribution instead of the harder problem of recovering $x$ exactly.

In numerical simulations, they compare their algorithm to orthogonal matching pursuit and linear programming.  Additionally, detailed proofs of the their algorithm’s effectiveness are provided.

## History of Compressed Sensing

Igor Carron provides a nice history of Compressed Sensing on his blog.  He reviews the computation complexity of various Compressed Sensing algorithms, noting that the field started with the discovery of polynomial time solutions to underdetermined linear problems.  While $l_0$ norm regularization results in NP-complete problems, $l_1$ regularization is easier.