Answer: Statistical Relational Learning. Maybe I can get the book for Christmas.

You are currently browsing the archive for the **Graphical Models** category.

NIPS was pretty fantastic this year. There were a number of breakthroughs in the areas that interest me most: Markov Decision Processes, Game Theory, Multi-Armed Bandits, and Deep Belief Networks. Here is the list of papers, workshops, and presentations I found the most interesting or potentially useful:

- Representation, Inference and Learning in Structured Statistical Models
- Stochastic Search and Optimization
- Quantum information and the Brain
- Relax and Randomize : From Value to Algorithms (Great)
- Classification with Deep Invariant Scattering Networks
- Discriminative Learning of Sum-Product Networks
- On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
- A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
- Regularized Off-Policy TD-Learning
- Multi-Stage Multi-Task Feature Learning
- Graphical Models via Generalized Linear Models (Great)
- No voodoo here! Learning discrete graphical models via inverse covariance estimation (Great)
- Gradient Weights help Nonparametric Regressors
- Dropout: A simple and effective way to improve neural networks (Great)
- Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions
- A Better Way to Pre-Train Deep Boltzmann Machines
- Bayesian Optimization and Decision Making
- Practical Bayesian Optimization of Machine Learning Algorithms
- Modern Nonparametric Methods in Machine Learning
- Deep Learning and Unsupervised Feature Learning

In “Semantic Hashing“, Salakhutdinov and Hinton (2007) show how to classify documents with binary vectors. They combine deep learning and graphical models to assign each document a binary vector. Similar documents can be found by using the L1 difference between the binary vectors. Here is their abstract.

We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs “semantic hashing”: Documents are mapped to memory addresses in such away that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to ﬁlter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set.

In “Church: a language for generative models“, Goodman, Mansinghka, Roy, Bonawitz, and Tenenbaum introduce the probabilistic computer language “Church, a universal language for describing stochastic generative processes. Church is based on the Lisp model of lambda calculus, containing a pure Lisp as its deterministic subset.” There will be a workshop on probabilistic programming at NIPS (which I first read about at the blog Statistical Modeling, Causal Inference, and Social Science). Here is a cool tutorial.

In the seminal paper “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data“, Lafferty, McCallum, Pereira (2001) introduce a very popular type of Markov random field for segmentation. Conditional Random Fields (CRFs) are used in many fields including machine translation, parsing, genetics, and transmission codes. They are a non-directed version of Hidden Markov Networks. The paper describes conditional random fields, provides an iterative method to estimate the parameters of the CRF, and reports experimental comparisons between CRFs, hidden Markov models, and maximum entropy Markov models.

Nuit Blanche pointed me toward the interesting paper “Re-Weighted L1 Dynamic Filtering for Time-Varying Sparse Signal Estimation” by Charles and Rozell (2012). They investigate the generalized Kalman filter situation which is a hidden Markov model with an infinite number of states and real valued observations. More specifically

$$x_n = f_n (x_{n-1}) + v_n$$

$$y_n = G_n x_{n-1} + \epsilon_n$$

where $x_n$ is the state, $f_n$ is the *known* state transition function, $y_n$ is a real valued observation vector, $G_n$ is the known observation matrix (full rank), $v_n$ is the process vector noise (not necessarily Gaussian), and $\epsilon_n$ is the observation noise vector (once again, not necessarily Gaussian). The object of the game is to estimate $x_n$ given the transition function, the observation matrices, and the observations $y_n$.

They sparsely represent the state vector $x_n$ is by $x_n = W z_n$ where $W$ is a matrix and $z_n$ is a vector that is mostly zeros. In order to achieve a sparse representation, a L1 norm regularization term is applied. This regularization usually corresponds to a Laplacian prior as in Bias pursuit denoising or LASSO regression. As with neural nets, sparsity has had an impact in many fields. They write

In particular, outside of the dynamic estimation regime, signal priors based on the notion of sparsity have led to state-of-the-art performance in many linear inverse problems (e.g., undersampling, inpainting, denoising, compressed sensing, etc. [3]) across several different application domains (e.g. natural images [3], audio [4], hyperspectral imagery [5], etc.).

The previous work integrating sparsity models into dynamic estimation has not yet followed the spirit

of the successful Kalman ﬁlter by propagating higher-order statistics from one time step to the next in a

way that is well-matched to the assumptions of the problem.As is pointed out in [8], propagating a covariance matrix is no longer necessarily the correct or optimal method once the

assumptions of the Kalman ﬁlter are not met. Instead, such a matrix is no more special than a simple parameter matrix, which

may or may not assist in the signal estimation.”

The paper includes several experimental results in section IV.