Graphical Models

You are currently browsing the archive for the Graphical Models category.

Answer:  Statistical Relational Learning.  Maybe I can get the book for Christmas.

NIPS was pretty fantastic this year.  There were a number of breakthroughs in the areas that interest me most:  Markov Decision Processes, Game Theory, Multi-Armed Bandits, and Deep Belief Networks.  Here is the list of papers, workshops, and presentations I found the most interesting or potentially useful:

 

  1. Representation, Inference and Learning in Structured Statistical Models
  2. Stochastic Search and Optimization
  3. Quantum information and the Brain
  4. Relax and Randomize : From Value to Algorithms  (Great)
  5. Classification with Deep Invariant Scattering Networks
  6. Discriminative Learning of Sum-Product Networks
  7. On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
  8. A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
  9. Regularized Off-Policy TD-Learning
  10. Multi-Stage Multi-Task Feature Learning
  11. Graphical Models via Generalized Linear Models (Great)
  12. No voodoo here! Learning discrete graphical models via inverse covariance estimation (Great)
  13. Gradient Weights help Nonparametric Regressors
  14. Dropout: A simple and effective way to improve neural networks (Great)
  15. Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions
  16. A Better Way to Pre-Train Deep Boltzmann Machines
  17. Bayesian Optimization and Decision Making
  18. Practical Bayesian Optimization of Machine Learning Algorithms
  19. Modern Nonparametric Methods in Machine Learning
  20. Deep Learning and Unsupervised Feature Learning
Unfortunately, when you have 30 full day workshops in a two day period, you miss most of them.  I could only attend the three listed above.  There were many other great ones.

 

 

In “Semantic Hashing“, Salakhutdinov and Hinton (2007) show how to classify documents with binary vectors.  They combine deep learning and graphical models to assign each document a binary vector.  Similar documents can be found by using the L1 difference between the binary vectors.  Here is their abstract.

We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs “semantic hashing”: Documents are mapped to memory addresses in such away that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set.

In “Church: a language for generative models“, Goodman, Mansinghka, Roy, Bonawitz, and Tenenbaum introduce the probabilistic computer language “Church, a universal language for describing stochastic generative processes. Church is based on the Lisp model of lambda calculus, containing a pure Lisp as its deterministic subset.”  There will be a workshop on probabilistic programming at NIPS (which I first read about at the blog Statistical Modeling, Causal Inference, and Social Science).  Here is a cool tutorial.

In the seminal paper “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data“, Lafferty, McCallum, Pereira (2001) introduce a very popular type of Markov random field for segmentation. Conditional Random Fields (CRFs) are used in many fields including machine translation, parsing, genetics, and transmission codes.  They are a non-directed version of Hidden Markov Networks.  The paper describes conditional random fields, provides an iterative method to estimate the parameters of the CRF, and reports experimental comparisons between CRFs, hidden Markov models, and maximum entropy Markov models.

Nuit Blanche pointed me toward the interesting paper “Re-Weighted L1 Dynamic Filtering for Time-Varying Sparse Signal Estimation” by Charles and Rozell (2012). They investigate the generalized Kalman filter situation which is a hidden Markov model with an infinite number of states and real valued observations. More specifically
$$x_n = f_n (x_{n-1}) + v_n$$

$$y_n = G_n x_{n-1} + \epsilon_n$$

where $x_n$ is the state, $f_n$ is the known state transition function, $y_n$ is a real valued observation vector, $G_n$ is the known observation matrix (full rank), $v_n$ is the process vector noise (not necessarily Gaussian), and $\epsilon_n$ is the observation noise vector (once again, not necessarily Gaussian). The object of the game is to estimate $x_n$ given the transition function, the observation matrices, and the observations $y_n$.

They sparsely represent the state vector $x_n$ is by $x_n = W z_n$ where $W$ is a matrix and $z_n$ is a vector that is mostly zeros. In order to achieve a sparse representation, a L1 norm regularization term is applied. This regularization usually corresponds to a Laplacian prior as in Bias pursuit denoising or LASSO regression. As with neural nets, sparsity has had an impact in many fields. They write

In particular, outside of the dynamic estimation regime, signal priors based on the notion of sparsity have led to state-of-the-art performance in many linear inverse problems (e.g., undersampling, inpainting, denoising, compressed sensing, etc. [3]) across several different application domains (e.g. natural images [3], audio [4], hyperspectral imagery [5], etc.).

The previous work integrating sparsity models into dynamic estimation has not yet followed the spirit
of the successful Kalman filter by propagating higher-order statistics from one time step to the next in a
way that is well-matched to the assumptions of the problem.

As is pointed out in [8], propagating a covariance matrix is no longer necessarily the correct or optimal method once the
assumptions of the Kalman filter are not met. Instead, such a matrix is no more special than a simple parameter matrix, which
may or may not assist in the signal estimation.”

The paper includes several experimental results in section IV.

Newer entries »