Graphical Models

You are currently browsing the archive for the Graphical Models category.

I was doing a little bit of research on Mulit-stage Markov Chains for a friend when I ran across this nice set of slides on Multi-stage Markov Chains for diffusion through porous media (i.e. oil or water flowing through the ground) by Pereira and Rahunanthan.

The authors first review the multi-scale equations for diffusion under pressure (see Darcy’s law) and mixed finite element analysis.  Then, interestingly they introduce a Bayesian Markov chain Monte Carlo method (MCMC) to get approximate solutions to the differential equation.  They use Multi-Stage Hastings-Metropolis and prefetching (to parallelize the computation) to improve the speed of MCMC.  The slides also display the results of their algorithm applied to aquifer contamination and oil recovery.

It’s really cool how methods used for probabilistic graphical models can be used to approximate the solutions to partial differential equations.

 

I’m hoping to get back to 2048 in a few weeks.  Turns out that it takes a long time to write code, make the code look nice, run the code, analyze the results, and then put together a blog post.  It’s much easier and quicker to read papers and summarize them or to try to explain things you already know.

 

Have a great weekend.  – Hein

So I have been wanting to understand Category Theory (see [1], [2], [3]) mainly because I thought it would help me understand advanced functional programming in Haskell and because of the book “Physics, topology, logic and computation: a Rosetta Stone” by Baez and Stay (2011).  A number of the constructs in Haskell were derived from Category Theory so that was enough motivation for me to learn it.  But also, monoidal categories are useful in several fields: physics, computer science, pure mathematics, and statistics. The biggest application for me is the idea of generalizing probabilistic graphical models through either Ologs or monoidal categories / tensor networks (see Brendan Fong’s thesis “Causal Theories: A Categorical Perspective on Bayesian Networks“).

To start my investigation of Category Theory, I began with the $20 thin book “Basic Category Theory for Computer Scientists” by Benjamin Pierce (see also the free online version “A taste of category theory for computer scientists” (1988)).

I really enjoyed the book.  Dr. Pierce’s style is a little informal compared to pure math books like Mac Lane’s “Categories for the Working Mathematician” (pdf / Amazon), but I enjoy that more relaxed style of writing when I am first learning a field.

The biggest obstacle for learning category theory is the fact that category theory generalizes a lot of areas of pure mathematics like topology, abstract algebra, and geometry.  It’s hard to generalize before you have examples to generalize, but the examples being generalized in category theory are mostly from higher level mathematics found in senior level undergraduate and graduate level courses. Pierce ameliorates this problem by introducing some of the most basic categories first: sets, ordered sets, partially ordered sets, groups, monoids, vector spaces, measure spaces, topological spaces, proofs, and a simple functional computer language.  He takes the time to explicitly define most of these ideas, so, in theory, you could read this book without a background in theoretical mathematics, but it would be hard.

After defining categories and introducing the most basic categories, Pierce describes and defines the most basic ideas in category theory: subcategories, commutative diagrams, monomorphisms, epimorphisms, isomorphisms, initial/terminal objects, products, coproducts, universal constructions, equalizers, pullbacks, pushouts, limits, cones, colimits, cocones, exponentiation, and closed Cartesian categories. These ideas are spelled out over the thirty pages of chapter one including illuminating homework exercises. The homework exercises varied significantly in difficulty.  Many of the exercises were trivial and there are two or three that I am still working on despite investing several hours of thought.  Generally, I found the exercises to be a bit harder than those in Mac Lane’s book, but Pierce’s book required less of a background in mathematics.  A couple of the exercises were incorrectly stated or impossible.

Chapter two introduced functors, natural transformations, adjoints, and F-algebras. After reading this chapter, I was finally able to understand the definition of monads which are an important part of the computer language Haskell!  Pierce provides many examples of each of these ideas and enjoyable homework exercises to increase understanding.  Pierce’s definition of adjoints is much easier to understand than the standard definitions using counit adjunction or Hom Sets.

The last major chapter concerns applications of category theory to computer science–specifically lambda-calculus and programming language design.  

The first two chapters of the book give a reasonable condensed introduction to category theory for students that have taken a course in abstract algebra.  A course in topology or linear algebra would be another useful prerequisite.  I carried around the light 100 page book for a few months so that I could learn something whenever I had some extra time.  I had hoped that when I had proven that several functors where monads, I would then really understand monads, but a full understanding still eludes me.  Similarly, I have proven that several functor pairs are adjoint, but I still don’t feel as though I understand adjoint functors.  I guess more work and learning are needed.

Have a great year.  –  Cheers, Hein

 

Check out Markus Beissinger’s blog post “Deep Learning 101″.  Markus reviews a lot of deep learning basics derived from the papers “Representation Learning: A Review and New Perspectives” (Bengio, Courville, Vincen 2012) and “Deep Learning of Representations: Looking Forward” (Bengio 2013). Beissinger covers the following topics:

  • An easy intro to Deep Learing
  • The Current State of Deep Learing
  • Probabilistic Graphical Models
  • Principal Component Analysis
  • Restricted Boltzman Machines
  • Auto-Encoders
  • “Challenges Looking Ahead”

This is a great intro and I highly recommend it.

If you want more information, check out Ng’s lecture notesHonglak Lee’s 2010 NIPS slides, and Hinton’s Videos ([2009] [2013]).

Michael Nielsen wrote an interesting, informative, and lengthy blog post on Simpson’s paradox and causal calculus titled “If correlation doesn’t imply causation, then what does?”  Nielsen’s post reminded me of Judea Pearl‘s talk at KDD 2011 where Pearl described his causal calculus.  At the time I found it hard to follow, but Nielsen’s post made it more clear to me.

 

Causal calculus is a way of reasoning about causality if the independence relationships between random variables are known even if some of the variables are unobserved.  It uses notation like

$\alpha$ = P( Y=1 | do(X=2))

to mean the probability that Y=1 if an experimenter forces the X variable to be 2. Using the Pearl’s calculus, it may be possible to estimate $\alpha$ from a large number of observations where X is free rather than performing the experiment where X is forced to be 2.  This is not as straight forward as it might seem. We tend to conflate P(Y=1 | do(X=2)) with the conditional probability P(Y=1 | X=2). Below I will describe an example1, based on Simpson’s paradox, where they are different.

Suppose that there are two treatments for kidney stones: treatment A and treatment B.  The following situation is possible:

  • Patients that received treatment A recovered 33% of the time.  
  • Patients that received treatment B recovered 67% of the time.  
  • Treatment A is significantly better than treatment B.

This seemed very counterintuitive to me.  How is this possible?

The problem is that there is a hidden variable in the kidney stone situation. Some kidney stones are larger and therefore harder to treat and others are smaller and easier to treat.  If treatment A is usually applied to large stones and treatment B is usually used for small stones, then the recovery rate for each treatment is biased by the type of stone it treated.

Imagine that

  • treatment A is given to one million people with a large stone and 1/3 of them recover,
  • treatment A is given to one thousand people with a small stone and all of them recover,
  • treatment B is given to one thousand people with a large stone and none of them recover,
  • treatment B is given to one million people with a small stone and 2/3 of them recover.

Notice that about one-third of the treatment A patients recovered and about two-thirds of the treatment B patients recovered, and yet, treatment A is much better than treatment B.  If you have a large stone, then treatment B is pretty much guaranteed to fail (0 out of 1000) and treatment A works about 1/3 of the time. If you have a small stone, treatment A is almost guaranteed to work, while treatment B only works 2/3 of the time.

Mathematically P( Recovery | Treatment A) $\approx$ 1/3   (i.e.  about 1/3 of the patients who got treatment A recovered).

The formula for P( Recovery | do(Treatment A)) is much different.  Here we force all patients (all 2,002,000 of them) to use treatment A.  In that case,

P( Recovery | do(Treatment A) ) $\approx$ 1/2*1/3 + 1/2*1 = 2/3.

Similarly for treatment B, P( Recovery |  Treatment B) $\approx$ 2/3 and

P( Recovery | do(Treatment B) ) $\approx$ 1/3.

 

This example may seemed contrived, but as Nielsen said, “Keep in mind that this really happened.”

 

Edit Aug 8,2013:  Judea Pearl has a wonderful write-up on Simpson’s paradox titled “Simpson’s Paradox: An Anatomy” (2011?).  I think equation (9) in the article has a typo on the right-hand side.  I think it should read

$$ P (E |do(\neg C)) = P (E |do(\neg C),F) P (F ) +P (E | do(\neg C), \neg F) P (\neg F ).$$

“A probabilistic programming language is a high-level language that makes it easy for a developer to define probability models and then “solve” these models automatically.  These languages incorporate random events as primitives and their runtime environment handles inference. Now, it is a matter of programming that enables a clean separation between modeling and inference.”

writes Beau Cronin in this post about Probabilistic Programming (PP) (see e.g. [1] and [2]).  He goes on to informally describe a probabilistic graphical model (PGM) and how PP languages or extensions to existing languages like BLOG, BUGSChurch (Lisp), FACTORIEFigaro (Scala), HANSEI (Ocaml), Hierarchical Bayesian CompilerInfer.NETProbLog, and Stan make it much easier to set up and solve PGMs.  He provides links to a tutorial, a great easy-to-understand video, the NIPS workshop on PP, and several ongoing PP projects.

I also enjoyed the more detailed post “Why Probabilistic Programming Matters” by Rob Zinkov.  Rob shows how to represent the following machine learning techniques in a PP language:

 

Check out “Recent Algorithms Development and Faster Belief Propagation algorithms” by Igor Carron at the Nuit Blanche blog.

 

In “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo“, Ho man and Gelman present an improvement of the Markov Chain Monte Carlo and the Hamiltonian Monte Carlo methods.  Here’s the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]).  They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.”  and

“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the field of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem.  It discovers the structure of a problem on the fly.”

where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998).  The algorithm consists of the following steps

The Bayesian Optimization Algorithm (BOA)
(1) set t ← 0 randomly generate initial population P(0)
(2) select a set of promising strings S(t) from P(t)
(3) construct the network B using a chosen metric and constraints
(4) generate a set of new strings O(t) according to the joint distribution encoded by B
(5) create a new population P(t+1) by replacing some strings from P(t) with O(t)  set t ← t + 1
(6) if the termination criteria are not met, go to (2)

where B is a Bayesian network.

Check out the NIPS 2011 workshop.

Garvesh Raskutti has some nice slides on Probabilistic Graphical Models and Markov Logic Networks (Richardson & Domingos 2006).  Markov Logic Networks encode first order predicate logic into a Markov Random Field. The resulting networks can be quite large because statements like “for all x, y, and z, x is y’s parent and z is x’s parent imply z is y’s grandparent” require the existence of $2n^2$ nodes in the graph where $n$ is the number of people considered.  The resulting networks are frequently solved by using Gibbs Sampling.  For even more information Pedro Domingos has put an entire course online at

www.cs.washington.edu/homes/pedrod/803

 

Lifted Inference uses the rules of first order predicate logic to improve the speed of the standard Markov Random Field algorithms applied to Markov Logic Networks.  I wish I had been in Barcelona Spain in July last year for IJCAI11 because they had a cool tutorial on Lifted Inference.  Here’s a quote

Much has been achieved in the field of AI, yet much remains to be done if we are to reach the goals we all imagine. One of the key challenges with moving ahead is closing the gap between logical and statistical AI. Recent years have seen an explosion of successes in combining probability and (subsets of) first-order logic respectively programming languages and databases in several subfields of AI: Reasoning, Learning, Knowledge Representation, Planning, Databases, NLP, Robotics, Vision, etc. Nowadays, we can learn probabilistic relational models automatically from millions of inter-related objects. We can generate optimal plans and learn to act optimally in uncertain environments involving millions of objects and relations among them. Exploiting shared factors can speed up message-passing algorithms for relational inference but also for classical propositional inference such as solving SAT problems. We can even perform exact lifted probabilistic inference avoiding explicit state enumeration by manipulating first-order state representations directly.

In the related paper “Lifted Inference Seen from the Other Side : The Tractable Features“, Jha, Gogate, Meliou, Suciu (2010) reverse this notion.  Here’s the abstract:

Lifted Inference algorithms for representations that combine first-order logic and graphical models have been the focus of much recent research. All lifted algorithms developed to date are based on the same underlying idea: take a standard probabilistic inference algorithm (e.g., variable elimination, belief propagation etc.) and improve its efficiency by exploiting repeated structure in the first-order model. In this paper, we propose an approach from the other side in that we use techniques from logic for probabilistic inference. In particular, we define a set of rules that look only at the logical representation to identify models for which exact efficient inference is possible. Our rules yield new tractable classes that could not be solved efficiently by any of the existing techniques.

« Older entries