# Optimization

You are currently browsing the archive for the Optimization category.

## “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo”

In “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo“, Ho man and Gelman present an improvement of the Markov Chain Monte Carlo and the Hamiltonian Monte Carlo methods.  Here’s the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

## “BOA: The Bayesian Optimization Algorithm”

In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]).  They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.”  and

“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the ﬁeld of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem.  It discovers the structure of a problem on the ﬂy.”

where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998).  The algorithm consists of the following steps

The Bayesian Optimization Algorithm (BOA)
(1) set t ← 0 randomly generate initial population P(0)
(2) select a set of promising strings S(t) from P(t)
(3) construct the network B using a chosen metric and constraints
(4) generate a set of new strings O(t) according to the joint distribution encoded by B
(5) create a new population P(t+1) by replacing some strings from P(t) with O(t)  set t ← t + 1
(6) if the termination criteria are not met, go to (2)

where B is a Bayesian network.

Check out the NIPS 2011 workshop.

## Stochastic Optimization Talk at NIPS

At NIPS yesterday, James Spall gave a nice overview of stochastic optimization. Stochastic optimization is the process of finding the minimum of a function $f(x)$ where are measurements or samples of the function are noisy.  He stressed that the no free lunch theorems (Wolpert & Macready 1995) limit the efficiency of any global minimization problem if there are no restrictions on $f$.  He described in detail the Simultaneous Perturbation Stochastic Approximation (SPSA) method which appears to be a great method for optimizing with noisy measurements. The basic is idea is that you don’t need to approximate the gradient by making $p+1$ measurements in a $p$ dimensional domain.  Instead, you sample $f$ at two nearby randomly generated points and make a nearly unbiased estimate of the gradient from those two measurements.   There is also way to form estimates of the Hessian with just 4 samples which leads to a stochastic algorithm similar to Newton-Raphson.  None of these methods can converge faster than $O(1/\sqrt{n})$ due to the noise, but they may be very useful as robust semi-global optimizers for functions with lots of local minima or high frequencies.

## Interior Point Methods for Large Scale SVMs

Jacek Gondzio has some nice slides (2009) on interior point methods for large scale support vector machines.  He focuses on the primal dual logarithmic barrier methods (see e.g. Wright 1987) for softer classification.  Great explanations, diagrams, and numerical results are provided.  Kristian Woodsend wrote his 2009 Ph.D. thesis on the same subject.  Woodsend applies the interior point methods and low rank approximations of the SVM kernel to reduce the computational cost to order $n$ where $n$ is the number of data points.  He compares this approach to active set methods, gradient projection algorithms, and cutting-plane algorithms and concludes with numerical results.

## Matlab code and a Tutorial on DIRECT Optimization

Yves Brise created this nice set of slides describing the DIRECT algorithm for Lipschitz functions.  Tim Kelly of Drexel university provides Matlab code here.

## “Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness”

In “Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness“, Remi Munos (2011) develops an optimization algorithm similar to hierarchical optimistic optimization, Lipschitz optimization, the Zooming algorithm (Klienberg, Slivkins, & Upfal 2008), branch and bound, and the upper confidence bound algorithm.  The algorithm does not minimize regret, rather it attempts to maximize $\max_n f(x_n)$ over all the samples $x_n$.  Munos’s first algorithm, Deterministic Optimistic Optimization, requires that the function be smooth with respect to a known semi-metric.  His second algorithm, Simultaneous Optimistic Optimization, does not require knowledge of the smoothness semi-metric.  He proves performance bounds, gives examples for both algorithms, and finishes the paper by comparing the algorithm to the well known DIRECT (DIviding RECTangles) algorithm (Jones, Perttunen, Stuckman 1993).

## “A Tutorial on the Cross-Entropy Method”

In “A Tutorial on the Cross-Entropy Method” De Boer, Kroese, Mannor, and Rubinstein (2005) write

“The CE method was motivated by an adaptive algorithm for estimating probabilities of rare events in complex stochastic networks (Rubinstein, 1997), which involves variance minimization. It was soon realized (Rubinstein, 1999, 2001) that a simple cross-entropy modiﬁcation of Rubinstein (1997) could be used not only for estimating probabilities of rare events but for solving difﬁcult COPs as well. This is done by translating the “deterministic” optimization problem into a related “stochastic” optimization problem and then using rare event simulation techniques similar to Rubinstein (1997). Several recent applications demonstrate the power of the CE method (Rubinstein, 1999) as a generic and practical tool for solving NP-hard problems.”

and then go on to compare cross-entropy with simulated annealing, tabu search , genetic algorithms, the nested partitioning method, ant colony optimization, rare event estimation algorithms, importance sampling, and algorithms for continuous multi-extremal optimization.

Several applications are listed in the introduction.

“An increasing number of applications is being found for the CE method. Recent publications on applications of the CE method include: buffer allocation (Alon et al., 2005); static simulation models (Homem-de-Mello and Rubinstein, 2002); queueing models of telecommunication systems (de Boer, 2000; de Boer, Kroese, and Rubinstein, 2004); neural computation (Dubin, 2002, 2004); control and navigation (Helvik and Wittner, 2001); DNA sequence alignment (Keith and Kroese, 2002); scheduling (Margolin, 2002, 2004); vehicle routing (Chepuri and Homem-de-Mello, 2005); reinforcement learning (Mannor, Rubinstein, and Gat, 2003; Menache, Mannor, and Shimkin, 2005); project management (Cohen, Golany, and Shtub, 2005); heavy-tail distributions (Asmussen, Kroese, and Rubinstein, 2005); (Kroese and Rubinstein, 2004); CE convergence (Margolin, 2005); network reliability (Hui et al., 2005); repairable systems (Ridder, 2005); and maxcut and bipartition problems (Rubinstein, 2002).”

The rest of the paper describes the basic CE algorithm, its convergence properties, specific applications, and variations.

## Tetris, Reinforcement Learning, The Cross-Entropy method

Tetris is hard, but it is good for your brain. So machine learners have written a number of papers about the application of reinforcement learning to Tetris. Donald Carr (2005) reviews a number of algorithms for Tetris including temporal difference learning, genetic algorithms (Llima, 2005), hand-tuned algorithms specific to Tetris (Dellacherie, Fahey).  Szita and Lörincz (2006) in “Learning Tetris Using the Noisy Cross-Entropy Method” use noise to prevent early suboptimal convergence of the fascinating cross entropy method (similar to particle swarm optimization).  Much like chess, the value of the current state of a Tetris game is estimated with a linear combination of 22 features (for more details, check out the seminal Tetris paper “Feature-Based Methods for Large Scale Dynamic Programming” by Bertsekas and Tsitsiklis 1996.) Their noisy CE method produced solutions almost as good at the best contemporaneous hand-tuned Tetris algorithms.

$$f(x) = \sum_{m=0}^M \beta_m h(x, a_m)$$
where the $\beta_m$ are real and $a_m$ are a vectors of trained parameters. At each iteration, gradient boost adds on a new function $h$ using steepest descent.