Dr Xu at Penn State introduced me to Richardson Extrapolation way back in the early 90’s. We used it to increase the speed of convergence of finite element analysis algorithms for partial differential equations, but it can be used for many numerical methods.

You are currently browsing the archive for the **Optimization** category.

I was doing a little bit of research on Mulit-stage Markov Chains for a friend when I ran across this nice set of slides on Multi-stage Markov Chains for diffusion through porous media (i.e. oil or water flowing through the ground) by Pereira and Rahunanthan.

The authors first review the multi-scale equations for diffusion under pressure (see Darcy’s law) and mixed finite element analysis. Then, interestingly they introduce a Bayesian Markov chain Monte Carlo method (MCMC) to get approximate solutions to the differential equation. They use Multi-Stage Hastings-Metropolis and prefetching (to parallelize the computation) to improve the speed of MCMC. The slides also display the results of their algorithm applied to aquifer contamination and oil recovery.

It’s really cool how methods used for probabilistic graphical models can be used to approximate the solutions to partial differential equations.

I’m hoping to get back to 2048 in a few weeks. Turns out that it takes a long time to write code, make the code look nice, run the code, analyze the results, and then put together a blog post. It’s much easier and quicker to read papers and summarize them or to try to explain things you already know.

Have a great weekend. – Hein

Nuit Blanche has a nice summary of the FOCS 2012 53rd Annual IEEE Symposium on Foundations of Computer Science Workshop on “Randomized Numerical Linear Algebra (RandNLA): Theory and Practice“. Faster random algorithms for QR, SVD, Eigenvalues, Least Squares, … using random projections and other techniques were discussed.

John Regehr writes this post about a compiler speed optimization technique called “Stochastic Superopimization“. “Stochastic Superopimization” systematically searches for algorithmic improvement in code using machine learning algorithms similar to multi-armed bandit strategies. It appears to be related to “Programming by Optimization“. “Stochastic Superopimization” is more like a very good optimization flag on a compiler. “Programming by Optimization” is constructing the program in such a fashion that design options are exposed and easily manipulable by an optimization program trying to maximize some performance metric. The “Programming by Optimization” community seems to mostly use BOA, the Bayesian optimization algorithm (see [1], [2]). I am hoping to read and write more about both of these ideas later.

In “Algorithm Portfolio Design: Theory vs. Practice“, Gomes and Selman (2013) study the use of a portfolio of stochastic search algorithm to solve computationally hard search problems. Here are some interesting quotes from the paper:

“Our studies reveal that in many cases the performance of a single algorithm dominates all others, on the problem class under consideration.”

“Given the diversity in performance profiles among algorithms, various approaches have been developed to combine different algorithms to take into account the computational resource constraints and to optimize the overall performance. These considerations led to the development of anytime algorithms (Dean and Boddy 1988), decision theoretic metareasoning and related approaches (Horvitz and Zilberstein 1996; Russell and Norvig 1995), and algorithm portfolio design (Huberman et al. 1997).”

“In addition, we also show that a good strategy for designing a portfolio is to combine many short runs of the same algorithm. The effectiveness of such portfolios explains the common practice of “restarts” for stochastic procedures, where the same algorithm is run repeatedly with different initial seeds for the random number generator. (For related work on the effectiveness of restarts, see e.g., Aldous and Vazirani 1994; Ertel 1991; Selman and Kirkpatrick 1996.)”

“…we often joke that our job, as the team that builds the experimentation platform, is to tell our clients that their new baby is ugly, …”

Andrew Gelman at Statistical Modeling, Causal Inference, and Social Science pointed me towards the paper “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” by Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu all of whom seem to be affiliated with Microsoft.

The paper itself recounted five online statistical experiments mostly done at Microsoft that had informative counter-intuitive results:

- Overall Evaluation Criteria for Bing
- Click Tracking
- Initial Effects
- Experiment Length
- Carry Over Effects.

The main lessons learned were:

*Be careful what you wish for*. – Short term effects may be diametrically opposed to long-term effects. Specifically, a high number clicks or queries per session could be indicative of a bug rather than success. It’s important to choose the right metric. The authors ended up focusing on “sessions per user” as a metric as opposed to “queries per month” partly due to a bug which increased (in the short-term) queries and revenues while degrading the user’s experience.*Initial results are strongly affected by “Primacy and Novelty”*. – In the beginning, experienced users may click on a new option just because it is new, not because it’s good. On the other hand, experienced users may be initially slowed by a new format even if the new format is “better”.*If reality is constantly changing, the experiment length may not improve the accuracy of the experiment*. The underlying behavior of the users may change every month. A short-term experiment may only capture a short-term behavior. Rather than running the experiment for years, the best option may be to run several short-term experiments and adapt the website to the changing behavior as soon as the new behavior is observed.*If the same user is presented with the same experiment repeatedly, her reaction to the experiment is a function of the number of times she has been exposed to the experiment*. This effect must be considered when interpreting experimental results.*The Poisson Distribution should not be used to model clicks*. They preferred Negative Binomial.

The paper is easy to read, well written, and rather informative. It is especially good for web analytics and for anyone new to experimental statistics. I found the references below to be especially interesting:

- “Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society” by Manzi (book)
- “Web Analytics: An Hour per Day” by Kaushik (book)
- “Controlled experiments on the web: survey and practical guide” by Kohavi, Longbotham, Sommerﬁeld, and Henne (2009)
- “Seven Pitfalls to Avoid when Running Controlled Experiments on the Web” by Crook, Frasca, Kohavi, Longbotham (2009)

Thanks to Nuit Blanche for pointing me towards the presentation by Andrea Montanari “Collaborative Filtering: Models and Algorithms” and the associated Deshpande and Montanari paper “Linear Bandits in High Dimension and Recommendation Systems” (2012). In the presentation, Montanari reviews Spectral, Gradient Descent, Stochasitc Gradient Descent, Convex Relaxation, and Linear Bandit methods for approximating the standard linear model for recommendation systems and some accuracy guarantees.

Assuming the $j$th movie has features $v_{j1}, v_{j2}, \ldots, v_{jr}$, then the $i$th viewer gives the rating

$R_{ij} = \langle u_i, v_j \rangle +\epsilon_{ij}$

where $u_i$ is an $r$ dimensional vector representing the preferences of $i$th viewer and $\epsilon_{ij}$ is Gaussian noise.

The paper introduces a new Linear Bandit method, Smooth Explore, better suited for recommendation systems. Their method is motivated by the three objectives:

- Constant-optimal cumulative reward,
- Constant-optimal regret, and
- Approximate monotonicity (rewards approximately increase with time).

Smooth Explore estimates the user preferences vectors with a regularized least squares regression. Proofs of optimality and numerical results are provided.

In “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo“, Homan and Gelman present an improvement of the Markov Chain Monte Carlo and the Hamiltonian Monte Carlo methods. Here’s the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

In “BOA: The Bayesian Optimization Algorithm“, Pelikan, Goldberg, and Cant´u-Paz introduce an adaptive improvement over genetic optimization algorithms (See also [1]). They write, “In this paper, an algorithm based on the concepts of genetic algorithms that uses an estimation of a probability distribution of promising solutions in order to generate new candidate solutions is proposed. To estimate the distribution, techniques for modeling multivariate data by Bayesian networks are used.” and

“The algorithm proposed in this paper is also capable of covering higher order interactions. It uses techniques from the ﬁeld of modeling data by Bayesian networks in order to estimate the joint distribution of promising solutions. The class of distributions that are considered is identical to the class of conditional distributions used in the FDA. Therefore, the theory of the FDA can be used in order to demonstrate the power of the proposed algorithm to solve decomposable problems. However, unlike the FDA, our algorithm does not require any prior information about the problem. It discovers the structure of a problem on the ﬂy.”

where FDA refers to the Factorized Distribution Algorithm (Mühlenbein et al., 1998). The algorithm consists of the following steps

The Bayesian Optimization Algorithm (BOA)

(1) set t ← 0 randomly generate initial population P(0)

(2) select a set of promising strings S(t) from P(t)

(3) construct the network B using a chosen metric and constraints

(4) generate a set of new strings O(t) according to the joint distribution encoded by B

(5) create a new population P(t+1) by replacing some strings from P(t) with O(t) set t ← t + 1

(6) if the termination criteria are not met, go to (2)

where B is a Bayesian network.

Check out the NIPS 2011 workshop.

At NIPS yesterday, James Spall gave a nice overview of stochastic optimization. Stochastic optimization is the process of finding the minimum of a function $f(x)$ where are measurements or samples of the function are noisy. He stressed that the no free lunch theorems (Wolpert & Macready 1995) limit the efficiency of any global minimization problem if there are no restrictions on $f$. He described in detail the Simultaneous Perturbation Stochastic Approximation (SPSA) method which appears to be a great method for optimizing with noisy measurements. The basic is idea is that you don’t need to approximate the gradient by making $p+1$ measurements in a $p$ dimensional domain. Instead, you sample $f$ at two nearby randomly generated points and make a nearly unbiased estimate of the gradient from those two measurements. There is also way to form estimates of the Hessian with just 4 samples which leads to a stochastic algorithm similar to Newton-Raphson. None of these methods can converge faster than $O(1/\sqrt{n})$ due to the noise, but they may be very useful as robust semi-global optimizers for functions with lots of local minima or high frequencies.