Artificial Intelligence Blog

“Algorithm Portfolio Design: Theory vs. Practice”

March 20, 2013 in Optimization by hundalhh | Permalink

In “Algorithm Portfolio Design: Theory vs. Practice“, Gomes and Selman (2013) study the use of a portfolio of stochastic search algorithm to solve computationally hard search problems. Here are some interesting quotes from the paper:

“Our studies reveal that in many cases the performance of a single algorithm dominates all others, on the problem class under consideration.”

“Given the diversity in performance profiles among algorithms, various approaches have been developed to combine different algorithms to take into account the computational resource constraints and to optimize the overall performance. These considerations led to the development of anytime algorithms (Dean and Boddy 1988), decision theoretic metareasoning and related approaches (Horvitz and Zilberstein 1996; Russell and Norvig 1995), and algorithm portfolio design (Huberman et al. 1997).”

“In addition, we also show that a good strategy for designing a portfolio is to combine many short runs of the same algorithm. The effectiveness of such portfolios explains the common practice of “restarts” for stochastic procedures, where the same algorithm is run repeatedly with different initial seeds for the random number generator. (For related work on the effectiveness of restarts, see e.g., Aldous and Vazirani 1994; Ertel 1991; Selman and Kirkpatrick 1996.)”

“Quantum information and the Brain”

March 18, 2013 in General ML, Neural Nets by hundalhh | Permalink

Scott Aaronson gave a wonderful talk at NIPS 2012 called “Quantum information and the Brain“. I really enjoyed his very concise intuitive description of quantum mechanics and his sceptical yet fair attitude towards the ideas expressed by Penrose in “The Emperor’s New Mind”.

Turn your browser into a notepad

March 16, 2013 in Uncategorized by hundalhh | Permalink

This is just too cute. Type

data:text/html, <html contenteditable>

into the URL tab and then type in the blank page.

Courtesy of Jose Jesus Perez Aguinaga https://coderwall.com/p/lhsrcq !

“Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained”

March 14, 2013 in Multi-Armed Bandit Problem, Optimization, Statistics by hundalhh | Permalink

“…we often joke that our job, as the team that builds the experimentation platform, is to tell our clients that their new baby is ugly, …”

Andrew Gelman at Statistical Modeling, Causal Inference, and Social Science pointed me towards the paper “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” by Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu all of whom seem to be affiliated with Microsoft.

The paper itself recounted five online statistical experiments mostly done at Microsoft that had informative counter-intuitive results:

Overall Evaluation Criteria for Bing
Click Tracking
Initial Effects
Experiment Length
Carry Over Effects.

The main lessons learned were:

Be careful what you wish for. – Short term effects may be diametrically opposed to long-term effects. Specifically, a high number clicks or queries per session could be indicative of a bug rather than success. It’s important to choose the right metric. The authors ended up focusing on “sessions per user” as a metric as opposed to “queries per month” partly due to a bug which increased (in the short-term) queries and revenues while degrading the user’s experience.
Initial results are strongly affected by “Primacy and Novelty”. – In the beginning, experienced users may click on a new option just because it is new, not because it’s good. On the other hand, experienced users may be initially slowed by a new format even if the new format is “better”.
If reality is constantly changing, the experiment length may not improve the accuracy of the experiment. The underlying behavior of the users may change every month. A short-term experiment may only capture a short-term behavior. Rather than running the experiment for years, the best option may be to run several short-term experiments and adapt the website to the changing behavior as soon as the new behavior is observed.
If the same user is presented with the same experiment repeatedly, her reaction to the experiment is a function of the number of times she has been exposed to the experiment. This effect must be considered when interpreting experimental results.
The Poisson Distribution should not be used to model clicks. They preferred Negative Binomial.

The paper is easy to read, well written, and rather informative. It is especially good for web analytics and for anyone new to experimental statistics. I found the references below to be especially interesting:

“Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society” by Manzi (book)
“Web Analytics: An Hour per Day” by Kaushik (book)
“Controlled experiments on the web: survey and practical guide” by Kohavi, Longbotham, Sommerﬁeld, and Henne (2009)
“Seven Pitfalls to Avoid when Running Controlled Experiments on the Web” by Crook, Frasca, Kohavi, Longbotham (2009)

Notes on “A Few Useful Things to Know about Machine Learning”

March 12, 2013 in General ML by hundalhh | Permalink

Pedro Domingos wrote this wonderful introductory overview of machine learning which teaches twelve short lessons about classifiers:

LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
IT’S GENERALIZATION THAT COUNTS
DATA ALONE IS NOT ENOUGH
OVERFITTING HAS MANY FACES
INTUITION FAILS IN HIGH DIMENSIONS
THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
FEATURE ENGINEERING IS THE KEY
MORE DATA BEATS A CLEVERER ALGORITHM
LEARN MANY MODELS, NOT JUST ONE
SIMPLICITY DOES NOT IMPLY ACCURACY
REPRESENTABLE DOES NOT IMPLY LEARNABLE
CORRELATION DOES NOT IMPLY CAUSATION

The lessons are augmented with excellent examples, so I highly recommend reading the simply written, short article.

I put together some notes and quotes, on and from these lessons which I will share below.

LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION – Creating a classifiers involves three key ideas: representation, an objective function, and optimization. For example, for $k$-means classification, the domain is divided up by a Voronoi diagram and each region has a different label or possibly a mixture of labels. This is the representation. The objective function does not involve the labels—only the input data. The objective function is the sum of the squared distances from input data to the “center” of each region to which the datum belongs. The optimization technique is greedy alternation between optimizing the center of each region and the assignment of data to regions.

IT’S GENERALIZATION THAT COUNTS – “The fundamental goal of machine learning is to generalize beyond the examples in the training set.” Merely learning the training set is both trivial and insufficient. Often doing well on the training set is a bad indicator of performance outside the training set. Generalization is the only way to label inputs that are not in the training set. Unfortunately, Wolpert’s “no free lunch” theorems demonstrate that no generalization method can always beat random guessing for learning problems without structure (i.e. Martin-Löf random classification problems do not admit generalization.)

OVERFITTING HAS MANY FACES – If we do not have enough data or time to determine the correct representation or the correct number of parameters, then we introduce either bias errors when we use the wrong representation or overfitting errors when we have too many parameters (i.e. Bias-Variance Trade-off). The fact that trying too many hypotheses has the same dangers as using too many parameters implies “contrary to intuition, a more powerful learner is not necessarily better than a less powerful one.” and “strong false assumptions can be better than weak true ones, because a learner with the latter needs more data to avoid overﬁtting.” Cross validation, statistical significance tests, minimizing the false discovery rate, and regularization terms prevent most overfitting. Overfitting can even occur when all the data is correctly labelled.

INTUITION FAILS IN HIGH DIMENSIONS – Most algorithms have more trouble in higher dimensional spaces and our intuition fails in higher dimensional spaces. Nearest Neighbor for simple classification problems tends to fail in more than 100 dimensions. Our intuition built on experience with 3 dimensional spaces fails to understand facts like a hypersphere with radius 0.95 has less than 1% of the volume of a hypersphere of radius 1 in a 200 dimensional space. So for a Gaussian in such a space, the majority of the probability is not near the mean, instead it lies in a “shell” about the mean. The volume of a hypersphere is much less than the smallest hypercube that contains it. These counter-intuitive facts reek havoc for several machine learning algorithms. It is even hard to fix or adjust the algorithm because we often can’t visualize the problem in high dimensional spaces. So dimensionality reduction via manifold learning, PCA, LDA, ICA, and feature selection becomes very important.

THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM – Domingos repeats the well known argument that if bad classifiers are wrong with probability greater than $\epsilon$ and the number of hypothetical classifiers is $H$, then $n = {1\over\epsilon}(\ln(H) + \ln({1\over\delta}))$ randomly chosen test examples are sufficient to eliminate all of the bad classifiers with probability greater than $1- \delta$. This works great as long at $H$ is not large, but often $H$ is large. The number of possible classifiers for $k$ binary features is $2^{2^k}$, so the bound above is sometimes not useful. The number of hypotheses can also be effectively infinite. Similarly, if the number of bad classifiers is large compared to the number of good classifiers, then $\delta$ needs to be very small. Other performance bounds are true only in the limit (e.g. infinite data) or contain unknown constants, so they merely give the user hope rather than a guarantee that the resulting classifier is correct.

FEATURE ENGINEERING IS THE KEY – Machine learning is often an iterative process of generating features, choosing algorithms, and analyzing the results. Creating useful features is often difficult and can require domain knowledge or luck, so “there is ultimately no replacement for the smarts you put into feature engineering.”

LEARN MANY MODELS, NOT JUST ONE – Use ensemble learning including bagging, boosting, bootstrap, stacking, Bayesian optimization, mixture of experts, and plain old averaging/voting. These ensembles of other well-known methods work and win contests like the Netflix prize. Bayesian model averaging (BMA) is theoretically optimal, but it tend to heavily weight one of the learners, so it should not be considered ensemble learning.

MORE DATA BEATS A CLEVERER ALGORITHM – If you are only going to read one section, read this section. The title itself is self-explanatory, but Domingos adds several interesting insights in this section like,

“…even though in principle more data means that more complex classiﬁers can be learned, in practice simpler classiﬁers wind up being used, because complex ones take too long to learn.”

“As a rule, it pays to try the simplest learners ﬁrst (e.g., naive Bayes before logistic regression, k-nearest neighbor before support vector machines). More sophisticated learners are seductive, but they are usually harder to use, because they have more knobs you need to turn to get good results, and because their internals are more opaque.”,

“Part of the reason using cleverer algorithms has a smaller payoﬀ than you might expect is that, to a ﬁrst approximation, they all do the same. This is surprising when you consider representations as different as, say, sets of rules and neural networks. But in fact propositional rules are readily encoded as neural networks, and similar relationships hold between other representations. All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of “nearby.” “,

“In the end, the biggest bottleneck is not data or CPU cycles, but human cycles. In research papers, learners are typically compared on measures of accuracy and computational cost. But human eﬀort saved and insight gained, although harder to measure, are often more important. This favors learners that produce human-understandable output (e.g., rule sets). And the organizations that make the most of machine learning are those that have in place an infrastructure that makes experimenting with many different learners, data sources and learning problems easy and eﬃcient, and where there is a close collaboration between machine learning experts and application domain ones.”,

and

“Learners can be divided into two major types: those whose representation has a ﬁxed size, like linear classiﬁers, and those whose representation can grow with the data, like decision trees.”

He says the variable size learners can, in theory, learn any class, but they often fall into local optima, require too much time, or fall to the curse of dimensionality, so these methods require a lot of experimentation and thinking on the part of the analyst.

Lastly, Generative models are more fun.

“Machine Learning Cheat Sheet (for scikit-learn)”

March 10, 2013 in General ML by hundalhh | Permalink

I love this diagram created by Peekaboo: Andy’s Computer Vision and Machine Learning Blog.

“An Empirical Evaluation of Thompson Sampling”

March 8, 2013 in Multi-Armed Bandit Problem by hundalhh | Permalink

In “An Empirical Evaluation of Thompson Sampling”, Chapelle and Lihong Li (NIPS 2011) compare Upper Confidence Bound (UCB), Gittins, and Thompson methods for the multi-armed bandit problem. More theoretical analysis has been done for UCB and Gittins, but Thompson sampling is simple (as opposed to Gittins) and seems to work well, sometimes performing significantly better that the other common algorithms. Both synthetic and real world (advertising) results are presented.

“Exact Sparse Recovery with L0 Projections”

March 6, 2013 in Sparsity by hundalhh | Permalink

Thanks to Nuit Blanche for pointing me towards the paper “Exact Sparse Recovery with L0 Projections” by Ping Li and Cun-Hui Zhang (2013). The paper describes an algorithm (Min+Gap(3)), which perfectly recovers $x\in R^N$ from $M_0$ random measurements $y_i\in R$. The vector $y$ is generated by

$y = S x$

where $S\in R^{N\times M},\ ||x||_0 = K,\ K\ll N,\ s_{ij} \sim S(\alpha, 1)$, and $S(\alpha, 1)$ denotes an $\alpha$-stable distribution with unit scale. Typically the number of measurements required for exact recovery is “about $5 K$”. More specifically,

“$M_0 = K \log ((N − K)/\delta)$ measurements are sufﬁcient for detecting all zeros with at least probability $1 − \delta$.”

Li and Zhang illustrate the effectiveness of their algorithm by applying it to a simple video stream exactly recovering the video from measurements constructed from the differences between frames. They also discuss an application to finding “heavy hitters” (“elephant detection”) in marketing data (see e.g “Compressive sensing medium access control for wireless lans” Globecom, 2012 and “A data streaming algorithm for estimating entropies of od ﬂows” IMC, San Diego, CA, 2007.).

The idea for the algorithm seems to have stemmed from Piotr Indyk’s paper “Stable distributions, pseudorandom generators, embeddings, and data stream computation.” Li and Zang state that Indyk estimated “the $\alpha$-th frequency moment $\sum_{i=1}^N |x_i|^\alpha$” using random measurements generated from an $\alpha$-stable distribution instead of the harder problem of recovering $x$ exactly.

In numerical simulations, they compare their algorithm to orthogonal matching pursuit and linear programming. Additionally, detailed proofs of the their algorithm’s effectiveness are provided.

“Linear Bandits in High Dimension and Recommendation Systems”

March 4, 2013 in Multi-Armed Bandit Problem, Optimization by hundalhh | Permalink

Thanks to Nuit Blanche for pointing me towards the presentation by Andrea Montanari “Collaborative Filtering: Models and Algorithms” and the associated Deshpande and Montanari paper “Linear Bandits in High Dimension and Recommendation Systems” (2012). In the presentation, Montanari reviews Spectral, Gradient Descent, Stochasitc Gradient Descent, Convex Relaxation, and Linear Bandit methods for approximating the standard linear model for recommendation systems and some accuracy guarantees.

Assuming the $j$th movie has features $v_{j1}, v_{j2}, \ldots, v_{jr}$, then the $i$th viewer gives the rating

$R_{ij} = \langle u_i, v_j \rangle +\epsilon_{ij}$

where $u_i$ is an $r$ dimensional vector representing the preferences of $i$th viewer and $\epsilon_{ij}$ is Gaussian noise.

The paper introduces a new Linear Bandit method, Smooth Explore, better suited for recommendation systems. Their method is motivated by the three objectives:

Constant-optimal cumulative reward,
Constant-optimal regret, and
Approximate monotonicity (rewards approximately increase with time).

Smooth Explore estimates the user preferences vectors with a regularized least squares regression. Proofs of optimality and numerical results are provided.

Lunokhod 2 Moon Rover

March 2, 2013 in Robots by hundalhh | Permalink

This Russian rover traveled over 20 miles on the moon in 1973!

« Older entries § Newer entries »