Ensemble Learning

You are currently browsing the archive for the Ensemble Learning category.

In the seminal paper “Stacked generalization“, David H. Wolpert generalizes the idea of cross-validation.

Suppose you had a data set $(x_i, y_i)$ where $i=1, \ldots, n$, $x_i \in A$, and $y_i \in R$. For cross validation, you might partition the data set into three subsets, train $k$ classifiers on two of the three subsets, and test the classifiers on the held out data. Then we might select one of the $k$ classifiers by choosing the classifier which did the best on the held out data.

Wolpert generalizes this idea by forming an $n$ by $k$ matrix of predictions from the $k$ classifiers.  The $i$th row and the $j$th column would contain the prediction of the $j$th classifier on  $x_i$ trained on partitions of the data set that do not include $x_i$.  Then Wolpert would train a new classifier using the $i$th row of the matrix as input and trying to match $y_i$ as output.   The new classifier $f$ would map $R^k$ into $R$.  Let $G_j$ be the result of training the $j$th classifier on all of the data.  Then the Wolpert generalized classifier would have the form

$$h(x) = f( G_1(x), G_2(x), \ldots, G_k(x) ).$$

Wolpert actually describes an even more general scheme which could have a large number of layers, much like deep belief networks and auto encoders. The idea of leaving part of the data out is similar to denoising or dropout.

In the paper “Feature-Weighted Linear Stacking“, Sill, Takacs, Mackey, and Lin describe a faster version of stacking used extensively by the second place team in the Neflix Prize contest.


In the widely cited paper “Rapid object detection using a boosted cascade of simple features“, Viola and Jones (CVPR 2001) apply “Harr-like” features and AdaBoost to a fast “cascade” of increasingly complex image classifiers (mostly facial recognition).   They write, “The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest.” The Harr-like decomposition quickly (constant time) creates mostly localized features and AdaBoost learns quickly so the combination is fast. They report, “In the domain of face detection it is possible to achieve fewer than 1% false negatives and 40% false positives using a classifier constructed from two Harr-like features.” [emphasis added]

“A Product of Experts model (PoE) (Hinton 2002) combines a number of individual component models (the experts) by taking their product and normalizing the result.”  Here’s the Scholarpedia page.

In “Rotation Forest: A New Classifier Ensemble Method“, Rodrıguez and Kuncheva (2006) develop a Random Forest method where the decision boundaries are not of the form   original data feature value <= fixed number. They do this by creating new features by “rotating” the data in the high dimensional feature space before creating a decision tree.  Each decision tree is created with a different rotation of the data, but all of the rotations are generated from Principal Component Analysis. The trees are combined to generate the final classifier.  Here is their abstract

We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name “forest.” Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Using WEKA, we examined the Rotation Forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with Bagging, AdaBoost, and Random Forest. The results were favorable to Rotation Forest and prompted an investigation into diversity-accuracy landscape of the ensemble models. Diversity-error diagrams revealed that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest, and more diverse than these in Bagging, sometimes more accurate as well.

In “Statistical Learning Algorithms Based on Bregman Distances“,  La erty, Pietra, and Pietra (1999) take a fairly standard entropy based tree growing algorithm and replace KL distance with Bregman distance. “In the feature selection step, each linear constraint in a pool of candidate features is evaluated by the reduction in Bregman distance that would result from adding it to the model.”  This is reminiscent of the distribution update step in Softboost and the “information projection” view of Adaboost (See Boosting as Entropy Projection).  The paper is somewhat technical containing interesting theorem proving techniques.


I’ve been looking at versions of Adaboost that are less sensitive to noise such as Softboost. Softboost works by ignoring a number of outliers set by the user (the parameter $v$), finding weak learners that are not highly correlated with the weak learners already in the boosted learner mix and updating the distribution by KL projection onto the set of distributions restricted to those uncorrelated to the mistakes of the latest learner and not placing too much weight on any particular data point. Softboost avoids over fitting by stopping when no feasible point is found for the KL projection.

In “Soft Margins for AdaBoost”, Ratsch, Onoda, and Muller, generalize Adaboost by adding a softening parameter $\phi$ to the distribution update step. They relate soft boosting to simulated annealing and minimization of a generalized exponential loss function. The paper has numerous helpful graphs and experimental data.

In “How to Use Expert Advice“, Cesa-Bianchi, Freund, Haussler, Helmbold, Sharpire, and Warmuth (1997) describe boosting type algorithms to combine classifiers. They prove a bound of

$$ \sqrt{n  \log(M)}$$

on the total regret where $n$ is the number of data points and $M$ is the number of classifiers.  They relate their bounds to the L1 distance between binary strings and apply their results to pattern recognition theory and PAC learning.


Carl and I have been discussing concepts related to sparsity in neural nets. One older idea similar to sparsity is negatively correlated or uncorrelated neural nets. To use this technique, you could train several neural nets on part or all of the data and then keep the neural networks that are less correlated. The idea used CELS algorithm is to add a correlation penalty term to the loss function of the neural net optimization process. In “Simultaneous Training of Negatively Correlated Neural Networks in an Ensemble” Liu and Yao (1999) say that

The idea behind CELS is to encourage different individual networks in an ensemble to learn different parts or aspects of a training data so that the ensemble can learn the whole training data better.

which seems similar to what happens in sparse neural networks.

In “Negative Correlation Learning for Classification Ensembles” Wang, Chen and Yao (2010) modify adaboost to put more weight on classifiers that are less correlated to the boosted classifier.  They do this by introducing a penalty term “
$$p_t = (h_t – H) \sum_{k\neq t} (h_k – H),$$
where H is the final output by combining the decisions from the individuals.”
The idea for this came from the Cooperative ensemble learning system (CELS) by Liu and Yao (1999).  Experimental results and performance bounds were given in “The Effectiveness of a New Negative Correlation Learning Algorithm for Classification Ensembles” Wang and Yao (2010).

I ran across an interesting paper yesterday “Logistic Regression, AdaBoost and Bregman Distances” (Collins, Schapire, Singer 2000).  Adaboost is a great way to combine several weak predictors into a strong predictor (Freund 2012 is a great reference).  The Bregman distance is a generalization of the Euclidean norm and KL-Divergence.  Many well known algorithms that use the Euclidean norm still work when the Euclidian norm is replaced with a Bregman distance (e.g. Dykstra’s Algorithm  see Censor Reich 2000).  In the paper, the authors show that Adaboost solves a best approximation problem using the KL-distance and introduce many variants of Adaboost.  Here are some quotes from the paper:

1) “We are now able to borrow methods from the maximum entropy literature for logistic regression and apply them to the exponential loss used by AdaBoost, especially convergence proof techniques.”

2) “Duffy and Helmbold (1999) gave conditions under which a loss function gives a boosting algorithm. They showed that minimizing logistic loss does lead to a boosting algorithm the PAC sense.”

3) “Our work builds heavily on that of Kivinen andWarmuth (1999) who, along with Lafferty, were the first to make a connection between AdaBoost and information geometry. They showed that the update used by AdaBoost is a form of “entropy projection.”” (bold face added.)

« Older entries