In the seminal paper “Stacked generalization“, David H. Wolpert generalizes the idea of cross-validation.

Suppose you had a data set $(x_i, y_i)$ where $i=1, \ldots, n$, $x_i \in A$, and $y_i \in R$. For cross validation, you might partition the data set into three subsets, train $k$ classifiers on two of the three subsets, and test the classifiers on the held out data. Then we might select one of the $k$ classifiers by choosing the classifier which did the best on the held out data.

Wolpert generalizes this idea by forming an $n$ by $k$ matrix of predictions from the $k$ classifiers. The $i$th row and the $j$th column would contain the prediction of the $j$th classifier on $x_i$ trained on partitions of the data set that do not include $x_i$. Then Wolpert would train a new classifier using the $i$th row of the matrix as input and trying to match $y_i$ as output. The new classifier $f$ would map $R^k$ into $R$. Let $G_j$ be the result of training the $j$th classifier on all of the data. Then the Wolpert generalized classifier would have the form

$$h(x) = f( G_1(x), G_2(x), \ldots, G_k(x) ).$$

Wolpert actually describes an even more general scheme which could have a large number of layers, much like deep belief networks and auto encoders. The idea of leaving part of the data out is similar to denoising or dropout.

In the paper “Feature-Weighted Linear Stacking“, Sill, Takacs, Mackey, and Lin describe a faster version of stacking used extensively by the second place team in the Neflix Prize contest.