Ensemble Methods (Part 3): Meta-learning, Stacking and Mixture of Experts

October 13, 2017

Ensemble methods were introduced in a previous tutorial. In this tutorial we will explore two more ensemble learning algorithms, namely - stacking and mixture of experts. Both these methods can be looked at as examples of meta learning, when machine learning models are trained on data from predictions outputted by other machine learning models.

Stacking

Let us continue with the scenario where m models are trained on a dataset of n samples. Stacking (or stacked generalization) builds the models in the ensemble using different learning algorithms (e.g. one neural network, one decision tree, …), as opposed to bagging or boosting that train various incarnations of the same learner (e.g. all decision trees).

The outputs of the models are combined to compute the ultimate prediction of any instance x:

$$\hat{y}(x) = \sum_{j=1}^m \beta_j h_j(x)$$

Whereas boosting sequentially computes weights αj using an empirical formula, stacking introduces a level-1 algorithm, called meta-learner, for learning the weights βj of the level-0 predictors. That means, the m predictions of each training instance xi become now training data for the level-1 learner.

Although any machine learning technique can be used, the optimization problem typically involves least-squares regression:

$$\beta^* = argmin_{\beta} \sum_{i=1}^{n} \left(y(x_i) - \sum \beta_j h_j^{(-i)}(x_i) \right)^2$$

Here, hj(-i) corresponds to the leave-one-out prediction generated by training hj over the subset of n-1 instances with the ith sample (xi, yi) left aside. Once optimal weights βj are estimated, base models hj are re-trained over the whole dataset and used to evaluate previously unseen instances x.

Mixture of experts

Based on the divide-and-conquer principle, mixture of experts (ME) trains individual models to become experts in different regions of the feature space. Then, a gating network (trainable combiner) decides which combination of ensemble learners is used to predict the final output of any instance x:

$$\hat{y}(x) = \sum_{j=1}^m g_j(x)\ h_j(x)$$

Here weights gj, called gating functions, are dynamically estimated through the gating network based on the current input data. They correspond to probabilities:

$$0 \leqslant g_j(x) \leqslant 1 \quad and\quad \sum_{j=1}^m g_j(x) = 1$$

From the figure it is easy to see that this strategy resembles a neural network. Under this approach, experts (hidden nodes) can be chosen to be linear models with input weights θj: hj(x) = θjTx. Typically, the gating network is modeled by a softmax activation function for a soft switching between learners:

$$g_j(x) = \dfrac{e^{A_j^T x} }{\sum_k e^{A_k^T x}}$$

Weights θj and Aj can be trained using the backpropagation algorithm with gradient descent.

Stacking vs. Mixture of experts

Although both techniques use a meta-learner to estimate the weights of the ensemble learners, they differ on the nature of this combiner. Stacking learns constant weights based on an optimization criterion by means of a machine learning algorithm (e.g. least-squares regression). Mixture-of-experts, on the other hand, is a gating network that assigns probabilities based on the current input data (weights are a function of x).

Stacking trains all learners on the entire dataset, whereas mixture-of-expert models specialize in different feature space partitions.