# Ensemble Methods (Part 3): Meta-learning, Stacking and Mixture of Experts

October 13, 2017

Ensemble methods were introduced in a previous tutorial. In this tutorial we will explore two more ensemble learning algorithms, namely - **stacking** and **mixture of experts**. Both these methods can be looked at as examples of **meta learning**, when machine learning models are trained on data from predictions outputted by other machine learning models.

# Stacking

Let us continue with the scenario where *m* models are trained on a dataset of *n* samples. **Stacking** (or stacked generalization) builds the models in the ensemble using different learning algorithms (e.g. one neural network, one decision tree, …), as opposed to **bagging** or **boosting** that train various incarnations of the same learner (e.g. all decision trees).

The outputs of the models are combined to compute the ultimate prediction of any instance *x*:

Whereas boosting sequentially computes weights *α _{j}* using an empirical formula, stacking introduces a level-1 algorithm, called

**meta-learner**, for learning the weights

*β*of the level-0 predictors. That means, the

_{j}*m*predictions of each training instance

*x*become now training data for the level-1 learner.

_{i}Although any machine learning technique can be used, the optimization problem typically involves least-squares regression:

$$ \beta^* = argmin_{\beta} \sum_{i=1}^{n} \left(y(x_i) - \sum \beta_j h_j^{(-i)}(x_i) \right)^2 $$Here, *h _{j}^{(-i)}* corresponds to the

**leave-one-out**prediction generated by training

*h*over the subset of

_{j}*n-1*instances with the i

^{th}sample (

*x*) left aside. Once optimal weights

_{i}, y_{i}*β*are estimated, base models

_{j}*h*are re-trained over the whole dataset and used to evaluate previously unseen instances

_{j}*x*.

# Mixture of experts

Based on the *divide-and-conquer principle*, **mixture of experts** (ME) trains individual models to become experts in different regions of the feature space. Then, a **gating network** (trainable combiner) decides which combination of ensemble learners is used to predict the final output of any instance *x*:

Here weights *g _{j}*, called

**gating functions**, are dynamically estimated through the gating network based on the current input data. They correspond to probabilities:

From the figure it is easy to see that this strategy resembles a neural network. Under this approach, experts (hidden nodes) can be chosen to be linear models with input weights *θ _{j}*:

*hj(x) = θ*. Typically, the gating network is modeled by a softmax activation function for a soft switching between learners:

_{j}^{T}xWeights *θ _{j}* and

*A*can be trained using the backpropagation algorithm with gradient descent.

_{j}# Stacking vs. Mixture of experts

Although both techniques use a **meta-learner** to estimate the weights of the ensemble learners, they differ on the nature of this combiner. Stacking learns *constant* weights based on an optimization criterion by means of a machine learning algorithm (e.g. least-squares regression). **Mixture-of-experts,** on the other hand, is a gating network that assigns probabilities *based on the current input data* (weights are a *function of x*).

Stacking trains all learners on the entire dataset, whereas mixture-of-expert models specialize in different feature space partitions.