Part of list:

Ensemble Methods (Part 3): Meta-learning, Stacking and Mixture of Experts

- Stacking
- Mixture of experts
- Stacking vs. Mixture of experts

Ensemble Methods (Part 3): Meta-learning, Stacking and Mixture of Experts[ Edit ]

Ensemble methods were introduced in a previous tutorial. In this tutorial we will explore two more ensemble learning algorithms, namely - **stacking** and **mixture of experts**. Both these methods can be looked at as examples of **meta learning**, when machine learning models are trained on data from predictions outputted by other machine learning models.

Let us continue with the scenario where *m *models are trained on a dataset of *n* samples. **Stacking** (or stacked generalization) builds the models in the ensemble using different learning algorithms (e.g. one neural network, one decision tree, ...), as opposed to **bagging** or **boosting** that train various incarnations of the same learner (e.g. all decision trees).

The outputs of the models are combined to compute the ultimate prediction of any instance *x*:

\hat{y}(x) = \sum_{j=1}^m \beta_j h_j(x)

Whereas boosting sequentially computes weights *α _{j}* using an empirical formula, stacking introduces a level-1 algorithm, called

Although any machine learning technique can be used, the optimization problem typically involves least-squares regression:

\beta^* = argmin_{\beta} \sum_{i=1}^{n} \left(y(x_i) - \sum \beta_j h_j^{(-i)}(x_i) \right)^2

Here, *h _{j}^{(-i)}* corresponds to the

Based on the *divide-and-conquer principle*, **mixture of experts** (ME) trains individual models to become experts in different regions of the feature space. Then, a **gating network** (trainable combiner) decides which combination of ensemble learners is used to predict the final output of any instance *x*:

\hat{y}(x) = \sum_{j=1}^m g_j(x)\ h_j(x)

Here weights *g _{j}*, called

0 \leqslant g_j(x) \leqslant 1 \quad and\quad \sum_{j=1}^m g_j(x) = 1

From the figure it is easy to see that this strategy resembles a neural network. Under this approach, experts (hidden nodes) can be chosen to be linear models with input weights *θ _{j}*:

g_j(x) = \dfrac{e^{A_j^T x} }{\sum_k e^{A_k^T x}}

Weights *θ _{j}* and

Although both techniques use a **meta-learner** to estimate the weights of the ensemble learners, they differ on the nature of this combiner. Stacking learns *constant* weights based on an optimization criterion by means of a machine learning algorithm (e.g. least-squares regression). **Mixture-of-experts, **on the other hand, is a gating network that assigns probabilities *based on the current input data* (weights are a *function of x*).

Stacking trains all learners on the entire dataset, whereas mixture-of-expert models specialize in different feature space partitions.

Read more…(493 words)

Mark as completed

Previous

Ensemble Methods (Part 2): Boosting and AdaBoost

Next

Bayesian Machine Learning

About the contributors:

Marta Enesco

93%

Keshav DhandhaniaMSc in Deep Learning @ MIT (2014)

7%

Loading…

Have a question? Ask here…

Post

Part of list:

Ensemble Methods (Part 3): Meta-learning, Stacking and Mixture of Experts

- Stacking
- Mixture of experts
- Stacking vs. Mixture of experts

Contributors

Marta Enesco

93%

Keshav DhandhaniaMSc in Deep Learning @ MIT (2014)

7%

Ready to join our community?

Sign up below to automatically get notified of new lists, get **reminders** to finish ones you subscribe to, and **bookmark** articles to read later.

Continue with Facebook

— OR —

Your Full Name

Email address

I have an account. Log in instead

By signing up, you agree to our Terms and our Privacy Policy.