# Ensemble Methods (Part 2): Boosting and AdaBoost

October 13, 2017

Ensemble methods were introduced in a previous tutorial. In this tutorial we explore another approach to ensemble learning called **boosting**, and a specific boosting algorithm called **AdaBoost**.

# Boosting

**Boosting** aims to build a *strong learner* from a set of *weak learners*. A **weak learner** is a predictive model that performs only slightly better than random guessing, whereas **strong learner** refers to a model whose predictions are close to perfect.

Boosting proceeds by repeatedly modifying the training set based on the performance of the earlier predictors. It can be seen as a *sequential ensemble method.* It assigns a greater importance to the data points that were previously misclassified, thereby leading subsequent models to focus more of their resources on those tougher data points. **AdaBoost** (adaptive boosting) is a popular boosting algorithm.

# AdaBoost

Let there be a binary classification dataset of *n* training samples (*x _{i}, y_{i}*) where

*y*is either +1 or -1. We seek to successively train

_{i}*m*binary classifiers

*h*to later combine them for preicting new instances

_{j}*x*as their

**weighted majority vote**:

classifier weights *α _{j}* are computed following the premise of giving a greater influence to the more accurate classifiers. At each model iteration, the misclassification error

*ε*on the training set is calculated:

_{j}and used to compute α_{j}:

In addition, probability weights W_{j} = (w_{j}(1), …, w_{j}(n)) are used to assign different importances to the training instances. They are individually modified after each model iteration before resampling the training set for next classifier *h _{j+1}*:

where, for a binary classification,

$$ y_i h_j(x_i) = \begin{cases} +1 &\text{if} \ x_i \ \text{well-classified} \\ -1 &\text{if} \ x_i \ \text{mis-classified} \end{cases} $$Since *α _{j }* > 0, weights are increased for instances misclassified by h

_{j}(scaled up by the exponential term), and decreased for well-classified samples. Misclassified points take a larger part of h

_{j+1}training set. In addition, to make

*W*a distribution whose values sum up to 1, weights are normalized:

_{j+1}Note that there are two types of weights: *w _{i}* for initially training the models, and

*α*for combining the weak learners to predict any future instance.

_{j}__Pseudocode__

```
# Probability weights initialization
for i=1, ..., n
w1(i) = 1/n
# Model iteration
for j=1, ..., m
resample training set using wj and fit model hj
calculate misclassification error εj using (2)
calculate weight αj by applying (3)
for i=1, ..., n
compute wj+1(i) by applying (4)
normalize all wj+1
# Termination
output the weighted majority vote ŷ using (1)
```