Ensemble Methods (Part 1): Model averaging, Bagging and Random Forests

May 27, 2017

Ensemble learning is a method in which we train multiple machine learning models and combine their predictions in-order to achieve better accuracy, and reduce variance in predictions made by the model. Below we discuss some approaches to ensemble learning - model averaging, bagging and random forests (a specific bagging algorithm).

Model Averaging

Model averaging is the simplest form of ensemble learning. It does exactly what the name says, multiple models are trained on the same dataset, and during prediction, we take the average over multiple models.

In the case of classification, the most common method for combining the predictions is to take votes from each model. For regression problems, we take the mean of the predictions from each model. In general, the specific choice of which function we use to combine the predictions depends on the cost function specific to the problem.

This improves performance on the overall task by reducing overfitting.

Bagging (Bootstrap aggregating)

Bagging is the next simplest form of ensemble learning. First we create multiple datasets from the original dataset, where each dataset is a random subset of the original dataset, then we train a machine learning model on each dataset. For prediction, we use combine the predictions from the individual models.

We hold out some fraction of the dataset before generating multiple datasets to use in testing. These samples are called out of bag samples and error estimated with these samples is known as out of bag error.

Random Forest

Random Forest is a widely used bagging (and hence ensemble learning) algorithm. In this case, the individual classifiers are decision trees.

We create multiple datasets as in bagging, by choosing random subsets from the original data. But, we also choose a random subset of features for each dataset. If features are not chosen randomly, decision trees in our forest could become highly correlated.

There are lots of advantages of random forest, such as they are fast to train and require no input preparation. One of the disadvantages of random forest is that our model may become too large.

Scikit-learn illustration

from sklearn.ensemble import RandomForestClassifier
# tune hyperparameter (number of decision trees) using out-of-bag error
min_estimators = 30 # min number of trees to be built
max_estimators = 60 # max number of trees to be built
# instantiate the classifier 
rf = RandomForestClassifier(criterion="entropy", warm_start=True, 
            oob_score=True, random_state=42)
for i in range(min_estimators, max_estimators+1):
    rf.set_params(n_estimators=i)
    rf.fit(X,y) # do not need to separate training and testing set
    oob_score = 1 - rf.oob_score_
    print(i, oob_score)