Strategies to Avoid Overfitting

September 29, 2017

In the last tutorial, we learned about generalization and overfitting. In this tutorial, we will discuss various methods to deal with overfitting.

Feature selection: In the previous tutorial, we saw that as the number of features grew, the model became more prone to overfitting. Hence, reducing the number of features, also known as feature selection, can reduce overfitting.
Regularization: Here, we penalize the model for having weights / parameters with large magnitude. This is achieved by modifying the cost function to include a penalty term based on the magnitude of the model weights. So minimizing the cost function also leads to a preference for smaller weights.

Another way to think about feature selection and regularization is as follows. Feature selection is equivalent to setting some model weights to 0. If a model weight is zero, the corresponding feature is not being used. The intuition behind regularization is that instead of setting some model weights to 0, we are trying to reduce the magnitude of all the weights.

Hyperparameters

We have already seen in previous tutorials that both logistic regression and linear regression models involve weights and bias. The weights and bias are estimated during the training process, such that the model is able to best predict the outcome variable. The weights and bias are known as the parameters of the model.

So what are hyperparameters?

Hyperparameters are parameters that define the model architecture and govern the training process. Hyperparameters are not estimated using machine learning algorithms while training the model. Instead these are decided externally before the training process starts.

Hyperparameter example: Polynomial degree

One example of hyperparameter that we have already come across in previous tutorials is the degree of polynomial terms in a polynomial regression model.

The equation of a polynomial regression model with one feature variable and one outcome variable is as follows:

$$ y_{pred}=w_1x+w_2x^2+w_3x^3+.......w_dx^d+b $$

where $x$ is the original feature of the dataset. All other features are higher order terms of the original feature.

The degree of this polynomial model, $d$, is considered a hyperparameter of this model.

Very low values of $d$ can make our model underfit the data, while very high values of $d$ will make it overfit the data. We can see in the example plots below, how the model seems to improve till degree 3, but then starts overfitting the data.

What is hyperparameter tuning?

The model we get after training depends directly on the hyperparameter values we use. Hence, selecting appropriate values for the hyperparameters is important for finding the best model which will generalize well to unseen data. This process of selecting the best hyperparameters is known as hyperparameter tuning.

Typically, the process for selecting the best hyperparameters is fairly straight-forward. We train multiple machine learning models, each with a different value for the hyperparameters. Then, we choose the best model among these.

Validation Dataset

Which dataset should we use for tuning the hyperparameters?

Using the training dataset doesn’t work. For example, in the polynomial degree example, our accuracy on the training dataset will keep increasing as we increase the degree of the polynomial.

Using the test dataset is cheating. As we learnt in the previous tutorial, we should never use the test dataset to train / select our model. Test set should only be used to evaluate a model.

So, we create a third split in our data, called the validation dataset. Typically, we first split the dataset into train and test data. Then, we further split the training dataset into two parts, one to use as the actual training dataset, and the rest to use for validation.

Since the number of hyperparameters is usually small, most of the time using 10-15% of the data as the validation data is sufficient.

Regularization

One of the most popular and effective strategies to combat overfitting is regularization.

In regularization, we introduce an additional term in our cost function in-order to penalize large weights. This biases our model to be simpler, where by simpler we are referring to models with weights of smaller magnitude (or even zero). We want to make the weights smaller, because complex models and overfitting are characterized by large weights.

We will look at two different kinds of regularization in this section

L2 Regularization or Ridge Regression
L1 Regularization or Lasso Regression

The strength of regularization is decided by a hyperparameter called the Regularization parameter, often denoted by $\lambda$.

L2 Regularization or Ridge Regression

Recall the mean-squared error cost function,

$$ J(w)=\frac{1}{n}\sum_{i=1}^n (y(x^i) - y_t^i)^2 $$

In L2 regularization (also known as Ridge Regression) we add a penalty proportional to the squared magnitude of each weight. Our new cost function with L2 regularization is as follows:-

$$ J(w)=\frac{1}{n}\sum_{i=1}^n (y(x^i) - y_t^i)^2 + \lambda\sum_{i=1}^d w_i^2 $$

where, the first term is the same as in regular linear regression (without any regularization), and the second term is the regularization term.

λ is a hyper-parameter that we choose and decides the regularization strength. Larger values of λ imply more regularization, i.e. smaller values for the model parameters.

L2 regularization penalizes the larger weights more (since the penalty is proportional to the weight squared). For example, reducing w = 10 to w = 9 has a larger effect on the penalty term (10² - 9² = 19) than reducing w = 3 to w = 2 (3² - 2² = 5). So even though the weights reduce by 1 in each case, the cost function decreases more when the larger weight is reduced.

L1 Regularization or Lasso Regression

In L1 regularization (also known as Lasso Regression), the penalty term is proportional to the absolute value of each weight. That is, our cost function is:

$$ J(w)=\frac{1}{n}\sum_{i=1}^n (y(x^i) - y_t^i)^2 + \lambda\sum_{i=1}^d |w_i| $$

Again, the first term is the same as in regular linear regression and the second term is the regularization term.

An interesting property of L1 regularization is that model’s parameters become sparse during optimization, i.e. it promotes a larger number of parameters w to be zero. This is because smaller weights are equally penalized as larger weights, whereas in L2 regularization, larger weights are being penalized much more. This sparse property is often quite useful. For example, it might help us identify which features are more important for making predictions, or it might help us reduce the size of a model (the zero values don’t need to be stored).

Typical machine learning workflow

Let’s see what a typical machine learning workflow looks like and how it incorporates the concepts we learnt in this tutorial.

Load the dataset.
Split the dataset into train dataset, validation dataset and test dataset. Typically, we do a 70-10-20 split.
Choose some specific hyperparameter values. Train the model using the train dataset. For example, we might have a linear regression model which we train using gradient descent. This gives us some values for the model parameters.
Evaluate the model on the validation dataset. (Not the test dataset).
Based on how much the model is overfitting, decide which other values for the hyperparameters we should try. Then repeat steps 3 and 4 for those hyperparameter values.
Choose the best model among the models we tried (based on performance on the validation data). This is our final model.
Test the final model on the test dataset. This tells us how well the final model performs on data it has not seen before. In particular, model parameters / hyperparameters should never be modified / chosen using the test dataset.

Steps 5 and 6 together are called hyperparameter tuning.

Summary

Hyperparameters are parameters which define the model architecture and govern the training process.
The process of selecting the best hyperparameters is known as hyperparameter tuning.
Validation dataset is a portion of data kept aside from the training set to be used for hyperparameter tuning.
There are two broad strategies to avoid overfitting — feature selection and regularization.
One way to perform automatic feature selection is to train a polynomial linear regression model, and treat the polynomial degree d as a hyperparameter.
Regularization introduces an additional term in the cost function in order to penalize large weights.
The strength of regularization is denoted by the hyperparameter $\lambda$, known as the regularization parameter.
The penalty term in L2 Regularization is $\lambda \sum w^2$. It penalizes the squares of the model weights.
The penalty term in L1 Regularization is $\lambda \sum |w|$. It penalizes the absolute values of the model weights.