Generalization and Overfitting

April 30, 2019

In this tutorial, we will talk about

generalization — the ability of a machine learning model to perform well on new unseen data, and
overfitting — when a machine learning model performs much better on data used to train the model, but doesn’t perform well on new unseen data.

Both of these are extremely important concepts in machine learning.

Let’s get started!

Train and Test data

Before we make predictions using a machine learning model, we first estimate the parameters (such as weights and bias) of the model.

The dataset based on which we estimate the parameters of the model is known as the training set, as we are training the model on this dataset.

Once our model has been trained and the optimum parameters have been determined, we check if it can accurately predict the target.

This step must be performed on new and unseen data, i.e. data that the machine learning model did not encounter during the training. The unseen data on which our model is tested is known as test set.

Generalization, Overfitting and Underfitting

It’s not a good idea to test a machine learning model on the same dataset which we used to train it. If we do this, we won’t get any indication of how well our model performs on unseen data.

The ability to generalize is the ability to perform well on unseen data. This is the main characteristic we want in a model.

We must always test machine learning models on new unseen data.

When a model performs well on training data but does not perform well on test data, we say that it has overfit the training data or that the model is overfitting. This happens because the model learns the noise present in the training data as if it was a reliable pattern.

Conversely, when a model does not perform well on training data (i.e. it fails to capture patterns present in the training data) as well as unseen data then it is said to be underfitting. That is, the model is unable to capture patterns present in the training data.

A smaller dataset can significantly increase the chance of overfitting. This is because it is much tougher to separate reliable patterns from noise when the dataset is small.

Overfitting and underfitting in a Regression problem

Suppose we have the following dataset (red points in the figure), where we have only one input variable x and one output variable y.

If we fit y = b + w₁x to the above dataset, we get the straight line fit as shown above. Note that this is not a good fit since it is quite far from many data points. This is an example of underfitting.

Now, if we add another feature x² and fit y = b + w₁x + w₂x² then we’ll get a curve fit as shown above. As you can see, this is a better fit since it passes much closer to the data points above.

Note: The above model is still a linear model, since x² is a input feature. The weights are w’s and they are interacting linearly with the features x and x². The curve we are fitting is a quadratic curve.

If we keep adding more and more features, we’ll get curves that are more and more complex and that pass through more and more data points.

The above figure shows an example. This is an example of overfitting. In this case, we fitting the polynomial curve y = w₀ + w₁x₁ + w₂x² + … + w_dx^d.

Even though the fitted curve passes through almost all points, we can see that it will not perform well on unseen data.

Overfitting and underfitting in a Classification problem

In a classification problem, overfitting can be seen in terms of how the decision boundary is drawn between target labels.

Suppose we have a dataset with two features $x_1$ and $x_2$, and a target variable $y$ which has only two labels.

We can use logistic regression to predict the target labels.

The diagram below shows a decision boundary that is underfitting.

This is formed from the logistic regression model:

$g(z) = \frac{1}{1+e^{-z}}$, where $z=w_1x_1+w_2x_2+b$

It’s a very simple decision boundary, and is unable to classify a lot of points correctly.

If we add a few polynomial terms as features to this model: $z=w_1x_1+w_2x_2+w_3{x_1}^{2}+w_4{x_2}^{2}+w_5x_1x_2+b$

we get a decision boundary as shown below.

This fits the data quite well. The decision boundary is able to classify most of the points and still retains a simplistic structure.

Moreover, intuitively, it feels like the red points in the middle of the blue points are incorrectly labelled in the data. The decision boundary also agrees with our intuition and tells us the same thing.

But if we add more polynomial terms to this model, the decision boundary gets extremely convoluted.

The above model is formed by adding higher degree polynomial terms:

$z=w_1x_1+w_2x_2+w_3{x_1}^{2}+w_4{x_2}^{2}+w_5x_1x_2+w_6x_1^2x_2^2 + \dots + b$

This is a case of over-fitting. It classifies every single point correctly. But it is unnecessarily complicated.

Typical machine learning workflow

Let’s see what a typical machine learning workflow looks like and how it incorporates the concepts we learnt in this tutorial.

Load the dataset.
Split the dataset into train dataset and test dataset. Typically, we use 70-80% of the data for training, and 20-30% of the data for testing.
Train the model using the train dataset. For example, we might have a linear regression model which we train using gradient descent. This gives us the final values for the model parameters.
Test the model on the test dataset. This tells us how well the model performs on data it has not seen before. In particular, model parameters should never be modified using the test dataset.

If we don’t like the final model we end up with, we can make any changes we want to the training process. Then, we redo the last two steps, training and testing.

Summary

The dataset based on which we estimate the parameters of the model is known as the training set.
Once the model is trained, we must check whether the model can correctly predict outcomes on new and unseen data. This data is known as the test set.
The main characteristic we want in a machine learning model is the ability to generalize, i.e. perform well on previously unseen data.
When a model performs well on training data but not on test data, the model is said to be overfitting. This happens because the model learns the noise present in the training data as if it was a reliable pattern.
When a model fails to capture patterns present in the training data, then it is said to be underfitting.

In the next tutorial, we will learn about various strategies for handling overfitting and underfitting.