Overfitting is one of the most important problems (and concepts) in machine learning.
It's not a good idea to test a machine learning model on a dataset which we used to train it, since it won't give any indication of how well our model performs on unseen data. The ability to perform well on unseen data is called generalization, and is the desirable characteristic we want in a model.
When a model performs well on training data (the data on which the algorithm was trained) but does not perform well on test data (new or unseen data), we say that it has overfit the training data or that the model is overfitting. This happens because the model learns the noise present in the training data as if it was a reliable pattern.
Conversely, when a model does not perform well on training data (i.e. it fails to capture patterns present in the training data) as well as unseen data then it is said to be under-fitting. That is, the model is unable to capture patterns present in the training data.
A smaller dataset can significantly increase the chance of overfitting. This is because it is much tougher to separate reliable patterns from noise when the dataset is small. 
Suppose we have the following dataset (red points in the figure), where we have only one input variable x and one output variable y.
If we fit y = w0 + w1x to the above dataset, we get the straight line fit as shown above. Note that this is not a good fit since it is quite far from many data points. This is an example of under-fitting.
Now, if we add another feature x2 and fit y = w0 + w1x1 + w2x2 then we'll get a curve fit as shown above. (Side note: This is still a linear model. x2 is a feature, i.e. input. The weights are w's and they are interacting linearly with the features x and x2. The curve we are fitting is a quadratic curve). As you can see, this is slightly better since it passes much closer to the data points above.
If we keep adding more features we'll get a curve that is more and more complex and that passes through more and more data points. Above figure shows an example. This is an example of overfitting. In this case, we are performing polynomial fitting y = w0 + w1x1 + w2x2 + ... + wdxd.
Even though the fitted curve passes through almost all points, it won't perform well on unseen data.
One way to avoid overfitting it to collect more data. However, that is not always feasible. Below are some other strategies to overcome the problem of overfitting - regularization and cross-validation.
In regularization, we combat overfitting by controlling the model's complexity, i.e. by introducing an additional term in our cost function in-order to penalize large weights. This biases our model to be simpler, where simpler is weights of smaller magnitude (or even zero). We want to make the weights smaller, because complex models and overfitting are characterized by large weights. Recall the mean-squared error cost function,
In L2 regularization, a commonly used regularization technique, we add a penalty proportional to the squared magnitude of each weight. Our new cost function with L2 regularization is as follows:-
where, the first term is the same as in regular linear regression (without any regularization), and the second term is the regularization term. λ is a hyper-parameter that we choose and decides the regularization strength. Larger values of λ imply more regularization, i.e. smaller values for the model parameters. ||w2|| is w12 + w22 + ... wd2.
L2 regularization penalizes the larger weights more (since the penalty is proportional to the weight squared). For example, reducing w = 10 to w = 9 has a larger effect on the penalty term (102-92) than reducing w = 3 to w = 2 (32-22).
In L1 regularization, we the penalty term is λ ||w||. That is, our cost function is:
An interesting property of L1 regularization is that model's parameters become sparse during optimization, i.e. it promotes a larger number of parameters w to be zero. This is because smaller weights are equally penalized as larger weights, whereas in L2 regularizations, larger weights are being penalized much more. This sparse property is often quite useful. For example, it might help us identify which features are more important for making predictions, or it might help us reduce the size of a model (the zero values don't need to be stored).
Ordinary least square (which we saw earlier in linear regression) with L2 regularization is known as Ridge Regression and with L1 regularization it is known as Lasso Regression.
Cross Validation is a method for finding the best hyper-parameters of a model. For example, in gradient descent, we need to choose a stopping criteria. The simplest stopping criteria is to check whether our accuracy is improving on the training dataset. However, this is prone to overfitting since the model might be capturing noise present in the training data as reliable patterns.
We can overcome this problem by not using the entire training data while training a model. Instead we will hold out some data (validation dataset) and we'll train only on remaining data. For example, we can split our training dataset into 70/30 and use 70% data for training and 30% data for validation.
In the above example of gradient descent, now we train our algorithm on the training data, but check whether or not our model is getting better on the validation dataset. This is known as the holdout method and it is one of the simplest cross validation methods.
We can also use the validation data for other types of experimentation. Such as if we want to run multiple experiments where we choose different features to use to train our machine learning model.
In K-fold cross validation, the dataset is divided into k separate parts. We repeat training process k times. Each time, one part is used as validation data, and the rest is used for training a model. Then we average the error to evaluate a model. Note that k-fold cross validation increases the computational requirements for training our model by a factor of k.
The main advantages of k-fold cross validation are that
- It is more robust to over-fitting than the holdout method when performing large number of experiments.
- It is better to use when the dataset size is small. This is because when performing k-fold cross-validation, we can use a much smaller validation split (say 10% instead of 30%) since we are testing the model on various subsamples of the data being in the 10%.
Leave-one-out cross validation is a special instance of k-fold cross validation in which k is equal to the number of data points in the dataset. Each time, we hold out a single data point and train a model on rest of the data. We use the single data point to test our model. Then we calculate the average error to evaluate a model.
- Ensemble Methods are good learners in the case of small dataset. In ensemble methods, instead of training one model we train multiple models and average the predictions.