In this tutorial, we'll discuss some regularization methods in deep learning. Recall that machine learning models, including deep learning models are susceptible to overfitting, and regularization methods are techniques to prevent or curtail it. In this tutorial, we'll discuss early stopping, data augmentation and transfer learning. An important regularization technique, dropout, is discussed in more detail in the next tutorial which has proven very successful for deep learning models.
We train neural networks using gradient descent, which is an iterative algorithm. Below is a plot of training error (red) and validation error (purple), where the x-axis is number of training iterations.
The idea behind early stopping is quite simple, stop training when the error starts to increase. Here, by error, we mean validation error, i.e. error measured on validation data, the part of training data kept aside for tuning hyper-parameters. In this case the hyper-parameter is the stop criteria.
Data augmentation is when we try to come up with ways to increase the amount of data we have (i.e. augment it), by using existing data and applying some transformations. The exact transformations applied depend on the task at hand. Moreover, which transformations will help the neural net the most depends on its architecture.
For example, in many computer vision tasks such as object classification, an effective data augmentation technique might be new data points that are cropped or translated versions of original data. When a computer takes an image as an input, it will take in an array of pixel values. Let’s say that the whole image is shifted left by 10 pixels. To us, this change is imperceptible. However, to a computer, this shift can be fairly significant as the classification or label of the image doesn’t change, while the array does. Typically, we apply many different shifts in different directions, resulting in an augmented dataset 5-10x the size of the original dataset. Other ideas would be, horizontal flips, vertical flips, color jitters, rotations, etc.
Transfer learning is the process of taking a pre-trained model (the weights and parameters of a network that has been trained on a large dataset by somebody else) and “fine-tuning” the model with your own dataset. There are multiple ways to do this, a couple are described below.
- Train the pre-trained model on the large dataset. Then, remove the last layer of the network and replace it with a new layer with random weights. We then freeze the weights of all the other layers and train the network normally (freezing the layers means not changing the weights during gradient descent / optimization). The idea is that the pre-trained model will act as a feature extractor, and only the last layer will be trained on the current task.
- Train the pre-trained model on the large dataset. Then, train the same network on the current dataset, but add a penalty for changing the weights. This implies that the weights will remain close to what they were when trained on the large dataset, and only minor adaptation is allowed towards the current dataset.
Let’s investigate why this works. Let’s say the pre-trained model that we’re talking about was trained on ImageNet for object classification (ImageNet is a dataset that contains 14 million images with over 1,000 classes). When we think about the lower layers of the network, we know that they will detect features like edges and curves. Now, unless you have a very unique problem space and dataset, your network is going to need to detect curves and edges as well. Rather than training the whole network through a random initialization of weights, we can use the weights of the pre-trained model (and freeze them) and focus on the more important layers (ones that are higher up) for training. If your dataset is quite different than something like ImageNet, then you’d want to train more of your layers and freeze only a couple of the low layers.
Some papers for further reading on transfer learning.
- Paper by Yoshua Bengio.
- Paper by Ali Sharif Razavian.
- Paper by Jeff Donahue.
- Paper and subsequent paper by Dario Garcia-Gasulla.
Happy (deep) learning! Don't forget to read about dropout.