Hands-on Assignment: Training your first Neural Network

April 27, 2018

As part of this quiz, you’ll get to train your own neural network. You’ll go through the various decisions that one needs to make when training a neural network, such as the architecture of the neural network, the type of activations to use, whether to normalize the input, how to initialize the neural network weights. Most importantly, you’ll also learn the basics of the most modern Deep Learning library, PyTorch!

Overview

We will train a neural network for a pretty simple task, i.e. calculating the exclusive-or (XOR) of two inputs. Note that even though XOR is a very simple function, many machine learning models such as Support Vector Machines [2], Linear Regression, Logistic Regression can’t learn it, no matter how much data it given (assuming no feature engineering). Overall, the steps are going to be, (1) Understand the code, (2) Take the quiz, and (3) Do the experiments.

Why PyTorch?: PyTorch is the most modern deep learning library, and innovates on a number of things over TensorFlow / Keras. PyTorch code is often much more intuitive, and looks more like normal Python as compared to other deep learning libraries [1]. All of this saves time both for students and professionals. Although TensorFlow is currently more popular, PyTorch is the preferred library of choice for researchers and teams in industry who have the option to start afresh.

You’ll learn a lot! Let’s get started!

Setup

Option 1: Run the code on Google Colab

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.

Option 2: Install PyTorch on your own system

Installation command for PyTorch can be found here: PyTorch. For CUDA, say “None”. (Projects in this course assume you have running the code on your own computer). If you are on Windows, only Python 3 is supported. If you are not familiar with Python 3, don’t worry. The differences as compared to Python 2 will be quite minor for our use cases.

Dataset

The following code can be used to create the dataset.

###############################################################################
## load data
import random
import numpy as np
def make_data():
    x1 = random.randint(0, 1)
    x2 = random.randint(0, 1)
    yy = 0 if (x1 == x2) else 1
    # x1 = 2. * (x1 - 0.5)
    # x2 = 2. * (x2 - 0.5)
    # yy = 2. * (yy - 0.5)
    # add noise
    x1 += 0.1 * random.random()
    x2 += 0.1 * random.random()
    yy += 0.1 * random.random()
    return [x1, x2, ], yy
batch_size = 10
def make_batch():
    data = [make_data() for ii in range(batch_size)]
    labels = [label for xx, label in data]
    data = [xx for xx, label in data]
    return np.array(data, dtype='float32'), np.array(labels, dtype='float32')
print(make_batch())
print(make_batch())
print(make_batch())
train_data = [make_batch() for ii in range(500)]
test_data = [make_batch() for ii in range(50)]
###############################################################################

Go through the above code line by line. Some notes:

make_data creates a single example.
It first randomly samples x1 and x2, the inputs. Then, the correct answer, yy = XOR(x1, x2) is calculated.
Commented out code can be used to change the range of the inputs and outputs to be (-1, 1) instead of (0, 1). This will be useful during experimentation.
Lastly, it adds noise to the data points, so that the train and test dataset are different, and the network is asked to classify new points not seen before.
make_batch batches together data points, i.e. it creates mini-batches. This is the standard way of processing data in deep learning, since the most popular method for training uses mini-batch gradient descent.
Finally, the random seed can be set (at the top of the file), if we want the code to produce the exact same data points each time it is run. Will be useful during experimentation.

Defining our neural network

The following code defines our neural network, and the optimization method.

###############################################################################
## model
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
# torch.manual_seed(42)
class NN(nn.Module):
    def __init__(self):
        super(NN, self).__init__()
        self.dense1 = nn.Linear(2, 2)
        self.dense2 = nn.Linear(2, 1)
        print(self.dense1.weight)
        print(self.dense1.bias)
        print(self.dense2.weight)
        print(self.dense2.bias)
        # self.dense1.weight.data.uniform_(-1.0, 1.0)
        # self.dense1.bias.data.uniform_(-1.0, 1.0)
        # self.dense2.weight.data.uniform_(-1.0, 1.0)
        # self.dense2.bias.data.uniform_(-1.0, 1.0)
    def forward(self, x):
        x = F.sigmoid(self.dense1(x))
        x = self.dense2(x)
        return torch.squeeze(x)
model = NN()
## optimizer = stochastic gradient descent
optimizer = optim.SGD(model.parameters(), lr=0.01)
###############################################################################

Again, go through the above code line by line. Some notes:

The model defines the forward function. This function defines how the output is computed given the input. We do not need to implement the gradients.

The main tool that deep learning libraries (like PyTorch) provide us is automatic gradient calculations. They are able to do this, because they have defined the gradient for every primitive we are allowed to use. For example, someone has already implemented how to compute the gradients for F.sigmoid and torch.squeeze used above. Using those gradients and combining it with chain rule for differentiation, PyTorch is able to calculate gradients for our newly defined neural network.

Training and Evaluating the network

The code below can be used to train and test the neural network.

###############################################################################
## train and test functions
def train(epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_data):
        data, target = Variable(torch.from_numpy(data)), Variable(torch.from_numpy(target))
        optimizer.zero_grad()
        output = model(data)
        loss = F.mse_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % 100 == 0:
            print('Train Epoch: {} {}\tLoss: {:.4f}'.format(epoch, batch_idx * len(data), loss.item()))
def test():
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_data:
        data, target = Variable(torch.from_numpy(data), volatile=True), Variable(torch.from_numpy(target))
        output = model(data)
        test_loss += F.mse_loss(output, target)
        correct += (np.around(output.data.numpy()) == np.around(target.data.numpy())).sum()
    test_loss /= len(test_data)
    test_loss = test_loss.item()
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
        test_loss, correct, batch_size * len(test_data), 100. * correct / (batch_size * len(test_data))) )
###############################################################################
## run
nepochs = 100
for epoch in range(1, nepochs + 1):
    train(epoch)
    test()
###############################################################################

Again, go through the above code line by line. Some notes:

During training:-
We iterate over the batches of data. For each batch, we calculate the output from the model. Then, we calculate the loss, given the output from the model and the target.
Lastly, we use loss.backward() to calculate the gradients, and optimizer.step() to update the weights once the gradients have been calculated.
During testing, we calculate the output, the loss, and the number of predictions that are correct.

We go over the entire dataset 100 times. Going over the entire dataset once is called an epoch.

Take the Quiz

Save all the code above into a file on your computer. Read the code. Run it and add print statements anywhere you feel unclear. Then take the quiz (no time limit), which guides you through the parts you should be paying attention to in the code.

Understanding our NN more (deep)ly

Hope you did well on the quiz! Time to do some experiments with our neural network.

Experiment 1 (activation units):

Replace sigmoid activation with tanh and relu. Run the model note the differences in accuracy. Make sure you run with each activation multiple times to get a reliable measure.

Which one works best? Which one performs worst? Why?

Experiment 2 (weight initialization):

Uncomment the lines in __init__ so that the weights are initialized between uniformly in the range -1 and 1. Re-run the model multiple times each with sigmoid, tanh, relu activations.

Did you notice any improvement for any activation type?

Comment out the lines after you’re done.

Experiment 3 (input normalization):

Do the same thing as above after centering the input around 0. (uncomment the lines in make_data).

Experiment 4 (termination criteria):

When you use sigmoid activation, notice how the loss function flattens out, and then improves a lot, and then flattens, and then improves a lot.

This kind of behavior is common for some problems (although not always), and it is quite tough to know whether the network has stopped learning, or just needs more time. It becomes a much more thorny issue if the training time of the model is in days instead of seconds, which is often the case for deep learning.

Nothing to do in this experiment. Just make sure you look at the losses, and understand what the above paragraph is saying.

Experiment 5 (learning rate):

For this experiment, use tanh activation function, and input centered around 0. Use random seed 42 for both python and torch (uncomment lines in the code).

Try various values for the learning rate.

For what range of values does the network learn and reach 100% accuracy? For what values does the network training diverge and become unstable? For what values does it learn too slowly to each optimal accuracy in 100 epochs?

Experiment Bonus (extend the neural network):

Extend the code above to solve XOR for 3 inputs, 4 inputs, 5 inputs.

At what points did you need to change the size of the hidden layer for the network to achieve 100% accuracy?

Conclusion

Overall, once you are done with this assignment, make sure you understand how the various pieces you have learnt so far in this course tie together. In particular, notice the following.

How multiple layers helped you solve a problem (learning XOR) that linear regression or logistic regression or Support Vector Machines [2] can’t solve.
The power of the gradient descent optimization method to solve problems even though the cost function is not convex.
The ease in training a new neural network with frameworks like PyTorch.

Hope you learnt a lot this assignment. Happy (deep) learning!

Footnotes

The main advantage of PyTorch is that the computation graph is dynamic instead of static, which means that there is no compile step between defining and executing the computation graph. The importance of this difference gets more and more clear as deep learning architectures get more complex, such as recurrent neural networks and deep reinforcement learning models.
Support Vector Machine can solve XOR with the use of Kernels.