Hands-on Assignment: Text Generation using Recurrent Neural Networks

April 27, 2018

In this assignment, you’ll be implementing a recurrent neural network to generate text one letter at a time. In particular, we’ll start with a corpus of names extracted from news articles, mostly consisting of names of politicians and celebrities. We’ll train an RNN on this data, and use it to generate new names! In the process, you’ll also encounter adaptive learning rates and learn to generate random samples from an RNN. Let’s get started!

Dataset

After you download ner_dataset.csv, you can use the following code to load the data. The dataset is part of CONLL, and was created for the purpose of named entity recognition. We use the dataset to extract names of people from the text. By changing a few lines in this code, you’ll easily be able to generate organization names, or nouns, or locations, etc (for the first pass though, I recommend sticking to the code given below).

from __future__ import print_function
###############################################################################
## load data
import re
import pandas
data = pandas.read_csv('ner_dataset.csv')
words = list(data['Word'])
tags  = list(data['Tag'])
names = list()
for word, tt in zip(words, tags):
    if tt == 'O': continue
    if not tt.endswith('-per'): continue
    if tt == 'B-per':
        names.append(word)
    else:
        names[-1] = names[-1] + ' ' + word
# Only alphabets, hyphens and spaces allowed.
# Minimum 2 words. (Filters out names like "Bush")
names = [name for name in names if re.match('^[A-Za-z\- ]*$', name) and ' ' in name]
names = list(set(names)) # unique
print('Number of names:', len(names))
train_data = names[:3600]
test_data = names[3600:4000]
print(len(train_data), train_data[:10])
all_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ- $' # $ denotes BOS / EOS
n_letters = len(all_letters)
###############################################################################

The code produces the following output.

Number of names: 4345
3600 ['Minister Bernard Bot', 'Saad al-Hamash', 'Solomon Passy', 'President Sepp Blatter', 'Marco Antonio Muniz -', 'Herb Alpert', 'Mohammed Reza Heydari', 'Yuri Gagarin', 'Edgar Ugalde', 'Bashar al-ASAD']

Every name has at least 2 words, and often the name contains titles, such as Minister or President. Apart from alphabets, only hyphens and spaces are allowed. In addition, we’ll use $ as a special character to denote beginning or end-of-sequence (abbreviated to BOS and EOS).

The training data has 3600 names. And the test data has 400 names. Since we are training one character at a time, the total number of training instances is actually 3600 * (average name length).

Implementing our Recurrent Neural Network

Below is the template code for the RNN. You’ll need to fill in all the ....

###############################################################################
## Creating the Network
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        ## declare the layers
        # rnn layer 1. input = input + hidden values from previous time step
        self.rnn1 = nn.Linear(...) 
        # rnn layer 2. input = hidden values from rnn1 + hidden values from previous time step
        self.rnn2 = nn.Linear(...) 
        # final layer. output from rnn2 to final output  
        self.output = nn.Linear(...)
        self.softmax = nn.LogSoftmax(dim=1)
    def forward(self, input, hidden):
        hidden1 = ...
        # dropout layer (optional) 
        hidden2 = ...
        # dropout layer (optional)
        output = ...
        return output, (hidden1, hidden2)
    def init_hidden(self):
        # initialize hidden layers to 0 at the beginning
        return (Variable(torch.zeros(1, self.hidden_size)), Variable(torch.zeros(1, self.hidden_size)))
rnn = RNN(n_letters, 128, n_letters)
criterion = nn.NLLLoss() # negative log likelihood
###############################################################################
## Preparing Data for Training
# One-hot matrix of first to last letters (+BOS) for input
def input_tensor(name, bos=True):
    input = ('$' + name) if bos else name
    tensor = torch.zeros(len(input), 1, n_letters)
    for idx, letter in enumerate(input):
        tensor[idx][0][all_letters.find(letter)] = 1
    return tensor
# Index of first letter to last letter (+EOS) for target
def target_tensor(name):
    target = name + '$'
    letter_indexes = [all_letters.find(letter) for letter in target]
    return torch.LongTensor(letter_indexes)
print(input_tensor('Keshav Dhandhania'))
print(target_tensor('Keshav Dhandhania')) 
###############################################################################

Some notes.

The above outline is for implementing a simple RNN (not a LSTM) with 2-layers. The RNN model above represents one single time step. That is, the method definition is given by,
output, hidden_current = rnn(input, hidden_previous)
The input is a one-hot encoding of the characters. For example, letter c is represented by a vector of size 55 with 3rd dimension = 1 and all other dimensions = 0. Here, 55 is the total number of possible characters.
In many places in the code, there is an extra dimension of size 1. This is because in our case, the batch size is 1. (i.e. we’ll train one name at a time). For example, the shape of input_tensor is (len(input), 1, n_letters) and the shape of each hidden layer is (1, self.hidden_size).
Make sure to use sigmoid activation for hidden layers. Usually, we try to interpret the activation of hidden units in an RNN as on and off, where 1 denotes on and 0 denotes off.

Training and Testing

Below is the code for training and testing the RNN.

###############################################################################
## Training
import sys
log_interval = 100
learning_rate = 0.01
previous_loss = 100.0
def train():
    global learning_rate, previous_loss
    print('Using learning rate: %.5f' % learning_rate)
    rnn.train()
    running_loss = 0.0
    for ii, name in enumerate(train_data):
        if ii % log_interval == 0:
            print('.', end='')
            sys.stdout.flush()
        input, target = Variable(input_tensor(name)), Variable(target_tensor(name))
        rnn.zero_grad()
        hidden = rnn.init_hidden()
        loss = 0
        for idx in range(len(name)+1):
            output, hidden = rnn(input[idx], hidden)
            loss += criterion(output, target[idx].unsqueeze(0))
        loss.backward()
        # torch.nn.utils.clip_grad_norm(rnn.parameters(), 0.25)
        for p in rnn.parameters():
            p.data.add_(-learning_rate, p.grad.data)
        running_loss += loss.item()
    print('')
    avg_loss = running_loss / sum(len(name) for name in train_data)
    if previous_loss < avg_loss: learning_rate *= 0.8
    previous_loss = avg_loss
    print('Training loss: %.3f' % avg_loss)
###############################################################################
## Testing
def test():
    rnn.eval()
    running_loss = 0.0
    for name in test_data:
        input, target = Variable(input_tensor(name)), Variable(target_tensor(name))
        hidden = rnn.init_hidden()
        loss = 0
        for idx in range(len(name)+1):
            output, hidden = rnn(input[idx], hidden)
            loss += criterion(output, target[idx].unsqueeze(0))
        running_loss += loss.item()
    avg_loss = running_loss / sum(len(name) for name in test_data)
    print('Testing loss: %.3f' % avg_loss)
###############################################################################

Some notes.

Adaptive learning rate: Every time our training error increases, we reduce the learning rate by a multiplicative factor.
Since we have an adaptive learning rate, we’re updating the weights ourselves instead of using the in-built SGD optimizer.

Generating samples

Below is the template code for generating random samples. You’ll need to fill in all the ....

###############################################################################
## Sampling the network
from numpy.random import choice
max_length = 50
# Sample given a starting letter
def sample(prefix=''):
    rnn.eval()
    output_string = prefix
    ... # get initial hidden vector
    for idx in range(max_length):
        input = Variable(input_tensor(output_string))
        ... # execute one time step of rnn 
        if idx < len(prefix): continue # still 'feeding' in the prefix, no need to change output_string
        probabilities = ... # calculate probabilities from output
        sample_idx = ... # sample idx between (0, n_letters) using numpy choice
        if sample_idx == n_letters - 1: break # EOS 
        letter = ... 
        output_string = ... 
    print('Prefix = "%s".' % prefix, 'Generated string =', output_string)
    return output_string
# Get multiple samples from one category and multiple starting letters
def samples():
    for i in range(5):
        sample()
    for start_letter in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
        sample(start_letter)
###############################################################################

Running everything

You can use the following code to run.

###############################################################################
## run
import datetime
start = datetime.datetime.now()
def time_since(since):
    delta = datetime.datetime.now() - start
    return '%dm %ds' % (delta.total_seconds() / 60.0, delta.total_seconds() % 60.0)
nepochs = 100
for iepoch in range(1, nepochs+1):
    print('='*100)
    print('Starting epoch', iepoch)
    train()
    test()
    samples()
    print('Time since start:', time_since(start))
###############################################################################

It produces the following outputs for me. Your output will look different, but make sure the test error is roughly the same or lower, and that the generated samples look as “sensible” as mine (see bottom half of output for final results).

Number of names: 4345
3600 ['Minister Bernard Bot', 'Saad al-Hamash', 'Solomon Passy', 'President Sepp Blatter', 'Marco Antonio Muniz -', 'Herb Alpert', 'Mohammed Reza Heydari', 'Yuri Gagarin', 'Edgar Ugalde', 'Bashar al-ASAD']
========================================================================================================================
Starting epoch 1
Using learning rate: 0.01000
Training loss: 3.417
Testing loss: 3.206
Prefix = "". Generated string = Ci
Prefix = "". Generated string = Aheoao
Prefix = "". Generated string = BuaLb Eadlaaumovtr
Prefix = "". Generated string = Ruh
Prefix = "". Generated string = Liu Len
Prefix = "A". Generated string = An Ro Eii
Prefix = "B". Generated string = Blaaader
Prefix = "C". Generated string = Caayvsvh
Prefix = "D". Generated string = Dhna
Prefix = "E". Generated string = Eocaheoxrlrthfndn
Prefix = "F". Generated string = Frnaoamnomh Mnr
Prefix = "G". Generated string = Ga Ceiaminrabazv
Prefix = "H". Generated string = HicapeiraoaeadyVhs LuzodnnwouazRoeane
Prefix = "I". Generated string = Ian Genh
Prefix = "J". Generated string = Jo
Prefix = "K". Generated string = Khranle
Prefix = "L". Generated string = Ljy
Prefix = "M". Generated string = Mamehamka Annc a
Prefix = "N". Generated string = NalaoylalOiBVaatr Rata Pa
Prefix = "O". Generated string = OeU
Prefix = "P". Generated string = Pulaanamlraie Ietd
Prefix = "Q". Generated string = Qawf
Prefix = "R". Generated string = Ruhr DogojA PsnidultoAat
Prefix = "S". Generated string = Slosalchslrddirraa
Prefix = "T". Generated string = Tu-
Prefix = "U". Generated string = Uoao
Prefix = "V". Generated string = Vana
Prefix = "W". Generated string = Wntiinaaal Aheohpa Medaciuav
Prefix = "X". Generated string = Xooi
Prefix = "Y". Generated string = YurrddluhucyyLe
Prefix = "Z". Generated string = ZamI Gorg re Nrmaidadtwa
Time since start: 0m 23s
========================================================================================================================
...
... output from 98 more epochs ... 
... 
========================================================================================================================
Starting epoch 100
Using learning rate: 0.00023
....................................
Training loss: 2.196
Testing loss: 2.179
Prefix = "". Generated string = Michael Unawz
Prefix = "". Generated string = General Ramas Jaeed
Prefix = "". Generated string = Eliva Baky
Prefix = "". Generated string = Neszather Postoffva
Prefix = "". Generated string = General Vhyaur Fertod
Prefix = "A". Generated string = Abu Kheel
Prefix = "B". Generated string = Bour Raye
Prefix = "C". Generated string = Carmethon Garkhor
Prefix = "D". Generated string = David Aderduel
Prefix = "E". Generated string = Engrin Baedianj Prgwewt
Prefix = "F". Generated string = Freel Condonmou
Prefix = "G". Generated string = Ghyissaik Alizma
Prefix = "H". Generated string = Hudhold Cantestve
Prefix = "I". Generated string = Indwuel Lophun
Prefix = "J". Generated string = Jayon Perez
Prefix = "K". Generated string = Kim Dong
Prefix = "L". Generated string = Lloring Timdimal
Prefix = "M". Generated string = Minister Aybur Hushean Akaza
Prefix = "N". Generated string = Norper Stacer
Prefix = "O". Generated string = Olad Abdeli Charlan So
Prefix = "P". Generated string = President Bashoro A
Prefix = "Q". Generated string = Qurieg er Javin Pute
Prefix = "R". Generated string = Riad Sin
Prefix = "S". Generated string = Said Houssevina
Prefix = "T". Generated string = Tamethijo Allam
Prefix = "U". Generated string = Ulivavid Berguwnerg
Prefix = "V". Generated string = Vartina Borminder
Prefix = "W". Generated string = Woly Laministeda
Prefix = "X". Generated string = Xrubant Kerky
Prefix = "Y". Generated string = Yousef Ban Paltron
Prefix = "Z". Generated string = Zuni Alsk
Time since start: 39m 39s

Results and discussion

You should be able to achieve a test loss of less than 2.20, with a total training time of less than 1 hour on a standard CPU.

Notice how good some of the random samples look from the network. It produced names like Michael, David, Kim and Yousef character-by-character. It also produces titles such as President, Minister and General. It has learnt to model that most names consist of two words, but they are longer when titles are involved. Most words sound like a name, even if such names don’t exist. Plus, our dataset wasn’t that large. Just 3600 names!

Note: Your generated text will look much more clean if you overfit, but the test error will be higher. Make sure test loss is comparable or better, in which case samples will look clean / sensible as we saw earlier.

Solution on Google Colaboratory

Notebook Link

You can play with our solution directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.

Parting notes

Solution to this assignment can be found here: Solution to Hands-on Assignment: Text Generation using Recurrent Neural Networks. Please use it only for checking your solution, and only take small hints for the specific part you are stuck on.

A neural network very similar to this can be used for word-level models. It needs a lot more computation, since the dictionary size then is 10,000+ words, as compared to ~50 possible characters for this assignment. Such models power Google Translate, and represent the state-of-the-art in language modeling, question answering and Wikipedia compression.

Hope you enjoyed the assignment. Happy (deep) learning!