Prerequisites: Standard Convolutional Neural Networks (CNNs). For Neural Machine Translation background read this reply to this post.
Motivation 1: Why character by character?
- Robust to handling spelling mistakes, typos.
- Handling long words that have sub-structure.
Motivation 2: Why multi-lingual? (one model for multiple languages)
- Language switching! i.e. using an English phrase in the middle of a French sentence (real-world scenario)
- Reduce over-fitting (when languages have similar characteristics)
Challenge = Computation
Computation is quadratic in length of source sentence. This is because the the attention mechanism is used t times, where t is the length of the target sentence (which is usually proportional to the length of the source sentence). And each time the attention mechanism looks at the entire representation of the source sentence.
Main contribution 1 = 5x smaller source sentence representation
Neural Machine Translation systems have three components: Encoder, Attention and Decoder. This paper makes important modifications to the encoder.
Source: Figure 1 in paper
The architecture of the encoder is as follows,
Character Embeddings => Convolutional Layer with ReLU units => Max-pooling with stride 5 => Highway network => Bidirectional GRU
Interesting things to note in the architecture:
- They use a max-pooling layer with a stride of 5 to make the length of the source sentence representation 5x smaller than number of characters.
- If we do this naively, then we will loose a lot of information and the accuracy of the model is going to be low.
- To remedy this, the authors modify the convolutional layer to have many filters of multiple sizes (ranging from 1 to 8) instead of single size.
- In the paper, they use 200-300 filters of each size. Hence, for a sentence with 100 characters, the output of the convolutional layer is of size 100x2000, and the output of the max-pool layer is of size 20x2000.
- Even though the max-pooling regions do not overlap, the input region from which each segment receives information overlap significantly because the largest filter of size 8 is larger than the max-pooling stride length of 5.
Main contribution 2 = One model for multiple languages
They use the one model (not 2 models with the same architecture, one model) for translating multiple different languages to English. For 3 out of 4 languages, this improves the accuracy by reducing over-fitting over the bilingual models.
Evaluation:
- Accuracy: They show that their model performs significantly better than BPE models. Note that BPE is shown to work better than word-level models and is the state-of-the-art. [4]
- Speed: Trains in two weeks using 2 GPUs. The other notable character by character model takes 3 months. [5]
- Language switching: Although the multi-lingual model improves accuracy, its not good at handling language switching. The hunch is that this is because the training data does not include any examples of language switching.
For further reading, read this reply to this post. Leave a reply if you have suggestions or would like to add additional information. I plan to write one paper summary per week. Scroll to the bottom and click follow if you want to receive updates (you'll be asked to login with Facebook).