Motivation 1: Why character by character?
Motivation 2: Why multi-lingual? (one model for multiple languages)
Challenge = Computation
Computation is quadratic in length of source sentence. This is because the the attention mechanism is used t times, where t is the length of the target sentence (which is usually proportional to the length of the source sentence). And each time the attention mechanism looks at the entire representation of the source sentence.
Main contribution 1 = 5x smaller source sentence representation
Neural Machine Translation systems have three components: Encoder, Attention and Decoder. This paper makes important modifications to the encoder.
Source: Figure 1 in paper
The architecture of the encoder is as follows,
Character Embeddings => Convolutional Layer with ReLU units => Max-pooling with stride 5 => Highway network => Bidirectional GRU
Interesting things to note in the architecture:
Main contribution 2 = One model for multiple languages
They use the one model (not 2 models with the same architecture, one model) for translating multiple different languages to English. For 3 out of 4 languages, this improves the accuracy by reducing over-fitting over the bilingual models.
For further reading, read this reply to this post. Leave a reply if you have suggestions or would like to add additional information. I plan to write one paper summary per week. Scroll to the bottom and click follow if you want to receive updates (you'll be asked to login with Facebook).