This paper presents a simple approach to predicting sequences from sequential input. They use a multi-layer LSTM-based encoder-decoder architecture and show promising results on the task of neural machine translation. Their approach beats a phrase-based statistical machine translation system by a BLEU score of > 1.0 and is close to state-of-the-art if used to re-rank 1000-best predictions from the SMT system. Main contributions:
- The first LSTM encodes an input sequence to a single vector, which is then decoded by a second LSTM. End of sequence is indicated by a special character.
- 4-layer deep LSTMs.
- 160k source vocabulary, 80k target vocabulary. Trained on 12M sentences. Words in output sequence are generated by a softmax over fixed vocabulary.
- Beam search is used at test time to predict translations (Beam size 2 does best).
- Qualitative results (PCA projections) show that learned representations are fairly insensitive to active/passive voice, as sentences similar in meaning are clustered together.
- Another interesting observation was that reversing the source sequence gives a significant boost to translation of long sentences and results in performance gain, most likely due to the introduction of short-term dependencies that are more easily captured by the gradients.
- The reversing source input idea needs better justification, otherwise comes across as an 'ugly hack'.
- To re-score the n-best list of predictions of the baseline, they average confidences of LSTM and baseline model. They should have reported re-ranking accuracies by using just the LSTM-model confidences.