Introduction
- Word based language models suffer from the problem of rare or Out of Vocabulary (OOV) words.
- Learning representations for OOV words directly on the end task often results in poor representation.
- The alternative is to replace all the rare words with a single, unique representation (loss of information) or use character level models to obtain word representations (they tend to miss on the semantic relationship).
- The paper proposes to learn a network that can predict the representations of words using auxiliary data (referred to as definitions) such as dictionary definitions, Wikipedia infoboxes, the spelling of the word etc.
- The auxiliary data encoders are trained jointly with the end task to ensure that word representations align with the requirements of the end task.
Approach
- Given a rare word w, let d(w) = <x1, x2…> denote its defination where xi are words.
- d(w) is fed to a defination reader network f (LSTM) and its last state is used as the defination embedding ed(w)
- In case w has multiple definitions, the embeddings are combined using mean pooling.
- The approach can be extended to in-vocabulary words as well by using the definition embedding of such words to update their original embeddings.
Experiments
- Auxiliary data sources
- Word definitions from WordNet
- Spelling of words
- The proposed approach was tested on following tasks:
- Extractive Question Answering over SQuAD
- Base model from Xiong et al. 2016
- Entailment Prediction over SNLI corpus
- Base models from Bowman et al. 2015 and Chen et al. 2016
- One Billion Words Language Modelling
- For all the tasks, models using both spelling and dictionary (SD) outperformed the model using just one.
- While SD does not outperform the Glove model (with full vocabulary), it does bridge the performance gap significantly.
Future Work
- Multi-token words like “San Francisco” are not accounted for now.
- The model does not handle the rare words which appear in the definition and just replaces them by the token. Making the model recursive would be a useful addition.