Given a word sequence s (local context) and a document d in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in s from other words.
g(s, d) - scoring function giving liklihood of correct sequence.
g(sw, d) - scoring function giving liklihood of last word in s repalced by a word w.
Objective - g(s, d) > g(sw, d) + 1 for any other word w.
Two scoring components (neural networks) to capture:
Map word sequence s into an ordered list of vectors x = [x1, ..., xm].
xi - embedding corresponding to ith word in the sequence.
Compute local score scorel by using a neural network (with one hidden layer) over x.
Preserves word order and syntactic information.
Map document d to an ordered list of word embeddings, d = (d1, ..., dk).
Compute c, the weighted average of all word vectors in document.
The paper uses idf score for weighting the documents.
x = * concatenation of *c and vector of the last word in s.
Compute global score scoreg by using a neural network (with two hidden layers) over x.
Similar to bag-of-words features. score = scorel+ scoreg
Train the weights of the hidden layers and the word embeddings.
Multi-Prototype Neural Language Model
Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.
Solution - train multiple vectors per word to capture the different meanings.
Gather all the fixed-sized context windows for all occurrences of a given word.
Find the context vector by performing weighted averaging of all the words in the context window.
Cluster the context vectors using spherical k-means.
Each word occurrence in the corpus is re-labeled to its associated cluster.
To find similarity between a pair of words (w, w'):
For each possible cluster of i and j corresponding to the words w and w', find distance between cluster centers for i and j and weight them by the product of probabilities of w belonging to i and w' belonging to j given their respective contexts.
Average the value over the k2 pairs.
100 hidden units
No weight regularization
10 different word embeddings learnt for words having multiple meanings.
353 pairs of nouns
words represented without context
contains human similarity judgements on pair of words
The paper contributed a new dataset
captures human similarity judgements on pair of words in the context of a sentence
consists of verbs and adjectives along with nouns
for details on how the dataset is constructed, refer the paper
Proposed model achieves higher correlation to human scores than models using only the local or global context.
Performance can be improved by removing the stop words.
Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.
This work predated the more general word embedding models like Word2Vec and Glove. While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.