Introduction
- Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods.
Global Matrix Factorization Methods
- Decompose large matrices into low-rank approximations.
- eg - Latent Semantic Analysis (LSA)
Limitations
- Poor performance on word analogy task
- Frequent words contribute disproportionately high to the similarity measure.
Shallow, Local Context-Based Window Methods
- Learn word representations using adjacent words.
- eg - Continuous bag-of-words (CBOW) model and skip-gram model.
Limitations
- Since they do not operate directly on the global co-occurrence counts, they can not utilize the statistics of the corpus effectively.
GloVe Model
- To capture the relationship between words i and j, word vector models should use ratios of co-occurrence probabilities (with other words k) instead of using raw probabilities themselves.
- In most general form:
F(w_i, w_j, w_k^*) = P_{ik}/P_{jk}
- We want F to encode information in the vector space (which have a linear structure), so we can restrict to the difference of wi and wj
F(w_i - w_j, w_k^* ) = P_{ik}/P_{jk}
- Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments.
F( (w_i - w_j)^T, w_k^* ) = P_{ik}/P_{jk}
- F should be invariant to order of the word pair i and j.
F(w_i^Tw_k^*) = P_{ik}
- Doing further simplifications and optimisations (refer paper), we get cost function,
- J = Sum (over all i, j pairs in the vocabulary)
[w_i^Tw_k^* + b_i + b_k^* - log(X_{ik})]^2
- f is a weighing function.
- f(x) = min((x/xmax)α, 1)
- Typical values, xmax = 100 and α = 3/4
- b are the bias terms.
Complexity
- Depends on a number of non-zero elements in the input matrix.
- Upper bound by the square of vocabulary size
- Since for shallow window-based approaches, complexity depends on |C| (size of the corpus), tighter bounds are needed.
- By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to |C|0.8
Evaluation
Tasks
- Word Analogies
- a is to b as c is to ___?
- Both semantic and syntactic pairs
- Find closest d to wb - wc + wa (using cosine similarity)
- Word Similarity
- Named Entity Recognition
Datasets
- Wikipedia Dumps - 2010 and 2014
- Gigaword5
- Combination of Gigaword5 and Wikipedia2014
- CommonCrawl
- 400,000 most frequent words considered from the corpus.
Hyperparameters
- Size of context window.
- Whether to distinguish left context from right context.
- f - Word pairs that are d words apart contribute 1/d to the total count.
- xmax = 100
- α = 3/4
- AdaGrad update
Models Compared With
- Singular Value Decomposition
- Continuous Bag-Of-Words
- Skip-Gram
Results
- Glove outperforms all other models significantly.
- Diminishing returns for vectors larger than 200 dimensions.
- Small and asymmetric context windows (context window only to the left) works better for syntactic tasks.
- Long and symmetric context windows (context window to both the sides) works better for semantic tasks.
- Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5.
- Word2vec’s performance decreases if the number of negative samples increases beyond about 10.
- For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster.