# Introduction

- Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods.

# Global Matrix Factorization Methods

- Decompose large matrices into low-rank approximations.
- eg - Latent Semantic Analysis (LSA)

**Limitations**

- Poor performance on word analogy task
- Frequent words contribute disproportionately high to the similarity measure.

# Shallow, Local Context-Based Window Methods

- Learn word representations using adjacent words.
- eg - Continuous bag-of-words (CBOW) model and skip-gram model.

**Limitations**

- Since they do not operate directly on the global co-occurrence counts, they can not utilize the statistics of the corpus effectively.

# GloVe Model

- To capture the relationship between words
*i*and*j*, word vector models should use ratios of co-occurrence probabilities (with other words*k*) instead of using raw probabilities themselves. - In most general form:

F(w_i, w_j, w_k^*) = P_{ik}/P_{jk}

- We want
*F*to encode information in the vector space (which have a linear structure), so we can restrict to the difference of*wi*and*wj*

F(w_i - w_j, w_k^* ) = P_{ik}/P_{jk}

- Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments.

F( (w_i - w_j)^T, w_k^* ) = P_{ik}/P_{jk}

*F*should be invariant to order of the word pair*i*and*j*.

F(w_i^Tw_k^*) = P_{ik}

- Doing further simplifications and optimisations (refer paper), we get cost function,
*J = Sum (over all i, j pairs in the vocabulary)*

[w_i^Tw_k^* + b_i + b_k^* - log(X_{ik})]^2

- f is a weighing function.
*f(x) = min((x/xmax)α, 1)*- Typical values,
*xmax**= 100*and*α = 3/4* *b*are the bias terms.

**Complexity**

- Depends on a number of non-zero elements in the input matrix.
- Upper bound by the square of vocabulary size
- Since for shallow window-based approaches, complexity depends on
*|C|*(size of the corpus), tighter bounds are needed. - By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to
*|C|0.8*

# Evaluation

**Tasks**

- Word Analogies
- a is to b as c is to ___?
- Both semantic and syntactic pairs
- Find closest d to wb - wc + wa (using cosine similarity)
- Word Similarity
- Named Entity Recognition

**Datasets**

- Wikipedia Dumps - 2010 and 2014
- Gigaword5
- Combination of Gigaword5 and Wikipedia2014
- CommonCrawl
- 400,000 most frequent words considered from the corpus.

**Hyperparameters**

- Size of context window.
- Whether to distinguish left context from right context.
*f*- Word pairs that are*d*words apart contribute*1/d*to the total count.*xmax = 100**α = 3/4*- AdaGrad update

**Models Compared With**

- Singular Value Decomposition
- Continuous Bag-Of-Words
- Skip-Gram

**Results**

- Glove outperforms all other models significantly.
- Diminishing returns for vectors larger than 200 dimensions.
- Small and asymmetric context windows (context window only to the left) works better for syntactic tasks.
- Long and symmetric context windows (context window to both the sides) works better for semantic tasks.
- Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5.
- Word2vec’s performance decreases if the number of negative samples increases beyond about 10.
- For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster.