KL divergence is used as the asymmetric distance function for comparing the distributions.
Unlike the Word2Vec model, the proposed model uses ranking-based loss.
Similarity Measures used
For two gaussian distributions, Pi and Pj, compute the inner product E(Pi, Pj)as N(0; meani- meanj, sigmai+ sigmaj).
Compute the gradient of mean and sigma with respect to log(E).
The resulting loss function can be interpreted as pushing the means closer which encouraging the two gaussians to be more concentrated.
Use KL divergence to encode the context distribution.
The benefit over the symmetric setting is that now entailment type relations can also be modeled.
For example, a low KL divergence from x to y indicates that y can be encoded as x or that y “entails” x.
One of the two notions of similarity is chosen and max-margin is used as the loss function.
Mean is regularized by adding a simple constraint on the L2-norm.
For covariance matrix, the eigenvalues are constrained to lie within a hypercube. This ensures that the positive-definite property of the covariance matrix is maintained while having a constraint on the size.
Polysemous words have higher variance in their word embeddings as compared to specific words.
KL divergence (with diagonal covariance) outperforms other models.
Simple tree hierarchies can also be modeled by embedding into the Gaussian space. A Gaussian is created for each node with randomly initialized mean and the same set of embeddings is used for nodes and context.
For word similarity benchmarks, embeddings with spherical covariance have a slight edge over embeddings with diagonal covariance and outperform the Skip-Gram model in all the cases.
Use combinations of low rank and diagonal matrices for covariances.
Improved optimisation strategies.
Trying other distributions like Student’s-t distribution.