Part of course:

Paper Summary: Visualizing Data using t-SNE

- Introduction
- SNE
- t-SNE (t-Distributed SNE)
- Advantages of t-SNE
- Limitations of t-SNE

NaN.

Paper Summary: Visualizing Data using t-SNE

- Method to visualize high-dimensional data points in 2/3 dimensional space.
- Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
- Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
- t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
- Based on SNE (Stochastic Neighbor Embedding)
- Link to paper

- Given a set of datapoints
*x1, x2, ...xn, pi|j*is the probability that*xi*would pick*xj*as its neighbor if neighbors were picked in propotion to their porbability density under a Gaussian centred at*xi*. Calculation of*σi*is described later. - Similarly, define
*qi|j*as condtional probability corresponding to low-dimensional representations of*yi*and*yj*(corresponding to*xi*and*xj*). The variance of Gaussian in this case is set to be 1/sqrt(2) - Argument is that if
*yi*and*yj*captures the similarity between*xi*and*xj*,*pi|j*and*qi|j*should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for*Pi*and*Qi*, where*Pi*(*Qi*) represent condtional probability distribution given*xi*(*yi*) - Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
- Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find &sigmai which produces the
*Pi*having same perplexity as specified by the user. - Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.

__Symmetric SNE__

- A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
- Symmetric because
*pi|j**= pj|i*and*qi|j**= qj|i* - More robust to outliers and has a simpler gradient expression.

__Crowding Problem__

- When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
- One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.

__t-SNE__

- Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
- Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast.
- The cost function is easy to optimize.

__Optimization Tricks__

**Early Compression**

- At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another.
- Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.

**Early Exaggeration**

- Scale all the
*pi|j*'s so that large*qi|j*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.

__t-SNE on large datasets__

- Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
- Select a random subset of points (called landmark points) to display.
- for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
*pi|j*is defined as fraction of random walks starting at*xi*and finishing at*xj*(both these points are landmark points). This way,*pi|j*is not sensitive to "short-circuits" in the graph (due to noisy data points).

- Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
- Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
- The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
- Random walk version of t-SNE takes care of "short-circuit" problem.

- It is unclear t-SNE would perform on general
**Dimensionality Reduction**for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate. - t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (
**curse of dimensionality**). - The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.

Read more…(774 words)

Mark as completed

Previous

Recurrent Neural Network Regularization

Next

Paper Summary: Neural Module Networks

Part of courses:

About the author:

Shagun Sodhani

Loading…

Have a question? Ask here…

Post

Part of course:

Paper Summary: Visualizing Data using t-SNE

- Introduction
- SNE
- t-SNE (t-Distributed SNE)
- Advantages of t-SNE
- Limitations of t-SNE

Ready to join our community?

Sign up below to automatically get notified of new courses, get **reminders** to finish ones you subscribe to, and **bookmark** lessons to read later.

Continue with Facebook

— OR —

Your Full Name

Email address

I have an account. Log in instead

By signing up, you agree to our Terms and our Privacy Policy.

Popular Courses

New Courses

About Us

Get in touch

Copyright 2016-18, Compose Labs Inc. All rights reserved.