# Introduction

- Method to visualize high-dimensional data points in 2/3 dimensional space.
- Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
- Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
- t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
- Based on SNE (Stochastic Neighbor Embedding)
- Link to paper

# SNE

- Given a set of datapoints
*x1, x2, ...xn, pi|j*is the probability that*xi*would pick*xj*as its neighbor if neighbors were picked in propotion to their porbability density under a Gaussian centred at*xi*. Calculation of*σi*is described later. - Similarly, define
*qi|j*as condtional probability corresponding to low-dimensional representations of*yi*and*yj*(corresponding to*xi*and*xj*). The variance of Gaussian in this case is set to be 1/sqrt(2) - Argument is that if
*yi*and*yj*captures the similarity between*xi*and*xj*,*pi|j*and*qi|j*should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for*Pi*and*Qi*, where*Pi*(*Qi*) represent condtional probability distribution given*xi*(*yi*) - Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
- Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find &sigmai which produces the
*Pi*having same perplexity as specified by the user. - Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.

# t-SNE (t-Distributed SNE)

__Symmetric SNE__

- A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
- Symmetric because
*pi|j**= pj|i*and*qi|j**= qj|i* - More robust to outliers and has a simpler gradient expression.

__Crowding Problem__

- When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
- One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.

__t-SNE__

- Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
- Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast.
- The cost function is easy to optimize.

__Optimization Tricks__

**Early Compression**

- At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another.
- Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.

**Early Exaggeration**

- Scale all the
*pi|j*'s so that large*qi|j*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.

__t-SNE on large datasets__

- Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
- Select a random subset of points (called landmark points) to display.
- for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
*pi|j*is defined as fraction of random walks starting at*xi*and finishing at*xj*(both these points are landmark points). This way,*pi|j*is not sensitive to "short-circuits" in the graph (due to noisy data points).

# Advantages of t-SNE

- Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
- Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
- The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
- Random walk version of t-SNE takes care of "short-circuit" problem.

# Limitations of t-SNE

- It is unclear t-SNE would perform on general
**Dimensionality Reduction**for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate. - t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (
**curse of dimensionality**). - The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.