- Method to visualize high-dimensional data points in 2/3 dimensional space.
- Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
- Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
- t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
- Based on SNE (Stochastic Neighbor Embedding)
- Link to paper
- Given a set of datapoints x1, x2, ...xn, pi|j is the probability that xi would pick xj as its neighbor if neighbors were picked in propotion to their porbability density under a Gaussian centred at xi. Calculation of σi is described later.
- Similarly, define qi|j as condtional probability corresponding to low-dimensional representations of yi and yj(corresponding to xi and xj). The variance of Gaussian in this case is set to be 1/sqrt(2)
- Argument is that if yi and yj captures the similarity between xi and xj, pi|j and qi|j should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for Pi and Qi, where Pi (Qi) represent condtional probability distribution given xi (yi)
- Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
- Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find &sigmai which produces the Pi having same perplexity as specified by the user.
- Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.
- A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
- Symmetric because pi|j = pj|i and qi|j = qj|i
- More robust to outliers and has a simpler gradient expression.
- When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
- One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.
- Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
- Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast.
- The cost function is easy to optimize.
- At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another.
- Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.
- Scale all the pi|j's so that large qi|j's are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.
t-SNE on large datasets
- Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
- Select a random subset of points (called landmark points) to display.
- for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
- pi|j is defined as fraction of random walks starting at xi and finishing at xj (both these points are landmark points). This way, pi|j is not sensitive to "short-circuits" in the graph (due to noisy data points).
- Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
- Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
- The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
- Random walk version of t-SNE takes care of "short-circuit" problem.
- It is unclear t-SNE would perform on general Dimensionality Reduction for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate.
- t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (curse of dimensionality).
- The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.