Dimensionality reduction aims to reduce the number of features of a high dimensional dataset in-order to overcome the difficulties that arise due to the curse of dimensionality.
There are two approaches: feature selection and feature extraction. Feature selection focuses on finding a subset of the original attributes. Whereas feature extraction transforms the original high-dimensional space into a lower dimensional one. Ideally, some structure in the data should remain in order to preserve enough information. Algorithms can be unsupervised (principal component analysis or PCA, independent component analysis or ICA) or supervised (linear discriminant analysis or LDA). In feature extraction, transformations can be linear (PCA, LDA) or non-linear (t-SNE, autoencoders).
There are plenty of applications such as hidden patterns visualization (by removing highly correlated attributes), noise reduction (removing irrelevant features), further exploration, data compression and storage, etc. In fact, dimensionality reduction is usually applied as a preprocessing step for other machine learning and data mining algorithms by instance in unsupervised tasks PCA + K-Means clustering or LDA + K-Nearest Neighbors for supervised.
First, we'll see how Lasso Regression can be used for feature selection. Based on the solution for the regression minimization problem
features can be automatically selected (the above objective function is the standard cost function with mean squared error and L1 regularization). Since L1 regularization promotes sparsity, many of the estimated coefficients in the weight vector would be zero. Then, those features associated with non-zero coefficients would be kept. The higher the alpha parameter, the fewer features selected. However, L1 norm can lead to unstable results whenever features are highly correlated. For example by zeroing the coefficient of one out of two correlated features.
For an example of how Lasso regression can be used for feature selection, check this example on scikit-learn website. It shows an example using the Boston house-prices dataset where they select two most important features from the original 13.
PCA is a popular feature extraction technique based on retaining the maximum variance of the data. The input space X is linearly transformed into a space X’ of uncorrelated components called the principal components:
For this matter, it exploits the information encoded in the covariance data matrix CX.
Implementation for dataset X of d-dimensional instances x:
1. Normalize data X2. Calculate CX = covariance(X)3. Singular-value decomposition (SVD) of CX: CX = UΣUT,where U is the matrix of eigenvectors and Σ matrix of eigenvalues4. Choose first n <= d vectors in U, call it Ũ5. Project X into transformed space: X’ = ŨX
The idea is that Ũ should preserve maximal variance (90 − 99%) even for small n. Note that there are as many principal components as features d, but only n are used for projecting data into a lower dimensional space. For example, d can be of the order of 10 but n = 2. Since eigenvectors in Ũ are already ordered in decreasing variance contribution, it is enough to pick the first d. Then in the new coordinate system, the first principal component (first axis) captures the greatest variance, the second axis the second greatest variance, and so on.
The following is a visualization of PCA in small number of dimensions to help understand how it works. In this case, the dataset has 2 dimensions (this example is mostly for illustration, typically PCA is applied when the input data has many dimensions).
The figure below shows how a rotation can transforming the original 2D dataset of highly correlated features to uncorrelated features. This is the geometrical representation of multiplying X by U, i.e. the left graph represents X, and the right graph UX.
If dimensionality reduction is desired, transformed data can be further projected into 1D along its largest principal component. This is the step of choosing the top n (in this case 1) most important dimensions from the d (in this case 2) available dimensions. The right graph represents ŨX.
In some cases the orthogonality of the principal components prevent it from extracting informative features. For these cases independent component analysis (ICA) is a better choice. An examples case where ICA works much better than PCA is illustrated below.
Finally, PCA works better on data with linear relationships. For non-linear transformation, Kernel PCA can be applied for better results. This link shows an illustrative example. In general, dimensionality reduction problems are approached by first applying a linear algorithm. Then, when results are not satisfactory, by trying non-linear techniques (autoencoders, etc).
PCA has been successfully applied on single-cell RNA sequencing data. The goal of the project was to analyze similar cells based on the genes that were transcribed (the first step towards protein synthesis inside animal and human cells). Although each cell had initially about 10,000 transcribed genes (input space X), a compressed transcription profile was obtained by projecting cells vectors into the first few principal components (a 2D target space X’). Then a clustering algorithm was run; it correctly grouped cells with similar transcription profiles. Furthermore, it can be seen in the figure that each PC is helpful in distinguishing different pairs of clusters. This can be further used for studying which genes are most involved in differentiating between distinct types of cells. For example, by analyzing PC1, genes responsible for the difference between dermal and neural cells could be learned.
For more details on this work, check out the following YouTube video.
from sklearn.decomposition import PCAimport numpy as np# lets create featuresx1 = np.random.normal(size=200)x2 = np.random.normal(size=200)x3 = x1 + x2 #not useful since its highly correlated with other features.X = np.c_[x1,x2,x3]pca = PCA()pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
pca.explained_variance_ #third feature is clearly useless
array([ 2.961e+00, 1.061e+00, 3.341e-32])
pca.n_components_ #still 3, because we have not specify no of components to keep in PCA() method
pca2 = PCA(n_components=2)pca2.fit(X)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
X_processed = pca2.fit_transform(X)X.shape
Note that dimensionality reduction by means of Lasso regularization works by selecting features without modifying them (feature selection). In contrast to PCA, which transforms those original features into a low-dimensional space (feature extraction).