**Dimensionality reduction** aims to reduce the number of features of a high dimensional dataset in-order to overcome the difficulties that arise due to the curse of dimensionality.

There are two approaches: **feature selection** and **feature extraction**. Feature selection focuses on finding a subset of the original attributes. Whereas feature extraction transforms the original high-dimensional space into a lower dimensional one. Ideally, some structure in the data should remain in order to preserve enough information. Algorithms can be unsupervised (principal component analysis or PCA, independent component analysis or ICA) or supervised (linear discriminant analysis or LDA). In feature extraction, transformations can be linear (PCA, LDA) or non-linear (t-SNE, autoencoders).

There are plenty of applications such as hidden patterns visualization (by removing highly correlated attributes), noise reduction (removing irrelevant features), further exploration, data compression and storage, etc. In fact, dimensionality reduction is usually applied as a **preprocessing step **for other machine learning and data mining algorithms by instance in unsupervised tasks* PCA + K-Means clustering* or *LDA + K-Nearest Neighbors* for supervised.

# Feature Selection with Lasso Regression

First, we'll see how Lasso Regression can be used for feature selection. Based on the solution for the *regression minimization problem*

\min_{w} \dfrac{1}{N} \sum_{i=1}^{N} (w x_i - y_i)^2 + \alpha \sum_{j=1}^{d} |w_j|

features can be automatically selected (the above objective function is the standard cost function with mean squared error and L1 regularization). Since L1 regularization promotes sparsity, many of the estimated coefficients in the weight vector would be zero. Then, those features associated with **non-zero coefficients** would be kept. The higher the alpha parameter, the fewer features selected. However, L1 norm can lead to unstable results whenever features are highly correlated. For example by zeroing the coefficient of one out of two correlated features.

For an example of how Lasso regression can be used for feature selection, check this example on scikit-learn website. It shows an example using the Boston house-prices dataset where they select two most important features from the original 13.

# Feature Extraction with Principal Component Analysis

PCA is a popular feature extraction technique based on retaining the **maximum variance of the data**. The input space X is linearly transformed into a space X’ of uncorrelated components called the **principal components**:

X^{\prime} = \tilde{U}X,\quad \text{where } \tilde{U}\ \text{orthogonal matrix}

For this matter, it exploits the information encoded in the covariance data matrix C_{X}.

__Implementation for dataset X of d-dimensional instances x:__

1. Normalize data X

2. Calculate CX = covariance(X)

3. Singular-value decomposition (SVD) of CX: CX = UΣUT,

where U is the matrix of eigenvectors and Σ matrix of eigenvalues

4. Choose first n <= d vectors in U, call it Ũ

5. Project X into transformed space: X’ = ŨX

The idea is that Ũ should preserve maximal variance (90 − 99%) even for small *n*. Note that there are as many principal components as features *d*, but only *n* are used for projecting data into a lower dimensional space. For example, *d* can be of the order of 10 but *n *= 2. Since eigenvectors in Ũ are already ordered in decreasing variance contribution, it is enough to pick the first *d*. Then in the new coordinate system, the first principal component (first axis) captures the greatest variance, the second axis the second greatest variance, and so on.

## Example and visualization

The following is a visualization of PCA in small number of dimensions to help understand how it works. In this case, the dataset has 2 dimensions (this example is mostly for illustration, typically PCA is applied when the input data has many dimensions).

The figure below shows how a rotation can transforming the original 2D dataset of __highly correlated__ features to __uncorrelated__ features. This is the geometrical representation of multiplying X by U, i.e. the left graph represents X, and the right graph UX.

If dimensionality reduction is desired, transformed data can be further projected into 1D along its largest principal component. This is the step of choosing the top *n* (in this case 1) most important dimensions from the *d* (in this case 2) available dimensions. The right graph represents ŨX.

## Extensions of PCA

In some cases the __orthogonality of the principal components__ prevent it from extracting informative features. For these cases **independent component analysis** (ICA) is a better choice. An examples case where ICA works much better than PCA is illustrated below.

Finally, PCA works better on data with linear relationships. For non-linear transformation, Kernel PCA can be applied for better results. This link shows an illustrative example. In general, dimensionality reduction problems are approached by first applying a linear algorithm. Then, when results are not satisfactory, by trying non-linear techniques (autoencoders, etc).

## Real-world example where PCA is successful

PCA has been successfully applied on single-cell RNA sequencing data. The goal of the project was to analyze similar cells based on the genes that were transcribed (the first step towards protein synthesis inside animal and human cells). Although each cell had initially about 10,000 transcribed genes (input space *X*), a compressed transcription profile was obtained by projecting cells vectors into the first few principal components (a 2D target space *X’*). Then a **clustering algorithm **was run; it correctly grouped cells with similar transcription profiles. Furthermore, it can be seen in the figure that each PC is helpful in distinguishing different pairs of clusters. This can be further used for studying which genes are most involved in differentiating between distinct types of cells. For example, by analyzing PC1, genes responsible for the difference between dermal and neural cells could be learned.

For more details on this work, check out the following YouTube video.

## PCA code illustration using scikit learn

from sklearn.decomposition import PCA

import numpy as np

# lets create features

x1 = np.random.normal(size=200)

x2 = np.random.normal(size=200)

x3 = x1 + x2 #not useful since its highly correlated with other features.

X = np.c_[x1,x2,x3]

pca = PCA()

pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None, svd_solver='auto', tol=0.0, whiten=False)

pca.explained_variance_ #third feature is clearly useless

array([ 2.961e+00, 1.061e+00, 3.341e-32])

pca.n_components_ #still 3, because we have not specify no of components to keep in PCA() method

3

pca2 = PCA(n_components=2)

pca2.fit(X)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False)

pca2.n_components_

2

X_processed = pca2.fit_transform(X)

X.shape

(200, 3)

X_processed.shape

(200, 2)

# Feature selection vs Feature extraction

Note that dimensionality reduction by means of Lasso regularization works by selecting features without modifying them (feature *selection*). In contrast to PCA, which *transforms* those original features into a low-dimensional space (feature *extraction*).