Principal Component Analysis is a linear dimensionality reduction algorithm. Before moving forward, it's important to mention the curse of dimensionality. Curse of dimensionality refers to how certain algorithms perform poorly on high-dimensional data. High dimensionality makes clustering hard, because having lots of dimensions means that everything is far away from each other. It's hard to know what true distance means when you have so many dimensions.
Reducing the dimensions is same as reducing the features of a dataset. We can do this by feature selection or feature extraction. PCA is a feature extraction technique, that is, we project original dataset(with high-dimensionality) onto a new feature space(with low-dimensionality). We can also think of it as a data compression technique while maintaining the most relevant information. PCA aims to find the directions of maximum variance in high-dimensional data and project it onto a new subspace with fewer dimensions.
Given the input data X(d-dimensional), we want to find a projection matrix (P) such that X.W = N (n-dimensional) and please note n <= d . We construct project matrix P using eigenvectors. Eigenvectors can be calculated by decomposing a covariance matrix or we can perform singular value decomposition. One important thing to note here is that PCA is sensitive to data scale, hence we should standardize data before going through PCA. We can summarize this process as follows:
- Standardize the dataset.
- Find eigenvector and eigenvalues by decomposing a covariance matrix or use singular value decomposition.
- Sort eignevalues in descending order and select n eigenvectors corresponding to n largest eigenvalues.
- Construct the projection matrix P from eigenvectors.
- Project dataset X onto low-dimensional space by multiplying X and W
from sklearn.decomposition import PCAimport numpy as np#lets create featuresx1 = np.random.normal(size=200)x2 = np.random.normal(size=200)x3 = x1 + x2 #not useful since its highly correlated with other features.X = np.c_[x1,x2,x3]pca = PCA()pca.fit(X)Out[14]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
pca.explained_variance_ #third feature is clearly useless
array([ 2.961e+00, 1.061e+00, 3.341e-32])
pca.n_components_ #still 3, because we have not specify no of components to keep in PCA() method
3
pca2 = PCA(n_components=2)pca2.fit(X)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False)
pca2.n_components_
2
X_processed = pca2.fit_transform(X)X.shape
(200, 3)
X_processed.shape
(200, 2)
PCA is not the only Dimensionality Reduction algorithm. Check out Linear Discriminant Analysis.