Genes encode and can be used to synthesize proteins, and this process is known as gene expression. In higher organisms like humans, thousands of genes express together by different amounts depending upon various factors such as the type of cell (nerve cell or heart cell), environment and disease conditions. For example, different types of cancers invoke different gene expression patterns in humans. These different gene expression patterns under different conditions can be studied using Microarray technology.
Data from a Microarray can be imagined as rectangular matrix or a grid with each cell in the matrix corresponding to a gene expression value under a particular condition. As shown in the figure below, each row of the matrix corresponds to a gene gi and each each column corresponds to a condition/sample si
With reference to the cancer example, humans have approximately 20,000 expressing genes and let say we want to know their expression patterns i.e. which genes produce proteins at higher or lower levels under different types of human cancers. Also, let say there are 20 types of human cancers known, then the microarray gene expression matrix will have 20,000 rows corresponding to the genes and 20 columns corresponding to the 20 types of cancers.
One of the first steps in analyzing the gene expression data is clustering the genes or the samples in classic data mining setup. The genes can be clustered based on their expression patterns across all the conditions and the samples can be clustered using patterns of gene expression across all the genes.
With respect to the clustering problem, for gene clustering, the data-points are genes and features are the expression values across all the samples. Consequently, in the gene clustering with respect to the cancer example, 20,000 data-points with each point having 20 dimensions will be clustered.
Clustering gene expression data gives important insights into the gene co-regulation and cellular functions of genes. Genes that cluster together have similar expression patterns across all the samples which may indicate co-regulation of those genes. Also, genes from the same cluster are likely to perform similar cellular functions which helps in annotating newly discovered genes.
Conversely, for sample clustering, the samples are the data-points to be clustered using the gene expression values across all the genes as features. For sample clustering, 20 data-points with each point having 20,000 dimensions will be clustered.
Below, we'll discuss to different methods to perform clustering - Llyod's K-means clustering and Hierarchical Clustering.
It is important to calculate the distance or proximity between the data-points for clustering, since all the clustering algorithms work on the principle that close points should be clustered together in one cluster. One of the effective measures of calculating the distances between the data-points Oi and Oj using their features in the Pearson’s Correlation Coefficient:
The K-means clustering algorithm attempts to partition the given data into clusters by iteratively assigning data-points to new clusters.
Consider the following 2D data. By looking at the scatterplot it looks like the data contains 3 distinct clusters. Therefore, we will arbitrarily initiate 3 cluster centroids or cluster centers. Since we do not have any clusters yet, these centroids are arbitrary points in the space.
Then, we calculate the distances of all the points from the 3 centroids and assign the points to cluster to which they are closest. Then, we recalculate the centroids using the points assigned in the cluster. Cluster centroid is simply the average of all the points in the cluster.
Recalculate the distances of the points from the 3 newly assigned centroids and reassign the points to cluster to which they are closest. After the points are re-assigned to their closest clusters recalculate the cluster centroids.
The above steps are repeated until convergence i.e. until no points are moved between the clusters.
Hierarchical clustering is one of the progressive clustering techniques which starts with small clusters and progressively merges the closely related small clusters into larger clusters until only one large cluster remains. One of the biggest advantages of hierarchical over K-means is that we do not have to pre-define the number of clusters beforehand. Instead the optimum number of clusters can be inferred after the clustering process is complete.
Lets take a closer look at the hierarchical clustering algorithm using the following 2D data containing 25 data-points
Firstly, assign every single point to its own individual cluster i.e. there are 25 clusters with each cluster containing 1 point. Then, calculate centroids of the clusters which will be the points itself. Calculate all-against-all centroid distances and join the two clusters into a new cluster whose centroids are the closest. Recalculate the centroid for the newly formed cluster.
Again, calculate all-against-all centroid distances, and detect and join closest two clusters into a new cluster. Recalculate the centroid for the new cluster.
Repeat the 3 steps, calculating all-against-all centroid distance calculation, merging the 2 closest clusters and recalculating the centroids for newly formed clusters, until only one large cluster containing all the 25 data points is obtained (convergence).
The entire hierarchical clustering process can be visualized using a dendrogram as show below, where the leaf nodes of the bifurcating tree are the data-points and internal nodes show each merging step that was carried out.
The height scale on left shows the distances at which the clusters were merged. The lowest internal nodes are at small distances showing that the closest clusters or points were merged first. The highest internal node is at a large distance signifying that points or clusters that were far apart were joined in one cluster at the highest distances.
The actual clustering solution is obtained by drawing a horizontal line across the clustering dendrogram at a specified distance cutoff. The number of clusters is equal to the number of intersections encountered by the horizontal cutting line. For example, the red horizontal line drawn at the distance cutoff=60 defines 3 clusters for the 25 data points (see figure for iteration 22).
An example showing the distinct types of diffuse large B-cell lymphoma identified by hierarchical clustering of gene expression data. Based on the different types identified, our estimates for how the cancer is expected to grow would differ, and could also lead to differences in the prescribed treatment.