# Clustering Gene Expression Data

January 05, 2018

# Introduction

Genes encode and can be used to synthesize *proteins*, and this process is known as *gene expression*. In higher organisms like humans, thousands of genes express together by different amounts depending upon various factors such as the type of cell (nerve cell or heart cell), environment and disease conditions. For example, different types of cancers invoke different *gene expression patterns* in humans. These different gene expression patterns under different conditions can be studied using *Microarray* technology.

# Microarrays and Gene Expression profiling

Data from a *Microarray* can be imagined as *rectangular matrix* or a *grid* with each cell in the matrix corresponding to a *gene expression value* under a particular condition. As shown in the figure below, each *row* of the matrix corresponds to a *gene* g_{i} and each each *column* corresponds to a *condition/sample* s_{i}

With reference to the cancer example, humans have approximately 20,000 expressing genes and let say we want to know their expression patterns i.e. which genes produce proteins at higher or lower levels under different types of human cancers. Also, let say there are 20 types of human cancers known, then the microarray gene expression matrix will have 20,000 rows corresponding to the genes and 20 columns corresponding to the 20 types of cancers.

# Gene Expression Clustering

One of the first steps in analyzing the gene expression data is *clustering* the genes or the samples in classic data mining setup. The genes can be clustered based on their expression patterns across all the conditions and the samples can be clustered using patterns of gene expression across all the genes.

With respect to the clustering problem, for gene clustering, the *data-points* are genes and *features* are the expression values across all the samples. Consequently, in the gene clustering with respect to the cancer example, 20,000 data-points with each point having 20 dimensions will be clustered.

Clustering gene expression data gives important insights into the *gene co-regulation* and *cellular functions* of genes. Genes that cluster together have similar expression patterns across all the samples which may indicate co-regulation of those genes. Also, genes from the same cluster are likely to perform similar cellular functions which helps in *annotating* newly discovered genes.

Conversely, for sample clustering, the samples are the *data-points* to be clustered using the gene expression values across all the genes as *features*. For sample clustering, 20 data-points with each point having 20,000 dimensions will be clustered.

Below, we’ll discuss to different methods to perform clustering - *Llyod’s K-means clustering* and *Hierarchical Clustering*.

# Proximity calculation

It is important to calculate the *distance* or *proximity* between the data-points for clustering, since all the clustering algorithms work on the principle that close points should be clustered together in one cluster. One of the effective measures of calculating the distances between the data-points O_{i} and O_{j} using their features in the *Pearson’s Correlation Coefficient*:

# K-means clustering

The K-means clustering algorithm attempts to partition the given data into clusters by iteratively assigning data-points to new clusters.

Consider the following 2D data. By looking at the scatterplot it looks like the data contains 3 distinct clusters. Therefore, we will arbitrarily initiate 3 cluster *centroids* or *cluster centers*. Since we do not have any clusters yet, these centroids are *arbitrary points* in the space.

Then, we calculate the *distances* of all the points from the 3 centroids and *assign* the points to cluster to which they are *closest*. Then, we *recalculate* the centroids using the points assigned in the cluster. Cluster centroid is simply the *average* of all the points in the cluster**.**

Recalculate the distances of the points from the 3 newly assigned centroids and reassign the points to cluster to which they are closest. After the points are re-assigned to their closest clusters recalculate the cluster centroids.

The above steps are repeated until *convergence* i.e. until no points are moved between the clusters.

# Hierarchical clustering

Hierarchical clustering is one of the *progressive clustering* techniques which starts with *small clusters* and progressively *merges* the closely related small clusters into larger clusters until only one large cluster remains. One of the biggest advantages of hierarchical over K-means is that we do not have to pre-define the *number of clusters* beforehand. Instead the optimum number of clusters can be inferred after the clustering process is complete.

Lets take a closer look at the hierarchical clustering algorithm using the following 2D data containing 25 data-points

Firstly, assign every single point to its own individual cluster i.e. there are 25 clusters with each cluster containing 1 point. Then, calculate centroids of the clusters which will be the points itself. Calculate *all-against-all* centroid distances and join the two clusters into a new cluster whose centroids are the closest. Recalculate the centroid for the newly formed cluster.

Iteration 1

Again, calculate all-against-all centroid distances, and detect and join closest two clusters into a new cluster. Recalculate the centroid for the new cluster.

Iteration 2

Repeat the 3 steps, calculating all-against-all centroid distance calculation, merging the 2 closest clusters and recalculating the centroids for newly formed clusters, until only one large cluster containing all the 25 data points is obtained (convergence).

Iteration 3

Iteration 4

Iteration 5

Iteration 6

Iteration 7

Iteration 8

Iteration 9

Iteration 10

Iteration 11

Iteration 12

Iteration 13

Iteration 14

Iteration 15

Iteration 16

Iteration 17

Iteration 18

Iteration 19

Iteration 20

Iteration 21

Iteration 22

Iteration 23

Iteration 24

The entire hierarchical clustering process can be visualized using a *dendrogram* as show below, where the *leaf nodes* of the *bifurcating tree* are the data-points and *internal nodes* show each *merging step* that was carried out.

The *height scale* on left shows the *distances* at which the clusters were merged. The *lowest internal nodes* are at *small distances* showing that the closest clusters or points were merged first. The *highest internal node* is at a large distance signifying that points or clusters that were far apart were joined in one cluster at the *highest distances*.

The actual clustering solution is obtained by drawing a *horizontal line* across the clustering dendrogram at a specified distance cutoff. The *number of clusters* is equal to the *number of intersections* encountered by the horizontal cutting line. For example, the red horizontal line drawn at the distance cutoff=60 defines 3 clusters for the 25 data points (see figure for iteration 22).

# Example

An example showing the distinct types of diffuse large B-cell lymphoma identified by hierarchical clustering of gene expression data. Based on the different types identified, our estimates for how the cancer is expected to grow would differ, and could also lead to differences in the prescribed treatment.