K-means clustering is an algorithm to perform clustering. It is simple to understand and implement, and hence is one of the most popular methods, and often the first algorithm to be tried out when performing clustering.
The idea behind clustering is to segregate the data into groups called clusters, so that instances with similar behavior are classified in the same cluster. It is used in data mining, pattern recognition and anomaly detection. Clustering is the most popular unsupervised learning method. Unsupervised learning is a set of techniques to identify patterns and underlying characteristics in data.
K-means clustering partitions the dataset into k clusters, where each instance is allocated to one cluster. Each cluster is represented by the cluster's center, also known as the centroid. The following figure plots the partition of a data set into five clusters, with the cluster centroids shown as crosses.
The procedure works iteratively by relocating the K centroids and re-classifying data points.
It is easy to implement:
- Initialize K centroids (b)
- For each data point, find the nearest centroid and classify the sample into the respective cluster (c)
- Recalculate centroids as the arithmetic mean of all instances assigned to it (d)
- Repeat steps 2-3 till convergence (e)-(f). The K final clusters are then obtained.
Convergence is achieved when no further improvement can be made. That is, until there is no more change of the centroids or their memberships. It is a heuristic algorithm, which means that results will depend on the initial clusters.
There are multiple possible options for initializing the position of the centroids. Some common methods are described below:
- Forgy method: randomly chosen among the data points
- Kmeans++ (ideal for small datasets): selected to be as far from each other as possible, with the goal of converging faster. They can or not correspond to existing data points.
The optimization criterion of k-means clustering is:
where cluster j has centroid cj. The algorithm typically uses the Euclidean distance. However, any distance measure can be applied to the centroid calculation and the points assignation (e.g. cosine, Manhattan, Pearson).
The following are some methods for choosing a good value for parameter K:
- Rule of thumb: K ≈ sqrt(N/2). By doing so, each cluster will ideally have sqrt(2N) points. For example, for a dataset of N=50 points, start with sqrt(50/2) = 5 clusters; of 10 points each.
- Elbow method: perform clustering for a range of parameter values, and calculate the total within-cluster sums-of-squares (WCSS) for each K. WCSS gives a measurement of the variance within each cluster. The lower the WCSS, the better the partition. As K increases, WCSS will monotonically decrease. Therefore, an optimal K can be selected from the range where the curve flattens down.
- Silhouette analysis: by studying the separation distance between clusters. Silhouette coefficients measure how close are data points to the neighboring clusters and have a range of [-1, +1]. Points with a score near +1 are far away from neighboring clusters, whereas values around 0 indicate that samples are close to the boundaries between clusters. In addition, negative values might point out that instances are wrongly classified. Silhouette plots and averaged silhouette scores can be used to select an optimal K. A nice visual example with sklearn can be found here.
The above explanation uses hard assignments, i.e. each data point belongs completely to one cluster at any given point of time. This can be generalized in a way such that each cluster represents a probability distribution (say gaussian, with some mean and variance), and each data point belongs to different clusters with different probabilities. These are updated iteratively in the same way as described above. This version of K-means is called soft assignment, and it is an example of the EM-algorithm.
An interactive visualization of the algorithm is available at Alekseyn Plotnick's blog. You can play around with the distribution of the data, the number of clusters, and so on.
K-means clustering can be applied in multiple scenarios:
- Image processing: image segmentation (image partitioning into clusters), color quantization (reducing the number of colors in an image)
- Cluster analysis: segmenting users by spending behavior for e-commerce companies, user profiles grouping in search engines, patient clustering in healthcare
- Feature learning: features extraction in images and documents
- Anomaly detection: finding micro clusters formed by outliers
Unintuitive results and limitations may arise given the nature of the algorithm. It aims to build a partition of the dataset into evenly spaced and well-shaped (ideally spherical) clusters of similar size. However, clusters might have unequal variance (left figure) or be of uneven size (right). Note that some of these results might be expected, as in the right figure where data points are correctly separated.
In following example, k-means clustering with k = 3 is applied to the well-known Iris flower data set. As it can be seen, the resulting clusters fail to separate the three species.