Part of course:
- Choosing K
- Soft assignment
- Interactive visualization
K-means clustering is an algorithm to perform clustering. It is simple to understand and implement, and hence is one of the most popular methods, and often the first algorithm to be tried out when performing clustering.
The idea behind clustering is to segregate the data into groups called clusters, so that instances with similar behavior are classified in the same cluster. It is used in data mining, pattern recognition and anomaly detection. Clustering is the most popular unsupervised learning method. Unsupervised learning is a set of techniques to identify patterns and underlying characteristics in data.
K-means clustering partitions the dataset into k clusters, where each instance is allocated to one cluster. Each cluster is represented by the cluster's center, also known as the centroid. The following figure plots the partition of a data set into five clusters, with the cluster centroids shown as crosses.
The procedure works iteratively by relocating the K centroids and re-classifying data points.
It is easy to implement:
Convergence is achieved when no further improvement can be made. That is, until there is no more change of the centroids or their memberships. It is a heuristic algorithm, which means that results will depend on the initial clusters.
There are multiple possible options for initializing the position of the centroids. Some common methods are described below:
The optimization criterion of k-means clustering is:
where cluster j has centroid cj. The algorithm typically uses the Euclidean distance. However, any distance measure can be applied to the centroid calculation and the points assignation (e.g. cosine, Manhattan, Pearson).
The following are some methods for choosing a good value for parameter K:
The above explanation uses hard assignments, i.e. each data point belongs completely to one cluster at any given point of time. This can be generalized in a way such that each cluster represents a probability distribution (say gaussian, with some mean and variance), and each data point belongs to different clusters with different probabilities. These are updated iteratively in the same way as described above. This version of K-means is called soft assignment, and it is an example of the EM-algorithm.
An interactive visualization of the algorithm is available at Alekseyn Nozdryn-Plotnick's blog. You can play around with the distribution of the data, the number of clusters, and so on.
K-means clustering can be applied in multiple scenarios:
Unintuitive results and limitations may arise given the nature of the algorithm. It aims to build a partition of the dataset into evenly spaced and well-shaped (ideally spherical) clusters of similar size. However, clusters might have unequal variance (left figure) or be of uneven size (right). Note that some of these results might be expected, as in the right figure where data points are correctly separated.
In following example, k-means clustering with k = 3 is applied to the well-known Iris flower data set. As it can be seen, the resulting clusters fail to separate the three species.