Google Find us on Google+ K-means clustering - algorithm and examples

K-means clustering

What do you need to know to understand this topic?

Sections

What is K-means clustering?

Clustering is the process of partitioning a group of data points into a small number of clusters. For instance, the items in a supermarket are clustered in categories (butter, cheese and milk are grouped in dairy products). Of course this is a qualitative kind of partitioning. A quantitative approach would be to measure certain features of the products, say percentage of milk and others, and products with high percentage of milk would be grouped together. In general, we have $n$ data points $\b{x}_i,i=1...n$ that have to be partitioned in $k$ clusters. The goal is to assign a cluster to each data point. K-means is a clustering method that aims to find the positions $\b{\mu}_i,i=1...k$ of the clusters that minimize the distance from the data points to the cluster. K-means clustering solves $$\arg\min_\b{c} \sum_{i=1}^k\sum_{\b{x}\in c_i} d(\b{x},\mu_i) = \arg\min_\b{c} \sum_{i=1}^k\sum_{\b{x}\in c_i} \left\Vert \b{x}-\mu_i \right\Vert_2^2$$ where $\b{c}_i$ is the set of points that belong to cluster $i$. The K-means clustering uses the square of the Euclidean distance $d(\b{x},\mu_i) = \left\Vert \b{x}-\mu_i \right\Vert_2^2$. This problem is not trivial (in fact it is NP-hard), so the K-means algorithm only hopes to find the global minimum, possibly getting stuck in a different solution.

K-means algorithm

The Lloyd's algorithm, mostly known as k-means algorithm, is used to solve the k-means clustering problem and works as follows. First, decide the number of clusters $k$. Then:

1. Initialize the center of the clusters$\b{\mu}_i =$ some value $, i=1,...,k$
2. Attribute the closest cluster to each data point$$\b{c}_i = \{j: d(\b{x}_j, \mu_i) \le d(\b{x}_j, \mu_l), l \ne i, j=1,...,n\} $$
3. Set the position of each cluster to the mean of all data points belonging to that cluster$\mu_i = \frac{1}{|c_i|}\sum_{j\in c_i} \b{x}_j,\forall i$
4. Repeat steps 2-3 until convergence
Notation$|\b{c}| =$ number of elements in $\b{c}$

The algorithm eventually converges to a point, although it is not necessarily the minimum of the sum of squares. That is because the problem is non-convex and the algorithm is just a heuristic, converging to a local minimum. The algorithm stops when the assignments do not change from one iteration to the next.

Deciding the number of clusters

The number of clusters should match the data. An incorrect choice of the number of clusters will invalidate the whole process. An empirical way to find the best number of clusters is to try K-means clustering with different number of clusters and measure the resulting sum of squares.

The most curious can look at this paper for a benchmarking of 30 procedures for estimating the number of clusters.

Initializing the position of the clusters

It is really up to you! Here are some common methods:

Since the algorithm stops in a local minimum, the initial position of the clusters is very important.

Good example

Initialization method:

number of clusters =

Reset Iterate

1. Initialize clusters
2. Assign data points to closer cluster
3. Calculate center of each cluster

In this scatter plot you have several two-dimensional data points, clustered at 4 distinct positions. You can choose the initialization method and the number of clusters used in the k-means algorithm. The button 'Reset' resets the algorithm and generates a new dataset. The button 'Iterate' runs one step of the algorithm, which becomes bolded in the text below the button. More often than not, you see that the algorithm converges to the best solution. However, if you try enough times, there are some initializations of the clusters that lead to a "bad" local minimum. If you choose the wrong number of clusters, you can see the drastic effects on the result of the algorithm.

Bad examples

The k-means algorithm works reasonably well when the data fits the cluster model:

If any one these principles does not hold (and assuming the clusters are not too far apart), then the local minima of the k-means algorithm will be counter-intuitive. Here are two examples to demonstrate this:

Initialization method:

number of clusters =

Reset Iterate

1. Initialize clusters
2. Assign data points to closer cluster
3. Calculate center of each cluster

Although the clusters have the same scatter (in fact, the same shape), they are not spherical. As before, you can choose the initialization method and the number of clusters used in the k-means algorithm. The button 'Reset' resets the algorithm and generates a new dataset. The button 'Iterate' runs one step of the algorithm, which becomes bolded in the text below the button. The k-means algorithm never assigns correctly the tips of the shapes because the spherical assumption fails.

Initialization method:

number of clusters =

Reset Iterate

1. Initialize clusters
2. Assign data points to closer cluster
3. Calculate center of each cluster

Although the clusters have spherical shapes, they have different scatters. As before, you can choose the initialization method and the number of clusters used in the k-means algorithm. The button 'Reset' resets the algorithm and generates a new dataset. The button 'Iterate' runs one step of the algorithm, which becomes bolded in the text below the button. The data points that clearly belong to the large cluster but are closer to the small clusters are misclassified.

If I helped you in some way, please help me back by liking this website on the bottom of the page or clicking on the link below. It would mean the world to me!

Show your love:
Tweet