# K-means clustering

## What do you need to know to understand this topic?

• Distance norms

## What is K-means clustering?

Clustering is the process of partitioning a group of data points into a small number of clusters. For instance, the items in a supermarket are clustered in categories (butter, cheese and milk are grouped in dairy products). Of course this is a qualitative kind of partitioning. A quantitative approach would be to measure certain features of the products, say percentage of milk and others, and products with high percentage of milk would be grouped together. In general, we have $n$ data points $\b{x}_i,i=1...n$ that have to be partitioned in $k$ clusters. The goal is to assign a cluster to each data point. K-means is a clustering method that aims to find the positions $\b{\mu}_i,i=1...k$ of the clusters that minimize the distance from the data points to the cluster. K-means clustering solves $$\arg\min_\b{c} \sum_{i=1}^k\sum_{\b{x}\in c_i} d(\b{x},\mu_i) = \arg\min_\b{c} \sum_{i=1}^k\sum_{\b{x}\in c_i} \left\Vert \b{x}-\mu_i \right\Vert_2^2$$ where $\b{c}_i$ is the set of points that belong to cluster $i$. The K-means clustering uses the square of the Euclidean distance $d(\b{x},\mu_i) = \left\Vert \b{x}-\mu_i \right\Vert_2^2$. This problem is not trivial (in fact it is NP-hard), so the K-means algorithm only hopes to find the global minimum, possibly getting stuck in a different solution.

## K-means algorithm

The Lloyd's algorithm, mostly known as k-means algorithm, is used to solve the k-means clustering problem and works as follows. First, decide the number of clusters $k$. Then:

 1. Initialize the center of the clusters $\b{\mu}_i =$ some value $, i=1,...,k$ 2. Attribute the closest cluster to each data point $$\b{c}_i = \{j: d(\b{x}_j, \mu_i) \le d(\b{x}_j, \mu_l), l \ne i, j=1,...,n\}$$ 3. Set the position of each cluster to the mean of all data points belonging to that cluster $\mu_i = \frac{1}{|c_i|}\sum_{j\in c_i} \b{x}_j,\forall i$ 4. Repeat steps 2-3 until convergence Notation $|\b{c}| =$ number of elements in $\b{c}$

The algorithm eventually converges to a point, although it is not necessarily the minimum of the sum of squares. That is because the problem is non-convex and the algorithm is just a heuristic, converging to a local minimum. The algorithm stops when the assignments do not change from one iteration to the next.

### Deciding the number of clusters

The number of clusters should match the data. An incorrect choice of the number of clusters will invalidate the whole process. An empirical way to find the best number of clusters is to try K-means clustering with different number of clusters and measure the resulting sum of squares.

The most curious can look at this paper for a benchmarking of 30 procedures for estimating the number of clusters.

### Initializing the position of the clusters

It is really up to you! Here are some common methods:

• Forgy: set the positions of the $k$ clusters to $k$ observations chosen randomly from the dataset.
• Random partition: assign a cluster randomly to each observation and compute means as in step 3.
Since the algorithm stops in a local minimum, the initial position of the clusters is very important.

### Good example

 Initialization method: ForgyRandom partition number of clusters = 1. Initialize clusters 2. Assign data points to closer cluster 3. Calculate center of each cluster

In this scatter plot you have several two-dimensional data points, clustered at 4 distinct positions. You can choose the initialization method and the number of clusters used in the k-means algorithm. The button 'Reset' resets the algorithm and generates a new dataset. The button 'Iterate' runs one step of the algorithm, which becomes bolded in the text below the button. More often than not, you see that the algorithm converges to the best solution. However, if you try enough times, there are some initializations of the clusters that lead to a "bad" local minimum. If you choose the wrong number of clusters, you can see the drastic effects on the result of the algorithm.

The k-means algorithm works reasonably well when the data fits the cluster model:

• The clusters are spherical: the data points in a cluster are centered around that cluster
• The spread/variance of the clusters is similar: Each data point belongs to the closest cluster
If any one these principles does not hold (and assuming the clusters are not too far apart), then the local minima of the k-means algorithm will be counter-intuitive. Here are two examples to demonstrate this:

 Initialization method: ForgyRandom partition number of clusters = 1. Initialize clusters 2. Assign data points to closer cluster 3. Calculate center of each cluster

Although the clusters have the same scatter (in fact, the same shape), they are not spherical. As before, you can choose the initialization method and the number of clusters used in the k-means algorithm. The button 'Reset' resets the algorithm and generates a new dataset. The button 'Iterate' runs one step of the algorithm, which becomes bolded in the text below the button. The k-means algorithm never assigns correctly the tips of the shapes because the spherical assumption fails.

 Initialization method: ForgyRandom partition number of clusters = 1. Initialize clusters 2. Assign data points to closer cluster 3. Calculate center of each cluster

Although the clusters have spherical shapes, they have different scatters. As before, you can choose the initialization method and the number of clusters used in the k-means algorithm. The button 'Reset' resets the algorithm and generates a new dataset. The button 'Iterate' runs one step of the algorithm, which becomes bolded in the text below the button. The data points that clearly belong to the large cluster but are closer to the small clusters are misclassified.

You are probably one of those that likes to learn stuff outside your field of work. Cosmology is a topic that wonders many, including myself. If you take an interest in it too, I suggest you read the classic A Brief History of Time by Stephen Hawking. The Professor gives us an excellent explanation of cosmological physics, making complex concepts simple to understand (I wish I could explain like that :) ). The book is also pretty accessible.