k-means clustering is a simple iterative clustering algorithm that partitions a dataset into a pre-determined number of clusters k, based on a pre-determined distance metric. Unlike supervised classification algorithms that have some notion of a target class, the objects comprising the input to k-means do not come with an associated target. For this reason, k-means clustering and its variations are considered instances of unsupervised learning techniques.

In this post, details of the k-means algorithm are covered in detail. After laying the theoretical foundation, we walk through a Python implementation of k-means, and explore how to determine an optimal value for k. We’ll also highlight a variant of k-means with an improved initialization scheme known as kmeans++. Finally, we examine how to leverage the clustering routines available in scikit-learn.

Background

Given a set of observations \((x_{1}, x_{2}, \cdots, x_{n})\), where each observation is a \(p\)-dimensional vector, k-means attempts to partition the \(n\) observations into \(k\) clusters \(C = {C_{1}, C_{2}, \cdots, C_{k}}\) such that the intra-cluster sum of squares is minimized. The k-means objective function is defined as

$$ J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk}||x_{n} - \mu_{k}||^{2}, $$

which represents the sum of the squares of the distances of each data point to its assigned vector \(\mu_{k}\). \(r_{nk}\) is given by

$$ \DeclareMathOperator*{\argmin}{arg\,min_{j}} r_{nk} = \begin{cases} 1 & \mbox{if} \argmin ||x_{n}-\mu_{j}||^{2} \\ 0 & \mbox{otherwise} \end{cases} $$

which states that the \(n^{th}\) observation gets assigned to the cluster whose centroid is nearest.
Conventional k-means usually begins with random selection of k observations for the initial centroids. The algorithm proceeds by assigning each observation to its nearest centroid. Upon completion, the k centroids are updated using the latest cluster assignments. This is an example of an iterative refinement technique, which ceases only after the change in distance between consecutive iterations for all observations is less than some pre-determined threshold.

The two stages correspond to the expectation and maximization steps of the Expectation Maximization algorithm. The value of \(J\) is reduced after each EM cycle, so convergence is guaranteed, but not necessarily to the global minimum. Since k-means can be run quickly, various combinations of starting points and number of clusters k are run in turn to see how changing the observations selected for the initial k means varies the ultimate formation of the k clusters.

Implementation

In what follows, we present an implementation of k-means that uses the conventional method of centroid intialization. In this version, the kmeans function takes as input data representing the observations of interest, k for the number of clusters and max_cycles, which limits the number of allowable EM cycles before iteration ceases. The function returns a dictionary containing:

cycles: Number of cycles required 
all_means: Progression of centroids for all cycles 
all_assign: Cluster assignments for each observation at each iteration 
all_dists: Distances from observations to each of the k centroids 
final_means: Centroids at the final iteration 
final_assign: Final cluster assignments for each observation as 1-D array 
final_dists: Final distances from observations to each of the k centroids 
total_cost: The total variance over all observations 

The implmentation is provided below:

import itertools
import numpy as np
import pandas as pd
import scipy
from scipy.spatial import distance


def kmeans(data, k=3, max_cycles=25):
    """
    Return means and cluster assignments for a given
    dataset `data`.

    Step I : Assignment step: Assign each observation to the cluster whose 
             mean has the least squared Euclidean distance, this is 
             intuitively the nearest mean.
    Step II: Update step; Calculate New cluster centroids for each 
             observation.
    """
    if isinstance(data, pd.DataFrame): 
        data = data.values

    # Add index column to data.
    tmpid = np.atleast_2d(np.arange(0, data.shape[0]))
    X = np.concatenate([tmpid.T, data], axis=1)a

    # Conventional initialization: Select k random means from  X.
    init_indx = np.random.choice(X.shape[0], k, replace=False)
    init_means = X[init_indx][:, 1:]
    centroids = init_means
    all_means, all_assign, all_dists = [], [], []

    # Initialize indicator and distance matricies which indicate cluster
    # membership for each observation.
    init_clust_assign = np.zeros(shape=(X.shape[0], k))
    init_clust_dists = np.zeros_like(init_clust_assign)

    all_means.append(init_means)
    all_assign.append(init_clust_assign)
    all_dists.append(init_clust_dists)

    for c in itertools.count(start=1):

        clust_assign = np.zeros(shape=(X.shape[0], k))
        clust_dists  = np.zeros_like(clust_assign)

        for i in range(X.shape[0]): # Step I (Assignment Step)
            iterobs = X[i, 1:]
            iterdists = [distance.euclidean(j, iterobs) for j in centroids]
            min_dist = np.min(iterdists)
            min_indx = np.argmin(iterdists)
            clust_assign[i, min_indx] = 1
            clust_dists[i, min_indx]  = min_dist

        # Test for convergence.
        if np.allclose(clust_dists, all_dists[-1], atol=1e-9) or c>= max_cycles:
            break

        all_assign.append(clust_assign)
        all_dists.append(clust_dists)
        centroids = np.zeros_like(init_means)

        # Step II (Update Step).
        for i in enumerate(centroids): 
            iter_k, iter_v = i[0], i[1]

            # Use indicies from clust_assign to retrieve elements from X
            # in order to calculate centroid updates.
            iter_mean = X[np.nonzero(clust_assign[:, iter_k])][:, 1:].mean(axis=0)
            centroids[iter_k] = iter_mean

        all_means.append(centroids)

    # Calculate cost function over all centroids.
    total_cost = (all_dists[-1]**2).sum()

    # Return final_assign as a 1-D array representing final cluster assignment.
    final_assign = np.apply_along_axis(
        lambda a: np.where(a==1)[0][0], axis=1, arr=all_assign[-1]
        )

    dresults = {
        'k': k, 'cycles': c, 'all_means': all_means, 'all_assign': all_assign,
        'all_dists': all_dists, 'final_means': all_means[-1], 
        'final_assign': final_assign, 'final_dists': all_dists[-1],
        'total_cost': total_cost
        }

    return(dresults)

At the very start of the function definition we append an identifer column which allows us to pair observations to their instance identifiers. To demonstrate, we use the dataset found here, which contains sample mean and standard deviation of unemployment data for each of the 50 states:

import pandas as pd

df = pd.read_csv("https://gist.githubusercontent.com/jtrive84/ea0f275a9ef010415392189e64c71fc3/raw/4bb56a71c5b597c16513c48183dde799ddf9ec51/unemp.csv")

# For visualizations.
x1 = df['mean'].values
x2 = df['stddev'].values

# Pass as input to kmeans function.
X = df[['mean','stddev']].values

results = kmeans(X, k=3)

Because kmeans returns the progression of centroid values and cluster assignments after each E-M step, we can create a visualization that demonstrates how assignments and centroids change with each iteration. We next plot cluster assignments at four iterations:

"""
Plotting various iterations of k-means cluster assignments.
"""
import matplotlib.pyplot as plt
import seaborn as sns


# Call kmeans with k=3.
output = kmeans(X, k=3, max_cycles=25)
am = output['all_means']
ad = output['all_dists']
aa = output['all_assign']
# Plot cluster assignments on 2x2 lattice plot.
fig = plt.figure()

# Plot initial means.
ax1 = fig.add_subplot(221)
ax1.plot(x1, x2,color="white", marker="o", fillstyle='none', markeredgecolor='black', markeredgewidth=.90)
ax1.set_title("k-means: Random Initialization ", loc="left")
m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]

ax1.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1a,m2a,color="#FF0080",marker="o")
ax1.plot(m1b,m2b,color="#1A8B55",marker="o")
ax1.plot(m1c,m2c,color="#0080FF",marker="o")
plt.xticks([]); plt.yticks([])

# 1st Assignment Step.
ax2 = fig.add_subplot(222)
m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]

ax2.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)

# Retrieve points assigned to clusters.
aa1 = aa[1]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:, 0], X0[:, 1]

X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:, 0], X1[:, 1]

X2 = X[np.nonzero(aa1[:, 2])][:, 0:]
x21, x22 = X2[:, 0], X2[:, 1]

ax2.scatter(x01, x02, color="#FF0080", marker="o")
ax2.scatter(x11, x12, color="#1A8B55", marker="o")
ax2.scatter(x21, x22, color="#0080FF", marker="o")
ax2.set_title("k-means: First Assignment Step", loc="left")
plt.xticks([]); plt.yticks([])


# 5th assignment step.
ax3 = fig.add_subplot(223)
m1a, m2a = am[4][0, :]; m1b, m2b = am[4][1, :]; m1c, m2c = am[4][2, :]
ax3.plot(m1a, m2a, color="#FF0080", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax3.plot(m1b, m2b, color="#1A8B55", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax3.plot(m1c, m2c, color="#0080FF", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax3.plot(m1a, m2a, color="#FF0080", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax3.plot(m1b, m2b, color="#1A8B55", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax3.plot(m1c, m2c, color="#0080FF", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)

# Retrieve points assigned to clusters.
aa1 = aa[5]
X0 = X[np.nonzero(aa1[:, 0])][:, 0:]
x01, x02 = X0[:,0], X0[:, 1]

X1 = X[np.nonzero(aa1[:, 1])][:, 0:]
x11, x12 = X1[:, 0], X1[:, 1]

X2 = X[np.nonzero(aa1[:, 2])][:, 0:]
x21, x22 = X2[:, 0], X2[:, 1]

ax3.scatter(x01, x02, color="#FF0080", marker="o")
ax3.scatter(x11, x12, color="#1A8B55", marker="o")
ax3.scatter(x21, x22, color="#0080FF", marker="o")
ax3.set_title("k-means: Fourth Assignment Step", loc="left")
plt.xticks([]); plt.yticks([])


# Final assignment step.
ax4 = fig.add_subplot(224)
m1a, m2a = am[-1][0, :]; m1b, m2b = am[-1][1, :]; m1c, m2c = am[-1][2, :]
ax4.plot(m1a, m2a, color="#FF0080", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax4.plot(m1b, m2b, color="#1A8B55", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax4.plot(m1c, m2c, color="#0080FF", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax4.plot(m1a, m2a, color="#FF0080", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax4.plot(m1b, m2b, color="#1A8B55", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax4.plot(m1c, m2c, color="#0080FF", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)

# Retrieve points assigned to clusters.
aa1 = aa[-1]
X0 = X[np.nonzero(aa1[:, 0])][:, 0:]
x01, x02 = X0[:, 0], X0[:, 1]

X1 = X[np.nonzero(aa1[:, 1])][:, 0:]
x11, x12 = X1[:, 0], X1[:, 1]

X2 = X[np.nonzero(aa1[:, 2])][:, 0:]
x21, x22 = X2[:, 0], X2[:, 1]

ax4.scatter(x01, x02, color="#FF0080", marker="o")
ax4.scatter(x11, x12, color="#1A8B55", marker="o")
ax4.scatter(x21, x22, color="#0080FF", marker="o")
ax4.set_title("k-means: Final Assignment Step", loc="left")
plt.xticks([]); plt.yticks([])

fig.tight_layout()
fig.subplots_adjust(wspace=.05, hspace=0.1, left=.05, right=.95)
plt.show()

Which produces the following:

km01

The centroids have changed noticably when comparing the upper-left and the bottom-right subplots. However, we arbitrarily chose 3 as the number of cluster centroids. We next discuss how to determine k, which corresponds to the number of centroids that minimizes the cost function \(J\) across all observations.

Determining the Optimal k

A common method used in assessing the optimal number of centroids k to use for clustering is the elbow method, which plots the sum of squared residuals as a function of the number of clusters across all observations. If we caculate the sum of squared residuals from a single centroid model as compared with the same measure from the two-centroid model, it’s clear that the two-centroid model will have a lower total variance since the larger residuals in the \(k=1\) model will almost surely be reduced when \(k=2\). However, at a certain point increasing k will have little impact on the intra-cluster variance, and the elbow method can be used to help identify this point. In the code that follows, we leverage our kmeans function and the sample dataset introduced earlier to assess total variance as a function of k, the number of centroids, with \(1 \leq k \leq 10\):

"""
Demonstrating the relationship between total variance as a function of the 
number of clusters to determine the optimal number of centroids.
"""

fig = plt.figure()
sns.set(context='notebook', style='darkgrid', font_scale=1)
ax = fig.add_subplot(111)
ax.set_title(
    "Total Variance as a Function of Cluster Count", fontsize=18, 
    loc="left", color="#FF6666", weight="bold"
    )
ax.grid(True)
ax.set_xlabel("Number of Clusters", fontsize=15, labelpad=12.5, color="black")
ax.set_ylabel("Total Variance", fontsize=15, labelpad=12.5, color="black")
ax.axvline(x=3., linewidth=2., color="grey")
ax.plot(x, y, "--", linewidth=2.25, color="#FF6666")
ax.scatter(x, y, marker="o", color="black", alpha=.90)
ax.tick_params(axis='both', which='major', labelsize=12, pad=10)

# Add arrow highlighting k=3.
ax.annotate("optimal k=3", fontsize=16, xy=(3,18.85), xytext=(4,30),
            arrowprops=dict(
                facecolor='black', width=.6, headwidth=8, shrink=.05),)
plt.show()

Note the characteristic elbow shape, after which the total variance changes very little with the inclusion of additional centroids:

km02

Alternative Initialization Methods

One of the shortcomings of conventional k-means is that ocassionally the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering. The k-means++ algorithm attempts to address this shortcoming by providing an alternative initialization scheme. In what follows, \(D(x)\) denotes the shortest distance from a data point \(x\) to the closest center \(C_{i}\) already selected:

  1. Take one center \(C_{1}\), chosen uniformly at random from the initial dataset.
  2. Take a new center \(C_{i}\), choosing \(x \in X\) with probability \(\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}\).
  3. Repeat Step 2 until we have taken \(k\) centroids in total.
  4. Proceed with standard k-means algorithm.

From step 2, the expression \(\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}\) can be interpreted as meaning observations having a greater distance from existing centroids will be given a higher probability of being selected when the next centroid is determined. In essence, this is an attempt to spread the initial centroids as far as possible from one another prior to commencing k-means.

We can encapsulate the k-means++ initialization method within a function that takes the dataset and number of clusters k as input. A sample implementation is given by kmeanspp:

def kmeanspp(X, k):
    """
    Return cluster centroids using k-means++ initialization technique.

    Parameters
    ----------
    X: 
        The input dataset.
    k: 
        The number of centroids to return.
    """
    # First centroids selected uniformly at random.
    centroids = [X[np.random.choice(range(X.shape[0]), 1)[0], :]]

    # Determine distance of each observation from chosen centroids.
    for i in range(k - 1):
        dsq = np.array(
            [max(distance.euclidean(X[i, :], j)**2 for j in centroids)
                for i in range(X.shape[0])]
            )

        r_indx = np.random.choice(range(X.shape[0]), 1, p=dsq/np.sum(dsq))[0]

        # Ensure same index not selected multiple times.
        if dsq[r_indx]==0:
            while True:
                r_indx = np.random.choice(range(X.shape[0]), 1, p=dsq/np.sum(dsq))[0]
                if dsq[r_indx]!=0: 
                    break
        centroids.append(X[r_indx, :])
    return(centroids)

If we call kmeanspp with our original dataset and k=3, the initial centroids have much better spread than when using conventional initialization:

km03

Clustering in scikit-learn

The k-means clustering algorithm can be found in scikit-learn’s sklearn.cluster module, where KMeans is one of the many available clustering algorithms. Information on these algorithms can be found here. In this section, we leverage the KMeans class to generate cluster assignments.
We start by scaling our variables so that no single explanatory variable contributes disproportionately to the overall variation in the dataset.

from sklearn.preprocessing import StandardScaler

# Read in data as before.
df = pd.read_csv("unemp.csv")
X = df[['mean', 'stddev']].values

sclr = StandardScaler().fit(X)
X = sclr.transform(X)

This transforms each explanatory variable onto a standard normal basis having zero mean and unit variance. Next we pass X to the fit method of KMeans:

import numpy as np
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3).fit(X)

# Get final cluster assignments.
labels = kmeans.labels_

# Get final cluster centroids.
centroids = kmeans.cluster_centers_

# Predict cluster assignment for new observations.
kmeans.predict(np.atleast_2d([.515, .73337]))