k-means clustering is a simple iterative clustering algorithm that partitions a dataset into a pre-determined number of clusters, k, based on a pre-determined distance metric. Unlike supervised classification algorithms that have some notion of a target class, the objects comprising the input to k-means do not come with an associated target. For this reason, k-means clustering and its variations are considered instances of unsupervised learning techniques1.

In this post, details of the k-means algorithm are covered in detail. After laying the theoretical foundation, we walk through a Python implementation of k-means, and explore how to determine an optimal value for $$k$$. We’ll also highlight a variant of k-means with an improved initialization scheme known as kmeans++. Finally, we examine how to leverage the clustering routines available in scikit-learn.

## Background

Given a set of observations $$(x_{1}, x_{2}, \cdots, x_{n})$$, where each observation is a p-dimensional vector, k-means attempts to partition the $$n$$ observations into $$k$$ clusters $$C = {C_{1}, C_{2}, \cdots, C_{k}}$$ such that the intra-cluster sum of squares is minimized. Adopting Bishop’s notational convention, we define the k-means objective (‘cost’) function as

$$J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk}||x_{n} - \mu_{k}||^{2},$$

which represents the sum of the squares of the distances of each data point to its assigned vector $$\mu_{k}$$2. Here, $$r_{nk}$$ is given by

$$\DeclareMathOperator*{\argmin}{arg\,min_{j}} r_{nk} = \begin{cases} 1 & \mbox{if} \argmin ||x_{n}-\mu_{j}||^{2} \\ 0 & \mbox{otherwise} \end{cases}$$

which states that the $$n^{th}$$ observation gets assigned to the cluster whose centroid is nearest.
Conventional k-means usually begins with random selection of $$k$$ observations for the initial centroids. The algorithm proceeds by assigning each observation to it’s nearest centroid. Upon completion, the $$k$$ centroids are updated using the latest cluster assignments. This is an example of an iterative refinement technique, which ceases only after the change in distance between consecutive iterations for all observations is less than some pre-determined threshold. In Pattern Recognition and Machine Learning, Bishop describes these steps as “minimizing $$J$$ with respect to the $$r_{nk}$$ , keeping the $$\mu_{k}$$ fixed, then minimizing $$J$$ with respect to the $$\mu_{k}$$, keeping $$r_{nk}$$ fixed”2. He also points out that the two stages correspond to the E (expectation) and M (maximization) steps of the EM (Expectation Maximization) algorithm.
The value of $$J$$ is reduced after each EM cycle, so convergence is guaranteed, but not necessarily to the global minimum.
Since k-means can be run quickly, various combinations of starting points and number of clusters $$k$$ are run in turn to see how changing the observations selected for the initial means varies the ultimate formation of the $$k$$ clusters.

## Implementation

In what follows, we present an implementation of the k-means algorithm that uses the conventional method of centroid intialization. This is also commonly referred to as Lloyd’s algorithm. In this version, the kmeans function takes as input data representing the observations of interest, k for the number of clusters and max_cycles, which limits the number of allowable EM cycles before iteration ceases. kmeans returns a dict of values containing:

cycles - Number of cycles required
all_means - Progression of centroids for all cycles
all_assign - Cluster assignments for each observation at each iteration
all_dists - Distances from observations to each of the $$k$$ centroids
final_means - Centroids at the final iteration
final_assign - Final cluster assignments for each observation as 1-D array
final_dists - Final distances from observations to each of the $$k$$ centroids
total_cost - The total variance over all observations

We define the kmeans function as follows:

import numpy as np
import scipy
from scipy.spatial import distance
import pandas as pd
import itertools

def kmeans(data, k=3, max_cycles=25):
"""
Return means and cluster assignments for a given
dataset data.

STEP I  => Assignment step: Assign each observation to the
cluster whose mean has the least squared Euclidean
distance, this is intuitively the "nearest" mean.
STEP II => Update step; Calculate New Cluster Centroids for
each observation.

"""
if isinstance(data, pd.DataFrame): data = data.values

# Add index column to data.
_tmpid = np.atleast_2d(np.arange(0, data.shape[0]))
X = np.concatenate([_tmpid.T, data], axis=1)

# Conventional initialization: Select k random means from  X.
init_indx = np.random.choice(np.arange(0,X.shape[0]-1),k,replace=False)
init_means = X[init_indx][:,1:]
centroids = init_means
all_means, all_assign, all_dists = [], [], []

# Initialize indicator and distance matricies which indicate cluster
# membership for each observation.
init_clust_assign = np.zeros(shape=(X.shape[0], k))
init_clust_dists  = np.zeros_like(init_clust_assign)

all_means.append(init_means)
all_assign.append(init_clust_assign)
all_dists.append(init_clust_dists)

for c in itertools.count(start=1):

clust_assign = np.zeros(shape=(X.shape[0], k))
clust_dists  = np.zeros_like(clust_assign)

for i in range(X.shape[0]): # Step I (Assignment Step)

iterobs   = X[i, 1:]
iterdists = [distance.euclidean(j, iterobs) for j in centroids]
min_dist  = np.min(iterdists)
min_indx  = np.argmin(iterdists)

clust_assign[i, min_indx] = 1
clust_dists[i, min_indx]  = min_dist

# Test for convergence.
if np.allclose(clust_dists,all_dists[-1],atol=1e-09) or c>=max_cycles:
break

all_assign.append(clust_assign)
all_dists.append(clust_dists)

centroids = np.zeros_like(init_means)

for i in enumerate(centroids): # Step II (Update Step)

iter_k = i[0]
iter_v = i[1]

# Use indicies from clust_assign to retrieve elements from X
# in order to calculate centroid updates.
iter_mean = X[np.nonzero(clust_assign[:,iter_k])][:,1:].mean(axis=0)
centroids[iter_k] = iter_mean

all_means.append(centroids)

# Calculate cost function (variance) over all centroids.
total_cost = (all_dists[-1]**2).sum()

# Return final_assign as a 1-D array representing final
# cluster assignment.
final_assign = np.apply_along_axis(
lambda a: np.where(a==1)[0][0],
axis=1, arr=all_assign[-1])

result = {
'k'           :k,
'cycles'      :c,
'all_means'   :all_means,
'all_assign'  :all_assign,
'all_dists'   :all_dists,
'final_means' :all_means[-1],
'final_assign':final_assign,
'final_dists' :all_dists[-1],
'total_cost'  :total_cost
}

return(result)


At the very start of the function definition we append an identifer column which allows us to pair observations to their instance identifiers.
To demonstrate, we’ll use the dataset found here, which contains the mean and standard deviation of unemployment data for each of the 50 states:

df = pd.read_csv("unemp.csv")

# For visualizations.
x1 = df['mean'].values
x2 = df['stddev'].values

# Pass as input to kmeans function.
X = df[['mean','stddev']].values

results = kmeans(X, k=3)


Because kmeans returns the progression of centroid values and cluster assignments after each E-M step, we can create a visualization that shows how assignments and centroids change with each iteration. We next plot cluster assignments at four iterations:

"""
Plotting various iterations of k-means cluster assignments.
"""
import numpy as np
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Call kmeans with k=3.
output = kmeans(X, k=3, max_cycles=25)
am = output['all_means']
aa = output['all_assign']

# Plot cluster assignments on 2x2 lattice plot =>
fig = plt.figure()

# Plot initial means =>
ax1.plot(x1, x2,color="white",marker="o",fillstyle='none',markeredgecolor='black',markeredgewidth=.90)
ax1.set_title("k-means: Random Initialization ", loc="left")

m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]

ax1.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1a,m2a,color="#FF0080",marker="o")
ax1.plot(m1b,m2b,color="#1A8B55",marker="o")
ax1.plot(m1c,m2c,color="#0080FF",marker="o")
plt.xticks([])
plt.yticks([])

# 1st Assignment Step =>

m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]

ax2.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)

# Retrieve points assigned to clusters.
aa1 = aa[1]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:,0], X0[:,1]

X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:,0], X1[:,1]

X2 = X[np.nonzero(aa1[:,2])][:,0:]
x21, x22 = X2[:,0], X2[:,1]

ax2.scatter(x01,x02,color="#FF0080",marker="o")
ax2.scatter(x11,x12,color="#1A8B55",marker="o")
ax2.scatter(x21,x22,color="#0080FF",marker="o")
ax2.set_title("k-means: First Assignment Step", loc="left")
plt.xticks([])
plt.yticks([])

# 5th assignment step =>
m1a, m2a = am[4][0,:]; m1b, m2b = am[4][1,:]; m1c, m2c = am[4][2,:]
ax3.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax3.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax3.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax3.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax3.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax3.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)

# Retrieve points assigned to clusters.
aa1 = aa[5]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:,0], X0[:,1]

X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:,0], X1[:,1]

X2 = X[np.nonzero(aa1[:,2])][:,0:]
x21, x22 = X2[:,0], X2[:,1]

ax3.scatter(x01,x02,color="#FF0080",marker="o")
ax3.scatter(x11,x12,color="#1A8B55",marker="o")
ax3.scatter(x21,x22,color="#0080FF",marker="o")
ax3.set_title("k-means: Fourth Assignment Step", loc="left")
plt.xticks([])
plt.yticks([])

# Final assignment step =>
m1a, m2a = am[-1][0,:]; m1b, m2b = am[-1][1,:]; m1c, m2c = am[-1][2,:]
ax4.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax4.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax4.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax4.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax4.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax4.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)

# Retrieve points assigned to clusters.
aa1 = aa[-1]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:,0], X0[:,1]

X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:,0], X1[:,1]

X2 = X[np.nonzero(aa1[:,2])][:,0:]
x21, x22 = X2[:,0], X2[:,1]

ax4.scatter(x01,x02,color="#FF0080",marker="o")
ax4.scatter(x11,x12,color="#1A8B55",marker="o")
ax4.scatter(x21,x22,color="#0080FF",marker="o")
ax4.set_title("k-means: Final Assignment Step", loc="left")
plt.xticks([])
plt.yticks([])

fig.tight_layout()
plt.show()


Running the code above generates the following:

It’s clear that the centroids have changed noticably when comparing the upper-left and the bottom-right subplots. However, we arbitrarily chose 3 as the number of cluster centroids. We next discuss how to determine k, which corresponds to the number of centroids that minimizes the cost function $$J$$ across all observations.

## Determining the Optimal $$k$$

A common method used in assessing the optimal number of centroids $$k$$ to use for clustering is the elbow method, which plots the sum of squared residuals as a function of the number of clusters across all observations. If we caculate the sum of squared residuals from a single centroid model as compared with the same measure from the two-centroid model, it’s clear that the two-centroid model will have a lower total variance, since the larger residuals in the $$k=1$$ model will almost certainly be reduced when $$k=2$$. However, at a certain point increasing $$k$$ will have little impact on the intra-cluster variance, and the elbow method can be used to help identify this point.
In the code that follows, we leverage our kmeans function and the sample dataset introduced earlier to plot total variance as a function of $$k$$, the number of centroids, with $$1 \leq k \leq 10$$:

"""
Demonstrating the relationship between total variance
as a function of the number of clusters to determine the
optimal number of centroids.
"""
fig = plt.figure()
sns.set(context='notebook', style='darkgrid', font_scale=1)
ax.set_title(
"Total Variance as a Function of Cluster Count",
fontsize=18,
loc="left",
color="#FF6666",
weight="bold")
ax.grid(True)
ax.set_xlabel("Number of Clusters", fontsize=15, labelpad=12.5, color="black")
ax.axvline(x=3., linewidth=2., color="grey")
ax.plot(x, y, "--", linewidth=2.25, color="#FF6666")
ax.scatter(x, y, marker="o", color="black", alpha=.90)

ax.annotate(
"optimal k=3",
fontsize=16,
xy=(3,18.85),
xytext=(4,30),
)
plt.show()


This code produces the following (note the characteristic ‘elbow’ shape, after which the total variance changes very little with the inclusion of additional centroids):

Notice that the reduction in variance going from $$k=3$$ to $$k=4$$ isn’t as significant as the reduction going from $$k=1$$ to $$k=2$$, or even $$k=2$$ to $$k=3$$. So even though selecting the ‘best’ $$k$$ can be open to subjectivity, in this case it’s pretty clear that there would be little benefit is setting $$k \geq 4$$. Thus, for this dataset, we conclude that the optimal number of centroids is $$k=3$$.

## Alternative Initialization Methods

One of the shortcomings of conventional k-means is that ocassionally the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering4. The k-means++ algorithm attempts to address this issue by providing an alternative initialization scheme. In a 2007 paper that introduces k-means++, Arthur and Vassilvitskii propose a specific way of choosing k-means cluster centers. Note that in what follows, $$D(x)$$ denotes the shortest distance from a data point $$x$$ to the closest center $$C_{i}$$ already selected:

1. Take one center C_{1}, chosen uniformly at random from the initial dataset X.

2. Take a new center $$C_{i}$$, choosing $$x \in X$$ with probability $$\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}$$.

3. Repeat Step 2 until we have taken $$k$$ centroids in total.

4. Proceed as with the standard k-means algorithm.

From step 2, the expression $$\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}$$ can be interpreted as meaning observations having a greater distance from existing centroids will be given a higher probability of being selected when the next centroid is determined. In essence, this is an attempt to spread the initial centroids as far as possible from one another prior to commencing the k-means expectation and maximization steps.
We can encapsulate the logic corresponding to the k-means++ initialization method within a function that takes the dataset and number of clusters k as input. A sample implementation is given by kmeanspp:

def kmeanspp(data, k):
"""
Return cluster centroids using k-means++ initialization.
data => The input dataset
k    => THe number of centroids to return
"""
# First centroid selected uniformly at random.
centroids = [X[np.random.choice(range(X.shape[0]),1)[0],:]]

# Determine distance of each observation from chosen centroids.
for i in range(k-1):
dsq = np.array([max(distance.euclidean(X[i,:],j)**2
for j in centroids)
for i in range(X.shape[0])])

r_indx = np.random.choice(
range(X.shape[0]),1,p=dsq/np.sum(dsq)
)[0]

# Ensure same index not selected multiple times.
if dsq[r_indx]==0:
while True:
r_indx = np.random.choice(
range(X.shape[0]),1,p=dsq/np.sum(dsq)
)[0]
if dsq[r_indx]!=0: break

centroids.append(X[r_indx,:])

return(centroids)


If we call kmeanspp with our original dataset and k=3, the initial centroids have much better spread than when using conventional initialization:

It would be straightforward to replace the original centroid initialization logic in kmeans with kmeanspp.

## Clustering in scikit-learn

The k-means clustering algorithm can be found in scikit-learn within the sklearn.cluster module, where KMeans is just one of the many available clustering algorithms. More information on these algorithms can be found here.
In this section, discussion will be limited to the KMeans class and how it can be used to generate cluster assignments.
One of the greate advantages of scikit-learn is that it exposes a uniform interface to a large number of algorithms. The steps taken to setup and run scikit-learn’s implementation of KMeans can be repeated essentially without variation when leveraging any of the various supervised learning algorithms.

We start by scaling our variables so that no single explanatory variable contributes disproportionately to the overall variation in the dataset. This is accomplished quite easily using the StandardScaler class from the sklearn.preprocessing module:

from sklearn.preprocessing import StandardScaler

# Read in data as before.
X  = df[['mean','stddev']].values

sclr = StandardScaler().fit(X)
X = sclr.transform(X)


This transforms the data that comprises each explanatory variable into a standard normal distribution having 0 mean and unit variance.
Although not strictly required in this case since our dataset is complete, the next pre-processing step would typically consist of imputing missing values. Missing data elements are bound to one of 'mean', 'median' or 'most_frequent'. Since the definition will vary across datasets, the user specifies what constitutes a missing value. For the purposes of demonstration, we next apply scikit-learn’s Imputer class to the unemployment dataset, with np.nan specified as the missing value and 'mean' as the imputation strategy:

from sklearn.preprocessing import Imputer

imp = Imputer(missing_values=np.nan, strategy='mean').fit(X)
X = imp.transform(X)


Finally, we pass X to the fit method of KMeans:

from sklearn.cluster import KMeans
import numpy as np

kmeans = KMeans(n_clusters=3).fit(X)

# Get final cluster assignments.
kmeans.labels_

# Get final cluster centroids.
kmeans.cluster_centers_

# Predict cluster assignment for new observations.
kmeans.predict(np.atleast_2d([.515, .73337]))


## Conclusion

This post serves only as the briefest introduction to k-means and demonstrated only the simplest use cases. In higher dimensions, cluster assignments become difficult to interpret. However, for performing exploratory data analysis with reasonably sized datasets, k-means can prove to be very useful for gleaning insights. Until next time, happy coding!

### Footnotes:

1. Xindong and Vipin, The Top Ten Algorithms in Data Mining (Taylor & Francis Group 2009), pg. 21.

2. Bishop, Christopher M, Pattern Recognition and Machine Learning (Springer 2007), pg. 424.

3. https://en.wikipedia.org/wiki/K-means%2B%2B