k-means clustering is a simple iterative clustering algorithm that partitions a dataset into a pre-determined number of clusters, k
, based on a pre-determined distance metric. Unlike supervised classification algorithms that have some notion of a target class, the objects comprising the input to k-means do not come with an associated target. For this reason, k-means clustering and its variations are considered instances of unsupervised learning techniques1.
In this post, details of the k-means algorithm are covered in detail. After laying the theoretical foundation, we walk through a Python implementation of k-means, and explore how to determine an optimal value for \(k\). We’ll also highlight a variant of k-means with an improved initialization scheme known as kmeans++. Finally, we examine how to leverage the clustering routines available in scikit-learn.
Background
Given a set of observations \((x_{1}, x_{2}, \cdots, x_{n})\), where each observation is a p-dimensional vector, k-means attempts to partition the \(n\) observations into \(k\) clusters \(C = {C_{1}, C_{2}, \cdots, C_{k}}\) such that the intra-cluster sum of squares is minimized. Adopting Bishop’s notational convention, we define the k-means objective (‘cost’) function as
which represents the sum of the squares of the distances of each data point to its assigned vector \(\mu_{k}\)2.
Here, \(r_{nk}\) is given by
which states that the \(n^{th}\) observation gets assigned to the cluster whose centroid is nearest.
Conventional k-means usually begins with random selection of \(k\) observations for the initial centroids. The algorithm proceeds
by assigning each observation to it’s nearest centroid. Upon completion,
the \(k\) centroids are updated using the latest cluster assignments. This is an example of an iterative refinement technique,
which ceases only after the change in distance between consecutive iterations for all observations
is less than some pre-determined threshold. In Pattern Recognition and Machine Learning, Bishop describes
these steps as “minimizing \(J\) with respect to the \(r_{nk}\) , keeping the \(\mu_{k}\) fixed, then minimizing \(J\) with respect to
the \(\mu_{k}\), keeping \(r_{nk}\) fixed”2. He also points out that the two stages correspond to the E (expectation) and M (maximization) steps of the EM (Expectation Maximization) algorithm.
The value of \(J\) is reduced after each EM cycle, so convergence is guaranteed, but not necessarily to the global minimum.
Since k-means can be run quickly, various combinations of starting points and number of clusters \(k\) are run in turn to see how changing the observations selected for the initial means varies the ultimate formation of the \(k\) clusters.
Implementation
In what follows, we present an implementation of the k-means algorithm that uses the conventional method of centroid intialization. This is also commonly referred to as Lloyd’s algorithm. In this version, the kmeans
function takes as input
data
representing the observations of interest, k
for the number of clusters and max_cycles
, which limits the number of allowable EM cycles before iteration ceases. kmeans
returns a dict of values containing:
cycles
- Number of cycles required
all_means
- Progression of centroids for all cycles
all_assign
- Cluster assignments for each observation at each iteration
all_dists
- Distances from observations to each of the \(k\) centroids
final_means
- Centroids at the final iteration
final_assign
- Final cluster assignments for each observation as 1-D array
final_dists
- Final distances from observations to each of the \(k\) centroids
total_cost
- The total variance over all observations
We define the kmeans
function as follows:
import numpy as np
import scipy
from scipy.spatial import distance
import pandas as pd
import itertools
def kmeans(data, k=3, max_cycles=25):
"""
Return means and cluster assignments for a given
dataset `data`.
STEP I => Assignment step: Assign each observation to the
cluster whose mean has the least squared Euclidean
distance, this is intuitively the "nearest" mean.
STEP II => Update step; Calculate New Cluster Centroids for
each observation.
"""
if isinstance(data, pd.DataFrame): data = data.values
# Add index column to data.
_tmpid = np.atleast_2d(np.arange(0, data.shape[0]))
X = np.concatenate([_tmpid.T, data], axis=1)
# Conventional initialization: Select k random means from X.
init_indx = np.random.choice(np.arange(0,X.shape[0]-1),k,replace=False)
init_means = X[init_indx][:,1:]
centroids = init_means
all_means, all_assign, all_dists = [], [], []
# Initialize indicator and distance matricies which indicate cluster
# membership for each observation.
init_clust_assign = np.zeros(shape=(X.shape[0], k))
init_clust_dists = np.zeros_like(init_clust_assign)
all_means.append(init_means)
all_assign.append(init_clust_assign)
all_dists.append(init_clust_dists)
for c in itertools.count(start=1):
clust_assign = np.zeros(shape=(X.shape[0], k))
clust_dists = np.zeros_like(clust_assign)
for i in range(X.shape[0]): # Step I (Assignment Step)
iterobs = X[i, 1:]
iterdists = [distance.euclidean(j, iterobs) for j in centroids]
min_dist = np.min(iterdists)
min_indx = np.argmin(iterdists)
clust_assign[i, min_indx] = 1
clust_dists[i, min_indx] = min_dist
# Test for convergence.
if np.allclose(clust_dists,all_dists[-1],atol=1e-09) or c>=max_cycles:
break
all_assign.append(clust_assign)
all_dists.append(clust_dists)
centroids = np.zeros_like(init_means)
for i in enumerate(centroids): # Step II (Update Step)
iter_k = i[0]
iter_v = i[1]
# Use indicies from clust_assign to retrieve elements from X
# in order to calculate centroid updates.
iter_mean = X[np.nonzero(clust_assign[:,iter_k])][:,1:].mean(axis=0)
centroids[iter_k] = iter_mean
all_means.append(centroids)
# Calculate cost function (variance) over all centroids.
total_cost = (all_dists[-1]**2).sum()
# Return final_assign as a 1-D array representing final
# cluster assignment.
final_assign = np.apply_along_axis(
lambda a: np.where(a==1)[0][0],
axis=1, arr=all_assign[-1])
result = {
'k' :k,
'cycles' :c,
'all_means' :all_means,
'all_assign' :all_assign,
'all_dists' :all_dists,
'final_means' :all_means[-1],
'final_assign':final_assign,
'final_dists' :all_dists[-1],
'total_cost' :total_cost
}
return(result)
At the very start of the function definition we append an identifer column
which allows us to pair observations to their instance identifiers.
To demonstrate, we’ll use the dataset found here,
which contains the mean and standard deviation of unemployment data for each of the 50 states:
df = pd.read_csv("unemp.csv")
# For visualizations.
x1 = df['mean'].values
x2 = df['stddev'].values
# Pass as input to kmeans function.
X = df[['mean','stddev']].values
results = kmeans(X, k=3)
Because kmeans
returns the progression of centroid values and cluster
assignments after each E-M step, we can create a visualization that shows how
assignments and centroids change with each iteration. We next plot cluster
assignments at four iterations:
"""
Plotting various iterations of k-means cluster assignments.
"""
import numpy as np
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Call kmeans with k=3.
output = kmeans(X, k=3, max_cycles=25)
am = output['all_means']
ad = output['all_dists']
aa = output['all_assign']
# Plot cluster assignments on 2x2 lattice plot =>
fig = plt.figure()
# Plot initial means =>
ax1 = fig.add_subplot(221)
ax1.plot(x1, x2,color="white",marker="o",fillstyle='none',markeredgecolor='black',markeredgewidth=.90)
ax1.set_title("k-means: Random Initialization ", loc="left")
m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]
ax1.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1a,m2a,color="#FF0080",marker="o")
ax1.plot(m1b,m2b,color="#1A8B55",marker="o")
ax1.plot(m1c,m2c,color="#0080FF",marker="o")
plt.xticks([])
plt.yticks([])
# 1st Assignment Step =>
ax2 = fig.add_subplot(222)
m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]
ax2.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
# Retrieve points assigned to clusters.
aa1 = aa[1]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:,0], X0[:,1]
X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:,0], X1[:,1]
X2 = X[np.nonzero(aa1[:,2])][:,0:]
x21, x22 = X2[:,0], X2[:,1]
ax2.scatter(x01,x02,color="#FF0080",marker="o")
ax2.scatter(x11,x12,color="#1A8B55",marker="o")
ax2.scatter(x21,x22,color="#0080FF",marker="o")
ax2.set_title("k-means: First Assignment Step", loc="left")
plt.xticks([])
plt.yticks([])
# 5th assignment step =>
ax3 = fig.add_subplot(223)
m1a, m2a = am[4][0,:]; m1b, m2b = am[4][1,:]; m1c, m2c = am[4][2,:]
ax3.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax3.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax3.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax3.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax3.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax3.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
# Retrieve points assigned to clusters.
aa1 = aa[5]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:,0], X0[:,1]
X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:,0], X1[:,1]
X2 = X[np.nonzero(aa1[:,2])][:,0:]
x21, x22 = X2[:,0], X2[:,1]
ax3.scatter(x01,x02,color="#FF0080",marker="o")
ax3.scatter(x11,x12,color="#1A8B55",marker="o")
ax3.scatter(x21,x22,color="#0080FF",marker="o")
ax3.set_title("k-means: Fourth Assignment Step", loc="left")
plt.xticks([])
plt.yticks([])
# Final assignment step =>
ax4 = fig.add_subplot(224)
m1a, m2a = am[-1][0,:]; m1b, m2b = am[-1][1,:]; m1c, m2c = am[-1][2,:]
ax4.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax4.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax4.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax4.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax4.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax4.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
# Retrieve points assigned to clusters.
aa1 = aa[-1]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:,0], X0[:,1]
X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:,0], X1[:,1]
X2 = X[np.nonzero(aa1[:,2])][:,0:]
x21, x22 = X2[:,0], X2[:,1]
ax4.scatter(x01,x02,color="#FF0080",marker="o")
ax4.scatter(x11,x12,color="#1A8B55",marker="o")
ax4.scatter(x21,x22,color="#0080FF",marker="o")
ax4.set_title("k-means: Final Assignment Step", loc="left")
plt.xticks([])
plt.yticks([])
fig.tight_layout()
fig.subplots_adjust(wspace=.05,hspace=0.1,left=.05, right=.95)
plt.show()
Running the code above generates the following:
It’s clear that the centroids have changed noticably when comparing the upper-left and the bottom-right subplots. However, we arbitrarily chose 3 as the number of cluster centroids. We next discuss how to determine k, which corresponds to the number of centroids that minimizes the cost function \(J\) across all observations.
Determining the Optimal \(k\)
A common method used in assessing the optimal number of centroids \(k\) to use
for clustering is the elbow method, which plots the sum of squared residuals
as a function of the number of clusters across all observations. If we caculate
the sum of squared residuals from a single centroid model as compared with the
same measure from the two-centroid model, it’s clear that the two-centroid
model will have a lower total variance, since the larger residuals in the
\(k=1\) model will almost certainly be reduced when \(k=2\). However, at a certain
point increasing \(k\) will have little impact on the intra-cluster variance,
and the elbow method can be used to help identify this point.
In the code that follows, we leverage our kmeans
function and the sample
dataset introduced earlier to plot total variance as a function of \(k\), the
number of centroids, with \(1 \leq k \leq 10\):
"""
Demonstrating the relationship between total variance
as a function of the number of clusters to determine the
optimal number of centroids.
"""
fig = plt.figure()
sns.set(context='notebook', style='darkgrid', font_scale=1)
ax = fig.add_subplot(111)
ax.set_title(
"Total Variance as a Function of Cluster Count",
fontsize=18,
loc="left",
color="#FF6666",
weight="bold")
ax.grid(True)
ax.set_xlabel("Number of Clusters", fontsize=15, labelpad=12.5, color="black")
ax.set_ylabel("Total Variance", fontsize=15, labelpad=12.5, color="black")
ax.axvline(x=3., linewidth=2., color="grey")
ax.plot(x, y, "--", linewidth=2.25, color="#FF6666")
ax.scatter(x, y, marker="o", color="black", alpha=.90)
ax.tick_params(axis='both', which='major', labelsize=12, pad=10)
# Add arrow highlighting k=3.
ax.annotate(
"optimal k=3",
fontsize=16,
xy=(3,18.85),
xytext=(4,30),
arrowprops=dict(facecolor='black',width=.6,headwidth=8,shrink=.05),
)
plt.show()
This code produces the following (note the characteristic ‘elbow’ shape, after
which the total variance changes very little with the inclusion of additional
centroids):
Notice that the reduction in variance going from \(k=3\) to \(k=4\) isn’t as significant as the reduction going from \(k=1\) to \(k=2\), or even \(k=2\) to \(k=3\). So even though selecting the ‘best’ \(k\) can be open to subjectivity, in this case it’s pretty clear that there would be little benefit is setting \(k \geq 4\). Thus, for this dataset, we conclude that the optimal number of centroids is \(k=3\).
Alternative Initialization Methods
One of the shortcomings of conventional k-means is that ocassionally the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering4. The k-means++ algorithm attempts to address this issue by providing an alternative initialization scheme. In a 2007 paper that introduces k-means++, Arthur and Vassilvitskii propose a specific way of choosing k-means cluster centers. Note that in what follows, \(D(x)\) denotes the shortest distance from a data point \(x\) to the closest center \(C_{i}\) already selected:
1. Take one center C_{1}
, chosen uniformly at random from the initial dataset X
.
2. Take a new center \(C_{i}\), choosing \(x \in X\) with probability \(\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}\).
3. Repeat Step 2 until we have taken \(k\) centroids in total.
4. Proceed as with the standard k-means algorithm.
From step 2, the expression \(\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}\) can be
interpreted as meaning observations having a greater distance from existing
centroids will be given a higher probability of being selected when the next
centroid is determined. In essence, this is an attempt to spread the initial
centroids as far as possible from one another prior to commencing the k-means
expectation and maximization steps.
We can encapsulate the logic corresponding to the k-means++ initialization
method within a function that takes the dataset and number of clusters k
as
input. A sample implementation is given by kmeanspp
:
def kmeanspp(data, k):
"""
Return cluster centroids using k-means++ initialization.
data => The input dataset
k => THe number of centroids to return
"""
# First centroid selected uniformly at random.
centroids = [X[np.random.choice(range(X.shape[0]),1)[0],:]]
# Determine distance of each observation from chosen centroids.
for i in range(k-1):
dsq = np.array([max(distance.euclidean(X[i,:],j)**2
for j in centroids)
for i in range(X.shape[0])])
r_indx = np.random.choice(
range(X.shape[0]),1,p=dsq/np.sum(dsq)
)[0]
# Ensure same index not selected multiple times.
if dsq[r_indx]==0:
while True:
r_indx = np.random.choice(
range(X.shape[0]),1,p=dsq/np.sum(dsq)
)[0]
if dsq[r_indx]!=0: break
centroids.append(X[r_indx,:])
return(centroids)
If we call kmeanspp
with our original dataset and k=3
, the initial centroids
have much better spread than when using conventional initialization:
It would be straightforward to replace the original centroid initialization logic in
kmeans
with kmeanspp
.
Clustering in scikit-learn
The k-means clustering algorithm can be found in scikit-learn within the
sklearn.cluster
module, where KMeans
is just one of the many available
clustering algorithms.
More information on these algorithms can be found here.
In this section, discussion will be limited to the KMeans
class and how it
can be used to generate cluster assignments.
One of the greate advantages of scikit-learn is that it exposes a uniform
interface to a large number of algorithms. The steps taken to setup and run
scikit-learn’s implementation of KMeans
can be repeated essentially without
variation when leveraging any of the various supervised learning algorithms.
We start by scaling our variables so that no single explanatory variable
contributes disproportionately to the overall variation in the dataset. This is
accomplished quite easily using the StandardScaler
class from the
sklearn.preprocessing
module:
from sklearn.preprocessing import StandardScaler
# Read in data as before.
df = pd.read_csv("unemp.csv")
X = df[['mean','stddev']].values
sclr = StandardScaler().fit(X)
X = sclr.transform(X)
This transforms the data that comprises each explanatory variable into a
standard normal distribution having 0 mean and unit variance.
Although not strictly required in this case since our dataset is complete, the
next pre-processing step would typically consist of imputing missing values.
Missing data elements are bound to one of 'mean'
, 'median'
or 'most_frequent'
.
Since the definition will vary across datasets, the user specifies what constitutes a
missing value. For the purposes of demonstration, we next apply scikit-learn’s
Imputer
class to the unemployment dataset, with np.nan
specified as the
missing value and 'mean'
as the imputation strategy:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=np.nan, strategy='mean').fit(X)
X = imp.transform(X)
Finally, we pass X
to the fit
method of KMeans
:
from sklearn.cluster import KMeans
import numpy as np
kmeans = KMeans(n_clusters=3).fit(X)
# Get final cluster assignments.
kmeans.labels_
# Get final cluster centroids.
kmeans.cluster_centers_
# Predict cluster assignment for new observations.
kmeans.predict(np.atleast_2d([.515, .73337]))
Conclusion
This post serves only as the briefest introduction to k-means and demonstrated
only the simplest use cases. In higher dimensions, cluster assignments become
difficult to interpret. However, for performing exploratory data analysis with
reasonably sized datasets, k-means can prove to be very useful for gleaning
insights. Until next time, happy coding!
Footnotes:
- Xindong and Vipin, The Top Ten Algorithms in Data Mining (Taylor & Francis Group 2009), pg. 21.
- Bishop, Christopher M, Pattern Recognition and Machine Learning (Springer 2007), pg. 424.
- https://en.wikipedia.org/wiki/K-means%2B%2B