k-means clustering is a simple iterative clustering algorithm that partitions a dataset into a pre-determined number of clusters k, based on a pre-determined distance metric. Unlike supervised classification algorithms that have some notion of a target class, the objects comprising the input to k-means do not come with an associated target. For this reason, k-means clustering and its variations are considered instances of unsupervised learning techniques.
In this post, details of the k-means algorithm are covered in detail. After laying the theoretical foundation, we walk through a Python implementation of k-means, and explore how to determine an optimal value for k. We’ll also highlight a variant of k-means with an improved initialization scheme known as kmeans++. Finally, we examine how to leverage the clustering routines available in scikit-learn.
Background
Given a set of observations \((x_{1}, x_{2}, \cdots, x_{n})\), where each observation is a \(p\)-dimensional vector, k-means attempts to partition the \(n\) observations into \(k\) clusters \(C = {C_{1}, C_{2}, \cdots, C_{k}}\) such that the intra-cluster sum of squares is minimized. The k-means objective function is defined as
which represents the sum of the squares of the distances of each data point to its assigned vector \(\mu_{k}\). \(r_{nk}\) is given by
which states that the \(n^{th}\) observation gets assigned to the cluster whose
centroid is nearest.
Conventional k-means usually begins with random selection of k observations
for the initial centroids. The algorithm proceeds by assigning each observation
to its nearest centroid. Upon completion, the k centroids are updated using
the latest cluster assignments. This is an example of an iterative refinement
technique, which ceases only after the change in distance between consecutive
iterations for all observations is less than some pre-determined threshold.
The two stages correspond to the expectation and maximization steps of the Expectation Maximization algorithm. The value of \(J\) is reduced after each EM cycle, so convergence is guaranteed, but not necessarily to the global minimum. Since k-means can be run quickly, various combinations of starting points and number of clusters k are run in turn to see how changing the observations selected for the initial k means varies the ultimate formation of the k clusters.
Implementation
In what follows, we present an implementation of k-means that uses the
conventional method of centroid intialization. In this version, the kmeans
function takes as input
data representing the observations of interest, k for the number of clusters
and max_cycles, which limits the number of allowable EM cycles before
iteration ceases. The function returns a dictionary containing:
cycles: Number of cycles required
all_means: Progression of centroids for all cycles
all_assign: Cluster assignments for each observation at each iteration
all_dists: Distances from observations to each of the k centroids
final_means: Centroids at the final iteration
final_assign: Final cluster assignments for each observation as 1-D array
final_dists: Final distances from observations to each of the k centroids
total_cost: The total variance over all observations
The implmentation is provided below:
import itertools
import numpy as np
import pandas as pd
import scipy
from scipy.spatial import distance
def kmeans(data, k=3, max_cycles=25):
"""
Return means and cluster assignments for a given
dataset `data`.
Step I : Assignment step: Assign each observation to the cluster whose
mean has the least squared Euclidean distance, this is
intuitively the nearest mean.
Step II: Update step; Calculate New cluster centroids for each
observation.
"""
if isinstance(data, pd.DataFrame):
data = data.values
# Add index column to data.
tmpid = np.atleast_2d(np.arange(0, data.shape[0]))
X = np.concatenate([tmpid.T, data], axis=1)a
# Conventional initialization: Select k random means from X.
init_indx = np.random.choice(X.shape[0], k, replace=False)
init_means = X[init_indx][:, 1:]
centroids = init_means
all_means, all_assign, all_dists = [], [], []
# Initialize indicator and distance matricies which indicate cluster
# membership for each observation.
init_clust_assign = np.zeros(shape=(X.shape[0], k))
init_clust_dists = np.zeros_like(init_clust_assign)
all_means.append(init_means)
all_assign.append(init_clust_assign)
all_dists.append(init_clust_dists)
for c in itertools.count(start=1):
clust_assign = np.zeros(shape=(X.shape[0], k))
clust_dists = np.zeros_like(clust_assign)
for i in range(X.shape[0]): # Step I (Assignment Step)
iterobs = X[i, 1:]
iterdists = [distance.euclidean(j, iterobs) for j in centroids]
min_dist = np.min(iterdists)
min_indx = np.argmin(iterdists)
clust_assign[i, min_indx] = 1
clust_dists[i, min_indx] = min_dist
# Test for convergence.
if np.allclose(clust_dists, all_dists[-1], atol=1e-9) or c>= max_cycles:
break
all_assign.append(clust_assign)
all_dists.append(clust_dists)
centroids = np.zeros_like(init_means)
# Step II (Update Step).
for i in enumerate(centroids):
iter_k, iter_v = i[0], i[1]
# Use indicies from clust_assign to retrieve elements from X
# in order to calculate centroid updates.
iter_mean = X[np.nonzero(clust_assign[:, iter_k])][:, 1:].mean(axis=0)
centroids[iter_k] = iter_mean
all_means.append(centroids)
# Calculate cost function over all centroids.
total_cost = (all_dists[-1]**2).sum()
# Return final_assign as a 1-D array representing final cluster assignment.
final_assign = np.apply_along_axis(
lambda a: np.where(a==1)[0][0], axis=1, arr=all_assign[-1]
)
dresults = {
'k': k, 'cycles': c, 'all_means': all_means, 'all_assign': all_assign,
'all_dists': all_dists, 'final_means': all_means[-1],
'final_assign': final_assign, 'final_dists': all_dists[-1],
'total_cost': total_cost
}
return(dresults)
At the very start of the function definition we append an identifer column which allows us to pair observations to their instance identifiers. To demonstrate, we use the dataset found here, which contains sample mean and standard deviation of unemployment data for each of the 50 states:
import pandas as pd
df = pd.read_csv("https://gist.githubusercontent.com/jtrive84/ea0f275a9ef010415392189e64c71fc3/raw/4bb56a71c5b597c16513c48183dde799ddf9ec51/unemp.csv")
# For visualizations.
x1 = df['mean'].values
x2 = df['stddev'].values
# Pass as input to kmeans function.
X = df[['mean','stddev']].values
results = kmeans(X, k=3)
Because kmeans
returns the progression of centroid values and cluster
assignments after each E-M step, we can create a visualization that
demonstrates how assignments and centroids change with each iteration. We next
plot cluster assignments at four iterations:
"""
Plotting various iterations of k-means cluster assignments.
"""
import matplotlib.pyplot as plt
import seaborn as sns
# Call kmeans with k=3.
output = kmeans(X, k=3, max_cycles=25)
am = output['all_means']
ad = output['all_dists']
aa = output['all_assign']
# Plot cluster assignments on 2x2 lattice plot.
fig = plt.figure()
# Plot initial means.
ax1 = fig.add_subplot(221)
ax1.plot(x1, x2,color="white", marker="o", fillstyle='none', markeredgecolor='black', markeredgewidth=.90)
ax1.set_title("k-means: Random Initialization ", loc="left")
m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]
ax1.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax1.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax1.plot(m1a,m2a,color="#FF0080",marker="o")
ax1.plot(m1b,m2b,color="#1A8B55",marker="o")
ax1.plot(m1c,m2c,color="#0080FF",marker="o")
plt.xticks([]); plt.yticks([])
# 1st Assignment Step.
ax2 = fig.add_subplot(222)
m1a, m2a = am[0][0,:]
m1b, m2b = am[0][1,:]
m1c, m2c = am[0][2,:]
ax2.plot(m1a,m2a,color="#FF0080",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1b,m2b,color="#1A8B55",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1c,m2c,color="#0080FF",marker="D",markersize=12,fillstyle='none',markeredgewidth=1.30)
ax2.plot(m1a,m2a,color="#FF0080",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1b,m2b,color="#1A8B55",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
ax2.plot(m1c,m2c,color="#0080FF",marker="+",markersize=28,fillstyle='none',markeredgewidth=1.50)
# Retrieve points assigned to clusters.
aa1 = aa[1]
X0 = X[np.nonzero(aa1[:,0])][:,0:]
x01, x02 = X0[:, 0], X0[:, 1]
X1 = X[np.nonzero(aa1[:,1])][:,0:]
x11, x12 = X1[:, 0], X1[:, 1]
X2 = X[np.nonzero(aa1[:, 2])][:, 0:]
x21, x22 = X2[:, 0], X2[:, 1]
ax2.scatter(x01, x02, color="#FF0080", marker="o")
ax2.scatter(x11, x12, color="#1A8B55", marker="o")
ax2.scatter(x21, x22, color="#0080FF", marker="o")
ax2.set_title("k-means: First Assignment Step", loc="left")
plt.xticks([]); plt.yticks([])
# 5th assignment step.
ax3 = fig.add_subplot(223)
m1a, m2a = am[4][0, :]; m1b, m2b = am[4][1, :]; m1c, m2c = am[4][2, :]
ax3.plot(m1a, m2a, color="#FF0080", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax3.plot(m1b, m2b, color="#1A8B55", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax3.plot(m1c, m2c, color="#0080FF", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax3.plot(m1a, m2a, color="#FF0080", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax3.plot(m1b, m2b, color="#1A8B55", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax3.plot(m1c, m2c, color="#0080FF", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
# Retrieve points assigned to clusters.
aa1 = aa[5]
X0 = X[np.nonzero(aa1[:, 0])][:, 0:]
x01, x02 = X0[:,0], X0[:, 1]
X1 = X[np.nonzero(aa1[:, 1])][:, 0:]
x11, x12 = X1[:, 0], X1[:, 1]
X2 = X[np.nonzero(aa1[:, 2])][:, 0:]
x21, x22 = X2[:, 0], X2[:, 1]
ax3.scatter(x01, x02, color="#FF0080", marker="o")
ax3.scatter(x11, x12, color="#1A8B55", marker="o")
ax3.scatter(x21, x22, color="#0080FF", marker="o")
ax3.set_title("k-means: Fourth Assignment Step", loc="left")
plt.xticks([]); plt.yticks([])
# Final assignment step.
ax4 = fig.add_subplot(224)
m1a, m2a = am[-1][0, :]; m1b, m2b = am[-1][1, :]; m1c, m2c = am[-1][2, :]
ax4.plot(m1a, m2a, color="#FF0080", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax4.plot(m1b, m2b, color="#1A8B55", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax4.plot(m1c, m2c, color="#0080FF", marker="D", markersize=12, fillstyle='none', markeredgewidth=1.30)
ax4.plot(m1a, m2a, color="#FF0080", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax4.plot(m1b, m2b, color="#1A8B55", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
ax4.plot(m1c, m2c, color="#0080FF", marker="+", markersize=28, fillstyle='none', markeredgewidth=1.50)
# Retrieve points assigned to clusters.
aa1 = aa[-1]
X0 = X[np.nonzero(aa1[:, 0])][:, 0:]
x01, x02 = X0[:, 0], X0[:, 1]
X1 = X[np.nonzero(aa1[:, 1])][:, 0:]
x11, x12 = X1[:, 0], X1[:, 1]
X2 = X[np.nonzero(aa1[:, 2])][:, 0:]
x21, x22 = X2[:, 0], X2[:, 1]
ax4.scatter(x01, x02, color="#FF0080", marker="o")
ax4.scatter(x11, x12, color="#1A8B55", marker="o")
ax4.scatter(x21, x22, color="#0080FF", marker="o")
ax4.set_title("k-means: Final Assignment Step", loc="left")
plt.xticks([]); plt.yticks([])
fig.tight_layout()
fig.subplots_adjust(wspace=.05, hspace=0.1, left=.05, right=.95)
plt.show()
Which produces the following:
The centroids have changed noticably when comparing the upper-left and the bottom-right subplots. However, we arbitrarily chose 3 as the number of cluster centroids. We next discuss how to determine k, which corresponds to the number of centroids that minimizes the cost function \(J\) across all observations.
Determining the Optimal k
A common method used in assessing the optimal number of centroids k to use
for clustering is the elbow method, which plots the sum of squared residuals as
a function of the number of clusters across all observations.
If we caculate the sum of squared residuals from a single centroid model as compared with the
same measure from the two-centroid model, it’s clear that the two-centroid
model will have a lower total variance since the larger residuals in the \(k=1\)
model will almost surely be reduced when \(k=2\). However, at a certain point
increasing k will have little impact on the intra-cluster variance, and the
elbow method can be used to help identify this point. In the code that follows, we
leverage our kmeans
function and the sample dataset introduced earlier to assess total
variance as a function of k, the number of centroids, with \(1 \leq k \leq 10\):
"""
Demonstrating the relationship between total variance as a function of the
number of clusters to determine the optimal number of centroids.
"""
fig = plt.figure()
sns.set(context='notebook', style='darkgrid', font_scale=1)
ax = fig.add_subplot(111)
ax.set_title(
"Total Variance as a Function of Cluster Count", fontsize=18,
loc="left", color="#FF6666", weight="bold"
)
ax.grid(True)
ax.set_xlabel("Number of Clusters", fontsize=15, labelpad=12.5, color="black")
ax.set_ylabel("Total Variance", fontsize=15, labelpad=12.5, color="black")
ax.axvline(x=3., linewidth=2., color="grey")
ax.plot(x, y, "--", linewidth=2.25, color="#FF6666")
ax.scatter(x, y, marker="o", color="black", alpha=.90)
ax.tick_params(axis='both', which='major', labelsize=12, pad=10)
# Add arrow highlighting k=3.
ax.annotate("optimal k=3", fontsize=16, xy=(3,18.85), xytext=(4,30),
arrowprops=dict(
facecolor='black', width=.6, headwidth=8, shrink=.05),)
plt.show()
Note the characteristic elbow shape, after which the total variance changes very little with the inclusion of additional centroids:
Alternative Initialization Methods
One of the shortcomings of conventional k-means is that ocassionally the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering. The k-means++ algorithm attempts to address this shortcoming by providing an alternative initialization scheme. In what follows, \(D(x)\) denotes the shortest distance from a data point \(x\) to the closest center \(C_{i}\) already selected:
- Take one center \(C_{1}\), chosen uniformly at random from the initial dataset.
- Take a new center \(C_{i}\), choosing \(x \in X\) with probability \(\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}\).
- Repeat Step 2 until we have taken \(k\) centroids in total.
- Proceed with standard k-means algorithm.
From step 2, the expression \(\frac {D(x)^{2}}{\sum_{x \in X} D(x)^{2}}\) can be interpreted as meaning observations having a greater distance from existing centroids will be given a higher probability of being selected when the next centroid is determined. In essence, this is an attempt to spread the initial centroids as far as possible from one another prior to commencing k-means.
We can encapsulate the k-means++ initialization method within a function that
takes the dataset and number of clusters k as input. A sample implementation is
given by kmeanspp
:
def kmeanspp(X, k):
"""
Return cluster centroids using k-means++ initialization technique.
Parameters
----------
X:
The input dataset.
k:
The number of centroids to return.
"""
# First centroids selected uniformly at random.
centroids = [X[np.random.choice(range(X.shape[0]), 1)[0], :]]
# Determine distance of each observation from chosen centroids.
for i in range(k - 1):
dsq = np.array(
[max(distance.euclidean(X[i, :], j)**2 for j in centroids)
for i in range(X.shape[0])]
)
r_indx = np.random.choice(range(X.shape[0]), 1, p=dsq/np.sum(dsq))[0]
# Ensure same index not selected multiple times.
if dsq[r_indx]==0:
while True:
r_indx = np.random.choice(range(X.shape[0]), 1, p=dsq/np.sum(dsq))[0]
if dsq[r_indx]!=0:
break
centroids.append(X[r_indx, :])
return(centroids)
If we call kmeanspp
with our original dataset and k=3, the initial
centroids have much better spread than when using conventional initialization:
Clustering in scikit-learn
The k-means clustering algorithm can be found in scikit-learn’s
sklearn.cluster
module, where KMeans
is one of the many available
clustering algorithms. Information on these algorithms can be found here.
In this section, we leverage the KMeans
class to generate cluster
assignments.
We start by scaling our variables so that no single explanatory variable
contributes disproportionately to the overall variation in the dataset.
from sklearn.preprocessing import StandardScaler
# Read in data as before.
df = pd.read_csv("unemp.csv")
X = df[['mean', 'stddev']].values
sclr = StandardScaler().fit(X)
X = sclr.transform(X)
This transforms each explanatory variable onto a standard normal basis having
zero mean and unit variance. Next we pass X
to the fit method of KMeans
:
import numpy as np
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3).fit(X)
# Get final cluster assignments.
labels = kmeans.labels_
# Get final cluster centroids.
centroids = kmeans.cluster_centers_
# Predict cluster assignment for new observations.
kmeans.predict(np.atleast_2d([.515, .73337]))