K-Means Clustering from scratch in Python

Kunal Chhikara
4 min readJul 1, 2021

What is clustering ?

Clustering refers to the segregation of data points into a number of different clusters (or groups ) such that all the data points in a cluster are similar to each other.

Types of Clustering —

Clustering can be divided into two parts:

Hard Clustering — In hard clustering, each data point either belongs to the cluster completely or not.

Soft Clustering — In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.

Hard Clustering vs Soft Clustering

K-Means Clustering

K-Means clustering is the most popular unsupervised learning algorithm which means that this algorithm learns patterns from unlabeled data.

The ‘K’ in K-Means stands for the number of centroids. Centroid is the point which has the least distance from all the points of a cluster. Each cluster has its own centroid value. The ‘means’ in K-Means stands for the average value which we calculate in order to find the centroid.

K-Means Algorithm

  1. Specify the value of number of clusters k.

2. Randomly initialize the value of ‘k’ centroids.

3. Keep iterating until the centroids becomes constant i.e. the assignment of data points to clusters is not changing.

  • Find the Euclidian distance between the centroid and the data points.
  • Assign the data points to the closest cluster.
  • Update the value of the centroid by taking the average of all the data points of a cluster.

Implementation in Python from scratch

  • Function to calculate Euclidian distance
def euclidean_distance(x1,x2):
return np.sqrt(np.sum((x1-x2)**2))
  • KMeans class

Defining the constructor and initializing the the values

def __init__(self, K=5, max_iters=100, plot_steps=False):
self.K = K
self.max_iters = max_iters
self.plot_steps = plot_steps
# list of sample indices for each cluster
self.clusters = [[] for _ in range(self.K)]
# the centers (mean feature vector) for each cluster
self.centroids = []

Predict Method

def predict(self, X):
self.X = X
self.n_samples, self.n_features = X.shape
# initialize
random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
self.centroids = [self.X[idx] for idx in random_sample_idxs]
# Optimize clusters
for _ in range(self.max_iters):
# Assign samples to closest centroids (create clusters)
self.clusters = self._create_clusters(self.centroids)
if self.plot_steps:
self.plot()
# Calculate new centroids from the clusters
centroids_old = self.centroids
self.centroids = self._get_centroids(self.clusters)
# check if clusters have changed
if self._is_converged(centroids_old, self.centroids):
break
if self.plot_steps:
self.plot()
# Classify samples as the index of their clusters
return self._get_cluster_labels(self.clusters)

Helper functions for the predict method

def _get_cluster_labels(self, clusters):
# each sample will get the label of the cluster it was assigned to
labels = np.empty(self.n_samples)
for cluster_idx, cluster in enumerate(clusters):
for sample_index in cluster:
labels[sample_index] = cluster_idx
return labels
def _create_clusters(self, centroids):
# Assign the samples to the closest centroids to create clusters
clusters = [[] for _ in range(self.K)]
for idx, sample in enumerate(self.X):
centroid_idx = self._closest_centroid(sample, centroids)
clusters[centroid_idx].append(idx)
return clusters
def _closest_centroid(self, sample, centroids):
# distance of the current sample to each centroid
distances = [euclidean_distance(sample, point) for point in centroids]
closest_index = np.argmin(distances)
return closest_index
def _get_centroids(self, clusters):
# assign mean value of clusters to centroids
centroids = np.zeros((self.K, self.n_features))
for cluster_idx, cluster in enumerate(clusters):
cluster_mean = np.mean(self.X[cluster], axis=0)
centroids[cluster_idx] = cluster_mean
return centroids
def _is_converged(self, centroids_old, centroids):
# distances between each old and new centroids, fol all centroids
distances = [
euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)
]
return sum(distances) == 0
def plot(self):
fig, ax = plt.subplots(figsize=(12, 8))
for i, index in enumerate(self.clusters):
point = self.X[index].T
ax.scatter(*point)
for point in self.centroids:
ax.scatter(*point, marker="x", color="black", linewidth=2)
plt.show()

def cent(self):
return self.centroids

Implementing the model

from sklearn.datasets import make_blobs
X, y = make_blobs(centers=3, n_samples=500, n_features=2, shuffle=True, random_state=40)
print(X.shape)
clusters = len(np.unique(y))
print(clusters)
k = KMeans(K=clusters, max_iters=150, plot_steps=False)
y_pred = k.predict(X)
print(np.unique(y_pred))
(500, 2)
3
[0.,1.,2.]

Applications of K-Means

K-Means clustering is used in a variety of examples or business cases in real life, like:

  • Academic performance
  • Diagnostic systems
  • Search engines
  • Wireless sensor networks

Advantages of K-Means

  • Simple — It is easy to implement k-means and identify unknown groups of data from complex data sets.
  • Flexible- K-means algorithm can easily adjust to the changes.
  • Suitable in a large dataset- K-means is suitable for a large number of datasets and it’s computed much faster than the smaller dataset.

Disadvantages of K-Means

  • No-optimal set of clusters- K-means doesn’t allow development of an optimal set of clusters and for effective results, you should decide on the clusters before.
  • Lacks consistency- K-means clustering gives varying results on different runs of an algorithm.
  • Uniform effect- It produces cluster with uniform size even when the input data has different sizes.

--

--