K-Means Clustering from scratch in Python

4 min readJul 1, 2021

What is clustering ?

Clustering refers to the segregation of data points into a number of different clusters (or groups ) such that all the data points in a cluster are similar to each other.

Types of Clustering —

Clustering can be divided into two parts:

Hard Clustering — In hard clustering, each data point either belongs to the cluster completely or not.
Soft Clustering — In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.

K-Means Clustering

K-Means clustering is the most popular unsupervised learning algorithm which means that this algorithm learns patterns from unlabeled data.

The ‘K’ in K-Means stands for the number of centroids. Centroid is the point which has the least distance from all the points of a cluster. Each cluster has its own centroid value. The ‘means’ in K-Means stands for the average value which we calculate in order to find the centroid.

K-Means Algorithm

Specify the value of number of clusters k.

2. Randomly initialize the value of ‘k’ centroids.

3. Keep iterating until the centroids becomes constant i.e. the assignment of data points to clusters is not changing.

Find the Euclidian distance between the centroid and the data points.
Assign the data points to the closest cluster.
Update the value of the centroid by taking the average of all the data points of a cluster.

Implementation in Python from scratch

Function to calculate Euclidian distance

def euclidean_distance(x1,x2):
    return np.sqrt(np.sum((x1-x2)**2))

KMeans class

Defining the constructor and initializing the the values

def __init__(self, K=5, max_iters=100, plot_steps=False):
        self.K = K
        self.max_iters = max_iters
        self.plot_steps = plot_steps# list of sample indices for each cluster
        self.clusters = [[] for _ in range(self.K)]
        # the centers (mean feature vector) for each cluster
        self.centroids = []

Predict Method

def predict(self, X):
        self.X = X
        self.n_samples, self.n_features = X.shape# initialize
        random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
        self.centroids = [self.X[idx] for idx in random_sample_idxs]# Optimize clusters
        for _ in range(self.max_iters):
            # Assign samples to closest centroids (create clusters)
            self.clusters = self._create_clusters(self.centroids)if self.plot_steps:
                self.plot()# Calculate new centroids from the clusters
            centroids_old = self.centroids
            self.centroids = self._get_centroids(self.clusters)# check if clusters have changed
            if self._is_converged(centroids_old, self.centroids):
                breakif self.plot_steps:
                self.plot()# Classify samples as the index of their clusters
        return self._get_cluster_labels(self.clusters)

Helper functions for the predict method

def _get_cluster_labels(self, clusters):
        # each sample will get the label of the cluster it was assigned to
        labels = np.empty(self.n_samples)for cluster_idx, cluster in enumerate(clusters):
            for sample_index in cluster:
                labels[sample_index] = cluster_idx
        return labelsdef _create_clusters(self, centroids):
        # Assign the samples to the closest centroids to create clusters
        clusters = [[] for _ in range(self.K)]
        for idx, sample in enumerate(self.X):
            centroid_idx = self._closest_centroid(sample, centroids)
            clusters[centroid_idx].append(idx)
        return clustersdef _closest_centroid(self, sample, centroids):
        # distance of the current sample to each centroid
        distances = [euclidean_distance(sample, point) for point in centroids]
        closest_index = np.argmin(distances)
        return closest_indexdef _get_centroids(self, clusters):
        # assign mean value of clusters to centroids
        centroids = np.zeros((self.K, self.n_features))
        for cluster_idx, cluster in enumerate(clusters):
            cluster_mean = np.mean(self.X[cluster], axis=0)
            centroids[cluster_idx] = cluster_mean
        return centroidsdef _is_converged(self, centroids_old, centroids):
        # distances between each old and new centroids, fol all centroids
        distances = [
            euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)
        ]
        return sum(distances) == 0def plot(self):
        fig, ax = plt.subplots(figsize=(12, 8))for i, index in enumerate(self.clusters):
            point = self.X[index].T
            ax.scatter(*point)for point in self.centroids:
            ax.scatter(*point, marker="x", color="black", linewidth=2)plt.show()
        
    def cent(self):
        return self.centroids

Implementing the model

from sklearn.datasets import make_blobs
X, y = make_blobs(centers=3, n_samples=500, n_features=2, shuffle=True, random_state=40)
print(X.shape)
clusters = len(np.unique(y))
print(clusters)
k = KMeans(K=clusters, max_iters=150, plot_steps=False)
y_pred = k.predict(X)
print(np.unique(y_pred))(500, 2)
3
[0.,1.,2.]

Applications of K-Means

K-Means clustering is used in a variety of examples or business cases in real life, like:

Academic performance
Diagnostic systems
Search engines
Wireless sensor networks

Advantages of K-Means

Simple — It is easy to implement k-means and identify unknown groups of data from complex data sets.
Flexible- K-means algorithm can easily adjust to the changes.
Suitable in a large dataset- K-means is suitable for a large number of datasets and it’s computed much faster than the smaller dataset.

Disadvantages of K-Means

No-optimal set of clusters- K-means doesn’t allow development of an optimal set of clusters and for effective results, you should decide on the clusters before.
Lacks consistency- K-means clustering gives varying results on different runs of an algorithm.
Uniform effect- It produces cluster with uniform size even when the input data has different sizes.

K-Means Clustering from scratch in Python

K-Means Clustering

K-Means Algorithm

Implementation in Python from scratch

Written by Kunal Chhikara