Machine Learning - Building K-means Clustering Algorithm from scratch

Machine Learning - K-means Clustering Algorithm

We have seen, how K-means Clustering model is built in R using the predefined functions such as, “kmeans()”. However, it is also important to know how the machine learning works at the back-end. What are the functions running inside the main function. We will walk through various stages of machine learning process in Kmeans. We will create our own machine learning functions step-by-step:

Define Function to compute centroids
Define Function to assigning cluster number to each record
Define Function to Randomly initialize cetroids
Define Function to train kmeans model
Define Function to plot progress in kmeans model
Train the model using all the functions on data

In the video lecture, the algorithm has been explained in detail. Please go through the lecture before starting with this lab session for a better understanding.

Centroids computation function

Let’s start with Centroids computation function. This function takes the dataset, centroid id or number of the each datapoint, calculates the mean of all the datapoints in each cluster which becomes the centroid of that cluster. Returns the matrix containing centroids in its rows.

Centroids <- function(X, id, K){
  
  # X = dataset in a matrix, each row is an observation
  # id = centroid number/id (1 to k) assigned to each data point
  # k = number of centroid
  
  X <- as.matrix(X) #In case the dataset is not in a matrix format
  
  m = nrow(X) #number of records in X
  n = ncol(X) #number of dimensions/features in X
  
  #Empty matrix with all 0s for storing centroids in each row
  centroids = matrix(rep(0, K*n),K)
  
  for(i in 1:K){
    ind = which(id==i) #TRUE for all data points belonging to cluster i
    centroids[i,] = colMeans(X[(ind),]) #Mean value of each dimension of cluster i
    
  }
  
  centroids
 
}

Function for assigning cluster number to each record

This function takes the dataset, centroids; calculates the euclidean distance of each data points from each centroid and assigns the observation to the nearest centroid. Returns the vector “id” containing assigned centroid number for every data point.

centroid_id <- function(X, centroids){
  
  # X = a matrix of dataset
  # centroids = a matrix containing centroids in rows
  
  X <- as.matrix(X) #In case the dataset is not in a matrix format
  centroids <- as.matrix(centroids) #In case the centroids is not in a matrix format
  
  K = nrow(centroids) #Number of centroids
  
  #Empty vector with all 0's for storing assigned cluster number of each data point
  id = rep(0,nrow(X))
  
  for(i in 1:nrow(X)){
    ind = rep(0,K)
    
    for(j in 1:K){
      ind[j] = t(X[i,]-centroids[j,])%*%(X[i,]-centroids[j,])  #Euclidean distance of record i from centroid j
    }
    
    id[i] <- which.min(ind)
  }
 id 
}

Function for randomly initializing cetroids

This function takes the dataset X and K = number of centroids; initializes centroids randomly by selecting k number of records from the dataset X. Returns a matrix of random centroids.

initCentroids <- function(X, K){
  
  X <- as.matrix(X) #In case the dataset is not in a matrix format
  
  centroids = matrix(rep(0,K*ncol(X)),K)
  rand_id = sample(nrow(X))
  centroids = X[rand_id[1:K],]
  centroids
  
}

Function to plot kmeans model progress

This function takes the dataset X, centroids, previous centroids, number of centroids, the vector id containing centroid number for every record and current iteration number. Plots the model progress.

modelProgress <- function(X, centroids, prev_centroids, id, K, i){
  
  # X = a matrix of dataset
  # centroids = a matrix containing centroids in rows
  # prev_centroids = previous centroids
  # id = centroid number/id (1 to k) assigned to each data point 
  # K = number of centroids
  # i = current iteration number
  
  # Create palette
  palette = rainbow(K)
  colors = palette[id]
  
  #Plot the datapoints
  points(X[,1],X[,2],col=colors, lwd=1, pch=1)
  
  #Plot the centroids
  points(centroids[,1], centroids[,2],col="black",pch=4,lty=1,lwd=6)
  
  #Keep plotting previous centroids as lines
  for(j in 1:nrow(centroids)){
    lines(c(centroids[j,1], prev_centroids[j,1]), c(centroids[j,2], prev_centroids[j,2]))
  }
  
  
  
}

Function to train kmeans model

This function takes the dataset X, randomly initialized centroids, number of iteration and whether to display the progress of model (TRUE or FALSE). It trains the kmeans model and returns the final centroids and the vector id containing centroid number for every record in the data.

kMeansModel <- function(X, init.centroids, maxit, model_progress=FALSE){
  
  # X = dataset
  # init.centroids = randomly initialized centroids
  # maxit = maximum number of iteration
  # model_progress = TRUE for plotting model progress
  # Uses centroid_id() in a for loop for maxit to assign centroid number to every record
  # Uses Centroids() in the same for loop to compute centroids for all clusters
  
  X <- as.matrix(X) #In case the dataset is not in a matrix format
  
 if(model_progress) {
   plot(X[,1],X[,2],type="n")
   title(main=paste0('Model progress in ', maxit,' Iterations' ))
 }
   
  
  # Initializing variables
  m <- nrow(X)
  n <- ncol(X)
  K <- nrow(init.centroids)
  centroids <- init.centroids
  prev_centroids <- centroids #To store the previous centroids as the new centroids are computed
  id <- rep(0,m) #Empty vector to store centroid/cluster number assigned to every record
  
  #Train K-means
  for(i in 1:maxit){
    
    #Display iteration number
    cat('iteration no.', i,'\n')
    
    #Every record in X is assigned to the closest centroid
    id = centroid_id(X, centroids)
    
    #If true, plot model progress
    if (model_progress){
     
      modelProgress(X, centroids, prev_centroids, id, K, i);
      prev_centroids = centroids
      cat ("Press [enter] to continue")
      line <- readline()
    }
    
    #compute new centroids based on id returned by centroid_id()
    centroids = Centroids(X, id, K)
    
  }
    
  
  
 list(centroids=centroids,id=id) 
 
}

Training Kmeans cluster

Now we have all the functions to run the algorithm. Let’s build a kmeans model using a dataset

Simulating dataset for training

Let’s simulate some datapoints randomly for our model building

set.seed(2); x=matrix (rnorm (150*2) , ncol =2) 
x[1:50 ,1]=x[1:50 ,1]+2
x[1:50 ,2]=x[1:50 ,2]-1
x[51:100 ,1]=x[51:100 ,1]-3
x[51:100 ,2]=x[51:100 ,2]+2
x[101:150 ,1]=x[101:150 ,1]+6
x[101:150 ,2]=x[101:150 ,2]-4

Initializing important values for running K-Means

Let’s set the number of clusters and maximum number of iteration and initialize the centroids.

K = 3
maxit = 10
init.centroids = matrix(c(-4, -4, 2, 4, 8, 4),3, byrow = T)

Train K-Means model

Let’s train the k-means model on the dataset, while plotting the model progress

km.model <- kMeansModel(x, init.centroids, maxit, TRUE)

## iteration no. 1 
## Press [enter] to continue
## iteration no. 2 
## Press [enter] to continue
## iteration no. 3 
## Press [enter] to continue
## iteration no. 4 
## Press [enter] to continue
## iteration no. 5 
## Press [enter] to continue
## iteration no. 6 
## Press [enter] to continue
## iteration no. 7 
## Press [enter] to continue
## iteration no. 8 
## Press [enter] to continue
## iteration no. 9 
## Press [enter] to continue
## iteration no. 10 
## Press [enter] to continue

Exercise: Use some other dataset from base R, MASS package or from elsewhere, or simulated data and run the above algorithm in R.

An application of K-Means in real life

Let’s go through one real life application of k-means clustering.

We will use k-mean clustering to compress an image of size 128-by-128 pixel

We will divide the pixels into different clusters, let’s say 18

We will the compute the centroids of each cluster

Assign pixels to the closest centroids

Replace all the pixels with their respective centroids

Loading an image

We will load a small 128-by-128 image of a flower.

To read .png file we can use png package and use readPNG function in R.

To read .jpeg file we can use jpeg package and use readJPEG function in R.

We are using a png image

library(png)
img <- readPNG("flower_small.png") #It should ideally have a 128-128-3 dimension, but what we get 128-128-4
img <- img[, ,-4] #Remove that extra dimension which is nothing but 1s

Size of the image

im_dim = dim(img)

Reshape the image into a matrix

We need to reshape the image into an (128X128)-by-3 matrix where 128X128 = number of pixels. Matrix has 128X128 (=16384) rows and 3 columns

3 columns are Red, Green and Blue values of 16384 pixels

X = img #We need to retain img for displaying it later

dim(X) <- c(im_dim[1]*im_dim[2], 3)

Train the K-Means model on pixels of the image

Initialize cluster number and maximum iteration

Let’s set the number of clusters and maximum number of iteration

K = 18 
maxit = 15

Initialize centroids randomly

Let’s use our function initCentroids() to initialize random centroids

init.centroids = initCentroids(X, K)

Train K-Means model

Let’s train our k-means model on image pixel data using our function kMeansModel()

km.model.img = kMeansModel(X, init.centroids, maxit)

## iteration no. 1 
## iteration no. 2 
## iteration no. 3 
## iteration no. 4 
## iteration no. 5 
## iteration no. 6 
## iteration no. 7 
## iteration no. 8 
## iteration no. 9 
## iteration no. 10 
## iteration no. 11 
## iteration no. 12 
## iteration no. 13 
## iteration no. 14 
## iteration no. 15

centroids = km.model.img$centroids                  # Matrix of centroids
id = km.model.img$id                                # cluster number for every record

Compressing image

Let’s compress the image using our trained k-means model

Replacing pixels with centroids

We need to replace all the pixel RGB values with their respective centroid RGB values

i.e., replace all the records with their respective centroids

X_new = centroids[id,] #id is a vector containing cluster number for each record

Number of colors: original vs compressed

Let’s check the number of colors in original image. We can use the function unique() in R, which removes all the duplicate entries and returns only unique records (rows)

dim(unique(X))

## [1] 13524     3

Therefore, the original data image has used 13524 different combinations of RGB, i.e. 13524 different colors

dim(unique(X_new))

## [1] 18  3

As expected, the compressed image has only 18 unique colors.

Reshape X_new

To get the image back to its original dimensions, we need to reshape it to 128-128-3

dim(X_new) = c(im_dim[1],im_dim[2], 3)

Display the original image vs compressed image

Now let’s display the original and compressed images

par(mfrow=c(1, 2))

#Original image
plot(0, type='n', xlim=0:1, ylim=0:1, main = "Originally 13524 colors")
rasterImage(as.raster(img), 0, 0, 1, 1)

#Compressed image
plot(0, type='n', xlim=0:1, ylim=0:1, main = paste0("Now only ",K," colors"))
rasterImage(as.raster(X_new), 0, 0, 1, 1)

So, basically, we were able to reproduce the image in just 18 colors as opposed to 13524 colors of the original image

Exercise: Use some other image and run the above algorithm.

Click the links below for more

Clustering in R

Business and AI

Search This Blog