Lab_CLUS_2_RM
Asmi Ariv
2022-10-06
Machine Learning - K-means Clustering Algorithm
We have seen, how K-means Clustering model is built in R using the predefined functions such as, “kmeans()”. However, it is also important to know how the machine learning works at the back-end. What are the functions running inside the main function. We will walk through various stages of machine learning process in Kmeans. We will create our own machine learning functions step-by-step:
- Define Function to compute centroids
- Define Function to assigning cluster number to each record
- Define Function to Randomly initialize cetroids
- Define Function to train kmeans model
- Define Function to plot progress in kmeans model
- Train the model using all the functions on data
In the video lecture, the algorithm has been explained in detail. Please go through the lecture before starting with this lab session for a better understanding.
Centroids computation function
Let’s start with Centroids computation function. This function takes the dataset, centroid id or number of the each datapoint, calculates the mean of all the datapoints in each cluster which becomes the centroid of that cluster. Returns the matrix containing centroids in its rows.
Centroids <- function(X, id, K){
# X = dataset in a matrix, each row is an observation
# id = centroid number/id (1 to k) assigned to each data point
# k = number of centroid
X <- as.matrix(X) #In case the dataset is not in a matrix format
m = nrow(X) #number of records in X
n = ncol(X) #number of dimensions/features in X
#Empty matrix with all 0s for storing centroids in each row
centroids = matrix(rep(0, K*n),K)
for(i in 1:K){
ind = which(id==i) #TRUE for all data points belonging to cluster i
centroids[i,] = colMeans(X[(ind),]) #Mean value of each dimension of cluster i
}
centroids
}Function for assigning cluster number to each record
This function takes the dataset, centroids; calculates the euclidean distance of each data points from each centroid and assigns the observation to the nearest centroid. Returns the vector “id” containing assigned centroid number for every data point.
centroid_id <- function(X, centroids){
# X = a matrix of dataset
# centroids = a matrix containing centroids in rows
X <- as.matrix(X) #In case the dataset is not in a matrix format
centroids <- as.matrix(centroids) #In case the centroids is not in a matrix format
K = nrow(centroids) #Number of centroids
#Empty vector with all 0's for storing assigned cluster number of each data point
id = rep(0,nrow(X))
for(i in 1:nrow(X)){
ind = rep(0,K)
for(j in 1:K){
ind[j] = t(X[i,]-centroids[j,])%*%(X[i,]-centroids[j,]) #Euclidean distance of record i from centroid j
}
id[i] <- which.min(ind)
}
id
}Function for randomly initializing cetroids
This function takes the dataset X and K = number of centroids; initializes centroids randomly by selecting k number of records from the dataset X. Returns a matrix of random centroids.
initCentroids <- function(X, K){
X <- as.matrix(X) #In case the dataset is not in a matrix format
centroids = matrix(rep(0,K*ncol(X)),K)
rand_id = sample(nrow(X))
centroids = X[rand_id[1:K],]
centroids
}Function to plot kmeans model progress
This function takes the dataset X, centroids, previous centroids, number of centroids, the vector id containing centroid number for every record and current iteration number. Plots the model progress.
modelProgress <- function(X, centroids, prev_centroids, id, K, i){
# X = a matrix of dataset
# centroids = a matrix containing centroids in rows
# prev_centroids = previous centroids
# id = centroid number/id (1 to k) assigned to each data point
# K = number of centroids
# i = current iteration number
# Create palette
palette = rainbow(K)
colors = palette[id]
#Plot the datapoints
points(X[,1],X[,2],col=colors, lwd=1, pch=1)
#Plot the centroids
points(centroids[,1], centroids[,2],col="black",pch=4,lty=1,lwd=6)
#Keep plotting previous centroids as lines
for(j in 1:nrow(centroids)){
lines(c(centroids[j,1], prev_centroids[j,1]), c(centroids[j,2], prev_centroids[j,2]))
}
}Function to train kmeans model
This function takes the dataset X, randomly initialized centroids, number of iteration and whether to display the progress of model (TRUE or FALSE). It trains the kmeans model and returns the final centroids and the vector id containing centroid number for every record in the data.
kMeansModel <- function(X, init.centroids, maxit, model_progress=FALSE){
# X = dataset
# init.centroids = randomly initialized centroids
# maxit = maximum number of iteration
# model_progress = TRUE for plotting model progress
# Uses centroid_id() in a for loop for maxit to assign centroid number to every record
# Uses Centroids() in the same for loop to compute centroids for all clusters
X <- as.matrix(X) #In case the dataset is not in a matrix format
if(model_progress) {
plot(X[,1],X[,2],type="n")
title(main=paste0('Model progress in ', maxit,' Iterations' ))
}
# Initializing variables
m <- nrow(X)
n <- ncol(X)
K <- nrow(init.centroids)
centroids <- init.centroids
prev_centroids <- centroids #To store the previous centroids as the new centroids are computed
id <- rep(0,m) #Empty vector to store centroid/cluster number assigned to every record
#Train K-means
for(i in 1:maxit){
#Display iteration number
cat('iteration no.', i,'\n')
#Every record in X is assigned to the closest centroid
id = centroid_id(X, centroids)
#If true, plot model progress
if (model_progress){
modelProgress(X, centroids, prev_centroids, id, K, i);
prev_centroids = centroids
cat ("Press [enter] to continue")
line <- readline()
}
#compute new centroids based on id returned by centroid_id()
centroids = Centroids(X, id, K)
}
list(centroids=centroids,id=id)
}Training Kmeans cluster
Now we have all the functions to run the algorithm. Let’s build a kmeans model using a dataset
Simulating dataset for training
Let’s simulate some datapoints randomly for our model building
set.seed(2); x=matrix (rnorm (150*2) , ncol =2)
x[1:50 ,1]=x[1:50 ,1]+2
x[1:50 ,2]=x[1:50 ,2]-1
x[51:100 ,1]=x[51:100 ,1]-3
x[51:100 ,2]=x[51:100 ,2]+2
x[101:150 ,1]=x[101:150 ,1]+6
x[101:150 ,2]=x[101:150 ,2]-4Initializing important values for running K-Means
Let’s set the number of clusters and maximum number of iteration and initialize the centroids.
K = 3
maxit = 10
init.centroids = matrix(c(-4, -4, 2, 4, 8, 4),3, byrow = T)Train K-Means model
Let’s train the k-means model on the dataset, while plotting the model progress
km.model <- kMeansModel(x, init.centroids, maxit, TRUE) ## iteration no. 1
## Press [enter] to continue
## iteration no. 2
## Press [enter] to continue
## iteration no. 3
## Press [enter] to continue
## iteration no. 4
## Press [enter] to continue
## iteration no. 5
## Press [enter] to continue
## iteration no. 6
## Press [enter] to continue
## iteration no. 7
## Press [enter] to continue
## iteration no. 8
## Press [enter] to continue
## iteration no. 9
## Press [enter] to continue
## iteration no. 10
## Press [enter] to continueExercise: Use some other dataset from base R, MASS package or from elsewhere, or simulated data and run the above algorithm in R.
An application of K-Means in real life
Let’s go through one real life application of k-means clustering.
We will use k-mean clustering to compress an image of size 128-by-128 pixel
We will divide the pixels into different clusters, let’s say 18
We will the compute the centroids of each cluster
Assign pixels to the closest centroids
Replace all the pixels with their respective centroids
Loading an image
We will load a small 128-by-128 image of a flower.
To read .png file we can use png package and use readPNG function in R.
To read .jpeg file we can use jpeg package and use readJPEG function in R.
We are using a png image
library(png)
img <- readPNG("flower_small.png") #It should ideally have a 128-128-3 dimension, but what we get 128-128-4
img <- img[, ,-4] #Remove that extra dimension which is nothing but 1sSize of the image
im_dim = dim(img)Reshape the image into a matrix
We need to reshape the image into an (128X128)-by-3 matrix where 128X128 = number of pixels. Matrix has 128X128 (=16384) rows and 3 columns
3 columns are Red, Green and Blue values of 16384 pixels
X = img #We need to retain img for displaying it later
dim(X) <- c(im_dim[1]*im_dim[2], 3)Train the K-Means model on pixels of the image
Initialize cluster number and maximum iteration
Let’s set the number of clusters and maximum number of iteration
K = 18
maxit = 15Initialize centroids randomly
Let’s use our function initCentroids() to initialize random centroids
init.centroids = initCentroids(X, K)Train K-Means model
Let’s train our k-means model on image pixel data using our function kMeansModel()
km.model.img = kMeansModel(X, init.centroids, maxit)## iteration no. 1
## iteration no. 2
## iteration no. 3
## iteration no. 4
## iteration no. 5
## iteration no. 6
## iteration no. 7
## iteration no. 8
## iteration no. 9
## iteration no. 10
## iteration no. 11
## iteration no. 12
## iteration no. 13
## iteration no. 14
## iteration no. 15centroids = km.model.img$centroids # Matrix of centroids
id = km.model.img$id # cluster number for every record Compressing image
Let’s compress the image using our trained k-means model
Replacing pixels with centroids
We need to replace all the pixel RGB values with their respective centroid RGB values
i.e., replace all the records with their respective centroids
X_new = centroids[id,] #id is a vector containing cluster number for each recordNumber of colors: original vs compressed
Let’s check the number of colors in original image. We can use the function unique() in R, which removes all the duplicate entries and returns only unique records (rows)
dim(unique(X))## [1] 13524 3Therefore, the original data image has used 13524 different combinations of RGB, i.e. 13524 different colors
dim(unique(X_new))## [1] 18 3As expected, the compressed image has only 18 unique colors.
Reshape X_new
To get the image back to its original dimensions, we need to reshape it to 128-128-3
dim(X_new) = c(im_dim[1],im_dim[2], 3)Display the original image vs compressed image
Now let’s display the original and compressed images
par(mfrow=c(1, 2))
#Original image
plot(0, type='n', xlim=0:1, ylim=0:1, main = "Originally 13524 colors")
rasterImage(as.raster(img), 0, 0, 1, 1)
#Compressed image
plot(0, type='n', xlim=0:1, ylim=0:1, main = paste0("Now only ",K," colors"))
rasterImage(as.raster(X_new), 0, 0, 1, 1)So, basically, we were able to reproduce the image in just 18 colors as opposed to 13524 colors of the original image
Exercise: Use some other image and run the above algorithm.
Comments
Post a Comment