Skip to main content

Clustering in R

Clustering in R

In this lab we will go through various clustering machine learning methods in R. We will use different datasets.

Function for Best value of k (number of clusters)

The function takes a dataset and maximum number of k or centroids and returns a plot on elbow method to find out the best value of k based on within sum of squares cost. With increasing number of centroids (going upward from 1 to k), the cost will decrease rapidly and then slow down at a point but will continue to decrease further, thus giving the appearance of an elbow. This point is the optimum value of k.

bestK = function(X, k){
  
  #bestK returns best value of k by plotting k and within sum of squares
  
   cost = rep(0, k)
   for (i in 1:k){
     km = kmeans(X, centers = i, nstart = 10)
     cost[i] = km$tot.withinss
   }
   
   plot(1:k, cost, main = "Elbow Method for optimum K" ,xlab="Number of Centroids (K)", ylab="Cost (within SS)", col=1:k, cex=4, bg="lightblue", type="o")
}

K-means clustering

We will use kmeans() function that comes with base R.

Let’s look at an example of k-means clustering in R using our simulated dataset.

X1 = c(10.1,  8.9, 13.5,  7.8,  9.7, 10.6,  8.4,  9.5, 18.0, 10.2,  5.3, 13.9,  9.0,  9.5,  9.4,  6.9,  6.2,  6.2,  7.1,  9.9, 13.1, 17.4,  9.3, 11.4,  4.4)

X2 = c(1.4, 14.0,  9.3,  6.0, 11.4, 10.8, 11.6,  4.9,  9.9,  3.0, 12.4, 10.0,  5.1, 13.6,  4.7, 10.2,  3.7,  6.3,  3.4,  7.8, 10.1,  5.7,  4.6, 12.5,  5.0)

data = data.frame(X1, X2)

In R, we can use predict function to predict cluster based on kmeans. However, for this you need to install “fdm2id” in R. Once, you have installed and loaded this package in R, predict function works for kmeans model as well. But remember that clustering is unsupervised learning and not a classification model which is a supervised learning.

Splitting data into train and test

library(fdm2id)
set.seed(1); train.idx = sample(1:nrow(data), 20, replace=F)
train = data[train.idx,]
test = data[-train.idx,]

Training kmeans on train data

km = kmeans(train, centers = 3, nstart = 10) #Using three centroids
km
## K-means clustering with 3 clusters of sizes 3, 7, 10
## 
## Cluster means:
##          X1        X2
## 1 16.433333  8.533333
## 2  8.471429 12.000000
## 3  8.060000  4.590000
## 
## Clustering vector:
## 25  4  7  1  2 11 14 18 22  5 16 10  6 19 23  9 15 12 17 20 
##  3  3  2  3  2  2  2  3  1  2  2  3  2  3  3  1  3  1  3  3 
## 
## Within cluster sum of squares by cluster:
## [1] 21.85333 31.73429 67.07300
##  (between_SS / total_SS =  76.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

So, kmeans has created three clusters of sizes 3, 7 and 10. Here size means number of data points in each cluster.

Under cluster vector, you can see first row is index number and second row is cluster numbers for each index.

Plotting clusters

Let’s plot the data points according to their clusters

plot(train, type="n", xlim=c(3,19), xlab="X1", ylab="X2",)
text(x=train$X1, y=train$X2, col=km$cluster+1)

As we can see that the data points have been segregated into their respective clusters.

Prediction on test data set

pred.test = predict(km, test)
pred.test
##  3  8 13 21 24 
##  1  3  3  1  2
plot(test, type="n", xlim=c(3,19), xlab="X1", ylab="X2",)
text(x=test$X1, y=test$X2, col=pred.test+1)

Looking at both the plots, we can see that model has done a good job on predicting the clusters for test data.

Best value of k

Let’s find out best K for our data set

bestK(train, k=10)

As per our elbow plot, we can see that the optimum value of k is 3 at which the decline in cost has slowed down. This is what we also chose in our kmeans model. Therefore, we don’t need to rebuild our model.

IRIS data set

Let’s build a kmeans clustering model on iris data and see how our model does, as this dataset also has labels. We will exclude the species variable while building the model.

There is a function splitdata() in the package “fdm2id” that helps in splitting dataset into train and test

require (datasets) #require() is another way of loading a package in R
data (iris)
m= nrow(iris)
d = splitdata (iris, target=5, size = floor(0.8*m), seed = 1) #Here, target is column number of Class variable
                                         #by default it splits data into 70-30 ratio, we have set it at 80-20
str (d)
## List of 4
##  $ train.x:'data.frame': 120 obs. of  4 variables:
##   ..$ Sepal.Length: num [1:120] 5.8 6.4 4.4 4.3 7 5.4 5.4 7.6 6.1 4.6 ...
##   ..$ Sepal.Width : num [1:120] 2.7 2.8 3.2 3 3.2 3 3.4 3 2.8 3.4 ...
##   ..$ Petal.Length: num [1:120] 4.1 5.6 1.3 1.1 4.7 4.5 1.7 6.6 4.7 1.4 ...
##   ..$ Petal.Width : num [1:120] 1 2.1 0.2 0.1 1.4 1.5 0.2 2.1 1.2 0.3 ...
##  $ train.y: Factor w/ 3 levels "setosa","versicolor",..: 2 3 1 1 2 2 1 3 2 1 ...
##  $ test.x :'data.frame': 30 obs. of  4 variables:
##   ..$ Sepal.Length: num [1:30] 4.6 5 5 4.4 5.4 5.7 4.7 5 4.8 5.1 ...
##   ..$ Sepal.Width : num [1:30] 3.1 3.6 3.4 2.9 3.7 4.4 3.2 3.2 3 3.8 ...
##   ..$ Petal.Length: num [1:30] 1.5 1.4 1.5 1.4 1.5 1.5 1.6 1.2 1.4 1.6 ...
##   ..$ Petal.Width : num [1:30] 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.3 0.2 ...
##  $ test.y : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "dataset"

Training kmeans of IRIS

We will use KMEANS() function from th package “fdm2id” (notice, the function name is in capital letters)

km.iris = KMEANS(d$train.x, k = 3)
km.iris
## K-means clustering with 3 clusters of sizes 33, 48, 39
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.896970    3.100000     5.751515   2.0969697
## 2     5.943750    2.733333     4.468750   1.4354167
## 3     5.007692    3.420513     1.461538   0.2512821
## 
## Clustering vector:
##  68 129  43  14  51  85  21 106  74   7  73  79  37 105 110  34 143 126  89  33 
##   2   1   3   3   2   2   3   1   2   3   2   2   3   1   1   3   2   1   2   3 
##  84  70 142  42  38 111  20  28 124  44  87 149  40 121  25 119  39 146 127   6 
##   2   2   1   3   3   1   3   3   2   3   2   1   3   1   3   1   3   1   2   3 
##  24  32 147   2  45  18  22  78 102  65 115 120 100  75  81  13 118 132  48  93 
##   3   3   2   3   3   3   3   1   2   2   2   2   2   2   2   3   1   1   3   2 
##  23 130  29  95 104 123  92 131 134 144  31  17 140  91  64  60 113 135  10   1 
##   3   1   3   2   1   1   2   1   2   1   3   3   1   2   2   2   1   2   3   3 
## 148  59  26  15  58  88 136 112  77  53  12 114  76  61 145  86  94  83  19 150 
##   1   2   3   3   2   2   1   1   2   1   3   2   2   2   1   2   2   2   3   2 
##  35  98  71 101 108  55 125  56  41 138   3  82  50 141 133  27  63 122 116  66 
##   3   2   2   1   1   2   1   2   3   1   3   2   3   1   1   3   2   2   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 21.18182 31.95771 11.64103
##  (between_SS / total_SS =  88.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Prediction on test data

pred.test = predict (km.iris, d$test.x)
pred.test
##   4   5   8   9  11  16  30  36  46  47  49  52  54  57  62  67  69  72  80  90 
##   3   3   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   2   2   2 
##  96  97  99 103 107 109 117 128 137 139 
##   2   2   2   1   2   1   1   2   1   2

Plotting kmeans cluster of IRIS Data

plot(d$train.x$Sepal.Length, d$train.x$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)

text(x=d$train.x$Sepal.Length, y=d$train.x$Sepal.Width, labels=d$train.y,col=km.iris$cluster+1)

As we can we from this 2-dimension plot that one cluster (representing setosa) is quite separate from other two clusters, which are very close to each other. In fact some of the members of the two clusters have merged together, but this could be due to our plotting 4d data into two 2d.

plot(d$test.x$Sepal.Length, d$test.x$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)

text(x=d$test.x$Sepal.Length, y=d$test.x$Sepal.Width, labels=d$test.y,col=pred.test+1)

Same as training data.

MIXTURES OF NORMAL DISTRIBUTIONS Clustering

As we have discussed in the lecture, that kmeans works on hard clustering which may not give the best result. Hence, we need an algorithm that works on soft clustering.

Let’s build a clustering model based on soft clustering using Gaussian Mixture Model. We will use a function mvnormalmixEM() from the package mixtools.

We will use the IRIS data

library(mixtools)
mix.model <- mvnormalmixEM(iris[,-5],arbvar=TRUE,k=3,epsilon=1e-02)
## number of iterations= 27
names(mix.model)
## [1] "x"          "lambda"     "mu"         "sigma"      "loglik"    
## [6] "posterior"  "all.loglik" "restarts"   "ft"

Here, posterior component is a matrix of posterior probabilities for observations, i.e., probability of an observation belonging to each cluster. If there are k clusters, then there will be k columns in the matrix, each representing one cluster. In our case, there are 3 posterior probabilities for each observation.

We will take the round up values up to 3 digits, to get a better view of which cluster gets higher probability for each observation.

prob1=round(mix.model$posterior[,1],digits=3)
prob2=round(mix.model$posterior[,2],digits=3)
prob3=round(mix.model$posterior[,3],digits=3)

prob1[1:5]
## [1] 1 1 1 1 1
prob2[1:5]
## [1] 0 0 0 0 0
prob3[1:5]
## [1] 0 0 0 0 0

Every observation has a certain probability of belonging to each cluster (that’s the purpose of soft clustering)

Let’s plot the data based on their posterior probabilities.

plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)

text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col=ifelse(prob1>prob2 & prob1>prob3, "red", ifelse(prob2>prob3, "blue", "orange")))

As we can see from the above plot, that the observations have been assigned different probabilities of belonging to each cluster. Our plotting is again in 2d while we are dealing with 4d data.

We cannot use predict() function for this method here.

K-medoids

It is a partition based variant of K-means, but instead of using mean values of observations for centroids, it uses the actual observations among the datapoints. It is more robust than k-means.

We will use cluster package and its function pam()

library(cluster)

pam.model = pam(iris[,-5], k=3)

pam.model
## Medoids:
##       ID Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   8          5.0         3.4          1.5         0.2
## [2,]  79          6.0         2.9          4.5         1.5
## [3,] 113          6.8         3.0          5.5         2.1
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## Objective function:
##     build      swap 
## 0.6709391 0.6542077 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

Clusters

pam.model$clustering
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2

Plotting data into clusters

plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)

text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col= pam.model$clustering+1)

Hierarchical Clustering

Let’s use Hierarchical Clustering method to build clusters around the data

We will use agnes() function from the “cluster” package.

library(cluster)

hi.model=agnes(iris[,-5], diss=FALSE, metric="euclidian", method="complete") #We can select "single", "average"

hi.model
## Call:     agnes(x = iris[, -5], diss = FALSE, metric = "euclidian", method = "complete") 
## Agglomerative coefficient:  0.9574622 
## Order of objects:
##   [1]   1  18  41   8  40  50  28  29   5  38  12  30  31  25   2  46  13  10
##  [19]  35  26  36   3   4  48  43   7   9  39  14  23  42   6  19  11  49  37
##  [37]  21  32  20  22  47  45  24  27  44  15  16  17  33  34  54  90  70  81
##  [55]  82  60  65  80  63  56  91  67  85  62  72  68  83  93  89  96  97  95
##  [73] 100 107  58  94  99  61  51  53  87  78  55  59  77  66  76  52  57  86
##  [91]  64  92  79  74  75  98  69  88 120  71 128 139 150 102 143 114 122 115
## [109]  73  84 134 124 127 112 147 135 101 137 149 121 144 125 141 145 104 117
## [127] 138 105 129 133 109 111 148 116 113 140 142 146 103 126 130 108 131 106
## [145] 123 119 110 136 118 132
## Height (summary):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2449  0.3742  0.5865  0.6481  7.0852 
## 
## Available components:
## [1] "order"  "height" "ac"     "merge"  "diss"   "call"   "method" "data"

Plotting Hierarchical Clustering

plot(hi.model)

This plot shows that clusters have been built using agglomerative (bottom-up) method. Each data is considered a cluster in the beginning (at the bottom) and then joined togther pairwise depending on their similaries (we have used euclidian distance). Finally, one big cluster is formed at the end (at the top).

Required cluster

To get the required cluster, we use another function cutree()

clus = cutree(hi.model, 3) #3 is number of clusters
clus
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2
##  [75] 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2 2 2 2 2 3 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2

Plotting data into clusters

plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)

text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col= clus+1)

weighted kernel version of K-means

We can build cluster based on kernel functions (that we have seen in SVM) as well. We will use kkmeans() function from the package “kernlab”

This function accepts a matrix instead of a data frame.

require(kernlab)
kkm = kkmeans(as.matrix(iris[,-5]), centers=3)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
kkm
## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2 
##  
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.782254497703445 
## 
## Centers:  
##          [,1]     [,2]     [,3]     [,4]
## [1,] 5.006000 3.428000 1.462000 0.246000
## [2,] 5.903279 2.747541 4.381967 1.418033
## [3,] 6.823077 3.066667 5.725641 2.079487
## 
## Cluster size:  
## [1] 50 61 39
## 
## Within-cluster sum of squares:  
## [1] 1313.240 1424.553 1166.031

Plotting data into clusters

plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)

text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col= kkm+1)

Exercise: Try different datasets and build clustering models using all the methods above. 


Click the links below for more


Comments

Popular posts from this blog

Metaverse needs better technology, scalable infra, strong governance

Many minds have been intrigued by the idea of metaverse, and its effect is such that the social media giant like Facebook has been rebranded as Meta. Yet, there is a big question mark on the future of this technology. The enablers of metaverse such as augmented reality, mixed reality and virtual reality operating on computers, smartphones and other devices have failed to give the complete real-world like immersive experience to end users. There is a clear lack of standard virtual environment and technical specifications for implementing metaverse  –  a bottleneck in using technologies from different proprietors. Due to the business privacy and transparency concerns, interoperability of services from various providers has become a big challenge. Although, the efforts to standardize virtual reality, such as Universal Scene Description, glTF and OpenXR may help in a long run, but a lot more needs to be put in.  The technologies and devices, such as wireless he...

What is ChatGPT?

Introduction ChatGPT is a language model developed by OpenAI based on the GPT-3.5 architecture. It is designed to perform various natural language processing tasks such as language translation, text summarization, question-answering, and chatbot interactions. In this blog, we will discuss ChatGPT, its architecture, applications, and benefits. Architecture ChatGPT is based on the GPT-3.5 architecture, which is an extension of the GPT-3 architecture. The model has 175 billion parameters, making it one of the largest language models available. The architecture consists of 96 transformer blocks with a hidden size of 12,288 and 10 attention heads. The model is trained using a combination of unsupervised and supervised learning techniques. Applications ChatGPT has a wide range of applications in various fields such as healthcare, finance, customer service, and education. Some of the applications of ChatGPT are as follows: Language translation: ChatGPT can translate text from one language to ...

Exploratory Data Analysis

  Lab_D_2_RM Asmi Ariv 2022-10-14 Exploratory Data Analysis In this lab, we will go through various steps to explore a dataset using descriptive statistics, summary of data, different graphs, etc. Factor Variables (try the following in R): data = read.csv( "patient.csv" );data #Reading patient data ## Patient Gender Age Group ## 1 Dick M 20 2 ## 2 Anna F 25 1 ## 3 Sam M 30 3 ## 4 Jennie F 28 2 ## 5 Joss M 29 3 ## 6 Don M 21 2 ## 7 Annie F 26 1 ## 8 John M 32 3 ## 9 Rose F 27 2 ## 10 Jack M 31 3 data$Gender #It is a string/character variable ## [1] "M" "F" "M" "F" "M" "M" "F" "M" "F" "M" data$Gender = factor(data$Gender,levels=c( "M" , "F" ), ordered= TRUE ) #...