Lab_CLUS_1_RM
Asmi Ariv
2022-10-06
Clustering in R
In this lab we will go through various clustering machine learning methods in R. We will use different datasets.
Function for Best value of k (number of clusters)
The function takes a dataset and maximum number of k or centroids and returns a plot on elbow method to find out the best value of k based on within sum of squares cost. With increasing number of centroids (going upward from 1 to k), the cost will decrease rapidly and then slow down at a point but will continue to decrease further, thus giving the appearance of an elbow. This point is the optimum value of k.
bestK = function(X, k){
#bestK returns best value of k by plotting k and within sum of squares
cost = rep(0, k)
for (i in 1:k){
km = kmeans(X, centers = i, nstart = 10)
cost[i] = km$tot.withinss
}
plot(1:k, cost, main = "Elbow Method for optimum K" ,xlab="Number of Centroids (K)", ylab="Cost (within SS)", col=1:k, cex=4, bg="lightblue", type="o")
}K-means clustering
We will use kmeans() function that comes with base R.
Let’s look at an example of k-means clustering in R using our simulated dataset.
X1 = c(10.1, 8.9, 13.5, 7.8, 9.7, 10.6, 8.4, 9.5, 18.0, 10.2, 5.3, 13.9, 9.0, 9.5, 9.4, 6.9, 6.2, 6.2, 7.1, 9.9, 13.1, 17.4, 9.3, 11.4, 4.4)
X2 = c(1.4, 14.0, 9.3, 6.0, 11.4, 10.8, 11.6, 4.9, 9.9, 3.0, 12.4, 10.0, 5.1, 13.6, 4.7, 10.2, 3.7, 6.3, 3.4, 7.8, 10.1, 5.7, 4.6, 12.5, 5.0)
data = data.frame(X1, X2)In R, we can use predict function to predict cluster based on kmeans. However, for this you need to install “fdm2id” in R. Once, you have installed and loaded this package in R, predict function works for kmeans model as well. But remember that clustering is unsupervised learning and not a classification model which is a supervised learning.
Splitting data into train and test
library(fdm2id)
set.seed(1); train.idx = sample(1:nrow(data), 20, replace=F)
train = data[train.idx,]
test = data[-train.idx,]Training kmeans on train data
km = kmeans(train, centers = 3, nstart = 10) #Using three centroids
km## K-means clustering with 3 clusters of sizes 3, 7, 10
##
## Cluster means:
## X1 X2
## 1 16.433333 8.533333
## 2 8.471429 12.000000
## 3 8.060000 4.590000
##
## Clustering vector:
## 25 4 7 1 2 11 14 18 22 5 16 10 6 19 23 9 15 12 17 20
## 3 3 2 3 2 2 2 3 1 2 2 3 2 3 3 1 3 1 3 3
##
## Within cluster sum of squares by cluster:
## [1] 21.85333 31.73429 67.07300
## (between_SS / total_SS = 76.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"So, kmeans has created three clusters of sizes 3, 7 and 10. Here size means number of data points in each cluster.
Under cluster vector, you can see first row is index number and second row is cluster numbers for each index.
Plotting clusters
Let’s plot the data points according to their clusters
plot(train, type="n", xlim=c(3,19), xlab="X1", ylab="X2",)
text(x=train$X1, y=train$X2, col=km$cluster+1)As we can see that the data points have been segregated into their respective clusters.
Prediction on test data set
pred.test = predict(km, test)
pred.test## 3 8 13 21 24
## 1 3 3 1 2plot(test, type="n", xlim=c(3,19), xlab="X1", ylab="X2",)
text(x=test$X1, y=test$X2, col=pred.test+1)Looking at both the plots, we can see that model has done a good job on predicting the clusters for test data.
Best value of k
Let’s find out best K for our data set
bestK(train, k=10)As per our elbow plot, we can see that the optimum value of k is 3 at which the decline in cost has slowed down. This is what we also chose in our kmeans model. Therefore, we don’t need to rebuild our model.
IRIS data set
Let’s build a kmeans clustering model on iris data and see how our model does, as this dataset also has labels. We will exclude the species variable while building the model.
There is a function splitdata() in the package “fdm2id” that helps in splitting dataset into train and test
require (datasets) #require() is another way of loading a package in R
data (iris)
m= nrow(iris)
d = splitdata (iris, target=5, size = floor(0.8*m), seed = 1) #Here, target is column number of Class variable
#by default it splits data into 70-30 ratio, we have set it at 80-20
str (d)## List of 4
## $ train.x:'data.frame': 120 obs. of 4 variables:
## ..$ Sepal.Length: num [1:120] 5.8 6.4 4.4 4.3 7 5.4 5.4 7.6 6.1 4.6 ...
## ..$ Sepal.Width : num [1:120] 2.7 2.8 3.2 3 3.2 3 3.4 3 2.8 3.4 ...
## ..$ Petal.Length: num [1:120] 4.1 5.6 1.3 1.1 4.7 4.5 1.7 6.6 4.7 1.4 ...
## ..$ Petal.Width : num [1:120] 1 2.1 0.2 0.1 1.4 1.5 0.2 2.1 1.2 0.3 ...
## $ train.y: Factor w/ 3 levels "setosa","versicolor",..: 2 3 1 1 2 2 1 3 2 1 ...
## $ test.x :'data.frame': 30 obs. of 4 variables:
## ..$ Sepal.Length: num [1:30] 4.6 5 5 4.4 5.4 5.7 4.7 5 4.8 5.1 ...
## ..$ Sepal.Width : num [1:30] 3.1 3.6 3.4 2.9 3.7 4.4 3.2 3.2 3 3.8 ...
## ..$ Petal.Length: num [1:30] 1.5 1.4 1.5 1.4 1.5 1.5 1.6 1.2 1.4 1.6 ...
## ..$ Petal.Width : num [1:30] 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.3 0.2 ...
## $ test.y : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "class")= chr "dataset"Training kmeans of IRIS
We will use KMEANS() function from th package “fdm2id” (notice, the function name is in capital letters)
km.iris = KMEANS(d$train.x, k = 3)
km.iris## K-means clustering with 3 clusters of sizes 33, 48, 39
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.896970 3.100000 5.751515 2.0969697
## 2 5.943750 2.733333 4.468750 1.4354167
## 3 5.007692 3.420513 1.461538 0.2512821
##
## Clustering vector:
## 68 129 43 14 51 85 21 106 74 7 73 79 37 105 110 34 143 126 89 33
## 2 1 3 3 2 2 3 1 2 3 2 2 3 1 1 3 2 1 2 3
## 84 70 142 42 38 111 20 28 124 44 87 149 40 121 25 119 39 146 127 6
## 2 2 1 3 3 1 3 3 2 3 2 1 3 1 3 1 3 1 2 3
## 24 32 147 2 45 18 22 78 102 65 115 120 100 75 81 13 118 132 48 93
## 3 3 2 3 3 3 3 1 2 2 2 2 2 2 2 3 1 1 3 2
## 23 130 29 95 104 123 92 131 134 144 31 17 140 91 64 60 113 135 10 1
## 3 1 3 2 1 1 2 1 2 1 3 3 1 2 2 2 1 2 3 3
## 148 59 26 15 58 88 136 112 77 53 12 114 76 61 145 86 94 83 19 150
## 1 2 3 3 2 2 1 1 2 1 3 2 2 2 1 2 2 2 3 2
## 35 98 71 101 108 55 125 56 41 138 3 82 50 141 133 27 63 122 116 66
## 3 2 2 1 1 2 1 2 3 1 3 2 3 1 1 3 2 2 1 2
##
## Within cluster sum of squares by cluster:
## [1] 21.18182 31.95771 11.64103
## (between_SS / total_SS = 88.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"Prediction on test data
pred.test = predict (km.iris, d$test.x)
pred.test## 4 5 8 9 11 16 30 36 46 47 49 52 54 57 62 67 69 72 80 90
## 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2
## 96 97 99 103 107 109 117 128 137 139
## 2 2 2 1 2 1 1 2 1 2Plotting kmeans cluster of IRIS Data
plot(d$train.x$Sepal.Length, d$train.x$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)
text(x=d$train.x$Sepal.Length, y=d$train.x$Sepal.Width, labels=d$train.y,col=km.iris$cluster+1)As we can we from this 2-dimension plot that one cluster (representing setosa) is quite separate from other two clusters, which are very close to each other. In fact some of the members of the two clusters have merged together, but this could be due to our plotting 4d data into two 2d.
plot(d$test.x$Sepal.Length, d$test.x$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)
text(x=d$test.x$Sepal.Length, y=d$test.x$Sepal.Width, labels=d$test.y,col=pred.test+1)Same as training data.
MIXTURES OF NORMAL DISTRIBUTIONS Clustering
As we have discussed in the lecture, that kmeans works on hard clustering which may not give the best result. Hence, we need an algorithm that works on soft clustering.
Let’s build a clustering model based on soft clustering using Gaussian Mixture Model. We will use a function mvnormalmixEM() from the package mixtools.
We will use the IRIS data
library(mixtools)
mix.model <- mvnormalmixEM(iris[,-5],arbvar=TRUE,k=3,epsilon=1e-02)## number of iterations= 27names(mix.model)## [1] "x" "lambda" "mu" "sigma" "loglik"
## [6] "posterior" "all.loglik" "restarts" "ft"Here, posterior component is a matrix of posterior probabilities for observations, i.e., probability of an observation belonging to each cluster. If there are k clusters, then there will be k columns in the matrix, each representing one cluster. In our case, there are 3 posterior probabilities for each observation.
We will take the round up values up to 3 digits, to get a better view of which cluster gets higher probability for each observation.
prob1=round(mix.model$posterior[,1],digits=3)
prob2=round(mix.model$posterior[,2],digits=3)
prob3=round(mix.model$posterior[,3],digits=3)
prob1[1:5]## [1] 1 1 1 1 1prob2[1:5]## [1] 0 0 0 0 0prob3[1:5]## [1] 0 0 0 0 0Every observation has a certain probability of belonging to each cluster (that’s the purpose of soft clustering)
Let’s plot the data based on their posterior probabilities.
plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)
text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col=ifelse(prob1>prob2 & prob1>prob3, "red", ifelse(prob2>prob3, "blue", "orange")))As we can see from the above plot, that the observations have been assigned different probabilities of belonging to each cluster. Our plotting is again in 2d while we are dealing with 4d data.
We cannot use predict() function for this method here.
K-medoids
It is a partition based variant of K-means, but instead of using mean values of observations for centroids, it uses the actual observations among the datapoints. It is more robust than k-means.
We will use cluster package and its function pam()
library(cluster)
pam.model = pam(iris[,-5], k=3)
pam.model## Medoids:
## ID Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 8 5.0 3.4 1.5 0.2
## [2,] 79 6.0 2.9 4.5 1.5
## [3,] 113 6.8 3.0 5.5 2.1
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## Objective function:
## build swap
## 0.6709391 0.6542077
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"Clusters
pam.model$clustering## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2Plotting data into clusters
plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)
text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col= pam.model$clustering+1)Hierarchical Clustering
Let’s use Hierarchical Clustering method to build clusters around the data
We will use agnes() function from the “cluster” package.
library(cluster)
hi.model=agnes(iris[,-5], diss=FALSE, metric="euclidian", method="complete") #We can select "single", "average"
hi.model## Call: agnes(x = iris[, -5], diss = FALSE, metric = "euclidian", method = "complete")
## Agglomerative coefficient: 0.9574622
## Order of objects:
## [1] 1 18 41 8 40 50 28 29 5 38 12 30 31 25 2 46 13 10
## [19] 35 26 36 3 4 48 43 7 9 39 14 23 42 6 19 11 49 37
## [37] 21 32 20 22 47 45 24 27 44 15 16 17 33 34 54 90 70 81
## [55] 82 60 65 80 63 56 91 67 85 62 72 68 83 93 89 96 97 95
## [73] 100 107 58 94 99 61 51 53 87 78 55 59 77 66 76 52 57 86
## [91] 64 92 79 74 75 98 69 88 120 71 128 139 150 102 143 114 122 115
## [109] 73 84 134 124 127 112 147 135 101 137 149 121 144 125 141 145 104 117
## [127] 138 105 129 133 109 111 148 116 113 140 142 146 103 126 130 108 131 106
## [145] 123 119 110 136 118 132
## Height (summary):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2449 0.3742 0.5865 0.6481 7.0852
##
## Available components:
## [1] "order" "height" "ac" "merge" "diss" "call" "method" "data"Plotting Hierarchical Clustering
plot(hi.model)This plot shows that clusters have been built using agglomerative (bottom-up) method. Each data is considered a cluster in the beginning (at the bottom) and then joined togther pairwise depending on their similaries (we have used euclidian distance). Finally, one big cluster is formed at the end (at the top).
Required cluster
To get the required cluster, we use another function cutree()
clus = cutree(hi.model, 3) #3 is number of clusters
clus## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2
## [75] 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2 2 2 2 2 3 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2Plotting data into clusters
plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)
text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col= clus+1)weighted kernel version of K-means
We can build cluster based on kernel functions (that we have seen in SVM) as well. We will use kkmeans() function from the package “kernlab”
This function accepts a matrix instead of a data frame.
require(kernlab)
kkm = kkmeans(as.matrix(iris[,-5]), centers=3)## Using automatic sigma estimation (sigest) for RBF or laplace kernelkkm## Spectral Clustering object of class "specc"
##
## Cluster memberships:
##
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.782254497703445
##
## Centers:
## [,1] [,2] [,3] [,4]
## [1,] 5.006000 3.428000 1.462000 0.246000
## [2,] 5.903279 2.747541 4.381967 1.418033
## [3,] 6.823077 3.066667 5.725641 2.079487
##
## Cluster size:
## [1] 50 61 39
##
## Within-cluster sum of squares:
## [1] 1313.240 1424.553 1166.031Plotting data into clusters
plot(iris$Sepal.Length, iris$Sepal.Width, type="n", xlab="Sepal.Length", ylab="Sepal.Width",)
text(x=iris$Sepal.Length, y=iris$Sepal.Width, labels=iris[,5],col= kkm+1)Exercise: Try different datasets and build clustering models using all the methods above.
Comments
Post a Comment