Lab_PCA_1_RM
Asmi Ariv
2022-10-07
Machine Learning - Principal Component Analysis (PCA)
PCA comes under unsupervised Machine Learning and its main objective is to project the data in a lower dimensional space compared to the original dimensions of the data, so that its behavior can be analysed. It is achieved through the rotation of axes along the directions in which variability of the data is maximum.
When there is a very large number of dimensions, the data suffers from what is called, “Curse of dimensionality”. It makes it difficult to analyse data. When the number of dimensions is very large, the data becomes quite sparse and to counter this we may need a massive amount of data, which, a lot of times, is not possible. In such cases, the number of features/dimensions is much larger than the number of records. As the space (no. of dimensions) keeps growing, the similarities among datapoints keep declining rapidly.
In machine learning parlance, the predictive power of a model increases with the increasing number of predictors/feature, but after it reaches a certain number in dimensionality, the result starts to deteriorate.
PCA constructs a set of uncorrelated and non-collinear (orthogonal) independent variables. Thus, helping us get rid of correlated, collinear variables, which might be unnecessarily exploding the dimensionality of the dataset.
Please go through the video lectures to understand better.
We will go through PCA in using the dataset: NCI60 from the package ISLR.
It is NCI microarray data, which contains expression levels on 6830 genes from 64 cancer cell lines. And each record is labelled with a cancer type
Loading the dataset
library(ISLR)
data(NCI60)
class(NCI60); names(NCI60)## [1] "list"## [1] "data" "labs"It is a list with data (independent variables) and labs (response) variable.
X = NCI60$data
y = as.factor(NCI60$labs)
dim(X)## [1] 64 6830length(y)## [1] 64levels(y)## [1] "BREAST" "CNS" "COLON" "K562A-repro" "K562B-repro"
## [6] "LEUKEMIA" "MCF7A-repro" "MCF7D-repro" "MELANOMA" "NSCLC"
## [11] "OVARIAN" "PROSTATE" "RENAL" "UNKNOWN"table(y)## y
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1So, the data has 6830 gene expression measurements (features) and only 64 records.
There are 13 cancer types besides one case of an unknown type.
X[1:5, 1:5]## 1 2 3 4 5
## V1 0.300000 1.180000 0.550000 1.140000 -0.265000
## V2 0.679961 1.289961 0.169961 0.379961 0.464961
## V3 0.940000 -0.040000 -0.170000 -0.040000 -0.605000
## V4 0.280000 -0.310000 0.680000 -0.810000 0.625000
## V5 0.485000 -0.465000 0.395000 0.905000 0.200000y[1:5]## [1] CNS CNS CNS RENAL BREAST
## 14 Levels: BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA ... UNKNOWNPCA on NCI60
Now, let’s perform PCA on the data dataset, NCI60.
We will use the function prcomp() that comes with base R.
We will scale the data to get their standard deviation as 1 and mean zero.
pca.mod =prcomp (X , scale=TRUE)
names(pca.mod)## [1] "sdev" "rotation" "center" "scale" "x"So, the prcomp() has some useful components. “center” and “scale” are means and standard deviations of the variables that have been used to scale them.
“rotation”, is the matrix of principal component loadings and “x” is the principal component scores. When rotation matrix is matrix multiplied by X, we get PC scores which are nothing but the coordinates of data points in the rotated axes. What it means is that principal components become the new axes and the scores become the new coordinates for the datapoints in terms of principal components.
“rotation” represents weights assigned to original features in the dataset. Rows represent features and columns represent PCs. Therefore, it is a matrix of weights given by each principal component to different variables/features.
Plotting the PCs
Let’s plot the principle components scores. We will create a palette of colors for cancer types so each is plotted in its own color
# Create palette
palette = rainbow(length(unique(y)))
colors = palette[as.numeric(y)]
plot(pca.mod$x [,1:2], col = colors, pch =19, xlab ="Z1",ylab="Z2", main = "1st and 2nd PCs") #First two PCsplot(pca.mod$x[,c(1,3) ], col = colors, pch =19, xlab ="Z1",ylab="Z3", main = "1st and 3rd PCs") #First and third PCsplot(pca.mod$x[,c(1,4) ], col = colors, pch =19, xlab ="Z1",ylab="Z3", main = "1st and 4th PCs") From the above plot, we see that cell lines belonging to the same type of cancer have similar values on first few principal components.Therefore, we do understand that cell lines with the same cancer type have pretty similar gene expression measurements.
Analysing PVE of PCs
Let’s look at the summary
summary(pca.mod)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 27.8535 21.48136 19.82046 17.03256 15.97181 15.72108
## Proportion of Variance 0.1136 0.06756 0.05752 0.04248 0.03735 0.03619
## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115 0.31850 0.35468
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 14.47145 13.54427 13.14400 12.73860 12.68672 12.15769
## Proportion of Variance 0.03066 0.02686 0.02529 0.02376 0.02357 0.02164
## Cumulative Proportion 0.38534 0.41220 0.43750 0.46126 0.48482 0.50646
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 11.83019 11.62554 11.43779 11.00051 10.65666 10.48880
## Proportion of Variance 0.02049 0.01979 0.01915 0.01772 0.01663 0.01611
## Cumulative Proportion 0.52695 0.54674 0.56590 0.58361 0.60024 0.61635
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 10.43518 10.3219 10.14608 10.0544 9.90265 9.64766
## Proportion of Variance 0.01594 0.0156 0.01507 0.0148 0.01436 0.01363
## Cumulative Proportion 0.63229 0.6479 0.66296 0.6778 0.69212 0.70575
## PC25 PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 9.50764 9.33253 9.27320 9.0900 8.98117 8.75003 8.59962
## Proportion of Variance 0.01324 0.01275 0.01259 0.0121 0.01181 0.01121 0.01083
## Cumulative Proportion 0.71899 0.73174 0.74433 0.7564 0.76824 0.77945 0.79027
## PC32 PC33 PC34 PC35 PC36 PC37 PC38
## Standard deviation 8.44738 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127
## Proportion of Variance 0.01045 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896
## Cumulative Proportion 0.80072 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803
## PC39 PC40 PC41 PC42 PC43 PC44 PC45
## Standard deviation 7.72156 7.58603 7.45619 7.3444 7.10449 7.0131 6.95839
## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072 0.00709
## Cumulative Proportion 0.86676 0.87518 0.88332 0.8912 0.89861 0.9058 0.91290
## PC46 PC47 PC48 PC49 PC50 PC51 PC52
## Standard deviation 6.8663 6.80744 6.64763 6.61607 6.40793 6.21984 6.20326
## Proportion of Variance 0.0069 0.00678 0.00647 0.00641 0.00601 0.00566 0.00563
## Cumulative Proportion 0.9198 0.92659 0.93306 0.93947 0.94548 0.95114 0.95678
## PC53 PC54 PC55 PC56 PC57 PC58 PC59
## Standard deviation 6.06706 5.91805 5.91233 5.73539 5.47261 5.2921 5.02117
## Proportion of Variance 0.00539 0.00513 0.00512 0.00482 0.00438 0.0041 0.00369
## Cumulative Proportion 0.96216 0.96729 0.97241 0.97723 0.98161 0.9857 0.98940
## PC60 PC61 PC62 PC63 PC64
## Standard deviation 4.68398 4.17567 4.08212 4.04124 2.148e-14
## Proportion of Variance 0.00321 0.00255 0.00244 0.00239 0.000e+00
## Cumulative Proportion 0.99262 0.99517 0.99761 1.00000 1.000e+00As we look at the summary, it shows us the importance of each component in terms of how much of variance in the data, each principal component(PC) is able to explain.
We need to look at the proportion of variance (PVE) (explained by each PC). PC1 has the highest value 0.1136 (i.e. 11.36%). Likewise, PC2 explains 0.06756 (i.e. 6.75%).
We need to look at the Cumulative Proportion to decide how many PCs we need to explain certain percentage of variance in the data. E.g., if we need 80% variance explained in the data by PCs, then we need first 32 PCs.
In this table, Standard Deviation is the square root of the quantity of variance explained, while PVE is the portion or percentage of variance.
Let’s plot pCs to see how much variance is being explained by them.
plot(pca.mod)Here, height of a bar is square of standard deviation (which by definition is variance) of the corresponding PCs. Bars are plotted in the same order as PCs from first PC onward. E.g. the height of first bar = square of Standard deviation of PC1
pca.mod$sdev[1]^2 # square of Standard deviation of PC1## [1] 775.8157This shows that as we move from first to last PC, the variance explained keeps declining, i.e., the first PC explains the highest amount of variance in the data, and the next is PC2 and so on.
Plotting PVE
To get a better understanding, let’s plot PVE and cumulative PVE of each principal component. To get the PVE from the model, will have to take the square of sdev, as the function doesn’t return PVE as a component, and divide it by sum of squared sdev.
We will also plot cumulative PVE to look at the cumulative variance explained by PCs.
PVE = (pca.mod$sdev^2)/sum(pca.mod$sdev^2)
PVE[1:5] #Let's look at some of the values## [1] 0.11358942 0.06756203 0.05751842 0.04247554 0.03734972The above result does match with that of summary.
plot(PVE , type ="o", ylab="PVE ", xlab=" Principal Components", col =" blue")plot(cumsum (PVE), type="o", ylab =" Cumulative PVE", xlab="Principal Components", col =" brown3 ")The PVE plot is more like the elbow plot we had used in k-means clustering. Looking at its starting point of elbow, first 7 or 8 PCs are useful in explaining significant amount of variance, post which other PCs may not help much.
So, as discussed earlier from the summary table, when we look at Cumulative PVE, we can see that 80% of the variance is explained by 30-32 PCs and around 40% being explained by first 8 PCs.
The main idea is to is minimize the number of principal components to get maximum information on the dataset. It is difficult to analyze more PCs (even 7-8 is considered very large).
Analysing PCs
dim(pca.mod$rotation)## [1] 6830 64As we can see that the model has created 64 principal components using all the 6830 variables. In general there are only min(m-1, n) principal components that are informative, (where m = number of records and n = number of variables). Hence, the function has created only 64 PCs.
As we know that the first PC explains the highest variance in the data compared to any other PCs, it can give us an idea about the importance of variables in a dataset.
We usually look at only the first few PCs and analyse the behavior or importance of variables in those PCs and decide which ones we can use for further analysis or for building the predictive models.
Due to a very large size of variables, we cannot look at the weights of all the variables in a PC.
Let’s look at some of them in PC1
pca.mod$rotation[1:20, 1]## 1 2 3 4 5
## -0.0106823696 -0.0023120784 -0.0058797496 0.0032780710 -0.0076775347
## 6 7 8 9 10
## 0.0022666712 -0.0085883790 -0.0049757108 -0.0045281698 -0.0081231123
## 11 12 13 14 15
## -0.0015250368 0.0006676954 0.0032915834 -0.0060017646 -0.0084585316
## 16 17 18 19 20
## -0.0029104606 0.0035443606 -0.0060346714 -0.0021566243 -0.0004145221What we see here is how much of contribution each variable has made in creating PC1. That means their weights explain their importance in the data considering that the first PC is the most important component in PCA. Likewise, we can analyse other PCs (as Pcs are orthogonal to each other) as well and select only those variables that are making higher contribution.
Let’s look at the top 10 variables with highest absolute weights in PC1.
idx = order(abs(pca.mod$rotation[, 1]), decreasing = TRUE) #Indices of variables in decreasing absolute weights
pca.mod$rotation[idx[1:10],1]## 5951 5874 5886 5872 5937 5868
## -0.03113715 -0.03044319 -0.03004791 -0.02964670 -0.02940276 -0.02917913
## 5887 5960 5956 5902
## -0.02917070 -0.02898966 -0.02891342 -0.02856706These variables have been given highest absolute weights than others in the first Principal component, which indicates their importance. Therefore, we might consider these variables for model building or further analysis.
Likewise, we can also look at the more important variables from other PCs such as PC2, PC3, etc.
Therefore, we can reduce the dimension of the datapoints by considering only the important variables used in first few principal components.
Working on reduced dimensionality
Let’s write a function that returns the indices of most important variables for principal components
imp_var = function(pca_rotation, K, n=10){
# pca_rotation = rotation matrix of PCA
# K = number of principal components to be considered for dimensionality reduction
# n = number of top important variables to be considered for dimensionality reduction
m = nrow(pca_rotation)
idx = rep(0,m) #Empty vector to store indices of most important variables
c = 1
for(i in 1:K){
id = order(abs(pca_rotation[, i]), decreasing = TRUE)[1:n]
idx[c:(i*n)] = id
c = c+n
if(c >= m-n) break
}
idx = unique(idx)
idx
}Let’s get the indices of top 10 variables from the first 8 principal components
idx = imp_var(pca.mod$rotation, K=8, n=10)Let’s look at the values of idx
length(idx)## [1] 81idx## [1] 5951 5874 5886 5872 5937 5868 5887 5960 5956 5902 4320 4321 4327 4324 4193
## [16] 4322 4325 4335 4329 4192 267 3939 5644 215 2846 279 3953 2845 5630 3933
## [31] 957 941 93 2849 2850 4832 4830 975 5131 3126 1620 1561 793 3355 4925
## [46] 3398 4811 3357 3996 1621 1385 1384 1387 6698 4044 4045 6693 6692 1398 4038
## [61] 5369 5368 423 3461 5568 6295 421 424 422 6294 3703 3702 559 3286 3832
## [76] 184 6798 808 3289 3701 0So, we have reduced the dimension to 80 (ignoring 0) from 6830
Training the model on data with reduced dimensionality
Let’s get the data X with 80 variables and split it into train and test data to build and test the model.
m = nrow(X)
X_reduced = X[,idx]
dim(X_reduced)## [1] 64 80set.seed(1); train.idx = sample(1:m, 50, replace=F)
train = X_reduced[train.idx,]
test = X_reduced[-train.idx,]
train.y = y[train.idx]
test.y = y[-train.idx]
dim(train); length(train.y)## [1] 50 80## [1] 50dim(test); length(test.y)## [1] 14 80## [1] 14Let’s build a classifier on the train data using neural net
library(nnet)
nn.model = nnet(train.y ~ ., data=train, siz=10, decay = 0.01, maxit=400)## # weights: 964
## initial value 154.602101
## iter 10 value 22.800897
## iter 20 value 12.948009
## iter 30 value 8.335844
## iter 40 value 6.766485
## iter 50 value 6.196805
## iter 60 value 5.923871
## iter 70 value 5.850125
## iter 80 value 5.798908
## iter 90 value 5.753930
## iter 100 value 5.743480
## iter 110 value 5.736503
## iter 120 value 5.734001
## iter 130 value 5.723961
## iter 140 value 5.709124
## iter 150 value 5.692332
## iter 160 value 5.609125
## iter 170 value 5.567010
## iter 180 value 5.516563
## iter 190 value 5.470570
## iter 200 value 5.459094
## iter 210 value 5.457853
## iter 220 value 5.457762
## iter 230 value 5.457745
## iter 240 value 5.457742
## final value 5.457742
## convergedBrief summary of the model:
print(nn.model)## a 80-10-14 network with 964 weights
## inputs: `5951` `5874` `5886` `5872` `5937` `5868` `5887` `5960` `5956` `5902` `4320` `4321` `4327` `4324` `4193` `4322` `4325` `4335` `4329` `4192` `267` `3939` `5644` `215` `2846` `279` `3953` `2845` `5630` `3933` `957` `941` `93` `2849` `2850` `4832` `4830` `975` `5131` `3126` `1620` `1561` `793` `3355` `4925` `3398` `4811` `3357` `3996` `1621` `1385` `1384` `1387` `6698` `4044` `4045` `6693` `6692` `1398` `4038` `5369` `5368` `423` `3461` `5568` `6295` `421` `424` `422` `6294` `3703` `3702` `559` `3286` `3832` `184` `6798` `808` `3289` `3701`
## output(s): train.y
## options were - softmax modelling decay=0.01Predictions
pred.train = predict(nn.model, newdata = train, type = "class")
pred.test = predict(nn.model, newdata = test, type = "class")Let’s calculate train and test errors
confMtx.train=table(train.y,pred.train)
confMtx.test=table(test.y,pred.test)
confMtx.train## pred.train
## train.y BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro
## BREAST 6 0 0 0 0 0 0
## CNS 0 3 0 0 0 0 0
## COLON 0 0 5 0 0 0 0
## K562A-repro 0 0 0 1 0 0 0
## K562B-repro 0 0 0 0 1 0 0
## LEUKEMIA 0 0 0 0 0 6 0
## MCF7A-repro 0 0 0 0 0 0 1
## MCF7D-repro 0 0 0 0 0 0 0
## MELANOMA 0 0 0 0 0 0 0
## NSCLC 0 0 0 0 0 0 0
## OVARIAN 0 0 0 0 0 0 0
## PROSTATE 0 0 0 0 0 0 0
## RENAL 0 0 0 0 0 0 0
## UNKNOWN 0 0 0 0 0 0 0
## pred.train
## train.y MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
## BREAST 0 0 0 0 0 0 0
## CNS 0 0 0 0 0 0 0
## COLON 0 0 0 0 0 0 0
## K562A-repro 0 0 0 0 0 0 0
## K562B-repro 0 0 0 0 0 0 0
## LEUKEMIA 0 0 0 0 0 0 0
## MCF7A-repro 0 0 0 0 0 0 0
## MCF7D-repro 1 0 0 0 0 0 0
## MELANOMA 0 5 0 0 0 0 0
## NSCLC 0 0 8 0 0 0 0
## OVARIAN 0 0 0 6 0 0 0
## PROSTATE 0 0 0 0 1 0 0
## RENAL 0 0 0 0 0 5 0
## UNKNOWN 0 0 0 0 0 0 1confMtx.test## pred.test
## test.y CNS COLON MELANOMA OVARIAN RENAL
## BREAST 1 0 0 0 0
## CNS 2 0 0 0 0
## COLON 0 2 0 0 0
## K562A-repro 0 0 0 0 0
## K562B-repro 0 0 0 0 0
## LEUKEMIA 0 0 0 0 0
## MCF7A-repro 0 0 0 0 0
## MCF7D-repro 0 0 0 0 0
## MELANOMA 0 0 3 0 0
## NSCLC 0 0 0 0 1
## OVARIAN 0 0 0 0 0
## PROSTATE 0 0 0 0 1
## RENAL 0 0 0 1 3
## UNKNOWN 0 0 0 0 0cat('\nTrain Set Accuracy: ', mean((pred.train == train.y)) * 100, "%")##
## Train Set Accuracy: 100 %cat('\nTest Set Accuracy: ', mean((pred.test == test.y)) * 100, "%")##
## Test Set Accuracy: 71.42857 %The model has performed really well on train set with 100% accuracy, but did not do so well on the test data which has only 71.42% accuracy. One of the reasons could be that there is still of some level of sparsity in the data due to number of features > number of records. However, we must try to improve the performance of the model through different methods/options.
Exercise: Try different combinations of K (number of PCs) and n (number of top important variables) in the function imp_var() and see if you can improve the model performance on test data. Also try different algorithms such as SVM to improve the performance on test data.
Comments
Post a Comment