Principal Component Analysis (PCA) In R

Machine Learning - Principal Component Analysis (PCA)

PCA comes under unsupervised Machine Learning and its main objective is to project the data in a lower dimensional space compared to the original dimensions of the data, so that its behavior can be analysed. It is achieved through the rotation of axes along the directions in which variability of the data is maximum.

When there is a very large number of dimensions, the data suffers from what is called, “Curse of dimensionality”. It makes it difficult to analyse data. When the number of dimensions is very large, the data becomes quite sparse and to counter this we may need a massive amount of data, which, a lot of times, is not possible. In such cases, the number of features/dimensions is much larger than the number of records. As the space (no. of dimensions) keeps growing, the similarities among datapoints keep declining rapidly.

In machine learning parlance, the predictive power of a model increases with the increasing number of predictors/feature, but after it reaches a certain number in dimensionality, the result starts to deteriorate.

PCA constructs a set of uncorrelated and non-collinear (orthogonal) independent variables. Thus, helping us get rid of correlated, collinear variables, which might be unnecessarily exploding the dimensionality of the dataset.

Please go through the video lectures to understand better.

We will go through PCA in using the dataset: NCI60 from the package ISLR.

It is NCI microarray data, which contains expression levels on 6830 genes from 64 cancer cell lines. And each record is labelled with a cancer type

Loading the dataset

library(ISLR)
data(NCI60)
class(NCI60); names(NCI60)

## [1] "list"

## [1] "data" "labs"

It is a list with data (independent variables) and labs (response) variable.

X = NCI60$data
y = as.factor(NCI60$labs)

dim(X)

## [1]   64 6830

length(y)

## [1] 64

levels(y)

##  [1] "BREAST"      "CNS"         "COLON"       "K562A-repro" "K562B-repro"
##  [6] "LEUKEMIA"    "MCF7A-repro" "MCF7D-repro" "MELANOMA"    "NSCLC"      
## [11] "OVARIAN"     "PROSTATE"    "RENAL"       "UNKNOWN"

table(y)

## y
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

So, the data has 6830 gene expression measurements (features) and only 64 records.

There are 13 cancer types besides one case of an unknown type.

X[1:5, 1:5]

##           1         2         3         4         5
## V1 0.300000  1.180000  0.550000  1.140000 -0.265000
## V2 0.679961  1.289961  0.169961  0.379961  0.464961
## V3 0.940000 -0.040000 -0.170000 -0.040000 -0.605000
## V4 0.280000 -0.310000  0.680000 -0.810000  0.625000
## V5 0.485000 -0.465000  0.395000  0.905000  0.200000

y[1:5]

## [1] CNS    CNS    CNS    RENAL  BREAST
## 14 Levels: BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA ... UNKNOWN

PCA on NCI60

Now, let’s perform PCA on the data dataset, NCI60.

We will use the function prcomp() that comes with base R.

We will scale the data to get their standard deviation as 1 and mean zero.

pca.mod =prcomp (X , scale=TRUE)

names(pca.mod)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

So, the prcomp() has some useful components. “center” and “scale” are means and standard deviations of the variables that have been used to scale them.

“rotation”, is the matrix of principal component loadings and “x” is the principal component scores. When rotation matrix is matrix multiplied by X, we get PC scores which are nothing but the coordinates of data points in the rotated axes. What it means is that principal components become the new axes and the scores become the new coordinates for the datapoints in terms of principal components.

“rotation” represents weights assigned to original features in the dataset. Rows represent features and columns represent PCs. Therefore, it is a matrix of weights given by each principal component to different variables/features.

Plotting the PCs

Let’s plot the principle components scores. We will create a palette of colors for cancer types so each is plotted in its own color

# Create palette
  palette = rainbow(length(unique(y)))
  colors = palette[as.numeric(y)]
  
plot(pca.mod$x [,1:2], col = colors, pch =19, xlab ="Z1",ylab="Z2", main = "1st and 2nd PCs") #First two PCs

plot(pca.mod$x[,c(1,3) ], col = colors, pch =19, xlab ="Z1",ylab="Z3", main = "1st and 3rd PCs") #First and third PCs

plot(pca.mod$x[,c(1,4) ], col = colors, pch =19, xlab ="Z1",ylab="Z3", main = "1st and 4th PCs")

From the above plot, we see that cell lines belonging to the same type of cancer have similar values on first few principal components.Therefore, we do understand that cell lines with the same cancer type have pretty similar gene expression measurements.

Analysing PVE of PCs

Let’s look at the summary

summary(pca.mod)

## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     27.8535 21.48136 19.82046 17.03256 15.97181 15.72108
## Proportion of Variance  0.1136  0.06756  0.05752  0.04248  0.03735  0.03619
## Cumulative Proportion   0.1136  0.18115  0.23867  0.28115  0.31850  0.35468
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     14.47145 13.54427 13.14400 12.73860 12.68672 12.15769
## Proportion of Variance  0.03066  0.02686  0.02529  0.02376  0.02357  0.02164
## Cumulative Proportion   0.38534  0.41220  0.43750  0.46126  0.48482  0.50646
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     11.83019 11.62554 11.43779 11.00051 10.65666 10.48880
## Proportion of Variance  0.02049  0.01979  0.01915  0.01772  0.01663  0.01611
## Cumulative Proportion   0.52695  0.54674  0.56590  0.58361  0.60024  0.61635
##                            PC19    PC20     PC21    PC22    PC23    PC24
## Standard deviation     10.43518 10.3219 10.14608 10.0544 9.90265 9.64766
## Proportion of Variance  0.01594  0.0156  0.01507  0.0148 0.01436 0.01363
## Cumulative Proportion   0.63229  0.6479  0.66296  0.6778 0.69212 0.70575
##                           PC25    PC26    PC27   PC28    PC29    PC30    PC31
## Standard deviation     9.50764 9.33253 9.27320 9.0900 8.98117 8.75003 8.59962
## Proportion of Variance 0.01324 0.01275 0.01259 0.0121 0.01181 0.01121 0.01083
## Cumulative Proportion  0.71899 0.73174 0.74433 0.7564 0.76824 0.77945 0.79027
##                           PC32    PC33    PC34    PC35    PC36    PC37    PC38
## Standard deviation     8.44738 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127
## Proportion of Variance 0.01045 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896
## Cumulative Proportion  0.80072 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803
##                           PC39    PC40    PC41   PC42    PC43   PC44    PC45
## Standard deviation     7.72156 7.58603 7.45619 7.3444 7.10449 7.0131 6.95839
## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072 0.00709
## Cumulative Proportion  0.86676 0.87518 0.88332 0.8912 0.89861 0.9058 0.91290
##                          PC46    PC47    PC48    PC49    PC50    PC51    PC52
## Standard deviation     6.8663 6.80744 6.64763 6.61607 6.40793 6.21984 6.20326
## Proportion of Variance 0.0069 0.00678 0.00647 0.00641 0.00601 0.00566 0.00563
## Cumulative Proportion  0.9198 0.92659 0.93306 0.93947 0.94548 0.95114 0.95678
##                           PC53    PC54    PC55    PC56    PC57   PC58    PC59
## Standard deviation     6.06706 5.91805 5.91233 5.73539 5.47261 5.2921 5.02117
## Proportion of Variance 0.00539 0.00513 0.00512 0.00482 0.00438 0.0041 0.00369
## Cumulative Proportion  0.96216 0.96729 0.97241 0.97723 0.98161 0.9857 0.98940
##                           PC60    PC61    PC62    PC63      PC64
## Standard deviation     4.68398 4.17567 4.08212 4.04124 2.148e-14
## Proportion of Variance 0.00321 0.00255 0.00244 0.00239 0.000e+00
## Cumulative Proportion  0.99262 0.99517 0.99761 1.00000 1.000e+00

As we look at the summary, it shows us the importance of each component in terms of how much of variance in the data, each principal component(PC) is able to explain.

We need to look at the proportion of variance (PVE) (explained by each PC). PC1 has the highest value 0.1136 (i.e. 11.36%). Likewise, PC2 explains 0.06756 (i.e. 6.75%).

We need to look at the Cumulative Proportion to decide how many PCs we need to explain certain percentage of variance in the data. E.g., if we need 80% variance explained in the data by PCs, then we need first 32 PCs.

In this table, Standard Deviation is the square root of the quantity of variance explained, while PVE is the portion or percentage of variance.

Let’s plot pCs to see how much variance is being explained by them.

plot(pca.mod)

Here, height of a bar is square of standard deviation (which by definition is variance) of the corresponding PCs. Bars are plotted in the same order as PCs from first PC onward. E.g. the height of first bar = square of Standard deviation of PC1

pca.mod$sdev[1]^2 # square of Standard deviation of PC1

## [1] 775.8157

This shows that as we move from first to last PC, the variance explained keeps declining, i.e., the first PC explains the highest amount of variance in the data, and the next is PC2 and so on.

Plotting PVE

To get a better understanding, let’s plot PVE and cumulative PVE of each principal component. To get the PVE from the model, will have to take the square of sdev, as the function doesn’t return PVE as a component, and divide it by sum of squared sdev.

We will also plot cumulative PVE to look at the cumulative variance explained by PCs.

PVE = (pca.mod$sdev^2)/sum(pca.mod$sdev^2)

PVE[1:5] #Let's look at some of the values

## [1] 0.11358942 0.06756203 0.05751842 0.04247554 0.03734972

The above result does match with that of summary.

plot(PVE , type ="o", ylab="PVE ", xlab=" Principal Components", col =" blue")

plot(cumsum (PVE), type="o", ylab =" Cumulative PVE", xlab="Principal Components", col =" brown3 ")

The PVE plot is more like the elbow plot we had used in k-means clustering. Looking at its starting point of elbow, first 7 or 8 PCs are useful in explaining significant amount of variance, post which other PCs may not help much.

So, as discussed earlier from the summary table, when we look at Cumulative PVE, we can see that 80% of the variance is explained by 30-32 PCs and around 40% being explained by first 8 PCs.

The main idea is to is minimize the number of principal components to get maximum information on the dataset. It is difficult to analyze more PCs (even 7-8 is considered very large).

Analysing PCs

dim(pca.mod$rotation)

## [1] 6830   64

As we can see that the model has created 64 principal components using all the 6830 variables. In general there are only min(m-1, n) principal components that are informative, (where m = number of records and n = number of variables). Hence, the function has created only 64 PCs.

As we know that the first PC explains the highest variance in the data compared to any other PCs, it can give us an idea about the importance of variables in a dataset.

We usually look at only the first few PCs and analyse the behavior or importance of variables in those PCs and decide which ones we can use for further analysis or for building the predictive models.

Due to a very large size of variables, we cannot look at the weights of all the variables in a PC.

Let’s look at some of them in PC1

pca.mod$rotation[1:20, 1]

##             1             2             3             4             5 
## -0.0106823696 -0.0023120784 -0.0058797496  0.0032780710 -0.0076775347 
##             6             7             8             9            10 
##  0.0022666712 -0.0085883790 -0.0049757108 -0.0045281698 -0.0081231123 
##            11            12            13            14            15 
## -0.0015250368  0.0006676954  0.0032915834 -0.0060017646 -0.0084585316 
##            16            17            18            19            20 
## -0.0029104606  0.0035443606 -0.0060346714 -0.0021566243 -0.0004145221

What we see here is how much of contribution each variable has made in creating PC1. That means their weights explain their importance in the data considering that the first PC is the most important component in PCA. Likewise, we can analyse other PCs (as Pcs are orthogonal to each other) as well and select only those variables that are making higher contribution.

Let’s look at the top 10 variables with highest absolute weights in PC1.

idx = order(abs(pca.mod$rotation[, 1]), decreasing = TRUE) #Indices of variables in decreasing absolute weights

pca.mod$rotation[idx[1:10],1]

##        5951        5874        5886        5872        5937        5868 
## -0.03113715 -0.03044319 -0.03004791 -0.02964670 -0.02940276 -0.02917913 
##        5887        5960        5956        5902 
## -0.02917070 -0.02898966 -0.02891342 -0.02856706

These variables have been given highest absolute weights than others in the first Principal component, which indicates their importance. Therefore, we might consider these variables for model building or further analysis.

Likewise, we can also look at the more important variables from other PCs such as PC2, PC3, etc.

Therefore, we can reduce the dimension of the datapoints by considering only the important variables used in first few principal components.

Working on reduced dimensionality

Let’s write a function that returns the indices of most important variables for principal components

imp_var = function(pca_rotation, K, n=10){
  
  # pca_rotation = rotation matrix of PCA
  # K = number of principal components to be considered for dimensionality reduction
  # n = number of top important variables to be considered for dimensionality reduction  
  
  m = nrow(pca_rotation)
  idx = rep(0,m) #Empty vector to store indices of most important variables
  c = 1
   
  for(i in 1:K){
    
    id = order(abs(pca_rotation[, i]), decreasing = TRUE)[1:n]
    idx[c:(i*n)] = id
    c = c+n
    if(c >= m-n) break
  }
  idx = unique(idx)
  idx
}

Let’s get the indices of top 10 variables from the first 8 principal components

idx = imp_var(pca.mod$rotation, K=8, n=10)

Let’s look at the values of idx

length(idx)

## [1] 81

idx

##  [1] 5951 5874 5886 5872 5937 5868 5887 5960 5956 5902 4320 4321 4327 4324 4193
## [16] 4322 4325 4335 4329 4192  267 3939 5644  215 2846  279 3953 2845 5630 3933
## [31]  957  941   93 2849 2850 4832 4830  975 5131 3126 1620 1561  793 3355 4925
## [46] 3398 4811 3357 3996 1621 1385 1384 1387 6698 4044 4045 6693 6692 1398 4038
## [61] 5369 5368  423 3461 5568 6295  421  424  422 6294 3703 3702  559 3286 3832
## [76]  184 6798  808 3289 3701    0

So, we have reduced the dimension to 80 (ignoring 0) from 6830

Training the model on data with reduced dimensionality

Let’s get the data X with 80 variables and split it into train and test data to build and test the model.

m = nrow(X)

X_reduced = X[,idx]

dim(X_reduced)

## [1] 64 80

set.seed(1); train.idx = sample(1:m, 50, replace=F)

train = X_reduced[train.idx,]
test = X_reduced[-train.idx,]

train.y = y[train.idx]
test.y = y[-train.idx]

dim(train); length(train.y)

## [1] 50 80

## [1] 50

dim(test); length(test.y)

## [1] 14 80

## [1] 14

Let’s build a classifier on the train data using neural net

library(nnet)
nn.model = nnet(train.y ~ ., data=train, siz=10, decay = 0.01, maxit=400)

## # weights:  964
## initial  value 154.602101 
## iter  10 value 22.800897
## iter  20 value 12.948009
## iter  30 value 8.335844
## iter  40 value 6.766485
## iter  50 value 6.196805
## iter  60 value 5.923871
## iter  70 value 5.850125
## iter  80 value 5.798908
## iter  90 value 5.753930
## iter 100 value 5.743480
## iter 110 value 5.736503
## iter 120 value 5.734001
## iter 130 value 5.723961
## iter 140 value 5.709124
## iter 150 value 5.692332
## iter 160 value 5.609125
## iter 170 value 5.567010
## iter 180 value 5.516563
## iter 190 value 5.470570
## iter 200 value 5.459094
## iter 210 value 5.457853
## iter 220 value 5.457762
## iter 230 value 5.457745
## iter 240 value 5.457742
## final  value 5.457742 
## converged

Brief summary of the model:

print(nn.model)

## a 80-10-14 network with 964 weights
## inputs: `5951` `5874` `5886` `5872` `5937` `5868` `5887` `5960` `5956` `5902` `4320` `4321` `4327` `4324` `4193` `4322` `4325` `4335` `4329` `4192` `267` `3939` `5644` `215` `2846` `279` `3953` `2845` `5630` `3933` `957` `941` `93` `2849` `2850` `4832` `4830` `975` `5131` `3126` `1620` `1561` `793` `3355` `4925` `3398` `4811` `3357` `3996` `1621` `1385` `1384` `1387` `6698` `4044` `4045` `6693` `6692` `1398` `4038` `5369` `5368` `423` `3461` `5568` `6295` `421` `424` `422` `6294` `3703` `3702` `559` `3286` `3832` `184` `6798` `808` `3289` `3701` 
## output(s): train.y 
## options were - softmax modelling  decay=0.01

Predictions

pred.train = predict(nn.model, newdata = train, type = "class")
pred.test = predict(nn.model, newdata = test, type = "class")

Let’s calculate train and test errors

confMtx.train=table(train.y,pred.train) 

confMtx.test=table(test.y,pred.test) 

confMtx.train

##              pred.train
## train.y       BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro
##   BREAST           6   0     0           0           0        0           0
##   CNS              0   3     0           0           0        0           0
##   COLON            0   0     5           0           0        0           0
##   K562A-repro      0   0     0           1           0        0           0
##   K562B-repro      0   0     0           0           1        0           0
##   LEUKEMIA         0   0     0           0           0        6           0
##   MCF7A-repro      0   0     0           0           0        0           1
##   MCF7D-repro      0   0     0           0           0        0           0
##   MELANOMA         0   0     0           0           0        0           0
##   NSCLC            0   0     0           0           0        0           0
##   OVARIAN          0   0     0           0           0        0           0
##   PROSTATE         0   0     0           0           0        0           0
##   RENAL            0   0     0           0           0        0           0
##   UNKNOWN          0   0     0           0           0        0           0
##              pred.train
## train.y       MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
##   BREAST                0        0     0       0        0     0       0
##   CNS                   0        0     0       0        0     0       0
##   COLON                 0        0     0       0        0     0       0
##   K562A-repro           0        0     0       0        0     0       0
##   K562B-repro           0        0     0       0        0     0       0
##   LEUKEMIA              0        0     0       0        0     0       0
##   MCF7A-repro           0        0     0       0        0     0       0
##   MCF7D-repro           1        0     0       0        0     0       0
##   MELANOMA              0        5     0       0        0     0       0
##   NSCLC                 0        0     8       0        0     0       0
##   OVARIAN               0        0     0       6        0     0       0
##   PROSTATE              0        0     0       0        1     0       0
##   RENAL                 0        0     0       0        0     5       0
##   UNKNOWN               0        0     0       0        0     0       1

confMtx.test

##              pred.test
## test.y        CNS COLON MELANOMA OVARIAN RENAL
##   BREAST        1     0        0       0     0
##   CNS           2     0        0       0     0
##   COLON         0     2        0       0     0
##   K562A-repro   0     0        0       0     0
##   K562B-repro   0     0        0       0     0
##   LEUKEMIA      0     0        0       0     0
##   MCF7A-repro   0     0        0       0     0
##   MCF7D-repro   0     0        0       0     0
##   MELANOMA      0     0        3       0     0
##   NSCLC         0     0        0       0     1
##   OVARIAN       0     0        0       0     0
##   PROSTATE      0     0        0       0     1
##   RENAL         0     0        0       1     3
##   UNKNOWN       0     0        0       0     0

cat('\nTrain Set Accuracy: ', mean((pred.train == train.y)) * 100, "%")

## 
## Train Set Accuracy:  100 %

cat('\nTest Set Accuracy: ', mean((pred.test == test.y)) * 100, "%")

## 
## Test Set Accuracy:  71.42857 %

The model has performed really well on train set with 100% accuracy, but did not do so well on the test data which has only 71.42% accuracy. One of the reasons could be that there is still of some level of sparsity in the data due to number of features > number of records. However, we must try to improve the performance of the model through different methods/options.

Exercise: Try different combinations of K (number of PCs) and n (number of top important variables) in the function imp_var() and see if you can improve the model performance on test data. Also try different algorithms such as SVM to improve the performance on test data.

Click the links below for more

Machine Learning - Building Principal Component Analysis algorithm from scratch

Business and AI

Search This Blog