Machine Learning - Building Principal Component Analysis algorithm from scratch

Machine Learning - Principal Component Analysis

We have seen, how PCA is done in R using the predefined functions “prcomp()”. However, it is also important to know how the machine learning works at the back-end. What are the functions running inside the main function. We will walk through various stages of machine learning process in PCA. We will create our own machine learning functions step-by-step:

Define Function to compute Eigenvectors and Eigenvalues
Define Function for normalizing features
Define Function to calculate PC Scores for projecting data
Define Function to calculate proportion of variance explained (PVE) and cumulative PVE
Define Function to recover approximation of original data
Define Function to draw lines in a plot
Compute PCs using all the functions on data and reduce the dimensionality

In the video lecture, the algorithm has been explained in detail. Please go through the lecture before starting with this lab session for a better understanding.

pComp Function for eigenvectors and eigenvalues

Let’s start with Principal components function. This function takes the dataset and computes the eigenvectors and eigenvalues of the covariance matrix of dataset.

pComp <- function(X){
  
  X = as.matrix(X, dimnames = list(row.names(X), names(X))) #In case, it is not in a matrix form
  
  
  m = nrow(X) #Number of rows
  n = ncol(X) #Number of columns
  
  U = matrix(rep(0,n*n),n) #To store eigenvectors
  S = matrix(rep(0,n*n),n) #To store eigenvalues
  
  
  sigma = (1/m)*(t(X)%*%X) #Covariance matrix of X
  
  US = svd(sigma)
  
  diag(S) = US$d  #Eigenvalues
  U = US$u        #Eigenvectors
  list(U = U, S = S)
  
}

Feature normalization

For some data sets, we may like to normalize all the features (due to difference in the scales of variables).

So, we need a function that normalizes all the variables. The function takes X (all the features) and returns normalized features, mean and sd values of all the original features

Normalize <- function(X){
  
  X = as.matrix(X, dimnames = list(row.names(X), names(X))) #In case, it is not in a matrix form
  
  #Creating X.norm to store normalized features
  X.norm = X
  
  mu = rep(0, dim(X)[2]) #Vector to store mean values of features
  sigma = rep(0, dim(X)[2]) #Vector to store standard devaition values of features
  
  for (i in 1:dim(X)[2]){
  mu[i] = mean(X[,i])
  sigma[i] = sd(X[,i])
  X.norm[,i] = (X.norm[,i]-mu[i])/sigma[i]
  }
  
 list(X=X.norm, mu=mu, sigma=sigma) 
}

Function to calculate PC Scores for projecting data

This function takes data X, eigenvetors U and K = number of first few eigenvectors to be used for calculating Principal Component Scores of first K principal components. These scores are nothing but the projection of the original data onto the dimensional space of rotated axes (i.e. principal components).

pcScores <- function(X, U, K){
  
  # X = original data
  # U = matrix of all eigenvectors of covariance matrix of X
  # K = number of first few eigenvectors
  
  X = as.matrix(X, dimnames = list(row.names(X), names(X))) #In case, it is not in a matrix form
  
  if(K>ncol(U)) {K = ncol(U); 
              cat(" As your K >  number of eigenvectors, setting K = number of eigenvectors, which is", ncol(U))}
  
  U_k = U[, 1:K]
  
  pc_scores = X%*%U_k
  
  pc_scores
}

Function to calculate proportion of variance explained (PVE) and cumulative PVE

This function takes principal component scores and returns PVE and cumulative PVE.

pve = function(pc_scores){
  # pc_scores = Principal Components scores of first K PCs
  
  stdev = round(apply(pc.score, 2, sd), 5)
  
  pc_var = apply(pc.score, 2, var)


  PVE = round(pc_var/sum(pc_var), 5)


  cum_PVE = round(cumsum(PVE), 5)
  
  list(stdev=stdev, PVE=PVE, cum_PVE=cum_PVE)
  
}

Function to recover approximation of original data

The function takes pc_scores (projected data onto K PCs), K = number first few eigenvectors, and U = matrix of all eigenvectors. It recovers the approximation of the data in its original higher space from the projected data of the reduced dimensions (i.e. principal components).

data_apx <- function(pc_scores, U, K){
  
  # pc_scores = Principal Components scores of first K PCs
  # K = number first few eigenvectors
  # U = matrix of all eigenvectors
  
  
  if(K>ncol(U)) {K = ncol(U); 
                  cat(" As your K >  number of eigenvectors, setting K = number of eigenvectors, which is", ncol(U))}
  
  X_apx = matrix(rep(0,nrow(pc_scores)*nrow(U)),nrow(pc_scores))
  
  X_apx = pc_scores[,1:K]%*%t(U[,1:K])
  
  X_apx
}

Function to draw lines in a plot

plotLine <- function(p1, p2,...){
  
  lines(c(p1[1],p2[1]),c(p1[2],p2[2]),...)
}

PCA using functions

Let’s go through Principal Component Analysis with one simulated dataset to understand how our functions are working.

Data

Let’s generate a simple simulated dataset with just two variables, age and salary

set.seed(1); age = rnorm(100, mean=45, sd=10)
salary = 2 + 5*age + rnorm(100, mean=50, sd=20)
plot(age, salary, col="red", pch = 1, cex=2)

X = cbind(age, salary)

As we can see that the variables have very different scales, hence we need to normalize them before proceeding with PCA.

Normalize data

Let’s normalize the data

Norm = Normalize(X)

mu <- Norm$mu #Means of variables
sigma <- Norm$sigma #standard deviations of variables

X.norm <- Norm$X #Normalized data

Perform PCA

Now let’s run our normalized data through all the functions to perform PCA

pc = pComp(X.norm)

U <- pc$U #eigenvectors of covariance matrix of X.norm 
S <- pc$S #eigenvalues of covariance matrix of X.norm

Calculating PC Scores

Let’s calculate the principal component scores of PC1, which is nothing but projecting the original data onto a lower dimensional space of rotated axis (i.e. PC1), which is 1D in this case

K = 1
pc.score = pcScores(X.norm, U, K) #PC score using 1st eigenvector and normalized data

Recover approximation of original data

Now let’s recover the approximation of normalized data by projecting the PC1 scores back onto the original high dimensional space, i.e 2

X_apx  = data_apx(pc.score, U, K) #Approximated normalized data using 1st principal component

Plotting approximation data

Let’s plot the approximated normalized data in reduced dimenion (i.e. one-dimension) we have calculated using the first principal component versus the original normalized data

plot(X.norm, col='blue', pch=1,
     xlim=c(-3, 4), ylim=c(-3, 3))

points(X_apx[, 1], X_apx[, 2], col='red', pch=1,
     xlim=c(-3, 4), ylim=c(-3, 3))

for(i in 1:nrow(X.norm)){
  
  plotLine(X.norm[i,], X_apx[i,])
  
}

Blue circles are original normalized datapoints, while red circles are projected data.

From the plot, we can see how two dimensional data is being explained well by the recovered data from just one dimensional space (just a line). Although, there is a loss of little information in the original (normalized) data as the projected data retains only that information which is in the directions of principal components (here it is PC1)

PCA of USArrests

Let’s perform PCA on the dataset USArrests from base R.

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

A data frame with 50 observations on 4 variables.

[,1] Murder numeric Murder arrests (per 100,000)
[,2] Assault numeric Assault arrests (per 100,000)
[,3] UrbanPop numeric Percent urban population
[,4] Rape numeric Rape arrests (per 100,000)

X = as.matrix(USArrests)
dim(X)

## [1] 50  4

head(X)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

There are 50 records and four variables as explained above. There is also scale difference in the dataset, so we need to normalize the data.

Normalize data

Let’s normalize the data

Norm = Normalize(X)

mu <- Norm$mu #Means of variables
sigma <- Norm$sigma #standard deviations of variables

X.norm <- Norm$X #Normalized data

Perform PCA

Now let’s run our normalized data through all the functions to perform PCA

pc = pComp(X.norm)

U <- pc$U #eigenvectors of covariance matrix of X.norm 
S <- pc$S #eigenvalues of covariance matrix of X.norm 

row.names(U) = colnames(X) #Assigning the variable names

dim(U)

## [1] 4 4

So, we have a total of 4 eigenvectors; that means we can get a maximum of 4 principal components.

Calculating PC Scores

Let’s calculate the principal component scores of all PCs, i.e. 1, 2, 3 and 4, which is nothing but projecting the original data onto the space of principal components (the rotated axes).

K = 5  #Let's see what the function does when we set K > number of eigenvectors
pc.score = pcScores(X.norm, U, K) #PC score using eigenvectors and normalized data

##  As your K >  number of eigenvectors, setting K = number of eigenvectors, which is 4

Proportion of Variance Explained (PVE)

Let’s calculate PVE of all the PCs

pc_var = pve(pc.score)

stdev = pc_var$stdev
PVE = pc_var$PVE
cum_PVE = pc_var$cum_PVE

pve_scores = rbind(stdev, PVE, cum_PVE)
pve_scores

##            [,1]    [,2]    [,3]    [,4]
## stdev   1.57488 0.99487 0.59713 0.41645
## PVE     0.62006 0.24744 0.08914 0.04336
## cum_PVE 0.62006 0.86750 0.95664 1.00000

As we can see that the first principal component explains 62% of the variance in the data, while second PC explains 24.7% variance. However, third and fourth PCs explain only 8.9% and 4.3% respectively.

We are able to get 86.75% of the variance in the data from PC1 and PC2

Therefore, we can recover most of the variance in the data just by using first two principal components.

Plotting PVE

As discussed in lab1, let’s plot PVE cumulative PVE.

plot(PVE , type ="o", ylab="PVE ", xlab=" Principal Components", col =" blue")

plot(cum_PVE, type="o", ylim=c(0.5, 1.0),ylab =" Cumulative PVE", xlab="Principal Components", col =" brown3 ")

Looking at its starting point of elbow, first 2 PCs are useful in explaining significant amount of variance, post which other PCs may help little.

So, as discussed earlier from the PVE table, when we look at Cumulative PVE, we can see that 86.75% of the variance is explained by 2 PCs, while 62% by PC1 alone.

Plotting and analysing first 2 PCs

We can use a nice function biplot() to plot the first two principal components.

biplot(pc.score[,1:2], U[,1:2])

In this map black state names represent scores and red arrows represent the first two principal components loading (two axes on top and right). e.g. loading for Rape in PC1 is around -0.54 and in PC2 it is around -0.17, so its coordinates in the plot are (-0.54, -0.17).

U[,1:2]

##                [,1]       [,2]
## Murder   -0.5358995  0.4181809
## Assault  -0.5831836  0.1879856
## UrbanPop -0.2781909 -0.8728062
## Rape     -0.5434321 -0.1673186

The first principal component places almost equal weights on Murder, Assault and Rape and lesser weight on UrbanPop. Therefore, PC1 is the representative of a measure of overall serious crime rates rather than urbanization.

The second principal component places very high weight on UrbanPop and lower weights on other variables. Therefore, PC2 represents Urbanization of the state rather than crime rates.

So, the crime related variables (murder, assault and rape) are close to each other, while UrbanPop is far away from these crime variables

Hence, crime related variables are correlated and states where there is a high murder rate also tend to have high assault and rape rates.

To analyse the biplot we must understand that if PC scores of the states have the same signs (- or +) as that of the weights of variables in a principal component then their interpretation would be same as that of PC (that means they have positive correlations). However, we also need to consider the absolute values of the scores.

For example, states such as California, Nevada and Florida have high crime rates as their absolute scores are high on PC1 as well as their score signs are negative (as the weights of these crime variables have negative signs on PC1). And we also analysed that PC1 represents crime rates. Similarly, states such as North and South Dakota have positive signs with high absolute scores on PC1, therefore, they have low crime rates as they are negatively correlated with PC1.

Another way of looking at the plot is the arrow signs, which represent the direction of that principal component. Since PC1 corresponds to a measure of crime rates the arrows pointing towards these variables represent PC1 and the direction is towards left. That means as states move from right to left the crime rates increase.

PC2 corresponds to UrbanPop (which has negative sign in PC2), hence, as states move from top to bottom the Urbanization increases. Therefore, states like California, New Jersey have high level of urbanization, while states like Mississippi, North Carolina have low level of Urbanization.

Hence, for the interpretation of the data using Principal Components, the sign (- or +) and absolute vales of both weights of variables in PC and PC scores of records are important factors.

This is the way principal components work. They club together all the correlated variables to form a single dimension leading to dimensionality reduction. In this case PC1 clubbed three correlated variables, viz. murder, assault and rape to represent overall serious crime rate.

These PCs are also orthogonal to each other. Since each PC deals with a set of correlated variables, two PCs will deal with two different groups of variables. Variables are correlated within the group but not outside the group. In our example, PC1 represents crime related variables while PC2 represents urban population related variable. Therefore, they are orthogonal.

Validate our results with R’s prcomp()

Let’s validate our results with that of prcomp() function of base R. We will use this function on USArrests dataset and construct principal components and compare the results thus obtained with results produced by our functions. This way we will know that the functions we have written are correct.

pca = prcomp(USArrests, scale=TRUE) #Scale must be true as we are using normalized data.
summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

As, we can see that both the results are same.

Let’s compare the results of our eigenvectors with that of prcomp()

pca$rotation #The function has named it rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

U #Eigenvetors calculated by our function

##                [,1]       [,2]       [,3]        [,4]
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

Again the results are same.

Let’s also compare pc scores

pca$x[1:4,] #We will use only first 4 rows to save the space

##                 PC1        PC2         PC3        PC4
## Alabama  -0.9756604  1.1220012 -0.43980366  0.1546966
## Alaska   -1.9305379  1.0624269  2.01950027 -0.4341755
## Arizona  -1.7454429 -0.7384595  0.05423025 -0.8262642
## Arkansas  0.1399989  1.1085423  0.11342217 -0.1809736

pc.score[1:4,]

##                [,1]       [,2]        [,3]       [,4]
## Alabama  -0.9756604  1.1220012 -0.43980366  0.1546966
## Alaska   -1.9305379  1.0624269  2.01950027 -0.4341755
## Arizona  -1.7454429 -0.7384595  0.05423025 -0.8262642
## Arkansas  0.1399989  1.1085423  0.11342217 -0.1809736

Results do match again.

Therefore, we can say that all our functions are working fine.

Recover approximation of original data

Now let’s recover the approximation of normalized data by projecting the scores of only 2 PCs back onto the original high dimensional space, i.e 4

X_apx  = data_apx(pc.score, U, 2) #Approximated normalized data using only 1st and 2nd principal components

Plotting approximation data

Let’s plot the approximated normalized data in reduced dimension (i.e. two-dimensions) that we have calculated using the first and second principal components versus the original normalized data.

plot(X.norm[, 1:2], col='blue', pch=1,
     xlim=c(-3, 4), ylim=c(-3, 3))

points(X_apx[, 1], X_apx[, 2], col='red', pch=1,
     xlim=c(-3, 4), ylim=c(-3, 3))

for(i in 1:nrow(X.norm)){
  
  plotLine(X.norm[i,], X_apx[i,])
  
}

Blue circles are original normalized datapoints, while red circles are projected data.

From the plot, we can see how four dimensional data is being explained well by the recovered data from just two dimensional space. Although, there is a loss of little information in the original (normalized) data as the projected data retains only that information which is in the directions of principal components (here it is PC1, PC2)

Exercise: Use some other datasets (such as iris dataset) from R or other sources and run the PCA using these functions and analyze the data based on the PCs.

Click the links below for more

Principal Component Analysis (PCA) In R

Business and AI

Search This Blog