Lab_PCA_2_RM
Asmi Ariv
2022-10-08
Machine Learning - Principal Component Analysis
We have seen, how PCA is done in R using the predefined functions “prcomp()”. However, it is also important to know how the machine learning works at the back-end. What are the functions running inside the main function. We will walk through various stages of machine learning process in PCA. We will create our own machine learning functions step-by-step:
- Define Function to compute Eigenvectors and Eigenvalues
- Define Function for normalizing features
- Define Function to calculate PC Scores for projecting data
- Define Function to calculate proportion of variance explained (PVE) and cumulative PVE
- Define Function to recover approximation of original data
- Define Function to draw lines in a plot
- Compute PCs using all the functions on data and reduce the dimensionality
In the video lecture, the algorithm has been explained in detail. Please go through the lecture before starting with this lab session for a better understanding.
pComp Function for eigenvectors and eigenvalues
Let’s start with Principal components function. This function takes the dataset and computes the eigenvectors and eigenvalues of the covariance matrix of dataset.
pComp <- function(X){
X = as.matrix(X, dimnames = list(row.names(X), names(X))) #In case, it is not in a matrix form
m = nrow(X) #Number of rows
n = ncol(X) #Number of columns
U = matrix(rep(0,n*n),n) #To store eigenvectors
S = matrix(rep(0,n*n),n) #To store eigenvalues
sigma = (1/m)*(t(X)%*%X) #Covariance matrix of X
US = svd(sigma)
diag(S) = US$d #Eigenvalues
U = US$u #Eigenvectors
list(U = U, S = S)
}Feature normalization
For some data sets, we may like to normalize all the features (due to difference in the scales of variables).
So, we need a function that normalizes all the variables. The function takes X (all the features) and returns normalized features, mean and sd values of all the original features
Normalize <- function(X){
X = as.matrix(X, dimnames = list(row.names(X), names(X))) #In case, it is not in a matrix form
#Creating X.norm to store normalized features
X.norm = X
mu = rep(0, dim(X)[2]) #Vector to store mean values of features
sigma = rep(0, dim(X)[2]) #Vector to store standard devaition values of features
for (i in 1:dim(X)[2]){
mu[i] = mean(X[,i])
sigma[i] = sd(X[,i])
X.norm[,i] = (X.norm[,i]-mu[i])/sigma[i]
}
list(X=X.norm, mu=mu, sigma=sigma)
}Function to calculate PC Scores for projecting data
This function takes data X, eigenvetors U and K = number of first few eigenvectors to be used for calculating Principal Component Scores of first K principal components. These scores are nothing but the projection of the original data onto the dimensional space of rotated axes (i.e. principal components).
pcScores <- function(X, U, K){
# X = original data
# U = matrix of all eigenvectors of covariance matrix of X
# K = number of first few eigenvectors
X = as.matrix(X, dimnames = list(row.names(X), names(X))) #In case, it is not in a matrix form
if(K>ncol(U)) {K = ncol(U);
cat(" As your K > number of eigenvectors, setting K = number of eigenvectors, which is", ncol(U))}
U_k = U[, 1:K]
pc_scores = X%*%U_k
pc_scores
}Function to calculate proportion of variance explained (PVE) and cumulative PVE
This function takes principal component scores and returns PVE and cumulative PVE.
pve = function(pc_scores){
# pc_scores = Principal Components scores of first K PCs
stdev = round(apply(pc.score, 2, sd), 5)
pc_var = apply(pc.score, 2, var)
PVE = round(pc_var/sum(pc_var), 5)
cum_PVE = round(cumsum(PVE), 5)
list(stdev=stdev, PVE=PVE, cum_PVE=cum_PVE)
}Function to recover approximation of original data
The function takes pc_scores (projected data onto K PCs), K = number first few eigenvectors, and U = matrix of all eigenvectors. It recovers the approximation of the data in its original higher space from the projected data of the reduced dimensions (i.e. principal components).
data_apx <- function(pc_scores, U, K){
# pc_scores = Principal Components scores of first K PCs
# K = number first few eigenvectors
# U = matrix of all eigenvectors
if(K>ncol(U)) {K = ncol(U);
cat(" As your K > number of eigenvectors, setting K = number of eigenvectors, which is", ncol(U))}
X_apx = matrix(rep(0,nrow(pc_scores)*nrow(U)),nrow(pc_scores))
X_apx = pc_scores[,1:K]%*%t(U[,1:K])
X_apx
}Function to draw lines in a plot
plotLine <- function(p1, p2,...){
lines(c(p1[1],p2[1]),c(p1[2],p2[2]),...)
}PCA using functions
Let’s go through Principal Component Analysis with one simulated dataset to understand how our functions are working.
Data
Let’s generate a simple simulated dataset with just two variables, age and salary
set.seed(1); age = rnorm(100, mean=45, sd=10)
salary = 2 + 5*age + rnorm(100, mean=50, sd=20)
plot(age, salary, col="red", pch = 1, cex=2)X = cbind(age, salary)As we can see that the variables have very different scales, hence we need to normalize them before proceeding with PCA.
Normalize data
Let’s normalize the data
Norm = Normalize(X)
mu <- Norm$mu #Means of variables
sigma <- Norm$sigma #standard deviations of variables
X.norm <- Norm$X #Normalized dataPerform PCA
Now let’s run our normalized data through all the functions to perform PCA
pc = pComp(X.norm)
U <- pc$U #eigenvectors of covariance matrix of X.norm
S <- pc$S #eigenvalues of covariance matrix of X.norm Calculating PC Scores
Let’s calculate the principal component scores of PC1, which is nothing but projecting the original data onto a lower dimensional space of rotated axis (i.e. PC1), which is 1D in this case
K = 1
pc.score = pcScores(X.norm, U, K) #PC score using 1st eigenvector and normalized dataRecover approximation of original data
Now let’s recover the approximation of normalized data by projecting the PC1 scores back onto the original high dimensional space, i.e 2
X_apx = data_apx(pc.score, U, K) #Approximated normalized data using 1st principal componentPlotting approximation data
Let’s plot the approximated normalized data in reduced dimenion (i.e. one-dimension) we have calculated using the first principal component versus the original normalized data
plot(X.norm, col='blue', pch=1,
xlim=c(-3, 4), ylim=c(-3, 3))
points(X_apx[, 1], X_apx[, 2], col='red', pch=1,
xlim=c(-3, 4), ylim=c(-3, 3))
for(i in 1:nrow(X.norm)){
plotLine(X.norm[i,], X_apx[i,])
}Blue circles are original normalized datapoints, while red circles are projected data.
From the plot, we can see how two dimensional data is being explained well by the recovered data from just one dimensional space (just a line). Although, there is a loss of little information in the original (normalized) data as the projected data retains only that information which is in the directions of principal components (here it is PC1)
PCA of USArrests
Let’s perform PCA on the dataset USArrests from base R.
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
A data frame with 50 observations on 4 variables.
- [,1] Murder numeric Murder arrests (per 100,000)
- [,2] Assault numeric Assault arrests (per 100,000)
- [,3] UrbanPop numeric Percent urban population
- [,4] Rape numeric Rape arrests (per 100,000)
X = as.matrix(USArrests)
dim(X)## [1] 50 4head(X)## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7There are 50 records and four variables as explained above. There is also scale difference in the dataset, so we need to normalize the data.
Normalize data
Let’s normalize the data
Norm = Normalize(X)
mu <- Norm$mu #Means of variables
sigma <- Norm$sigma #standard deviations of variables
X.norm <- Norm$X #Normalized dataPerform PCA
Now let’s run our normalized data through all the functions to perform PCA
pc = pComp(X.norm)
U <- pc$U #eigenvectors of covariance matrix of X.norm
S <- pc$S #eigenvalues of covariance matrix of X.norm
row.names(U) = colnames(X) #Assigning the variable names
dim(U)## [1] 4 4So, we have a total of 4 eigenvectors; that means we can get a maximum of 4 principal components.
Calculating PC Scores
Let’s calculate the principal component scores of all PCs, i.e. 1, 2, 3 and 4, which is nothing but projecting the original data onto the space of principal components (the rotated axes).
K = 5 #Let's see what the function does when we set K > number of eigenvectors
pc.score = pcScores(X.norm, U, K) #PC score using eigenvectors and normalized data## As your K > number of eigenvectors, setting K = number of eigenvectors, which is 4Proportion of Variance Explained (PVE)
Let’s calculate PVE of all the PCs
pc_var = pve(pc.score)
stdev = pc_var$stdev
PVE = pc_var$PVE
cum_PVE = pc_var$cum_PVE
pve_scores = rbind(stdev, PVE, cum_PVE)
pve_scores## [,1] [,2] [,3] [,4]
## stdev 1.57488 0.99487 0.59713 0.41645
## PVE 0.62006 0.24744 0.08914 0.04336
## cum_PVE 0.62006 0.86750 0.95664 1.00000As we can see that the first principal component explains 62% of the variance in the data, while second PC explains 24.7% variance. However, third and fourth PCs explain only 8.9% and 4.3% respectively.
We are able to get 86.75% of the variance in the data from PC1 and PC2
Therefore, we can recover most of the variance in the data just by using first two principal components.
Plotting PVE
As discussed in lab1, let’s plot PVE cumulative PVE.
plot(PVE , type ="o", ylab="PVE ", xlab=" Principal Components", col =" blue")plot(cum_PVE, type="o", ylim=c(0.5, 1.0),ylab =" Cumulative PVE", xlab="Principal Components", col =" brown3 ")Looking at its starting point of elbow, first 2 PCs are useful in explaining significant amount of variance, post which other PCs may help little.
So, as discussed earlier from the PVE table, when we look at Cumulative PVE, we can see that 86.75% of the variance is explained by 2 PCs, while 62% by PC1 alone.
Plotting and analysing first 2 PCs
We can use a nice function biplot() to plot the first two principal components.
biplot(pc.score[,1:2], U[,1:2])In this map black state names represent scores and red arrows represent the first two principal components loading (two axes on top and right). e.g. loading for Rape in PC1 is around -0.54 and in PC2 it is around -0.17, so its coordinates in the plot are (-0.54, -0.17).
U[,1:2]## [,1] [,2]
## Murder -0.5358995 0.4181809
## Assault -0.5831836 0.1879856
## UrbanPop -0.2781909 -0.8728062
## Rape -0.5434321 -0.1673186The first principal component places almost equal weights on Murder, Assault and Rape and lesser weight on UrbanPop. Therefore, PC1 is the representative of a measure of overall serious crime rates rather than urbanization.
The second principal component places very high weight on UrbanPop and lower weights on other variables. Therefore, PC2 represents Urbanization of the state rather than crime rates.
So, the crime related variables (murder, assault and rape) are close to each other, while UrbanPop is far away from these crime variables
Hence, crime related variables are correlated and states where there is a high murder rate also tend to have high assault and rape rates.
To analyse the biplot we must understand that if PC scores of the states have the same signs (- or +) as that of the weights of variables in a principal component then their interpretation would be same as that of PC (that means they have positive correlations). However, we also need to consider the absolute values of the scores.
For example, states such as California, Nevada and Florida have high crime rates as their absolute scores are high on PC1 as well as their score signs are negative (as the weights of these crime variables have negative signs on PC1). And we also analysed that PC1 represents crime rates. Similarly, states such as North and South Dakota have positive signs with high absolute scores on PC1, therefore, they have low crime rates as they are negatively correlated with PC1.
Another way of looking at the plot is the arrow signs, which represent the direction of that principal component. Since PC1 corresponds to a measure of crime rates the arrows pointing towards these variables represent PC1 and the direction is towards left. That means as states move from right to left the crime rates increase.
PC2 corresponds to UrbanPop (which has negative sign in PC2), hence, as states move from top to bottom the Urbanization increases. Therefore, states like California, New Jersey have high level of urbanization, while states like Mississippi, North Carolina have low level of Urbanization.
Hence, for the interpretation of the data using Principal Components, the sign (- or +) and absolute vales of both weights of variables in PC and PC scores of records are important factors.
This is the way principal components work. They club together all the correlated variables to form a single dimension leading to dimensionality reduction. In this case PC1 clubbed three correlated variables, viz. murder, assault and rape to represent overall serious crime rate.
These PCs are also orthogonal to each other. Since each PC deals with a set of correlated variables, two PCs will deal with two different groups of variables. Variables are correlated within the group but not outside the group. In our example, PC1 represents crime related variables while PC2 represents urban population related variable. Therefore, they are orthogonal.
Validate our results with R’s prcomp()
Let’s validate our results with that of prcomp() function of base R. We will use this function on USArrests dataset and construct principal components and compare the results thus obtained with results produced by our functions. This way we will know that the functions we have written are correct.
pca = prcomp(USArrests, scale=TRUE) #Scale must be true as we are using normalized data.
summary(pca)## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000As, we can see that both the results are same.
Let’s compare the results of our eigenvectors with that of prcomp()
pca$rotation #The function has named it rotation## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432U #Eigenvetors calculated by our function## [,1] [,2] [,3] [,4]
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432Again the results are same.
Let’s also compare pc scores
pca$x[1:4,] #We will use only first 4 rows to save the space## PC1 PC2 PC3 PC4
## Alabama -0.9756604 1.1220012 -0.43980366 0.1546966
## Alaska -1.9305379 1.0624269 2.01950027 -0.4341755
## Arizona -1.7454429 -0.7384595 0.05423025 -0.8262642
## Arkansas 0.1399989 1.1085423 0.11342217 -0.1809736pc.score[1:4,]## [,1] [,2] [,3] [,4]
## Alabama -0.9756604 1.1220012 -0.43980366 0.1546966
## Alaska -1.9305379 1.0624269 2.01950027 -0.4341755
## Arizona -1.7454429 -0.7384595 0.05423025 -0.8262642
## Arkansas 0.1399989 1.1085423 0.11342217 -0.1809736Results do match again.
Therefore, we can say that all our functions are working fine.
Recover approximation of original data
Now let’s recover the approximation of normalized data by projecting the scores of only 2 PCs back onto the original high dimensional space, i.e 4
X_apx = data_apx(pc.score, U, 2) #Approximated normalized data using only 1st and 2nd principal componentsPlotting approximation data
Let’s plot the approximated normalized data in reduced dimension (i.e. two-dimensions) that we have calculated using the first and second principal components versus the original normalized data.
plot(X.norm[, 1:2], col='blue', pch=1,
xlim=c(-3, 4), ylim=c(-3, 3))
points(X_apx[, 1], X_apx[, 2], col='red', pch=1,
xlim=c(-3, 4), ylim=c(-3, 3))
for(i in 1:nrow(X.norm)){
plotLine(X.norm[i,], X_apx[i,])
}Blue circles are original normalized datapoints, while red circles are projected data.
From the plot, we can see how four dimensional data is being explained well by the recovered data from just two dimensional space. Although, there is a loss of little information in the original (normalized) data as the projected data retains only that information which is in the directions of principal components (here it is PC1, PC2)
Exercise: Use some other datasets (such as iris dataset) from R or other sources and run the PCA using these functions and analyze the data based on the PCs.
Comments
Post a Comment