Lab_NN_2_RM
Asmi Ariv
2022-10-03
Machine Learning - Neural Network Algorithm
We have seen, how Neural Network model is built in R using the predefined functions such as, “nnet()”. However, it is also important to know how the machine learning works at the back-end. What are the functions running inside the main function. We will walk through various stages of machine learning process in Neural Network. We will create our own machine learning functions step-by-step:
- Define Sigmoid Function
- Define Sigmoid Gradient Function
- Define Cost Function using regularization
- Define Gradient function using regularization
- Define Random Initialization of Weight Function
- Define predict function to predict classes for each record
- Define function for mapping response variable into integers
- Define function for mapping response variable back to original classes
- Train the model using all the functions on training data
- Make predictions on train and test data
In the video lecture, the algorithm has been explained in detail. Please go through the lecture before starting with this lab session for a better understanding.
Sigmoid Function
Let’s start with Sigmoid function to be used as an activation function
sigmoid <- function(x){
x <- as.matrix(x)
sig = matrix(rep(0, nrow(x)*ncol(x)),nrow(x))
sig <- 1/ (1 + exp(-x))
sig
}Sigmoid Gradient Function
Let’s define Sigmoid Gradient function, for back propagation steps
sigmoid.grad <- function(x){
x <- as.matrix(x)
sig.grad = matrix(rep(0, nrow(x)*ncol(x)),nrow(x))
sig.grad <- (1/ (1 + exp(-x)))*(1-(1/ (1 + exp(-x))))
sig.grad
}Cost Function
Let’s define cost function for a input_one-hidden_output layered neural net. The main purpose is to minimize the cost to get the optimum values of parameters. This function requires X, y, unrolled vector of thetas for all layers, number of units/nodes of input and hidden layers, number of classes/categories for output layer and regularization term lambda. Returns cost.
costFunction <- function(vec_thetas, n.input, n.hidden, n.class, X, y, lambda){
# n.input = number of nodes/units in input layer
# n.hidden = number of nodes/units in hidden layer
# n.class = number of categories/classes in the output layer (or response variable)
m <- nrow(X) #No. of records in the training data
J <- 0 #Initialize cost
# vec_thetas is a vector containing all the weights of all the layers
# Reshape vec_thetas into matrices of the weights theta.h (between input and hidden) and theta.out (between hidden and output), as we are only building a neural net with two layers
# We will use vectorized calculations through matrix operations as much as possible to avoid loops
theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))
X = as.matrix(X) #Converting into matrix for matrix operations
X = cbind(rep(1,m),X) #Adding bias
A = sigmoid(X%*%t(theta.h)) #Activation of weighted inputs at the hidden layer
ma = nrow(A) #Number of rows in the hidden layer
A = cbind(rep(1,ma), A) #Adding bias to the hidden layer
for (i in 1:n.class){
J1 <- (1/m)*(-t(as.numeric(y==i))%*%log(sigmoid(A%*%(t(theta.out)[,i])))-
(t(1-as.numeric(y==i))%*%log(1-sigmoid(A%*%(t(theta.out)[,i])))))
J <- J + J1
}
temp.h = theta.h
temp.out = theta.out
temp.h[,1] = 0;
temp.out[,1] = 0;
J = J+(lambda/(2*m))*(sum(sum(temp.h*temp.h))+ sum(sum(temp.out*temp.out)))
J
}Gradient Function
Let’s define the gradient function. This function requires X, y, unrolled vector of thetas for all layers, number of units/nodes of input and hidden layers, number of classes/categories for output layer and regularization term lambda Returns the unrolled vector of gradient values for corresponding theta values.
gradFunction <- function(vec_thetas, n.input, n.hidden, n.class, X, y, lambda){
# This function returns unrolled vector of gradinet values for corresponding thetas
# n.input = number of nodes/units in input layer
# n.hidden = number of nodes/units in hidden layer
# n.class = number of categories/classes in the output layer (or response variable)
m <- nrow(X) #No. of records in the training data
# vec_thetas is a vector containing all the weights of all the layers
# Reshape vec_thetas into matrices of the weights theta.h (between input and hidden) and theta.out (between hidden and output), as we are only building a neural net with two layers
# We will use vectorized calculations through matrix operations as much as possible to avoid loops
theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))
theta.h_grad = matrix(rep(0,length(theta.h)),nrow(theta.h))
theta.out_grad = matrix(rep(0,length(theta.out)),nrow(theta.out))
X = as.matrix(X) #Converting into matrix for matrix operations
X = cbind(rep(1,m),X) #Adding bias term to input layer
temp.h = theta.h
temp.out = theta.out
temp.h[,1] = 0;
temp.out[,1] = 0;
#Converting response variable y into a matrix of m rows and n.class columns for vectorized calculation
tp <- matrix(1:n.class,1) #Creating one row matrix with columns as classes
yt <- matrix(rep(tp,m),m,byrow=T) # Matrix of m rows and n.class columns (each column represents one class)
yt = 1*(yt == as.vector(y)) # Assigning the actual class to each row ()
A1 = X
Z2 = A1%*%t(theta.h)
A2 = sigmoid(Z2)
ma = nrow(A2)
A2 = cbind(rep(1,ma), A2) #Adding bias term to hidden layer
Z3 = A2%*%t(theta.out)
A3 = sigmoid(Z3)
d3 = A3-yt #Error term of output layer
d2 = (d3%*%theta.out)*(A2*(1-A2)) #Error term of hidden layer
end = nrow(t(d2))
theta.h_grad = theta.h_grad + (t(d2)%*%A1)[2:end,]; #Starting with 2 as A1 is not connected to bias term of hidden #layer
theta.out_grad = theta.out_grad + t(d3)%*%(A2);
theta.h_grad = (1/m)*(theta.h_grad + lambda*temp.h);
theta.out_grad = (1/m)*(theta.out_grad + lambda*temp.out);
grad <- c(as.vector(theta.h_grad),as.vector(theta.out_grad)) #Unrolling gradient values in a vector
grad
}Random Initialization of weights
We will use Normalized Xavier Weight Initialization method for this purpose.
We need to randomly initialize weights to break the symmetry and to make the learning quick and efficient
The function takes the number of units/nodes in adjucent layers of the thetas. It returns matrix of randomized theta values between two layers.
rand.Weights <- function(n.prev, n.next){
# n.prev = number of units in previous layer
# n.next = number of units in next layer
# randTheta should be a matrix of dimension (n.next, 1 + n.prev) as first column is for intercept/bias term
randTheta = matrix(rep(0,n.next*(1 + n.prev)),n.next)
small_init = sqrt(6)/sqrt(n.prev+n.next)
randTheta <- matrix(runif(n.next*(1 + n.prev)),n.next)*2*small_init - small_init
randTheta
}Prediction function
This function takes thetas for each layer in a matrix and X. Returns the predicted class for each record.
predict <- function(theta.h, theta.out, X){
X = as.matrix(X) #Converting into matrix for matrix operations
m = nrow(X)
p = rep(0,m)
X = cbind(rep(1,m), X)
X = as.matrix(X) #Converting into matrix for matrix operations
A = sigmoid(X%*%t(theta.h))
ma = nrow(A)
A = cbind(rep(1,ma), A)
p <- apply(sigmoid(A%*%t(theta.out)),1,which.max)
p
}Mapping response variable into integers (1 to n.class)
To convert all the classes into integers for our algorithm to run and perform matrix operations. Make sure that the response variable is a factor with classes as levels.
map_y = function(y){
classes = levels(y) #Response variable must be a factor with classes as levels
n.class = length (classes)
y_map = rep(0, length(y))
for (i in 1:n.class){
idx = y %in% classes[i]
y_map[idx] = i
}
list(y_map=y_map, classes = classes)
}Mapping response variable back to original classes
To convert mapped y/predicted values back to the original classes to get the predicted class for each record.
class_y = function(y_map, classes){
n.class = length (classes)
y_class = rep(0, length(y_map))
for (i in 1:n.class){
idx = y_map %in% i
y_class[idx] = classes[i]
}
list(y_class=y_class, classes=classes)
}Load Data
We will use the same data that we used for lab1 of neural network. Please go through lab1 to know about the data.
let’s load the data
library(mlbench)
data(BreastCancer)Let’s get rid of records with missing values
sum(!complete.cases(BreastCancer))## [1] 16BreastCancerComplete = BreastCancer[complete.cases(BreastCancer),]
sum(!complete.cases(BreastCancerComplete))## [1] 0sum(is.na(BreastCancerComplete))## [1] 0dim(BreastCancerComplete)## [1] 683 11BreastCancerNew = BreastCancerComplete[,-1] # Dropping "Id"
names(BreastCancerNew)## [1] "Cl.thickness" "Cell.size" "Cell.shape" "Marg.adhesion"
## [5] "Epith.c.size" "Bare.nuclei" "Bl.cromatin" "Normal.nucleoli"
## [9] "Mitoses" "Class"Our algorithm needs data in a numeric format. Therefore, we will use a function called, model.matrix, to convert all the categorical variables, except the response variable, into 0s and 1s or numeric values depending on variable type.
X = model.matrix(Class ~ .,data= BreastCancerNew)
dim(X)## [1] 683 81X = X[,-1]
dim(X)## [1] 683 80m = nrow(X)
m## [1] 683So, it has 81 variables with first one as an intercept or bias term which we have removed, because our functions are going to add the bias term
Response variable
y = BreastCancerNew$Class
levels(y)## [1] "benign" "malignant"Let’s map y into intergers values
Mapping response variable into integers
To convert all the classes into integers for our algorithm to run and perform matrix operations
y_mapped = map_y(y)
y_map = y_mapped$y_map
classes = y_mapped$classes
y_map[1:10]## [1] 1 1 1 1 1 2 1 1 1 1y[1:10]## [1] benign benign benign benign benign malignant benign
## [8] benign benign benign
## Levels: benign malignantclasses## [1] "benign" "malignant"Train and Test Data
Let’s split the data set into train (80%) and test (20%)
set.seed(1);train.idx = sample(1:m, 0.8*m, replace=F)
train.n = length(train.idx)
test.n = m-train.n
trainX = X[train.idx,]
testX = X[-train.idx,]
trainy = y_map[train.idx]
testy = y_map[-train.idx]
dim(trainX)## [1] 546 80dim(testX)## [1] 137 80trainy[10:20]## [1] 2 1 1 2 1 2 1 1 1 2 1testy[10:20]## [1] 2 1 2 1 2 1 1 1 1 1 1Layers dimensions
n.input = 80 # 80 input variables
n.hidden = 10 # 10 hidden units
n.class = 2 # 2 classes, "benign" and "malignant" Initializing random Neural Network Weights
iniTheta.h = rand.Weights(n.input, n.hidden)
iniTheta.out = rand.Weights(n.hidden, n.class)Unrolling thetas
initial_vec_thetas = c(as.vector(iniTheta.h) , as.vector(iniTheta.out))Training Neural Network
We will try two different optimizers (optim() and nlminb()) to see which one takes less time to train the model
First, let’s try optim()
lambda = 1
#Compute Weights and cost using optim() as optimizer
#number of iterations = 100
costh <- optim(par=initial_vec_thetas, fn=costFunction, gr=gradFunction,
method="BFGS", n.input=n.input,
n.hidden=n.hidden, n.class=n.class,
X=trainX,y=trainy,lambda=lambda, control = list(maxit=100))
vec_thetas <- costh$par
cost <- costh$value
cat('Cost after last iteration:', '\n', cost)## Cost after last iteration:
## 0.1638301Rolling vec_thetas back to theta.h and theta.out
theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))Predictions by trained model using optim()
Train data set
pred.train = predict(theta.h, theta.out, trainX)
cat('Accuracy on train data set:\n',mean((pred.train == trainy))*100)## Accuracy on train data set:
## 99.45055Test data set
pred.test = predict(theta.h, theta.out, testX)
cat('Accuracy on test data set:\n',mean((pred.test == testy))*100)## Accuracy on test data set:
## 95.62044Now let’s try nlminb() as optimizer to get parameters and cost
#Compute Weights and cost using nlminb() as optimizer
#number of iterations = 100
costh <- nlminb(start=initial_vec_thetas, objective=costFunction,
gradient=gradFunction, n.input=n.input,
n.hidden=n.hidden, n.class=n.class,
X=trainX,y=trainy,lambda=lambda, control = list(iter.max=100))
vec_thetas <- costh$par
cost <- costh$objective
cat('Cost after last iteration:', '\n', cost)## Cost after last iteration:
## 0.1638596Rolling vec_thetas back to theta.h and theta.out
theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))Predictions by trained model using nlminb()
Train data set
pred.train = predict(theta.h, theta.out, trainX)
cat('Accuracy on train data set:\n',mean((pred.train == trainy))*100)## Accuracy on train data set:
## 99.45055Test data set
pred.test = predict(theta.h, theta.out, testX)
cat('Accuracy on test data set:\n',mean((pred.test == testy))*100)## Accuracy on test data set:
## 95.62044Both optim() and nlminb() take the same amount of time
Their accuracies are similar as well
We can see that this simple neural network classifier with just two layers has performed really well
Predicted class
Let’s convert the response variable and predicted values back to original classes
pred.train.class = class_y(pred.train, classes)$y_class
trainy.class = class_y(trainy, classes)$y_class
pred.test.class = class_y(pred.test, classes)$y_class
testy.class = class_y(testy, classes)$y_classLet’s look at confusion matrix
confMatTrain = table(trainy.class, pred.train.class)
confMatTest = table(testy.class, pred.test.class)
confMatTrain## pred.train.class
## trainy.class benign malignant
## benign 345 2
## malignant 1 198cat('\n Accuracy on train data set:\n',mean((pred.train.class == trainy.class))*100, '\n')##
## Accuracy on train data set:
## 99.45055confMatTest## pred.test.class
## testy.class benign malignant
## benign 94 3
## malignant 3 37cat('\n Accuracy on test data set:\n',mean((pred.test.class == testy.class))*100, '\n')##
## Accuracy on test data set:
## 95.62044Another example
Let’s quickly run through another example and see how our neural net performs
We will use iris data set from R, we have already built multinomial logistic regression using this data (go through lab1 in Logistic Regression section for more details). This data set comes with base R.
Train and Test subset
Data dimensions and converting into X and y
dim(iris)## [1] 150 5levels(iris$Species)## [1] "setosa" "versicolor" "virginica"X = iris[, -5]
y = iris$SpeciesMapping response variable into integers
y_mapped = map_y(y)
y_map = y_mapped$y_map
classes = y_mapped$classes
y_map[c(1:3, 51:53, 101:103)]## [1] 1 1 1 2 2 2 3 3 3y[c(1:3, 51:53, 101:103)]## [1] setosa setosa setosa versicolor versicolor versicolor virginica
## [8] virginica virginica
## Levels: setosa versicolor virginicaclasses## [1] "setosa" "versicolor" "virginica"m = dim(X)[1]
set.seed(1); train.idx = sample(1:m, 120, replace=F)
trainX = X[train.idx, ]
testX = X[-train.idx, ]
train.y = y_map[train.idx] #Mapped into integer values for respective classes
train.y.class = y[train.idx] #Original classes
test.y = y_map[-train.idx] #Mapped into integer values for respective classes
test.y.class = y[-train.idx] #Original classes
dim(trainX)## [1] 120 4dim(testX)## [1] 30 4levels(test.y.class)## [1] "setosa" "versicolor" "virginica"levels(train.y.class)## [1] "setosa" "versicolor" "virginica"Training the neural net
Layers dimensions
n.input = 4 # 4 input variables
n.hidden = 10 # 10 hidden units
n.class = 3 # 3 classes, "setosa", "versicolor" and "virginica" Initializing random Neural Network Weights
iniTheta.h = rand.Weights(n.input, n.hidden)
iniTheta.out = rand.Weights(n.hidden, n.class)Unrolling thetas
initial_vec_thetas = c(as.vector(iniTheta.h) , as.vector(iniTheta.out))Training Neural Network with optim()
lambda = 1
#Compute Weights and cost using optim() as optimizer
#number of iterations = 100
costh <- optim(par=initial_vec_thetas, fn=costFunction, gr=gradFunction,
method="BFGS", n.input=n.input,
n.hidden=n.hidden, n.class=n.class,
X=trainX,y=train.y,lambda=lambda, control = list(maxit=100))
vec_thetas <- costh$par
cost <- costh$value
cat('Cost after last iteration:', '\n', cost)## Cost after last iteration:
## 0.6549467Rolling vec_thetas back to theta.h and theta.out
theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))Predictions accuracy of train and test sets
Train data set
pred.train = predict(theta.h, theta.out, trainX)
cat('Accuracy on train data set:\n',mean((pred.train == train.y))*100)## Accuracy on train data set:
## 98.33333Test data set
pred.test = predict(theta.h, theta.out, testX)
cat('Accuracy on test data set:\n',mean((pred.test == test.y))*100)## Accuracy on test data set:
## 93.33333Predicted class
Let’s convert the response variable and predicted values back to original classes
pred.train.class = class_y(pred.train, classes)$y_class
pred.test.class = class_y(pred.test, classes)$y_classLet’s look at confusion matrix
confMatTrain = table(train.y.class, pred.train.class)
confMatTest = table(test.y.class, pred.test.class)
confMatTrain## pred.train.class
## train.y.class setosa versicolor virginica
## setosa 39 0 0
## versicolor 0 36 2
## virginica 0 0 43cat('\n Accuracy on train data set:\n',mean((pred.train.class == train.y.class))*100, '\n')##
## Accuracy on train data set:
## 98.33333confMatTest## pred.test.class
## test.y.class setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 12 0
## virginica 0 2 5cat('\n Accuracy on test data set:\n',mean((pred.test.class == test.y.class))*100, '\n')##
## Accuracy on test data set:
## 93.33333As we can see that our neural net has again performed extremely well.
Exercise: Use different data sets from R, MASS or any other package or source that you like and run this algorithm to see how it works.
Comments
Post a Comment