Machine Learning - Building Neural Network Algorithm from scratch

Machine Learning - Neural Network Algorithm

We have seen, how Neural Network model is built in R using the predefined functions such as, “nnet()”. However, it is also important to know how the machine learning works at the back-end. What are the functions running inside the main function. We will walk through various stages of machine learning process in Neural Network. We will create our own machine learning functions step-by-step:

Define Sigmoid Function
Define Sigmoid Gradient Function
Define Cost Function using regularization
Define Gradient function using regularization
Define Random Initialization of Weight Function
Define predict function to predict classes for each record
Define function for mapping response variable into integers
Define function for mapping response variable back to original classes
Train the model using all the functions on training data
Make predictions on train and test data

In the video lecture, the algorithm has been explained in detail. Please go through the lecture before starting with this lab session for a better understanding.

Sigmoid Function

Let’s start with Sigmoid function to be used as an activation function

sigmoid <- function(x){
  x <- as.matrix(x)
  sig = matrix(rep(0, nrow(x)*ncol(x)),nrow(x))
  
  
  sig <- 1/ (1 + exp(-x))

sig
  
}

Sigmoid Gradient Function

Let’s define Sigmoid Gradient function, for back propagation steps

sigmoid.grad <- function(x){
  x <- as.matrix(x)
  sig.grad = matrix(rep(0, nrow(x)*ncol(x)),nrow(x))
  
  
  sig.grad <- (1/ (1 + exp(-x)))*(1-(1/ (1 + exp(-x))))

sig.grad
  
}

Cost Function

Let’s define cost function for a input_one-hidden_output layered neural net. The main purpose is to minimize the cost to get the optimum values of parameters. This function requires X, y, unrolled vector of thetas for all layers, number of units/nodes of input and hidden layers, number of classes/categories for output layer and regularization term lambda. Returns cost.

costFunction <- function(vec_thetas, n.input, n.hidden, n.class, X, y, lambda){
  
  # n.input = number of nodes/units in input layer
  # n.hidden = number of nodes/units in hidden layer
  # n.class = number of categories/classes in the output layer (or response variable)
  
   m <- nrow(X) #No. of records in the training data
   J <- 0 #Initialize cost
   
  
  # vec_thetas is a vector containing all the weights of all the layers 
  # Reshape vec_thetas into matrices of the weights theta.h (between input and hidden) and theta.out (between hidden      and output), as we are only building a neural net with two layers
  # We will use vectorized calculations through matrix operations as much as possible to avoid loops
  
   theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
   dim(theta.h) <- c(n.hidden, (n.input + 1))
   
   theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
   dim(theta.out) <- c(n.class, (n.hidden + 1))
  
  
  
   X = as.matrix(X) #Converting into matrix for matrix operations
   X = cbind(rep(1,m),X) #Adding bias
  
   A = sigmoid(X%*%t(theta.h)) #Activation of weighted inputs at the hidden layer
  
   ma = nrow(A) #Number of rows in the hidden layer
  
   A = cbind(rep(1,ma), A) #Adding bias to the hidden layer
  
   for (i in 1:n.class){
  
     
     J1 <- (1/m)*(-t(as.numeric(y==i))%*%log(sigmoid(A%*%(t(theta.out)[,i])))-
                   (t(1-as.numeric(y==i))%*%log(1-sigmoid(A%*%(t(theta.out)[,i]))))) 
    
     J <- J + J1  
   }
  
   temp.h = theta.h
   temp.out = theta.out
   temp.h[,1] = 0;
   temp.out[,1] = 0;
   J = J+(lambda/(2*m))*(sum(sum(temp.h*temp.h))+ sum(sum(temp.out*temp.out)))
  J

  
  }

Gradient Function

Let’s define the gradient function. This function requires X, y, unrolled vector of thetas for all layers, number of units/nodes of input and hidden layers, number of classes/categories for output layer and regularization term lambda Returns the unrolled vector of gradient values for corresponding theta values.

gradFunction <- function(vec_thetas, n.input, n.hidden, n.class, X, y, lambda){
  
  # This function returns unrolled vector of gradinet values for corresponding thetas
  
  
  # n.input = number of nodes/units in input layer
  # n.hidden = number of nodes/units in hidden layer
  # n.class = number of categories/classes in the output layer (or response variable)
  
   m <- nrow(X) #No. of records in the training data
   
  
  
  # vec_thetas is a vector containing all the weights of all the layers 
  # Reshape vec_thetas into matrices of the weights theta.h (between input and hidden) and theta.out (between hidden      and output), as we are only building a neural net with two layers
  # We will use vectorized calculations through matrix operations as much as possible to avoid loops
  
   theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
   dim(theta.h) <- c(n.hidden, (n.input + 1))
   
   theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
   dim(theta.out) <- c(n.class, (n.hidden + 1))
  
  
    
  theta.h_grad = matrix(rep(0,length(theta.h)),nrow(theta.h))
  theta.out_grad = matrix(rep(0,length(theta.out)),nrow(theta.out))
  
  X = as.matrix(X) #Converting into matrix for matrix operations
  X = cbind(rep(1,m),X) #Adding bias term to input layer
  
   
  temp.h = theta.h
  temp.out = theta.out
  temp.h[,1] = 0;
  temp.out[,1] = 0;
  
  #Converting response variable y into a matrix of m rows and n.class columns for vectorized calculation
  tp <- matrix(1:n.class,1) #Creating one row matrix with columns as classes
  yt <- matrix(rep(tp,m),m,byrow=T) # Matrix of m rows and n.class columns (each column represents one class)
  yt = 1*(yt == as.vector(y)) # Assigning the actual class to each row ()
  
  A1 = X
  Z2 = A1%*%t(theta.h)
  A2 = sigmoid(Z2)
  ma = nrow(A2)
  A2 = cbind(rep(1,ma), A2) #Adding bias term to hidden layer
  Z3 = A2%*%t(theta.out)
  A3 = sigmoid(Z3)
  d3 = A3-yt                #Error term of output layer 
  d2 = (d3%*%theta.out)*(A2*(1-A2)) #Error term of hidden layer
  end = nrow(t(d2))
  
  theta.h_grad = theta.h_grad + (t(d2)%*%A1)[2:end,]; #Starting with 2 as A1 is not connected to bias term of hidden                                                       #layer
  theta.out_grad = theta.out_grad + t(d3)%*%(A2);
                               
  theta.h_grad = (1/m)*(theta.h_grad + lambda*temp.h);
  theta.out_grad = (1/m)*(theta.out_grad + lambda*temp.out);
  
  
  grad <- c(as.vector(theta.h_grad),as.vector(theta.out_grad)) #Unrolling gradient values in a vector
  
 grad
  
  }

Random Initialization of weights

We will use Normalized Xavier Weight Initialization method for this purpose.

We need to randomly initialize weights to break the symmetry and to make the learning quick and efficient

The function takes the number of units/nodes in adjucent layers of the thetas. It returns matrix of randomized theta values between two layers.

rand.Weights <- function(n.prev, n.next){
  
  #   n.prev = number of units in previous layer
  #   n.next = number of units in next layer
  
  #   randTheta should be a matrix of dimension (n.next, 1 + n.prev) as first column is for intercept/bias term
   
  
  
  randTheta = matrix(rep(0,n.next*(1 + n.prev)),n.next)
  
  small_init = sqrt(6)/sqrt(n.prev+n.next)
  
  randTheta <- matrix(runif(n.next*(1 + n.prev)),n.next)*2*small_init - small_init
  
  randTheta
  
}

Prediction function

This function takes thetas for each layer in a matrix and X. Returns the predicted class for each record.

predict <- function(theta.h, theta.out, X){
  
  X = as.matrix(X) #Converting into matrix for matrix operations
  
  m = nrow(X)
  
  
  p = rep(0,m)
  X = cbind(rep(1,m), X)
   
  X = as.matrix(X) #Converting into matrix for matrix operations
  
  A = sigmoid(X%*%t(theta.h))
  
  ma = nrow(A)
  
  A = cbind(rep(1,ma), A)
  
  p <- apply(sigmoid(A%*%t(theta.out)),1,which.max)
  
  p
  
}

Mapping response variable into integers (1 to n.class)

To convert all the classes into integers for our algorithm to run and perform matrix operations. Make sure that the response variable is a factor with classes as levels.

map_y = function(y){
  
  classes = levels(y) #Response variable must be a factor with classes as levels
  n.class = length (classes)
  y_map = rep(0, length(y))
  for (i in 1:n.class){
    idx = y %in% classes[i] 
    y_map[idx] = i
  }
    
  list(y_map=y_map, classes = classes)
  
}

Mapping response variable back to original classes

To convert mapped y/predicted values back to the original classes to get the predicted class for each record.

class_y = function(y_map, classes){
  
  n.class = length (classes)
  y_class = rep(0, length(y_map))
  for (i in 1:n.class){
    idx = y_map %in% i 
    y_class[idx] = classes[i]
  }
    
  list(y_class=y_class, classes=classes)
  
}

Load Data

We will use the same data that we used for lab1 of neural network. Please go through lab1 to know about the data.

let’s load the data

library(mlbench)

data(BreastCancer)

Let’s get rid of records with missing values

sum(!complete.cases(BreastCancer))

## [1] 16

BreastCancerComplete = BreastCancer[complete.cases(BreastCancer),]

sum(!complete.cases(BreastCancerComplete))

## [1] 0

sum(is.na(BreastCancerComplete))

## [1] 0

dim(BreastCancerComplete)

## [1] 683  11

BreastCancerNew = BreastCancerComplete[,-1] # Dropping "Id"

names(BreastCancerNew)

##  [1] "Cl.thickness"    "Cell.size"       "Cell.shape"      "Marg.adhesion"  
##  [5] "Epith.c.size"    "Bare.nuclei"     "Bl.cromatin"     "Normal.nucleoli"
##  [9] "Mitoses"         "Class"

Our algorithm needs data in a numeric format. Therefore, we will use a function called, model.matrix, to convert all the categorical variables, except the response variable, into 0s and 1s or numeric values depending on variable type.

X = model.matrix(Class ~ .,data= BreastCancerNew)
dim(X)

## [1] 683  81

X = X[,-1]
dim(X)

## [1] 683  80

m = nrow(X)
m

## [1] 683

So, it has 81 variables with first one as an intercept or bias term which we have removed, because our functions are going to add the bias term

Response variable

y = BreastCancerNew$Class

levels(y)

## [1] "benign"    "malignant"

Let’s map y into intergers values

Mapping response variable into integers

To convert all the classes into integers for our algorithm to run and perform matrix operations

y_mapped = map_y(y)

y_map = y_mapped$y_map
classes = y_mapped$classes

y_map[1:10]

##  [1] 1 1 1 1 1 2 1 1 1 1

y[1:10]

##  [1] benign    benign    benign    benign    benign    malignant benign   
##  [8] benign    benign    benign   
## Levels: benign malignant

classes

## [1] "benign"    "malignant"

Train and Test Data

Let’s split the data set into train (80%) and test (20%)

set.seed(1);train.idx = sample(1:m, 0.8*m, replace=F)

train.n = length(train.idx)

test.n = m-train.n

trainX = X[train.idx,]
testX = X[-train.idx,]

trainy = y_map[train.idx]
testy = y_map[-train.idx]

dim(trainX)

## [1] 546  80

dim(testX)

## [1] 137  80

trainy[10:20]

##  [1] 2 1 1 2 1 2 1 1 1 2 1

testy[10:20]

##  [1] 2 1 2 1 2 1 1 1 1 1 1

Layers dimensions

n.input  = 80  # 80 input variables
n.hidden = 10   # 10 hidden units
n.class = 2          # 2 classes, "benign" and "malignant"

Initializing random Neural Network Weights

iniTheta.h = rand.Weights(n.input, n.hidden)
iniTheta.out = rand.Weights(n.hidden, n.class)

Unrolling thetas

initial_vec_thetas = c(as.vector(iniTheta.h) , as.vector(iniTheta.out))

Training Neural Network

We will try two different optimizers (optim() and nlminb()) to see which one takes less time to train the model

First, let’s try optim()

lambda = 1

#Compute Weights and cost using optim() as optimizer
#number of iterations = 100

costh <- optim(par=initial_vec_thetas, fn=costFunction, gr=gradFunction, 
               method="BFGS", n.input=n.input,
               n.hidden=n.hidden, n.class=n.class,
               X=trainX,y=trainy,lambda=lambda, control = list(maxit=100))

vec_thetas <- costh$par

cost <- costh$value


cat('Cost after last iteration:', '\n', cost)

## Cost after last iteration: 
##  0.1638301

Rolling vec_thetas back to theta.h and theta.out

theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
   
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))

Predictions by trained model using optim()

Train data set

pred.train = predict(theta.h, theta.out, trainX)

cat('Accuracy on train data set:\n',mean((pred.train == trainy))*100)

## Accuracy on train data set:
##  99.45055

Test data set

pred.test = predict(theta.h, theta.out, testX)

cat('Accuracy on test data set:\n',mean((pred.test == testy))*100)

## Accuracy on test data set:
##  95.62044

Now let’s try nlminb() as optimizer to get parameters and cost

#Compute Weights and cost using nlminb() as optimizer
#number of iterations = 100

costh <- nlminb(start=initial_vec_thetas, objective=costFunction, 
                gradient=gradFunction, n.input=n.input,
                n.hidden=n.hidden, n.class=n.class,
                X=trainX,y=trainy,lambda=lambda, control = list(iter.max=100))

vec_thetas <- costh$par

cost <- costh$objective


cat('Cost after last iteration:', '\n', cost)

## Cost after last iteration: 
##  0.1638596

Rolling vec_thetas back to theta.h and theta.out

theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
   
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))

Predictions by trained model using nlminb()

Train data set

pred.train = predict(theta.h, theta.out, trainX)

cat('Accuracy on train data set:\n',mean((pred.train == trainy))*100)

## Accuracy on train data set:
##  99.45055

Test data set

pred.test = predict(theta.h, theta.out, testX)

cat('Accuracy on test data set:\n',mean((pred.test == testy))*100)

## Accuracy on test data set:
##  95.62044

Both optim() and nlminb() take the same amount of time

Their accuracies are similar as well

We can see that this simple neural network classifier with just two layers has performed really well

Predicted class

Let’s convert the response variable and predicted values back to original classes

pred.train.class = class_y(pred.train, classes)$y_class

trainy.class = class_y(trainy, classes)$y_class

pred.test.class = class_y(pred.test, classes)$y_class

testy.class = class_y(testy, classes)$y_class

Let’s look at confusion matrix

confMatTrain = table(trainy.class, pred.train.class)

confMatTest = table(testy.class, pred.test.class)

confMatTrain

##             pred.train.class
## trainy.class benign malignant
##    benign       345         2
##    malignant      1       198

cat('\n Accuracy on train data set:\n',mean((pred.train.class == trainy.class))*100, '\n')

## 
##  Accuracy on train data set:
##  99.45055

confMatTest

##            pred.test.class
## testy.class benign malignant
##   benign        94         3
##   malignant      3        37

cat('\n Accuracy on test data set:\n',mean((pred.test.class == testy.class))*100, '\n')

## 
##  Accuracy on test data set:
##  95.62044

Another example

Let’s quickly run through another example and see how our neural net performs

We will use iris data set from R, we have already built multinomial logistic regression using this data (go through lab1 in Logistic Regression section for more details). This data set comes with base R.

Train and Test subset

Data dimensions and converting into X and y

dim(iris)

## [1] 150   5

levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

X = iris[, -5]
y = iris$Species

Mapping response variable into integers

y_mapped = map_y(y)

y_map = y_mapped$y_map
classes = y_mapped$classes

y_map[c(1:3, 51:53, 101:103)]

## [1] 1 1 1 2 2 2 3 3 3

y[c(1:3, 51:53, 101:103)]

## [1] setosa     setosa     setosa     versicolor versicolor versicolor virginica 
## [8] virginica  virginica 
## Levels: setosa versicolor virginica

classes

## [1] "setosa"     "versicolor" "virginica"

m = dim(X)[1]
set.seed(1); train.idx = sample(1:m, 120, replace=F)

trainX = X[train.idx, ]
testX = X[-train.idx, ]

train.y = y_map[train.idx]    #Mapped into integer values for respective classes
train.y.class = y[train.idx]  #Original classes

test.y = y_map[-train.idx]    #Mapped into integer values for respective classes
test.y.class = y[-train.idx]  #Original classes


dim(trainX)

## [1] 120   4

dim(testX)

## [1] 30  4

levels(test.y.class)

## [1] "setosa"     "versicolor" "virginica"

levels(train.y.class)

## [1] "setosa"     "versicolor" "virginica"

Training the neural net

Layers dimensions

n.input  = 4    # 4 input variables
n.hidden = 10   # 10 hidden units
n.class = 3     # 3 classes, "setosa", "versicolor" and "virginica"

Initializing random Neural Network Weights

iniTheta.h = rand.Weights(n.input, n.hidden)
iniTheta.out = rand.Weights(n.hidden, n.class)

Unrolling thetas

initial_vec_thetas = c(as.vector(iniTheta.h) , as.vector(iniTheta.out))

Training Neural Network with optim()

lambda = 1

#Compute Weights and cost using optim() as optimizer
#number of iterations = 100

costh <- optim(par=initial_vec_thetas, fn=costFunction, gr=gradFunction, 
               method="BFGS", n.input=n.input,
               n.hidden=n.hidden, n.class=n.class,
               X=trainX,y=train.y,lambda=lambda, control = list(maxit=100))

vec_thetas <- costh$par

cost <- costh$value


cat('Cost after last iteration:', '\n', cost)

## Cost after last iteration: 
##  0.6549467

Rolling vec_thetas back to theta.h and theta.out

theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
   
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))

Predictions accuracy of train and test sets

Train data set

pred.train = predict(theta.h, theta.out, trainX)

cat('Accuracy on train data set:\n',mean((pred.train == train.y))*100)

## Accuracy on train data set:
##  98.33333

Test data set

pred.test = predict(theta.h, theta.out, testX)

cat('Accuracy on test data set:\n',mean((pred.test == test.y))*100)

## Accuracy on test data set:
##  93.33333

Predicted class

Let’s convert the response variable and predicted values back to original classes

pred.train.class = class_y(pred.train, classes)$y_class

pred.test.class = class_y(pred.test, classes)$y_class

Let’s look at confusion matrix

confMatTrain = table(train.y.class, pred.train.class)

confMatTest = table(test.y.class, pred.test.class)

confMatTrain

##              pred.train.class
## train.y.class setosa versicolor virginica
##    setosa         39          0         0
##    versicolor      0         36         2
##    virginica       0          0        43

cat('\n Accuracy on train data set:\n',mean((pred.train.class == train.y.class))*100, '\n')

## 
##  Accuracy on train data set:
##  98.33333

confMatTest

##             pred.test.class
## test.y.class setosa versicolor virginica
##   setosa         11          0         0
##   versicolor      0         12         0
##   virginica       0          2         5

cat('\n Accuracy on test data set:\n',mean((pred.test.class == test.y.class))*100, '\n')

## 
##  Accuracy on test data set:
##  93.33333

As we can see that our neural net has again performed extremely well.

Exercise: Use different data sets from R, MASS or any other package or source that you like and run this algorithm to see how it works.

Click the links below for more

Neural Network in R

Business and AI

Machine Learning - Building Neural Network Algorithm from scratch

Lab_NN_2_RM

Asmi Ariv

2022-10-03

Machine Learning - Neural Network Algorithm

Sigmoid Function

Sigmoid Gradient Function

Cost Function

Gradient Function

Random Initialization of weights

Prediction function

Mapping response variable into integers (1 to n.class)

Mapping response variable back to original classes

Load Data

Mapping response variable into integers

Train and Test Data

Layers dimensions

Initializing random Neural Network Weights

Unrolling thetas

Training Neural Network

Predictions by trained model using optim()

Predictions by trained model using nlminb()

Predicted class

Another example

Train and Test subset

Mapping response variable into integers

Training the neural net

Layers dimensions

Initializing random Neural Network Weights

Unrolling thetas

Training Neural Network with optim()

Predictions accuracy of train and test sets

Predicted class

Click the links below for more

Labels

Comments

Post a Comment

Popular posts from this blog

Metaverse needs better technology, scalable infra, strong governance

What is ChatGPT?

Exploratory Data Analysis