Skip to main content

Machine Learning - Building Neural Network Algorithm from scratch

 

Machine Learning - Neural Network Algorithm

We have seen, how Neural Network model is built in R using the predefined functions such as, “nnet()”. However, it is also important to know how the machine learning works at the back-end. What are the functions running inside the main function. We will walk through various stages of machine learning process in Neural Network. We will create our own machine learning functions step-by-step:

  1. Define Sigmoid Function
  2. Define Sigmoid Gradient Function
  3. Define Cost Function using regularization
  4. Define Gradient function using regularization
  5. Define Random Initialization of Weight Function
  6. Define predict function to predict classes for each record
  7. Define function for mapping response variable into integers
  8. Define function for mapping response variable back to original classes
  9. Train the model using all the functions on training data
  10. Make predictions on train and test data

In the video lecture, the algorithm has been explained in detail. Please go through the lecture before starting with this lab session for a better understanding.

Sigmoid Function

Let’s start with Sigmoid function to be used as an activation function

sigmoid <- function(x){
  x <- as.matrix(x)
  sig = matrix(rep(0, nrow(x)*ncol(x)),nrow(x))
  
  
  sig <- 1/ (1 + exp(-x))

sig
  
}

Sigmoid Gradient Function

Let’s define Sigmoid Gradient function, for back propagation steps

sigmoid.grad <- function(x){
  x <- as.matrix(x)
  sig.grad = matrix(rep(0, nrow(x)*ncol(x)),nrow(x))
  
  
  sig.grad <- (1/ (1 + exp(-x)))*(1-(1/ (1 + exp(-x))))

sig.grad
  
}

Cost Function

Let’s define cost function for a input_one-hidden_output layered neural net. The main purpose is to minimize the cost to get the optimum values of parameters. This function requires X, y, unrolled vector of thetas for all layers, number of units/nodes of input and hidden layers, number of classes/categories for output layer and regularization term lambda. Returns cost.

costFunction <- function(vec_thetas, n.input, n.hidden, n.class, X, y, lambda){
  
  # n.input = number of nodes/units in input layer
  # n.hidden = number of nodes/units in hidden layer
  # n.class = number of categories/classes in the output layer (or response variable)
  
   m <- nrow(X) #No. of records in the training data
   J <- 0 #Initialize cost
   
  
  # vec_thetas is a vector containing all the weights of all the layers 
  # Reshape vec_thetas into matrices of the weights theta.h (between input and hidden) and theta.out (between hidden      and output), as we are only building a neural net with two layers
  # We will use vectorized calculations through matrix operations as much as possible to avoid loops
  
   theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
   dim(theta.h) <- c(n.hidden, (n.input + 1))
   
   theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
   dim(theta.out) <- c(n.class, (n.hidden + 1))
  
  
  
   X = as.matrix(X) #Converting into matrix for matrix operations
   X = cbind(rep(1,m),X) #Adding bias
  
   A = sigmoid(X%*%t(theta.h)) #Activation of weighted inputs at the hidden layer
  
   ma = nrow(A) #Number of rows in the hidden layer
  
   A = cbind(rep(1,ma), A) #Adding bias to the hidden layer
  
   for (i in 1:n.class){
  
     
     J1 <- (1/m)*(-t(as.numeric(y==i))%*%log(sigmoid(A%*%(t(theta.out)[,i])))-
                   (t(1-as.numeric(y==i))%*%log(1-sigmoid(A%*%(t(theta.out)[,i]))))) 
    
     J <- J + J1  
   }
  
   temp.h = theta.h
   temp.out = theta.out
   temp.h[,1] = 0;
   temp.out[,1] = 0;
   J = J+(lambda/(2*m))*(sum(sum(temp.h*temp.h))+ sum(sum(temp.out*temp.out)))
  J

  
  }

Gradient Function

Let’s define the gradient function. This function requires X, y, unrolled vector of thetas for all layers, number of units/nodes of input and hidden layers, number of classes/categories for output layer and regularization term lambda Returns the unrolled vector of gradient values for corresponding theta values.

gradFunction <- function(vec_thetas, n.input, n.hidden, n.class, X, y, lambda){
  
  # This function returns unrolled vector of gradinet values for corresponding thetas
  
  
  # n.input = number of nodes/units in input layer
  # n.hidden = number of nodes/units in hidden layer
  # n.class = number of categories/classes in the output layer (or response variable)
  
   m <- nrow(X) #No. of records in the training data
   
  
  
  # vec_thetas is a vector containing all the weights of all the layers 
  # Reshape vec_thetas into matrices of the weights theta.h (between input and hidden) and theta.out (between hidden      and output), as we are only building a neural net with two layers
  # We will use vectorized calculations through matrix operations as much as possible to avoid loops
  
   theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
   dim(theta.h) <- c(n.hidden, (n.input + 1))
   
   theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
   dim(theta.out) <- c(n.class, (n.hidden + 1))
  
  
    
  theta.h_grad = matrix(rep(0,length(theta.h)),nrow(theta.h))
  theta.out_grad = matrix(rep(0,length(theta.out)),nrow(theta.out))
  
  X = as.matrix(X) #Converting into matrix for matrix operations
  X = cbind(rep(1,m),X) #Adding bias term to input layer
  
   
  temp.h = theta.h
  temp.out = theta.out
  temp.h[,1] = 0;
  temp.out[,1] = 0;
  
  #Converting response variable y into a matrix of m rows and n.class columns for vectorized calculation
  tp <- matrix(1:n.class,1) #Creating one row matrix with columns as classes
  yt <- matrix(rep(tp,m),m,byrow=T) # Matrix of m rows and n.class columns (each column represents one class)
  yt = 1*(yt == as.vector(y)) # Assigning the actual class to each row ()
  
  A1 = X
  Z2 = A1%*%t(theta.h)
  A2 = sigmoid(Z2)
  ma = nrow(A2)
  A2 = cbind(rep(1,ma), A2) #Adding bias term to hidden layer
  Z3 = A2%*%t(theta.out)
  A3 = sigmoid(Z3)
  d3 = A3-yt                #Error term of output layer 
  d2 = (d3%*%theta.out)*(A2*(1-A2)) #Error term of hidden layer
  end = nrow(t(d2))
  
  theta.h_grad = theta.h_grad + (t(d2)%*%A1)[2:end,]; #Starting with 2 as A1 is not connected to bias term of hidden                                                       #layer
  theta.out_grad = theta.out_grad + t(d3)%*%(A2);
                               
  theta.h_grad = (1/m)*(theta.h_grad + lambda*temp.h);
  theta.out_grad = (1/m)*(theta.out_grad + lambda*temp.out);
  
  
  grad <- c(as.vector(theta.h_grad),as.vector(theta.out_grad)) #Unrolling gradient values in a vector
  
 grad
  
  }

Random Initialization of weights

We will use Normalized Xavier Weight Initialization method for this purpose.

We need to randomly initialize weights to break the symmetry and to make the learning quick and efficient

The function takes the number of units/nodes in adjucent layers of the thetas. It returns matrix of randomized theta values between two layers.

rand.Weights <- function(n.prev, n.next){
  
  #   n.prev = number of units in previous layer
  #   n.next = number of units in next layer
  
  #   randTheta should be a matrix of dimension (n.next, 1 + n.prev) as first column is for intercept/bias term
   
  
  
  randTheta = matrix(rep(0,n.next*(1 + n.prev)),n.next)
  
  small_init = sqrt(6)/sqrt(n.prev+n.next)
  
  randTheta <- matrix(runif(n.next*(1 + n.prev)),n.next)*2*small_init - small_init
  
  randTheta
  
}

Prediction function

This function takes thetas for each layer in a matrix and X. Returns the predicted class for each record.

predict <- function(theta.h, theta.out, X){
  
  X = as.matrix(X) #Converting into matrix for matrix operations
  
  m = nrow(X)
  
  
  p = rep(0,m)
  X = cbind(rep(1,m), X)
   
  X = as.matrix(X) #Converting into matrix for matrix operations
  
  A = sigmoid(X%*%t(theta.h))
  
  ma = nrow(A)
  
  A = cbind(rep(1,ma), A)
  
  p <- apply(sigmoid(A%*%t(theta.out)),1,which.max)
  
  p
  
}

Mapping response variable into integers (1 to n.class)

To convert all the classes into integers for our algorithm to run and perform matrix operations. Make sure that the response variable is a factor with classes as levels.

map_y = function(y){
  
  classes = levels(y) #Response variable must be a factor with classes as levels
  n.class = length (classes)
  y_map = rep(0, length(y))
  for (i in 1:n.class){
    idx = y %in% classes[i] 
    y_map[idx] = i
  }
    
  list(y_map=y_map, classes = classes)
  
}

Mapping response variable back to original classes

To convert mapped y/predicted values back to the original classes to get the predicted class for each record.

class_y = function(y_map, classes){
  
  n.class = length (classes)
  y_class = rep(0, length(y_map))
  for (i in 1:n.class){
    idx = y_map %in% i 
    y_class[idx] = classes[i]
  }
    
  list(y_class=y_class, classes=classes)
  
}

Load Data

We will use the same data that we used for lab1 of neural network. Please go through lab1 to know about the data.

let’s load the data

library(mlbench)

data(BreastCancer)

Let’s get rid of records with missing values

sum(!complete.cases(BreastCancer))
## [1] 16
BreastCancerComplete = BreastCancer[complete.cases(BreastCancer),]

sum(!complete.cases(BreastCancerComplete))
## [1] 0
sum(is.na(BreastCancerComplete))
## [1] 0
dim(BreastCancerComplete)
## [1] 683  11
BreastCancerNew = BreastCancerComplete[,-1] # Dropping "Id"

names(BreastCancerNew)
##  [1] "Cl.thickness"    "Cell.size"       "Cell.shape"      "Marg.adhesion"  
##  [5] "Epith.c.size"    "Bare.nuclei"     "Bl.cromatin"     "Normal.nucleoli"
##  [9] "Mitoses"         "Class"

Our algorithm needs data in a numeric format. Therefore, we will use a function called, model.matrix, to convert all the categorical variables, except the response variable, into 0s and 1s or numeric values depending on variable type.

X = model.matrix(Class ~ .,data= BreastCancerNew)
dim(X)
## [1] 683  81
X = X[,-1]
dim(X)
## [1] 683  80
m = nrow(X)
m
## [1] 683

So, it has 81 variables with first one as an intercept or bias term which we have removed, because our functions are going to add the bias term

Response variable

y = BreastCancerNew$Class

levels(y)
## [1] "benign"    "malignant"

Let’s map y into intergers values

Mapping response variable into integers

To convert all the classes into integers for our algorithm to run and perform matrix operations

y_mapped = map_y(y)

y_map = y_mapped$y_map
classes = y_mapped$classes

y_map[1:10]
##  [1] 1 1 1 1 1 2 1 1 1 1
y[1:10]
##  [1] benign    benign    benign    benign    benign    malignant benign   
##  [8] benign    benign    benign   
## Levels: benign malignant
classes
## [1] "benign"    "malignant"

Train and Test Data

Let’s split the data set into train (80%) and test (20%)

set.seed(1);train.idx = sample(1:m, 0.8*m, replace=F)

train.n = length(train.idx)

test.n = m-train.n

trainX = X[train.idx,]
testX = X[-train.idx,]

trainy = y_map[train.idx]
testy = y_map[-train.idx]

dim(trainX)
## [1] 546  80
dim(testX)
## [1] 137  80
trainy[10:20]
##  [1] 2 1 1 2 1 2 1 1 1 2 1
testy[10:20]
##  [1] 2 1 2 1 2 1 1 1 1 1 1

Layers dimensions

n.input  = 80  # 80 input variables
n.hidden = 10   # 10 hidden units
n.class = 2          # 2 classes, "benign" and "malignant" 

Initializing random Neural Network Weights

iniTheta.h = rand.Weights(n.input, n.hidden)
iniTheta.out = rand.Weights(n.hidden, n.class)

Unrolling thetas

initial_vec_thetas = c(as.vector(iniTheta.h) , as.vector(iniTheta.out))

Training Neural Network

We will try two different optimizers (optim() and nlminb()) to see which one takes less time to train the model

First, let’s try optim()

lambda = 1

#Compute Weights and cost using optim() as optimizer
#number of iterations = 100

costh <- optim(par=initial_vec_thetas, fn=costFunction, gr=gradFunction, 
               method="BFGS", n.input=n.input,
               n.hidden=n.hidden, n.class=n.class,
               X=trainX,y=trainy,lambda=lambda, control = list(maxit=100))

vec_thetas <- costh$par

cost <- costh$value


cat('Cost after last iteration:', '\n', cost)
## Cost after last iteration: 
##  0.1638301

Rolling vec_thetas back to theta.h and theta.out

theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
   
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))

Predictions by trained model using optim()

Train data set

pred.train = predict(theta.h, theta.out, trainX)

cat('Accuracy on train data set:\n',mean((pred.train == trainy))*100)
## Accuracy on train data set:
##  99.45055

Test data set

pred.test = predict(theta.h, theta.out, testX)

cat('Accuracy on test data set:\n',mean((pred.test == testy))*100)
## Accuracy on test data set:
##  95.62044

Now let’s try nlminb() as optimizer to get parameters and cost

#Compute Weights and cost using nlminb() as optimizer
#number of iterations = 100

costh <- nlminb(start=initial_vec_thetas, objective=costFunction, 
                gradient=gradFunction, n.input=n.input,
                n.hidden=n.hidden, n.class=n.class,
                X=trainX,y=trainy,lambda=lambda, control = list(iter.max=100))

vec_thetas <- costh$par

cost <- costh$objective


cat('Cost after last iteration:', '\n', cost)
## Cost after last iteration: 
##  0.1638596

Rolling vec_thetas back to theta.h and theta.out

theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
   
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))

Predictions by trained model using nlminb()

Train data set

pred.train = predict(theta.h, theta.out, trainX)

cat('Accuracy on train data set:\n',mean((pred.train == trainy))*100)
## Accuracy on train data set:
##  99.45055

Test data set

pred.test = predict(theta.h, theta.out, testX)

cat('Accuracy on test data set:\n',mean((pred.test == testy))*100)
## Accuracy on test data set:
##  95.62044

Both optim() and nlminb() take the same amount of time

Their accuracies are similar as well

We can see that this simple neural network classifier with just two layers has performed really well

Predicted class

Let’s convert the response variable and predicted values back to original classes

pred.train.class = class_y(pred.train, classes)$y_class

trainy.class = class_y(trainy, classes)$y_class

pred.test.class = class_y(pred.test, classes)$y_class

testy.class = class_y(testy, classes)$y_class

Let’s look at confusion matrix

confMatTrain = table(trainy.class, pred.train.class)

confMatTest = table(testy.class, pred.test.class)

confMatTrain
##             pred.train.class
## trainy.class benign malignant
##    benign       345         2
##    malignant      1       198
cat('\n Accuracy on train data set:\n',mean((pred.train.class == trainy.class))*100, '\n')
## 
##  Accuracy on train data set:
##  99.45055
confMatTest
##            pred.test.class
## testy.class benign malignant
##   benign        94         3
##   malignant      3        37
cat('\n Accuracy on test data set:\n',mean((pred.test.class == testy.class))*100, '\n')
## 
##  Accuracy on test data set:
##  95.62044

Another example

Let’s quickly run through another example and see how our neural net performs

We will use iris data set from R, we have already built multinomial logistic regression using this data (go through lab1 in Logistic Regression section for more details). This data set comes with base R.

Train and Test subset

Data dimensions and converting into X and y

dim(iris)
## [1] 150   5
levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
X = iris[, -5]
y = iris$Species

Mapping response variable into integers

y_mapped = map_y(y)

y_map = y_mapped$y_map
classes = y_mapped$classes

y_map[c(1:3, 51:53, 101:103)]
## [1] 1 1 1 2 2 2 3 3 3
y[c(1:3, 51:53, 101:103)]
## [1] setosa     setosa     setosa     versicolor versicolor versicolor virginica 
## [8] virginica  virginica 
## Levels: setosa versicolor virginica
classes
## [1] "setosa"     "versicolor" "virginica"
m = dim(X)[1]
set.seed(1); train.idx = sample(1:m, 120, replace=F)

trainX = X[train.idx, ]
testX = X[-train.idx, ]

train.y = y_map[train.idx]    #Mapped into integer values for respective classes
train.y.class = y[train.idx]  #Original classes

test.y = y_map[-train.idx]    #Mapped into integer values for respective classes
test.y.class = y[-train.idx]  #Original classes


dim(trainX)
## [1] 120   4
dim(testX)
## [1] 30  4
levels(test.y.class)
## [1] "setosa"     "versicolor" "virginica"
levels(train.y.class)
## [1] "setosa"     "versicolor" "virginica"

Training the neural net

Layers dimensions

n.input  = 4    # 4 input variables
n.hidden = 10   # 10 hidden units
n.class = 3     # 3 classes, "setosa", "versicolor" and "virginica" 

Initializing random Neural Network Weights

iniTheta.h = rand.Weights(n.input, n.hidden)
iniTheta.out = rand.Weights(n.hidden, n.class)

Unrolling thetas

initial_vec_thetas = c(as.vector(iniTheta.h) , as.vector(iniTheta.out))

Training Neural Network with optim()

lambda = 1

#Compute Weights and cost using optim() as optimizer
#number of iterations = 100

costh <- optim(par=initial_vec_thetas, fn=costFunction, gr=gradFunction, 
               method="BFGS", n.input=n.input,
               n.hidden=n.hidden, n.class=n.class,
               X=trainX,y=train.y,lambda=lambda, control = list(maxit=100))

vec_thetas <- costh$par

cost <- costh$value


cat('Cost after last iteration:', '\n', cost)
## Cost after last iteration: 
##  0.6549467

Rolling vec_thetas back to theta.h and theta.out

theta.h <- vec_thetas[1:(n.hidden*(n.input + 1))] #Between input and hidden layers
dim(theta.h) <- c(n.hidden, (n.input + 1))
   
theta.out <- vec_thetas[(1 + (n.hidden * (n.input + 1))):length(vec_thetas)] #Between hidden and output layers
dim(theta.out) <- c(n.class, (n.hidden + 1))

Predictions accuracy of train and test sets

Train data set

pred.train = predict(theta.h, theta.out, trainX)

cat('Accuracy on train data set:\n',mean((pred.train == train.y))*100)
## Accuracy on train data set:
##  98.33333

Test data set

pred.test = predict(theta.h, theta.out, testX)

cat('Accuracy on test data set:\n',mean((pred.test == test.y))*100)
## Accuracy on test data set:
##  93.33333

Predicted class

Let’s convert the response variable and predicted values back to original classes

pred.train.class = class_y(pred.train, classes)$y_class

pred.test.class = class_y(pred.test, classes)$y_class

Let’s look at confusion matrix

confMatTrain = table(train.y.class, pred.train.class)

confMatTest = table(test.y.class, pred.test.class)

confMatTrain
##              pred.train.class
## train.y.class setosa versicolor virginica
##    setosa         39          0         0
##    versicolor      0         36         2
##    virginica       0          0        43
cat('\n Accuracy on train data set:\n',mean((pred.train.class == train.y.class))*100, '\n')
## 
##  Accuracy on train data set:
##  98.33333
confMatTest
##             pred.test.class
## test.y.class setosa versicolor virginica
##   setosa         11          0         0
##   versicolor      0         12         0
##   virginica       0          2         5
cat('\n Accuracy on test data set:\n',mean((pred.test.class == test.y.class))*100, '\n')
## 
##  Accuracy on test data set:
##  93.33333

As we can see that our neural net has again performed extremely well.

Exercise: Use different data sets from R, MASS or any other package or source that you like and run this algorithm to see how it works.


Click the links below for more

Comments

Popular posts from this blog

Metaverse needs better technology, scalable infra, strong governance

Many minds have been intrigued by the idea of metaverse, and its effect is such that the social media giant like Facebook has been rebranded as Meta. Yet, there is a big question mark on the future of this technology. The enablers of metaverse such as augmented reality, mixed reality and virtual reality operating on computers, smartphones and other devices have failed to give the complete real-world like immersive experience to end users. There is a clear lack of standard virtual environment and technical specifications for implementing metaverse  –  a bottleneck in using technologies from different proprietors. Due to the business privacy and transparency concerns, interoperability of services from various providers has become a big challenge. Although, the efforts to standardize virtual reality, such as Universal Scene Description, glTF and OpenXR may help in a long run, but a lot more needs to be put in.  The technologies and devices, such as wireless he...

What is ChatGPT?

Introduction ChatGPT is a language model developed by OpenAI based on the GPT-3.5 architecture. It is designed to perform various natural language processing tasks such as language translation, text summarization, question-answering, and chatbot interactions. In this blog, we will discuss ChatGPT, its architecture, applications, and benefits. Architecture ChatGPT is based on the GPT-3.5 architecture, which is an extension of the GPT-3 architecture. The model has 175 billion parameters, making it one of the largest language models available. The architecture consists of 96 transformer blocks with a hidden size of 12,288 and 10 attention heads. The model is trained using a combination of unsupervised and supervised learning techniques. Applications ChatGPT has a wide range of applications in various fields such as healthcare, finance, customer service, and education. Some of the applications of ChatGPT are as follows: Language translation: ChatGPT can translate text from one language to ...

Exploratory Data Analysis

  Lab_D_2_RM Asmi Ariv 2022-10-14 Exploratory Data Analysis In this lab, we will go through various steps to explore a dataset using descriptive statistics, summary of data, different graphs, etc. Factor Variables (try the following in R): data = read.csv( "patient.csv" );data #Reading patient data ## Patient Gender Age Group ## 1 Dick M 20 2 ## 2 Anna F 25 1 ## 3 Sam M 30 3 ## 4 Jennie F 28 2 ## 5 Joss M 29 3 ## 6 Don M 21 2 ## 7 Annie F 26 1 ## 8 John M 32 3 ## 9 Rose F 27 2 ## 10 Jack M 31 3 data$Gender #It is a string/character variable ## [1] "M" "F" "M" "F" "M" "M" "F" "M" "F" "M" data$Gender = factor(data$Gender,levels=c( "M" , "F" ), ordered= TRUE ) #...