Logistic Regression in R

Logistic Regression

In this lab, we will focus on logistic regression using R.

We will not focus on exploratory data analysis as we will be looking at multiple data sets. The main objective it to go through building classification models using binomial and multinomial (multi-class) logistic regression on different data sets.

We will start with IRIS data set for multinomial logistic regression model.

Multinomial Logistic Regression

For multinomial logistic regression model, which is basically to build a model on data set that has multiple classes (>2) in its response variable, we will use the package, “VGAM”, which has a function ,“vglm”. This function is easy to use. You need to install this package.

install.packages("VGAM")

Let’s look at some of the details of the data:

dim(iris) #Dimension of the data set (rows and columns)

## [1] 150   5

names(iris) #Names of the variables in the data set

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

summary(iris) #Summary of the data set

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

There are five variables in the data set with 150 records. The data set has the records of different species of the flower “IRIS”

The variables are:

Sepal Length (Numeric)
Sepal width (Numeric)
Petal Length (Numeric)
Petal Width (Numeric)
Species (categorical)

We will use Species as the response variable, which has three categories - setosa, versicolor and virginica

#Mean of all features for different species
aggregate(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species, data = iris, FUN=mean)

##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

#Median of all features for different species
aggregate(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species, data = iris, FUN=median)

##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa          5.0         3.4         1.50         0.2
## 2 versicolor          5.9         2.8         4.35         1.3
## 3  virginica          6.5         3.0         5.55         2.0

So, the above tables show mean and median of all the features for different species. As we can observe from the mean and median of lengths and widths, that Setosa seems to be the smallest and virginica the largest in size, except that Sepal.Width is largest for setosa.

Based on our observation let’s plot boxplot:

boxplot(Sepal.Length ~ Species, data = iris, col=c(2,3,4))

boxplot(Sepal.Width ~ Species, data = iris, col=c(2,3,4))

From the above plots also we can see that Sepal Width is largest for Setosa species of iris.

Exercise: Try boxplots and other types of plots for the data set and draw your conclusions based on that.

Train and Test subset

m = dim(iris)[1]
set.seed(1); train.n = sample(1:m, 120, replace=F)

train = iris[train.n, ]
test = iris[-train.n, ]

dim(train)

## [1] 120   5

dim(test)

## [1] 30  5

Multinomial Logistic Model for IRIS

Let’s load the package, “VGAM” and build the model on train data

library(VGAM)

mulC = vglm(Species ~ ., data=train, family = multinomial(refLevel=3)) #Here we are using virginica as reference level type levels(iris$Species) to check the levels

coef(mulC)

##  (Intercept):1  (Intercept):2 Sepal.Length:1 Sepal.Length:2  Sepal.Width:1 
##    45.23777274    56.89190371     8.51186654     0.05398653    12.12252310 
##  Sepal.Width:2 Petal.Length:1 Petal.Length:2  Petal.Width:1  Petal.Width:2 
##     7.54364522   -24.18141496   -10.23431193   -27.85185960   -16.58067086

Predictions

Let’s make predictions on test data

pred = predict(mulC, newdata= test) #It gives the logit values of "setosa" and "versicolor" as against "virginica"

head(pred)

##    log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 4            80.12969           61.85794
## 5            92.01384           66.67479
## 8            87.17119           64.14263
## 9            78.42095           61.36184
## 11           94.21269           66.42732
## 16           99.68165           68.40793

pred = exp(pred) #Calculating the exponential values of logits for both "setosa" and "versicolor"

head(pred)

##    log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 4        6.307837e+34       7.320857e+26
## 5        9.143261e+39       9.046751e+28
## 8        7.210521e+37       7.191002e+27
## 9        1.142315e+34       4.457697e+26
## 11       8.242382e+40       7.063450e+28
## 16       1.955192e+43       5.119015e+29

sum=(pred[,1]+pred[,2] + 1) #Sum of exponential values of "setosa", "versicolor" and "virginica" (it is 1 as it is the reference level)

head(sum)

##            4            5            8            9           11           16 
## 6.307838e+34 9.143261e+39 7.210521e+37 1.142315e+34 8.242382e+40 1.955192e+43

pVirginica = round(1/sum) #Calculating probability values for Virginica
pSetosa = round(pred[,1]/sum) #Calculating probability values for Setosa
pVersicolor = round(pred[,2]/sum) #Calculating probability values for Versicolor

predTest = data.frame(pSetosa, pVersicolor, pVirginica, test$Species)

predTest

##     pSetosa pVersicolor pVirginica test.Species
## 4         1           0          0       setosa
## 5         1           0          0       setosa
## 8         1           0          0       setosa
## 9         1           0          0       setosa
## 11        1           0          0       setosa
## 16        1           0          0       setosa
## 30        1           0          0       setosa
## 36        1           0          0       setosa
## 46        1           0          0       setosa
## 47        1           0          0       setosa
## 49        1           0          0       setosa
## 52        0           1          0   versicolor
## 54        0           1          0   versicolor
## 57        0           1          0   versicolor
## 62        0           1          0   versicolor
## 67        0           1          0   versicolor
## 69        0           1          0   versicolor
## 72        0           1          0   versicolor
## 80        0           1          0   versicolor
## 90        0           1          0   versicolor
## 96        0           1          0   versicolor
## 97        0           1          0   versicolor
## 99        0           1          0   versicolor
## 103       0           0          1    virginica
## 107       0           1          0    virginica
## 109       0           0          1    virginica
## 117       0           0          1    virginica
## 128       0           0          1    virginica
## 137       0           0          1    virginica
## 139       0           1          0    virginica

The first column is the row numbers as in the original iris data. Wherever you see the value 1 under a class (of three columns - pSetosa, pVersicolor, pVerginica), that class is the prediction by our model for that record and the last column shows the actual species in the test data. In this example, the probability values have turned out to be 1s and 0s, but this may not always be the case.

If you are using this method to build a model and then predict probability values for each class, you may not always get 1s and 0s. You may get different values (>=0 and <1) for each class of the same record. In that case, whichever has got the highest value, that particular class is the prediction for that record.

As you can see, the model has performed fairly well on the test data except misclassification of two records, viz. 107 and 139. That is like over 93% accuracy. Of course, there is always a possibility to improve the model performance by using more data, polynomial features, transformed features, etc.

Binomial Logistic Regression

Now let’s build a logistic regression model to solve a binomial classification problem

For this section of the lab, we will use a data set Smarket.

Details of the data set:

Contains the percentage returns on S&P 500 stock index for 1250 days (2001 to 2005)
Lag1 to Lag5: percentage returns of five previous days for each date
Volume: Number of shares (in billions) traded the previous day
Today: Percentage return on today’s date (i.e. date in question)
Direction: Indicates whether market was up or down (We will use this as our response variable)

This dataset is part of a package, ISLR. You need to install this package.

install.packages("ISLR")

Let’s load the package and look at some of the key details of the data:

library(ISLR)
dim(Smarket)

## [1] 1250    9

names(Smarket )

## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

As discussed above, there are 9 variables in the data set and 1250 records

summary(Smarket)

##       Year           Lag1                Lag2                Lag3          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
##  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
##       Lag4                Lag5              Volume           Today          
##  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
##  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
##  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
##  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
##  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
##  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

pairs(Smarket)

Although, the plots aren’t so legible, we can still make out that there is hardly any correlation between any two variables.

cor(Smarket[, -9])

##              Year         Lag1         Lag2         Lag3         Lag4
## Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
## Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
## Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
## Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
## Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
##                Lag5      Volume        Today
## Year    0.029787995  0.53900647  0.030095229
## Lag1   -0.005674606  0.04090991 -0.026155045
## Lag2   -0.003557949 -0.04338321 -0.010250033
## Lag3   -0.018808338 -0.04182369 -0.002447647
## Lag4   -0.027083641 -0.04841425 -0.006899527
## Lag5    1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315  1.00000000  0.014591823
## Today  -0.034860083  0.01459182  1.000000000

As we can see from the correlation matrix above, that most of the variables aren’t correlated, except Year and Volume which have some substantial correlation.

attach(Smarket)
plot(Volume)

Since volume represents number of shares traded previous day (that’s the daily volume of shares in billions), the plot suggests that the daily average volume of shares traded has continuously increased from 2001 to 2005 with some spikes here and there.

Are the two classes linearly seperable?

 plot(Volume,Today,col=(Direction=="Up")+1)

From the plot above, We can clearly see that the two classes are linearly separable in this surface (or space) of Volume and Today (on these two axes), which may not be the case with other surface (or space) using other axes.

Exercise: Plot all the other pairs using Direction as color (same as above), such as lag1 and lag2, lag3 and lag4, etc.

We will build a logistic regression to predict Direction using all the other variables except Year (as it doesn’t make sense to include this variable).

We will use a function from R, glm(), which stands fro generalized linear models. It can use various model building algorithm including logistic regression. We need to pass in binomial as family for logistic regression.

Train and Test subset

Since it is a time series or sequential data, we will split the data set into train and test set little differently as opposed to how we deal with randomized data set.

Let’s split the data set into train (data from 2001 to 2005) and test (all 2005 data)

train.idx = Smarket$Year < 2005


train = Smarket[train.idx, ]
test = Smarket[!train.idx, ]

dim(train)

## [1] 998   9

dim(test)

## [1] 252   9

train.n = dim(train)[1]
test.n = dim(test)[1]

Binomial logistic regression model for Smarket data set

Now let’s build the model on train set

glm.fit1 = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=train, family=binomial )

coef(glm.fit1)

##  (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5 
##  0.191212621 -0.054178292 -0.045805334  0.007200118  0.006440875 -0.004222672 
##       Volume 
## -0.116256960

Lag1, Lag2, and Lag5 have negative coefficients suggesting that if the market was up on those days, then it is quite unlikely that the market will be up today. Lag3 and Lag4 have positive coefficients suggesting that if the market was up on those days, it is likely that the market will be up today as well.

contrasts(Direction)

##      Up
## Down  0
## Up    1

So, R has used Up as 1 and Down as 0.

Prediction

Let’s look at the prediction of both test and train data

pred.train = predict(glm.fit1, newdata= train, type = "response")
pred.test = predict(glm.fit1, newdata= test, type = "response")

pred.train[1:5]

##         1         2         3         4         5 
## 0.4985061 0.4893137 0.4849791 0.5098788 0.5145737

pred.test[1:5]

##       999      1000      1001      1002      1003 
## 0.5282195 0.5156688 0.5226521 0.5138543 0.4983345

What we are getting is probability values of Up as R is treating Up as 1 or success

Let’s look at train and test errors

train.class = 1*(train$Direction=="Up") #Getting actual class in 1s and 0s
pred.train.class=floor(pred.train+0.5) #Turning predictions into 1s and 0s
confMtx.train=table(train.class,pred.train.class)
error.train=(confMtx.train[1,2]+confMtx.train[2,1])/train.n #(False Positive + False Negative)/no. of records

test.class = 1*(test$Direction=="Up") #Getting actual class in 1s and 0s
pred.test.class=floor(pred.test+0.5) #Turning predictions into 1s and 0s
confMtx.test=table(test.class,pred.test.class)
error.test=(confMtx.test[1,2]+confMtx.test[2,1])/test.n #(False Positive + False Negative)/no. of records

confMtx.train

##            pred.train.class
## train.class   0   1
##           0 175 316
##           1 156 351

confMtx.test

##           pred.test.class
## test.class  0  1
##          0 77 34
##          1 97 44

error.train

## [1] 0.4729459

error.test

## [1] 0.5198413

Clearly, the model has performed poorly as error rate for training data is 47% and that of test data is around 52%. It is expected that one cannot quite predict the share market just by using previous days’ returns.

Another reason from logistic regression point of view could be realized when you do the exercise of plotting the pairs of variables using Direction as color to see whether the classes are linearly separable in a different space or surface. Logistic regression works well when decision boundaries are linearly separable. For non-linearly separable decision boundaries, we can use other algorithms such as Neural Network, Support Vector Machine etc. (which we will see in later sections).

You can also see the summary of the the model we built. p-values of the features are much larger than the significance level of 0.05. Hence, they aren’t helpful, rather they cause increase in variance without decreasing bias, thus increasing the test errors.

However, we will use lag1 and lag2 as they have the lowest p-values among all others.

glm.fit2 = glm(Direction ~ Lag1+Lag2, data=train, family=binomial )

coef(glm.fit2)

## (Intercept)        Lag1        Lag2 
##  0.03221753 -0.05562122 -0.04448684

Let’s look at the prediction

pred.train = predict(glm.fit2, newdata= train, type = "response")
pred.test = predict(glm.fit2, newdata= test, type = "response")

pred.train[1:5]

##         1         2         3         4         5 
## 0.5048917 0.4904830 0.4830449 0.5052396 0.5064450

pred.test[1:5]

##       999      1000      1001      1002      1003 
## 0.5098275 0.5208237 0.5332635 0.5260574 0.5072103

Let’s look at test and train errors

pred.train.class = ifelse(pred.train<0.5, "Down", "Up") #Predicted class for train data
pred.test.class = ifelse(pred.test<0.5, "Down" , "Up") #Predicted class for test data


error.train = mean(pred.train.class != train$Direction) #Error rate for train data
error.test = mean(pred.test.class != test$Direction) #Error rate for test data

confMtx.train=table(train$Direction,pred.train.class)
confMtx.test=table(test$Direction,pred.test.class)

confMtx.train

##       pred.train.class
##        Down  Up
##   Down  168 323
##   Up    160 347

confMtx.test

##       pred.test.class
##        Down  Up
##   Down   35  76
##   Up     35 106

error.train

## [1] 0.4839679

error.test

## [1] 0.4404762

accu.test.Up = confMtx.test[2,2]/sum(confMtx.test[,2]) #Percentage of accuracy for prediction "Up"

accu.test.Up

## [1] 0.5824176

SO, the error rate for the test data has come down to 44%, i.e. accuracy of prediction on test data is 56% now.

This simple model’s predictions whether market will go up or down will 56% correct while there is 44% chance of going wrong.

When the model predicts that market will be up, there is 58% chance that the prediction is correct, while 42% chance that it is incorrect.

So, we can take a clue from above that if the model predicts Up, if we buy shares that day we have 58% possibility of making profit and avoid on days when the model predicts down

Chi-square Test of the model

Another way to test the model fit in logistic regression is chi-square test.

Similar to linear regression, where we have F-test (ANOVA) for i.e. analyzing prediction of y (or response variable) without x (or independent variables) compared to prediction of y (or response variable) with x (or independent variables), i.e. base model (only mean or intercept) and regression model with intercept and independent variables

Similarly, for classification model, we use ch-square test, in which we calculate chi-squre value based of null deviance and residual deviance with their respective degrees of freedom

i.e. chi-square value is calculated by comparing two models (one with predictors and another without) and this gives us difference in goodness of fit. The intuition is that Goodness of fit with predictors must be higher than that without predictors.

Hence, goodness of fit = likelihood with predictors - likelihood without predictors

                     = -2ln[likelihood without predictors/likelihood with predictors]

Hence, chi-square test is the test of the fit of the model. We just calculate the p-value of chi-square. If the p-value is less than 0.05 then the null hypothesis is rejected and the model with independent variables is accepted

#P-value of chi-square
pchisq(glm.fit2$null.deviance - glm.fit2$deviance,  glm.fit2$df.null - glm.fit2$df.residual,  lower.tail = FALSE)

## [1] 0.393926

As we can see that p-value is much larger than 0.05. Therefore, this model is not a good fit, which was obvious from our test prediction as well.

If you want to make a prediction on an known data for lag1 (e.g. -0.12, 1.58) and lag2 (e.g. 0.49, 0.67)

p = predict(glm.fit2, newdata= data.frame(Lag1 = c(-0.12, 1.58), Lag2=c(0.49, 0.67)), type="response")
p

##         1         2 
## 0.5042733 0.4786455

p.class = ifelse(p<0.5, "Down", "Up")
p.class

##      1      2 
##   "Up" "Down"

detach(Smarket)

Default Data - Binomial Logistic Regression

Let’s build a binomial logistic regression on a data set “Default” from the package “ISLR”.

library("ISLR")
dim(Default)

## [1] 10000     4

names(Default)

## [1] "default" "student" "balance" "income"

It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.It has 10,000 records and 4 variables.

default: A factor with levels No and Yes indicating whether the customer defaulted on their debt
student: A factor with levels No and Yes indicating whether the customer is a student
balance: The average balance that the customer has remaining on their credit card after making their monthly payment
income: Income of customer

boxplot(balance ~ student, data=Default, xlab="Student Status", ylab= "Credit Card Balance", col=c(2, 3))

boxplot(income ~ student, data=Default, xlab="Student Status", ylab= "Income of Customer", col=c(2, 3))

As we can see from the plots; Students tend to have higher average credit card balance as against non-students. While the average income of students seems to be lower compared to non-students, which is quite obvious.

#Plotting bar chart of Balance vs Defaulters and Student Status
Student_status = ifelse(Default$student=="Yes", "Student", "Non-student")

deafault_status = ifelse(Default$default=="Yes", "Defaulter", "Non-defaulter")

credit_status = tapply(Default$balance, list(deafault_status, Student_status), FUN = mean)

barplot(credit_status, beside = TRUE, col= c(2,3), legend.text=TRUE, main= "Balance vs Defaulters and Student Status" )

credit_status

##               Non-student   Student
## Defaulter       1678.4295 1860.3791
## Non-defaulter    744.5044  948.4802

As we can see that whether student or non-student, the customers with higher average credit card balance tend to default more likely than those with lower average credit card balance.

Let’s write a function to calculate cumulative default rates and cumulative average credit card balance. The function takes def (1s and 0s, where a defaulter is 1) for defaulters and bal for average credit card balance for each customer. Returns cumulative defaulters and average credit card balance for the number of records in the data.

default_rate = function(def, bal){
  m = length(def)
  cum_rate = c(rep(0, m))
  cum_bal = c(rep(0, m))
  for(i in 2:m){
    cum_rate[i] = sum(def[1:i])
    cum_bal[i] = sum(bal[1:i])
  }
  
  list(cum_rate = cum_rate, cum_bal = cum_bal)
  
}

Let’s calculate the cumulative defaulters rates and average credit card balance for students and non-students

def = ifelse(Default$default=="Yes", 1,0)

studen_def_rate = default_rate(def[Default$student=="Yes"], Default$balance[Default$student=="Yes"])

nonStu_def_rate = default_rate(def[Default$student=="No"], Default$balance[Default$student=="No"])

studen_def_rate$cum_rate[1:5]; studen_def_rate$cum_bal[1:5]

## [1] 0 0 0 0 0

## [1]    0.000 1736.769 2545.436 2545.436 3766.020

nonStu_def_rate$cum_rate[1:5]; nonStu_def_rate$cum_bal[1:5]

## [1] 0 0 0 0 0

## [1]    0.000 1803.076 2332.326 3117.982 3943.495

Let’s look at some plots

plot(Default$balance[Default$student=="Yes"], studen_def_rate$cum_rate, xlab="Average Credit Card Balance", ylab="Cumulative Defaulter Rate", type="l", col="blue")
lines(Default$balance[Default$student=="No"], nonStu_def_rate$cum_rate, col="orange")
legend("topright", legend=c("Student", "Non-student"), lwd=2, col = c("blue", "orange"))

From the above plot, we can understand that if the average credit card balance is somewhere between 600 and 1600, the chances of defaulters among non-student customers are quite high as this region is thickly filled with non-student defaulters (orange lines).

Once the average credit card balance goes beyond 1700-1800, the chances of defaulters among student customers are much higher as this region is thickly filled with student defaulters (blue lines).

Although, the overall significant range of average credit card balance is quite high (almost 1600) for non-student defaulters compared to that of student defaulters which is, maybe, just 2300-1700 = 600

Let’s look at another plot

plot(studen_def_rate$cum_bal, studen_def_rate$cum_rate, xlab="Cumulative Average Credit Card Balance", ylab="Cumulative Defaulter Rate", type="l", col="blue")
lines(nonStu_def_rate$cum_bal, nonStu_def_rate$cum_rate, col="orange")
legend("topright", legend=c("Student", "Non-student"), lwd=2, col = c("blue", "orange"))

Here we can see that the cumulative Defaulter rate against cumulative average credit card balance of student customers is higher throughout compared to that of non-student customers.

In the beginning, both are almost close, but once the cumulative CC balance goes beyond 1,250,000 (1.25 million) the gap between two graphs keep widening, i.e. student customers tend to default more likely.

Splitting data into train and test

m = dim(Default)[1]
set.seed(1); train.idx = sample(1:m, floor(0.8*m), replace=F)
train.n = length(train.idx)
test.n = m-train.n

train = Default[train.idx,]
test = Default[-train.idx,]

train.n; test.n

## [1] 8000

## [1] 2000

Building binomial logistic regression model on train set

glm.fit3 = glm(default ~ student + balance + income, data=train, family=binomial)
summary(glm.fit3)

## 
## Call:
## glm(formula = default ~ student + balance + income, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4736  -0.1389  -0.0543  -0.0202   3.5171  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.094e+01  5.517e-01 -19.834  < 2e-16 ***
## studentYes  -7.282e-01  2.683e-01  -2.714  0.00664 ** 
## balance      5.713e-03  2.585e-04  22.100  < 2e-16 ***
## income       5.877e-06  9.172e-06   0.641  0.52170    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2313.6  on 7999  degrees of freedom
## Residual deviance: 1231.9  on 7996  degrees of freedom
## AIC: 1239.9
## 
## Number of Fisher Scoring iterations: 8

The model has chosen No as reference level fro Student and has created a dummy variable StudentYes for the categorical variable Student. p-values are quite low for all except for income.

As per this model, considering the coefficient values (which is negative), for a fixed value of income and balance students seem to default less compared to non-student.

Model performce on test data

Let’s test our model’s performance on test data.

pred.test = predict(glm.fit3, newdata=test, type="response")
pred.test.class = ifelse(pred.test<0.5, "No", "Yes")

confMtx.test = table(test$default, pred.test.class)

error.test = mean(pred.test.class != test$default) #Error rate on test data

accu.test = mean(pred.test.class == test$default) #Accuracy on test data

accu.test.Yes = confMtx.test[2,2]/sum(confMtx.test[,2]) #Percentage of accuracy for prediction "Yes"

error.test

## [1] 0.0275

accu.test

## [1] 0.9725

accu.test.Yes

## [1] 0.8947368

sum(test$default=="Yes")    #Number of defaulters in test data

## [1] 70

sum(train$default=="Yes")    #Number of defaulters in train data

## [1] 263

We can see that the model performed really well on the test data with error rate as low as 2.75% and accuracy as high as 97.25%.

The accuracy level of predicting Defaulters is 89.47%, which is quite good, considering the test data has only 70 defaulters among 2000 records and train data has only 263 defaulters among 8000 records. The model is able to pick up defaulters almost 90% of the times correctly.

Let’s look at chi-square test:

pchisq(glm.fit3$null.deviance - glm.fit3$deviance,  glm.fit3$df.null - glm.fit3$df.residual,  lower.tail = FALSE)

## [1] 3.427193e-234

The p-value is much smaller than 0.05, hence we reject the null hypothesis and accept the model with predictors.

Execrice: Build a logistic regression model using “Default” data from ISLR packages, excluding “Income” variable (as its p-value is large) and repeat the entire process as above. Try to understand the changes in coefficient values and their interpretations.

ROC Curve

Either we can plot ROC curve using sensitivity (i.e. true positive rate) vs specificity (i.e.true negative rate)

or We can plot ROC curve using true positive rate vs false positive rate

Let’s define what they are (recall confusion matrix):

true positive rate = (true positives)/(true positives + false negatives) = (true positives)/(actual positives)

true positive rate is also known as sensitivity

true negative rate = (true negatives)/(true negatives + false positives) = (true negatives)/(actual negatives)

true negative rate is also known as specificity

false positive rate = (false positive)/(true negatives + false positives) = (false positive)/(actual negatives)

false positive rate is 1- specificity

Taking the above example

Now, let’s plot ROC curve using first method, i.e. Sensitivity vs Specificity,

For this we will use a package, pROC, you need to install the package

install.packages("pROC")

library(pROC)

## Warning: package 'pROC' was built under R version 4.1.3

rocScore = roc(test$default, pred.test) #Calculate ROC scores (different rates of sensitivity and specificity)
plot(rocScore, main="ROC Curve")

If you want to know all the components of rocScore:

names(rocScore) #To get different components

##  [1] "percent"            "sensitivities"      "specificities"     
##  [4] "thresholds"         "direction"          "cases"             
##  [7] "controls"           "fun.sesp"           "auc"               
## [10] "call"               "original.predictor" "original.response" 
## [13] "predictor"          "response"           "levels"

rocScore$auc #To get the area under curve

## Area under the curve: 0.9368

length(rocScore$thresholds) #To get the number of thresholds (or iterations used by the function)

## [1] 2001

Let’s create ROC Curve using second method (i.e. true positive rate vs false positive rate) for the test predictions

tp = c(1:50000) #True positive rate vector, in our example positive is "Yes" (i.e. defaulter) and negative is "No"
fp = c(1:50000) #False positive rate vector

for ( i in 1:50000){
pred.test.class = factor(ifelse(pred.test>=i/50000, "Yes", "No"), levels=c("No", "Yes"))
confMtx.test = table(test$default, pred.test.class)
tp[i] = confMtx.test[2,2]/(confMtx.test[2,1]+confMtx.test[2,2])
fp[i] = confMtx.test[1,2]/(confMtx.test[1,1]+confMtx.test[1,2])
}

plot(fp,tp, main = "ROC Curve", xlab = "False Positive Rate", ylab="True Positive Rate", type="l")

We had to keep the number of thresholds (or iteration) very high (50,000) to get an acceptable value of area under curve (AUC), perhaps because of very low percentage of defaulters in the data set. We are not going to calculate the area under curve, but looking at the plot, we get an idea that it is above 0.7.

Although there is no standard value to define what is a good AUC score, if it is higher, it is better.

A model with AUC score 1 is able to perfectly predict all the classes without any error, while AUC 0.5 and below is no better than random guessing.

AUC: 0.7 to 0.8 is considered acceptable, anything higher than 0.8 is excellent.

The main idea is to minimize False Positive rate and maximize True Positive rate by selecting the most optimum threshold value (threshold probability value at which our model predicts positive).

To decide the most optimum value of threshold, one needs to take into consideration the domain and industry one is dealing with. In this scenario, the bank or credit card company may be averse to taking high risks when it comes to issuing credit cards to customers. The risk of offering credit card to a defaulter could cost much more than the risk of losing some genuine customers.

In this case, positive means defaulter. Therefore, When the False Positive rate is high (i.e. a false alarm about a genuine customer as a defaulter), there is a greater chance of losing genuine customers, but it also means avoiding risk of getting defaulters.

So, the threshold value depends on how much tolerance a domain/industry/company can show towards False Positive rates.

pt_fp_Rate = data.frame(TP=tp, FP=fp)

pt_fp_Rate = pt_fp_Rate[order(pt_fp_Rate$FP, decreasing=TRUE),] #Arrange the data in a decreasing order of fp

pt_fp_Rate[1:10,]

##    TP        FP
## 1   1 0.9865285
## 2   1 0.9046632
## 3   1 0.8839378
## 4   1 0.8637306
## 5   1 0.8466321
## 6   1 0.8321244
## 7   1 0.8176166
## 8   1 0.8031088
## 9   1 0.7880829
## 10  1 0.7803109

Exercise: Suppose you met the credit card department head and she told you that they are fine within 20-23% range of False Positive rate and asked you to build a predictive model for defaulters. Since, we have already built the model, your job is to determine the threshold value from the ROC curve above.

Click on the links below for more

Machine Learning - Building Logistic Regression Algorithm from scratch

Business and AI

Search This Blog