Lab_LRC_1_RM
Asmi Ariv
2022-09-28
Logistic Regression
In this lab, we will focus on logistic regression using R.
We will not focus on exploratory data analysis as we will be looking at multiple data sets. The main objective it to go through building classification models using binomial and multinomial (multi-class) logistic regression on different data sets.
We will start with IRIS data set for multinomial logistic regression model.
Multinomial Logistic Regression
For multinomial logistic regression model, which is basically to build a model on data set that has multiple classes (>2) in its response variable, we will use the package, “VGAM”, which has a function ,“vglm”. This function is easy to use. You need to install this package.
install.packages("VGAM")Let’s look at some of the details of the data:
dim(iris) #Dimension of the data set (rows and columns)## [1] 150 5names(iris) #Names of the variables in the data set## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"summary(iris) #Summary of the data set## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
## There are five variables in the data set with 150 records. The data set has the records of different species of the flower “IRIS”
The variables are:
- Sepal Length (Numeric)
- Sepal width (Numeric)
- Petal Length (Numeric)
- Petal Width (Numeric)
- Species (categorical)
We will use Species as the response variable, which has three categories - setosa, versicolor and virginica
#Mean of all features for different species
aggregate(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species, data = iris, FUN=mean)## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026#Median of all features for different species
aggregate(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species, data = iris, FUN=median)## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.0 3.4 1.50 0.2
## 2 versicolor 5.9 2.8 4.35 1.3
## 3 virginica 6.5 3.0 5.55 2.0So, the above tables show mean and median of all the features for different species. As we can observe from the mean and median of lengths and widths, that Setosa seems to be the smallest and virginica the largest in size, except that Sepal.Width is largest for setosa.
Based on our observation let’s plot boxplot:
boxplot(Sepal.Length ~ Species, data = iris, col=c(2,3,4))boxplot(Sepal.Width ~ Species, data = iris, col=c(2,3,4)) From the above plots also we can see that Sepal Width is largest for Setosa species of iris.
Exercise: Try boxplots and other types of plots for the data set and draw your conclusions based on that.
Train and Test subset
m = dim(iris)[1]
set.seed(1); train.n = sample(1:m, 120, replace=F)
train = iris[train.n, ]
test = iris[-train.n, ]
dim(train)## [1] 120 5dim(test)## [1] 30 5Multinomial Logistic Model for IRIS
Let’s load the package, “VGAM” and build the model on train data
library(VGAM)
mulC = vglm(Species ~ ., data=train, family = multinomial(refLevel=3)) #Here we are using virginica as reference level type levels(iris$Species) to check the levels
coef(mulC)## (Intercept):1 (Intercept):2 Sepal.Length:1 Sepal.Length:2 Sepal.Width:1
## 45.23777274 56.89190371 8.51186654 0.05398653 12.12252310
## Sepal.Width:2 Petal.Length:1 Petal.Length:2 Petal.Width:1 Petal.Width:2
## 7.54364522 -24.18141496 -10.23431193 -27.85185960 -16.58067086Predictions
Let’s make predictions on test data
pred = predict(mulC, newdata= test) #It gives the logit values of "setosa" and "versicolor" as against "virginica"
head(pred)## log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 4 80.12969 61.85794
## 5 92.01384 66.67479
## 8 87.17119 64.14263
## 9 78.42095 61.36184
## 11 94.21269 66.42732
## 16 99.68165 68.40793pred = exp(pred) #Calculating the exponential values of logits for both "setosa" and "versicolor"
head(pred) ## log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
## 4 6.307837e+34 7.320857e+26
## 5 9.143261e+39 9.046751e+28
## 8 7.210521e+37 7.191002e+27
## 9 1.142315e+34 4.457697e+26
## 11 8.242382e+40 7.063450e+28
## 16 1.955192e+43 5.119015e+29sum=(pred[,1]+pred[,2] + 1) #Sum of exponential values of "setosa", "versicolor" and "virginica" (it is 1 as it is the reference level)
head(sum)## 4 5 8 9 11 16
## 6.307838e+34 9.143261e+39 7.210521e+37 1.142315e+34 8.242382e+40 1.955192e+43pVirginica = round(1/sum) #Calculating probability values for Virginica
pSetosa = round(pred[,1]/sum) #Calculating probability values for Setosa
pVersicolor = round(pred[,2]/sum) #Calculating probability values for Versicolor
predTest = data.frame(pSetosa, pVersicolor, pVirginica, test$Species)
predTest## pSetosa pVersicolor pVirginica test.Species
## 4 1 0 0 setosa
## 5 1 0 0 setosa
## 8 1 0 0 setosa
## 9 1 0 0 setosa
## 11 1 0 0 setosa
## 16 1 0 0 setosa
## 30 1 0 0 setosa
## 36 1 0 0 setosa
## 46 1 0 0 setosa
## 47 1 0 0 setosa
## 49 1 0 0 setosa
## 52 0 1 0 versicolor
## 54 0 1 0 versicolor
## 57 0 1 0 versicolor
## 62 0 1 0 versicolor
## 67 0 1 0 versicolor
## 69 0 1 0 versicolor
## 72 0 1 0 versicolor
## 80 0 1 0 versicolor
## 90 0 1 0 versicolor
## 96 0 1 0 versicolor
## 97 0 1 0 versicolor
## 99 0 1 0 versicolor
## 103 0 0 1 virginica
## 107 0 1 0 virginica
## 109 0 0 1 virginica
## 117 0 0 1 virginica
## 128 0 0 1 virginica
## 137 0 0 1 virginica
## 139 0 1 0 virginicaThe first column is the row numbers as in the original iris data. Wherever you see the value 1 under a class (of three columns - pSetosa, pVersicolor, pVerginica), that class is the prediction by our model for that record and the last column shows the actual species in the test data. In this example, the probability values have turned out to be 1s and 0s, but this may not always be the case.
If you are using this method to build a model and then predict probability values for each class, you may not always get 1s and 0s. You may get different values (>=0 and <1) for each class of the same record. In that case, whichever has got the highest value, that particular class is the prediction for that record.
As you can see, the model has performed fairly well on the test data except misclassification of two records, viz. 107 and 139. That is like over 93% accuracy. Of course, there is always a possibility to improve the model performance by using more data, polynomial features, transformed features, etc.
Binomial Logistic Regression
Now let’s build a logistic regression model to solve a binomial classification problem
For this section of the lab, we will use a data set Smarket.
Details of the data set:
- Contains the percentage returns on S&P 500 stock index for 1250 days (2001 to 2005)
- Lag1 to Lag5: percentage returns of five previous days for each date
- Volume: Number of shares (in billions) traded the previous day
- Today: Percentage return on today’s date (i.e. date in question)
- Direction: Indicates whether market was up or down (We will use this as our response variable)
This dataset is part of a package, ISLR. You need to install this package.
install.packages("ISLR")Let’s load the package and look at some of the key details of the data:
library(ISLR)
dim(Smarket)## [1] 1250 9names(Smarket )## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"As discussed above, there are 9 variables in the data set and 1250 records
summary(Smarket)## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
## pairs(Smarket)Although, the plots aren’t so legible, we can still make out that there is hardly any correlation between any two variables.
cor(Smarket[, -9])## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000As we can see from the correlation matrix above, that most of the variables aren’t correlated, except Year and Volume which have some substantial correlation.
attach(Smarket)
plot(Volume)Since volume represents number of shares traded previous day (that’s the daily volume of shares in billions), the plot suggests that the daily average volume of shares traded has continuously increased from 2001 to 2005 with some spikes here and there.
Are the two classes linearly seperable?
plot(Volume,Today,col=(Direction=="Up")+1)From the plot above, We can clearly see that the two classes are linearly separable in this surface (or space) of Volume and Today (on these two axes), which may not be the case with other surface (or space) using other axes.
Exercise: Plot all the other pairs using Direction as color (same as above), such as lag1 and lag2, lag3 and lag4, etc.
We will build a logistic regression to predict Direction using all the other variables except Year (as it doesn’t make sense to include this variable).
We will use a function from R, glm(), which stands fro generalized linear models. It can use various model building algorithm including logistic regression. We need to pass in binomial as family for logistic regression.
Train and Test subset
Since it is a time series or sequential data, we will split the data set into train and test set little differently as opposed to how we deal with randomized data set.
Let’s split the data set into train (data from 2001 to 2005) and test (all 2005 data)
train.idx = Smarket$Year < 2005
train = Smarket[train.idx, ]
test = Smarket[!train.idx, ]
dim(train)## [1] 998 9dim(test)## [1] 252 9train.n = dim(train)[1]
test.n = dim(test)[1]Binomial logistic regression model for Smarket data set
Now let’s build the model on train set
glm.fit1 = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=train, family=binomial )
coef(glm.fit1)## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.191212621 -0.054178292 -0.045805334 0.007200118 0.006440875 -0.004222672
## Volume
## -0.116256960Lag1, Lag2, and Lag5 have negative coefficients suggesting that if the market was up on those days, then it is quite unlikely that the market will be up today. Lag3 and Lag4 have positive coefficients suggesting that if the market was up on those days, it is likely that the market will be up today as well.
contrasts(Direction)## Up
## Down 0
## Up 1So, R has used Up as 1 and Down as 0.
Prediction
Let’s look at the prediction of both test and train data
pred.train = predict(glm.fit1, newdata= train, type = "response")
pred.test = predict(glm.fit1, newdata= test, type = "response")
pred.train[1:5]## 1 2 3 4 5
## 0.4985061 0.4893137 0.4849791 0.5098788 0.5145737pred.test[1:5]## 999 1000 1001 1002 1003
## 0.5282195 0.5156688 0.5226521 0.5138543 0.4983345What we are getting is probability values of Up as R is treating Up as 1 or success
Let’s look at train and test errors
train.class = 1*(train$Direction=="Up") #Getting actual class in 1s and 0s
pred.train.class=floor(pred.train+0.5) #Turning predictions into 1s and 0s
confMtx.train=table(train.class,pred.train.class)
error.train=(confMtx.train[1,2]+confMtx.train[2,1])/train.n #(False Positive + False Negative)/no. of records
test.class = 1*(test$Direction=="Up") #Getting actual class in 1s and 0s
pred.test.class=floor(pred.test+0.5) #Turning predictions into 1s and 0s
confMtx.test=table(test.class,pred.test.class)
error.test=(confMtx.test[1,2]+confMtx.test[2,1])/test.n #(False Positive + False Negative)/no. of records
confMtx.train## pred.train.class
## train.class 0 1
## 0 175 316
## 1 156 351confMtx.test## pred.test.class
## test.class 0 1
## 0 77 34
## 1 97 44error.train## [1] 0.4729459error.test## [1] 0.5198413Clearly, the model has performed poorly as error rate for training data is 47% and that of test data is around 52%. It is expected that one cannot quite predict the share market just by using previous days’ returns.
Another reason from logistic regression point of view could be realized when you do the exercise of plotting the pairs of variables using Direction as color to see whether the classes are linearly separable in a different space or surface. Logistic regression works well when decision boundaries are linearly separable. For non-linearly separable decision boundaries, we can use other algorithms such as Neural Network, Support Vector Machine etc. (which we will see in later sections).
You can also see the summary of the the model we built. p-values of the features are much larger than the significance level of 0.05. Hence, they aren’t helpful, rather they cause increase in variance without decreasing bias, thus increasing the test errors.
However, we will use lag1 and lag2 as they have the lowest p-values among all others.
glm.fit2 = glm(Direction ~ Lag1+Lag2, data=train, family=binomial )
coef(glm.fit2)## (Intercept) Lag1 Lag2
## 0.03221753 -0.05562122 -0.04448684Let’s look at the prediction
pred.train = predict(glm.fit2, newdata= train, type = "response")
pred.test = predict(glm.fit2, newdata= test, type = "response")
pred.train[1:5]## 1 2 3 4 5
## 0.5048917 0.4904830 0.4830449 0.5052396 0.5064450pred.test[1:5]## 999 1000 1001 1002 1003
## 0.5098275 0.5208237 0.5332635 0.5260574 0.5072103Let’s look at test and train errors
pred.train.class = ifelse(pred.train<0.5, "Down", "Up") #Predicted class for train data
pred.test.class = ifelse(pred.test<0.5, "Down" , "Up") #Predicted class for test data
error.train = mean(pred.train.class != train$Direction) #Error rate for train data
error.test = mean(pred.test.class != test$Direction) #Error rate for test data
confMtx.train=table(train$Direction,pred.train.class)
confMtx.test=table(test$Direction,pred.test.class)
confMtx.train## pred.train.class
## Down Up
## Down 168 323
## Up 160 347confMtx.test## pred.test.class
## Down Up
## Down 35 76
## Up 35 106error.train## [1] 0.4839679error.test## [1] 0.4404762accu.test.Up = confMtx.test[2,2]/sum(confMtx.test[,2]) #Percentage of accuracy for prediction "Up"
accu.test.Up## [1] 0.5824176SO, the error rate for the test data has come down to 44%, i.e. accuracy of prediction on test data is 56% now.
This simple model’s predictions whether market will go up or down will 56% correct while there is 44% chance of going wrong.
When the model predicts that market will be up, there is 58% chance that the prediction is correct, while 42% chance that it is incorrect.
So, we can take a clue from above that if the model predicts Up, if we buy shares that day we have 58% possibility of making profit and avoid on days when the model predicts down
Chi-square Test of the model
Another way to test the model fit in logistic regression is chi-square test.
Similar to linear regression, where we have F-test (ANOVA) for i.e. analyzing prediction of y (or response variable) without x (or independent variables) compared to prediction of y (or response variable) with x (or independent variables), i.e. base model (only mean or intercept) and regression model with intercept and independent variables
Similarly, for classification model, we use ch-square test, in which we calculate chi-squre value based of null deviance and residual deviance with their respective degrees of freedom
i.e. chi-square value is calculated by comparing two models (one with predictors and another without) and this gives us difference in goodness of fit. The intuition is that Goodness of fit with predictors must be higher than that without predictors.
Hence, goodness of fit = likelihood with predictors - likelihood without predictors
= -2ln[likelihood without predictors/likelihood with predictors]
Hence, chi-square test is the test of the fit of the model. We just calculate the p-value of chi-square. If the p-value is less than 0.05 then the null hypothesis is rejected and the model with independent variables is accepted
#P-value of chi-square
pchisq(glm.fit2$null.deviance - glm.fit2$deviance, glm.fit2$df.null - glm.fit2$df.residual, lower.tail = FALSE)## [1] 0.393926As we can see that p-value is much larger than 0.05. Therefore, this model is not a good fit, which was obvious from our test prediction as well.
If you want to make a prediction on an known data for lag1 (e.g. -0.12, 1.58) and lag2 (e.g. 0.49, 0.67)
p = predict(glm.fit2, newdata= data.frame(Lag1 = c(-0.12, 1.58), Lag2=c(0.49, 0.67)), type="response")
p## 1 2
## 0.5042733 0.4786455p.class = ifelse(p<0.5, "Down", "Up")
p.class## 1 2
## "Up" "Down"detach(Smarket)Default Data - Binomial Logistic Regression
Let’s build a binomial logistic regression on a data set “Default” from the package “ISLR”.
library("ISLR")
dim(Default)## [1] 10000 4names(Default)## [1] "default" "student" "balance" "income"It is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.It has 10,000 records and 4 variables.
default: A factor with levels No and Yes indicating whether the customer defaulted on their debt
student: A factor with levels No and Yes indicating whether the customer is a student
balance: The average balance that the customer has remaining on their credit card after making their monthly payment
income: Income of customer
boxplot(balance ~ student, data=Default, xlab="Student Status", ylab= "Credit Card Balance", col=c(2, 3))boxplot(income ~ student, data=Default, xlab="Student Status", ylab= "Income of Customer", col=c(2, 3))As we can see from the plots; Students tend to have higher average credit card balance as against non-students. While the average income of students seems to be lower compared to non-students, which is quite obvious.
#Plotting bar chart of Balance vs Defaulters and Student Status
Student_status = ifelse(Default$student=="Yes", "Student", "Non-student")
deafault_status = ifelse(Default$default=="Yes", "Defaulter", "Non-defaulter")
credit_status = tapply(Default$balance, list(deafault_status, Student_status), FUN = mean)
barplot(credit_status, beside = TRUE, col= c(2,3), legend.text=TRUE, main= "Balance vs Defaulters and Student Status" )credit_status## Non-student Student
## Defaulter 1678.4295 1860.3791
## Non-defaulter 744.5044 948.4802As we can see that whether student or non-student, the customers with higher average credit card balance tend to default more likely than those with lower average credit card balance.
Let’s write a function to calculate cumulative default rates and cumulative average credit card balance. The function takes def (1s and 0s, where a defaulter is 1) for defaulters and bal for average credit card balance for each customer. Returns cumulative defaulters and average credit card balance for the number of records in the data.
default_rate = function(def, bal){
m = length(def)
cum_rate = c(rep(0, m))
cum_bal = c(rep(0, m))
for(i in 2:m){
cum_rate[i] = sum(def[1:i])
cum_bal[i] = sum(bal[1:i])
}
list(cum_rate = cum_rate, cum_bal = cum_bal)
}Let’s calculate the cumulative defaulters rates and average credit card balance for students and non-students
def = ifelse(Default$default=="Yes", 1,0)
studen_def_rate = default_rate(def[Default$student=="Yes"], Default$balance[Default$student=="Yes"])
nonStu_def_rate = default_rate(def[Default$student=="No"], Default$balance[Default$student=="No"])
studen_def_rate$cum_rate[1:5]; studen_def_rate$cum_bal[1:5]## [1] 0 0 0 0 0## [1] 0.000 1736.769 2545.436 2545.436 3766.020nonStu_def_rate$cum_rate[1:5]; nonStu_def_rate$cum_bal[1:5]## [1] 0 0 0 0 0## [1] 0.000 1803.076 2332.326 3117.982 3943.495Let’s look at some plots
plot(Default$balance[Default$student=="Yes"], studen_def_rate$cum_rate, xlab="Average Credit Card Balance", ylab="Cumulative Defaulter Rate", type="l", col="blue")
lines(Default$balance[Default$student=="No"], nonStu_def_rate$cum_rate, col="orange")
legend("topright", legend=c("Student", "Non-student"), lwd=2, col = c("blue", "orange"))From the above plot, we can understand that if the average credit card balance is somewhere between 600 and 1600, the chances of defaulters among non-student customers are quite high as this region is thickly filled with non-student defaulters (orange lines).
Once the average credit card balance goes beyond 1700-1800, the chances of defaulters among student customers are much higher as this region is thickly filled with student defaulters (blue lines).
Although, the overall significant range of average credit card balance is quite high (almost 1600) for non-student defaulters compared to that of student defaulters which is, maybe, just 2300-1700 = 600
Let’s look at another plot
plot(studen_def_rate$cum_bal, studen_def_rate$cum_rate, xlab="Cumulative Average Credit Card Balance", ylab="Cumulative Defaulter Rate", type="l", col="blue")
lines(nonStu_def_rate$cum_bal, nonStu_def_rate$cum_rate, col="orange")
legend("topright", legend=c("Student", "Non-student"), lwd=2, col = c("blue", "orange"))Here we can see that the cumulative Defaulter rate against cumulative average credit card balance of student customers is higher throughout compared to that of non-student customers.
In the beginning, both are almost close, but once the cumulative CC balance goes beyond 1,250,000 (1.25 million) the gap between two graphs keep widening, i.e. student customers tend to default more likely.
Splitting data into train and test
m = dim(Default)[1]
set.seed(1); train.idx = sample(1:m, floor(0.8*m), replace=F)
train.n = length(train.idx)
test.n = m-train.n
train = Default[train.idx,]
test = Default[-train.idx,]
train.n; test.n## [1] 8000## [1] 2000Building binomial logistic regression model on train set
glm.fit3 = glm(default ~ student + balance + income, data=train, family=binomial)
summary(glm.fit3)##
## Call:
## glm(formula = default ~ student + balance + income, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4736 -0.1389 -0.0543 -0.0202 3.5171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.094e+01 5.517e-01 -19.834 < 2e-16 ***
## studentYes -7.282e-01 2.683e-01 -2.714 0.00664 **
## balance 5.713e-03 2.585e-04 22.100 < 2e-16 ***
## income 5.877e-06 9.172e-06 0.641 0.52170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2313.6 on 7999 degrees of freedom
## Residual deviance: 1231.9 on 7996 degrees of freedom
## AIC: 1239.9
##
## Number of Fisher Scoring iterations: 8The model has chosen No as reference level fro Student and has created a dummy variable StudentYes for the categorical variable Student. p-values are quite low for all except for income.
As per this model, considering the coefficient values (which is negative), for a fixed value of income and balance students seem to default less compared to non-student.
Model performce on test data
Let’s test our model’s performance on test data.
pred.test = predict(glm.fit3, newdata=test, type="response")
pred.test.class = ifelse(pred.test<0.5, "No", "Yes")
confMtx.test = table(test$default, pred.test.class)
error.test = mean(pred.test.class != test$default) #Error rate on test data
accu.test = mean(pred.test.class == test$default) #Accuracy on test data
accu.test.Yes = confMtx.test[2,2]/sum(confMtx.test[,2]) #Percentage of accuracy for prediction "Yes"
error.test## [1] 0.0275accu.test## [1] 0.9725accu.test.Yes## [1] 0.8947368sum(test$default=="Yes") #Number of defaulters in test data## [1] 70sum(train$default=="Yes") #Number of defaulters in train data## [1] 263We can see that the model performed really well on the test data with error rate as low as 2.75% and accuracy as high as 97.25%.
The accuracy level of predicting Defaulters is 89.47%, which is quite good, considering the test data has only 70 defaulters among 2000 records and train data has only 263 defaulters among 8000 records. The model is able to pick up defaulters almost 90% of the times correctly.
Let’s look at chi-square test:
pchisq(glm.fit3$null.deviance - glm.fit3$deviance, glm.fit3$df.null - glm.fit3$df.residual, lower.tail = FALSE)## [1] 3.427193e-234The p-value is much smaller than 0.05, hence we reject the null hypothesis and accept the model with predictors.
Execrice: Build a logistic regression model using “Default” data from ISLR packages, excluding “Income” variable (as its p-value is large) and repeat the entire process as above. Try to understand the changes in coefficient values and their interpretations.
ROC Curve
Either we can plot ROC curve using sensitivity (i.e. true positive rate) vs specificity (i.e.true negative rate)
or We can plot ROC curve using true positive rate vs false positive rate
Let’s define what they are (recall confusion matrix):
true positive rate = (true positives)/(true positives + false negatives) = (true positives)/(actual positives)
true positive rate is also known as sensitivity
true negative rate = (true negatives)/(true negatives + false positives) = (true negatives)/(actual negatives)
true negative rate is also known as specificity
false positive rate = (false positive)/(true negatives + false positives) = (false positive)/(actual negatives)
false positive rate is 1- specificity
Taking the above example
Now, let’s plot ROC curve using first method, i.e. Sensitivity vs Specificity,
For this we will use a package, pROC, you need to install the package
install.packages("pROC")library(pROC)## Warning: package 'pROC' was built under R version 4.1.3rocScore = roc(test$default, pred.test) #Calculate ROC scores (different rates of sensitivity and specificity)
plot(rocScore, main="ROC Curve")If you want to know all the components of rocScore:
names(rocScore) #To get different components## [1] "percent" "sensitivities" "specificities"
## [4] "thresholds" "direction" "cases"
## [7] "controls" "fun.sesp" "auc"
## [10] "call" "original.predictor" "original.response"
## [13] "predictor" "response" "levels"rocScore$auc #To get the area under curve## Area under the curve: 0.9368length(rocScore$thresholds) #To get the number of thresholds (or iterations used by the function)## [1] 2001Let’s create ROC Curve using second method (i.e. true positive rate vs false positive rate) for the test predictions
tp = c(1:50000) #True positive rate vector, in our example positive is "Yes" (i.e. defaulter) and negative is "No"
fp = c(1:50000) #False positive rate vector
for ( i in 1:50000){
pred.test.class = factor(ifelse(pred.test>=i/50000, "Yes", "No"), levels=c("No", "Yes"))
confMtx.test = table(test$default, pred.test.class)
tp[i] = confMtx.test[2,2]/(confMtx.test[2,1]+confMtx.test[2,2])
fp[i] = confMtx.test[1,2]/(confMtx.test[1,1]+confMtx.test[1,2])
}
plot(fp,tp, main = "ROC Curve", xlab = "False Positive Rate", ylab="True Positive Rate", type="l")We had to keep the number of thresholds (or iteration) very high (50,000) to get an acceptable value of area under curve (AUC), perhaps because of very low percentage of defaulters in the data set. We are not going to calculate the area under curve, but looking at the plot, we get an idea that it is above 0.7.
Although there is no standard value to define what is a good AUC score, if it is higher, it is better.
A model with AUC score 1 is able to perfectly predict all the classes without any error, while AUC 0.5 and below is no better than random guessing.
AUC: 0.7 to 0.8 is considered acceptable, anything higher than 0.8 is excellent.
The main idea is to minimize False Positive rate and maximize True Positive rate by selecting the most optimum threshold value (threshold probability value at which our model predicts positive).
To decide the most optimum value of threshold, one needs to take into consideration the domain and industry one is dealing with. In this scenario, the bank or credit card company may be averse to taking high risks when it comes to issuing credit cards to customers. The risk of offering credit card to a defaulter could cost much more than the risk of losing some genuine customers.
In this case, positive means defaulter. Therefore, When the False Positive rate is high (i.e. a false alarm about a genuine customer as a defaulter), there is a greater chance of losing genuine customers, but it also means avoiding risk of getting defaulters.
So, the threshold value depends on how much tolerance a domain/industry/company can show towards False Positive rates.
pt_fp_Rate = data.frame(TP=tp, FP=fp)
pt_fp_Rate = pt_fp_Rate[order(pt_fp_Rate$FP, decreasing=TRUE),] #Arrange the data in a decreasing order of fp
pt_fp_Rate[1:10,]## TP FP
## 1 1 0.9865285
## 2 1 0.9046632
## 3 1 0.8839378
## 4 1 0.8637306
## 5 1 0.8466321
## 6 1 0.8321244
## 7 1 0.8176166
## 8 1 0.8031088
## 9 1 0.7880829
## 10 1 0.7803109Exercise: Suppose you met the credit card department head and she told you that they are fine within 20-23% range of False Positive rate and asked you to build a predictive model for defaulters. Since, we have already built the model, your job is to determine the threshold value from the ROC curve above.
Comments
Post a Comment