Lab_LR_3_RM
Asmi Ariv
2022-09-26
Multivariate Linear Regression in R
In the first lab, we have worked on a simple linear regression in R (using Women data set from base R), although, while improving the fit, our model eventually turned out to be a multivariate model by adding transformed versions of the feature.
Later in the second lab, we built our machine learning algorithm from scratch defining various functions and thereby, building a predictive model using our data set Experiment.
Now, let’s use R’s inbuilt function to build a multivariate model.
For this lab we will use the data set Boston from the MASS Package. This data set represents medv (median house value) of 506 neighborhoods around Boston. It has total of 14 variables, including medv. We will use medv as our response variable and others as our independent variables.
You can find more information on the data set by CLICKING HERE
Loading MASS package and Boston Data and performing basic Exploratory Data Analysis (EDA)
We will load the data and only perform basic EDA (will not go into details as our main focus is to show different aspects of building a multivariate linear regression model)
The main intention behind this lab is to show you different challenges, roadblocks one can face while building a model with multiple variables. There isn’t any perfect sequence that one can follow. Therefore, go through this lab with open mind and do not expect a perfect model at the end (as there are so many factors that can only be handled with the help of subject matter experts, including getting more data).
Let’s load the package MASS
library(MASS)
names(Boston) #All the variables## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"There are 14 variables in the data set
Summary of data
Let’s look at summary of the data
summary(Boston)## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00Except Chas, all the other variables are quantitative variables. We must transform Chas into a factor variable for our model building
From mean and median comparison, we can see that,
crim, zn, indus, nox, dis, lstat, medv are right skewed: mean > median
age, ptratio, and black are left skewed: median > mean
While rm, rad, and tax do not show much of skewness when plotted
Exercise: Plot the histogram of all these variables to see their distribution
Scatter plot of the data
Boston$chas <- as.factor(Boston$chas) #Converting qualitative variable into a factor variable
plot(Boston)Since there are 14 variables, the scatter plot isn’t legible.
Correlation Matrix
Let’s look at the correlations among different variables
cor(Boston[,-4]) #Not using chas for correlation as it is not a quantitative variable## crim zn indus nox rm age
## crim 1.0000000 -0.2004692 0.4065834 0.4209717 -0.2192467 0.3527343
## zn -0.2004692 1.0000000 -0.5338282 -0.5166037 0.3119906 -0.5695373
## indus 0.4065834 -0.5338282 1.0000000 0.7636514 -0.3916759 0.6447785
## nox 0.4209717 -0.5166037 0.7636514 1.0000000 -0.3021882 0.7314701
## rm -0.2192467 0.3119906 -0.3916759 -0.3021882 1.0000000 -0.2402649
## age 0.3527343 -0.5695373 0.6447785 0.7314701 -0.2402649 1.0000000
## dis -0.3796701 0.6644082 -0.7080270 -0.7692301 0.2052462 -0.7478805
## rad 0.6255051 -0.3119478 0.5951293 0.6114406 -0.2098467 0.4560225
## tax 0.5827643 -0.3145633 0.7207602 0.6680232 -0.2920478 0.5064556
## ptratio 0.2899456 -0.3916785 0.3832476 0.1889327 -0.3555015 0.2615150
## black -0.3850639 0.1755203 -0.3569765 -0.3800506 0.1280686 -0.2735340
## lstat 0.4556215 -0.4129946 0.6037997 0.5908789 -0.6138083 0.6023385
## medv -0.3883046 0.3604453 -0.4837252 -0.4273208 0.6953599 -0.3769546
## dis rad tax ptratio black lstat
## crim -0.3796701 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215
## zn 0.6644082 -0.3119478 -0.3145633 -0.3916785 0.1755203 -0.4129946
## indus -0.7080270 0.5951293 0.7207602 0.3832476 -0.3569765 0.6037997
## nox -0.7692301 0.6114406 0.6680232 0.1889327 -0.3800506 0.5908789
## rm 0.2052462 -0.2098467 -0.2920478 -0.3555015 0.1280686 -0.6138083
## age -0.7478805 0.4560225 0.5064556 0.2615150 -0.2735340 0.6023385
## dis 1.0000000 -0.4945879 -0.5344316 -0.2324705 0.2915117 -0.4969958
## rad -0.4945879 1.0000000 0.9102282 0.4647412 -0.4444128 0.4886763
## tax -0.5344316 0.9102282 1.0000000 0.4608530 -0.4418080 0.5439934
## ptratio -0.2324705 0.4647412 0.4608530 1.0000000 -0.1773833 0.3740443
## black 0.2915117 -0.4444128 -0.4418080 -0.1773833 1.0000000 -0.3660869
## lstat -0.4969958 0.4886763 0.5439934 0.3740443 -0.3660869 1.0000000
## medv 0.2499287 -0.3816262 -0.4685359 -0.5077867 0.3334608 -0.7376627
## medv
## crim -0.3883046
## zn 0.3604453
## indus -0.4837252
## nox -0.4273208
## rm 0.6953599
## age -0.3769546
## dis 0.2499287
## rad -0.3816262
## tax -0.4685359
## ptratio -0.5077867
## black 0.3334608
## lstat -0.7376627
## medv 1.0000000As we can see, some of the variables have strong correlations such as crim with rad and tax; zn with age and dis; indus with nox, age, dis, rad, tax and lstat. Similarly, we can see that others also have some strong correlations with different variables.
Missing data
Let’s see if there are any missing values in the data set
Boston[!complete.cases(Boston),]## [1] crim zn indus chas nox rm age dis rad
## [10] tax ptratio black lstat medv
## <0 rows> (or 0-length row.names)As we can see that data set is complete without any missing values.
Model Building using R functions
lm.fit = lm(medv ~ ., data=Boston) #To include all variables, we just use ~ . shorthandSummary of the model
summary(lm.fit)##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas1 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16As we can see that R-squared value is just 0.74, that means our model only explains 74% of the variability in the data. In multivariate linear regression, it turns out that R-square is nothing but the square of correlation between response variable and the model fit (i.e. predicted values of response variable). Hence, our model is not a good fit.
We also notice, that the p-values of indus and age are larger than 0.05 (our significance level of 5%). Hence, these variables are not significant. Therefore, we should drop them.
The interpretation of chas1 dummy variable:
We can notice here, that R has created a dummy variable chas1. Since chas is a categorical variable with only two values, i.e. 1 and 0, meaning a house at the bank of Charles River is 1, otherwise it is 0. chas1 takes the value 1 wherever chas is 1, otherwise it is zero. Here 0 has become a reference point. Our model suggests that the average median value of the houses at the bank of Charles is 2.68 times more compared to those that are not located at the bank.
contrasts(Boston$chas) #To see how R treats a categorical variable## 1
## 0 0
## 1 1The function contrasts() shows that R has created a dummy variable chas1 (header of the above table) and it takes the value 1 for chas as 1 and 0 for chas as 0.
Let’s fit another model without age and indus
lm.fit2 = lm(medv ~ . -age - indus, data=Boston) #Mention names of variables with minus signs to be excludedsummary(lm.fit2)##
## Call:
## lm(formula = medv ~ . - age - indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas1 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16Although, the model suggests that all the variables are significant based on their p-values, the model itself hasn’t improved much. We can also drop some other variables which have very small coefficient values. such as zn, tax and black and see what happens.
lm.fit3 = update(lm.fit2, ~. -zn -tax -black) #update() function removes these variables from the modelsummary(lm.fit3)##
## Call:
## lm(formula = medv ~ crim + chas + nox + rm + dis + rad + ptratio +
## lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0030 -3.0225 -0.6074 1.8951 26.8346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.80894 4.96675 8.216 1.83e-15 ***
## crim -0.11061 0.03338 -3.314 0.000988 ***
## chas1 3.09444 0.87505 3.536 0.000444 ***
## nox -21.95511 3.51626 -6.244 9.17e-10 ***
## rm 3.99022 0.41033 9.724 < 2e-16 ***
## dis -1.19964 0.16552 -7.248 1.63e-12 ***
## rad 0.11877 0.04062 2.924 0.003611 **
## ptratio -1.11544 0.12426 -8.977 < 2e-16 ***
## lstat -0.55201 0.04827 -11.435 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.877 on 497 degrees of freedom
## Multiple R-squared: 0.7233, Adjusted R-squared: 0.7188
## F-statistic: 162.4 on 8 and 497 DF, p-value: < 2.2e-16By removing those few variables, our model has gone worse in terms of R-squared and adjusted R_squared.
We also notice, that as we drop some variables, the coefficient values of other variables keep changing, e.g. for chas1, it has gone up to 3.09 from 2.68, nox has gone down from -17.7 to -21, intercept (or bias) has gone up from 36.46 to 40.80, rad has gone down from 0.3 to 0.11
cor(Boston[,c("indus","age","zn","tax","black","nox","rad")])## indus age zn tax black nox
## indus 1.0000000 0.6447785 -0.5338282 0.7207602 -0.3569765 0.7636514
## age 0.6447785 1.0000000 -0.5695373 0.5064556 -0.2735340 0.7314701
## zn -0.5338282 -0.5695373 1.0000000 -0.3145633 0.1755203 -0.5166037
## tax 0.7207602 0.5064556 -0.3145633 1.0000000 -0.4418080 0.6680232
## black -0.3569765 -0.2735340 0.1755203 -0.4418080 1.0000000 -0.3800506
## nox 0.7636514 0.7314701 -0.5166037 0.6680232 -0.3800506 1.0000000
## rad 0.5951293 0.4560225 -0.3119478 0.9102282 -0.4444128 0.6114406
## rad
## indus 0.5951293
## age 0.4560225
## zn -0.3119478
## tax 0.9102282
## black -0.4444128
## nox 0.6114406
## rad 1.0000000nox has strong correlation with indus, age, tax and rad, while it is negatively correlated with zn; on the other hand rad is strongly correlated with tax and has positive correlation with indus.These correlations might have affected their coefficient values in the new model.
Let’s plot the models and try to understand the problems:
par(mfrow=c(2,2))
plot(lm.fit3)Looking at the residual vs fitted and normal QQ plots, we realize that there is non-linearity in the data. That means we may have to consider either higher polynomial terms or lowering the power for some of the variables.
If you wish to see all the components of the model built by R, use names() function
names(lm.fit3)## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"lm.fit3$contrasts[1] #Since there is only one dummy variable## $chas
## [1] "contr.treatment"lm.fit3$xlevels## $chas
## [1] "0" "1"Let’s look at the coefficient values of predictors
lm.fit3$coefficients## (Intercept) crim chas1 nox rm dis
## 40.8089437 -0.1106138 3.0944416 -21.9551069 3.9902161 -1.1996435
## rad ptratio lstat
## 0.1187750 -1.1154412 -0.5520089We can see that nox and bias term have maximum impact on the median value. One unit change in nox can bring down the medv by 21. Although, the overall average change in medv also depends on the unit value of each variable. Therefore, it is sometimes important to normalize the variables to get the better understanding of the model by getting rid of scale differences.
Right skewed variables: crim, nox, dis, lstat
Left skewed variables: ptratio
While rm and rad do not show much skewness.
Let’s try to bring some symmetry in the variables, through various transformations.
attach(Boston) #Attaches database to R search path, we can call all variables by their names without ref to Bostonpar(mfrow=c(2,2))
hist(log(crim)); hist(log(nox)); hist(log(dis)); hist(sqrt(lstat))As we can see that log transformation of crim, nox and dis and square root of lstat have solved the skewness to a great extend.
Now let’s transform ptratio, which is left skewed.
hist((ptratio)**5)We can take either 5th or 4th power of ptratio.
Now, let’s build the model with the transformed variables.
lm.fit4 = lm(medv ~ log(crim)+ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + rm + rad)
#Use I() function to raise the power of a vraible as ^ is reserved in lm()summary(lm.fit4)##
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) +
## I(ptratio^5) + chas + rm + rad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5366 -2.7585 -0.3929 2.1939 22.7735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.168e+01 3.641e+00 5.953 4.98e-09 ***
## log(crim) 1.216e-01 2.578e-01 0.472 0.637418
## log(nox) -1.318e+01 2.347e+00 -5.617 3.24e-08 ***
## log(dis) -6.449e+00 7.652e-01 -8.429 3.80e-16 ***
## sqrt(lstat) -5.298e+00 3.479e-01 -15.229 < 2e-16 ***
## I(ptratio^5) -1.530e-06 2.197e-07 -6.967 1.03e-11 ***
## chas1 2.907e+00 8.295e-01 3.505 0.000498 ***
## rm 3.550e+00 3.947e-01 8.995 < 2e-16 ***
## rad -1.431e-02 4.808e-02 -0.298 0.766055
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.622 on 497 degrees of freedom
## Multiple R-squared: 0.7515, Adjusted R-squared: 0.7475
## F-statistic: 187.8 on 8 and 497 DF, p-value: < 2.2e-16Although, the model doesn’t show much improvement, we can drop log(crim) and rad, as their p-values are larger than 0.05
lm.fit5 = lm(medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + rm)summary(lm.fit5)##
## Call:
## lm(formula = medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) +
## chas + rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6349 -2.7224 -0.4356 2.2259 22.8071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.162e+01 3.632e+00 5.953 4.98e-09 ***
## log(nox) -1.277e+01 2.106e+00 -6.063 2.65e-09 ***
## log(dis) -6.513e+00 7.519e-01 -8.662 < 2e-16 ***
## sqrt(lstat) -5.267e+00 3.394e-01 -15.519 < 2e-16 ***
## I(ptratio^5) -1.522e-06 2.045e-07 -7.446 4.26e-13 ***
## chas1 2.913e+00 8.278e-01 3.518 0.000474 ***
## rm 3.554e+00 3.899e-01 9.114 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.614 on 499 degrees of freedom
## Multiple R-squared: 0.7514, Adjusted R-squared: 0.7484
## F-statistic: 251.3 on 6 and 499 DF, p-value: < 2.2e-16How about adding some of the variables dropped earlier. We can transform them first to get rid of their skewness.
We can use 1/log(indus), 3rd power of age. Tax doesn’t show much skewness.
zn is zero and black is between 350-400 (although its range is 0.32 and 396.6) for most of the records
Hence, either we can ignore zn and black or include them to see how they perform in the model
lm.fit6 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + (I(age^3)) + zn + tax + rm + black)summary(lm.fit6)##
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) +
## I(ptratio^5) + chas + 1/log(indus) + (I(age^3)) + zn + tax +
## rm + black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8306 -2.4388 -0.3933 2.0180 22.4578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.101e+01 3.994e+00 5.260 2.15e-07 ***
## log(crim) 6.791e-01 2.231e-01 3.044 0.002459 **
## log(nox) -1.204e+01 2.357e+00 -5.109 4.64e-07 ***
## log(dis) -6.707e+00 8.218e-01 -8.161 2.78e-15 ***
## sqrt(lstat) -5.121e+00 3.665e-01 -13.974 < 2e-16 ***
## I(ptratio^5) -1.284e-06 2.258e-07 -5.685 2.24e-08 ***
## chas1 2.600e+00 8.137e-01 3.195 0.001486 **
## I(age^3) 4.041e-08 1.049e-06 0.039 0.969293
## zn 2.949e-02 1.199e-02 2.459 0.014280 *
## tax -7.651e-03 2.378e-03 -3.218 0.001377 **
## rm 3.579e+00 3.951e-01 9.059 < 2e-16 ***
## black 9.449e-03 2.575e-03 3.670 0.000269 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.513 on 494 degrees of freedom
## Multiple R-squared: 0.7645, Adjusted R-squared: 0.7593
## F-statistic: 145.8 on 11 and 494 DF, p-value: < 2.2e-16Since p-value of age^3 is much larger than 0.05, we can drop it.
lm.fit7 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)summary(lm.fit7)##
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) +
## I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8403 -2.4291 -0.3915 2.0199 22.4658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.102e+01 3.986e+00 5.273 2.01e-07 ***
## log(crim) 6.802e-01 2.212e-01 3.075 0.002223 **
## log(nox) -1.202e+01 2.306e+00 -5.214 2.72e-07 ***
## log(dis) -6.718e+00 7.654e-01 -8.778 < 2e-16 ***
## sqrt(lstat) -5.116e+00 3.426e-01 -14.934 < 2e-16 ***
## I(ptratio^5) -1.283e-06 2.244e-07 -5.716 1.88e-08 ***
## chas1 2.601e+00 8.125e-01 3.202 0.001454 **
## zn 2.952e-02 1.196e-02 2.468 0.013931 *
## tax -7.663e-03 2.355e-03 -3.254 0.001217 **
## rm 3.582e+00 3.872e-01 9.251 < 2e-16 ***
## black 9.461e-03 2.556e-03 3.702 0.000238 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.508 on 495 degrees of freedom
## Multiple R-squared: 0.7645, Adjusted R-squared: 0.7597
## F-statistic: 160.7 on 10 and 495 DF, p-value: < 2.2e-16Since the coefficient value of ptratio^5 is extremely low, let’s drop it and see what happens.
lm.fit8 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + chas + 1/log(indus) + zn + + tax + rm + black)summary(lm.fit8)##
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) +
## chas + 1/log(indus) + zn + +tax + rm + black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5019 -2.7145 -0.4338 2.1731 22.1456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.574601 4.110193 5.006 7.74e-07 ***
## log(crim) 0.638173 0.228044 2.798 0.005334 **
## log(nox) -10.075400 2.352300 -4.283 2.21e-05 ***
## log(dis) -6.942027 0.788440 -8.805 < 2e-16 ***
## sqrt(lstat) -5.333125 0.351179 -15.186 < 2e-16 ***
## chas1 3.089209 0.833356 3.707 0.000233 ***
## zn 0.048815 0.011837 4.124 4.36e-05 ***
## tax -0.011911 0.002305 -5.167 3.45e-07 ***
## rm 3.774133 0.397886 9.485 < 2e-16 ***
## black 0.008858 0.002634 3.363 0.000831 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.65 on 496 degrees of freedom
## Multiple R-squared: 0.749, Adjusted R-squared: 0.7444
## F-statistic: 164.4 on 9 and 496 DF, p-value: < 2.2e-16lm.fit7 has been our best bet so far in terms of R-squared and Adjusted R-squared values while ensuring that all the variables are significant in terms of their p-values. This model explains 76.45% variability in the data.
Let’s look at the residual plots:
par(mfrow=c(2,2))
plot(lm.fit7)There is still non-linearity in the residual plot.
Let’s try adding all the polynomial terms of age and ptratio and see what happens:
lm.fit9 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + ptratio + I(ptratio^2) + I(ptratio^3) + I(ptratio^4) + I(ptratio^5) + chas + 1/log(indus) + age + (I(age^2)) + (I(age^3)) + zn + tax + rm + black)summary(lm.fit9)##
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) +
## ptratio + I(ptratio^2) + I(ptratio^3) + I(ptratio^4) + I(ptratio^5) +
## chas + 1/log(indus) + age + (I(age^2)) + (I(age^3)) + zn +
## tax + rm + black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.460 -2.456 -0.374 1.847 22.529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.024e+04 5.246e+03 -1.952 0.051489 .
## log(crim) 5.088e-01 2.283e-01 2.229 0.026302 *
## log(nox) -1.551e+01 2.536e+00 -6.117 1.96e-09 ***
## log(dis) -6.564e+00 8.280e-01 -7.928 1.53e-14 ***
## sqrt(lstat) -5.055e+00 3.688e-01 -13.705 < 2e-16 ***
## ptratio 3.023e+03 1.561e+03 1.937 0.053271 .
## I(ptratio^2) -3.522e+02 1.843e+02 -1.911 0.056528 .
## I(ptratio^3) 2.029e+01 1.080e+01 1.879 0.060782 .
## I(ptratio^4) -5.786e-01 3.140e-01 -1.843 0.065991 .
## I(ptratio^5) 6.538e-03 3.628e-03 1.802 0.072168 .
## chas1 2.924e+00 8.144e-01 3.591 0.000363 ***
## age 7.576e-02 1.107e-01 0.685 0.493907
## I(age^2) -1.219e-03 2.208e-03 -0.552 0.581129
## I(age^3) 5.988e-06 1.303e-05 0.460 0.645934
## zn 1.204e-02 1.394e-02 0.864 0.388247
## tax -5.022e-03 2.632e-03 -1.908 0.056999 .
## rm 3.454e+00 4.060e-01 8.508 < 2e-16 ***
## black 9.147e-03 2.566e-03 3.565 0.000400 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.459 on 488 degrees of freedom
## Multiple R-squared: 0.7728, Adjusted R-squared: 0.7649
## F-statistic: 97.66 on 17 and 488 DF, p-value: < 2.2e-16As you can see that most of the polynomial terms have become insignificant due to large p-values and hence do not contribute to the model.
par(mfrow=c(2,2))
plot(lm.fit9)There is no change in residual plot.
Hence, we can drop the polynomial terms. Let’s analyse the residual of lm.fit7
res = lm.fit7$residuals
hist(res)boxplot(res)The residuals do appear normally distributed. However, the boxplot shows some outliers. Let’s look at them.
outliers = boxplot.stats(res)$out
outliers## 8 167 187 196 215 226 229
## 11.189451 9.187839 12.926856 10.418538 13.608669 8.859031 9.863847
## 234 263 268 365 366 368 369
## 10.008200 9.157586 11.147849 -15.840305 11.639497 10.563031 21.594391
## 370 371 372 373 375 376 381
## 13.523434 10.185482 22.465833 21.240269 9.175768 -12.769298 -14.201047
## 402 406 413 506
## -10.980890 -10.970145 14.207404 -11.231709And as we can see, that there are many outliers. Let’s get their indices
idx = res %in% outliers #To get the indices
idx <- which(idx)
idx #Indices## [1] 8 167 187 196 215 226 229 234 263 268 365 366 368 369 370 371 372 373 375
## [20] 376 381 402 406 413 506Let’s remove these records from the data and then build the model
newData = Boston[-idx,]
names(newData)## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"detach(Boston) #Detach Boston, coz we are going to use newData now
attach(newData) #Attach newData for building the model lm.fit10 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)summary(lm.fit10)##
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) +
## I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.599 -2.013 -0.358 1.729 10.284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.594e+00 3.366e+00 1.662 0.09718 .
## log(crim) 5.108e-01 1.677e-01 3.046 0.00245 **
## log(nox) -7.247e+00 1.760e+00 -4.118 4.52e-05 ***
## log(dis) -4.187e+00 5.967e-01 -7.017 7.94e-12 ***
## sqrt(lstat) -3.686e+00 2.897e-01 -12.725 < 2e-16 ***
## I(ptratio^5) -1.145e-06 1.685e-07 -6.792 3.35e-11 ***
## chas1 1.886e+00 6.354e-01 2.968 0.00315 **
## zn 2.539e-02 8.948e-03 2.837 0.00475 **
## tax -8.778e-03 1.759e-03 -4.991 8.48e-07 ***
## rm 5.036e+00 3.439e-01 14.645 < 2e-16 ***
## black 1.224e-02 1.945e-03 6.293 7.13e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.312 on 470 degrees of freedom
## Multiple R-squared: 0.8392, Adjusted R-squared: 0.8358
## F-statistic: 245.4 on 10 and 470 DF, p-value: < 2.2e-16As we can see that our model has significantly improved by removing those outliers. We never bother about the p-value of intercept as it gives us the mean of response variable when all other variables are set to zero. Having mean as zero doesn’t make sense in this case.
Our model is able to explain 83% of the variability in the data. We can go on trying different permutations and combinations to get the best model, but that also requires working closely with the subject matter experts for help with many other factors, such as understanding variables better, getting more variables, getting more data, etc.
par(mfrow=c(2,2))
plot(lm.fit10)There is some improvement in the residual plot as well.
Interaction Effect
Let’s try one last option before we close this lab session. Let’s try interaction term. Interaction term is basically derived from the fact that two independent variables sometimes do interact with each other and their interaction does influence the final output in terms of the values of response variable.
For example: Marketing budget for digital media could be one variable and the same for print media could be another variable. However, marketing in digital media does enhance the marketing in print media and vice-versa. Although, together they are responsible for the overall sale. So, splitting the budget between digital marketing and print marketing may increase the sale more than allocating the whole budget to either digital or print.
Interaction term is created by multiplying two variables, (digital marketing budget) X (print marketing budget)
We will create interaction terms of crim and tax; zn and dis; lstat and rm
In R, if you use “:” (colon) between two variables it creates only the interaction term, and if you use * (star) then R will also include individual variables along with the interaction term i.e. cim*tax is a short hand for crim+tax+crim:tax
Whether to use transformed variable or original variable for interaction term, would depend on whether you want the interaction term to explain the changes in the slope of transformed variable, in that case, use transformed variable. However, if you wish to explain the changes in the slope of original variable, then it doesn’t matter whether you use transformed variable or otherwise.
lm.fit11 = lm(medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black + zn:log(dis) + sqrt(lstat):rm)summary(lm.fit11)##
## Call:
## lm(formula = medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) +
## chas + 1/log(indus) + zn + tax + rm + black + zn:log(dis) +
## sqrt(lstat):rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5185 -1.7733 -0.2025 1.8610 11.6327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.935e+01 4.816e+00 -8.170 2.88e-15 ***
## log(nox) -3.717e+00 1.502e+00 -2.475 0.01367 *
## log(dis) -2.362e+00 5.792e-01 -4.078 5.35e-05 ***
## sqrt(lstat) 9.822e+00 1.179e+00 8.330 8.94e-16 ***
## I(ptratio^5) -8.463e-07 1.518e-07 -5.574 4.21e-08 ***
## chas1 1.647e+00 5.640e-01 2.921 0.00366 **
## zn 8.141e-02 3.936e-02 2.068 0.03917 *
## tax -5.169e-03 1.309e-03 -3.949 9.05e-05 ***
## rm 1.204e+01 6.798e-01 17.707 < 2e-16 ***
## black 8.205e-03 1.709e-03 4.802 2.11e-06 ***
## log(dis):zn -4.165e-02 2.013e-02 -2.070 0.03904 *
## sqrt(lstat):rm -2.191e+00 1.885e-01 -11.628 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.939 on 469 degrees of freedom
## Multiple R-squared: 0.8737, Adjusted R-squared: 0.8707
## F-statistic: 294.9 on 11 and 469 DF, p-value: < 2.2e-16log(crim) and log(crim):tax had high p-values, therefore, they were dropped.
Finally, we reached 87.4% accuracy in the model. It can be improved further.
par(mfrow=c(2,2))
plot(lm.fit11)The residual plot looks much better now.
The main objective of this lab was to show the different ways one can go on improving the performance of a model. Model building is not just typing some codes based on predefined functions in R or python, etc. It requires a lot of understanding of data, industry, domain and various other factors.
Exercise: Try to improve the model further using different options such normalizing the features, etc.
Comments
Post a Comment