Multivariate Linear Regression in R

In the first lab, we have worked on a simple linear regression in R (using Women data set from base R), although, while improving the fit, our model eventually turned out to be a multivariate model by adding transformed versions of the feature.

Later in the second lab, we built our machine learning algorithm from scratch defining various functions and thereby, building a predictive model using our data set Experiment.

Now, let’s use R’s inbuilt function to build a multivariate model.

For this lab we will use the data set Boston from the MASS Package. This data set represents medv (median house value) of 506 neighborhoods around Boston. It has total of 14 variables, including medv. We will use medv as our response variable and others as our independent variables.

You can find more information on the data set by CLICKING HERE

Loading MASS package and Boston Data and performing basic Exploratory Data Analysis (EDA)

We will load the data and only perform basic EDA (will not go into details as our main focus is to show different aspects of building a multivariate linear regression model)

The main intention behind this lab is to show you different challenges, roadblocks one can face while building a model with multiple variables. There isn’t any perfect sequence that one can follow. Therefore, go through this lab with open mind and do not expect a perfect model at the end (as there are so many factors that can only be handled with the help of subject matter experts, including getting more data).

Let’s load the package MASS

library(MASS)

names(Boston) #All the variables

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

There are 14 variables in the data set

Summary of data

Let’s look at summary of the data

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Except Chas, all the other variables are quantitative variables. We must transform Chas into a factor variable for our model building

From mean and median comparison, we can see that,

crim, zn, indus, nox, dis, lstat, medv are right skewed: mean > median

age, ptratio, and black are left skewed: median > mean

While rm, rad, and tax do not show much of skewness when plotted

Exercise: Plot the histogram of all these variables to see their distribution

Scatter plot of the data

Boston$chas <- as.factor(Boston$chas) #Converting qualitative variable into a factor variable

plot(Boston)

Since there are 14 variables, the scatter plot isn’t legible.

Correlation Matrix

Let’s look at the correlations among different variables

cor(Boston[,-4]) #Not using chas for correlation as it is not a quantitative variable

##               crim         zn      indus        nox         rm        age
## crim     1.0000000 -0.2004692  0.4065834  0.4209717 -0.2192467  0.3527343
## zn      -0.2004692  1.0000000 -0.5338282 -0.5166037  0.3119906 -0.5695373
## indus    0.4065834 -0.5338282  1.0000000  0.7636514 -0.3916759  0.6447785
## nox      0.4209717 -0.5166037  0.7636514  1.0000000 -0.3021882  0.7314701
## rm      -0.2192467  0.3119906 -0.3916759 -0.3021882  1.0000000 -0.2402649
## age      0.3527343 -0.5695373  0.6447785  0.7314701 -0.2402649  1.0000000
## dis     -0.3796701  0.6644082 -0.7080270 -0.7692301  0.2052462 -0.7478805
## rad      0.6255051 -0.3119478  0.5951293  0.6114406 -0.2098467  0.4560225
## tax      0.5827643 -0.3145633  0.7207602  0.6680232 -0.2920478  0.5064556
## ptratio  0.2899456 -0.3916785  0.3832476  0.1889327 -0.3555015  0.2615150
## black   -0.3850639  0.1755203 -0.3569765 -0.3800506  0.1280686 -0.2735340
## lstat    0.4556215 -0.4129946  0.6037997  0.5908789 -0.6138083  0.6023385
## medv    -0.3883046  0.3604453 -0.4837252 -0.4273208  0.6953599 -0.3769546
##                dis        rad        tax    ptratio      black      lstat
## crim    -0.3796701  0.6255051  0.5827643  0.2899456 -0.3850639  0.4556215
## zn       0.6644082 -0.3119478 -0.3145633 -0.3916785  0.1755203 -0.4129946
## indus   -0.7080270  0.5951293  0.7207602  0.3832476 -0.3569765  0.6037997
## nox     -0.7692301  0.6114406  0.6680232  0.1889327 -0.3800506  0.5908789
## rm       0.2052462 -0.2098467 -0.2920478 -0.3555015  0.1280686 -0.6138083
## age     -0.7478805  0.4560225  0.5064556  0.2615150 -0.2735340  0.6023385
## dis      1.0000000 -0.4945879 -0.5344316 -0.2324705  0.2915117 -0.4969958
## rad     -0.4945879  1.0000000  0.9102282  0.4647412 -0.4444128  0.4886763
## tax     -0.5344316  0.9102282  1.0000000  0.4608530 -0.4418080  0.5439934
## ptratio -0.2324705  0.4647412  0.4608530  1.0000000 -0.1773833  0.3740443
## black    0.2915117 -0.4444128 -0.4418080 -0.1773833  1.0000000 -0.3660869
## lstat   -0.4969958  0.4886763  0.5439934  0.3740443 -0.3660869  1.0000000
## medv     0.2499287 -0.3816262 -0.4685359 -0.5077867  0.3334608 -0.7376627
##               medv
## crim    -0.3883046
## zn       0.3604453
## indus   -0.4837252
## nox     -0.4273208
## rm       0.6953599
## age     -0.3769546
## dis      0.2499287
## rad     -0.3816262
## tax     -0.4685359
## ptratio -0.5077867
## black    0.3334608
## lstat   -0.7376627
## medv     1.0000000

As we can see, some of the variables have strong correlations such as crim with rad and tax; zn with age and dis; indus with nox, age, dis, rad, tax and lstat. Similarly, we can see that others also have some strong correlations with different variables.

Missing data

Let’s see if there are any missing values in the data set

Boston[!complete.cases(Boston),]

##  [1] crim    zn      indus   chas    nox     rm      age     dis     rad    
## [10] tax     ptratio black   lstat   medv   
## <0 rows> (or 0-length row.names)

As we can see that data set is complete without any missing values.

Model Building using R functions

lm.fit = lm(medv ~ ., data=Boston) #To include all variables, we just use ~ . shorthand

Summary of the model

summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas1        2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

As we can see that R-squared value is just 0.74, that means our model only explains 74% of the variability in the data. In multivariate linear regression, it turns out that R-square is nothing but the square of correlation between response variable and the model fit (i.e. predicted values of response variable). Hence, our model is not a good fit.

We also notice, that the p-values of indus and age are larger than 0.05 (our significance level of 5%). Hence, these variables are not significant. Therefore, we should drop them.

The interpretation of chas1 dummy variable:

We can notice here, that R has created a dummy variable chas1. Since chas is a categorical variable with only two values, i.e. 1 and 0, meaning a house at the bank of Charles River is 1, otherwise it is 0. chas1 takes the value 1 wherever chas is 1, otherwise it is zero. Here 0 has become a reference point. Our model suggests that the average median value of the houses at the bank of Charles is 2.68 times more compared to those that are not located at the bank.

contrasts(Boston$chas) #To see how R treats a categorical variable

##   1
## 0 0
## 1 1

The function contrasts() shows that R has created a dummy variable chas1 (header of the above table) and it takes the value 1 for chas as 1 and 0 for chas as 0.

Let’s fit another model without age and indus

lm.fit2 = lm(medv ~ . -age - indus, data=Boston) #Mention names of variables with minus signs to be excluded

summary(lm.fit2)

## 
## Call:
## lm(formula = medv ~ . - age - indus, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas1         2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Although, the model suggests that all the variables are significant based on their p-values, the model itself hasn’t improved much. We can also drop some other variables which have very small coefficient values. such as zn, tax and black and see what happens.

lm.fit3 = update(lm.fit2, ~. -zn -tax -black) #update() function removes these variables from the model

summary(lm.fit3)

## 
## Call:
## lm(formula = medv ~ crim + chas + nox + rm + dis + rad + ptratio + 
##     lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0030  -3.0225  -0.6074   1.8951  26.8346 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.80894    4.96675   8.216 1.83e-15 ***
## crim         -0.11061    0.03338  -3.314 0.000988 ***
## chas1         3.09444    0.87505   3.536 0.000444 ***
## nox         -21.95511    3.51626  -6.244 9.17e-10 ***
## rm            3.99022    0.41033   9.724  < 2e-16 ***
## dis          -1.19964    0.16552  -7.248 1.63e-12 ***
## rad           0.11877    0.04062   2.924 0.003611 ** 
## ptratio      -1.11544    0.12426  -8.977  < 2e-16 ***
## lstat        -0.55201    0.04827 -11.435  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.877 on 497 degrees of freedom
## Multiple R-squared:  0.7233, Adjusted R-squared:  0.7188 
## F-statistic: 162.4 on 8 and 497 DF,  p-value: < 2.2e-16

By removing those few variables, our model has gone worse in terms of R-squared and adjusted R_squared.

We also notice, that as we drop some variables, the coefficient values of other variables keep changing, e.g. for chas1, it has gone up to 3.09 from 2.68, nox has gone down from -17.7 to -21, intercept (or bias) has gone up from 36.46 to 40.80, rad has gone down from 0.3 to 0.11

cor(Boston[,c("indus","age","zn","tax","black","nox","rad")])

##            indus        age         zn        tax      black        nox
## indus  1.0000000  0.6447785 -0.5338282  0.7207602 -0.3569765  0.7636514
## age    0.6447785  1.0000000 -0.5695373  0.5064556 -0.2735340  0.7314701
## zn    -0.5338282 -0.5695373  1.0000000 -0.3145633  0.1755203 -0.5166037
## tax    0.7207602  0.5064556 -0.3145633  1.0000000 -0.4418080  0.6680232
## black -0.3569765 -0.2735340  0.1755203 -0.4418080  1.0000000 -0.3800506
## nox    0.7636514  0.7314701 -0.5166037  0.6680232 -0.3800506  1.0000000
## rad    0.5951293  0.4560225 -0.3119478  0.9102282 -0.4444128  0.6114406
##              rad
## indus  0.5951293
## age    0.4560225
## zn    -0.3119478
## tax    0.9102282
## black -0.4444128
## nox    0.6114406
## rad    1.0000000

nox has strong correlation with indus, age, tax and rad, while it is negatively correlated with zn; on the other hand rad is strongly correlated with tax and has positive correlation with indus.These correlations might have affected their coefficient values in the new model.

Let’s plot the models and try to understand the problems:

par(mfrow=c(2,2))
plot(lm.fit3)

Looking at the residual vs fitted and normal QQ plots, we realize that there is non-linearity in the data. That means we may have to consider either higher polynomial terms or lowering the power for some of the variables.

If you wish to see all the components of the model built by R, use names() function

names(lm.fit3)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

lm.fit3$contrasts[1] #Since there is only one dummy variable

## $chas
## [1] "contr.treatment"

lm.fit3$xlevels

## $chas
## [1] "0" "1"

Let’s look at the coefficient values of predictors

lm.fit3$coefficients

## (Intercept)        crim       chas1         nox          rm         dis 
##  40.8089437  -0.1106138   3.0944416 -21.9551069   3.9902161  -1.1996435 
##         rad     ptratio       lstat 
##   0.1187750  -1.1154412  -0.5520089

We can see that nox and bias term have maximum impact on the median value. One unit change in nox can bring down the medv by 21. Although, the overall average change in medv also depends on the unit value of each variable. Therefore, it is sometimes important to normalize the variables to get the better understanding of the model by getting rid of scale differences.

Right skewed variables: crim, nox, dis, lstat

Left skewed variables: ptratio

While rm and rad do not show much skewness.

Let’s try to bring some symmetry in the variables, through various transformations.

attach(Boston) #Attaches database to R search path, we can call all variables by their names without ref to Boston

par(mfrow=c(2,2))
hist(log(crim)); hist(log(nox)); hist(log(dis)); hist(sqrt(lstat))

As we can see that log transformation of crim, nox and dis and square root of lstat have solved the skewness to a great extend.

Now let’s transform ptratio, which is left skewed.

hist((ptratio)**5)

We can take either 5th or 4th power of ptratio.

Now, let’s build the model with the transformed variables.

lm.fit4 = lm(medv ~ log(crim)+ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + rm + rad)

#Use I() function to raise the power of a vraible as ^ is reserved in lm()

summary(lm.fit4)

## 
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + 
##     I(ptratio^5) + chas + rm + rad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5366  -2.7585  -0.3929   2.1939  22.7735 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.168e+01  3.641e+00   5.953 4.98e-09 ***
## log(crim)     1.216e-01  2.578e-01   0.472 0.637418    
## log(nox)     -1.318e+01  2.347e+00  -5.617 3.24e-08 ***
## log(dis)     -6.449e+00  7.652e-01  -8.429 3.80e-16 ***
## sqrt(lstat)  -5.298e+00  3.479e-01 -15.229  < 2e-16 ***
## I(ptratio^5) -1.530e-06  2.197e-07  -6.967 1.03e-11 ***
## chas1         2.907e+00  8.295e-01   3.505 0.000498 ***
## rm            3.550e+00  3.947e-01   8.995  < 2e-16 ***
## rad          -1.431e-02  4.808e-02  -0.298 0.766055    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.622 on 497 degrees of freedom
## Multiple R-squared:  0.7515, Adjusted R-squared:  0.7475 
## F-statistic: 187.8 on 8 and 497 DF,  p-value: < 2.2e-16

Although, the model doesn’t show much improvement, we can drop log(crim) and rad, as their p-values are larger than 0.05

lm.fit5 = lm(medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + rm)

summary(lm.fit5)

## 
## Call:
## lm(formula = medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + 
##     chas + rm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6349  -2.7224  -0.4356   2.2259  22.8071 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.162e+01  3.632e+00   5.953 4.98e-09 ***
## log(nox)     -1.277e+01  2.106e+00  -6.063 2.65e-09 ***
## log(dis)     -6.513e+00  7.519e-01  -8.662  < 2e-16 ***
## sqrt(lstat)  -5.267e+00  3.394e-01 -15.519  < 2e-16 ***
## I(ptratio^5) -1.522e-06  2.045e-07  -7.446 4.26e-13 ***
## chas1         2.913e+00  8.278e-01   3.518 0.000474 ***
## rm            3.554e+00  3.899e-01   9.114  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.614 on 499 degrees of freedom
## Multiple R-squared:  0.7514, Adjusted R-squared:  0.7484 
## F-statistic: 251.3 on 6 and 499 DF,  p-value: < 2.2e-16

How about adding some of the variables dropped earlier. We can transform them first to get rid of their skewness.

We can use 1/log(indus), 3rd power of age. Tax doesn’t show much skewness.

zn is zero and black is between 350-400 (although its range is 0.32 and 396.6) for most of the records

Hence, either we can ignore zn and black or include them to see how they perform in the model

lm.fit6 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + (I(age^3)) + zn + tax + rm + black)

summary(lm.fit6)

## 
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + 
##     I(ptratio^5) + chas + 1/log(indus) + (I(age^3)) + zn + tax + 
##     rm + black)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8306  -2.4388  -0.3933   2.0180  22.4578 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.101e+01  3.994e+00   5.260 2.15e-07 ***
## log(crim)     6.791e-01  2.231e-01   3.044 0.002459 ** 
## log(nox)     -1.204e+01  2.357e+00  -5.109 4.64e-07 ***
## log(dis)     -6.707e+00  8.218e-01  -8.161 2.78e-15 ***
## sqrt(lstat)  -5.121e+00  3.665e-01 -13.974  < 2e-16 ***
## I(ptratio^5) -1.284e-06  2.258e-07  -5.685 2.24e-08 ***
## chas1         2.600e+00  8.137e-01   3.195 0.001486 ** 
## I(age^3)      4.041e-08  1.049e-06   0.039 0.969293    
## zn            2.949e-02  1.199e-02   2.459 0.014280 *  
## tax          -7.651e-03  2.378e-03  -3.218 0.001377 ** 
## rm            3.579e+00  3.951e-01   9.059  < 2e-16 ***
## black         9.449e-03  2.575e-03   3.670 0.000269 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.513 on 494 degrees of freedom
## Multiple R-squared:  0.7645, Adjusted R-squared:  0.7593 
## F-statistic: 145.8 on 11 and 494 DF,  p-value: < 2.2e-16

Since p-value of age^3 is much larger than 0.05, we can drop it.

lm.fit7 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)

summary(lm.fit7)

## 
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + 
##     I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8403  -2.4291  -0.3915   2.0199  22.4658 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.102e+01  3.986e+00   5.273 2.01e-07 ***
## log(crim)     6.802e-01  2.212e-01   3.075 0.002223 ** 
## log(nox)     -1.202e+01  2.306e+00  -5.214 2.72e-07 ***
## log(dis)     -6.718e+00  7.654e-01  -8.778  < 2e-16 ***
## sqrt(lstat)  -5.116e+00  3.426e-01 -14.934  < 2e-16 ***
## I(ptratio^5) -1.283e-06  2.244e-07  -5.716 1.88e-08 ***
## chas1         2.601e+00  8.125e-01   3.202 0.001454 ** 
## zn            2.952e-02  1.196e-02   2.468 0.013931 *  
## tax          -7.663e-03  2.355e-03  -3.254 0.001217 ** 
## rm            3.582e+00  3.872e-01   9.251  < 2e-16 ***
## black         9.461e-03  2.556e-03   3.702 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.508 on 495 degrees of freedom
## Multiple R-squared:  0.7645, Adjusted R-squared:  0.7597 
## F-statistic: 160.7 on 10 and 495 DF,  p-value: < 2.2e-16

Since the coefficient value of ptratio^5 is extremely low, let’s drop it and see what happens.

lm.fit8 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + chas + 1/log(indus) + zn + + tax + rm + black)

summary(lm.fit8)

## 
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + 
##     chas + 1/log(indus) + zn + +tax + rm + black)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5019  -2.7145  -0.4338   2.1731  22.1456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.574601   4.110193   5.006 7.74e-07 ***
## log(crim)     0.638173   0.228044   2.798 0.005334 ** 
## log(nox)    -10.075400   2.352300  -4.283 2.21e-05 ***
## log(dis)     -6.942027   0.788440  -8.805  < 2e-16 ***
## sqrt(lstat)  -5.333125   0.351179 -15.186  < 2e-16 ***
## chas1         3.089209   0.833356   3.707 0.000233 ***
## zn            0.048815   0.011837   4.124 4.36e-05 ***
## tax          -0.011911   0.002305  -5.167 3.45e-07 ***
## rm            3.774133   0.397886   9.485  < 2e-16 ***
## black         0.008858   0.002634   3.363 0.000831 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.65 on 496 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7444 
## F-statistic: 164.4 on 9 and 496 DF,  p-value: < 2.2e-16

lm.fit7 has been our best bet so far in terms of R-squared and Adjusted R-squared values while ensuring that all the variables are significant in terms of their p-values. This model explains 76.45% variability in the data.

Let’s look at the residual plots:

par(mfrow=c(2,2))
plot(lm.fit7)

There is still non-linearity in the residual plot.

Let’s try adding all the polynomial terms of age and ptratio and see what happens:

lm.fit9 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + ptratio + I(ptratio^2) + I(ptratio^3) + I(ptratio^4) + I(ptratio^5) + chas + 1/log(indus) + age + (I(age^2)) + (I(age^3)) + zn + tax + rm + black)

summary(lm.fit9)

## 
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + 
##     ptratio + I(ptratio^2) + I(ptratio^3) + I(ptratio^4) + I(ptratio^5) + 
##     chas + 1/log(indus) + age + (I(age^2)) + (I(age^3)) + zn + 
##     tax + rm + black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.460  -2.456  -0.374   1.847  22.529 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.024e+04  5.246e+03  -1.952 0.051489 .  
## log(crim)     5.088e-01  2.283e-01   2.229 0.026302 *  
## log(nox)     -1.551e+01  2.536e+00  -6.117 1.96e-09 ***
## log(dis)     -6.564e+00  8.280e-01  -7.928 1.53e-14 ***
## sqrt(lstat)  -5.055e+00  3.688e-01 -13.705  < 2e-16 ***
## ptratio       3.023e+03  1.561e+03   1.937 0.053271 .  
## I(ptratio^2) -3.522e+02  1.843e+02  -1.911 0.056528 .  
## I(ptratio^3)  2.029e+01  1.080e+01   1.879 0.060782 .  
## I(ptratio^4) -5.786e-01  3.140e-01  -1.843 0.065991 .  
## I(ptratio^5)  6.538e-03  3.628e-03   1.802 0.072168 .  
## chas1         2.924e+00  8.144e-01   3.591 0.000363 ***
## age           7.576e-02  1.107e-01   0.685 0.493907    
## I(age^2)     -1.219e-03  2.208e-03  -0.552 0.581129    
## I(age^3)      5.988e-06  1.303e-05   0.460 0.645934    
## zn            1.204e-02  1.394e-02   0.864 0.388247    
## tax          -5.022e-03  2.632e-03  -1.908 0.056999 .  
## rm            3.454e+00  4.060e-01   8.508  < 2e-16 ***
## black         9.147e-03  2.566e-03   3.565 0.000400 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.459 on 488 degrees of freedom
## Multiple R-squared:  0.7728, Adjusted R-squared:  0.7649 
## F-statistic: 97.66 on 17 and 488 DF,  p-value: < 2.2e-16

As you can see that most of the polynomial terms have become insignificant due to large p-values and hence do not contribute to the model.

par(mfrow=c(2,2))
plot(lm.fit9)

There is no change in residual plot.

Hence, we can drop the polynomial terms. Let’s analyse the residual of lm.fit7

res = lm.fit7$residuals
hist(res)

boxplot(res)

The residuals do appear normally distributed. However, the boxplot shows some outliers. Let’s look at them.

outliers = boxplot.stats(res)$out

outliers

##          8        167        187        196        215        226        229 
##  11.189451   9.187839  12.926856  10.418538  13.608669   8.859031   9.863847 
##        234        263        268        365        366        368        369 
##  10.008200   9.157586  11.147849 -15.840305  11.639497  10.563031  21.594391 
##        370        371        372        373        375        376        381 
##  13.523434  10.185482  22.465833  21.240269   9.175768 -12.769298 -14.201047 
##        402        406        413        506 
## -10.980890 -10.970145  14.207404 -11.231709

And as we can see, that there are many outliers. Let’s get their indices

idx = res %in% outliers #To get the indices
idx <- which(idx)
idx                 #Indices

##  [1]   8 167 187 196 215 226 229 234 263 268 365 366 368 369 370 371 372 373 375
## [20] 376 381 402 406 413 506

Let’s remove these records from the data and then build the model

newData = Boston[-idx,]

names(newData)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

detach(Boston) #Detach Boston, coz we are going to use newData now

attach(newData) #Attach newData for building the model

lm.fit10 = lm(medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)

summary(lm.fit10)

## 
## Call:
## lm(formula = medv ~ log(crim) + log(nox) + log(dis) + sqrt(lstat) + 
##     I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.599 -2.013 -0.358  1.729 10.284 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.594e+00  3.366e+00   1.662  0.09718 .  
## log(crim)     5.108e-01  1.677e-01   3.046  0.00245 ** 
## log(nox)     -7.247e+00  1.760e+00  -4.118 4.52e-05 ***
## log(dis)     -4.187e+00  5.967e-01  -7.017 7.94e-12 ***
## sqrt(lstat)  -3.686e+00  2.897e-01 -12.725  < 2e-16 ***
## I(ptratio^5) -1.145e-06  1.685e-07  -6.792 3.35e-11 ***
## chas1         1.886e+00  6.354e-01   2.968  0.00315 ** 
## zn            2.539e-02  8.948e-03   2.837  0.00475 ** 
## tax          -8.778e-03  1.759e-03  -4.991 8.48e-07 ***
## rm            5.036e+00  3.439e-01  14.645  < 2e-16 ***
## black         1.224e-02  1.945e-03   6.293 7.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.312 on 470 degrees of freedom
## Multiple R-squared:  0.8392, Adjusted R-squared:  0.8358 
## F-statistic: 245.4 on 10 and 470 DF,  p-value: < 2.2e-16

As we can see that our model has significantly improved by removing those outliers. We never bother about the p-value of intercept as it gives us the mean of response variable when all other variables are set to zero. Having mean as zero doesn’t make sense in this case.

Our model is able to explain 83% of the variability in the data. We can go on trying different permutations and combinations to get the best model, but that also requires working closely with the subject matter experts for help with many other factors, such as understanding variables better, getting more variables, getting more data, etc.

par(mfrow=c(2,2))
plot(lm.fit10)

There is some improvement in the residual plot as well.

Interaction Effect

Let’s try one last option before we close this lab session. Let’s try interaction term. Interaction term is basically derived from the fact that two independent variables sometimes do interact with each other and their interaction does influence the final output in terms of the values of response variable.

For example: Marketing budget for digital media could be one variable and the same for print media could be another variable. However, marketing in digital media does enhance the marketing in print media and vice-versa. Although, together they are responsible for the overall sale. So, splitting the budget between digital marketing and print marketing may increase the sale more than allocating the whole budget to either digital or print.

Interaction term is created by multiplying two variables, (digital marketing budget) X (print marketing budget)

We will create interaction terms of crim and tax; zn and dis; lstat and rm

In R, if you use “:” (colon) between two variables it creates only the interaction term, and if you use * (star) then R will also include individual variables along with the interaction term i.e. cim*tax is a short hand for crim+tax+crim:tax

Whether to use transformed variable or original variable for interaction term, would depend on whether you want the interaction term to explain the changes in the slope of transformed variable, in that case, use transformed variable. However, if you wish to explain the changes in the slope of original variable, then it doesn’t matter whether you use transformed variable or otherwise.

lm.fit11 = lm(medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + chas + 1/log(indus) + zn + tax + rm + black + zn:log(dis) + sqrt(lstat):rm)

summary(lm.fit11)

## 
## Call:
## lm(formula = medv ~ log(nox) + log(dis) + sqrt(lstat) + I(ptratio^5) + 
##     chas + 1/log(indus) + zn + tax + rm + black + zn:log(dis) + 
##     sqrt(lstat):rm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5185 -1.7733 -0.2025  1.8610 11.6327 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.935e+01  4.816e+00  -8.170 2.88e-15 ***
## log(nox)       -3.717e+00  1.502e+00  -2.475  0.01367 *  
## log(dis)       -2.362e+00  5.792e-01  -4.078 5.35e-05 ***
## sqrt(lstat)     9.822e+00  1.179e+00   8.330 8.94e-16 ***
## I(ptratio^5)   -8.463e-07  1.518e-07  -5.574 4.21e-08 ***
## chas1           1.647e+00  5.640e-01   2.921  0.00366 ** 
## zn              8.141e-02  3.936e-02   2.068  0.03917 *  
## tax            -5.169e-03  1.309e-03  -3.949 9.05e-05 ***
## rm              1.204e+01  6.798e-01  17.707  < 2e-16 ***
## black           8.205e-03  1.709e-03   4.802 2.11e-06 ***
## log(dis):zn    -4.165e-02  2.013e-02  -2.070  0.03904 *  
## sqrt(lstat):rm -2.191e+00  1.885e-01 -11.628  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.939 on 469 degrees of freedom
## Multiple R-squared:  0.8737, Adjusted R-squared:  0.8707 
## F-statistic: 294.9 on 11 and 469 DF,  p-value: < 2.2e-16

log(crim) and log(crim):tax had high p-values, therefore, they were dropped.

Finally, we reached 87.4% accuracy in the model. It can be improved further.

par(mfrow=c(2,2))
plot(lm.fit11)

The residual plot looks much better now.

The main objective of this lab was to show the different ways one can go on improving the performance of a model. Model building is not just typing some codes based on predefined functions in R or python, etc. It requires a lot of understanding of data, industry, domain and various other factors.

Exercise: Try to improve the model further using different options such normalizing the features, etc.

Click on links below for more:

Simple Linear Regression with R

Machine Learning - Building Linear Regression Algorithm from scratch

Business and AI

Search This Blog