Simple Linear Regression with R

Built-in data sets in R

We can use data sets, available in base R:

library(help = "datasets")              #Displays the list of data sets in R

We can also use package MASS in R for data sets:

library("MASS")
data(package="MASS")                #Displays the list of data sets in MASS

Simple Linear Regression (try the following in R):

We will use the data “women” from R

About data:

This data set was taken from American Society of Actuaries Build and Blood Pressure Study (years unknown)
Data set has only two variables: Height and Weight of women
Heights are in inches and weights are in pound lbs
weights in ordinary indoor clothing and shoes, and heights with shoes
Source: The World Almanac and Book of Facts, 1975

##Let’s do a bit of exploratory data analysis

dim(women)

## [1] 15  2

head(women) #First 6 rows of the data set

##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129

summary(women) #summary of data set

##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0

So, there are 15 records and two variables You can also get other details from the summary such as minimum and maximum values, mean, median, 1st and 3rd quantiles

Let’s plot the data with a smooth line through the scatter plot

plot(women,pch=22,bg ="lightblue", cex=2)
lines(stats::lowess(women))

As you can see, it’s pretty much linear

Data is plane and simple:

Weight of women on y axis
Height of women on x axis
They are linear in relation

Let’s see if the variables are correlated

cor(women) #This gives us the matrix of correlation of all possible pairs in the data set

##           height    weight
## height 1.0000000 0.9954948
## weight 0.9954948 1.0000000

As we can see that both the variables have strong positive correlation, which was obvious from the plot as well

How about their mean and median

mean(women$weight)

## [1] 136.7333

mean(women$height)

## [1] 65

median(women$weight)

## [1] 135

median(women$height)

## [1] 65

The mean(average) height of women is 65 inches
The mean(average) weight of women is 136.73 pound lbs
The median height of women is 65 inches
The median weight of women is 135 pound lbs

Mean usually gets affected by the presence of outliers, but median doesn’t get influenced by an outlier

Are there any outliers in the data affecting the average values?

boxplot(women$height)

boxplot(women$weight)

boxplot.stats(women$height)$out

## numeric(0)

boxplot.stats(women$weight)$out

## numeric(0)

As we can see that there are no outliers

Since, there are no outliers, mean gives us an overall idea about the average height and weight of the women We can also observe that for each, height and weight, mean and median are almost similar.

Let’s look at their distribution:

hist(women$height, freq = FALSE, breaks = 10, col="blue")

hist(women$weight,freq = FALSE, breaks = 10, col="green")

We have very small data set (just 15 records), still, we can say that both variables are slightly right-skewed. As skewness is very small, mean and median of each variable are quite similar. For right skewness: mean > median For left skewness: mean < median No skewness: mean = median

Are there any missing values in the data?

women[!complete.cases(women),]

## [1] height weight
## <0 rows> (or 0-length row.names)

There are no missing values in the data

##Let’s build a simple linear model using Weight as a response variable and Height as a predictor

m1 = lm(weight ~ height, data = women)  #Building model using lm() function in R
summary(m1)                             #Summary of the model

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

As you can see in the summary, that R-square value is very high, 0.99. It shows that the model is able to explain 99% of the variability in the data. Therefore, it’s a good fit.

plot(weight ~ height, frame=FALSE, data=women, pch=21,bg ="lightblue", cex=2 )
abline(m1,lwd=2, col="red")

Let’s look at the normality assumption of residuals of the model

qqnorm(resid(m1))
qqline(resid(m1))

The above plot is a visual confirmation that the residual follows normal distribution. It is called quantile to quantile plot, which shows whether the data came from a theoretical distribution (in this case normal distribution).

Y-axis is sample quantiles and x-axis is theoretical quantiles. If the plotting follows close to a straight line, then the data follows normal distribution.

As we can see here, that the residual is close to the straight line, and yet not very straight.

Let’s look at another plot to verify, if the residual shows randomness

plot(fitted(m1), resid(m1))

Residual values doesn’t show randomness against fitted values. We can see some pattern in the residual values.That means our model is not able to explain certain variability in the data. Therefore, homoscedasticity condition is not met.

Since, we are dealing with a very small data set, there isn’t much we can do about it. We need more data to improve the model performance in terms of getting randomness in the residual values. We can still try adding some polynomial terms (or with lower power) of height and see if the pattern in the residual can be offset.

Since, both the variables showed some degree of right skewness, we should lower the power of both or either one to see if the model improves. (For letf-skewness, we raise the power of variables)

m2 = lm(weight ~ height+sqrt(height), data = women)
summary(m2)

## 
## Call:
## lm(formula = weight ~ height + sqrt(height), data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5311 -0.3437  0.0225  0.2929  0.6802 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1299.734    109.407   11.88 5.41e-08 ***
## height         24.858      1.688   14.72 4.81e-09 ***
## sqrt(height) -344.859     27.195  -12.68 2.61e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4183 on 12 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9993 
## F-statistic:  9605 on 2 and 12 DF,  p-value: < 2.2e-16

There is some improvement in R-square and Adjusted R-Square values

qqnorm(resid(m2))
qqline(resid(m2))

In this plot residual values look more normally distributed than that of first model

plot(fitted(m2), resid(m2))

We do see residual values showing quite a lot of randomness compared to that of the previous model (m1).

Let’s try some more options:

m3 = lm(sqrt(weight) ~ height+sqrt(height), data = women)
summary(m3)

## 
## Call:
## lm(formula = sqrt(weight) ~ height + sqrt(height), data = women)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0207719 -0.0114944 -0.0006494  0.0117824  0.0233955 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.49281    4.06721   11.19 1.05e-07 ***
## height         0.81683    0.06277   13.01 1.95e-08 ***
## sqrt(height) -10.78593    1.01096  -10.67 1.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01555 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.261e+04 on 2 and 12 DF,  p-value: < 2.2e-16

qqnorm(resid(m3))
qqline(resid(m3))

plot(fitted(m3), resid(m3))

There isn’t much improvement, in fact residual plots look the same. Therefore, there isn’t any point complicating the model using square root of weight.

Let’s add cube root of height and see what happens

 h_cube_root = (women$height)^1/3 #lm() returns errors if you use height^1/3
 h__srt = sqrt(women$height)
m4 = lm(weight ~ height + h__srt + h_cube_root, data = women)
summary(m4)

## 
## Call:
## lm(formula = weight ~ height + h__srt + h_cube_root, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5311 -0.3437  0.0225  0.2929  0.6802 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1299.734    109.407   11.88 5.41e-08 ***
## height        24.858      1.688   14.72 4.81e-09 ***
## h__srt      -344.859     27.195  -12.68 2.61e-08 ***
## h_cube_root       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4183 on 12 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9993 
## F-statistic:  9605 on 2 and 12 DF,  p-value: < 2.2e-16

qqnorm(resid(m4))
qqline(resid(m4))

plot(fitted(m4), resid(m4))

As we can see, that model has not used cube root of height (returns NA). Hence, this term is not useful.

Let’s look at the effect of log transformation of height on the model

m5 = lm(weight ~ height + log(height), data = women)
summary(m5)

## 
## Call:
## lm(formula = weight ~ height + log(height), data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53764 -0.35937  0.03375  0.29480  0.70804 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2105.5923   178.1317   11.82 5.72e-08 ***
## height        14.1234     0.8673   16.29 1.51e-09 ***
## log(height) -691.9374    56.1990  -12.31 3.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4299 on 12 degrees of freedom
## Multiple R-squared:  0.9993, Adjusted R-squared:  0.9992 
## F-statistic:  9092 on 2 and 12 DF,  p-value: < 2.2e-16

qqnorm(resid(m5))
qqline(resid(m5))

plot(fitted(m5), resid(m5))

In terms of residual plotting, the model hasn’t improved. In fact, R-square and Adjusted R-square values have come down a little compared to m2. So, we can ignore log transformation.

How about using 1/height in the model

m6 = lm(weight ~ height + 1/(height), data = women)
summary(m6)

## 
## Call:
## lm(formula = weight ~ height + 1/(height), data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

qqnorm(resid(m6))
qqline(resid(m6))

plot(fitted(m6), resid(m6))

This model is as good as m1. Hence, we can discard this option

hence, considering the limited data we have, we can stick to m2. Although, with more data, we can build a better model than this.

Our Interpretations of the model (m2):

Weights of women increase with the increase in their heights
Weights of women decrease with the increase in the square root of their heights
Average change in their weights is 24.858 times a unit change in their heights(increase or decrease)
Average increase in their weights is 344.859 times a unit decrease in the square root of their heights
There is no need to interpret intercept here, it makes no sense for the following reasons
- Intercept value is 1299.734
- Average weight is 1299.734, when height is 0 (not possible)
- A person cannot have any weight if the height is 0 (such a person doesn’t exists)

Exercise: Try different transformation of weight (response variable), such as log, cube root and see if you can improve the model any further

Click on links below for more:

Machine Learning - Building Linear Regression Algorithm from scratch

Multivariate Linear Regression in R

Business and AI

Search This Blog