Skip to main content

Simple Linear Regression with R

 

Built-in data sets in R

We can use data sets, available in base R:

library(help = "datasets")              #Displays the list of data sets in R

We can also use package MASS in R for data sets:

library("MASS")
data(package="MASS")                #Displays the list of data sets in MASS

Simple Linear Regression (try the following in R):

We will use the data “women” from R

About data:

  • This data set was taken from American Society of Actuaries Build and Blood Pressure Study (years unknown)
  • Data set has only two variables: Height and Weight of women
  • Heights are in inches and weights are in pound lbs
  • weights in ordinary indoor clothing and shoes, and heights with shoes
  • Source: The World Almanac and Book of Facts, 1975

##Let’s do a bit of exploratory data analysis

dim(women)
## [1] 15  2
head(women) #First 6 rows of the data set
##   height weight
## 1     58    115
## 2     59    117
## 3     60    120
## 4     61    123
## 5     62    126
## 6     63    129
summary(women) #summary of data set
##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0

So, there are 15 records and two variables You can also get other details from the summary such as minimum and maximum values, mean, median, 1st and 3rd quantiles

Let’s plot the data with a smooth line through the scatter plot

plot(women,pch=22,bg ="lightblue", cex=2)
lines(stats::lowess(women))
As you can see, it’s pretty much linear

As you can see, it’s pretty much linear

Data is plane and simple:

  • Weight of women on y axis
  • Height of women on x axis
  • They are linear in relation

Let’s see if the variables are correlated

cor(women) #This gives us the matrix of correlation of all possible pairs in the data set
##           height    weight
## height 1.0000000 0.9954948
## weight 0.9954948 1.0000000

As we can see that both the variables have strong positive correlation, which was obvious from the plot as well

How about their mean and median

mean(women$weight)
## [1] 136.7333
mean(women$height)
## [1] 65
median(women$weight)
## [1] 135
median(women$height)
## [1] 65
  • The mean(average) height of women is 65 inches
  • The mean(average) weight of women is 136.73 pound lbs
  • The median height of women is 65 inches
  • The median weight of women is 135 pound lbs

Mean usually gets affected by the presence of outliers, but median doesn’t get influenced by an outlier

Are there any outliers in the data affecting the average values?

boxplot(women$height)

boxplot(women$weight)

boxplot.stats(women$height)$out 
## numeric(0)
boxplot.stats(women$weight)$out
## numeric(0)

As we can see that there are no outliers

Since, there are no outliers, mean gives us an overall idea about the average height and weight of the women We can also observe that for each, height and weight, mean and median are almost similar.

Let’s look at their distribution:

hist(women$height, freq = FALSE, breaks = 10, col="blue")

hist(women$weight,freq = FALSE, breaks = 10, col="green")

We have very small data set (just 15 records), still, we can say that both variables are slightly right-skewed. As skewness is very small, mean and median of each variable are quite similar. For right skewness: mean > median For left skewness: mean < median No skewness: mean = median

Are there any missing values in the data?

women[!complete.cases(women),]
## [1] height weight
## <0 rows> (or 0-length row.names)

There are no missing values in the data

##Let’s build a simple linear model using Weight as a response variable and Height as a predictor

m1 = lm(weight ~ height, data = women)  #Building model using lm() function in R
summary(m1)                             #Summary of the model
## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

As you can see in the summary, that R-square value is very high, 0.99. It shows that the model is able to explain 99% of the variability in the data. Therefore, it’s a good fit.

plot(weight ~ height, frame=FALSE, data=women, pch=21,bg ="lightblue", cex=2 )
abline(m1,lwd=2, col="red")

Let’s look at the normality assumption of residuals of the model

qqnorm(resid(m1))
qqline(resid(m1))

The above plot is a visual confirmation that the residual follows normal distribution. It is called quantile to quantile plot, which shows whether the data came from a theoretical distribution (in this case normal distribution).

Y-axis is sample quantiles and x-axis is theoretical quantiles. If the plotting follows close to a straight line, then the data follows normal distribution.

As we can see here, that the residual is close to the straight line, and yet not very straight.

Let’s look at another plot to verify, if the residual shows randomness

plot(fitted(m1), resid(m1))

Residual values doesn’t show randomness against fitted values. We can see some pattern in the residual values.That means our model is not able to explain certain variability in the data. Therefore, homoscedasticity condition is not met.

Since, we are dealing with a very small data set, there isn’t much we can do about it. We need more data to improve the model performance in terms of getting randomness in the residual values. We can still try adding some polynomial terms (or with lower power) of height and see if the pattern in the residual can be offset.

Since, both the variables showed some degree of right skewness, we should lower the power of both or either one to see if the model improves. (For letf-skewness, we raise the power of variables)

m2 = lm(weight ~ height+sqrt(height), data = women)
summary(m2)
## 
## Call:
## lm(formula = weight ~ height + sqrt(height), data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5311 -0.3437  0.0225  0.2929  0.6802 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1299.734    109.407   11.88 5.41e-08 ***
## height         24.858      1.688   14.72 4.81e-09 ***
## sqrt(height) -344.859     27.195  -12.68 2.61e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4183 on 12 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9993 
## F-statistic:  9605 on 2 and 12 DF,  p-value: < 2.2e-16

There is some improvement in R-square and Adjusted R-Square values

qqnorm(resid(m2))
qqline(resid(m2))

In this plot residual values look more normally distributed than that of first model

plot(fitted(m2), resid(m2))

We do see residual values showing quite a lot of randomness compared to that of the previous model (m1).

Let’s try some more options:

m3 = lm(sqrt(weight) ~ height+sqrt(height), data = women)
summary(m3)
## 
## Call:
## lm(formula = sqrt(weight) ~ height + sqrt(height), data = women)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0207719 -0.0114944 -0.0006494  0.0117824  0.0233955 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.49281    4.06721   11.19 1.05e-07 ***
## height         0.81683    0.06277   13.01 1.95e-08 ***
## sqrt(height) -10.78593    1.01096  -10.67 1.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01555 on 12 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 1.261e+04 on 2 and 12 DF,  p-value: < 2.2e-16
qqnorm(resid(m3))
qqline(resid(m3))

plot(fitted(m3), resid(m3))

There isn’t much improvement, in fact residual plots look the same. Therefore, there isn’t any point complicating the model using square root of weight.

Let’s add cube root of height and see what happens

 h_cube_root = (women$height)^1/3 #lm() returns errors if you use height^1/3
 h__srt = sqrt(women$height)
m4 = lm(weight ~ height + h__srt + h_cube_root, data = women)
summary(m4)
## 
## Call:
## lm(formula = weight ~ height + h__srt + h_cube_root, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5311 -0.3437  0.0225  0.2929  0.6802 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1299.734    109.407   11.88 5.41e-08 ***
## height        24.858      1.688   14.72 4.81e-09 ***
## h__srt      -344.859     27.195  -12.68 2.61e-08 ***
## h_cube_root       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4183 on 12 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9993 
## F-statistic:  9605 on 2 and 12 DF,  p-value: < 2.2e-16
qqnorm(resid(m4))
qqline(resid(m4))

plot(fitted(m4), resid(m4))

As we can see, that model has not used cube root of height (returns NA). Hence, this term is not useful.

Let’s look at the effect of log transformation of height on the model

m5 = lm(weight ~ height + log(height), data = women)
summary(m5)
## 
## Call:
## lm(formula = weight ~ height + log(height), data = women)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53764 -0.35937  0.03375  0.29480  0.70804 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2105.5923   178.1317   11.82 5.72e-08 ***
## height        14.1234     0.8673   16.29 1.51e-09 ***
## log(height) -691.9374    56.1990  -12.31 3.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4299 on 12 degrees of freedom
## Multiple R-squared:  0.9993, Adjusted R-squared:  0.9992 
## F-statistic:  9092 on 2 and 12 DF,  p-value: < 2.2e-16
qqnorm(resid(m5))
qqline(resid(m5))

plot(fitted(m5), resid(m5))

In terms of residual plotting, the model hasn’t improved. In fact, R-square and Adjusted R-square values have come down a little compared to m2. So, we can ignore log transformation.

How about using 1/height in the model

m6 = lm(weight ~ height + 1/(height), data = women)
summary(m6)
## 
## Call:
## lm(formula = weight ~ height + 1/(height), data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14
qqnorm(resid(m6))
qqline(resid(m6))

plot(fitted(m6), resid(m6))

This model is as good as m1. Hence, we can discard this option

hence, considering the limited data we have, we can stick to m2. Although, with more data, we can build a better model than this.

Our Interpretations of the model (m2):

  1. Weights of women increase with the increase in their heights
  2. Weights of women decrease with the increase in the square root of their heights
  3. Average change in their weights is 24.858 times a unit change in their heights(increase or decrease)
  4. Average increase in their weights is 344.859 times a unit decrease in the square root of their heights
  5. There is no need to interpret intercept here, it makes no sense for the following reasons
    • Intercept value is 1299.734
    • Average weight is 1299.734, when height is 0 (not possible)
    • A person cannot have any weight if the height is 0 (such a person doesn’t exists)

Exercise: Try different transformation of weight (response variable), such as log, cube root and see if you can improve the model any further


Click on links below for more:


Comments

Popular posts from this blog

Metaverse needs better technology, scalable infra, strong governance

Many minds have been intrigued by the idea of metaverse, and its effect is such that the social media giant like Facebook has been rebranded as Meta. Yet, there is a big question mark on the future of this technology. The enablers of metaverse such as augmented reality, mixed reality and virtual reality operating on computers, smartphones and other devices have failed to give the complete real-world like immersive experience to end users. There is a clear lack of standard virtual environment and technical specifications for implementing metaverse  –  a bottleneck in using technologies from different proprietors. Due to the business privacy and transparency concerns, interoperability of services from various providers has become a big challenge. Although, the efforts to standardize virtual reality, such as Universal Scene Description, glTF and OpenXR may help in a long run, but a lot more needs to be put in.  The technologies and devices, such as wireless he...

What is ChatGPT?

Introduction ChatGPT is a language model developed by OpenAI based on the GPT-3.5 architecture. It is designed to perform various natural language processing tasks such as language translation, text summarization, question-answering, and chatbot interactions. In this blog, we will discuss ChatGPT, its architecture, applications, and benefits. Architecture ChatGPT is based on the GPT-3.5 architecture, which is an extension of the GPT-3 architecture. The model has 175 billion parameters, making it one of the largest language models available. The architecture consists of 96 transformer blocks with a hidden size of 12,288 and 10 attention heads. The model is trained using a combination of unsupervised and supervised learning techniques. Applications ChatGPT has a wide range of applications in various fields such as healthcare, finance, customer service, and education. Some of the applications of ChatGPT are as follows: Language translation: ChatGPT can translate text from one language to ...

Exploratory Data Analysis

  Lab_D_2_RM Asmi Ariv 2022-10-14 Exploratory Data Analysis In this lab, we will go through various steps to explore a dataset using descriptive statistics, summary of data, different graphs, etc. Factor Variables (try the following in R): data = read.csv( "patient.csv" );data #Reading patient data ## Patient Gender Age Group ## 1 Dick M 20 2 ## 2 Anna F 25 1 ## 3 Sam M 30 3 ## 4 Jennie F 28 2 ## 5 Joss M 29 3 ## 6 Don M 21 2 ## 7 Annie F 26 1 ## 8 John M 32 3 ## 9 Rose F 27 2 ## 10 Jack M 31 3 data$Gender #It is a string/character variable ## [1] "M" "F" "M" "F" "M" "M" "F" "M" "F" "M" data$Gender = factor(data$Gender,levels=c( "M" , "F" ), ordered= TRUE ) #...