Lab_LR_1
Asmi Ariv
2022-09-23
Built-in data sets in R
We can use data sets, available in base R:
library(help = "datasets") #Displays the list of data sets in RWe can also use package MASS in R for data sets:
library("MASS")
data(package="MASS") #Displays the list of data sets in MASSSimple Linear Regression (try the following in R):
We will use the data “women” from R
About data:
- This data set was taken from American Society of Actuaries Build and Blood Pressure Study (years unknown)
- Data set has only two variables: Height and Weight of women
- Heights are in inches and weights are in pound lbs
- weights in ordinary indoor clothing and shoes, and heights with shoes
- Source: The World Almanac and Book of Facts, 1975
##Let’s do a bit of exploratory data analysis
dim(women)## [1] 15 2head(women) #First 6 rows of the data set## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129summary(women) #summary of data set## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0So, there are 15 records and two variables You can also get other details from the summary such as minimum and maximum values, mean, median, 1st and 3rd quantiles
Let’s plot the data with a smooth line through the scatter plot
plot(women,pch=22,bg ="lightblue", cex=2)
lines(stats::lowess(women))As you can see, it’s pretty much linear
Data is plane and simple:
- Weight of women on y axis
- Height of women on x axis
- They are linear in relation
Let’s see if the variables are correlated
cor(women) #This gives us the matrix of correlation of all possible pairs in the data set## height weight
## height 1.0000000 0.9954948
## weight 0.9954948 1.0000000As we can see that both the variables have strong positive correlation, which was obvious from the plot as well
How about their mean and median
mean(women$weight)## [1] 136.7333mean(women$height)## [1] 65median(women$weight)## [1] 135median(women$height)## [1] 65- The mean(average) height of women is 65 inches
- The mean(average) weight of women is 136.73 pound lbs
- The median height of women is 65 inches
- The median weight of women is 135 pound lbs
Mean usually gets affected by the presence of outliers, but median doesn’t get influenced by an outlier
Are there any outliers in the data affecting the average values?
boxplot(women$height)boxplot(women$weight)boxplot.stats(women$height)$out ## numeric(0)boxplot.stats(women$weight)$out## numeric(0)As we can see that there are no outliers
Since, there are no outliers, mean gives us an overall idea about the average height and weight of the women We can also observe that for each, height and weight, mean and median are almost similar.
Let’s look at their distribution:
hist(women$height, freq = FALSE, breaks = 10, col="blue")hist(women$weight,freq = FALSE, breaks = 10, col="green")We have very small data set (just 15 records), still, we can say that both variables are slightly right-skewed. As skewness is very small, mean and median of each variable are quite similar. For right skewness: mean > median For left skewness: mean < median No skewness: mean = median
Are there any missing values in the data?
women[!complete.cases(women),]## [1] height weight
## <0 rows> (or 0-length row.names)There are no missing values in the data
##Let’s build a simple linear model using Weight as a response variable and Height as a predictor
m1 = lm(weight ~ height, data = women) #Building model using lm() function in R
summary(m1) #Summary of the model##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14As you can see in the summary, that R-square value is very high, 0.99. It shows that the model is able to explain 99% of the variability in the data. Therefore, it’s a good fit.
plot(weight ~ height, frame=FALSE, data=women, pch=21,bg ="lightblue", cex=2 )
abline(m1,lwd=2, col="red")Let’s look at the normality assumption of residuals of the model
qqnorm(resid(m1))
qqline(resid(m1))The above plot is a visual confirmation that the residual follows normal distribution. It is called quantile to quantile plot, which shows whether the data came from a theoretical distribution (in this case normal distribution).
Y-axis is sample quantiles and x-axis is theoretical quantiles. If the plotting follows close to a straight line, then the data follows normal distribution.
As we can see here, that the residual is close to the straight line, and yet not very straight.
Let’s look at another plot to verify, if the residual shows randomness
plot(fitted(m1), resid(m1))Residual values doesn’t show randomness against fitted values. We can see some pattern in the residual values.That means our model is not able to explain certain variability in the data. Therefore, homoscedasticity condition is not met.
Since, we are dealing with a very small data set, there isn’t much we can do about it. We need more data to improve the model performance in terms of getting randomness in the residual values. We can still try adding some polynomial terms (or with lower power) of height and see if the pattern in the residual can be offset.
Since, both the variables showed some degree of right skewness, we should lower the power of both or either one to see if the model improves. (For letf-skewness, we raise the power of variables)
m2 = lm(weight ~ height+sqrt(height), data = women)
summary(m2)##
## Call:
## lm(formula = weight ~ height + sqrt(height), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5311 -0.3437 0.0225 0.2929 0.6802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1299.734 109.407 11.88 5.41e-08 ***
## height 24.858 1.688 14.72 4.81e-09 ***
## sqrt(height) -344.859 27.195 -12.68 2.61e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4183 on 12 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9993
## F-statistic: 9605 on 2 and 12 DF, p-value: < 2.2e-16There is some improvement in R-square and Adjusted R-Square values
qqnorm(resid(m2))
qqline(resid(m2))In this plot residual values look more normally distributed than that of first model
plot(fitted(m2), resid(m2))We do see residual values showing quite a lot of randomness compared to that of the previous model (m1).
Let’s try some more options:
m3 = lm(sqrt(weight) ~ height+sqrt(height), data = women)
summary(m3)##
## Call:
## lm(formula = sqrt(weight) ~ height + sqrt(height), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0207719 -0.0114944 -0.0006494 0.0117824 0.0233955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.49281 4.06721 11.19 1.05e-07 ***
## height 0.81683 0.06277 13.01 1.95e-08 ***
## sqrt(height) -10.78593 1.01096 -10.67 1.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01555 on 12 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 1.261e+04 on 2 and 12 DF, p-value: < 2.2e-16qqnorm(resid(m3))
qqline(resid(m3))plot(fitted(m3), resid(m3))There isn’t much improvement, in fact residual plots look the same. Therefore, there isn’t any point complicating the model using square root of weight.
Let’s add cube root of height and see what happens
h_cube_root = (women$height)^1/3 #lm() returns errors if you use height^1/3
h__srt = sqrt(women$height)
m4 = lm(weight ~ height + h__srt + h_cube_root, data = women)
summary(m4)##
## Call:
## lm(formula = weight ~ height + h__srt + h_cube_root, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5311 -0.3437 0.0225 0.2929 0.6802
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1299.734 109.407 11.88 5.41e-08 ***
## height 24.858 1.688 14.72 4.81e-09 ***
## h__srt -344.859 27.195 -12.68 2.61e-08 ***
## h_cube_root NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4183 on 12 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9993
## F-statistic: 9605 on 2 and 12 DF, p-value: < 2.2e-16qqnorm(resid(m4))
qqline(resid(m4))plot(fitted(m4), resid(m4))As we can see, that model has not used cube root of height (returns NA). Hence, this term is not useful.
Let’s look at the effect of log transformation of height on the model
m5 = lm(weight ~ height + log(height), data = women)
summary(m5)##
## Call:
## lm(formula = weight ~ height + log(height), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53764 -0.35937 0.03375 0.29480 0.70804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2105.5923 178.1317 11.82 5.72e-08 ***
## height 14.1234 0.8673 16.29 1.51e-09 ***
## log(height) -691.9374 56.1990 -12.31 3.63e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4299 on 12 degrees of freedom
## Multiple R-squared: 0.9993, Adjusted R-squared: 0.9992
## F-statistic: 9092 on 2 and 12 DF, p-value: < 2.2e-16qqnorm(resid(m5))
qqline(resid(m5))plot(fitted(m5), resid(m5))In terms of residual plotting, the model hasn’t improved. In fact, R-square and Adjusted R-square values have come down a little compared to m2. So, we can ignore log transformation.
How about using 1/height in the model
m6 = lm(weight ~ height + 1/(height), data = women)
summary(m6)##
## Call:
## lm(formula = weight ~ height + 1/(height), data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14qqnorm(resid(m6))
qqline(resid(m6))plot(fitted(m6), resid(m6))This model is as good as m1. Hence, we can discard this option
hence, considering the limited data we have, we can stick to m2. Although, with more data, we can build a better model than this.
Our Interpretations of the model (m2):
- Weights of women increase with the increase in their heights
- Weights of women decrease with the increase in the square root of their heights
- Average change in their weights is 24.858 times a unit change in their heights(increase or decrease)
- Average increase in their weights is 344.859 times a unit decrease in the square root of their heights
- There is no need to interpret intercept here, it makes no sense for the following reasons
- Intercept value is 1299.734
- Average weight is 1299.734, when height is 0 (not possible)
- A person cannot have any weight if the height is 0 (such a person doesn’t exists)
Exercise: Try different transformation of weight (response variable), such as log, cube root and see if you can improve the model any further
Comments
Post a Comment