Linear Regression in R

Kent S Johnson
http://kentsjohnson.com

May 2012

What is linear regression?

Simple linear regression

Example

What is the relationship between height and weight?

library(ggplot2)
data = read.csv("CDC_data.csv")
p = qplot(height, weight, data = data, size = I(1))
p + stat_smooth(method = "lm")

Source: CDC National Health and Nutrition Examination Survey 2009-2010

The linear regression model

Simple linear regression uses the model \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Finding the best line

Example (cont)

model = lm(weight ~ height, data)
summary(model)
# 
# Call:
# lm(formula = weight ~ height, data = data)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -92.7  -28.9   -6.3   21.4  325.7 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -163.233      8.882   -18.4   <2e-16 ***
# height         5.218      0.135    38.8   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Residual standard error: 42.3 on 5992 degrees of freedom
# Multiple R-squared:  0.2, Adjusted R-squared:  0.2 
# F-statistic: 1.5e+03 on 1 and 5992 DF,  p-value: <2e-16 
# 

What?

Model statistics

The rest of the output is useful for evaluating the model. It includes

Confidence intervals

Confidence intervals on the model coefficients are another way to evaluate the model. The stats::confint() function computes confidence intervals for model coefficients.

confint(model)
#                2.5 %   97.5 %
# (Intercept) -180.645 -145.820
# height         4.954    5.482

Graphical evaluation of the model

Not all data is well represented by a linear model.

Diagnostic plots

op <- par(mfrow = c(1, 2))
plot(model, which = c(1, 2), pch = 20, cex = 0.5)

Hmmm…

The diagnostic plots indicate that the errors are not normally distributed. They seem to be skewed toward the right, i.e. the tail is longer on the right. This is visible in the original plot as well.

In the real world, you might try transforming the data to get a better fit.

Predictions

The stats::predict() function is used to predict the \(y\) values for new values of \(x\).

Values for the independent variable are passed in a data frame column with the same name as the original data.

new_x = data.frame(height = c(60, 65, 70))
predict(model, new_x)
#     1     2     3 
# 149.9 176.0 202.1 

Predictions may include confidence intervals:

predict(model, new_x, interval = "prediction")
#     fit   lwr   upr
# 1 149.9  66.9 232.8
# 2 176.0  93.0 258.9
# 3 202.1 119.1 285.0

Multiple regression

Multiple regression models the dependent variable as a linear combination of multiple independent variables.

Example (cont)

model <- lm(weight ~ height + gender, data)
summary(model)
# 
# Call:
# lm(formula = weight ~ height + gender, data = data)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -95.9  -28.9   -6.0   21.5  322.7 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -191.437     11.528  -16.61  < 2e-16 ***
# height         5.688      0.182   31.25  < 2e-16 ***
# genderM       -5.665      1.479   -3.83  0.00013 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Residual standard error: 42.3 on 5991 degrees of freedom
# Multiple R-squared: 0.202,    Adjusted R-squared: 0.202 
# F-statistic:  760 on 2 and 5991 DF,  p-value: <2e-16 
# 

What?

Diagnostics again

op <- par(mfrow = c(1, 2))
plot(model, which = c(1, 2), pch = 20, cex = 0.5)

Interaction terms (briefly)

Does the slope of the relationship between height and weight depend on gender? This is an interaction. The model for this is

\[weight_i = \beta_0 + \beta_1 height_i + \beta_2 gender_i + \beta_3 height_i weight_i + \epsilon_i\]

The R formula uses * instead of +: weight ~ height * gender

Example (cont)

model <- lm(weight ~ height * gender, data)
summary(model)
# 
# Call:
# lm(formula = weight ~ height * gender, data = data)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
#  -95.4  -29.0   -5.9   21.6  324.5 
# 
# Coefficients:
#                Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    -155.139     16.619   -9.34   <2e-16 ***
# height            5.114      0.263   19.47   <2e-16 ***
# genderM         -78.289     24.010   -3.26   0.0011 ** 
# height:genderM    1.103      0.364    3.03   0.0025 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Residual standard error: 42.2 on 5990 degrees of freedom
# Multiple R-squared: 0.204,    Adjusted R-squared: 0.203 
# F-statistic:  510 on 3 and 5990 DF,  p-value: <2e-16 
# 

```
# Coefficients:
#                Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    -155.139     16.619   -9.34   <2e-16 ***
# height            5.114      0.263   19.47   <2e-16 ***
# genderM         -78.289     24.010   -3.26   0.0011 ** 
# height:genderM    1.103      0.364    3.03   0.0025 ** 
```

ggplot rocks!

p = qplot(height, weight, data = data, color = gender, size = I(1))
p + stat_smooth(method = lm) + scale_color_manual(values = c("red", 
    "blue"))