1
Extensions to Linear Regression
Part 2 of Week 7
Aida Parnia
[email protected]
U of T Sociology
October 15, 2024
SOC252H1F
2
Extensions to Linear regression
Linear regression as explaining variation
→ R2 and adjusted R2
Hypothesis testing in regression
→ t distribution and p value
Multiple linear regression
→ Categorical variables as multiple indicator variables
→ Interaction term
Model selection criteria
SOC252H1F
3
Today’s example: UN data
Our investigation is around the determinants of life expectancy
1 str(UN)
tibble [193 × 8] (S3: tbl_df/tbl/data.frame)
$ country : chr [1:193] "Afghanistan" "Albania" "Algeria" "Angola" ...
$ region : chr [1:193] "Asia" "Europe" "Africa" "Africa" ...
$ group : chr [1:193] "other" "other" "africa" "africa" ...
$ fertility : num [1:193] 5.97 1.52 2.14 5.13 2.17 ...
$ ppgdp : num [1:193] 499 3677 4473 4322 9162 ...
$ lifeExpF : num [1:193] 49.5 80.4 75 53.2 79.9 ...
$ pctUrban : num [1:193] 23 53 67 59 93 64 47 89 68 52 ...
$ infantMortality: num [1:193] 124.5 16.6 21.5 96.2 12.3 ...
- attr(*, "na.action")= 'omit' Named int [1:20] 4 6 21 35 38 54 67 75 77 78 ...
..- attr(*, "names")= chr [1:20] "4" "6" "21" "35" ...
SOC252H1F
4
Explained Variation and R-Squared
Definition and Interpretation of R-Squared
→ Formula: R2 = 1 − SSres
SS
→
tot
Proportion of variance explained by the model
Adjusted R-Squared for Multiple Regression Models
→ As we increase the number of variables in the model the
proportion explained would increase,
→ R-squared can be adjusted for the number of predictors
→ Formula: Adjusted R = 1 − (
2 (1−R2)(n−1)
n−k−1 )
→ We can use this in model comparison
SOC252H1F
5
Explained variation of life expectancy by GDP or log-
GDP
1 fit_gdp <- lm(lifeExpF ~ ppgdp, GDP log(GDP)
2 data = UN)
3 (Intercept) 68.072 29.257
4 fit_loggdp <- lm(lifeExpF ~ log(ppgdp),
5 data = UN)
[66.598, [24.154,
6 69.546] 34.359]
7 modelsummary(list("GDP"=fit_gdp,
8 "log(GDP)"=fit_loggdp),
ppgdp 0.000
9 statistic = "conf.int",
[0.000, 0.000]
10 fmt = 3,
11 gof_map = c("nobs", "r.squared")) log(ppgdp) 5.090
[4.494, 5.687]
Num.Obs. 193 193
R2 0.311 0.597
In the model with GDP as the predictor, only 30% of the variation in female life expectancy is explained by
the model, while with the log GDP as the predictor, 59.6% of the variation is explained by the model.
SOC252H1F
6
Hypothesis Testing of the Regression Coefficients
The Null Hypotheses If the assumptions of the model
H0 : betak = 0 are met, then we can get the
We know the point estimate, but probability distribution of the
we want to account for the estimated coefficient.
uncertainty in the data. Turns out that the estimated
coefficients follow a t distribution
(with n - number of parametrs
degrees of freedom.
SOC252H1F
7
The t distribution is a modified normal distribution
SOC252H1F
8
Hypothesis Testing of the Regression Coefficients
t-Tests for Individual Coefficients
→ Formula: t =
^
β−0
^
SE(β)
→ The calculated t value is then put on this t distribution and we
calculate the corresponding probability for it (p-value).
1 sum_loggdp <- summary(fit_loggdp)
2 sum_loggdp
SOC252H1F
9
Hypothesis Testing of the Regression Coefficients
Call:
lm(formula = lifeExpF ~ log(ppgdp), data = UN)
Residuals:
Min 1Q Median 3Q Max
-25.885 -2.908 1.396 3.982 12.402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.2566 2.5870 11.31 <2e-16 ***
log(ppgdp) 5.0901 0.3025 16.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.479 on 191 degrees of freedom
Multiple R-squared: 0.5972, Adjusted R-squared: 0.5951
F-statistic: 283.2 on 1 and 191 DF, p-value: < 2.2e-16
SOC252H1F
10
Hypothesis Testing of the Regression Coefficients
SOC252H1F
11
Multiple Linear Regression
For our example the model is the following
Life Expectancy = β0 + β1 ⋅ log(GDP) + β2 ⋅ fertility + ε
1 UN <- UN %>% mutate(log_gdp = log(ppgdp))
2 # ML model
3 fit_ml <- lm(lifeExpF ~ log_gdp + fertility,
4 data = UN)
SOC252H1F
12
Multiple Linear Regression as a regression plane
Regression Plane for Life Expectancy
pred_le
80
70
60
50
SOC252H1F
13
More on non-linear relationships
In our model we assume that fertility rate has a linear relationship.
Meaning if fertility increases from 1 to 2, it has the same relationship to
life expectancy as it increases from 5 to 6.
But this may not be true, so we can categorize our fertility measure to
see different relationships. Here we use quartiles.
1 fertility_b <- c(min(UN$fertility), # A tibble: 4 × 2
fertility_q n
2 quantile(UN$fertility, 0.25),
<fct> <int>
3 quantile(UN$fertility, 0.5), 1 [1.13,1.75) 48
4 quantile(UN$fertility, 0.75), 2 [1.75,2.26) 48
5 max(UN$fertility)) 3 [2.26,3.7) 48
4 [3.7,6.92] 49
6 UN <- UN %>% mutate(
7 fertility_q = cut(fertility,
8 breaks = fertility_b,
9 include.lowest = TRUE,
10 right = FALSE))
11 UN %>% count(fertility_q)
SOC252H1F
14
More on non-linear relationships: categorical
variables
Dummy Variable Coding is when a categorical predictor is separated
into multiple dummy / indicator variables. And one category is defined
as the reference category.
1 fit_ml2 <- lm(lifeExpF ~ log_gdp + fertility_q,
2 data = UN)
3
4 summary(fit_ml2)
SOC252H1F
15
More on non-linear relationships: categorical
variables
Call:
lm(formula = lifeExpF ~ log_gdp + fertility_q, data = UN)
Residuals:
Min 1Q Median 3Q Max
-21.6867 -1.6328 0.2597 2.9554 12.4939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.2092 3.6933 14.136 < 2e-16 ***
log_gdp 2.9235 0.3801 7.691 7.93e-13 ***
fertility_q[1.75,2.26) -1.2362 1.1532 -1.072 0.285
fertility_q[2.26,3.7) -5.6775 1.2696 -4.472 1.34e-05 ***
fertility_q[3.7,6.92] -11.8377 1.5434 -7.670 9.01e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.646 on 188 degrees of freedom
SOC252H1F
16
More on non-linear relationships: categorical
variables
(1)
(Intercept) 52.21 Interpretation of coefficients
[44.92, 59.49] β0 or the intercept
log_gdp 2.92 The average life expectancy of females in countries with 0
[2.17, 3.67] log_GDP and a fertility rate of less than 1.75 is 52.21 years.
fertility_q[1.75,2.26) -1.24 β1 or the coefficient for log_GDP
[-3.51, 1.04] When fertility is held constant, the average life expectancy of
fertility_q[2.26,3.7) -5.68
females in countries with one percent higher GDP is 0.03 years
[-8.18, -3.17]
higher than in others.
fertility_q[3.7,6.92] -11.84
[-14.88, -8.79]
SOC252H1F
17
More on non-linear relationships: categorical
variables
(1)
(Intercept) 52.21 Interpretation of coefficients
[44.92, 59.49] β2 or the coefficient for fertility [1.75,2.26)
log_gdp 2.92 When GDP is held constant, the average life expectancy of
[2.17, 3.67] females in countries with fertility rates of 1.75 to just below
fertility_q[1.75,2.26) -1.24 2.26 is 1.24 years lower than in countries with fertility rates of
[-3.51, 1.04] less than 1.75 children per woman.
fertility_q[2.26,3.7) -5.68
β3 or the coefficient for fertility [2.26,3.7)
[-8.18, -3.17]
When GDP is held constant, the average life expectancy of
females in countries with fertility rates of 2.26 to just below 3.7
fertility_q[3.7,6.92] -11.84
is 5.68 years lower than in countries with fertility rates of less
[-14.88, -8.79]
than 1.75 children per woman.
β4 or the coefficient for fertility [3.7,6.92]
When GDP is held constant, the average life expectancy of
females in countries with fertility rates of 3.7 to 6.92 is 11.83
years lower than in countries with fertility rates of less than
1.75 children per woman.
SOC252H1F
18
More on non-linear relationships: categorical
variables
1 # Added predictions
2 UN %>% mutate(ml_pred = predict(fit_ml2)) %>%
3 # visualizing the model
4 ggplot(aes(y = ml_pred, x = log_gdp, colour = fertility_q)) +
5 geom_line() +
6 geom_point(aes(y = lifeExpF, x = log_gdp, colour = fertility_q)) +
7 theme_light(base_size = 20)
SOC252H1F
19
Interaction in Regressions
What if we allow the slope of one variable to change by the values of
another?
In our example, what if relationship of life expectancy with GDP
depended on the values of fertility rate?
This is the concept of an Interaction Effect.
The previous example the predicted values increases additively with the
values of fertility rate, but with an interaction this increase is
multiplicative.
Life Expectancy = β0 + β1 ⋅ log(GDP) + β2 ⋅ fertility+
β3 ⋅ log(GDP) × fertility + ε
SOC252H1F
20
Interaction in Regressions
1 # Note the change from + to * in the formula
2 fit_ml3 <- lm(lifeExpF ~ log_gdp * fertility_q,
3 data = UN)
4
5 summary(fit_ml3)
SOC252H1F
21
Interaction in Regressions
Call:
lm(formula = lifeExpF ~ log_gdp * fertility_q, data = UN)
Residuals:
Min 1Q Median 3Q Max
-21.6893 -1.4094 0.2941 3.2680 12.3541
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.40179 7.99615 6.803 1.37e-10 ***
log_gdp 2.69213 0.83931 3.208 0.00158 **
fertility_q[1.75,2.26) -0.01215 9.94127 -0.001 0.99903
fertility_q[2.26,3.7) -8.58005 10.38544 -0.826 0.40978
fertility_q[3.7,6.92] -19.63143 9.90168 -1.983 0.04889 *
log_gdp:fertility_q[1.75,2.26) -0.13312 1.04587 -0.127 0.89886
log_gdp:fertility_q[2.26,3.7) 0.31927 1.16941 0.273 0.78515
log_gdp:fertility_q[3.7,6.92] 1.06005 1.19859 0.884 0.37762
---
SOC252H1F
22
Interaction in Regressions
1 # Added predictions
2 UN %>% mutate(ml_pred = predict(fit_ml3)) %>%
3 # visualizing the model
4 ggplot(aes(y = ml_pred, x = log_gdp, colour = fertility_q)) +
5 geom_line() +
6 geom_point(aes(y = lifeExpF, x = log_gdp, colour = fertility_q)) +
7 theme_light(base_size = 20)
SOC252H1F
23
Interaction in Regressions
SOC252H1F
24
Interaction in Regressions
1 # Added predictions
2 UN %>% mutate(ml_pred = predict(fit_ml3)) %>%
3 # visualizing the model
4 ggplot(aes(y = ml_pred, x = log_gdp, colour = fertility_q)) +
5 geom_line() +
6 geom_point(aes(y = lifeExpF, x = log_gdp, colour = fertility_q)) +
7 theme_light(base_size = 20) + facet_wrap(fertility_q ~ .)
SOC252H1F
25
Interaction in Regressions
SOC252H1F
26
Interpreting interaction effects: Interpretation
Interpretation of interaction is often hard to put clearly, so visualization
or choosing specific predicted values is often more intuitive.
Alternatively, we can define partial effects by calculating coefficient for
each category of the the categorical variable.
The coefficient for log_GDP when fertility is below 1.75 (reference
group) is β ⋅ log_gdp
The coefficient for log_GDP when fertility is [1.75,2.26) is
β ⋅ log_gdp + β ⋅ log_gdp:fertility_q[1.75,2.26)
The coefficient for log_GDP when fertility is [2.26,3.7) is
β ⋅ log_gdp + β ⋅ log_gdp:fertility_q[2.26,3.7)
The coefficient for log_GDP when fertility is [3.7,6.92] is
β ⋅ log_gdp + β ⋅ log_gdp:fertility_q[3.7,6.92]
SOC252H1F
27
Interpreting interaction effects: Interpretation
1 log_gdp_fer1 <- fit_ml3$coefficients["log_gdp"]
2
3 log_gdp_fer2 <- fit_ml3$coefficients["log_gdp"] + fit_ml3$coefficients["log_gdp:fertility_q[1.7
4
5 log_gdp_fer3 <- fit_ml3$coefficients["log_gdp"] + fit_ml3$coefficients["log_gdp:fertility_q[2.2
6
7 log_gdp_fer4 <- fit_ml3$coefficients["log_gdp"] + fit_ml3$coefficients["log_gdp:fertility_q[3.7
8
9 c(log_gdp_fer1,log_gdp_fer2,log_gdp_fer3,log_gdp_fer4)
log_gdp log_gdp log_gdp log_gdp
2.692128 2.559012 3.011393 3.752180
SOC252H1F
28
Model Selection Criteria
So which model is best?
There are a number of criteria created to evaluate goodness of fit of a
model.
R squared is one but it is not always the most useful.
Two very useful model selection criterion are
→ Akaike Information Criterion (AIC)
→ Bayesian Information Criterion (BIC)
There are other methods too, like cross-validation and the purpose is to
find the best model without over-fitting the data.
SOC252H1F
29
Model Selection Criteria
1 modelsummary(list("Fertility" = fit_ml, "Fertility_q"=fit_ml2,
2 "Interaction" = fit_ml3),
3 fmt = 2,
4 statistic = "conf.int",
5 gof_map = c("r.squared", "aic", "bic"),
6 stars = TRUE)
SOC252H1F
30
Model Selection Criteria
Fertility Fertility_q Interaction
(Intercept) 63.06*** 52.21*** 54.40***
[55.52, 70.60] [44.92, 59.49] [38.63, 70.18]
log_gdp 2.45*** 2.92*** 2.69**
[1.77, 3.14] [2.17, 3.67] [1.04, 4.35]
fertility -4.18***
[-4.96, -3.39]
fertility_q[1.75,2.26) -1.24 -0.01
[-3.51, 1.04] [-19.62, 19.60]
fertility_q[2.26,3.7) -5.68*** -8.58
[-8.18, -3.17] [-29.07, 11.91]
fertility_q[3.7,6.92] -11.84*** -19.63*
[-14.88, -8.79] [-39.17, -0.10]
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
SOC252H1F
Fertility Fertility_q Interaction
log_gdp × fertility_q[1.75,2.26) -0.13
[-2.20, 1.93]
log_gdp × fertility_q[2.26,3.7) 0.32
[-1.99, 2.63]
log_gdp × fertility_q[3.7,6.92] 1.06
[-1.30, 3.42]
R2 0.745 0.699 0.701
AIC 1186.6 1222.8 1227.4
BIC 1199.6 1242.4 1256.8
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
SOC252H1F
31
Choosing the Best Model
1. We first map the model based on our theoretical understanding
2. If and when we don’t have a good grasp of the model, we can follow
other methods available to us
Stepwise Regression Methods (Forward, Backward, and Stepwise
Selection)
→ Adding/removing predictors based on criteria (often p value)
Comparing Models Using Selection Criteria
→ Evaluating AIC, BIC, and adjusted R-squared; or cross-validation
techniques
Practical Considerations in Model Selection
→ Interpretability and simplicity
SOC252H1F
32
Summary of Key Points
Research design,
→ cycle of research, hierarchy of evidence, experimental vs. non-
experimental methods
Regression is about explaining variation (R-Squared)
Hypothesis testing in regression and p values
Multiple linear regression with categorical variables
Interaction effect
Model selection criteria
We have scratched the surface of regression analysis. But we proud of yourselves, as these are not easy
topics.
SOC252H1F
33
Next week: Introduction to Causal Inference and
Directed Acyclic Graphs
SOC252H1F