0% found this document useful (0 votes)
12 views34 pages

Sociology: Intermediate Quantitative Research Method

Uploaded by

iris200193
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views34 pages

Sociology: Intermediate Quantitative Research Method

Uploaded by

iris200193
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

1

Extensions to Linear Regression


Part 2 of Week 7

Aida Parnia
[email protected]

U of T Sociology

October 15, 2024

SOC252H1F
2

Extensions to Linear regression


Linear regression as explaining variation
→ R2 and adjusted R2
Hypothesis testing in regression
→ t distribution and p value
Multiple linear regression
→ Categorical variables as multiple indicator variables
→ Interaction term
Model selection criteria

SOC252H1F
3

Today’s example: UN data


Our investigation is around the determinants of life expectancy
1 str(UN)
tibble [193 × 8] (S3: tbl_df/tbl/data.frame)
$ country : chr [1:193] "Afghanistan" "Albania" "Algeria" "Angola" ...
$ region : chr [1:193] "Asia" "Europe" "Africa" "Africa" ...
$ group : chr [1:193] "other" "other" "africa" "africa" ...
$ fertility : num [1:193] 5.97 1.52 2.14 5.13 2.17 ...
$ ppgdp : num [1:193] 499 3677 4473 4322 9162 ...
$ lifeExpF : num [1:193] 49.5 80.4 75 53.2 79.9 ...
$ pctUrban : num [1:193] 23 53 67 59 93 64 47 89 68 52 ...
$ infantMortality: num [1:193] 124.5 16.6 21.5 96.2 12.3 ...
- attr(*, "na.action")= 'omit' Named int [1:20] 4 6 21 35 38 54 67 75 77 78 ...
..- attr(*, "names")= chr [1:20] "4" "6" "21" "35" ...

SOC252H1F
4

Explained Variation and R-Squared


Definition and Interpretation of R-Squared
→ Formula: R2 = 1 − SSres
SS


tot

Proportion of variance explained by the model


Adjusted R-Squared for Multiple Regression Models
→ As we increase the number of variables in the model the
proportion explained would increase,
→ R-squared can be adjusted for the number of predictors

→ Formula: Adjusted R = 1 − (
2 (1−R2)(n−1)
n−k−1 )
→ We can use this in model comparison

SOC252H1F
5

Explained variation of life expectancy by GDP or log-


GDP
1 fit_gdp <- lm(lifeExpF ~ ppgdp, GDP log(GDP)
2 data = UN)
3 (Intercept) 68.072 29.257
4 fit_loggdp <- lm(lifeExpF ~ log(ppgdp),
5 data = UN)
[66.598, [24.154,
6 69.546] 34.359]
7 modelsummary(list("GDP"=fit_gdp,
8 "log(GDP)"=fit_loggdp),
ppgdp 0.000
9 statistic = "conf.int",
[0.000, 0.000]
10 fmt = 3,
11 gof_map = c("nobs", "r.squared")) log(ppgdp) 5.090
[4.494, 5.687]
Num.Obs. 193 193
R2 0.311 0.597

In the model with GDP as the predictor, only 30% of the variation in female life expectancy is explained by
the model, while with the log GDP as the predictor, 59.6% of the variation is explained by the model.

SOC252H1F
6

Hypothesis Testing of the Regression Coefficients


The Null Hypotheses If the assumptions of the model
H0 : betak = 0 are met, then we can get the
We know the point estimate, but probability distribution of the
we want to account for the estimated coefficient.
uncertainty in the data. Turns out that the estimated
coefficients follow a t distribution
(with n - number of parametrs
degrees of freedom.

SOC252H1F
7

The t distribution is a modified normal distribution

SOC252H1F
8

Hypothesis Testing of the Regression Coefficients


t-Tests for Individual Coefficients

→ Formula: t =
^
β−0
^
SE(β)
→ The calculated t value is then put on this t distribution and we
calculate the corresponding probability for it (p-value).
1 sum_loggdp <- summary(fit_loggdp)
2 sum_loggdp

SOC252H1F
9

Hypothesis Testing of the Regression Coefficients


Call:
lm(formula = lifeExpF ~ log(ppgdp), data = UN)

Residuals:
Min 1Q Median 3Q Max
-25.885 -2.908 1.396 3.982 12.402

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.2566 2.5870 11.31 <2e-16 ***
log(ppgdp) 5.0901 0.3025 16.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.479 on 191 degrees of freedom


Multiple R-squared: 0.5972, Adjusted R-squared: 0.5951
F-statistic: 283.2 on 1 and 191 DF, p-value: < 2.2e-16

SOC252H1F
10

Hypothesis Testing of the Regression Coefficients

SOC252H1F
11

Multiple Linear Regression


For our example the model is the following

Life Expectancy = β0 + β1 ⋅ log(GDP) + β2 ⋅ fertility + ε


1 UN <- UN %>% mutate(log_gdp = log(ppgdp))
2 # ML model
3 fit_ml <- lm(lifeExpF ~ log_gdp + fertility,
4 data = UN)

SOC252H1F
12

Multiple Linear Regression as a regression plane


Regression Plane for Life Expectancy

pred_le

80

70

60

50

SOC252H1F
13

More on non-linear relationships


In our model we assume that fertility rate has a linear relationship.
Meaning if fertility increases from 1 to 2, it has the same relationship to
life expectancy as it increases from 5 to 6.
But this may not be true, so we can categorize our fertility measure to
see different relationships. Here we use quartiles.

1 fertility_b <- c(min(UN$fertility), # A tibble: 4 × 2


fertility_q n
2 quantile(UN$fertility, 0.25),
<fct> <int>
3 quantile(UN$fertility, 0.5), 1 [1.13,1.75) 48
4 quantile(UN$fertility, 0.75), 2 [1.75,2.26) 48
5 max(UN$fertility)) 3 [2.26,3.7) 48
4 [3.7,6.92] 49
6 UN <- UN %>% mutate(
7 fertility_q = cut(fertility,
8 breaks = fertility_b,
9 include.lowest = TRUE,
10 right = FALSE))
11 UN %>% count(fertility_q)

SOC252H1F
14

More on non-linear relationships: categorical


variables
Dummy Variable Coding is when a categorical predictor is separated
into multiple dummy / indicator variables. And one category is defined
as the reference category.
1 fit_ml2 <- lm(lifeExpF ~ log_gdp + fertility_q,
2 data = UN)
3
4 summary(fit_ml2)

SOC252H1F
15

More on non-linear relationships: categorical


variables
Call:
lm(formula = lifeExpF ~ log_gdp + fertility_q, data = UN)

Residuals:
Min 1Q Median 3Q Max
-21.6867 -1.6328 0.2597 2.9554 12.4939

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.2092 3.6933 14.136 < 2e-16 ***
log_gdp 2.9235 0.3801 7.691 7.93e-13 ***
fertility_q[1.75,2.26) -1.2362 1.1532 -1.072 0.285
fertility_q[2.26,3.7) -5.6775 1.2696 -4.472 1.34e-05 ***
fertility_q[3.7,6.92] -11.8377 1.5434 -7.670 9.01e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.646 on 188 degrees of freedom

SOC252H1F
16

More on non-linear relationships: categorical


variables
(1)
(Intercept) 52.21 Interpretation of coefficients
[44.92, 59.49] β0 or the intercept
log_gdp 2.92 The average life expectancy of females in countries with 0
[2.17, 3.67] log_GDP and a fertility rate of less than 1.75 is 52.21 years.
fertility_q[1.75,2.26) -1.24 β1 or the coefficient for log_GDP
[-3.51, 1.04] When fertility is held constant, the average life expectancy of
fertility_q[2.26,3.7) -5.68
females in countries with one percent higher GDP is 0.03 years
[-8.18, -3.17]
higher than in others.
fertility_q[3.7,6.92] -11.84
[-14.88, -8.79]

SOC252H1F
17

More on non-linear relationships: categorical


variables
(1)
(Intercept) 52.21 Interpretation of coefficients
[44.92, 59.49] β2 or the coefficient for fertility [1.75,2.26)
log_gdp 2.92 When GDP is held constant, the average life expectancy of
[2.17, 3.67] females in countries with fertility rates of 1.75 to just below
fertility_q[1.75,2.26) -1.24 2.26 is 1.24 years lower than in countries with fertility rates of
[-3.51, 1.04] less than 1.75 children per woman.
fertility_q[2.26,3.7) -5.68
β3 or the coefficient for fertility [2.26,3.7)
[-8.18, -3.17]
When GDP is held constant, the average life expectancy of
females in countries with fertility rates of 2.26 to just below 3.7
fertility_q[3.7,6.92] -11.84
is 5.68 years lower than in countries with fertility rates of less
[-14.88, -8.79]
than 1.75 children per woman.
β4 or the coefficient for fertility [3.7,6.92]
When GDP is held constant, the average life expectancy of
females in countries with fertility rates of 3.7 to 6.92 is 11.83
years lower than in countries with fertility rates of less than
1.75 children per woman.

SOC252H1F
18

More on non-linear relationships: categorical


variables
1 # Added predictions
2 UN %>% mutate(ml_pred = predict(fit_ml2)) %>%
3 # visualizing the model
4 ggplot(aes(y = ml_pred, x = log_gdp, colour = fertility_q)) +
5 geom_line() +
6 geom_point(aes(y = lifeExpF, x = log_gdp, colour = fertility_q)) +
7 theme_light(base_size = 20)

SOC252H1F
19

Interaction in Regressions
What if we allow the slope of one variable to change by the values of
another?
In our example, what if relationship of life expectancy with GDP
depended on the values of fertility rate?
This is the concept of an Interaction Effect.
The previous example the predicted values increases additively with the
values of fertility rate, but with an interaction this increase is
multiplicative.

Life Expectancy = β0 + β1 ⋅ log(GDP) + β2 ⋅ fertility+


β3 ⋅ log(GDP) × fertility + ε

SOC252H1F
20

Interaction in Regressions
1 # Note the change from + to * in the formula
2 fit_ml3 <- lm(lifeExpF ~ log_gdp * fertility_q,
3 data = UN)
4
5 summary(fit_ml3)

SOC252H1F
21

Interaction in Regressions
Call:
lm(formula = lifeExpF ~ log_gdp * fertility_q, data = UN)

Residuals:
Min 1Q Median 3Q Max
-21.6893 -1.4094 0.2941 3.2680 12.3541

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.40179 7.99615 6.803 1.37e-10 ***
log_gdp 2.69213 0.83931 3.208 0.00158 **
fertility_q[1.75,2.26) -0.01215 9.94127 -0.001 0.99903
fertility_q[2.26,3.7) -8.58005 10.38544 -0.826 0.40978
fertility_q[3.7,6.92] -19.63143 9.90168 -1.983 0.04889 *
log_gdp:fertility_q[1.75,2.26) -0.13312 1.04587 -0.127 0.89886
log_gdp:fertility_q[2.26,3.7) 0.31927 1.16941 0.273 0.78515
log_gdp:fertility_q[3.7,6.92] 1.06005 1.19859 0.884 0.37762
---

SOC252H1F
22

Interaction in Regressions
1 # Added predictions
2 UN %>% mutate(ml_pred = predict(fit_ml3)) %>%
3 # visualizing the model
4 ggplot(aes(y = ml_pred, x = log_gdp, colour = fertility_q)) +
5 geom_line() +
6 geom_point(aes(y = lifeExpF, x = log_gdp, colour = fertility_q)) +
7 theme_light(base_size = 20)

SOC252H1F
23

Interaction in Regressions

SOC252H1F
24

Interaction in Regressions
1 # Added predictions
2 UN %>% mutate(ml_pred = predict(fit_ml3)) %>%
3 # visualizing the model
4 ggplot(aes(y = ml_pred, x = log_gdp, colour = fertility_q)) +
5 geom_line() +
6 geom_point(aes(y = lifeExpF, x = log_gdp, colour = fertility_q)) +
7 theme_light(base_size = 20) + facet_wrap(fertility_q ~ .)

SOC252H1F
25

Interaction in Regressions

SOC252H1F
26

Interpreting interaction effects: Interpretation


Interpretation of interaction is often hard to put clearly, so visualization
or choosing specific predicted values is often more intuitive.
Alternatively, we can define partial effects by calculating coefficient for
each category of the the categorical variable.
The coefficient for log_GDP when fertility is below 1.75 (reference
group) is β ⋅ log_gdp
The coefficient for log_GDP when fertility is [1.75,2.26) is
β ⋅ log_gdp + β ⋅ log_gdp:fertility_q[1.75,2.26)
The coefficient for log_GDP when fertility is [2.26,3.7) is
β ⋅ log_gdp + β ⋅ log_gdp:fertility_q[2.26,3.7)
The coefficient for log_GDP when fertility is [3.7,6.92] is
β ⋅ log_gdp + β ⋅ log_gdp:fertility_q[3.7,6.92]

SOC252H1F
27

Interpreting interaction effects: Interpretation


1 log_gdp_fer1 <- fit_ml3$coefficients["log_gdp"]
2
3 log_gdp_fer2 <- fit_ml3$coefficients["log_gdp"] + fit_ml3$coefficients["log_gdp:fertility_q[1.7
4
5 log_gdp_fer3 <- fit_ml3$coefficients["log_gdp"] + fit_ml3$coefficients["log_gdp:fertility_q[2.2
6
7 log_gdp_fer4 <- fit_ml3$coefficients["log_gdp"] + fit_ml3$coefficients["log_gdp:fertility_q[3.7
8
9 c(log_gdp_fer1,log_gdp_fer2,log_gdp_fer3,log_gdp_fer4)
log_gdp log_gdp log_gdp log_gdp
2.692128 2.559012 3.011393 3.752180

SOC252H1F
28

Model Selection Criteria


So which model is best?
There are a number of criteria created to evaluate goodness of fit of a
model.
R squared is one but it is not always the most useful.
Two very useful model selection criterion are
→ Akaike Information Criterion (AIC)
→ Bayesian Information Criterion (BIC)
There are other methods too, like cross-validation and the purpose is to
find the best model without over-fitting the data.

SOC252H1F
29

Model Selection Criteria


1 modelsummary(list("Fertility" = fit_ml, "Fertility_q"=fit_ml2,
2 "Interaction" = fit_ml3),
3 fmt = 2,
4 statistic = "conf.int",
5 gof_map = c("r.squared", "aic", "bic"),
6 stars = TRUE)

SOC252H1F
30

Model Selection Criteria


Fertility Fertility_q Interaction
(Intercept) 63.06*** 52.21*** 54.40***
[55.52, 70.60] [44.92, 59.49] [38.63, 70.18]
log_gdp 2.45*** 2.92*** 2.69**
[1.77, 3.14] [2.17, 3.67] [1.04, 4.35]
fertility -4.18***
[-4.96, -3.39]
fertility_q[1.75,2.26) -1.24 -0.01
[-3.51, 1.04] [-19.62, 19.60]
fertility_q[2.26,3.7) -5.68*** -8.58
[-8.18, -3.17] [-29.07, 11.91]
fertility_q[3.7,6.92] -11.84*** -19.63*
[-14.88, -8.79] [-39.17, -0.10]
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
SOC252H1F
Fertility Fertility_q Interaction
log_gdp × fertility_q[1.75,2.26) -0.13
[-2.20, 1.93]
log_gdp × fertility_q[2.26,3.7) 0.32
[-1.99, 2.63]
log_gdp × fertility_q[3.7,6.92] 1.06
[-1.30, 3.42]
R2 0.745 0.699 0.701
AIC 1186.6 1222.8 1227.4
BIC 1199.6 1242.4 1256.8
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

SOC252H1F
31

Choosing the Best Model


1. We first map the model based on our theoretical understanding
2. If and when we don’t have a good grasp of the model, we can follow
other methods available to us

Stepwise Regression Methods (Forward, Backward, and Stepwise


Selection)
→ Adding/removing predictors based on criteria (often p value)
Comparing Models Using Selection Criteria
→ Evaluating AIC, BIC, and adjusted R-squared; or cross-validation
techniques
Practical Considerations in Model Selection
→ Interpretability and simplicity

SOC252H1F
32

Summary of Key Points


Research design,
→ cycle of research, hierarchy of evidence, experimental vs. non-
experimental methods
Regression is about explaining variation (R-Squared)
Hypothesis testing in regression and p values
Multiple linear regression with categorical variables
Interaction effect
Model selection criteria

We have scratched the surface of regression analysis. But we proud of yourselves, as these are not easy
topics.

SOC252H1F
33

Next week: Introduction to Causal Inference and


Directed Acyclic Graphs

SOC252H1F

You might also like