Regression PDF
Regression PDF
Qin Gao
Contents
Simple Linear Regression 3
Rationale of simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Assessing how well the model fits the observed data . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Perform simple regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Multiple Regression 6
Rationale of multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Multiple Regression: Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Partial correlation, semi-partial (part) correlation, and regression coefficients . . . . . . . . . . . . 7
Perform multiple regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Interpretatio of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
10
Assumptions of Regression 20
Straightforward Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
The More Tricky Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1
DFFits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Cook’s D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Hat values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Model Building and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Summary 33
2
Simple Linear Regression
𝑌𝑖 = 𝑏0 + 𝑏𝑖 𝑋𝑖 + 𝜀𝑖
This graph shows a scatterplot of some data with a line representing the general trend. The vertical lines
(dotted) represent the differences (or residuals) between the line and the actual data
3
Testing the Model: ANOVA
If the model results in better prediction than using the mean, then we expect SSM to be much greater than
SSR
𝑀 𝑆𝑀
𝐹 =
𝑀 𝑆𝑅
Coefficient of determination: 𝑅2
𝑆𝑆𝑀
𝑅2 =
𝑆𝑆𝑇
𝐻0 ∶ 𝛽1 = 0 𝐻1 ∶ 𝛽1 ≠ 0
Testing Statistics:
𝑏𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑏𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑏
𝑡= = 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 ∼ 𝑡(𝑁 − 2)
𝑆𝐸𝑏 𝑆𝐸𝑏
4
Assessing the significance of individual predictors: F-test
𝐻0 ∶ 𝛽1 = 0 𝐻1 ∶ 𝛽1 ≠ 0
Testing Statistics:
We run a regression analysis using the lm() function – lm stands for ‘linear model’. This function takes the
general form:
newModel<-lm(outcome ~ predictor(s), data = dataFrame, na.action = an action))
* na.action = na.fail
* na.action = na.omit or na.exclude
Example
• A record company boss was interested in predicting record sales from advertising.
• Data: 200 different album releases
• Outcome variable:
– Sales (CDs and downloads) in the week after release
• Predictor variable:
– The amount (in units of £1000) spent promoting the record before release.
head(album1)
## adverts sales
## 1 10.256 330
## 2 985.685 120
## 3 1445.563 360
## 4 1188.193 270
## 5 574.513 220
## 6 568.954 170
5
albumSales.1 <- lm(sales ~ adverts, data = album1)
summary(albumSales.1)
##
## Call:
## lm(formula = sales ~ adverts, data = album1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.949 -43.796 -0.393 37.040 211.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 ***
## adverts 9.612e-02 9.632e-03 9.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313
## F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16
cor(album1$sales, album1$adverts)^2
## [1] 0.3346481
Multiple Regression
When the weights for each observation are identical and the errors are uncorrelated, the estimated parameters
are:
𝑦 = X𝛽 + 𝜖
When the weights for each observation are identical and the errors are uncorrelated, the estimated parameters
are
𝛽 ̂ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
6
𝑦 ̂ = 𝑋 𝛽 ̂ = 𝑋(𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝐻 = 𝑋(𝑋 𝑇 𝑋)−1 𝑋 𝑇
The residual:
𝜖 = 𝑦 − 𝑦 ̂ = 𝑦 − 𝐻𝑦 = (𝐼 − 𝐻)𝑦
• Partial correlation: measures the relationship between two variables, controlling for the effect that a
third variable has on all the variables involved
• Semi-partial correlation: Measures the relationship between two variables controlling for the effect that
a third variable has on only one of the others. It measures the unique contribution of a predictor to
explaining the variance of the outcome.
Partial correlation
2 2
2 𝑅1.23 − 𝑅1.3
𝑟12.3 = 2
1 − 𝑅1.3
2
• 𝑅1.23 is the R2 from a multiple regression with 1 being Y and 2 and 3 being the predictor variables.
2
• 𝑅1.3 is the R2 from a simple regression with 1 being Y and 2 being X - the single predictor variable.
Semi-partial correlation Semipartial correlation removes the effects of additional variables from one of
the variables under study (typically X)
2 2 2
𝑟1(2.3) = 𝑅1.23 − 𝑅1.3
7
Uses of Partial and Semipartial
• The partial correlation is most often used when some third variable z is a plausible explanation of the
correlation between X and Y.
• The semipartial is most often used when we want to show that some variable adds incremental variance
in Y above and beyond other X variable
• Each regression coefficient in the regression model is the amount of change in the outcome variable
that would be expected per one-unit change of the predictor, if all other variables in the model
were held constant.
Example
• A record company boss was interested in predicting record sales from advertising.
• Data: 200 different album releases
• Outcome variable:
– Sales (CDs and downloads) in the week after release (in units of 1000)
• Predictor variable:
– The amount (in units of £1000) spent promoting the record before release.
– Number of plays on the radio: number of times played on radio the week before release
– Attractiveness of the CD cover: expert rating with a 0-10 scale
8
##
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.324 -28.336 -0.451 28.967 144.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.612958 17.350001 -1.534 0.127
## adverts 0.084885 0.006923 12.261 < 2e-16 ***
## airplay 3.367425 0.277771 12.123 < 2e-16 ***
## attract 11.086335 2.437849 4.548 9.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595
## F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16
• F(3, 196) - 129.5, p < .001: The model results significantly better prediction than if we used the mean
value of album sales.
• 𝛽1 = 0.085: As advertising increases by 1 unit (£1000), album sales increase by 0.085 unit (1000).
• 𝛽2 = 3.367: When the number of plays on the radio increases by 1 unit, its sales increase by 3.367 unit
(1000).
• 𝛽3 = 11.086: When attractiveness of CD cover increases by 1 unit, its sales increase by 11.086 unit
(1000).
The standardized regression coefficients represent the change in response for a change of one standard
deviation in a predictor.
𝑠𝑥𝑖
𝛽𝑖∗ = 𝛽𝑖
𝑠𝑦
You can calculate standardised beta values using the lm.beta() function from QuantPsyc package.
library(QuantPsyc)
lm.beta(albumSales.2)
Alternatively, you can standardize the original raw data by converting each original data value to a z-score,
and then perform multiple linear regression using the standardized data. The obtained regression coefficients
are standardized.
9
Standardised Beta Values
• 𝛽1∗ = 0.511: As advertising increases by 1 standard deviation (£485, 655), album sales increase by
0.511 of a standard deviation (0.511* 80,699).
• 𝛽2∗ = 0.512: When the number of plays on the radio increases by 1 SD (12.27) its sales increase by
0.512 standard deviations(0.512* 80,699).
• 𝛽3∗ = 0.192: When the attractiveness of CD cover increases by 1 SD (1.40) its sales increase by 0.192
standard deviations (0.192* 80,699).
R2 and adjusted R2
(1 − 𝑅2 ) · (𝑛 − 1)
1−
𝑛−𝑣
The adjusted R2 increases when a new explanator is included only if the new explanator improves the R2
more than would be expected by chance.
• Akaike Information Criterion (AIC) is a measure of fit which penalizes the model for having more
variables
𝑆𝑆𝐸
𝐴𝐼𝐶 = 𝑛 ln + 2𝑘
𝑛
Calculating AIC in R
AIC(regression model, k = the number of predictors)
10
Methods of Regression
Interpretation of Results
• The F-test: It tells us whether using the regression model is significantly better at predicting values of
the outcome than using the mean.
• Beta values: the change in the outcome associated with a unit change in the predictor.
• Standardised beta values: tell us the same but expressed as standard deviations.
Hierarchical Regression
• Predictors are entered into the regression model in the order specified by the researcher based on past
research .
• New predictors are then entered in a separate step/block.
• Each IV is assessed in terms of what it adds to the prediction of DV after the previous IVs have been
controlled for.
• Overall model and relative contribution of each block of variables is assessed
– F test of 𝑅2 change.
Example
If we control for the possible effect of promotion budget, are airplay and CD cover design still able to predict
a significant amount of the variance in CD sales?
Examine individual models
##
## Call:
## lm(formula = sales ~ adverts, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
11
## -152.949 -43.796 -0.393 37.040 211.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 ***
## adverts 9.612e-02 9.632e-03 9.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313
## F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16
summary.lm(album.full)
##
## Call:
## lm(formula = sales ~ adverts + attract + airplay, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.324 -28.336 -0.451 28.967 144.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.612958 17.350001 -1.534 0.127
## adverts 0.084885 0.006923 12.261 < 2e-16 ***
## attract 11.086335 2.437849 4.548 9.49e-06 ***
## airplay 3.367425 0.277771 12.123 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595
## F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16
## SSM.adv.only SSE.adv.only
## [1,] 433687.8 862264.2
cbind(SSM.full, SSE.full)
## SSM.full SSE.full
## [1,] 861377.4 434574.6
12
F=((SSM.full-SSM.adv.only)/2)/(SSE.full/196)
F
## [1] 96.44738
anova(album.adv.only, album.full) #Note: for the function of anova(model1, model2), all predictors in mo
• A hierarchical regression can have as many blocks as there are groups of independent variables, i.e. the
analyst can specify a hypothesis that specifies an exact order of entry for variables.
• A more common hierarchical regression specifies two blocks of variables: a set of control variables
entered in the first block and a set of predictor variables entered in the second block.
– Control variables are often demographics which are thought to make a difference in scores on
the dependent variable.
– Predictors are the variables in whose effect our research question is really interested, but whose
effect we want to separate out from the control variables.
• Support for a hierarchical hypothesis would be expected to require statistical significance for the
addition of each block of variables.
• However, many times, we want to exclude the effect of blocks of variables previously entered into the
analysis, whether or not a previous block was statistically significant. The analysis is interested in
obtaining the best indicator of the effect of the predictor variables. The statistical significance of
previously entered variables is not interpreted.
• The latter strategy is also widely adopted in research.
• 𝑅2 change, i.e. the increase when the predictors variables are added to the analysis is interpreted
rather than the overall R² for the model with all variables entered.
• In the interpretation of individual relationships, the relationship between the predictors and the de-
pendent variable is presented.
• Similarly, in the validation analysis, we are only concerned with verifying the significance of the pre-
dictor variables. Differences in control variables are often ignored.
13
Reporting Hierarchical Regression
Hierarchical multiple regression was performed to investigate the ability airplay and CD cover
design to predict the variance in CD sales, after controlling for the possible effect of promotion
budget.
In the first step of hierarchical multiple regression, advertisement budget was entered. This model
was statistically significant (F (1, 198) = 99.59; p < .001) and explained 33 % of variance in CD
sales.
• Report 𝑅2 change and its significant test results after the entry of each new block
After the entry of airplay and CD cover design at Step 2, the total variance explained by the
model as a whole was 66% (F (3, 196) = 129.5; p < .001). The introduction of airplay and CD
cover design explained additional 33 % variance in criminal thinking style, after controlling for
advertisement budget (F (2, 196) = 96.45; p < .001).
14
Source: Park, N., Kee, K. F., & Valenzuela, S. (2009). Being immersed in social networking environment:
Facebook groups, uses and gratifications, and social outcomes. CyberPsychology & Behavior, 12, 729-733.
Stepwise Regression
Stepwise Regression in R
## Start: AIC=1544.76
## sales ~ adverts + attract + airplay
##
## Df Sum of Sq RSS AIC
## <none> 434575 1544.8
15
## - attract 1 45853 480428 1562.8
## - airplay 1 325860 760434 1654.7
## - adverts 1 333332 767907 1656.6
##
## Call:
## lm(formula = sales ~ adverts + attract + airplay, data = album2)
##
## Coefficients:
## (Intercept) adverts attract airplay
## -26.61296 0.08488 11.08634 3.36743
## Start: AIC=1757.29
## sales ~ 1
##
## Df Sum of Sq RSS AIC
## + airplay 1 464863 831089 1670.4
## + adverts 1 433688 862264 1677.8
## + attract 1 137822 1158130 1736.8
## <none> 1295952 1757.3
##
## Step: AIC=1670.44
## sales ~ airplay
##
## Df Sum of Sq RSS AIC
## + adverts 1 350661 480428 1562.8
## + attract 1 63182 767907 1656.6
## <none> 831089 1670.4
##
## Step: AIC=1562.82
## sales ~ airplay + adverts
##
## Df Sum of Sq RSS AIC
## + attract 1 45853 434575 1544.8
## <none> 480428 1562.8
##
## Step: AIC=1544.76
## sales ~ airplay + adverts + attract
##
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
##
## Coefficients:
## (Intercept) airplay adverts attract
## -26.61296 3.36743 0.08488 11.08634
16
## Start: AIC=1757.29
## sales ~ 1
##
## Df Sum of Sq RSS AIC
## + airplay 1 464863 831089 1670.4
## + adverts 1 433688 862264 1677.8
## + attract 1 137822 1158130 1736.8
## <none> 1295952 1757.3
##
## Step: AIC=1670.44
## sales ~ airplay
##
## Df Sum of Sq RSS AIC
## + adverts 1 350661 480428 1562.8
## + attract 1 63182 767907 1656.6
## <none> 831089 1670.4
## - airplay 1 464863 1295952 1757.3
##
## Step: AIC=1562.82
## sales ~ airplay + adverts
##
## Df Sum of Sq RSS AIC
## + attract 1 45853 434575 1544.8
## <none> 480428 1562.8
## - adverts 1 350661 831089 1670.4
## - airplay 1 381836 862264 1677.8
##
## Step: AIC=1544.76
## sales ~ airplay + adverts + attract
##
## Df Sum of Sq RSS AIC
## <none> 434575 1544.8
## - attract 1 45853 480428 1562.8
## - airplay 1 325860 760434 1654.7
## - adverts 1 333332 767907 1656.6
##
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
##
## Coefficients:
## (Intercept) airplay adverts attract
## -26.61296 3.36743 0.08488 11.08634
17
All-subsets-regressions (Best-subset)
• A procedure that considers all possible regression models given the set of potentially important pre-
dictors
• Model selection criteria:
– 𝑅2 . Find a subset model so that adding more variables will yield only small increases in R-squared
– Adjusted R2.
– MSE criterion
– Mallow’s Cp criterion
– Other: PRESS, predicted 𝑅2 (which is calculated from the PRESS statistic)…
Mallow’s Cp
𝑆𝑆𝐸𝑘
𝐶𝑝 = − (𝑛 − 2(𝑘 + 1))
𝑀 𝑆𝐸𝑇
• Identify subsets of predictors for which the Cp value is near k+1 (if possible).
– The full model always yields Cp = k+1, so don’t select the full model based on Cp.
• If all models, except the full model, yield a large Cp not near k+1, it suggests some important predic-
tor(s) are missing from the analysis. In this case, we are well-advised to identify the predictors that
are missing!
• If a number of models have Cp near k+1, choose the model with the smallest Cp value, thereby insuring
that the combination of the bias and the variance is at a minimum.
• When more than one model has a small value of Cp value near k+1, in general, choose the simpler
model or the model that meets your research needs.
All-subsets-regression in R
library(leaps)
album.subsets <- regsubsets(sales~adverts+attract+airplay, data = album2)
summary(album.subsets)
18
## Subset selection object
## Call: regsubsets.formula(sales ~ adverts + attract + airplay, data = album2)
## 3 Variables (and intercept)
## Forced in Forced out
## adverts FALSE FALSE
## attract FALSE FALSE
## airplay FALSE FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
## adverts attract airplay
## 1 ( 1 ) " " " " "*"
## 2 ( 1 ) "*" " " "*"
## 3 ( 1 ) "*" "*" "*"
4
Cp
23
180
(Intercept)
adverts
attract
airplay
19
0.66
adjr2
0.63
0.36
(Intercept)
adverts
attract
airplay
Options for plot( ) are r2, bic, Cp, and adjr2.
• It is important to note that no a single criterion can determine which is the best model.
• The different criteria quantify different aspects of the regression model, and therefore often yield
different choices for the best set of predictors.
• Subsets regression is better to be used as a screening tool to reduce the large number of possible
regression models to just a handful of models for further evaluation.
• Further evaluation and refinement might entail performing residual analyses, transforming the predic-
tors and/or response, adding interaction terms, and so on, until you are satisfied with the best model
that summarize the trend in the data and allows you to answer your research question.
• Model selection statistics are generally not used blindly, but rather information about the field of
application, the intended use of the model, and any known biases in the data are taken into account
in the process of model selection.
• More suitable for exploratory model building
• Better to cross-validate the model with new data
Assumptions of Regression
Straightforward Assumptions
• Variable Type:
20
– Outcome must be continuous
– Predictors can be continuous or dichotomous.
• Non-Zero Variance: Predictors must not have zero variance.
• Linearity: The relationship we model is, in reality, linear.
• Homoscedasticity: For each value of the predictors the variance of the error term should be constant.
• Independence: All values of the outcome should come from different persons.
21
The More Tricky Assumptions
• No multicollinearity: Predictors must not be highly correlated.
• Independent Errors: For any pair of observations, the error terms should be uncorrelated.
– Tested with Durbin-Watson test
– Value 1~4, 2 means uncorrelated
• Normally-distributed Errors
Multicollinearity
library(car)
vif(album.full)
Testing independence in R
durbinwatsonTest(model) or dwt(model)
dwt(album.full)
22
Sample Size
𝑁 > 50 + 8𝑘
∗ N = number of Participants
∗ k = number of IVs
• An observation that is unconditionally unusual in Y value is called an outlier, but it is not necessarily
a regression outlier
• An observation that has an unusual X value—i.e., it is far from the mean of X—has leverage on (i.e.,
the potential to influence) the regression line
• Influential cases: an unusual X-value with an unusual Y-value given its X-value
• The olsrr package offers a number of tools to detect influential observations. For more use of olsrr,
check out this introduction.
• Unstandardized residuals
• Standardized residuals:
1. those >3 are cause of concern
2. If more than 1% of cases > 2.5, the level of error within model is unacceptable
3. If more than 5% of cases > 2, the level of error within model is unacceptable
• Estimating the outliers in R
– unstanddized residuals: resid()
23
– standardized residuals: rstandard()
large.standardized.residual
library(olsrr)
ols_plot_resid_stand(album.full)
3
Threshold:
169abs(2)
1 10 52 61 100
2
Standardized Residuals
−1
−2 200
2 68
47 55
164
24
DFBeta
• DFBeta: the difference between a parameter estimated using all cases and estimated when one case is
excluded
Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and √2
𝑛
as a size-adjusted cutoff.
For our sample, √2 = 0.14
200
ols_plot_dfbetas(album.full)
page 1 of 1
Influence Diagnostics for (Intercept) Influence Diagnostics for attract
52 1
Threshold: 0.14 Threshold: 0.14
12
0.2 0.2
DFBETAS
DFBETAS
0.0 0.0
7 138
27 169 69 146
−0.2 47
12 113 −0.2
1 52 200
DFBETAS
0.0
0.0
68 152
83
10 −0.2 94
−0.2 46 99
100
1 169 119
55 −0.4 164
25
DFFits
• DFFit: The difference between the predicted value for a case when the model is calculated including
that case and when the model is calculated excluding that case.
• An observation is deemed influential if the absolute value of its DFFITS value is greater than:
𝑘+1
2∗√
𝑛
where n is the number of observations and k is the number of predictors.
For our sample, this equals
3+1
2∗√ = 0.28
200
album2$dffit <- dffits(album.full)
head(album2$dffit)
ols_plot_dffits(album.full)
52 100
0.3
DFFITS
0.0
−0.3 47 68 200
99 119
55
164
26
Cook’s D
• Cook’s distance: the impact that a case has on the model’s ability to predict all cases
𝑛
∑𝑗=1 (𝑌𝑗̂ −𝑌𝑗(𝑖)
̂ )2
– 𝐷𝑖 = 𝑘 𝑀𝑆𝐸
• Since Cook’s distance is in the metric of an F distribution with p and n-pdegrees of freedom, the
median point of the 𝐹(𝑝,𝑛−𝑝) can be used as a cut-off
• For large n, a simple cutoff value of 1 can be used. (Weisberg, 1982)
0.04
0.02
0.00
Index
ols_plot_cooksd_chart(album.full)
27
Cook's D Chart
164
Threshold: 0.02
1
0.06
169
55
Cook's D
0.04
52
100 119
99
47 200
68
0.02
0.00
Hat values
28
0.10
0.08
album2$hatvalues
0.06
0.04
0.02
Index
• In data-driven research, the final step in the model-building process is to validate the selected regression
model.
– Collect new data and compare the results.
– Cross-validation: If the data set is large, split the data into two parts and cross-validate the results
(often 80-20).
{width = 60%}
29
Categorical predictors and multiple reression
• Often you may have categorical variables (e.g., gender, major) which you want to include as predictors
in regression.
• To do that, you need to code the categorical predictors with dummy variables
– Create k-1 dummy variables
– Choose the baseline/control group. If you don’t havea specific control group, choose the group
that represents the majority of people.
– Assign the baseline values of 0 for all of your dummy variables.
– For your first dummy variable, assign the value 1 to the first group that you want to compare
against the baseline group. Assign all other groups 0 for this variable.
– For your second dummy variable, assign the value 1 to the second group that you want to compare
against the baseline group. Assign all other groups 0 for this variable.
– Repeat this until you run out of dummy variables
– Put dummy variables into the regression model
Example
Salaries is a data frame pre-loaded with carData package. It includes 397 observations on the 2008-09
nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the
U.S.
library(car)
data("Salaries", package = "carData")
str(Salaries)
##
## Call:
## lm(formula = salary ~ yrs.service, data = Salaries)
30
##
## Residuals:
## Min 1Q Median 3Q Max
## -81933 -20511 -3776 16417 101947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99974.7 2416.6 41.37 < 2e-16 ***
## yrs.service 779.6 110.4 7.06 7.53e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28580 on 395 degrees of freedom
## Multiple R-squared: 0.1121, Adjusted R-squared: 0.1098
## F-statistic: 49.85 on 1 and 395 DF, p-value: 7.529e-12
If we want to examine if gender plays a role in determining professors’ salaries, we would like to put it into
the regression model.
Note that the variable of sex in this dataset has already been set contrasts using female as the baseline group
contrasts(Salaries$sex)
## Male
## Female 0
## Male 1
You can use the function relevel() to set the baseline category to males as follow:
## Female
## Male 0
## Female 1
The fact that the coefficient for sexFemale in the regression output is negative indicates that being a Female
is associated with a constant decrease (i.e., smaller intercept in regression) in salary (relative to Males).
##
## Call:
## lm(formula = salary ~ yrs.service + sex, data = Salaries)
##
31
## Residuals:
## Min 1Q Median 3Q Max
## -81757 -20614 -3376 16779 101707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101428.7 2531.9 40.060 < 2e-16 ***
## yrs.service 747.6 111.4 6.711 6.74e-11 ***
## sexFemale -9071.8 4861.6 -1.866 0.0628 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28490 on 394 degrees of freedom
## Multiple R-squared: 0.1198, Adjusted R-squared: 0.1154
## F-statistic: 26.82 on 2 and 394 DF, p-value: 1.201e-11
• The interaction term indicates whether females’ salaries grow with years of service in the same rate as
males’ salaries do.
##
## Call:
## lm(formula = salary ~ yrs.service + yrs.service * sex, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80381 -20258 -3727 16353 102536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102197.1 2563.7 39.863 < 2e-16 ***
## yrs.service 705.6 113.7 6.205 1.39e-09 ***
## sexFemale -20128.6 7991.1 -2.519 0.0122 *
## yrs.service:sexFemale 931.7 535.2 1.741 0.0825 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28420 on 393 degrees of freedom
## Multiple R-squared: 0.1266, Adjusted R-squared: 0.1199
## F-statistic: 18.98 on 3 and 393 DF, p-value: 1.622e-11
• Note that by default R treats the levels of categorical data in alphabetical order (level 1 = AsstProf,
level 2 = AssocProf, level 3 = Prof)
## 2 3
32
## AsstProf 0 0
## AssocProf 1 0
## Prof 0 1
##
## Call:
## lm(formula = salary ~ yrs.service + sex + rank, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64500 -15111 -1459 11966 107011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82081.5 2974.1 27.598 < 2e-16 ***
## yrs.service -171.8 115.3 -1.490 0.13694
## sexFemale -5468.7 4035.3 -1.355 0.17613
## rank2 14702.9 4266.6 3.446 0.00063 ***
## rank3 48980.2 3991.8 12.270 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23580 on 392 degrees of freedom
## Multiple R-squared: 0.4, Adjusted R-squared: 0.3938
## F-statistic: 65.32 on 4 and 392 DF, p-value: < 2.2e-16
Summary
• Different types of regression methods are developed for theory-driven or data-driven research purposes.
• To compare nested models, F-test and AIC are often used.
• To compare not nested models, more general goodness-of-fit indices are often used, including Mallow’s
Cp.
• It is possible for a single observation to have a great influence on the results of a regression analysis.
Methods to detect such observations include DFBeta, DFFits, Cook’s D, and hat values.
• Categorical variables can be included in linear regression models after coded using aa set of dummy
variables.
33