0% found this document useful (0 votes)
15 views33 pages

Regression PDF

The document provides a comprehensive overview of regression analysis, including simple and multiple regression techniques, their rationale, and methods for assessing model fit and significance. It discusses various regression methods, assumptions, and the identification of outliers, along with practical implementation in R. Key concepts such as the method of least squares, coefficient of determination, and partial correlation are also covered.

Uploaded by

yu.ann.chen216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views33 pages

Regression PDF

The document provides a comprehensive overview of regression analysis, including simple and multiple regression techniques, their rationale, and methods for assessing model fit and significance. It discusses various regression methods, assumptions, and the identification of outliers, along with practical implementation in R. Key concepts such as the method of least squares, coefficient of determination, and partial correlation are also covered.

Uploaded by

yu.ann.chen216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Regression

Qin Gao

Contents
Simple Linear Regression 3
Rationale of simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Assessing how well the model fits the observed data . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Perform simple regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Multiple Regression 6
Rationale of multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Multiple Regression: Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Partial correlation, semi-partial (part) correlation, and regression coefficients . . . . . . . . . . . . 7
Perform multiple regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Interpretatio of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

10

Types of Research Questions and Corresponding Regression Methods 10


Forced Entry Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Hierarchical Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
All-subsets-regressions (Best-subset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Summary of Data-Driven Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Assumptions of Regression 20
Straightforward Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
The More Tricky Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Finding outliers and influential cases 23


Outliers and influential cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Residuals and standardized residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
DFBeta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1
DFFits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Cook’s D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Hat values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Model Building and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Categorical predictors and multiple reression 30

Summary 33

2
Simple Linear Regression

Rationale of simple linear regression

• Regression is a way of predicting the value of one variable from another.


• In correlation, the two variables are treated as equals. In regression, one variable is considered inde-
pendent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

𝑌𝑖 = 𝑏0 + 𝑏𝑖 𝑋𝑖 + 𝜀𝑖

The Method of Least Squares

This graph shows a scatterplot of some data with a line representing the general trend. The vertical lines
(dotted) represent the differences (or residuals) between the line and the actual data

Assessing how well the model fits the observed data

• The regression line is only a model based on the data.


• We need some way of testing how well the model fits the observed data.
• Testing the model fit with ANOVA
– SST: Total variability (variability between scores and the mean).
– SSR: Residual/error variability (variability between the regression model and the actual data).
– SSM: Model variability (difference in variability between the model and the mean).

3
Testing the Model: ANOVA

If the model results in better prediction than using the mean, then we expect SSM to be much greater than
SSR

𝑀 𝑆𝑀
𝐹 =
𝑀 𝑆𝑅

• dfM: number of predictors


• dfR: the number of observations minus the number of parameters being estimated

Coefficient of determination: 𝑅2

• 𝑅2 is the proportion of variance accounted for by the regression model.


• 𝑅2 is the Pearson Correlation Coefficient Squared

𝑆𝑆𝑀
𝑅2 =
𝑆𝑆𝑇

Assessing the significance of individual predictors: t-test

𝐻0 ∶ 𝛽1 = 0 𝐻1 ∶ 𝛽1 ≠ 0
Testing Statistics:

𝑏𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑏𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑏
𝑡= = 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 ∼ 𝑡(𝑁 − 2)
𝑆𝐸𝑏 𝑆𝐸𝑏

4
Assessing the significance of individual predictors: F-test

𝐻0 ∶ 𝛽1 = 0 𝐻1 ∶ 𝛽1 ≠ 0

• 𝑆𝑆𝑀 : model sum of square


• 𝑆𝑆𝐸 : error sum of square
• 𝑆𝑆𝑀𝐻 : model sum of square without the j predictor
• 𝑆𝑆𝐸𝐻 : error sum of square without the j predictor

Testing Statistics:

𝑆𝑆𝑀 − 𝑆𝑆𝑀𝐻 𝑆𝑆𝑀𝑗


𝐹𝑗 = = ∼ 𝐹(1,𝑛−𝑘−1)
𝑆𝑆𝐸 /(𝑛 − 𝑘 − 1) 𝑆𝑆𝐸 /(𝑛 − 𝑘 − 1)

Perform simple regression in R

We run a regression analysis using the lm() function – lm stands for ‘linear model’. This function takes the
general form:
newModel<-lm(outcome ~ predictor(s), data = dataFrame, na.action = an action))

* na.action = na.fail
* na.action = na.omit or na.exclude

Example

• A record company boss was interested in predicting record sales from advertising.
• Data: 200 different album releases
• Outcome variable:
– Sales (CDs and downloads) in the week after release
• Predictor variable:
– The amount (in units of £1000) spent promoting the record before release.

album1 <- read.delim("Album Sales 1.dat", header = TRUE)


str(album1)

## 'data.frame': 200 obs. of 2 variables:


## $ adverts: num 10.3 985.7 1445.6 1188.2 574.5 ...
## $ sales : int 330 120 360 270 220 170 70 210 200 300 ...

head(album1)

## adverts sales
## 1 10.256 330
## 2 985.685 120
## 3 1445.563 360
## 4 1188.193 270
## 5 574.513 220
## 6 568.954 170

5
albumSales.1 <- lm(sales ~ adverts, data = album1)
summary(albumSales.1)

##
## Call:
## lm(formula = sales ~ adverts, data = album1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.949 -43.796 -0.393 37.040 211.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 ***
## adverts 9.612e-02 9.632e-03 9.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313
## F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16

cor(album1$sales, album1$adverts)^2

## [1] 0.3346481

Multiple Regression

Rationale of multiple regression

Multiple Regression is a natural extension of linear model:

𝑦𝑖 = 𝛽0 + 𝛽1 ∗ 𝑋1𝑖 + 𝛽2 ∗ 𝑋2𝑖 + 𝛽3 ∗ 𝑋3𝑖 + ...𝜀𝑖

Multiple Regression: Parameter Estimation

Least squares solution:

When the weights for each observation are identical and the errors are uncorrelated, the estimated parameters
are:

𝑦 = X𝛽 + 𝜖
When the weights for each observation are identical and the errors are uncorrelated, the estimated parameters
are

𝛽 ̂ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦

so the fitted values are

6
𝑦 ̂ = 𝑋 𝛽 ̂ = 𝑋(𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦

The following matrix is called a projection matrix or hat matrix

𝐻 = 𝑋(𝑋 𝑇 𝑋)−1 𝑋 𝑇

The residual:

𝜖 = 𝑦 − 𝑦 ̂ = 𝑦 − 𝐻𝑦 = (𝐼 − 𝐻)𝑦

Partial correlation, semi-partial (part) correlation, and regression coefficients

Partial vs. semi-partial correlation

• Partial correlation: measures the relationship between two variables, controlling for the effect that a
third variable has on all the variables involved
• Semi-partial correlation: Measures the relationship between two variables controlling for the effect that
a third variable has on only one of the others. It measures the unique contribution of a predictor to
explaining the variance of the outcome.

Partial correlation

• Using bi-variate correlations

𝑟12 − 𝑟13 𝑟23


𝑟12.3 = 2 √1 − 𝑟 2
√1 − 𝑟13 23

• Using multiple regressions

2 2
2 𝑅1.23 − 𝑅1.3
𝑟12.3 = 2
1 − 𝑅1.3

2
• 𝑅1.23 is the R2 from a multiple regression with 1 being Y and 2 and 3 being the predictor variables.
2
• 𝑅1.3 is the R2 from a simple regression with 1 being Y and 2 being X - the single predictor variable.

Semi-partial correlation Semipartial correlation removes the effects of additional variables from one of
the variables under study (typically X)

𝑟12 − 𝑟13 𝑟23


𝑟1(2.3) = 2
√1 − 𝑟23

Using multiple regressions:

2 2 2
𝑟1(2.3) = 𝑅1.23 − 𝑅1.3

7
Uses of Partial and Semipartial

• The partial correlation is most often used when some third variable z is a plausible explanation of the
correlation between X and Y.
• The semipartial is most often used when we want to show that some variable adds incremental variance
in Y above and beyond other X variable

Calculate semi-partial correlation in r: spcor() from the ppcor package


‘spcor(x, method = c(“pearson”, “kendall”, “spearman”))’

Semi-partial correlation and regression coefficients

• Each regression coefficient in the regression model is the amount of change in the outcome variable
that would be expected per one-unit change of the predictor, if all other variables in the model
were held constant.

• Regression is essentially about semi-partials: Each X is residualized on the other X variables.


• For each X we add to the equation, we ask, “What is the unique contribution of this X above and
beyond the others?” Increment in R2 when added last.
• We do NOT residualize Y, just X.
• Semipartial correlation coefficient is conceptually close to standardized regression coefficient

𝑟12 − 𝑟13 𝑟23


𝑟1(2.3) = 2
√1 − 𝑟23

𝑟12 − 𝑟13 𝑟23


𝛽1(2.3) = 2
1 − 𝑟23

• The difference is the square root in the denominator.


• The regression coefficient can exceed 1.0 in absolute value; the correlation cannot.

Perform multiple regression in R

Example

• A record company boss was interested in predicting record sales from advertising.
• Data: 200 different album releases
• Outcome variable:
– Sales (CDs and downloads) in the week after release (in units of 1000)
• Predictor variable:
– The amount (in units of £1000) spent promoting the record before release.
– Number of plays on the radio: number of times played on radio the week before release
– Attractiveness of the CD cover: expert rating with a 0-10 scale

album2 <- read.delim("Album Sales 2.dat", header = TRUE)


albumSales.2 <- lm(sales ~ adverts+airplay+attract, data = album2)
summary(albumSales.2)

8
##
## Call:
## lm(formula = sales ~ adverts + airplay + attract, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.324 -28.336 -0.451 28.967 144.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.612958 17.350001 -1.534 0.127
## adverts 0.084885 0.006923 12.261 < 2e-16 ***
## airplay 3.367425 0.277771 12.123 < 2e-16 ***
## attract 11.086335 2.437849 4.548 9.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595
## F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16

Interpretatio of the model

• F(3, 196) - 129.5, p < .001: The model results significantly better prediction than if we used the mean
value of album sales.
• 𝛽1 = 0.085: As advertising increases by 1 unit (£1000), album sales increase by 0.085 unit (1000).
• 𝛽2 = 3.367: When the number of plays on the radio increases by 1 unit, its sales increase by 3.367 unit
(1000).
• 𝛽3 = 11.086: When attractiveness of CD cover increases by 1 unit, its sales increase by 11.086 unit
(1000).

Standardised Beta Values

The standardized regression coefficients represent the change in response for a change of one standard
deviation in a predictor.

𝑠𝑥𝑖
𝛽𝑖∗ = 𝛽𝑖
𝑠𝑦

You can calculate standardised beta values using the lm.beta() function from QuantPsyc package.

library(QuantPsyc)
lm.beta(albumSales.2)

## adverts airplay attract


## 0.5108462 0.5119881 0.1916834

Alternatively, you can standardize the original raw data by converting each original data value to a z-score,
and then perform multiple linear regression using the standardized data. The obtained regression coefficients
are standardized.

9
Standardised Beta Values

• 𝛽1∗ = 0.511: As advertising increases by 1 standard deviation (£485, 655), album sales increase by
0.511 of a standard deviation (0.511* 80,699).
• 𝛽2∗ = 0.512: When the number of plays on the radio increases by 1 SD (12.27) its sales increase by
0.512 standard deviations(0.512* 80,699).
• 𝛽3∗ = 0.192: When the attractiveness of CD cover increases by 1 SD (1.40) its sales increase by 0.192
standard deviations (0.192* 80,699).

R2 and adjusted R2

• 𝑅2 : The proportion of variance accounted for by the model.


• 𝐴𝑑𝑗.𝑅2 : penalized by adjusting for the number of parameters in the model compared to the number
of observations.
– Wherry’s formula (Default output in R)

(1 − 𝑅2 ) · (𝑛 − 1)
1−
𝑛−𝑣

The adjusted R2 increases when a new explanator is included only if the new explanator improves the R2
more than would be expected by chance.

Parsimony-adjusted measures of fit

• Akaike Information Criterion (AIC) is a measure of fit which penalizes the model for having more
variables

𝑆𝑆𝐸
𝐴𝐼𝐶 = 𝑛 ln + 2𝑘
𝑛

* The bigger AIC is, the worse the fit is


* The smaller AIC is, the better the fit is
* AIC is only useful for comparing models with the same data and the same outocme variable

Calculating AIC in R
AIC(regression model, k = the number of predictors)

Types of Research Questions and Corresponding Regression Meth-


ods
Types of research questions that multiple regression can answer

• How well a set of IVs is able to predict a particular outcome (DV) ?


• Which IV is the best predictor of an outcome?
• Whether a particular predictor variable is still able to predict an outcome when the effects of another
variable is controlled for?

10
Methods of Regression

• Forced Entry: All predictors are entered simultaneously.


• Hierarchical: Experimenter decides the order in which variables are entered into the model.
• Stepwise: Predictors are selected using their semi-partial correlation with the outcome.
– Forward/backward/both
• All-subsets methods (best-subset methods)

Forced Entry Regression

• All variables are entered into the model simultaneously.


• Identifies the strongest predictor variable within the model
• Some researchers argued this method is the only appropriate method for theory testing (Studenmund
& Cassidy, 1987)

Interpretation of Results

• The F-test: It tells us whether using the regression model is significantly better at predicting values of
the outcome than using the mean.
• Beta values: the change in the outcome associated with a unit change in the predictor.
• Standardised beta values: tell us the same but expressed as standard deviations.

Hierarchical Regression

• Predictors are entered into the regression model in the order specified by the researcher based on past
research .
• New predictors are then entered in a separate step/block.
• Each IV is assessed in terms of what it adds to the prediction of DV after the previous IVs have been
controlled for.
• Overall model and relative contribution of each block of variables is assessed
– F test of 𝑅2 change.

Example

If we control for the possible effect of promotion budget, are airplay and CD cover design still able to predict
a significant amount of the variance in CD sales?
Examine individual models

album.adv.only <- lm(sales~adverts, data = album2)


album.full <- lm(sales~adverts+attract+airplay, data = album2)
summary(album.adv.only)

##
## Call:
## lm(formula = sales ~ adverts, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max

11
## -152.949 -43.796 -0.393 37.040 211.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 ***
## adverts 9.612e-02 9.632e-03 9.979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.99 on 198 degrees of freedom
## Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313
## F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16

summary.lm(album.full)

##
## Call:
## lm(formula = sales ~ adverts + attract + airplay, data = album2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.324 -28.336 -0.451 28.967 144.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.612958 17.350001 -1.534 0.127
## adverts 0.084885 0.006923 12.261 < 2e-16 ***
## attract 11.086335 2.437849 4.548 9.49e-06 ***
## airplay 3.367425 0.277771 12.123 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.09 on 196 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595
## F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16

Compare nested regression models

SSE.adv.only <- sum((fitted(album.adv.only)-album2$sales)^2)


SSM.adv.only <- sum((fitted(album.adv.only)-mean(album2$sales))^2)
SSE.full <- sum((fitted(album.full)-album2$sales)^2)
SSM.full <- sum((fitted(album.full)-mean(album2$sales))^2)
cbind(SSM.adv.only, SSE.adv.only)

## SSM.adv.only SSE.adv.only
## [1,] 433687.8 862264.2

cbind(SSM.full, SSE.full)

## SSM.full SSE.full
## [1,] 861377.4 434574.6

12
F=((SSM.full-SSM.adv.only)/2)/(SSE.full/196)
F

## [1] 96.44738

If the null hypothesis is supported, 𝐹 ∼ 𝐹(2,196)

anova(album.adv.only, album.full) #Note: for the function of anova(model1, model2), all predictors in mo

## Analysis of Variance Table


##
## Model 1: sales ~ adverts
## Model 2: sales ~ adverts + attract + airplay
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 862264
## 2 196 434575 2 427690 96.447 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note on hierarchical regression

Build the model

• A hierarchical regression can have as many blocks as there are groups of independent variables, i.e. the
analyst can specify a hypothesis that specifies an exact order of entry for variables.

• A more common hierarchical regression specifies two blocks of variables: a set of control variables
entered in the first block and a set of predictor variables entered in the second block.
– Control variables are often demographics which are thought to make a difference in scores on
the dependent variable.
– Predictors are the variables in whose effect our research question is really interested, but whose
effect we want to separate out from the control variables.

Evaluate the models

• Support for a hierarchical hypothesis would be expected to require statistical significance for the
addition of each block of variables.
• However, many times, we want to exclude the effect of blocks of variables previously entered into the
analysis, whether or not a previous block was statistically significant. The analysis is interested in
obtaining the best indicator of the effect of the predictor variables. The statistical significance of
previously entered variables is not interpreted.
• The latter strategy is also widely adopted in research.

Interpret hierarchical regression

• 𝑅2 change, i.e. the increase when the predictors variables are added to the analysis is interpreted
rather than the overall R² for the model with all variables entered.
• In the interpretation of individual relationships, the relationship between the predictors and the de-
pendent variable is presented.
• Similarly, in the validation analysis, we are only concerned with verifying the significance of the pre-
dictor variables. Differences in control variables are often ignored.

13
Reporting Hierarchical Regression

• Report the procedure to perform the regression

Hierarchical multiple regression was performed to investigate the ability airplay and CD cover
design to predict the variance in CD sales, after controlling for the possible effect of promotion
budget.

In the first step of hierarchical multiple regression, advertisement budget was entered. This model
was statistically significant (F (1, 198) = 99.59; p < .001) and explained 33 % of variance in CD
sales.

• Report 𝑅2 change and its significant test results after the entry of each new block

After the entry of airplay and CD cover design at Step 2, the total variance explained by the
model as a whole was 66% (F (3, 196) = 129.5; p < .001). The introduction of airplay and CD
cover design explained additional 33 % variance in criminal thinking style, after controlling for
advertisement budget (F (2, 196) = 96.45; p < .001).

• Interpret the final model

14
Source: Park, N., Kee, K. F., & Valenzuela, S. (2009). Being immersed in social networking environment:
Facebook groups, uses and gratifications, and social outcomes. CyberPsychology & Behavior, 12, 729-733.

Stepwise Regression

• Variables are entered into the model based on mathematical criteria.


• Forward regression as an example
– R looks for the predictor that can explain the most variance in the outcome variable.
– Having selected the 1st predictor, a second one is chosen from the remaining predictors based
some prespecified criterion.
∗ The semi-partial correlation is often used as a criterion for selection.
∗ Alternatives model selection techniques include adjusted _R_2, AIC
– The procedure stops when the model cannot be improved signifcantly by adding any new variables.
– In R, the step() function choose a model by AIC
• Backward method is the opposite
– All predictors enter the model at first
– Remove each variable and see if AIC goes up
• Both method (called stepwise by some other programs)
– Starts in the same way as the forward method
– Each time a predictor is added to the equation a removal test is made of the least useful predictor
• Difference: in forward, a variable is never removed once it is in the model; in stepwise, the variable
entered in the first step might not be included in the second step.

Stepwise Regression in R

step(album.full, direction = "backward")

## Start: AIC=1544.76
## sales ~ adverts + attract + airplay
##
## Df Sum of Sq RSS AIC
## <none> 434575 1544.8

15
## - attract 1 45853 480428 1562.8
## - airplay 1 325860 760434 1654.7
## - adverts 1 333332 767907 1656.6

##
## Call:
## lm(formula = sales ~ adverts + attract + airplay, data = album2)
##
## Coefficients:
## (Intercept) adverts attract airplay
## -26.61296 0.08488 11.08634 3.36743

album.min = lm (sales ~ 1, data = album2)


step(album.min, direction = "forward", scope = ~ adverts + airplay + attract)

## Start: AIC=1757.29
## sales ~ 1
##
## Df Sum of Sq RSS AIC
## + airplay 1 464863 831089 1670.4
## + adverts 1 433688 862264 1677.8
## + attract 1 137822 1158130 1736.8
## <none> 1295952 1757.3
##
## Step: AIC=1670.44
## sales ~ airplay
##
## Df Sum of Sq RSS AIC
## + adverts 1 350661 480428 1562.8
## + attract 1 63182 767907 1656.6
## <none> 831089 1670.4
##
## Step: AIC=1562.82
## sales ~ airplay + adverts
##
## Df Sum of Sq RSS AIC
## + attract 1 45853 434575 1544.8
## <none> 480428 1562.8
##
## Step: AIC=1544.76
## sales ~ airplay + adverts + attract

##
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
##
## Coefficients:
## (Intercept) airplay adverts attract
## -26.61296 3.36743 0.08488 11.08634

step(album.min, direction = "both", scope = ~ adverts + airplay + attract)

16
## Start: AIC=1757.29
## sales ~ 1
##
## Df Sum of Sq RSS AIC
## + airplay 1 464863 831089 1670.4
## + adverts 1 433688 862264 1677.8
## + attract 1 137822 1158130 1736.8
## <none> 1295952 1757.3
##
## Step: AIC=1670.44
## sales ~ airplay
##
## Df Sum of Sq RSS AIC
## + adverts 1 350661 480428 1562.8
## + attract 1 63182 767907 1656.6
## <none> 831089 1670.4
## - airplay 1 464863 1295952 1757.3
##
## Step: AIC=1562.82
## sales ~ airplay + adverts
##
## Df Sum of Sq RSS AIC
## + attract 1 45853 434575 1544.8
## <none> 480428 1562.8
## - adverts 1 350661 831089 1670.4
## - airplay 1 381836 862264 1677.8
##
## Step: AIC=1544.76
## sales ~ airplay + adverts + attract
##
## Df Sum of Sq RSS AIC
## <none> 434575 1544.8
## - attract 1 45853 480428 1562.8
## - airplay 1 325860 760434 1654.7
## - adverts 1 333332 767907 1656.6

##
## Call:
## lm(formula = sales ~ airplay + adverts + attract, data = album2)
##
## Coefficients:
## (Intercept) airplay adverts attract
## -26.61296 3.36743 0.08488 11.08634

Limitations with Stepwise Methods

• Rely on a mathematical criterion.


• Variable selection may depend upon only slight differences in the semi-partial correlation.
• These slight numerical differences can lead to major theoretical differences and create nonsensical
results!
• The computer cannot distinguish spurious correlations or make judgments regarding multi-collinearity
• Stepwise regression is a completely non-theoretical approach to prediction
• Should be used only for exploration or for predictive research

17
All-subsets-regressions (Best-subset)

• A procedure that considers all possible regression models given the set of potentially important pre-
dictors
• Model selection criteria:
– 𝑅2 . Find a subset model so that adding more variables will yield only small increases in R-squared
– Adjusted R2.
– MSE criterion
– Mallow’s Cp criterion
– Other: PRESS, predicted 𝑅2 (which is calculated from the PRESS statistic)…

Mallow’s Cp

• An underspecified model is a model in which important predictors are missing. An underspecified


model yields biased regression coefficients and biased predictions of the response.
• Mallows’ Cp-statistic estimates the size of the bias that is introduced into the predicted responses by
having an underspecified model.

𝑆𝑆𝐸𝑘
𝐶𝑝 = − (𝑛 − 2(𝑘 + 1))
𝑀 𝑆𝐸𝑇

* k = number of independent variables included in a particular regression model


* T = total number of parameters to be estimated in the full regression model
* $SSE_k$ = residual sum of squares for the particular regression model with k predictors
* $MSE_T$ = mean square error of the full model

• When the Cp value is …


– … near k+1, the bias is small (next to none)
– … much greater than k+1, the bias is substantial
– … below k+1, it is due to sampling error; interpret as no bias

*Strategy for using Cp to identify “best” models**

• Identify subsets of predictors for which the Cp value is near k+1 (if possible).
– The full model always yields Cp = k+1, so don’t select the full model based on Cp.
• If all models, except the full model, yield a large Cp not near k+1, it suggests some important predic-
tor(s) are missing from the analysis. In this case, we are well-advised to identify the predictors that
are missing!
• If a number of models have Cp near k+1, choose the model with the smallest Cp value, thereby insuring
that the combination of the bias and the variance is at a minimum.
• When more than one model has a small value of Cp value near k+1, in general, choose the simpler
model or the model that meets your research needs.

All-subsets-regression in R

library(leaps)
album.subsets <- regsubsets(sales~adverts+attract+airplay, data = album2)
summary(album.subsets)

18
## Subset selection object
## Call: regsubsets.formula(sales ~ adverts + attract + airplay, data = album2)
## 3 Variables (and intercept)
## Forced in Forced out
## adverts FALSE FALSE
## attract FALSE FALSE
## airplay FALSE FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
## adverts attract airplay
## 1 ( 1 ) " " " " "*"
## 2 ( 1 ) "*" " " "*"
## 3 ( 1 ) "*" "*" "*"

plot(album.subsets, scale = "Cp")

4
Cp

23

180
(Intercept)

adverts

attract

airplay

plot(album.subsets, scale = "adjr2")

19
0.66
adjr2

0.63

0.36
(Intercept)

adverts

attract

airplay
Options for plot( ) are r2, bic, Cp, and adjr2.

• It is important to note that no a single criterion can determine which is the best model.
• The different criteria quantify different aspects of the regression model, and therefore often yield
different choices for the best set of predictors.
• Subsets regression is better to be used as a screening tool to reduce the large number of possible
regression models to just a handful of models for further evaluation.
• Further evaluation and refinement might entail performing residual analyses, transforming the predic-
tors and/or response, adding interaction terms, and so on, until you are satisfied with the best model
that summarize the trend in the data and allows you to answer your research question.

Summary of Data-Driven Regression Methods

• Model selection statistics are generally not used blindly, but rather information about the field of
application, the intended use of the model, and any known biases in the data are taken into account
in the process of model selection.
• More suitable for exploratory model building
• Better to cross-validate the model with new data

Assumptions of Regression

Straightforward Assumptions

• Variable Type:

20
– Outcome must be continuous
– Predictors can be continuous or dichotomous.
• Non-Zero Variance: Predictors must not have zero variance.
• Linearity: The relationship we model is, in reality, linear.
• Homoscedasticity: For each value of the predictors the variance of the error term should be constant.
• Independence: All values of the outcome should come from different persons.

Residual Analysis: check assumptions

• Check the assumptions of regression by examining the residuals


– Examine for linearity assumption
– Examine for constant variance for all levels of X (homoscedasticity)

– Evaluate normal distribution assumption


– Evaluate independence assumption

21
The More Tricky Assumptions
• No multicollinearity: Predictors must not be highly correlated.
• Independent Errors: For any pair of observations, the error terms should be uncorrelated.
– Tested with Durbin-Watson test
– Value 1~4, 2 means uncorrelated
• Normally-distributed Errors

Multicollinearity

• Multicollinearity exists if predictors are highly correlated.


• This assumption can be checked with collinearity diagnostics
1
• Variance inflation factor (VIF): 𝑓 = 1−𝑅 2
𝑖

– VIF > 10 are worthy of concern (Myers 1990)


– VIF > 5 are worthy of concern (Menard 1995)
– Average VIF is substantially 1: the model is biased (Bowerman & O’Connell, 1990)
• Tolerance statistic: 1/VIF

library(car)
vif(album.full)

## adverts attract airplay


## 1.014593 1.038455 1.042504

Testing independence in R

• Using the Drubin-Watson test

durbinwatsonTest(model) or dwt(model)

dwt(album.full)

## lag Autocorrelation D-W Statistic p-value


## 1 0.0026951 1.949819 0.68
## Alternative hypothesis: rho != 0

22
Sample Size

• Sample size – results with small sample do not generalize


– 15 cases per predictor (Stevens, 1996)
– Formula for calculating sample size (Tabachnick & Fidell, 2007)

𝑁 > 50 + 8𝑘

∗ N = number of Participants
∗ k = number of IVs

Finding outliers and influential cases

Outliers and influential cases

• An observation that is unconditionally unusual in Y value is called an outlier, but it is not necessarily
a regression outlier
• An observation that has an unusual X value—i.e., it is far from the mean of X—has leverage on (i.e.,
the potential to influence) the regression line
• Influential cases: an unusual X-value with an unusual Y-value given its X-value
• The olsrr package offers a number of tools to detect influential observations. For more use of olsrr,
check out this introduction.

Residuals and standardized residuals

• Unstandardized residuals
• Standardized residuals:
1. those >3 are cause of concern
2. If more than 1% of cases > 2.5, the level of error within model is unacceptable
3. If more than 5% of cases > 2, the level of error within model is unacceptable
• Estimating the outliers in R
– unstanddized residuals: resid()

23
– standardized residuals: rstandard()

album2$standardized.residual <- rstandard(album.full)


library(dplyr)
large.standardized.residual <- filter(album2, standardized.residual > 2 | standardized.residual < -2)

large.standardized.residual

## adverts sales airplay attract standardized.residual


## 1 10.256 330 43 10 2.177404
## 2 985.685 120 28 7 -2.323083
## 3 174.093 300 40 7 2.130289
## 4 102.568 40 25 8 -2.460996
## 5 405.913 190 12 4 2.099446
## 6 1542.329 190 33 8 -2.455913
## 7 579.321 300 30 7 2.104079
## 8 56.895 70 37 7 -2.363549
## 9 1000.000 250 5 7 2.095399
## 10 9.104 120 53 8 -2.628814
## 11 145.585 360 42 8 3.093333
## 12 785.694 110 20 9 -2.088044

library(olsrr)
ols_plot_resid_stand(album.full)

Standardized Residuals Chart

3
Threshold:
169abs(2)

1 10 52 61 100
2
Standardized Residuals

−1

−2 200
2 68
47 55
164

0 50 100 150 200


Observation

24
DFBeta

• DFBeta: the difference between a parameter estimated using all cases and estimated when one case is
excluded

Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and √2
𝑛
as a size-adjusted cutoff.
For our sample, √2 = 0.14
200

album2$dfbeta <- dfbeta(album.full)


head(album2$dfbeta)

## (Intercept) adverts attract airplay


## 1 -5.42182707 -1.661591e-03 0.8529699235 0.0433929166
## 2 0.21601702 -8.649690e-04 -0.0450304095 0.0025870806
## 3 -0.65851797 1.207436e-03 -0.0130879018 0.0128983716
## 4 -0.04480869 8.441700e-05 0.0003156056 0.0009589848
## 5 -0.14928350 7.552860e-06 0.0331263834 -0.0039693382
## 6 1.14345654 1.554094e-05 -0.1251265742 -0.0057924331

ols_plot_dfbetas(album.full)

page 1 of 1
Influence Diagnostics for (Intercept) Influence Diagnostics for attract
52 1
Threshold: 0.14 Threshold: 0.14
12
0.2 0.2
DFBETAS

DFBETAS

55 164 200 113


27 169
7 138 152

0.0 0.0
7 138
27 169 69 146
−0.2 47
12 113 −0.2
1 52 200

0 50 100 150 200 0 50 100 150 200


Observation Observation

Influence Diagnostics for adverts Influence Diagnostics for airplay


164 124 169
68
Threshold: 0.14 0.2 42
50
Threshold: 0.14
0.2 3
47 87 1 10 69 105 200
100 148
DFBETAS

DFBETAS

0.0
0.0
68 152
83
10 −0.2 94
−0.2 46 99
100
1 169 119

55 −0.4 164

0 50 100 150 200 0 50 100 150 200


Observation Observation

25
DFFits
• DFFit: The difference between the predicted value for a case when the model is calculated including
that case and when the model is calculated excluding that case.
• An observation is deemed influential if the absolute value of its DFFITS value is greater than:

𝑘+1
2∗√
𝑛
where n is the number of observations and k is the number of predictors.
For our sample, this equals

3+1
2∗√ = 0.28
200
album2$dffit <- dffits(album.full)
head(album2$dffit)

## [1] 0.48929398 -0.21109830 0.21418431 0.01688873 -0.02020169 0.07410797

##large.standardized.residual <- filter(album2, dffit > 0.28 | dffit < -0.28)


##large.standardized.residual

ols_plot_dffits(album.full)

Influence Diagnostics for sales


1 Threshold: 0.28
169

52 100
0.3
DFFITS

0.0

−0.3 47 68 200
99 119
55

164

0 50 100 150 200


Observation

26
Cook’s D

• Cook’s distance: the impact that a case has on the model’s ability to predict all cases
𝑛
∑𝑗=1 (𝑌𝑗̂ −𝑌𝑗(𝑖)
̂ )2
– 𝐷𝑖 = 𝑘 𝑀𝑆𝐸

• Since Cook’s distance is in the metric of an F distribution with p and n-pdegrees of freedom, the
median point of the 𝐹(𝑝,𝑛−𝑝) can be used as a cut-off
• For large n, a simple cutoff value of 1 can be used. (Weisberg, 1982)

album2$cooks.distance <- cooks.distance(album.full)


plot(album2$cooks.distance)
0.06
album2$cooks.distance

0.04
0.02
0.00

0 50 100 150 200

Index

ols_plot_cooksd_chart(album.full)

27
Cook's D Chart
164
Threshold: 0.02

1
0.06

169

55
Cook's D

0.04
52
100 119
99
47 200
68
0.02

0.00

0 50 100 150 200


Observation

Hat values

• Leverage: hat value


– the diagonal elements of the hat matrix 𝐻 = 𝑋(𝑋 𝑇 𝑋)−1 𝑋 𝑇
– ℎ𝑖𝑖 = 𝜕𝜕𝑦𝑦𝑖̂
𝑖
– Average: (k+1)/n
• Criteria
– 0 indicates no influence; 1 indicates the case has complete influence
– Values greater than 2 times the average are cause of concern (Welsch 1978)
– Values greater than 3 times the average are cause of concern (Stevens 2002)

album2$hatvalues <- hatvalues(album.full)


plot(album2$hatvalues)
abline(h = 2*(3+1)/200, lty = 2, col = "green")
abline(h = 3*(3+1)/200, lty = 2, col = "red")

28
0.10
0.08
album2$hatvalues

0.06
0.04
0.02

0 50 100 150 200

Index

Model Building and Validation

• In data-driven research, the final step in the model-building process is to validate the selected regression
model.
– Collect new data and compare the results.
– Cross-validation: If the data set is large, split the data into two parts and cross-validate the results
(often 80-20).

{width = 60%}

29
Categorical predictors and multiple reression
• Often you may have categorical variables (e.g., gender, major) which you want to include as predictors
in regression.
• To do that, you need to code the categorical predictors with dummy variables
– Create k-1 dummy variables
– Choose the baseline/control group. If you don’t havea specific control group, choose the group
that represents the majority of people.
– Assign the baseline values of 0 for all of your dummy variables.
– For your first dummy variable, assign the value 1 to the first group that you want to compare
against the baseline group. Assign all other groups 0 for this variable.
– For your second dummy variable, assign the value 1 to the second group that you want to compare
against the baseline group. Assign all other groups 0 for this variable.
– Repeat this until you run out of dummy variables
– Put dummy variables into the regression model

Example

Salaries is a data frame pre-loaded with carData package. It includes 397 observations on the 2008-09
nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the
U.S.

• rank: a factor with levels AssocProf AsstProf Prof


• discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments).
• yrs.since.phd: years since PhD.
• yrs.service: years of service.
• sex: a factor with levels Female Male
• salary: nine-month salary, in dollars.

library(car)
data("Salaries", package = "carData")
str(Salaries)

## 'data.frame': 397 obs. of 6 variables:


## $ rank : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
## $ discipline : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
## $ salary : int 139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...

Regression with Interval Variables

salaryModel1 <- lm(salary ~ yrs.service, data = Salaries)


summary(salaryModel1)

##
## Call:
## lm(formula = salary ~ yrs.service, data = Salaries)

30
##
## Residuals:
## Min 1Q Median 3Q Max
## -81933 -20511 -3776 16417 101947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99974.7 2416.6 41.37 < 2e-16 ***
## yrs.service 779.6 110.4 7.06 7.53e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28580 on 395 degrees of freedom
## Multiple R-squared: 0.1121, Adjusted R-squared: 0.1098
## F-statistic: 49.85 on 1 and 395 DF, p-value: 7.529e-12

Coding categorical variables with two levels

If we want to examine if gender plays a role in determining professors’ salaries, we would like to put it into
the regression model.
Note that the variable of sex in this dataset has already been set contrasts using female as the baseline group

contrasts(Salaries$sex)

## Male
## Female 0
## Male 1

You can use the function relevel() to set the baseline category to males as follow:

Salaries <- Salaries %>%


mutate(sex = relevel(sex, ref = "Male"))
contrasts(Salaries$sex)

## Female
## Male 0
## Female 1

Regression with Dummy Variables

The fact that the coefficient for sexFemale in the regression output is negative indicates that being a Female
is associated with a constant decrease (i.e., smaller intercept in regression) in salary (relative to Males).

salaryModel2 <- lm(salary ~ yrs.service + sex, data = Salaries)


summary(salaryModel2)

##
## Call:
## lm(formula = salary ~ yrs.service + sex, data = Salaries)
##

31
## Residuals:
## Min 1Q Median 3Q Max
## -81757 -20614 -3376 16779 101707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101428.7 2531.9 40.060 < 2e-16 ***
## yrs.service 747.6 111.4 6.711 6.74e-11 ***
## sexFemale -9071.8 4861.6 -1.866 0.0628 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28490 on 394 degrees of freedom
## Multiple R-squared: 0.1198, Adjusted R-squared: 0.1154
## F-statistic: 26.82 on 2 and 394 DF, p-value: 1.201e-11

• The interaction term indicates whether females’ salaries grow with years of service in the same rate as
males’ salaries do.

salaryModel3 <- lm(salary ~ yrs.service+yrs.service*sex, data = Salaries)


summary(salaryModel3)

##
## Call:
## lm(formula = salary ~ yrs.service + yrs.service * sex, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80381 -20258 -3727 16353 102536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102197.1 2563.7 39.863 < 2e-16 ***
## yrs.service 705.6 113.7 6.205 1.39e-09 ***
## sexFemale -20128.6 7991.1 -2.519 0.0122 *
## yrs.service:sexFemale 931.7 535.2 1.741 0.0825 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28420 on 393 degrees of freedom
## Multiple R-squared: 0.1266, Adjusted R-squared: 0.1199
## F-statistic: 18.98 on 3 and 393 DF, p-value: 1.622e-11

Code categorical variables with more than two levels

• Note that by default R treats the levels of categorical data in alphabetical order (level 1 = AsstProf,
level 2 = AssocProf, level 3 = Prof)

contrasts(Salaries$rank) <- contr.treatment(3, base = 1)


contrasts(Salaries$rank)

## 2 3

32
## AsstProf 0 0
## AssocProf 1 0
## Prof 0 1

• Create the dummy variables by setting contrasts


– The contr.treatment() function sets a contrast based on comparing all groups to a baseline con-
dition.
– contr.treatment(number of groups, base = number representing the baseline groups)

Regression with dummy variables

salaryModel4 <- lm(salary ~ yrs.service + sex + rank, data = Salaries)


summary(salaryModel4)

##
## Call:
## lm(formula = salary ~ yrs.service + sex + rank, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64500 -15111 -1459 11966 107011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82081.5 2974.1 27.598 < 2e-16 ***
## yrs.service -171.8 115.3 -1.490 0.13694
## sexFemale -5468.7 4035.3 -1.355 0.17613
## rank2 14702.9 4266.6 3.446 0.00063 ***
## rank3 48980.2 3991.8 12.270 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23580 on 392 degrees of freedom
## Multiple R-squared: 0.4, Adjusted R-squared: 0.3938
## F-statistic: 65.32 on 4 and 392 DF, p-value: < 2.2e-16

Summary
• Different types of regression methods are developed for theory-driven or data-driven research purposes.
• To compare nested models, F-test and AIC are often used.
• To compare not nested models, more general goodness-of-fit indices are often used, including Mallow’s
Cp.
• It is possible for a single observation to have a great influence on the results of a regression analysis.
Methods to detect such observations include DFBeta, DFFits, Cook’s D, and hat values.
• Categorical variables can be included in linear regression models after coded using aa set of dummy
variables.

33

You might also like