Ch08 - Linear Regression
Ch08 - Linear Regression
Slide 2
What is Regression?
• A way of predicting the value of one
variable from another.
– It is a hypothetical model of the relationship
between two variables.
– The model used is a linear one.
– Therefore, we describe the relationship using
the equation of a straight line.
Slide 3
Describing a Straight Line
Yi b0 b1X i i
• bi
– Regression coefficient for the predictor
– Gradient (slope) of the regression line
– Direction/Strength of Relationship
• b0
– Intercept (value of Y when X = 0)
– Point at which the regression line crosses the Y-
axis (ordinate)
Slide 4
Intercepts and Gradients
The Method of Least Squares
Slide 6
How Good is the Model?
• The regression line is only a model
based on the data.
• This model might not reflect reality.
– We need some way of testing how well
the model fits the observed data.
– How?
Slide 7
Sums of Squares
Slide 8
Summary
• SST
– Total variability (variability between scores and the
mean).
• SSR
– Residual/Error variability (variability between the
regression model and the actual data).
• SSM
– Model variability (difference in variability between
the model and the mean).
Slide 9
Testing the Model: ANOVA
SST
Total Variance In The Data
SSM SSR
Improvement Due to the Model Error in Model
Slide 10
Testing the Model: ANOVA
• Mean Squared Error
– Sums of Squares are total values.
– They can be expressed as averages.
– These are called Mean Squares, MS
MS M
F MSR
Slide 11
Testing the Model: R 2
• R2
– The proportion of variance accounted for by
the regression model.
– The Pearson Correlation Coefficient Squared
2 SS M
R SST
Slide 12
Outliers and residuals
• An outlier is a case that differs
substantially from the main trend of the
data
• The green line shows the original model,
and the red line shows the model with the
outlier included. The outlier has a
dramatic effect on the regression model:
the line becomes flatter (i.e., b1 is smaller)
and the intercept increases (i.e., b0 is
larger)
• Examine residuals to look for outliers
• These residuals represent the error present
in the model. If a model fits the sample
data well then all residuals will be small.
Also, if any cases stand out as having a
large residual, then they could be outliers.
• unstandardized residuals described above
are measured in the same units as the
outcome variable and so are difficult to
interpret across different models
• we cannot define a universal cut-off point
for what constitutes a large residual.
• standardized residuals, which are the residuals
converted to z-scores
• 1.96 cuts off the top 2.5% of the distribution.
• −1.96 cuts off the bottom 2.5% of the
distribution.
• As such, 95% of z-scores lie between −1.96 and
1.96.
• 99% of z-scores lie between −2.58 and 2.58,
• 99.9% of them lie between −3.29 and 3.29.
Standardized Residuals
• In an average sample, 95% of
standardized residuals should lie
between 2.
• 99% of standardized residuals should
lie between 2.5.
• Outliers
– Any case for which the absolute value of
the standardized residual is 3 or more, is
likely to be an outlier.
Slide 18
Influential cases
• look at whether certain cases that exert undue
influence over the parameters of the model.
• if we were to delete a certain case, would we
obtain different regression coefficients?
• This type of analysis can help to determine
whether the regression model is stable across the
sample, or whether it is biased by a few
influential cases. Again, this process will unveil
outliers.
• adjusted predicted value for a case when
that case is excluded from the analysis. In
effect, the computer calculates a new
model without a particular case and then
uses this new model to predict the value of
the outcome variable for the case that was
excluded.
• We can also look at the residual based on the
adjusted predicted value: that is, the difference
between the adjusted predicted value and the
original observed value. This is the
deleted residual. The deleted residual can be
divided by the standard error to give a
standardized value known as the Studentized
deleted residual. This residual can be compared
across different regression analyses because it is
measured in standard units.
• The deleted residuals are very useful to assess
the influence of a case on the ability of the
model to predict that case.
• However, they do not provide any information
about how a case influences the model as a
whole
• USE Cook’s distance:a measure of the overall
influence of a case on the model, values greater
than 1 may be cause for concern.
• Run the regression analysis with a case
included and then rerun the analysis with
that same case excluded. If we did this,
undoubtedly there would be some
difference between the b coefficients in
the two regression equations. This
difference would tell us how much
influence a particular case has on the
parameters of the regression model.
• The difference between a parameter
estimated using all cases and estimated
when one case is excluded is known as
the DFBeta
• Again, the units of measurement used will
affect these values and so SPSS produces
a standardized DFBeta
• A related statistic is the DFFit, which is
the difference between the predicted value
for a case when the model is calculated
including that case and when the model is
calculated excluding that case: in this
example the value is −1.90
• We have a problem with units Therefore,
SPSS also produces standardized versions
of the DFFit values (Standardized DFFit
).
Regression: An Example
• A record company boss was interested in
predicting album sales from advertising.
• Data
– 200 different album releases
• Outcome variable:
– Sales (CDs and Downloads) in the week after
release
• Predictor variable:
– The amount (in £s) spent promoting the album
before release.
Step One: Graph the Data
Slide 28
Regression Using IBM SPSS
Slide 29
Output: Model Summary
Model Summary
Std. Error
Adjus ted R of the
Model R R Square Square Es ti mate
a
1 .578 .335 .331 65.9914
a. Predi c tors : (Cons tant), Adv erti s i ng Budget (thous ands
of pounds )
Slide 30
Output: ANOVA
Slide 31
SPSS Output: Model Parameters
Slide 32
t=(b observed-b expected)/SE
Using The Model
Slide 34
• The sample size required to test the
overall regression model depending on the
number of predictors and the size of
expected effect, R2 = .02 (small), .13
(medium) and .26 (large)