Simple Linear Regression Part I - Updated FA18
Simple Linear Regression Part I - Updated FA18
Mean
a measure of 175
psychological problems).
150
symptoms
stress
5
Predicting a Score
Let’s assume we want to find out
if we can find a way to predict the
ERM680 experience
We could collect scores on a 150
symptom scale from a sample
and compute a mean to estimate
symptoms
6
Predicting a Score
We can predict a symptom score
is a score “given” a specific stress
score. 175
symptoms
E.g. If a student’s stress score is
may be 92.
100
75
0 25 50 75
stress
18 42
7
But how?
We are looking for the best-fitting line, one logical
way to find a best line is to find one that minimizes
errors of prediction.
If Y represents the observed response value and Ŷ
(Y-hat) represents the predicted value (value on
the line) , the difference between them is called the
residual, it is represented as (Y- Ŷ ).
Like deviations in the computation of standard
deviation, we cannot simply sum the residuals,
because some of them are positive and some of
them are negative. To estimate the overall error, we
have to square the residuals before finding the
sum. The error in predicting Y (ie., Y- Ŷ ) is derived
from the sum of squared residuals 8
But how?...
The goal of regression is to find the best
fitting line. ie., the one that minimizes the
sum of squared residuals or SSerror ,
represented by (Y Yˆ ) 2
ˆ
Y a bX Value
of X
Predicted
value of Y Slope of
Intercept regression
line
10
The Regression Line
Yˆ a bX
When our values for Y are predicted or
estimated, we indicate this with a Ŷ (Y-hat)
When the values are actual observed scores,
they’re indicated with Y
a is the intercept or the Ŷ when X=0
b is the slope or the rate of change in Y. For
every unit of increase in X value, we will see an
increase or decrease of b unit in the predicted
Y-hat. The increase or decrease depends on
the sign associated with the slope, b
11
The Regression Equation
ˆ
Y bX a
where
Yˆ pronounced " Y hat" , the predicted value of Y
b the slope of the regression line
a the intercept (i.e., predicted value of Y when X 0)
16
14
12
10
8
Test Performance
0 Rsq = 0.5965
0 2 4 6 8 10 12 14 16 18 20
Anxiety
Defining the line
Finding the regression line is the same as
finding a and b that defines the line.
14
Finding b
The slope, b, is derived from the
covariance & standard deviation of x:
cov XY
b 2
sX
Notice that the equation for b is similar
to the correlation equation. You can
find b using r:
cov XY sY
r br
s X sY sX
Finding a
a Y bX
Another View of the Regression Equation
ˆ s s
Y r Y
X Y r Y
X
Adjust sX
proportional to
sX
the strength of
sY
the relationship
r (X X ) Y The mean of Y is
sX
your estimate if
you know
Put the nothing
difference in
between X
Like a z score, how many SD is X
and its
above or below the mean
mean on Y’s
metric
Plotting a Regression Line
To plot a regression line you solve the
regression equation for a few values of X
Place the points on your graph
Connect the dots
Or just have SPSS do it!
Scatter Plot of SATV vs Score (Tab
9.6 in Howell)
covXY
cov XY 220.317
b 2 2
4.865
sX 6.729
Slope, b=4.865)
Intercept, a = 373.73)
Yˆ 4.865 X 373.73
For every 1 point increase in score X we expect a 4.865 points
increase in SATV scores
Prediction Y given X
You can now use your equation to make a
prediction
Yˆ 4.865 X 373.73
When x=40, we predict
Yhat=631.58
ID (i) Score Xi Yi Yihat Residual
X=53
Errors of Prediction
residual (Y Yˆ )
Residuals
What do you think the residuals should
add up to?
sum of residuals (Y Yˆ )
2
sum of squared residuals (Y Yˆ )
ˆ
(Y Y ) 2
2
Error Variance s Y Yˆ
n2
2
Note error variance is also written as sY . X
Table of residuals or prediction error
Residuals or
ID Score SATV (Observed) Predicted_SATV prediction error
X Y Yhat Y-Yhat
1 58 590 655.9096 -65.9096
2 48 580 607.259 -27.259
3 34 550 539.1483 10.85174
4 38 550 558.6085 -8.60848
5 41 560 573.2037 -13.2037
6 55 800 641.3144 158.6856
7 43 650 582.9338 67.06625
8 47 660 602.394 57.60603
9 47 600 602.394 -2.39397
10 46 610 597.5289 12.47108
11 40 620 568.3386 51.66141
12 39 560 563.4735 -3.47354
13 50 570 616.9891 -46.9891
14 46 510 597.5289 -87.5289
15 48 590 607.259 -17.259
16 41 490 573.2037 -83.2037
17 43 580 582.9338 -2.93375
18 53 700 631.5843 68.4157
19 60 690 665.6397 24.36032
20 44 600 587.7988 12.20119
21 49 580 612.1241 -32.1241
22 33 590 534.2832 55.71679
23 40 540 568.3386 -28.3386
24 53 580 631.5843 -51.5843
25 45 600 592.6639 7.33614
26 47 560 602.394 -42.394
27 53 630 631.5843 -1.5843
28 53 620 631.5843 -11.5843
Residuals Squared Residuals
Y-Yhat (Y-Yhat)2
1 -65.91 4344.08
2 -27.26 743.05
3 10.852 117.76
4 -8.608 74.11
5 -13.2 174.34
6 158.69 25181.12
7 67.066 4497.88
8 57.606 3318.45
9 -2.394 5.73
10 12.471 155.53
11 51.661 2668.90
12 -3.474 12.07
13 -46.99 2207.98
14 -87.53 7661.31
15 -17.26 297.87
16 -83.2 6922.86
17 -2.934 8.61
18 68.416 4680.71 SSerror=
19 24.36 593.43
20 12.201 148.87 SSY-Yhat
21 -32.12 1031.96
22 55.717 3104.36
23 -28.34 803.08
24 -51.58 2660.94
25 7.3361 53.82
26 -42.39 1797.25
27 -1.584 2.51
28 -11.58 134.20
Sum
(Y Yˆ )
0.00 (Y Yˆ ) 2
73402.75
n2 2823.18
residuals
standard error of estimate= (Y Yˆ ) 2
53.13
Sy.x =Sy-yhat n2
Error Variance
Error variance is used to describe how
much error is in the predictions
Just like variance describes variability
around the mean of a distribution, in
regression error variance describes
variability around predicted values
If our predictions were perfect (i.e., r=1), there
would be no error at all and error variance will
be zero
Regression coefficients minimize error
variance
Conditional Distribution of Errors
Assumption of Homoscedasticity
Conditional Distribution
Standard Error of the Estimate (SEE)
Like with regular variance there is a
standard deviation measure for error
variance i.e., the standard error of the
estimate
sY X sY Yˆ error variance
ˆ
(Y Y ) 2
sY X
n2
SEE as a measure of the accuracy
of prediction
SEE is one way we can assess how well our
regression equation is working.
If the regression equation works well, it’s safe to assume the
predicted values of Y should be very close to the real values.
Refer to equation on previous slide. As the predicted value
and the actual get very close, SEE will get smaller.
Smaller SEE implies more accurate predictions.
It should be made clear that this value in practice can
be quite sizeable—predicting with an independent
variable is better that predicting without one, but it is
not without error. We rarely have a dataset that fall
exactly onto one straight line.
34
r2 as a measure of accuracy of prediction
SSerror again. Remember (Y Yˆ ) is the sum of the
2
Y (Y Y ) SS Y (Y Y ) 2
= 598.5714 0.00 102342.86
SATV
ID (predicted) Deviation from Mean Squared Deviations
Yhat Yhat-Ybar (Y-Ybar)2
1 655.9096 57.3381714 3287.666
2 607.259 8.68757143 75.4739
3 539.1483 -59.4231286 3531.108
4 558.6085 -39.9629286 1597.036
5 573.2037 -25.3677286 SS regression
643.5217
6 641.3144 42.7429714 1826.962
7 582.9338 -15.6376286 in SPSS
244.5354
8 602.394 3.82257143 14.61205
9 602.394 3.82257143 14.61205
10 597.5289 -1.04252857 or also referred to
1.086866
914.0239
11
12
568.3386
563.4735
-30.2328286
-35.0979286
as 1231.865
13 616.9891 18.4176714 SSmodel
339.2106
14 597.5289 -1.04252857 1.086866
15 607.259 8.68757143 75.4739
Since the16 regression
573.2037line passes through the mean of Y,
-25.3677286 643.5217
then the17mean of 582.9338
predicted Y (Yhat bar)-15.6376286 244.5354
1089.85
18 631.5843 33.0128714
is the same
19 as the665.6397
mean of the observed Y
67.0682714 4498.153
20 587.7988 -10.7726286 116.0495
21 612.1241 13.5526714 183.6749
22 534.2832 -64.2882286 4132.976
23 568.3386 -30.2328286 914.0239
24 631.5843 33.0128714 1089.85
25 592.6639 -5.90752857 34.89889
26 602.394 3.82257143 14.61205
27 631.5843 33.0128714 1089.85
28 631.5843 33.0128714 1089.85
Yˆ Y (Yˆ Y ) SS Yˆ (Yˆ Y ) 2
r=0.75
r=0.22
41
Alternate Standard Error Equation
This equation does not require the computation
of all residuals
Notice how as r increases to 1 the standard
error decreases to 0
2 n 1
sY X sY (1 r )
n2
2
sY X sY (1 r )
Standard error of Estimate of the
SATV example
2
sY Yˆ sY (1 r )
2
sY Yˆ 61.567 (1 (.532) )
sY Yˆ 61.567 .7169
sY Yˆ 53.1
Another Example- Use the data in the table below to
obtain the regression equation and the standard error of
estimate for modeling Test Performance as a function of
Anxiety
Correlations cov XY
b 2
Anxiety
Test
Performance sX
Anxiety Pearson Correlation 1 -.772**
Sig. (2-tailed)
Sum of Squares and
. .000
a Y bX
693.500 -404.500
Cross-products
Covariance 33.024 -19.262
N 22 22
Test Performance Pearson Correlation -.772** 1
Sig. (2-tailed) .000 .
Sum of Squares and
-404.500 395.500
Descriptive Statistics
Cross-products
Covariance -19.262
Mean Std.18.833
Deviation N
N Anxiety 22
10.5000 22
5.74663 22
Testat
**. Correlation is significant Performance 9.5000
the 0.01 level (2-tailed). 4.33974 22
SPSS’s Answer
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 15.624 1.277 12.234 .000
Anxiety -.583 .107 -.772 -5.438 .000
a. Dependent Variable: Test Performance
ˆ
Y .583 X 15.62
For every 1 point increase in Anxiety (X) we expect
a .583 point decrease in Test Performance (Y)
Que: What test score would you predict for
someone with an anxiety score of 16?
6.3
Standard Error From SPSS
Model Summary
2
sY Yˆ sY (1 r )
2
sY Yˆ 4.34 (1 (.772) )
sY Yˆ 4.34 .404
sY Yˆ 2.8
Significance Testing of the regression
Slope
Regression coefficients are simply estimates of the true
parameters in the population. As such we can perform
hypothesis tests on them.
For example we can perform a t-test on regression
slopes
H0: b*=0, Ha: b*≠ 0
Because we use b to indicate the standardized
regression coefficient, we will use b* to represent the
true slope of the regression line
With single predictor regression equations, you will get
the same result whether you test the correlation
coefficient, r or the slope b.
Applies only to simple linear regression, i.e., you have only
predictor in the model.
Significance Testing of the regression
Slope of the SATV Data
Using the SATV data to test for if the slope, b is significantly different than
zero at the 5% level of significance
b b( s X ) n 1
t
sY X 2 n 1
sY (1 r )
sX n 1 n2
4.865(6.729) 27 170.104
t slope 3.24
2 27 52.497
61.567 (1 .532 )
28 s b
Standard error
tcv= t.05/2, df=26 = -+2.056 of slope
The tobs (3.24) is more extreme than the critical t (2.056), we therefore
reject Ho and declare that the regression slope is different than zero.
i.e., Not reading the passage (intelligent guessing) is still a significant
predictor of the SAT verbal scores
Significance Testing fo regression
slope - the Easy Way
tobs value