0% found this document useful (0 votes)
39 views59 pages

Simple Linear Regression Part I - Updated FA18

The document discusses simple linear regression and how it can be used to predict a variable from another correlated variable. It provides examples of using linear regression to predict psychological symptom scores based on the mean score, and how predicting could be improved by also considering an individual's stress level.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views59 pages

Simple Linear Regression Part I - Updated FA18

The document discusses simple linear regression and how it can be used to predict a variable from another correlated variable. It provides examples of using linear regression to predict psychological symptom scores based on the mean score, and how predicting could be improved by also considering an individual's stress level.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

ERM 680

Simple Linear Regression


Part I
Simple linear Regression
Students should be able to:
 Calculate and describe b, a, , r2, & standard
error (given the appropriate equations)
 Construct and use a regression equation
 Describe error variance
 Describe the meaning and uses of r2
What is Regression?
 Regression is the use of correlational
data to predict one variable from another
 Regression is also the plotting of a “best-
fitting” line through a scatter plot
 The plotted regression line is the
graphical representation of the prediction
Predicting a Score
 If we know nothing else about the variable you
want to predict, our best prediction is the mean
of that variable

Mean

 Remember, if we randomly draw a score from


a population, chances are that it will be closer
the mean of the population than far from it. 4
Predicting a Score
 If we know nothing else
but that the mean
symptom score of students
in UNCG is 90.7 (which is

a measure of 175

psychological problems).
150 

 We know a student comes

symptoms
 


from UNCG, if I ask you


125
 
 

about his symptom score   


 
  

(how much psychological


     
100  
   

    
  

problems he might have),  



   
  
   
 



the closest you can say is


    

75     
 
   

“about 90.7” because



   


  

that’s the mean, the best


estimate we can have.
0 25 50 75

stress

5
Predicting a Score
 Let’s assume we want to find out
if we can find a way to predict the 

number of symptoms students in 175

ERM680 experience
 We could collect scores on a 150 
symptom scale from a sample
and compute a mean to estimate

symptoms
 

the average in the population.



125
 
 

 However, this average may be 



 

 


higher or lower than the actual 100    


   
 

   

 

scores for many individuals


 
      
    
    

 Assume we also had scores on


  
      
    
75   
the level of stress for these   
   
  

individuals. We could then predict 

the symptoms scores for each 0 25 50 75

individual more precisely if we can stress


establish a relationship between
the number of symptoms reported 18 42

and the associated stress level.

6
Predicting a Score
 We can predict a symptom score
is a score “given” a specific stress 

score. 175

 E.g. If a student’s stress score is


42, his estimated symptom score 150 
maybe 105.

symptoms
 
 E.g. If a student’s stress score is  

18, his estimated symptom score


125
 
 

may be 92. 

 

 
 





       
100 
    
 
      
    
    
  
      
    
75     

   
  


0 25 50 75

stress

18 42

7
But how?
 We are looking for the best-fitting line, one logical
way to find a best line is to find one that minimizes
errors of prediction.
 If Y represents the observed response value and Ŷ
(Y-hat) represents the predicted value (value on
the line) , the difference between them is called the
residual, it is represented as (Y- Ŷ ).
 Like deviations in the computation of standard
deviation, we cannot simply sum the residuals,
because some of them are positive and some of
them are negative. To estimate the overall error, we
have to square the residuals before finding the
sum. The error in predicting Y (ie., Y- Ŷ ) is derived
from the sum of squared residuals 8
But how?...
 The goal of regression is to find the best
fitting line. ie., the one that minimizes the
sum of squared residuals or SSerror ,
represented by  (Y  Yˆ ) 2

 Our approach of finding the best line is


called “least squares regression”.
 One of the conditions to find the regression
line is that it must always pass through one
point ( mean x and mean Y), that is
( X ,Y )
9
The Regression Line
 The line that meets this criterion actually can be
numerically derived.
 Think back to your high school math class…the
equation for a straight line is:

ˆ
Y  a  bX Value
of X
Predicted
value of Y Slope of
Intercept regression
line
10
The Regression Line

Yˆ  a  bX
 When our values for Y are predicted or
estimated, we indicate this with a Ŷ (Y-hat)
 When the values are actual observed scores,
they’re indicated with Y
 a is the intercept or the Ŷ when X=0
 b is the slope or the rate of change in Y. For
every unit of increase in X value, we will see an
increase or decrease of b unit in the predicted
Y-hat. The increase or decrease depends on
the sign associated with the slope, b

11
The Regression Equation

ˆ
Y  bX  a
where
Yˆ  pronounced " Y hat" , the predicted value of Y
b  the slope of the regression line
a  the intercept (i.e., predicted value of Y when X  0)

a and b are called unstandardized regression


coefficients
Example Regression
18

16

14

12

10

8
Test Performance

0 Rsq = 0.5965
0 2 4 6 8 10 12 14 16 18 20

Anxiety
Defining the line
 Finding the regression line is the same as
finding a and b that defines the line.

14
Finding b
The slope, b, is derived from the
covariance & standard deviation of x:
cov XY
b 2
sX
Notice that the equation for b is similar
to the correlation equation. You can
find b using r:
cov XY sY
r br
s X sY sX
Finding a

The intercept, a, is derived from the


slope, b:

a  Y  bX
Another View of the Regression Equation

ˆ s s
Y r Y
X Y  r Y
X
Adjust sX
proportional to
sX
the strength of

sY
the relationship

 r (X  X ) Y The mean of Y is

sX
your estimate if
you know
Put the nothing
difference in
between X
Like a z score, how many SD is X
and its
above or below the mean
mean on Y’s
metric
Plotting a Regression Line
 To plot a regression line you solve the
regression equation for a few values of X
 Place the points on your graph
 Connect the dots
 Or just have SPSS do it!
Scatter Plot of SATV vs Score (Tab
9.6 in Howell)

Compare the two charts and comment


Scatter Plot of STAV vs Score (Tab 9.6)
Y-intercept=373.736

 The equation in SPSS: Y=3.74 E2+4.87X


should actually be written as
 Yhat= 374+4.87X
means
sX
SPSS Example

covXY

cov XY 220.317
b 2  2
 4.865
sX 6.729

a  Y  bX  598.57  (4.865  46.21)  373.73


Slope significantly
different from 0
SPSS Example

Slope, b=4.865)
Intercept, a = 373.73)

Yˆ  4.865 X  373.73
For every 1 point increase in score X we expect a 4.865 points
increase in SATV scores
Prediction Y given X
You can now use your equation to make a
prediction
Yˆ  4.865 X  373.73
When x=40, we predict

When X=53, we predict…


Scatter Plot of STAV vs Score
Scatter Plot of STAV vs Score

Yhat=631.58
ID (i) Score Xi Yi Yihat Residual

18 53 700 631.58 68.42

24 53 580 631.58 -51.58

27 53 630 631.58 -1.58

28 53 620 631.58 -11.58

X=53
Errors of Prediction

 When you predict a score using


regression, you may not obtain the exact
observed score, ie,
 The stronger the correlation r, the more
accurate your prediction
 The difference between the predicted and
actual value is called the residual

residual  (Y  Yˆ )
Residuals
 What do you think the residuals should
add up to?
sum of residuals   (Y  Yˆ )
2
sum of squared residuals   (Y  Yˆ )
ˆ
 (Y  Y ) 2
2
Error Variance  s Y Yˆ

n2
2
Note error variance is also written as sY . X
Table of residuals or prediction error
Residuals or
ID Score SATV (Observed) Predicted_SATV prediction error
X Y Yhat Y-Yhat
1 58 590 655.9096 -65.9096
2 48 580 607.259 -27.259
3 34 550 539.1483 10.85174
4 38 550 558.6085 -8.60848
5 41 560 573.2037 -13.2037
6 55 800 641.3144 158.6856
7 43 650 582.9338 67.06625
8 47 660 602.394 57.60603
9 47 600 602.394 -2.39397
10 46 610 597.5289 12.47108
11 40 620 568.3386 51.66141
12 39 560 563.4735 -3.47354
13 50 570 616.9891 -46.9891
14 46 510 597.5289 -87.5289
15 48 590 607.259 -17.259
16 41 490 573.2037 -83.2037
17 43 580 582.9338 -2.93375
18 53 700 631.5843 68.4157
19 60 690 665.6397 24.36032
20 44 600 587.7988 12.20119
21 49 580 612.1241 -32.1241
22 33 590 534.2832 55.71679
23 40 540 568.3386 -28.3386
24 53 580 631.5843 -51.5843
25 45 600 592.6639 7.33614
26 47 560 602.394 -42.394
27 53 630 631.5843 -1.5843
28 53 620 631.5843 -11.5843
  Residuals Squared Residuals
  Y-Yhat (Y-Yhat)2
1 -65.91 4344.08
2 -27.26 743.05
3 10.852 117.76
4 -8.608 74.11
5 -13.2 174.34
6 158.69 25181.12
7 67.066 4497.88
8 57.606 3318.45
9 -2.394 5.73
10 12.471 155.53
11 51.661 2668.90
12 -3.474 12.07
13 -46.99 2207.98
14 -87.53 7661.31
15 -17.26 297.87
16 -83.2 6922.86
17 -2.934 8.61
18 68.416 4680.71 SSerror=
19 24.36 593.43
20 12.201 148.87 SSY-Yhat
21 -32.12 1031.96
22 55.717 3104.36
23 -28.34 803.08
24 -51.58 2660.94
25 7.3361 53.82
26 -42.39 1797.25
27 -1.584 2.51
28 -11.58 134.20

Sum
 (Y  Yˆ ) 
0.00  (Y  Yˆ ) 2
 73402.75

Error variance or variance of  (Y  Yˆ ) 2


n2 2823.18
residuals
standard error of estimate=  (Y  Yˆ ) 2

 53.13
Sy.x =Sy-yhat n2
Error Variance
 Error variance is used to describe how
much error is in the predictions
 Just like variance describes variability
around the mean of a distribution, in
regression error variance describes
variability around predicted values
If our predictions were perfect (i.e., r=1), there
would be no error at all and error variance will
be zero
 Regression coefficients minimize error
variance
Conditional Distribution of Errors
Assumption of Homoscedasticity
Conditional Distribution
Standard Error of the Estimate (SEE)
 Like with regular variance there is a
standard deviation measure for error
variance i.e., the standard error of the
estimate
sY  X  sY Yˆ  error variance
ˆ
 (Y  Y ) 2

sY  X 
n2
SEE as a measure of the accuracy
of prediction
SEE is one way we can assess how well our
regression equation is working.
If the regression equation works well, it’s safe to assume the
predicted values of Y should be very close to the real values.
Refer to equation on previous slide. As the predicted value
and the actual get very close, SEE will get smaller.
Smaller SEE implies more accurate predictions.
It should be made clear that this value in practice can
be quite sizeable—predicting with an independent
variable is better that predicting without one, but it is
not without error. We rarely have a dataset that fall
exactly onto one straight line.
34
r2 as a measure of accuracy of prediction
 SSerror again. Remember  (Y  Yˆ ) is the sum of the
2

squared residuals. Residuals are “leftovers”.


 SSerror is the amount of the variance in the Predicted Y,
Yhat
scores in Y unaccounted for even when X is
used as a predictor ( it is the amount of variance
not explained by X).
 We can also sum up the squared difference
between Y and Y . This is the difference between
the actual observed values and the grand mean of
Y.
 We will denote it as SSY or 
2
(Y  Y ).
 This is the total variance or the variance in
the original Y (without taking X into
account).
SStotal in SPSS 35
SATV
ID (Observed) Deviation form Mean  Squared Deviations 
  Y Y-Ybar (Y-Ybar)2
1 590 -8.57142857 73.46938776
2 580 -18.5714286 344.8979592
3 550 -48.5714286 2359.183673
4 550 -48.5714286 2359.183673
5 560 -38.5714286 1487.755102
6 800 201.4285714 40573.46939
7 650 51.42857143 2644.897959
8 660 61.42857143 3773.469388
9 600 1.428571429 2.040816327
10 610 11.42857143 130.6122449
11 620 21.42857143 459.1836735
12 560 -38.5714286 1487.755102
13 570 -28.5714286 816.3265306
14 510 -88.5714286 7844.897959
15 590 -8.57142857 73.46938776
16 490 -108.571429 11787.7551
17 580 -18.5714286 344.8979592 SStotal
18
19
700
690
101.4285714
91.42857143
10287.7551
8359.183673
in SPSS
20 600 1.428571429 2.040816327
21 580 -18.5714286 344.8979592
22 590 -8.57142857 73.46938776
23 540 -58.5714286 3430.612245
24 580 -18.5714286 344.8979592
25 600 1.428571429 2.040816327
26 560 -38.5714286 1487.755102
27 630 31.42857143 987.755102
28 620 21.42857143 459.1836735

Y  (Y  Y )  SS Y   (Y  Y ) 2 
= 598.5714 0.00 102342.86
SATV
ID (predicted) Deviation from Mean Squared Deviations 
  Yhat Yhat-Ybar (Y-Ybar)2
1 655.9096 57.3381714 3287.666
2 607.259 8.68757143 75.4739
3 539.1483 -59.4231286 3531.108
4 558.6085 -39.9629286 1597.036
5 573.2037 -25.3677286 SS regression
643.5217
6 641.3144 42.7429714 1826.962
7 582.9338 -15.6376286 in SPSS
244.5354
8 602.394 3.82257143 14.61205
9 602.394 3.82257143 14.61205
10 597.5289 -1.04252857 or also referred to
1.086866
914.0239
11
12
568.3386
563.4735
-30.2328286
-35.0979286
as 1231.865
13 616.9891 18.4176714 SSmodel
339.2106
14 597.5289 -1.04252857 1.086866
15 607.259 8.68757143 75.4739
Since the16 regression
573.2037line passes through the mean of Y,
-25.3677286 643.5217
then the17mean of 582.9338
predicted Y (Yhat bar)-15.6376286 244.5354
1089.85
18 631.5843 33.0128714
is the same
19 as the665.6397
mean of the observed Y
67.0682714 4498.153
20 587.7988 -10.7726286 116.0495
21 612.1241 13.5526714 183.6749
22 534.2832 -64.2882286 4132.976
23 568.3386 -30.2328286 914.0239
24 631.5843 33.0128714 1089.85
25 592.6639 -5.90752857 34.89889
26 602.394 3.82257143 14.61205
27 631.5843 33.0128714 1089.85
28 631.5843 33.0128714 1089.85

Yˆ  Y  (Yˆ  Y )  SS Yˆ   (Yˆ  Y ) 2 

= 598.5714 0.00 28940.12


R2 as a measure of accuracy of
prediction
2
 It can be shown that SSerror  SSY (1  r )
SSY  SSerror
2
 Expanding and rearranging, we have r 
SSY

 SSY : total variablility in Y can be decomposed in two


parts
SStotal =SSregression + SSerror
SSY  SSYˆ  SSY Yˆ

a. the part that can be explained/predicted,


SSYˆ accounted
for by or attributable to X, , as a result of the
linear relationship between X and Y

b. The part of the variability in Y that is independent


SS ˆ
38
Y Y
Partitioning of Sum of Squares and r-
squared
2
SSYˆ
28940.123 Partitioning of the Sum of Squares
r    .283
SSY 102342.86 Total
Note that
28940.123
SSregressi on SStotal  SSerror
r2  
SStotal SStotal 73402.734

SSY  SSYˆ  SSY Yˆ SSYhat SSerror


i.e.,
SStotal  SSregressi on  SSerror
Predictable Variability & r2
 The higher the r2, the better the predictors are
working (the more variance in Y that are explained
by the predictors).
 If a correlation (r) is found to be 0.8, we calculate
r2 = 0.82 = 0.64
 How do we interpret this?
 This means that 64% of the variance of Y can
be explained by the variability in X.
 You can use the phrases on the previous slide
interchangeably.
 Remember, this does NOT mean that 64% of Y
is caused by X.
40
Factors That Affect Regression

r=0.75

r=0.22

41
Alternate Standard Error Equation
 This equation does not require the computation
of all residuals
 Notice how as r increases to 1 the standard
error decreases to 0

2  n 1 
sY  X  sY (1  r ) 
n2
2
sY  X  sY (1  r )
Standard error of Estimate of the
SATV example

2
sY Yˆ  sY (1  r )
2
sY Yˆ  61.567 (1  (.532) )
sY Yˆ  61.567 .7169
sY Yˆ  53.1
Another Example- Use the data in the table below to
obtain the regression equation and the standard error of
estimate for modeling Test Performance as a function of
Anxiety
Correlations cov XY
b 2
Anxiety
Test
Performance sX
Anxiety Pearson Correlation 1 -.772**
Sig. (2-tailed)
Sum of Squares and
. .000
a  Y  bX
693.500 -404.500
Cross-products
Covariance 33.024 -19.262
N 22 22
Test Performance Pearson Correlation -.772** 1
Sig. (2-tailed) .000 .
Sum of Squares and
-404.500 395.500
Descriptive Statistics
Cross-products
Covariance -19.262
Mean Std.18.833
Deviation N
N Anxiety 22
10.5000 22
5.74663 22
Testat
**. Correlation is significant Performance 9.5000
the 0.01 level (2-tailed). 4.33974 22
SPSS’s Answer
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 15.624 1.277 12.234 .000
Anxiety -.583 .107 -.772 -5.438 .000
a. Dependent Variable: Test Performance

ˆ
Y  .583 X  15.62
For every 1 point increase in Anxiety (X) we expect
a .583 point decrease in Test Performance (Y)
Que: What test score would you predict for
someone with an anxiety score of 16?
 6.3
Standard Error From SPSS
Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .772a .597 .576 2.82459
a. Predictors: (Constant), Anxiety

2
sY Yˆ  sY (1  r )
2
sY Yˆ  4.34 (1  (.772) )
sY Yˆ  4.34 .404
sY Yˆ  2.8
Significance Testing of the regression
Slope
 Regression coefficients are simply estimates of the true
parameters in the population. As such we can perform
hypothesis tests on them.
 For example we can perform a t-test on regression
slopes
 H0: b*=0, Ha: b*≠ 0
 Because we use b to indicate the standardized
regression coefficient, we will use b* to represent the
true slope of the regression line
 With single predictor regression equations, you will get
the same result whether you test the correlation
coefficient, r or the slope b.
 Applies only to simple linear regression, i.e., you have only
predictor in the model.
Significance Testing of the regression
Slope of the SATV Data
Using the SATV data to test for if the slope, b is significantly different than
zero at the 5% level of significance
b b( s X ) n  1
t 
sY  X 2 n 1
sY (1  r )
sX n 1 n2
4.865(6.729) 27 170.104
t slope    3.24
2 27 52.497
61.567 (1  .532 )
28 s b
Standard error
 tcv= t.05/2, df=26 = -+2.056 of slope
 The tobs (3.24) is more extreme than the critical t (2.056), we therefore
reject Ho and declare that the regression slope is different than zero.
 i.e., Not reading the passage (intelligent guessing) is still a significant
predictor of the SAT verbal scores
Significance Testing fo regression
slope - the Easy Way

tobs value

Significance achieved if this value is less


than your alpha level (usually .05)
Assumptions in Regression
 There are no firm assumptions underlying
regression (or correlations) when the statistics
are used to describe data
 A regression line is the “best fit” no matter what the
properties of the data are
 However, when we conduct significance testing,
or if we want to create confidence intervals we
make assumptions
 Homogeneity of Variance and Normality in the
predicted variable (e.g., Y) Independence of
observations and normality of errors of prediction etc.
 More on this next week
Assumptions in Regression
Potential Problems with Regression

 The same issues arise in regression as


occur with correlation
They are two faces of the same procedure
 Curvilinear relationships, range restriction,
subsamples, outliers, spurious correlations
 No proof of causality
Other Notes about Regression

 Avoid making predictions where you have


no data ( ie/., outside the range of
observed data).
 If you were predicting first year
Engineering School GPA from SATM
scores, you would not want to make a
prediction based on a SATM of 200.
 The prediction is likely not accurate
because the program isn’t likely to have
enrolled a student with SATM that low.
Standardized Regression
 In many situations, the y-intercept, a, is
meaningless such as when predicting the weight ()
of a baby when height (X) is zero.
 We could then transform (standardize) the raw
scores to z scores before conducting the
regression
 The z scores for Y (ZY ) are regressed onto the z
scores for X (Zx )
 Then the y-intercept is equal to the mean of ZY
corresponding to the mean of ZX .(i.e., (0,0) on the
scale This is consistent with the mean of Y that
corresponds to the mean of X on the raw score
scale
Standardized Regression
 =r when there is only one independent
variable ,i.e, true only for simple liear
regression
 The standardized value of the y-intercept,
a =mean of (ZY) = 0
 Note that a= on the Y metric but zero on
the z metric
 Standardized coefficients are generally
used with multiple linear regression
Standardized Coefficients
 When you perform regression on standardized data
(i.e., z scores) the regression coefficients are
standardized. In the case of simple linear
regression the standardized slope is called ,
instead of b, where

 =r when there is only one independent variable,


i.e., only one predictor in the model
 The standardized constant, is sometimes referred
to as 0 is always equal to zero.
Regression model in terms of
standardized Coefficients

 is also sometimes called the beta weight


 Note the intercept term is absent because
the standardized model, the mean of the z
scores for both X and Y is zero, ie.,

In our STAV example, =.532


The regression equation can be written as
Interpretation of standardized
Coefficients
 The slope .532 would be interpreted as the
expected increase in the standardized SATV
score for every one standard deviation increase
in the score when not reading the passage.
 Note the unit for .532 is in terms of standard
deviation not the raw scores
 ie., an increase of one standard deviation in the
score when not reading the passage will lead to
an increase of slightly more than half a standard
deviation in the predicted SATV scores.
Unstandardized vs Standardized
 slope is not very stable from sample to sample
 In simple linear regression, most researchers prefer
to use the unstandardized slope, b as it tends to be
more consistent form sample to sample.
 When there are several predictors in a model, the
regression is referred to as multiple linear
regression. If the predictors are on scales which are
not commensurate the are used instead of the b’s
to compare the relative importance of each
predictor.
 But this comes with its own issuesb/c they are
affected by the variances and covariances of both
the included predictors and the predictors not
included in the model.

You might also like