0% found this document useful (0 votes)
45 views31 pages

Chapter 17

This document provides an introduction to simple linear regression. Simple linear regression examines the relationship between a dependent variable (Y) and one independent variable (X) using a linear equation of the form Y = β0 + β1X + ε, where β0 is the Y-intercept, β1 is the slope, and ε is the error term. The coefficients β0 and β1 are estimated using the method of least squares, which finds the line that minimizes the sum of squared differences between the observed data points and the regression line. The quality of fit is assessed based on how well the linear model predicts the dependent variable from the independent variable.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views31 pages

Chapter 17

This document provides an introduction to simple linear regression. Simple linear regression examines the relationship between a dependent variable (Y) and one independent variable (X) using a linear equation of the form Y = β0 + β1X + ε, where β0 is the Y-intercept, β1 is the slope, and ε is the error term. The coefficients β0 and β1 are estimated using the method of least squares, which finds the line that minimizes the sum of squared differences between the observed data points and the regression line. The quality of fit is assessed based on how well the linear model predicts the dependent variable from the independent variable.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 31

Simple Linear Regression

Introduction

• In Chapters 17 to 19, we examine the


relationship between interval variables via a
mathematical equation.
• The motivation for using the technique:
– Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).
– Analyze the specific relationships between the
independent variables and the dependent variable.
The Model
The model has a deterministic and a probabilistic components

House
Cost
bo ut
sa
o st
c
se t.
ho u o Size)
n g a e fo + 75(
d i a r
Buil er squ 25000
p =
$75 e cost
Most lots sell s
H ou
for $25,000

House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost

Most lots sell


for $25,000
House cost = 25000 + 75 (Size) 
House size
• The first order linear model

YY  00  11XX 


Y = dependent variable 0 and 1 are unknown population
X = independent variable Y parameters, therefore are estimated
from the data.
0 = Y-intercept

 1 = slope of the line
Rise  = Rise/Run
 = error variable 0 Run
X
Estimating the Coefficients
• The estimates are determined by
– drawing a sample from the population of interest,
– calculating sample statistics.
– producing a straight line that cuts into the data.
Y 
 Question: What should be
 considered a good line?



  

X
The Least Squares (Regression) Line

A good line is one that minimizes


the sum of squared differences between the
points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

(2,4)
Let us compare two lines
4
 The second line is horizontal
3  (4,3.2)
2.5
2
(1,2) 
 (3,1.5)
1

The smaller the sum of


1 2 3 4 squared differences
the better the fit of the
line to the data.
The Estimated Coefficients

To calculate the estimates of the line The regression equation that estimates
coefficients, that minimize the differences the equation of the first order linear model
between the data points and the line, use is:
the formulas:
cov(X,Y)) ssXYXY
cov(X,Y ˆ
bb11  2  22 
 ˆ
Y  bb00  bb11XX
Y
ssXX
2
 ssXX 
bb00 YY bb11XX



The Simple Linear Regression Line

• Example 17.2 (Xm17-02)


– A car dealer wants to find
the relationship between Car Odometer Price
the odometer reading and 1 37388 14636
the selling price of used cars. 2 44758 14122
3 45833 14016
– A random sample of 100 cars 4 30862 15590
is selected, and the data 5 31705 15568
recorded. 6 34010 14718
. .
Independent .
Dependent
– Find the regression line. . .
variable .
X variable Y
. . .
• Solution
– Solving by hand: Calculate a number of statistics
X  36,009.45; sX2 
 (X i  X )2
 43,528,690
n 1

Y  14,822.823; cov(X,Y ) 
 (X i  X )(Yi  Y )
 2,712,511
n 1
where n = 100.
cov(X,Y) 1,712,511
b1    .06232
 2
sX 43,528,690
b0  Y  b1 X  14,822.82  (.06232)(36,009.45)  17,067

Yˆ  b0  b1 X  17,067  .0623X

• Solution – continued
– Using the computer (Xm17-02)

Tools > Data Analysis > Regression >


[Shade the Y range and the X range] > OK
Xm17-02
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.8063
R Square 0.6501
Adjusted R Square 0.6466

Yˆ  17,067  .0623X
Standard Error 303.1
Observations 100

ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
 Coefficients Standard Error t Stat P-value
Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
Interpreting the Linear Regression -
Equation
17067 Odometer Line Fit Plot

16000

15000
Price

14000

0 No data 13000
Odometer

Yˆ  17,067  .0623X

The intercept is b0 = $17067. This is the slope of the line.


intercept as the For each additional mile on the odometer,
Do not interpret the the price decreases by an average of $0.0623
“Price of cars that have not been driven”
Error Variable: Required Conditions

• The error is a critical part of the regression model.


• Four requirements involving the distribution of  must
be satisfied.
– The probability distribution of  is normal.
– The mean of  is zero: E() = 0.
– The standard deviation of  is  for all values of X.
– The set of errors associated with different values of Y are
all independent.
The Normality of 
E(Y|X3)
The standard deviation remains constant,
 + X 
0 1 3
E(Y|X2)
0 + 1 X2 

but the mean value changes with X E(Y|X1)

0 + 1X1 

X1 X2 X3
From the
From the first
first three
three assumptions
assumptions we we
have: YY isis normally
have: normally distributed
distributed with
with
mean E(Y) == 00 ++ 11X,
mean E(Y) X, and
and aa constant
constant
deviation 
standard deviation
standard
Assessing the Model
• The least squares method will produces a
regression line whether or not there are linear
relationship between X and Y.
• Consequently, it is important to assess how well
the linear model fits the data.
• Several methods are used to assess the model.
All are based on the sum of squares for errors,
SSE.
Sum of Squares for Errors
– This is the sum of differences between the points
and the regression line.
– It can serve as a measure of how well the line fits the
data. SSE is defined by
nn
SSE (Yi i Yi )i .
SSE  
 (Y  ˆ
Yˆ 22 .
)
i1
i1

– A shortcut formula
2 cov(X,Y)
2
2
 SSE(n
 SSE (n1)s
1)sY 
2 cov(X,Y)
Y s22
sXX
Standard Error of Estimate
– The mean error is equal to zero.
– If  is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the
data well.
– Therefore, we can, use  as a measure of the
suitability of using a linear model.
– An estimator of  is given by s
SStan
tandard
dard Error
Error of
of Estimate
Estimate
SSE
SSE
ss 
nn22
• Example 17.3
– Calculate the standard error of estimate for Example 17.2,
and describe what does it tell you about the model fit?
• Solution

sY2 
 i i
(Y  Yˆ ) 2

 259,996
Calculated before
n 1
[cov( X , Y )] 2
( 2, 712,511) 2
SSE  (n  1) sY2  2
 99(259,996)   9,005,450
sX 43,528,690
SSE 9,005,450 It is hard to assess the model based
s    303.13
n2 98 on s even when compared with the
mean value of Y.
s   303.1 y  14,823
Testing the Slope
– When no linear relationship exists between two
variables, the regression line should be horizontal.



 
  

 
         
  
 
     
      
    
           
    
  
       
    
          

Linear relationship. No linear relationship.


Different inputs (X) yield Different inputs (X) yield
different outputs (Y). the same output (Y).
The slope is not equal to zero The slope is equal to zero
• We can draw inference about 1 from b1 by testing
H0 :  1 = 0
H1: 1  0 (or < 0,or > 0)
– The test statistic is
b1111
b s
s
tt  where ssbb11 
ssbb11 1)sXX
(n1)s
(n 22

The standard error of b1.

– If the error variable is normally distributed, the statistic


 has Student t distribution
 
 with d.f. = n-2.
• Example 17.4
– Test to determine whether there is enough evidence
to infer that there is a linear relationship between the
car auction price and the odometer reading for all
three-year-old Tauruses, in Example 17.2.
Use  = 5%.
• Solving by hand
– To compute “t” we need the values of b1 and sb1.

b1  .0623
s 303.1
sb1    .00462
(n 1)sX
2
(99)(43,528,690)
b1  1 .0623  0
t   13.49
sb1 .00462

– The rejection region is t > t.025 or t < -t.025 with  = n-2 = 98.
 Approximately, t.025 = 1.984
Xm17-02
• Using the computer
Price Odometer SUMMARY OUTPUT
14636 37388
14122 44758 Regression Statistics
14016 45833 Multiple R 0.8063
15590 30862 R Square 0.6501 There is overwhelming evidence to infer
15568 31705 Adjusted R Square 0.6466
14718 34010 Standard Error 303.1 that the odometer reading affects the
14470 45854 Observations 100 auction selling price.
15690 19057
15072 40149 ANOVA
14802 40237 df SS MS F Significance F
15190 32359 Regression 1 16734111 16734111 182.11 0.0000
14660 43533 Residual 98 9005450 91892
15612 32744 Total 99 25739561
15610 34470
14634 37720 Coefficients Standard Error t Stat P-value
14632 41350 Intercept 17067 169 100.97 0.0000
15740 24469 Odometer -0.0623 0.0046 -13.49 0.0000
Coefficient of Determination
– To measure the strength of the linear relationship we
use the coefficient of determination:

cov(X,Y) 22
R 
22
R 
cov(X,Y)
2 2 
 or,
or, 
 rr22
XY 
;;
2 2
ssXXssYY XY

SSE
SSE
or, RR 1
or, 22 1 (seep.p.18
(see 18above)
above)

 (Yi i Y )
(Y  Y )22


• To understand the significance of this coefficient note:

par t by The regression model


lained in
Ex p
Overall variability in Y Rema
ins, in
part,
unexp
lained
The error
y2
Two data points (X1,Y1) and (X2,Y2)
of a certain sample are shown.

y1 Variation in Y = SSR + SSE

x1 x2
Variation explained by the
Total variation in Y = + Unexplained variation (error)
regression line

(Y1  Y ) 2  (Y2  Y ) 2  (Yˆ1  Y ) 2  (Yˆ2  Y ) 2 (Y1  Yˆ1) 2  (Y2  Yˆ2 ) 2

  
• R2 measures the proportion of the variation in Y
that is explained by the variation in X.

R 1
2 SSE

 i SSE
(Y Y ) 2


SSR
 (Yi Y ) 2
 (Y Y )
i
2
 (Yi Y ) 2

• R2 takes on any value between zero and one.


R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between X and Y.
• Example 17.5
– Find the coefficient of determination for Example 17.2;
what does this statistic tell you about the model?
• Solution
– Solving by hand;

2
[cov(X,Y)] [2,712,511]2
R 
2
2 2
 (43,528,688)(259,996)  .6501
sX sY


– Using the computer
From the regression output we have
SUMMARY OUTPUT

Regression Statistics
65% of the variation in the auction
Multiple R 0.8063 selling price is explained by the
R Square 0.6501
Adjusted R Square 0.6466 variation in odometer reading. The
Standard Error
Observations
303.1
100
rest (35%) remains unexplained by
this model.
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561

CoefficientsStandard Error t Stat P-value


Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000

You might also like