Simple Linear Regression
Simple Linear Regression
Fall
2013
Prof.
Yao
Xie,
[email protected]
H.
Milton
Stewart
School
of
Industrial
Systems
&
Engineering
Georgia
Tech
Midterm 2
• mean:
91.2
• median:
93.75
• std:
6.5
2
Meddicorp Sales
Meddicorp Company sells medical
supplies to hospitals, clinics, and
doctor's offices.
!
Meddicorp's management considers
the effectiveness of a new
advertising program.
!
Management wants to know if the
advertisement in 1999 is related to
sales.
3
Data
The company observes for 25 offices the yearly
sales (in thousands) and the advertisement
expenditure for the new program (in hundreds)
!
SALES ADV
1 963.50 374.27
2 893.00 408.50
3 1057.25 414.31
4 1183.25 448.42
5 1419.50 517.88
..........
!
4
Regression analysis
• Step
1:
graphical
display
of
data
—
scatter
plot:
sales
vs.
advertisement
cost
5
• Step
2:
find
the
relationship
or
association
between
Sales
and
Advertisement
Cost
—
Regression
6
Regression Analysis
• The
collection
of
statistical
tools
that
are
used
to
model
and
explore
relationships
between
variables
that
are
related
in
nondeterministic
manner
is
called
regression
analysis
!
• Occurs
frequently
in
engineering
and
science
7
Scatter Diagram
Many problems in engineering and science involve exploring the relationships between
two or more variables.
!
Regression analysis is a statistical technique that is very useful for these types of
problems
∑ ( x − x )( y
i i − y)
ρ̂ = i =1
− 1 ≤ ρˆ ≤ 1
n n
∑
i =1
( xi − x ) 2 × ∑
i =1
( yi − y ) 2
8
Basics of Regression
• We
observe
a
response
or
dependent
variable
(Y)
!
• With
each
(Y),
we
also
observe
regressors
or
predictors
{X1,
…,
Xn}
!
• Goal:
determine
the
mathematical
relationship
between
response
variables
and
regressors
!
• Y
=
h(X1,
…,
Xn)
9
• Function
can
be
non-‐linear
• In
this
class,
we
will
focus
on
the
case
where
Y
is
a
linear
function
of
{X1,
…,
Xn}
Y = h(XPlot
Regression 1,...,Xn) = β0+β1X1+...+βnXn
C2 = -0.488636 + 3.78859 C1
- 0.246379 C1**2
S = 0.680055 R-Sq = 96.9 % R-Sq(adj) = 96.2 %
15
10
C2
0 2 4 6 8 10 12
C1
10
Different forms of regression
• Simple
linear
regression
! .
Y = β0 + β1X + ε
.. . .
!
.
• Multiple
linear
regression
!
Y = β0 + β1X1 + β2X2+ ε
!
• Polynomial
regression
. .
.
Y = β0 + β1X + β2 X2+ ε .. .
11
Basics of regressions
Which is the RESPONSE and which is the
PREDICTOR?
!
The response or dependent variable varies with
different values of the regressor/predictor.
!
The predictor values are fixed: we observe the
response for these fixed values
!
The focus is in explaining the response variable in
association with one or more predictors
12
Simple linear regression Regression Plot
Our goal is to find the best line that describes a linear relationship:
S = 1.13916
C2 = 1.80955 + 1.29268 C1
! 12
10
! 9
Y = β0 + β1X + ε
C2
7
! 5
Unknown parameters:
4
1 2 3 4 5 6 7 8
C1
1. β0 Intercept (where the line crosses y-axis)
2. β1 Slope of the line
!
Basic idea
a. Plot observations (X,Y)
b. Find best line that follows plotted points
13
Class activity
1. In the Meddicorp Company example, the response is:
A. Sales B. Advertisement Expenditure
!
2. In the Meddicorp Company example, the predictor is:
A. Sales B. Advertisement Expenditure
!
3. To learn about the association between sales and the advertisement
expenditure we can use simple linear regression:
A. True Β. False
!
4. If the association between response and predictor is positive then the slope is
A. Positive Β. Negative C. We cannot identify the slope sign
14
Simple linear regression: model
With observed data {(X1,Y1),….,(Xn,Yn)}, we model the linear
relationship
!
!
!
! Yi = β0 + β1Xi + εi, i =1,…,n
!
E(εi) = 0
Var(εi) = σ2
{ε1,…, εn} are independent random variables
(Later we assume εi ~ Normal)
15
Summary: simple linear regression
Based on the scatter diagram, it is probably reasonable to assume that the mean of the
random variable Y is related to X by the following simple linear regression model:
Yi = β 0 + β1 X i + ε i i = 1,2,!, n
εi (
ε i ∼ Ν 0, σ 2 )
Intercept Slope Random error
where the slope and intercept of the line are called regression coefficients.
• The case of simple linear regression considers a single regressor or predictor x and a
dependent or response variable Y.
16
Estimate regression parameters
To estimate (β0,β1) , we find values that minimize squared error:
! n
!
∑( β β ) 2
!
y i − ( 0 + 1 xi )
i =1
!
!
• derivation: method of least squares
17
0 1
% &2 a
Fig. 11-3. $L
this criterion for estimatingfortheestimating
regression coefficients the method of least$! least
`
erion forWeestimating the regression
call this criterion coefficients the coefficients
the regression method ofthe
least
method of ˆ ,!
! ˆ
L 2% a i a i
-448.qxd 1/15/10 4:53 PM Page 407
2
n n
L% (11-4)
% &2 a 1 yi $!
` ˆ & ˆ
n !1xi 2 % 0
a
$L $! ˆ ,!
ˆ $L ˆ ˆ&2
0
` 0 ! 0 1
&i%1!` 0 &%! 1 xi 2 %1 y0
i &!ˆ &!
0
ˆ x 2x % 0
1 i i (11-5)
a
` ˆ ˆ
a a
n(11-5)
Least
` square
$!1 !0,!1 normal
ˆ equations
n
y $L ˆ ˆ
ˆ x 2x % 0
% &2 1y &
i
i%1! &!
0 1 i i
ˆ #!
n! ˆ
0 x(11-5)
" 1 y i i
$!1 ˆ ,!
! ˆ
0 1 i%1 i"1 i"1
0 a xi # !1 a x i " a yi xi
Observed value n n n
Data ( y) ˆ ˆ 2
! 18
i"1 i"1 i"1
Observed value
0 a i 1 a i a i i
i"1 i"1 i"1
Equations 11-6 are called the least squares normal equations. The solution to the normal
Least square estimates ˆ and !
equations results in the least squares estimators ! 0
ˆ .
1
es
es The least squares estimates of the intercept and slope in the simple linear regression
model are
ˆ "y%!
! ˆ x (11-7)
0 1
a a yi b a a xi b
n n
a yi xi %
n
i"1 i"1
i"1
n
ˆ "
a a xi b
!1 n 2 (11-8)
a i %
n
2 i"1
x n
i"1
ˆ #!
ŷ " ! ˆ x (11-9)
0 1 19
the adequacy of the fitted model.
tionally, it is occasionally convenient to give special symbols to the numerator and
t is occasionally convenient to give special symbols to the numerator and
ator of Equation 11-8. Given data (x1, y1), (x2, y2), p , (xn, yn), let
uation 11-8. Given data (x1, y1), (x2, y2), p , (xn, yn), let
Alternative notation
aa
n 2
a a xinb a a yi b
n n
a a
S " a 1yi % y2 1xi % x2 " a xiayi % xi b a n yi b
n n n
i"1 i"1
(11-11)
" a 1yi % y2 1xi % x2 " a xi yi %
n xy i"1
n i"1 i"1 i"1
xy n (11-11)
i"1 i"1
βˆ0 = y − βˆ1 x
yˆ i = βˆ0 + βˆ1 xi
Fitted (estimated)
S xy regression model
β̂1 =
S xx
20
where the slope and intercept of the lineY are "0 #regression
! called "1x # $ coefficients. While the mean of (11
Y is a linear function of x, the actual observed value y does not fall exactly on a straight line.
(y)
11 9 1.19 1.55 93.54
99.42 94
96
Purity
12 10 1.15 1.40 92.52
93.65
92
Purity (y)
13 11 0.98 1.19 93.54
90.56 94
20
19
0.95
1.43 94.98
87.33 Figure 11-1 Scatter diagram of oxygen purity versus hydrocarbo
Figure 11-1 Scatter diagram of oxygen purity versus hydrocarbon
20 0.95 87.33 level from Table 11-1.
level from Table 11-1.
21
computed:
Sx y 10.1774
ˆ "
! "
a xi " 23.92 a yi " 1,843.21
20 20 1
Sx x 0.68088
n " 20
i"1 i"1
and
x " 1.1960 y " 92.1605
a a xi b
20 2 ŷ " 74.283 $
Sx x " a x 2i #
20
i"1 123.922 2
" 29.2892 # This model is plotted in Fig. 11-4,
i"1 20 20
Practical Interpretation: Usin
" 0.68088 would predict oxygen purity o
hydrocarbon level is x " 1.00%.
and
interpreted as an estimate of the t
a a xi b a a yi b
20 20 when x " 1.00%, or as an estim
Sx y " a xi yi #
20 when x = 1.00%. These estimates
i"1 i"1
error; that is, it is unlikely that a f
i"1 20 would be exactly 89.23% when
123.922 11,843.212 1.00%. In subsequent sections we
" 2,214.6566 # " 10.17744 dence intervals and prediction int
20
in estimation from a regression
22 m
ˆ " xy 10
! "
a xi " 23.92 a yi " 1,843.21
20 20 1
Sx x 0.
n " 20
Therefore, the least squares estimates of the slope and inter- i"1
x " 1.1960
i"1
y " 92.1605
and
i"1
2
20
i"1
2
0 1
Sx y 10.17744
i"1
a a xi b
ŷ " 74.2
ˆ "
! " " 14.94748
20 2
Sx x " a x 2i #
1 20 123.922 2
Sx x 0.68088
i"1
" 29.2892 # This model is plotted in Fig.
i"1 20 20
Practical Interpretation:
" 0.68088 would predict oxygen pur
hydrocarbon level is x " 1.0
and
interpreted as an estimate o
and a a xi b a a yi b
20 20 when x " 1.00%, or as an
Sx y " a xi yi #
20 when x = 1.00%. These est
i"1 i"1
error; that is, it is unlikely th
i"1 20 would be exactly 89.23%
123.922 11,843.212 1.00%. In subsequent sectio
ˆ! " y # !
ˆ x " 92.1605 # 114.9474821.196
" 2,214.6566 #
" 74.28331
20
" 10.17744 dence intervals and predictio
0 1 in estimation from a regressi
99
93
This model is plotted in Fig. 11-4, along with the sample data.
Figure 11-4 Scatter 90
a
i"1
stimates
20 population
are, of m
course, ean
p
subject urity
to w hen
whenerror;xthat " 1.00%, is, it is unlikely or as anthat estima
a futu
20 yx b" 1.00%, or as an estimate of
when
when
a new observation
wouldx =be1.00%. exactlyThese 89.23% when th
estimates a
that
when
i"1
a
•x i The
future
= e
1.00%.stimates
observation
These a re
on
estimates s ubject
purity
are, t
of o
error
course,
subject to
.922
when11,843.212
the hydrocarbon level is 1.00%.
error; thatInis, subsequent
it is unlikely sections
that wefutw
a
rror; that is, it is unlikely that a future observation on purity
• later:
w" e
w ill
use
confidence
10.17744 dence
would intervals
intervals
be to
and
exactly describe
the
prediction
89.23% interv
when t
ons 20
would we bewill see how
exactly 89.23% to use
when confi-
the hydrocarbon level is
11,843.212
.00%.
22ion error
In
intervals in
toedescribe
subsequent stimation
sections thewe from
willa1.00%.
error
in
see regression
estimation
howIn usem
to subsequent odela regression
from
confi- sections we modw
ence
sion intervals
model. "and 10.17744
prediction intervals denceto intervals
describe the and
error prediction inter
24
20
Estimation of variance
•
15/10 Using
4:53the
PMfitted
Page model,
410 we
can
estimate
value
of
the
response
variable
for
given
predictor
!
yˆi = βˆ0 + βˆ1 xi
!
• Residuals:
ri = yi − yˆi
• Our
model:
Yi
=
β0
+
β1Xi
+
εi,
i
=1,…,n,
Var(εi)
=
σ2
1 SIMPLE LINEAR REGRESSION AND CORRELATION
• Unbiased
estimator
(MSE:
Mean
Square
Error)
g i"1 1 yi g
n n n
! 2
2 2 2 #2
where SST " ∑
# ry
i " y
i"1 i # ny is the total su
! σ̂ 2
=
variable y. Formulas such MSE = i =1
n − 2as this are presented in Section 11-4
the•estimate
oxygen
of
and
!2hfor
ydrocarcon
the oxygen level
purity data, !ˆ 2 " 1.18, are hig
example
in Table 11-2. 25
Example: Oil Well Drilling Costs
26
• Step
1:
graphical
display
of
the
data
!
!
!
!
!
!
!
!
• R
code:
plot(Depth,
Cost,
xlab=
“Depth”,
ylab
=
“Cost”)
27
Class activity
1. In this example, the response is:
A. The drilling cost B. The well depth
!
!
2. In this example, the dependent variable is:
A. The drilling cost B. The well depth
!
!
3. Is there a linear association between the drilling cost and the well depth?
A. Yes and positive Β. Yes and negative C. No
28
• Step
2:
find
the
relationship
between
Depth
and
Cost
29
Results and use of regression model
1. Fit a linear regression model:
Estimates (β0,β1) are (-2277.1, 1.0033)
2. What does the model predict as the cost increase for an additional
depth of 1000 ft?
If we increase X by 1000, we increase Y by 1000β1 = $1003
!
3. What cost would you predict for an oil well of 10,000 ft depth?
X = 10,000 ft is in the range of the data, and
estimate of the line at x=10,000 is βˆ0 + (10,000) βˆ1 = -2277.1 + 10,033 =
$7753
!
4. What is the estimate of the error variance? Estimate σ2 ≈ 774,211
!
5.What could you say about the cost of an oil well of depth 20,000 ft?
X=20,000 ft is much greater than all the observed values of X
30
We should not extrapolate the regression out that far.
squares. Using Equation 11-2, we may express the
L% a % a 1 yi &
n n
! Y = β0 + β1X i%1
"2i
i%1
squares
βˆ = y − βˆ x
% &2 a 1 yi & !
n
$L
0 1 `
$!0 ˆ ,!
! ˆ
0 1 i%1
!
yˆ i = βˆ0 + βˆ1 xi % &2 a 1 yi & !
Fitted (estimated)
$L n
`
S xy regression
$!1 !ˆ 0model
ˆ
! β̂1 =
,! 1 i%1
S xx y
!
Observed value
• Estimate
of
variance Data ( y)
Estimated
regression line
31
x
Figure 11-3 Deviations of the data from the