0% found this document useful (0 votes)
116 views

Simple Linear Regression

This document discusses simple linear regression. It begins with an overview of simple linear regression and its goal of determining the mathematical relationship between a response variable and predictor variables. Key aspects covered include identifying the response and predictor variables, plotting the variables to visualize their relationship, and using the method of least squares to estimate the regression coefficients that minimize the squared errors between the observed and modeled responses.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

Simple Linear Regression

This document discusses simple linear regression. It begins with an overview of simple linear regression and its goal of determining the mathematical relationship between a response variable and predictor variables. Key aspects covered include identifying the response and predictor variables, plotting the variables to visualize their relationship, and using the method of least squares to estimate the regression coefficients that minimize the squared errors between the observed and modeled responses.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Lecture 11


Simple Linear Regression

Fall  2013  
Prof.  Yao  Xie,  [email protected]  
H.  Milton  Stewart  School  of  Industrial  Systems  &  Engineering  
Georgia  Tech
Midterm 2
• mean:  91.2  
• median:  93.75  
• std:  6.5

2
Meddicorp Sales
Meddicorp Company sells medical
supplies to hospitals, clinics, and
doctor's offices.
!
Meddicorp's management considers
the effectiveness of a new
advertising program.
!
Management wants to know if the
advertisement in 1999 is related to
sales.

3
Data
The company observes for 25 offices the yearly
sales (in thousands) and the advertisement
expenditure for the new program (in hundreds)
!
SALES ADV
1 963.50 374.27
2 893.00 408.50
3 1057.25 414.31
4 1183.25 448.42
5 1419.50 517.88
..........
!
4
Regression analysis
• Step  1:  graphical  display  of  data  —  scatter  plot:  sales  
vs.  advertisement  cost

5
• Step  2:  find  the  relationship  or  association  between  
Sales  and  Advertisement  Cost  —  Regression

6
Regression Analysis
• The  collection  of  statistical  tools  that  are  used  to  
model  and  explore  relationships  between  variables  
that  are  related  in  nondeterministic  manner  is  
called  regression  analysis  
!
• Occurs  frequently  in  engineering  and  science

7
Scatter Diagram
Many problems in engineering and science involve exploring the relationships between
two or more variables.
!
Regression analysis is a statistical technique that is very useful for these types of
problems

∑ ( x − x )( y
i i − y)
ρ̂ = i =1
− 1 ≤ ρˆ ≤ 1
n n


i =1
( xi − x ) 2 × ∑
i =1
( yi − y ) 2
8
Basics of Regression
• We  observe  a  response  or  dependent  variable  (Y)  
!
• With  each  (Y),  we  also  observe  regressors  or  
predictors  {X1,  …,  Xn}  
!
• Goal:  determine  the  mathematical  relationship  
between  response  variables  and  regressors    
!
• Y  =  h(X1,  …,  Xn)

9
• Function  can  be  non-­‐linear  
• In  this  class,  we  will  focus  on  the  case  where  Y  is  a  
linear  function  of  {X1,  …,  Xn}    

Y = h(XPlot
Regression 1,...,Xn) = β0+β1X1+...+βnXn
C2 = -0.488636 + 3.78859 C1
- 0.246379 C1**2
S = 0.680055 R-Sq = 96.9 % R-Sq(adj) = 96.2 %
15

10
C2

0 2 4 6 8 10 12

C1
10
Different forms of regression
• Simple  linear  regression  
! .
Y = β0 + β1X + ε
.. . .
!
.
• Multiple  linear  regression  
!
Y = β0 + β1X1 + β2X2+ ε
!
• Polynomial  regression
. .
.
Y = β0 + β1X + β2 X2+ ε .. .

11
Basics of regressions
Which is the RESPONSE and which is the
PREDICTOR?
!
The response or dependent variable varies with
different values of the regressor/predictor.
!
The predictor values are fixed: we observe the
response for these fixed values
!
The focus is in explaining the response variable in
association with one or more predictors
12
Simple linear regression Regression Plot
Our goal is to find the best line that describes a linear relationship:
S = 1.13916
C2 = 1.80955 + 1.29268 C1

R-Sq = 87.9 % R-Sq(adj) = 86.6 %

! 12

Find (β0,β1) where


11

10

! 9

Y = β0 + β1X + ε

C2
7

! 5

Unknown parameters:

4

1 2 3 4 5 6 7 8

C1
1. β0 Intercept (where the line crosses y-axis)
2. β1 Slope of the line
!
Basic idea
a. Plot observations (X,Y)
b. Find best line that follows plotted points
13
Class activity
1. In the Meddicorp Company example, the response is:
A. Sales B. Advertisement Expenditure
!
2. In the Meddicorp Company example, the predictor is:
A. Sales B. Advertisement Expenditure
!
3. To learn about the association between sales and the advertisement
expenditure we can use simple linear regression:
A. True Β. False
!
4. If the association between response and predictor is positive then the slope is
A. Positive Β. Negative C. We cannot identify the slope sign

14
Simple linear regression: model
With observed data {(X1,Y1),….,(Xn,Yn)}, we model the linear
relationship
!
!
!
! Yi = β0 + β1Xi + εi, i =1,…,n
!
E(εi) = 0
Var(εi) = σ2
{ε1,…, εn} are independent random variables
(Later we assume εi ~ Normal)

Later, we will check these assumptions when we check model adequacy

15
Summary: simple linear regression
Based on the scatter diagram, it is probably reasonable to assume that the mean of the
random variable Y is related to X by the following simple linear regression model:

Response Regressor or Predictor

Yi = β 0 + β1 X i + ε i i = 1,2,!, n
εi (
ε i ∼ Ν 0, σ 2 )
Intercept Slope Random error

where the slope and intercept of the line are called regression coefficients.

• The case of simple linear regression considers a single regressor or predictor x and a
dependent or response variable Y.

16
Estimate regression parameters
To estimate (β0,β1) , we find values that minimize squared error:
! n
!
∑( β β ) 2
!
y i − ( 0 + 1 xi )
i =1
!
!
• derivation: method of least squares

17
0 1

% &2 a
Fig. 11-3. $L
this criterion for estimatingfortheestimating
regression coefficients the method of least$! least
`
erion forWeestimating the regression
call this criterion coefficients the coefficients
the regression method ofthe
least
method of ˆ ,!
! ˆ

Method of least squares


1 i% 0 1
ng Equation
squares. 11-2, we may express the n observations in the sample as
ion 11-2, weUsing
may Equation
express 11-2,
the nwe may express the
observations n observations
in the sample asin the sample as
y

yi % !0 ' !1 xyii '


%!"i0, ' !1ixi% "i, 2, pi %
' 1, , n1, 2, p , n (11-3) (11-3)
yi % !0 ' !1 xi ' "i, i % 1, 2, p , n Observed value
(11-3) Data ( y)
andsquares
of the the sum of
of the
the squares of theofdeviations
deviations the of the observations
observations from the from
true the true regression
regression
To
quares estimate
lineof (β0,β1)of, we
is the deviations the find values that
observations from the true regression Estimated
regression line
minimize squared error: n
a "i % a 1 yi & !0 & !1xi 2 2
n

L 2% a i a i
-448.qxd 1/15/10 4:53 PM Page 407
2
n n
L% (11-4)

L % a "i %i%1a 1 yi &


n n"2 % 1i%1 & ! x 22
y & ! i%1
0 2 1 i (11-4)
i%1!0 & !1xi 2 (11-4) x
The least i%1 i%1
squares estimators ˆ and !
of !0 and !1, say, ! ˆ , must satisfy
0 1 Figure 11-3 Deviations of the data from the
ˆ and !
ares estimators of !0 and !1, say, ! ˆ , must satisfy estimated regression model.

% &2 a 1 yi & !0 & !1xi 2 % 0


0 1
ˆ
mators of !0 and !1, say, !$L ˆ n
0 and !1, must satisfyˆ ˆ
`
n% &2 a 1 yi & !
$!n0 ˆ ,!
! ˆ
11-2 SIMPLE LINEAR REGRE
$L 0 1 i%1

% &2 a 1 yi $!
` ˆ & ˆ
n !1xi 2 % 0

a
$L $! ˆ ,!
ˆ $L ˆ ˆ&2
0
` 0 ! 0 1
&i%1!` 0 &%! 1 xi 2 %1 y0
i &!ˆ &!
0
ˆ x 2x % 0
1 i i (11-5)

n% &2 a 1 yi & !0 & !1xi 2 xi % 0


$!0 !ˆ 0,!ˆ 1 Simplifying
i%1 1n these
! 0,!1 two equations
ˆ ˆ
i%1 yields
$L

a
` ˆ ˆ

a a
n(11-5)
Least  
` square  
$!1 !0,!1 normal   ˆ equations
n
y $L ˆ ˆ
ˆ x 2x % 0
% &2 1y &
i
i%1! &!
0 1 i i
ˆ #!
n! ˆ
0 x(11-5)
" 1 y i i
$!1 ˆ ,!
! ˆ
0 1 i%1 i"1 i"1

0 a xi # !1 a x i " a yi xi
Observed value n n n
Data ( y) ˆ ˆ 2
! 18
i"1 i"1 i"1
Observed value
0 a i 1 a i a i i
i"1 i"1 i"1

Equations 11-6 are called the least squares normal equations. The solution to the normal
Least square estimates ˆ and !
equations results in the least squares estimators ! 0
ˆ .
1

es
es The least squares estimates of the intercept and slope in the simple linear regression
model are

ˆ "y%!
! ˆ x (11-7)
0 1

a a yi b a a xi b
n n

a yi xi %
n
i"1 i"1

i"1
n
ˆ "
a a xi b
!1 n 2 (11-8)

a i %
n
2 i"1
x n
i"1

where y " 11$n2 g i"1 yi and x " 11$n2 g i"1 xi.


n n

The fitted or estimated regression line is therefore

ˆ #!
ŷ " ! ˆ x (11-9)
0 1 19
the adequacy of the fitted model.
tionally, it is occasionally convenient to give special symbols to the numerator and
t is occasionally convenient to give special symbols to the numerator and
ator of Equation 11-8. Given data (x1, y1), (x2, y2), p , (xn, yn), let
uation 11-8. Given data (x1, y1), (x2, y2), p , (xn, yn), let
Alternative notation
aa
n 2

Sn x x " a 1xi % x2n2 " a xa2i a


n 2 xb
n n i
% xi b n
Sx x " a 1xi % x2 " a x i %
i"1
(11-10)
i"1 2 2 i"1 i"1
n (11-10)
i"1 i"1

a a xinb a a yi b
n n

a a
S " a 1yi % y2 1xi % x2 " a xiayi % xi b a n yi b
n n n
i"1 i"1
(11-11)
" a 1yi % y2 1xi % x2 " a xi yi %
n xy i"1
n i"1 i"1 i"1
xy n (11-11)
i"1 i"1

βˆ0 = y − βˆ1 x
yˆ i = βˆ0 + βˆ1 xi
Fitted (estimated)
S xy regression model
β̂1 =
S xx
20
where the slope and intercept of the lineY are "0 #regression
! called "1x # $ coefficients. While the mean of (11
Y is a linear function of x, the actual observed value y does not fall exactly on a straight line.

Example: oxygen and hydrocarcon level


The appropriate way to generalize this to a probabilistic linear model is to assume that the
expected value of Y is a linear function of x, but that for a fixed value of x the actual value of Y
is determined by the mean value function (the linear model) plus a random error term, say,
Table 11-1 Oxygen and Hydrocarbon Levels
Y ! "0 # "1x # $ (11-1)
Observation
Number
Hydrocarbon Level
x (%)
Purity
y (%)
Question:  fit  a  simple  regression  
1 0.99 90.01 model  to  related  purity  (y)  to  
2 Table 11-1 Oxygen 1.02
and Hydrocarbon Levels89.05 hydrocarbon  level  (x)
3 Observation 1.15
Hydrocarbon Level 91.43
Purity
Number x (%) y (%)
4 1.29 93.74
1
5 1.46 0.99 90.01
96.73
2 1.02 89.05
6 3 1.36 1.15 94.45
91.43 100
7 4 0.87 1.29 87.59
93.74
8 5 1.23 1.46 91.77
96.73 98
9 6 1.55 1.36 94.45
99.42 100
7 0.87 87.59 96
10 1.40 93.65
8 1.23 91.77 98

(y)
11 9 1.19 1.55 93.54
99.42 94
96

Purity
12 10 1.15 1.40 92.52
93.65
92

Purity (y)
13 11 0.98 1.19 93.54
90.56 94

14 12 1.01 1.15 92.52


89.54 90
92
13 0.98 90.56
15 14
1.11 1.01
89.85
89.54 90
88
16 15 1.20 1.11 90.39
89.85
88
17 16 1.26 1.20 93.25
90.39 86
18 17 1.32 1.26 93.25
93.41 860.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55
0.85 0.95 1.05 1.15 1.25
Hydrocarbon 1.35
level (1.45
x) 1.55
18
19 1.43 1.32 93.41
94.98 Hydrocarbon level ( x)

20
19
0.95
1.43 94.98
87.33 Figure 11-1 Scatter diagram of oxygen purity versus hydrocarbo
Figure 11-1 Scatter diagram of oxygen purity versus hydrocarbon
20 0.95 87.33 level from Table 11-1.
level from Table 11-1.

21
computed:
Sx y 10.1774
ˆ "
! "
a xi " 23.92 a yi " 1,843.21
20 20 1
Sx x 0.68088
n " 20
i"1 i"1
and
x " 1.1960 y " 92.1605

a y i " 170,044.5321 a x i " 29.2892


20 20
ˆ "y#!
! ˆ x " 92.1605 # 114
0 1
2 2
i"1 i"1
The fitted simple linear regression
a xi yi " 2,214.6566
20
reported to three decimal places) is
i"1

a a xi b
20 2 ŷ " 74.283 $

Sx x " a x 2i #
20
i"1 123.922 2
" 29.2892 # This model is plotted in Fig. 11-4,
i"1 20 20
Practical Interpretation: Usin
" 0.68088 would predict oxygen purity o
hydrocarbon level is x " 1.00%.
and
interpreted as an estimate of the t

a a xi b a a yi b
20 20 when x " 1.00%, or as an estim

Sx y " a xi yi #
20 when x = 1.00%. These estimates
i"1 i"1
error; that is, it is unlikely that a f
i"1 20 would be exactly 89.23% when
123.922 11,843.212 1.00%. In subsequent sections we
" 2,214.6566 # " 10.17744 dence intervals and prediction int
20
in estimation from a regression
22 m
ˆ " xy 10
! "
a xi " 23.92 a yi " 1,843.21
20 20 1
Sx x 0.
n " 20

Therefore, the least squares estimates of the slope and inter- i"1

x " 1.1960
i"1

y " 92.1605
and

a y i " 170,044.5321 a x i " 29.2892


ˆ "y#!
! ˆ x " 92.1605 #
cept are 20

i"1
2
20

i"1
2
0 1

The fitted simple linear regre


a xi yi " 2,214.6566
20
reported to three decimal plac

Sx y 10.17744
i"1

a a xi b
ŷ " 74.2
ˆ "
! " " 14.94748
20 2

Sx x " a x 2i #
1 20 123.922 2
Sx x 0.68088
i"1
" 29.2892 # This model is plotted in Fig.
i"1 20 20
Practical Interpretation:
" 0.68088 would predict oxygen pur
hydrocarbon level is x " 1.0
and
interpreted as an estimate o
and a a xi b a a yi b
20 20 when x " 1.00%, or as an

Sx y " a xi yi #
20 when x = 1.00%. These est
i"1 i"1
error; that is, it is unlikely th
i"1 20 would be exactly 89.23%
123.922 11,843.212 1.00%. In subsequent sectio
ˆ! " y # !
ˆ x " 92.1605 # 114.9474821.196
" 2,214.6566 #
" 74.28331
20
" 10.17744 dence intervals and predictio
0 1 in estimation from a regressi

Computer software programs are widely used in regressio


typically carry more decimal places in the calculations. Table
output from Minitab for this problem. The estimates ! ˆ and
The fitted simple linear regression model (with the coefficients 0
quent sections we will provide explanations for the informati
output.
reported to three decimal places) is 102

99

ŷ " 74.283 $ 14.947 x

Oxygen purity y (%)


96

93

This model is plotted in Fig. 11-4, along with the sample data.
Figure 11-4 Scatter 90

Practical Interpretation: Using the regression model, we


plot of oxygen
purity y versus
87
hydrocarbon level x 0.87 1.07 1.27 1.47 23 1.67
would predict oxygen purity of ŷ " 89.23% when the
and regression model
ŷ " 74.283 $ 14.947x.
Hydrocarbon level (%)
x
1 a x i " 29.2892
ession
ˆ "
! i"1 y model
# !ˆ x" (with the coefficients
92.1605 # 114.9474821.196 reported"to74.28331 three decimal places) is
"ces)
2,214.6566
0
is
1
The fitted simple linear regression m
Interpretation
214.6566
of regression
reported to three model
decimal places) is
The2 fitted simple linear regression model (with the coefficientsŷ " 74.283 $ 14
b •$ Regression  
283
eported 14.947
to x
three decimal places) is
model  
123.922 2
ŷ " 74.283 $ 14
" 29.2892
! # This model is plotted in Fig. 11-4, al
ŷ " 20 74.283 $ data.
14.947 x Practical Interpretation: Using
. 11-4, along with 123.922the 2
sample
n:"Using
29.2892! the# regression model, we This would model predict is plottedoxygen in Fig.
purity11-4,ofal
20
This
uritymodel ! is
of ŷ plotted
" 89.23% in Fig.when 11-4, along the with Practical
the samplelevel
hydrocarbon Interpretation:
data.is x " 1.00%. UsingTh
.00%.Practical Interpretation:
The purity 89.23%Using may the regression
bewould interpreted predictmodel,
as an we
oxygen estimatepurity of the oftru
!
would20 predict oxygen purity of ŷ " 89.23% x "when leveltheisorx as
a
of the true population mean purityhydrocarbon when 1.00%, " an
1.00%. Th
estima
b a •
ydrocarbon
n estimate This  
yi b m
levelay  
isbe  
x" i nterpreted  
of a new observation
1.00%. The apurity s   a n  
interpreted
when e stimate  
89.23%
x = as mayo
an
1.00%. f  
be tThese
he  true  
estimate of the trua
estimates
nterpreted as an estimate of the true population mean purity

a
i"1
stimates
20 population  
are, of m
course, ean   p
subject urity  
to w hen  
whenerror;xthat " 1.00%, is, it is unlikely or as anthat estima
a futu
20 yx b" 1.00%, or as an estimate of
when
when
a new observation
wouldx =be1.00%. exactlyThese 89.23% when th
estimates a
that
when
i"1
a
•x i The  
future
= e
1.00%.stimates  
observation
These a re  
on
estimates s ubject  
purity
are, t
of o   error  
course,   subject to
.922
when11,843.212
the hydrocarbon level is 1.00%.
error; thatInis, subsequent
it is unlikely sections
that wefutw
a
rror; that is, it is unlikely that a future observation on purity
• later:  w" e  w ill  use  confidence  
10.17744 dence
would intervals  
intervals
be to  and
exactly describe   the  
prediction
89.23% interv
when t
ons 20
would we bewill see how
exactly 89.23% to use
when confi-
the hydrocarbon level is
11,843.212
.00%.
22ion error  
In
intervals in  
toedescribe
subsequent stimation  
sections thewe from  
willa1.00%.
error  in
see regression  
estimation
howIn usem
to subsequent odela regression
from
confi- sections we modw
ence
sion intervals
model. "and 10.17744
prediction intervals denceto intervals
describe the and
error prediction inter
24
20
Estimation of variance

15/10 Using  
4:53the  
PMfitted  
Page model,  
410 we  can  estimate  value  of  the  
response  variable  for  given  predictor  
!
yˆi = βˆ0 + βˆ1 xi
!
• Residuals:   ri = yi − yˆi
• Our  model:    Yi  =  β0  +  β1Xi  +  εi,  i  =1,…,n,  Var(εi)  =  σ2    
1 SIMPLE LINEAR REGRESSION AND CORRELATION
• Unbiased  estimator  (MSE:  Mean  Square  Error)  
g i"1 1 yi g
n n n
! 2
2 2 2 #2
where SST " ∑
# ry
i " y
i"1 i # ny is the total su
! σ̂ 2
=
variable y. Formulas such MSE = i =1
n − 2as this are presented in Section 11-4
the•estimate
 oxygen  of
and  
!2hfor
ydrocarcon  
the oxygen level  
purity data, !ˆ 2 " 1.18, are hig
example
in Table 11-2. 25
Example: Oil Well Drilling Costs

Estimating the costs of Depth Cost Depth Cost


drilling oil wells is an 8210 4813.1
5000 2596.8
important consideration
5200 3328.0 8600 5618.7
for the oil industry.
! 6000 3181.1 9026 7736.0
Data: the total costs 6538 3198.4 9197 6788.3
and the depths of 16 7109 4779.9 9926 7840.8
off-shore oil wells 10813 8882.5
7556 5905.6
located in Philippines.
8005 5769.2 13800 10489.5

8207 8089.5 14311 12506.6

26
• Step  1:  graphical  display  of  the  data  
!
!
!
!
!
!
!
!
• R  code:  plot(Depth,  Cost,  xlab=  “Depth”,  ylab  =  “Cost”)

27
Class activity
1. In this example, the response is:
A. The drilling cost B. The well depth
!
!
2. In this example, the dependent variable is:
A. The drilling cost B. The well depth
!
!
3. Is there a linear association between the drilling cost and the well depth?
A. Yes and positive Β. Yes and negative C. No

28
• Step  2:  find  the  relationship  between  Depth  and  Cost

29
Results and use of regression model
1. Fit a linear regression model:
Estimates (β0,β1) are (-2277.1, 1.0033)


2. What does the model predict as the cost increase for an additional
depth of 1000 ft?
If we increase X by 1000, we increase Y by 1000β1 = $1003
!
3. What cost would you predict for an oil well of 10,000 ft depth?
X = 10,000 ft is in the range of the data, and
estimate of the line at x=10,000 is βˆ0 + (10,000) βˆ1 = -2277.1 + 10,033 =
$7753
!
4. What is the estimate of the error variance? Estimate σ2 ≈ 774,211
!
5.What could you say about the cost of an oil well of depth 20,000 ft?
X=20,000 ft is much greater than all the observed values of X
30
We should not extrapolate the regression out that far.
squares. Using Equation 11-2, we may express the

yi % !0 ' !1 xi ' "i,


Summary and the sum of the squares of the deviations of t
• Simple  linear  regression   line is

L% a % a 1 yi &
n n

! Y = β0 + β1X i%1
"2i
i%1

• Estimate  coefficients  from  data:  


Them ethod  
least of  least  
squares estimators ˆ
of ! and ! , say, !0 1 0 an

squares   βˆ = y − βˆ x
% &2 a 1 yi & !
n
$L
0 1 `
$!0 ˆ ,!
! ˆ
0 1 i%1
!
yˆ i = βˆ0 + βˆ1 xi % &2 a 1 yi & !
Fitted (estimated)
$L n
`
S xy regression
$!1 !ˆ 0model
ˆ
! β̂1 =
,! 1 i%1

S xx y
!
Observed value
• Estimate  of  variance Data ( y)

Estimated
regression line

31
x
Figure 11-3 Deviations of the data from the

You might also like