Probability and Statistic Chapter7 - Linear - Regression - Models
Probability and Statistic Chapter7 - Linear - Regression - Models
O UTLINE
1 I NTRODUCTION
O UTLINE
1 I NTRODUCTION
O UTLINE
1 I NTRODUCTION
3 A BUSES OF REGRESSION
O UTLINE
1 I NTRODUCTION
3 A BUSES OF REGRESSION
4 I NTERPRETING R RESULTS
L EARNING OUTCOMES
After careful study of this chapter, you should be able to do the
following:
1 Understand how the method of least squares is used to
estimate the parameters in a linear regression model.
L EARNING OUTCOMES
After careful study of this chapter, you should be able to do the
following:
1 Understand how the method of least squares is used to
estimate the parameters in a linear regression model.
2 Test statistical hypotheses and construct confidence intervals
on regression model parameters.
L EARNING OUTCOMES
After careful study of this chapter, you should be able to do the
following:
1 Understand how the method of least squares is used to
estimate the parameters in a linear regression model.
2 Test statistical hypotheses and construct confidence intervals
on regression model parameters.
3 Use the regression model to predict a future observation.
L EARNING OUTCOMES
After careful study of this chapter, you should be able to do the
following:
1 Understand how the method of least squares is used to
estimate the parameters in a linear regression model.
2 Test statistical hypotheses and construct confidence intervals
on regression model parameters.
3 Use the regression model to predict a future observation.
4 Analyze residuals to determine whether the regression model
is an adequate fit to the data or whether any underlying
assumptions are violated.
L EARNING OUTCOMES
After careful study of this chapter, you should be able to do the
following:
1 Understand how the method of least squares is used to
estimate the parameters in a linear regression model.
2 Test statistical hypotheses and construct confidence intervals
on regression model parameters.
3 Use the regression model to predict a future observation.
4 Analyze residuals to determine whether the regression model
is an adequate fit to the data or whether any underlying
assumptions are violated.
5 Apply the correlation model
L EARNING OUTCOMES
After careful study of this chapter, you should be able to do the
following:
1 Understand how the method of least squares is used to
estimate the parameters in a linear regression model.
2 Test statistical hypotheses and construct confidence intervals
on regression model parameters.
3 Use the regression model to predict a future observation.
4 Analyze residuals to determine whether the regression model
is an adequate fit to the data or whether any underlying
assumptions are violated.
5 Apply the correlation model
6 Use R software to fit simple linear regression models and
interpret the output.
Dr. Phan Thi Huong Probability and Statistics
Introduction
A simple linear regression model
Abuses of regression
Interpreting R results
Y = β0 + β1 x + ε, (1)
where
β0 , β1 are unknown parameters and called regression
coefficients.
Y = β0 + β1 x + ε, (1)
where
β0 , β1 are unknown parameters and called regression
coefficients.
ε is called the random error and assumed to be normally
distributed with E(ε) = 0 and Var (ε) = σ2 .
E[Y |x] = β0 + β1 x,
where β0 and β1 are respectively the intercept and the slope of the
straight-line.
Yi = β0 + β1 x i + εi , i = 1, 2, . . . , n
Yi = β0 + β1 x i + εi , i = 1, 2, . . . , n
Yi = β0 + β1 x i + εi , i = 1, 2, . . . , n
speed dist
1 4.00 2.00
The cars dataset contains 50
2 4.00 10.00
observations of two variables
3 7.00 4.00
speed(mph) and dist (ft).
4 7.00 22.00
5 8.00 16.00
... ... ...
48 24.00 93.00
49 24.00 120.00
Dr. Phan Thi Huong 50Statistics
Probability and 25.00 85.00
Model definition
Introduction
Regression parameters
A simple linear regression model
Coefficient of determination
Abuses of regression
Sample correlation coefficient
Interpreting R results
Analysis of residuals: assessing the model
D EFINITION
For a dataset of n observations (x 1 , y 1 ), ..., (x n , y n ), the sum of
squares for errors is defined by
n n
e i2 = [y i − (β̂0 + β̂1 x i )]2
X X
SSE =
i =1 i =1
D EFINITION
For a dataset of n observations (x 1 , y 1 ), ..., (x n , y n ), the sum of
squares for errors is defined by
n n
e i2 = [y i − (β̂0 + β̂1 x i )]2
X X
SSE =
i =1 i =1
The least-square method aims to find the estimates β̂0 , and β̂1 by
minimizing SSE . Those estimates are called least squares
estimates.
E XAMPLE 1
A large midwestern bank is planning on introducing a new word
processing system to its secretarial staff. To learn about the amount
of training that is needed to effectively implement the new system,
the bank chose eight employees of roughly equal skill. These
workers were trained for different amounts of time and were then
individually put to work on a given project. The following data
indicate the training times and the resulting times (both in hours)
that it took each worker to complete the project.
Training time(= x) 22 18 30 16 25 20 10 14
Time to complete project (= Y ) 18.4 19.2 14.5 19.0 16.6 17.7 24.4 21.0
E XAMPLE 1 ( CONTINUED )
(A) What is the estimated regression line?
(B) Predict the amount of time it would take a worker who receives
28 hours of training to complete the project.
(C) Find the residual e i of an observation (x i , y i ) = (22, 18.4).
Solution:
SSE
σ̂2 = M SE =
n −2
Proof:
Dr. Phan Thi Huong Probability and Statistics
Model definition
Introduction
Regression parameters
A simple linear regression model
Coefficient of determination
Abuses of regression
Sample correlation coefficient
Interpreting R results
Analysis of residuals: assessing the model
E XERCISE 2
The following data give, for certain years between 1982 and 2002,
the percentages of British women who were cigarette smokers.
Treat these data as coming from a linear regression model, with the
input being the year and the response being the percentage. Take
1982 as the base year, so 1982 has input value x = 0, 1986 has input
value x = 4, and so on.
(A) Estimate the value of σ2 .
(B) Predict the percentage of British women who smoked in 1997.
x̄ 2
µ ¶
1
E(βˆ0 ) = β0 , Var (βˆ0 ) = + σ2 , (3)
n S xx
σ2
E(βˆ1 ) = β1 , Var (βˆ1 ) = (4)
S xx
Proof:
Dr. Phan Thi Huong Probability and Statistics
Model definition
Introduction
Regression parameters
A simple linear regression model
Coefficient of determination
Abuses of regression
Sample correlation coefficient
Interpreting R results
Analysis of residuals: assessing the model
βˆ1 − b 1
T β1 = ∼ t (n − 2)
SE (β̂1 )
where s
σ̂2
SE (β̂1 ) =
S xx
βˆ0 − b 0
T β0 = ∼ t (n − 2)
SE (β̂0 )
where s
x̄ 2
µ ¶
SE (β̂0 ) = σ̂2 1+
S xx
T HEOREM
T HEOREM
Under the assumption that the observations are normally and
independently distributed, a 100(1 − α)% confidence interval on the
slope β1 in simple linear regression is
s s
n−2 σ̂ n−2 σ̂
βˆ1 − t 1−α/2 ≤ β1 ≤ βˆ1 + t 1−α/2 (5)
S xx S xx
T HEOREM
Under the assumption that the observations are normally and
independently distributed, a 100(1 − α)% confidence interval on the
slope β1 in simple linear regression is
s s
n−2 σ̂ n−2 σ̂
βˆ1 − t 1−α/2 ≤ β1 ≤ βˆ1 + t 1−α/2 (5)
S xx S xx
C OEFFICIENT OF DETERMINATION
D EFINITION
The coefficient of determination is the proportion of variation in
the response variables that is explained by the different values of
independent variable compared to the total variation. That is
computed by
SSR SSE
R2 = = 1− (7)
SST SST
Note that 0 ≤ R 2 ≤ 1.
C OEFFICIENT OF DETERMINATION
C OEFFICIENT OF DETERMINATION
C OEFFICIENT OF DETERMINATION
A value of R 2 near 1
indicates that most
of the variation of the
response data is ex-
plained by the different
values of independent
variable. In other word,
a the linear regression
model is explaining
well the relationship
between Y and x.
C OEFFICIENT OF DETERMINATION
A value of R 2 near 0 indicates that little of the variation is explained
by the different values of x or only a little portion of pair (Yi , x i ) has
linear correlation.
C OEFFICIENT OF DETERMINATION
A value of R 2 near 0 indicates that little of the variation is explained
by the different values of x or only a little portion of pair (Yi , x i ) has
linear correlation.
C OEFFICIENT OF DETERMINATION
A value of R 2 near 0 indicates that little of the variation is explained
by the different values of x or only a little portion of pair (Yi , x i ) has
linear correlation.
C OEFFICIENT OF DETERMINATION
E XAMPLE 4
A new-car dealer is interested in the relationship between the
number of salespeople working on a weekend and the number of
cars sold. Data were gathered for six consecutive Sundays:
Number of salespeople 5 7 4 2 4 8
Number of cars sold 22 20 15 9 17 25
D EFINITION
Considering a sample of n observations: (X i , Yi ), i = 1, . . . , n. The
sample correlation coefficient r X Y , is defined by
Pn
− X̄ )(Yi − Ȳ )
i =1 (X i SX Y
r X Y = qP =p (8)
n 2 n (Y − Ȳ )2
P S X X SST
i =1 (X i − X̄ ) i =1 i
Note that s
SST
βˆ1 = rX Y
SX X
thus
2 SX X SX Y SSR
r X2 Y = βˆ1 = βˆ1 =
SST SST SST
• The coefficient of determination R 2 in a simple linear regression
model equals to the square of the sample correlation coefficient.
R 2 = r X2 Y
The range of r X Y : −1 ≤ r X Y ≤ 1,
−1 ≤ r X Y < 0: negative correlation. r X Y is closer to −1
indicating a stronger negative correlation between X and Y .
0 < r X Y ≤ 1: positive correlation. r X Y is closer to 1 indicating a
stronger positive correlation between X and Y .
r X Y is closer to 0 indicating a weak correlation between X and
Y . r X Y = 0: indicating linearly independent between X and Y .
T HE QQ- PLOT
T HE QQ- PLOT
A Q–Q plot is a plot of the quantiles of two distributions
against each other, or a plot based on estimates of the
quantiles. The pattern of points in the plot is used to compare
the two distributions.
T HE QQ- PLOT
A Q–Q plot is a plot of the quantiles of two distributions
against each other, or a plot based on estimates of the
quantiles. The pattern of points in the plot is used to compare
the two distributions.
The points plotted in a Q–Q plot are always non-decreasing
when viewed from left to right. If the two distributions being
compared are identical, the Q–Q plot follows the 45 deg line
y =x
S TANDARDIZED RESIDUALS
The standardized residuals are defined as
Yi − (β̂0 + β̂1 x i )
Ei = p , i = 1, 2, . . . , n
SSE /(n − 2)
S TANDARDIZED RESIDUALS
The standardized residuals are defined as
Yi − (β̂0 + β̂1 x i )
Ei = p , i = 1, 2, . . . , n
SSE /(n − 2)
S TANDARDIZED RESIDUALS
The standardized residuals are defined as
Yi − (β̂0 + β̂1 x i )
Ei = p , i = 1, 2, . . . , n
SSE /(n − 2)
S TANDARDIZED RESIDUALS
The standardized residuals are defined as
Yi − (β̂0 + β̂1 x i )
Ei = p , i = 1, 2, . . . , n
SSE /(n − 2)
A BUSES OF REGRESSION
A BUSES OF REGRESSION
A BUSES OF REGRESSION
I NTERPRETING R RESULTS
I NTERPRETING R RESULTS
I NTERPRETING R RESULTS