15.simple Linear Regression-530
15.simple Linear Regression-530
2
First-Order Linear Model =
Simple Linear Regression Model
Yi 0 1X i i
where
y = dependent variable
x = independent variable
0= y-intercept
1= slope of the line
= error variable
3
Simple Linear Model
Yi 0 1X i i
This model is
– Simple: only one X
– Linear in the parameters: No parameter
appears as exponent or is multiplied or
divided by another parameter
– Linear in the predictor variable (X): X
appears only in the first power.
4
Examples
• Multiple Linear Regression:
Yi 0 1X1i 2 X 2i i
• Polynomial Linear Regression:
Yi 0 1X i 2 X 2i i
• Linear Regression:
log10 (Yi ) 0 1X i 2 exp(X i ) i
• Nonlinear Regression:
Yi 0 /(1 1 exp( 2 X i )) i
5
Linear or nonlinear in parameters
Deterministic Component of
Model
50 y ˆ0 ˆ1x
45
40
35
y-intercept 30 ∆y ̂1(slope)=∆y/∆x
25
20̂ 0 ∆x
15
10
5
0 x
0 5 10 15 20
6
Mathematical vs Statistical Relation
50
45 y^= - 5.3562 + 3.3988x
40
35
30
25
20
15
10
5
0 x
x
0 5 10 15 20
7
Error
• The scatterplot shows that the points are not
on a line, and so, in addition to the
relationship, we also describe the error:
yi 0 1 xi i , i=1,2,...,n
• The Y’s are the response (or dependent)
variable. The x’s are the predictors or
independent variables, and the epsilon’s are the
errors. We assume that the errors are normal,
mutually independent, and have variance 2.
8
Least Squares:
n n
Minimize
i 1
i2
i 1
( yi 0 i xi ) 2
9
Minimizing error
10
• The Simple Linear Regression Model
y 0 1 x
• The Least Squares Regression Line where
ŷ ˆ0 ˆ1 x
SS
ˆ
1 xy
ˆ y ˆ x
SS x 0 1
x
2
( x x ) x
i
2 2
SS x i i
n
x y
( x x )( y y )
i i
SS xy i i xy
i i
n
11
What form does the error take?
• Each observation may be decomposed
into two parts:
y yˆ ( y yˆ )
• The first part is used to determine the
fit, and the second to estimate the error.
• We estimate the standard deviation of
the error by:
2
S
SSE (Y Y ) S yy
ˆ 2 xy
S xx
12
Estimate of 2
• We estimate 2 by
2 SSE
s MSE
n 2
13
Example
• An educational economist wants to
establish the relationship between an
individual’s income and education. He takes
a random sample of 10 individuals and asks
for their income ( in $1000s) and education
( in years). The results are shown below.
Find the least squares regression line.
Education 11 12 11 15 8 10 11 12 17 11
Income 25 33 22 41 18 28 32 24 53 26
14
Dependent and Independent
Variables
• The dependent variable is the one that we
want to forecast or analyze.
• The independent variable is hypothesized
to affect the dependent variable.
• In this example, we wish to analyze
income and we choose the variable
individual’s education that most affects
income. Hence, y is income and x is
individual’s education
15
First Step:
x 118
i
x2
i 1450
y 302
i
y 10072
2
i
x y 3779
i i
16
Sum of Squares:
x y
SS xy i i
( x )( y ) 3779 (118)(302) 215.4
i i
n 10
SS x x 2
i
( xi ) 2
1450
(118) 2
57.6
n 10
Therefore, ˆ SS xy 215.4
1 3.74
SS x 57.6
ˆ ˆ 302 118
0 y 1 x 3.74 13.93
10 10
17
The Least Squares Regression
Line
• The least squares regression line is
yˆ 13.93 3.74 x
• Interpretation of coefficients:
*The sample slope ˆ1 3.74 tells us that on
average for each additional year of education,
an individual’s income rises by $3.74 thousand.
• The y-intercept is ˆ0 13.93 . This value is the
expected (or average) income for an individual
who has 0 education level (which is meaningless
here)
18
Example
• Car dealers across North America use the
red book to determine a cars selling price
on the basis of important features. One of
these is the car’s current odometer reading.
• To examine this issue 100 three-year old
cars in mint condition were randomly
selected. Their selling price and odometer
reading were observed.
19
Portion of the data file
Odometer Price
37388 5318
44758 5061
45833 5008
30862 5795
….. …
34212 5283
33190 5259
39196 5356
36392 5133
20
Example (Minitab Output)
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 1 4183528 4183528 182.11 0.000
Error 98 2251362 22973
Total 99 6434890
21
Example
• The least squares regression line is
yˆ 6533.38 0.031158 x
6000
5500
Price
5000
23
R² and R² adjusted
• R² measures the degree of linear association
between X and Y.
• So, an R² close to 0 does not necessarily
indicate that X and Y are unrelated (relation can
be nonlinear)
• Also, a high R² does not necessarily indicate
that the estimated regression line is a good fit.
• As more and more X’s are added to the model,
R² always increases. R²adj accounts for the
number of parameters in the model.
24
Scatter Plot
Odometer .vs. Price Line Fit Plot
6000
5500
Price
5000
4500
19000 29000 39000 49000
Odometer
25
Testing the slope
• Are X and Y linearly related?
H 0 : 1 0
H A : 1 0
•Test Statistic:
ˆ1 1 sˆ
s
t where
s ˆ 1
SS x
1
26
Testing the slope (continue)
• The Rejection Region:
Reject H0 if t < -t/2,n-2 or t > t/2,n-2.
• Minitab output
Predictor Coef StDev T P
Constant 6533.38 84.51 77.31 0.000
Odometer -0.031158 0.002309 -13.49 0.000
28
Coefficient of Determination
2 SSE
R 1
SS y
For the data in odometer example, we obtain :
2 SSE 2,251,363
R 1 1
SS y 6,434,890
1 0.3499 0.6501
2 n 1 SSE
R adj 1 ( )
n p SS y
where p is number of predictors in the model. 29
Using the Regression Equation
• Suppose we would like to predict the
selling price for a car with 40,000 miles on
the odometer
yˆ 6,533 0.0312 x
6,533 0.0312(40,000)
$5,285
30
Prediction and Confidence
Intervals
• Prediction Interval of y for x=xg: The
confidence interval for predicting the particular
value of y for a given x
2
1 ( x x )
ˆy t / 2,n 2 se 1 g
n SS x
6300
Prediction interval
5800
Predicted
5300
Confidence interval
4800
34
Notes
• No matter how strong is the statistical relation
between X and Y, no cause-and-effect pattern is
necessarily implied by the regression model. Ex:
Although a positive and significant relationship is
observed between vocabulary (X) and writing
speed (Y), this does not imply that an increase in
X causes an increase in Y. Other variables, such
as age, may affect both X and Y. Older children
have a larger vocabulary and faster writing
speed.
35
Regression Diagnostics
Residual Analysis:
Non-normality
Heteroscedasticity (non-constant variance)
Non-independence of the errors
Outlier
Influential observations
36
Standardized Residuals
• The standardized residuals are calculated
as r
Standardized residual i
s
where ri yi y ˆi .
• The standard deviation of the i-th residual
is 2
1 ( xi x )
sr s 1 hi where hi
i
n SS x
37
Non-normality:
• The errors should be normally distributed. To
check the normality of errors, we use histogram
of the residuals or normal probability plot of
residuals or tests such as Shapiro-Wilk test.
• Dealing with non-normality:
– Transformation on Y
– Other types of regression (e.g., Poisson or
Logistic …)
– Nonparametric methods (e.g., nonparametric
regression(i.e. smoothing))
38
Non-constant variance:
• The error variance 2
should be constant.
• To diagnose non-constant variance, one method
is to plot the residuals against the predicted
value of y (or x). If the points are distributed
evenly around the expected value of errors
which is 0, this means that the error variance is
constant. Or, formal tests such as: Breusch-
Pagan test
39
Dealing with non-constant variance
• Transform Y
• Re-specify the Model (e.g., Missing
important X’s?)
• Use Weighted Least Squares instead of
n
i2
Ordinary Least Squares min
i 1 Var ( i )
40
Non-independence of error
variable:
• The values of error should be
independent. When the data are time
series, the errors often are correlated (i.e.,
autocorrelated or serially correlated). To
detect autocorrelation we plot the
residuals against the time periods. If there
is no pattern, this means that errors are
independent. Or, more formal tests such
as Durbin-Watson
41
Outlier:
• An outlier is an observation that is unusually
small or large. Two possibilities which cause
outlier is
1. Error in recording the data. Detect the error
and correct it
The outlier point should not have been
included in the data (belongs to another
population) Discard the point from the sample
2. The observation is unusually small or large
although it belong to the sample and there is no
recording error. Do NOT remove it
42
Influential Observations
Scatter Plot Without the Influential
Scatter Plot of One Influential Observation
Observation
60 150
50
40
100
y
y 30
50
20
10 0
0 0 10 20 30 40 50
0 10 20 30 40 50
x
x
43
Influential Observations
• Detection:
Cook’s Distance, DFFITS, DFBETAS (Neter, J.,
Kutner, M.H., Nachtsheim, C.J., and Wasserman, W., (1996)
Applied Linear Statistical Models, 4th edition, Irwin, pp. 378-384 )
44
Multicollinearity
• A common issue in multiple regression is
multicollinearity. This exists when some or
all of the predictors in the model are highly
correlated. In such cases, the estimated
coefficient of any variable depends on
which other variables are in the model.
Also, standard errors of the coefficients
are very high…
45
Multicollinearity
• Look into correlation coefficient among X’s: If
Cor>0.8, suspect multicollinearity
• Look into Variance inflation factors (VIF): VIF>10 is
usually a sign of multicollinearity
• If there is multicollinearity:
– Use transformation on X’s, e.g. centering, standardization.
Ex: Cor(X,X²)=0.991; after standardization Cor=0!
– Remove the X that causes multicollinearity
– Factor analysis
– Ridge regression
–…
46
Exercise
• In baseball, the fans are always interested
in determining which factors lead to
successful teams. The table below lists the
team batting average and the team
winning percentage for the 14 league
teams at the end of a recent season.
47
Team-B-A Winning%
0.254 0.414
0.269 0.519
0.255 0.500
0.262 0.537
0.254 0.352
0.247 0.519
0.264 0.506
0.271 0.512
0.280 0.586
0.256 0.438
0.248 0.519
0.255 0.512
0.270 0.525
0.257 0.562
y 7.001, y 3.549
i
2
i
x y 1.824562 i i
x y n
SS xy i i
( x )( y ) 1.824562 (3.642)(7.001) 0.0033
i i
14
x
2
2
(3.642)
SS x
i
2
x i 0.948622 0.00118
n 14
49
ˆ SS xy 0.003302
1 0.7941
SS x 0.001182
ˆ0 y ˆ1 x 0.5 (0.7941)0.26 0.2935
2 SSE 0.03856 2
So, s
0.00321 and s s 0.0567
n 2 14 2
52
d)Coefficient of Determination
2
2
SS xy SSE 0.03856
R 1 1 0.1925
SS x SS y SS y 0.04778
53
e) Predict with 90% confidence the winning
percentage of a team whose batting average
is 0.275.
yˆ 0.2935 0.7941(0.275) 0.5119
2
1 ( xg x )
yˆ t / 2,n 2 s 1
n SS x
1 (0.275 0.2601) 2
0.5119 (1.782)(0.0567) 1
14 0.001182
0.5119 0.1134
90% PI for y: (0.3985,0.6253)