Sec2 Regression PDF
Sec2 Regression PDF
Carlos M. Carvalho
The University of Texas McCombs School of Business
1
1. Simple Linear Regression
2. Multiple Linear Regression
3. Dummy Variables
4. Residual Plots and Transformations
5. Variable Selection and Regularization
6. Dimension Reduction Methods
1. Regression: General Introduction
1
1st Example: Predicting House Prices
Problem:
I Predict market price based on observed characteristics
Solution:
I Look at property sales data where we know the price and
some observed characteristics.
I Build a decision rule that predicts price as a function of the
observed characteristics.
2
Predicting House Prices
80
60
size
3
Regression Model
Y = f (X1 , X2 , . . . , Xp ) + e
Y = b0 + b1 X1 + b2 X2 + . . . + bp Xp + e
4
Linear Prediction
Appears to be a linear relationship between price and size:
As size goes up, price goes up.
5
Linear Prediction
b1
Y = b0 + b1X
b0
1 2 X
6
Linear Prediction
7
Linear Prediction
What is the “fitted value”?
Yi
Ŷi
Xi
The dots are the observed values and the line represents our fitted
values given by Ŷi = b0 + b1 X1 .
8
Linear Prediction
What is the “residual”’ for the ith observation’ ?
Yi
ei = Yi – Ŷi = Residual i
Ŷi
Xi
9
Least Squares
Ideally we want to minimize the size of all residuals:
I If they were all zero we would have a perfect line.
I Trade-off between moving closer to some points and at the
same time moving away from other points.
I Minimize the “total” of residuals to get best fit.
PN 2
Least Squares chooses b0 and b1 to minimize i=1 ei
N
X
ei2 = e12 +e22 +· · ·+eN2 = (Y1 −Ŷ1 )2 +(Y2 −Ŷ2 )2 +· · ·+(YN −ŶN )2
i=1
n
X
= (Yi − [b0 + b1 Xi ])2
i=1
10
Least Squares – Excel Output
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.909209967
R Square 0.826662764
Adjusted R Square 0.81332913
Standard Error 14.13839732
Observations 15
ANOVA
df SS MS F Significance F
Regression 1 12393.10771 12393.10771 61.99831126 2.65987E-06
Residual 13 2598.625623 199.8942787
Total 14 14991.73333
11
2nd Example: Offensive Performance in Baseball
1. Problems:
I Evaluate/compare traditional measures of offensive
performance
I Help evaluate the worth of a player
2. Solutions:
I Compare prediction rules that forecast runs as a function of
either AVG (batting average), SLG (slugging percentage) or
OBP (on base percentage)
12
2nd Example: Offensive Performance in Baseball
13
Baseball Data – Using AVG
Each observation corresponds to a team in MLB. Each quantity is
the average over a season.
PN 2
N \i − Runsi
Runs
1 X 2 i=1
ei =
N N
i=1
17
The Least Squares Criterion
18
Sample Mean and Sample Variance
19
Example n
1 X 2
sy2 = Yi − Ȳ
n−1
i=1
!
20
!!
!! !
!
! ! ! ! !! ! !
!
! ! ! !! !
! !
X̄
! ! ! !! !
! ! ! !
! ! ! ! ! !
!
0
X
! ! !!
!! !
!
!
! !
! !!! !
!
−20
!
!
(Xi − X̄)
−40
0 10 20 30 40 50 60
sample
!
!
! ! !
! !
20
!! !
!! ! !
! ! ! ! !
! ! ! ! !
!! ! !
! ! !
! !
! !
!
Ȳ
0
Y
! !
! ! ! ! !
! ! !
! ! ! !
!
−20
! ! !
! !
!
!
(Yi − Ȳ ) !
!
−40
0 10 20 30 40 50 60
sample
sx = 9.7 sy = 16.0
20
Covariance
Measure the direction and strength of the linear relationship between Y and X
Pn
i=1 (Yi − Ȳ )(Xi − X̄ )
Cov (Y , X ) =
n−1
!
!
!
! !
! !
! !
!
!!
!
! ! !
!
! !
! ! ! !
!
Ȳ !
!
! ! I sy = 15.98, sx = 9.7
!
0
Y
! !
!
! ! ! !
!
I Cov (X , Y ) = 125.9
!!
!
!
!
!
!
How do we interpret that?
! ! !
−20
! !
!
(Yi − Ȳ )(Xi − X̄) > 0 (Yi − Ȳ )(Xi − X̄) < 0
!
!
!
−40
−20 −10 0 10 20
X
X̄
21
Correlation
22
Correlation
cov(X , Y ) cov(X , Y ) 125.9
corr (Y , X ) = = = = 0.812
15.98 × 9.7
q
sx2 sy2 sx sy
!
!
! !
! !
! !
!
!!
!
! ! !
!
! !
! ! ! !
!
Ȳ !
!
!
! !
0
Y
! !
! ! !
! !
!
!!
!
!
! !
! ! !
−20
! !
!
(Yi − Ȳ )(Xi − X̄) > 0 (Yi − Ȳ )(Xi − X̄) < 0
!
!
!
−40
−20 −10 0 10 20
X
X̄
23
Correlation
3
corr = 1 corr = .5
2
1
1
0
0
-1
-1
-2
-2
-3
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
3
3
corr = .8 corr = -.8
2
2
1
1
0
0
-1
-1
-2
-2
-3
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
24
Correlation
Only measures linear relationships:
corr(X , Y ) = 0 does not mean the variables are not related!
20
0
15
-2
10
-4
5
-6
0
-8
-3 -2 -1 0 1 2 0 5 10 15 20
25
Back to Least Squares
1. Intercept:
b0 = Ȳ − b1 X̄ ⇒ Ȳ = b0 + b1 X̄
Pn
sY i=1 (Xi − X̄ )(Yi − Ȳ )
b1 = corr (X , Y ) × = Pn 2
sX i=1 (Xi − X̄ )
Cov (X , Y )
=
var (X )
27
Decomposing the Variance
28
A Goodness of Fit Measure: R 2
SSR SSE
R2 = =1−
SST SST
I 0 < R 2 < 1.
I The closer R 2 is to 1, the better the fit.
29
Back to the House Data
Back to the House Data
''/
''-
''.
Applied Regression Analysis
!""#$%%&$'()*"$+,
Carlos M. Carvalho 2 SSR 12395
R = = 0.82 =
SST 14991
30
Back to Baseball
R2 corr SSE
OBP 0.88 0.94 0.79
SLG 0.76 0.87 1.64
AVG 0.63 0.79 2.49
31
Prediction and the Modeling Goal
32
Prediction and the Modeling Goal
Then we can say something like “with 95% probability the error
will be no less than -$28,000 or larger than $28,000”.
33
Prediction and the Modeling Goal
I Suppose you only had the purple points in the graph. The
dashed line fits the purple points. The solid line fits all the
points. Which line is better? Why?
●
6.0
●
●
●
●
●
●
5.5
●
●
●
●
●
●
RPG
● ●
5.0
●
●
●
●
● ●
● ●
●
●
● ●
●
●
4.5
●
●
AVG
ε ∼ N(0, σ 2 )
35
The Simple Linear Regression Model
Y = β0 + β1 X + ε
260
240
220
y
200
180
160
36
The Simple Linear Regression Model – Example
β0 = 40; β1 = 45; σ = 10
and you are asked to predict price of a 1500 square foot house.
Y = 40 + 45(1.5) + ε
= 107.5 + ε
37
Conditional Distributions
Y = β0 + β1 X + ε
140
120
y
100
80
60
We estimate s 2 with:
n
1 X 2 SSE
s2 = ei =
n−2 n−2
i=1
39
Estimation of Error Variance
Estimation of V2
!,"-"$).$s )/$0,"$123"($4506507
Where is s in the Excel output?
40
One Picture Summary of SLR
I The plot below has the house data, the fitted regression line
(b0 + b1 X ) and ±2 ∗ s...
I From this picture, what can you tell me about β0 , β1 and σ 2 ?
How about b0 , b1 and s 2 ?
●
160
●
140
●
120
price
●
●
100
●
●
●
●
●
80
●
●
60
size
41
Understanding Variation... Runs per Game and AVG
I blue line: all points
I red line: only purple points
I Which slope is closer to the true one? How much closer?
●
6.0
●
●
●
●
●
5.5
●
●
●
●
●
●
RPG
● ●
5.0
●
●
●
●
● ●
● ●
●
●
● ●
●
●
4.5
●
●
AVG
42
The Importance of Understanding Variation
43
Sampling Distribution of b1
44
Sampling Distribution of b1
s2 s2
sb21 = P =
(Xi − X̄ )2 (n − 1)sx2
Three Factors:
sample size (n), error variance (s 2 ), and X -spread (sx ).
45
Sampling Distribution of b0
X̄ 2
1
sb20 = var(b0 ) = s 2
+
n (n − 1)sx2
46
Understanding Variation... Runs per Game and AVG
Regression with all points
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.798496529
R Square 0.637596707
Adjusted R Square 0.624653732
Standard Error 0.298493066
Observations 30
ANOVA
df SS MS F Significance F
Regression 1 4.38915033 4.38915 49.26199 1.239E-‐07
Residual 28 2.494747094 0.089098
Total 29 6.883897424
sb1 = 4.78
47
Understanding Variation... Runs per Game and AVG
Regression with subsample
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.933601392
R Square 0.87161156
Adjusted R Square 0.828815413
Standard Error 0.244815842
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 1.220667405 1.220667 20.36659 0.0203329
Residual 3 0.17980439 0.059935
Total 4 1.400471795
sb1 = 10.78
48
Confidence Intervals
49
Example: Runs per Game and AVG
Regression with all points
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.798496529
R Square 0.637596707
Adjusted R Square 0.624653732
Standard Error 0.298493066
Observations 30
ANOVA
df SS MS F Significance F
Regression 1 4.38915033 4.38915 49.26199 1.239E-‐07
Residual 28 2.494747094 0.089098
Total 29 6.883897424
50
Testing
51
Testing
b1 − β10
t=
sb1
52
Testing
53
Example: Mutual Funds
Another Example of Conditional Distributions
Let’s investigate the performance of the Windsor Fund, an
aggressive large cap fund by Vanguard...
-"./0$(11#$2.$2$032.."45426$17$.8"$*2.29$
rw = β0 + β1 rsp500 +
55
Hypothesis Testing – Windsor Fund Example
-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<
Example: Mutual Funds
b! sb! b!
sb!
Applied Regression Analysis
!""#$%%%&$'()*"$+,
Carlos M. Carvalho
I t = 32.10... reject β1 = 0!!
I the 95% confidence interval is [0.87; 0.99]... again, reject!!
56
Example: Mutual Funds
57
-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<
Example: Mutual Funds
b! sb! b!
sb!
Applied Regression Analysis
!""#$%%%&$'()*"$+,
b1 −1
Carlos M. Carvalho −0.0643
I t= sb1 = 0.0291 = −2.205... reject.
I the 95% confidence interval is [0.87; 0.99]... again, reject,
but...
58
Testing – Why I like Conf. Int.
[1.00001, 1.00002]
59
Testing – Why I like Conf. Int.
[−100, 100]
60
Testing – Summary
61
Forecasting
400
300
Price
200
100
0
0 2 4 6 8
Size
I Careful with extrapolation!
62
House Data – one more time!
I R 2 = 82%
I Great R 2 , we are happy using this model to predict house
prices, right?
●
160
●
140
●
120
price
●
●
100
●
●
●
●
●
80
●
●
60
size
63
House Data – one more time!
I But, s = 14 leading to a predictive interval width of about
US$60,000!! How do you feel about the model now?
I As a practical matter, s is a much more relevant quantity than
R 2 . Once again, intervals are your friend!
●
160
●
140
●
120
price
●
●
100
●
●
●
●
●
80
●
●
60
size
64
2. The Multiple Regression Model
65
The MLR Model
Same as always, but with more covariates.
Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp +
Y |X1 . . . Xp ∼ N(β0 + β1 X1 . . . + βp Xp , σ 2 )
66
The MLR Model
If p = 2, we can plot the regression surface in 3D.
Consider sales of a product as predicted by price of this product
(P1) and the price of a competing product (P2).
Sales = β0 + β1 P1 + β2 P2 +
67
Least Squares
Y = β0 + β1 X1 . . . + βp Xp + ε, ε ∼ N(0, σ 2 )
68
Least Squares
Regression Statistics
Multiple R 0.99
R Square 0.99
Adjusted R Square 0.99
Standard Error 28.42
Observations 100.00
ANOVA
df SS MS F Significance F
Regression 2.00 6004047.24 3002023.62 3717.29 0.00
Residual 97.00 78335.60 807.58
Total 99.00 6082382.84
69
Least Squares
Residuals: ei = Yi − Ŷi .
Pn 2
Least Squares: Find b0 , b1 , b2 , . . . , bp to minimize i=1 ei .
70
Fitted Values in MLR
Useful way to plot the results for MLR problems is to look at
Y (true values) against Ŷ (fitted values).
1000
800
600
y=Sales
400
200
0
If things are working, these values should form a nice straight line. Can
you guess the slope of the blue line?
71
Fitted Values in MLR
With just P1...
1000
1000
1000
800
800
800
600
600
600
y=Sales
y=Sales
y=Sales
400
400
400
200
200
200
0
0
2 4 6 8 300 400 500 600 700 0
1000
1000
800
800
800
600
600
600
y=Sales
y=Sales
y=Sales
400
400
400
200
200
200
0
0
300 400 500 600 700 0 200 400 600 800 1000 0 200 400 600 800 1000
73
R-squared
74
Intervals for Individual Coefficients
bj ∼ N(βj , sb2j )
75
Intervals for Individual Coefficients
76
Intervals for Individual Coefficients
77
Understanding Multiple Regression
78
Understanding Multiple Regression
I If we regress Sales on our Regression
own price, we Plotobtain a somewhat
surprising conclusion... theSales
higher
= 211.165the price
+ 63.7130 p1 the more we sell!!
gress S = 223.401 R-Sq = 19.6 % R-Sq(adj) = 18.8 %
n
1000
ce,
ain the
hat
Sales
500
ng
ion
igher 0
associated 0 1 2 3 4 5 6 7 8 9
p1
ore sales!!
I It looks like we should just raise our prices, right? NO, not if
you have taken this statistics class!
The regression line 79
Understanding Multiple Regression
80
Understanding Multiple
How can we see what isRegression
going on ?
I How can we see what is going on? Let’s compare Sales in two
If we compares sales in weeks 82 and 99, we
different observations: weeks 82 and 99.
see that an increase in p1, holding p2 constant
I We see that an increase in P1, holding P2 constant,
(82 to 99) corresponds to a drop is sales.
corresponds to a drop in Sales!
9
8 99 1000
7
6 82
5
Sales
p1
500
4
3
2
99
1 0
82
0
0 1 2 3 4 5 6 7 8 9
0 5 10 15
p1
p2
9
8 1000
7
6
5
Sales
500
p1
4
3
2
1 0
0
0 5 10 15 0 1 2 3 4 5 6 7 8 9
p2 p1
1000
8
800
6
600
sales$Sales
sales$p1
400
4
200
2
0 5 10
p2 15 2 4 6 8
p1
sales$p2 sales$p1
83
Understanding Multiple Regression
I Summary:
1. A larger P1 is associated with larger P2 and the overall effect
leads to bigger sales
2. With P2 held fixed, a larger P1 leads to lower sales
3. MLR does the trick and unveils the “correct” economic
relationship between Sales and prices!
84
Understanding Multiple Regression
Beer Data (from an MBA class)
I nbeer – number of beers before getting drunk
I height and weight
ed
20
nbeer
10
60 65 70 75
height
SUMMARY OUTPUT
nbeers = β0 + β1 height +
Regression Statistics
Multiple R 0.58
R Square 0.34
Adjusted R Square 0.33
Standard Error 3.11
Observations 50.00
ANOVA
df SS MS F Significance F
Regression 1.00 237.77 237.77 24.60 0.00
Residual 48.00 463.86 9.66
Total 49.00 701.63
86
Understanding Multiple Regression
SUMMARY OUTPUT
nbeers = β0 + β1 weight + β2 height +
Regression Statistics
Multiple R 0.69
R Square 0.48
Adjusted R Square 0.46
Standard Error 2.78
Observations 50.00
ANOVA
df SS MS F Significance F
Regression 2.00 337.24 168.62 21.75 0.00
Residual 47.00 364.38 7.75
Total 49.00 701.63
87
S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%
75
The correlations:
70 nbeer weight
height
weight 0.692
height 0.582 0.806
65
Understanding Multiple
S = 2.784Regression
R-Sq = 48.1% R-Sq(adj) = 45.9%
75
The correlations:
70 nbeer weight
height
weight 0.692
height 0.582 0.806
65
89
Understanding Multiple Regression
SUMMARY OUTPUT
nbeers = β0 + β1 weight +
Regression Statistics
Multiple R 0.69
R Square 0.48
Adjusted R Square0.47
Standard Error 2.76
Observations 50
ANOVA
df SS MS F Significance F
Regression 1 336.0317807 336.0318 44.11878 2.60227E-08
Residual 48 365.5932193 7.616525
Total 49 701.625
Why is this a better model than the one with weight and height??
90
Understanding Multiple Regression
Any time a report says two variables are related and there’s a
suggestion of a “causal” relationship, ask yourself whether or not
other variables might be the real reason for the effect. Multiple
regression allows us to control for all important variables by
including them into the regression. “Once we control for weight,
height and beers are NOT related”!!
91
BackSUMMARY OUTPUT
to Baseball – Let’s try to add AVG on top of OBP
Regression Statistics
Multiple R 0.948136
R Square 0.898961
Adjusted R Square 0.891477
Standard Error 0.160502
Observations 30
ANOVA
df SS MS F Significance F
Regression 2 6.188355 3.094177 120.1119098 3.63577E‐14
Residual 27 0.695541 0.025761
Total 29 6.883896
ANOVA
df SS MS F Significance F
Regression 2 6.28747 3.143735 142.31576 4.56302E‐15
Residual 27 0.596426 0.02209
Total 29 6.883896
Correlations
AVG 1
OBP 0.77 1
SLG 0.75 0.83 1
94
3. Dummy Variables... Example: House Prices
95
Dummy Variables... Example: House Prices
96
Dummy Variables... Example: House Prices
97
Dummy Variables... Example: House Prices
Regression Statistics
Multiple R 0.828
R Square 0.685
Adjusted R Square 0.677
Standard Error 15.260
Observations 128
ANOVA
df SS MS F Significance F
Regression 3 62809.1504 20936 89.9053 5.8E-31
Residual 124 28876.0639 232.87
Total 127 91685.2143
98
Dummy Variables... Example: House Prices
Nbhd = 1
200
Nbhd = 2
180
160 Nbhd = 3
Price
140
120
100
80
Size
99
Dummy Variables... Example: House Prices
Pricei = β0 + β1 Size + i
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.553
R Square 0.306
Adjusted R Square 0.300
Standard Error 22.476
Observations 128
ANOVA
df SS MS F Significance F
Regression 1 28036.4 28036.36 55.501 1E-11
Residual 126 63648.9 505.1496
Total 127 91685.2
Coefficients
Standard Error t Stat P-valueLower 95%
Upper 95%
Intercept -10.09 18.97 -0.53 0.60 -47.62 27.44
size 70.23 9.43 7.45 0.00 51.57 88.88
100
Dummy Variables... Example: House Prices
Nbhd = 1
200
Nbhd = 2
Nbhd = 3
Just Size
180
160
Price
140
120
100
80
Size
101
Sex Discrimination Case
90
80
70
Salary
60
50
40
30
60 70 80 90
Year Hired
Does it look like the effect of experience on salary is the same for
males and females? 102
Sex Discrimination Case
For Females:
Salaryi = β0 + β1 Expi + i
For Males:
103
Sex Discrimination Case
104
Sex Discrimination Case
Salary
SUMMARY i = β0
OUTPUT + β1 Sexi + β2 Exp + β3 Exp ∗ Sex + i
Regression Statistics
Multiple R 0.799130351
R Square 0.638609318
Adjusted R S 0.63329475
Standard Err 6.816298288
Observations 208
ANOVA
df SS MS F Significance F
Regression 3 16748.88 5582.96 120.16 7.513E-45
Residual 204 9478.232 46.4619
Total 207 26227.11
90
80
70
Salary
60
50
40
30
60 70 80 90
Year Hired
106
Variable Interaction
107
4. Residual Plots and Transformations
108
Non Linearity
Example: Telemarketing
I How does length of employment affect productivity (number
of calls per day)?
109
Non Linearity
Example: Telemarketing
I Residual plot highlights the non-linearity!
110
Non Linearity
Let’s try
Y = β0 + β1 X + β2 X 2 +
The data...
111
Telemarketing
Adding Polynomials
Linear Model
With X2
∂E[Y |X ]
= β1 + 2β2 X
∂X
115
Polynomial Regression
116
Closing Comments on Polynomials
Be very careful about predicting outside the data range. The curve
may do unintended things beyond the observed data.
117
Be careful when extrapolating...
30
calls
25
20
10 15 20 25 30 35 40
months
118
...and, be careful when adding more polynomial terms!
40
2
3
35
8
30
calls
25
20
15
10 15 20 25 30 35 40
months
119
Non-constant Variance
Example...
120
Non-constant Variance
Y = γ0 X β1 (1 + R)
121
The Log-Log Model
log(Y ) = β0 + β1 log(X ) +
122
Elasticity and the log-log Model
d%Y
β1 ≈ (Why?)
d%X
123
Price Elasticity
124
Price Elasticity of OJ
A chain of gas station convenience stores was interested in the
dependency between price of and Sales for orange juice...
They decided to run an experiment and change prices randomly at
different locations. With the data in hands, let’s first run an
regression of Sales on Price:
Sales = β0 + β1 Price +
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.719
R Square 0.517
Adjusted R Square 0.507
Standard Error 20.112
Observations 50.000
ANOVA
df SS MS F Significance F
Regression 1.000 20803.071 20803.071 51.428 0.000
Residual 48.000 19416.449 404.509
Total 49.000 40219.520
125
Price Elasticity of OJ
80
60
40
residuals
Sales
80
20
60
0
40
-20
20
1.5 2.0 2.5 3.0 3.5 4.0 1.5 2.0 2.5 3.0 3.5 4.0
Price Price
No good!!
126
Price Elasticity of OJ
But... would you really think this relationship would be linear?
Moving a price from $1 to $2 is the same as changing it form $10
to $11?? We should probably be thinking about the price elasticity
of OJ... log(Sales) = γ + γ log(Price) + 0 1
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.869
R Square 0.755
Adjusted R Square 0.750
Standard Error 0.386
Observations 50.000
ANOVA
df SS MS F Significance F
Regression 1.000 22.055 22.055 148.187 0.000
Residual 48.000 7.144 0.149
Total 49.000 29.199
0.5
residuals
Sales
80
0.0
60
40
-0.5
20
1.5 2.0 2.5 3.0 3.5 4.0 1.5 2.0 2.5 3.0 3.5 4.0
Price Price
Much better!!
128
Making Predictions
129
Making Predictions
Plug-in Prediction Plug-in Prediction
5.0
140
4.5
120
4.0
100
log(Sales)
3.5
80
Sales
60
3.0
40
2.5
20
2.0
1.5 2.0 2.5 3.0 3.5 4.0 0.4 0.6 0.8 1.0 1.2 1.4
Price log(Price)
131
Summary of Transformations
The bottom line: you should combine what the plots and the
regression output are telling you with your common sense and
knowledge about the problem. Keep playing around with it until
you get something that makes sense and has nothing obviously
wrong with it.
132
Airline Data
Monthly passengers in the U.S. airline industry (in 1,000 of
passengers) from 1949 to 1960... we need to predict the number of
passengers in the next couple of months.
600
500
Passengers
400
300
200
100
Time
Any ideas?
133
Airline Data
Fitted Values
Passengers
400
300
200
100
Time
What do you think?
134
Airline Data
Time
Is there any obvious pattern here? YES!!
135
Airline Data
The variance of the residuals seems to be growing in time... Let’s
try taking the log. log(Yt ) = β0 + β1 t + t
6.5
6.0
Fitted Values
log(Passengers)
5.5
5.0
Time
Any better?
136
Airline Data
Residuals...
2
1
std residuals
0
-1
-2
Time
Still we can see some obvious temporal/seasonal pattern....
137
Airline Data
Okay, let’s add dummy variables for months (only 11 dummies)...
log(Yt ) = β0 + β1 t + β2 Jan + ...β12 Dec + t
6.5
6.0
Fitted Values
log(Passengers)
5.5
5.0
Time
Much better!!
138
Airline Data
Residuals...
2
1
std residuals
0
-1
-2
Time
I am still not happy... it doesn’t look normal iid to me...
139
Airline Data
Residuals... corr(e(t),e(t-1))= 0.786
0.10
0.05
0.00
e(t)
-0.05
-0.10
-0.15
e(t-1)
I was right! The residuals are dependent on time...
140
Airline Data
We have one more tool... let’s add one legged term.
log(Yt ) = β0 + β1 t + β2 Jan + ...β12 Dec + β13 log(Yt−1 ) + t
6.5
6.0
Fitted Values
log(Passengers)
5.5
5.0
Time
Okay, good...
141
Airline Data
Residuals...
2
1
std residuals
0
-1
-2
Time
Much better!!
142
Airline Data
Residuals... corr(e(t),e(t-1))= -0.11
0.05
0.00
e(t)
-0.05
-0.10
e(t-1)
Much better indeed!!
143
Model Building Process
144
Model Building Process
Big models tend to over-fit and find features that are specific to
the data in hand... ie, not generalizable relationships.
The results are bad predictions and bad science!
145
5. Variable Selection and Regularization
146
Subset Selection
The idea here is very simple: fit as many models as you can and
compare their performance based on some criteria!
Issues:
I How many possible models? Total number of models = 2p
Is this large?
I What criteria to use?
Just as before, if prediction is what we have in mind,
out-of-sample predictive ability should be the criteria
147
Information Criteria
148
Information Criteria
1 1
e − 2 BIC (Mi ) e − 2 [BIC (Mi )−BICmin ]
P(Mi ) ≈ PR 1 = PR 1
r =1 e − 2 BIC (Mr ) r =1 e − 2 [BIC (Mr )−BICmin ]
149
Search Strategies: Stepwise Regression
150
Shrinkage Methods
151
Ridge Regression
152
Ridge Regression
2
n
X p
X p
X
Yi − β0 − βj Xij + λ βj2
i=1 j=1 j=1
153
Ridge Regression
154
Ridge Regression
60
60
Mean Squared Error
50
40
40
30
30
20
20
10
10
0
0
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0
λ kβ̂λR k2 /kβ̂k2
Ridge LASSO
400
400
Income
300 400
400
Limit
Standardized Coefficients
Standardized Coefficients
Standardized Coefficients
300
300
300
Rating
Student
200
200
200
200
100
100
100
100
0
0
0
0
−100
−100
−300
−300
1e−02 1e+00 1e+02 1e+04 0.0
20 0.2 100
50 0.4
200 5000.6 0.8
2000 50001.0
λ !β̂λR !2λ/!β̂!2
156
LASSO
157
Ridge vs. LASSO
158
Ridge vs. LASSO
159
Choosing λ
The idea is to solve the ridge or LASSO objective function over a
grid of possible values for λ...
0.55
0.55
0.50
0.50
RMSE
RMSE
0.45
0.45
0.40
0.40
0.35
0.35
-2 0 2 4 6 -8 -6 -4 -2
log(lambda) log(lambda)
160
6. Dimension Reduction Methods
161
Principal Components Regression
162
Principal Components Analysis
163
PCA looks Components
Principal for high-variance projections from multivariate x
Analysis
(i.e.,Here’s
the long direction) and finds the least squares fit.
a picture...
PC2
1
PC1
x2
0
-1
-2
-2 -1 0 1 2
x1 These
two variables x1 and x2 are very correlated. PC 1 tells you almost
Components
everything are
thatordered by invariance
is going on of the fitted projection.
this dataset!
PCA will look for linear combinations of the original variables that 9
account for most of their variability! 164
Principal Components Analysis
Let’s look at a simple example... Data: Protein consumption by
person by country for 7 variables: red meat, white meat, eggs,
milk, fish, cereals, starch, nuts, vegetables.
Austria Portugal
Netherlands
1.5
4
Hungary W Germany
1.0
E Germany
Czechoslovakia
3
Denmark
Poland Spain
Switzerland
Ireland France
0.5
Belgium
2
WhiteMeat
0.0
PC2
Sweden
Greece
1
Norway
France
Romania
-0.5
0
Norway
USSR UK USSR
Sweden
-1.0
W Germany
Portugal Finland
Czechoslovakia
Spain Ireland Switzerland
Greece Hungary
Netherlands
Yugoslavia
-1
Austria
-1.5
Romania
Bulgaria
Albania Albania
-1 0 1 2 -4 -2 0 2 4
RedMeat PC1
Looks to me that PC1 measures how rich you are and PC2
something to do with the Mediterranean diet!
165
Principal Components Analysis
Portugal
4
3
Spain
2
PC2
Greece
1
Norway
France
E Germany Poland Italy
Denmark
Belgium
0
UK USSR
Sweden
W Germany
Finland
Czechoslovakia
Ireland Switzerland Hungary
Netherlands
Yugoslavia
-1
Austria Romania
Bulgaria
Albania
-4 -2 0 2 4
PC1
2
1
0
PC
167
Principal Components Analysis: Comments
168
Principal Components Regression (PCR)
300
PC
170
Principal Components Regression (PCR)
Histogram of loadings on PC1... What bills are important in
defining PC1?
Afford. Health (amdt.) TARP
500
400
300
Frequency
200
100
0
150
100
Frequency
50
0
TX-11 CA-34
173
Principal Components Regression (PCR)
FL-18 FL-19
174
Principal Components Regression (PCR)
90
0
80
-20
70
% Obama
60
PC2
-40
50
-60
40
30
PC1 PC1
90
losers in 2010
80
80
70
70
% Obama
% Obama
60
60
50
50
40
40
30
30
PC1 PC1
176
Partial Least Squares (PLS)
177
Partial Least Squares (PLS)
178
Partial Least Squares (PLS)
Roll Call Data again... it looks like the first component from PLS
is the same as the first principal component!
90
90
80
80
70
70
% Obama
% Obama
60
60
50
50
40
40
30
30
179
Partial Least Squares (PLS)
90
80
80
70
70
% Obama
% Obama
60
60
50
50
40
40
30
30
40 45 50 55 60 40 50 60 70
Fitted Fitted
180
Partial Least Squares (PLS)
Not easy to understand the difference between the second
component in each method (how is that for a homework!)... the
bottom line is that by using the information from Y in
summarizing the X variables, PLS find a second component that
has the ability to explain part of Y .
PCR PLS
30
0
-20
20
Comp 2
PC2
-40
10
-60
0
-80
-10
PC1 Comp 1
181