0% found this document useful (0 votes)
92 views183 pages

Sec2 Regression PDF

This document discusses regression analysis and model selection. It covers simple and multiple linear regression models as well as variable selection methods. Several examples are provided to illustrate regression concepts, including predicting house prices based on characteristics like size and predicting baseball team runs scored based on offensive statistics like batting average, slugging percentage, and on-base percentage. Formulas for the least squares regression line are also defined, such as calculating the slope and intercept to minimize the sum of squared residuals between predicted and actual values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views183 pages

Sec2 Regression PDF

This document discusses regression analysis and model selection. It covers simple and multiple linear regression models as well as variable selection methods. Several examples are provided to illustrate regression concepts, including predicting house prices based on characteristics like size and predicting baseball team runs scored based on offensive statistics like batting average, slugging percentage, and on-base percentage. Formulas for the least squares regression line are also defined, such as calculating the slope and intercept to minimize the sum of squared residuals between predicted and actual values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 183

Regression and Model Selection

Book Chapters 3 and 6.

Carlos M. Carvalho
The University of Texas McCombs School of Business

1
1. Simple Linear Regression
2. Multiple Linear Regression
3. Dummy Variables
4. Residual Plots and Transformations
5. Variable Selection and Regularization
6. Dimension Reduction Methods
1. Regression: General Introduction

I Regression analysis is the most widely used statistical tool for


understanding relationships among variables
I It provides a conceptually simple method for investigating
functional relationships between one or more factors and an
outcome of interest
I The relationship is expressed in the form of an equation or a
model connecting the response or dependent variable and one
or more explanatory or predictor variable

1
1st Example: Predicting House Prices

Problem:
I Predict market price based on observed characteristics

Solution:
I Look at property sales data where we know the price and
some observed characteristics.
I Build a decision rule that predicts price as a function of the
observed characteristics.

2
Predicting House Prices

It is much more useful to look at a scatterplot

100 120 140 160


price

80
60

1.0 1.5 2.0 2.5 3.0 3.5

size

In other words, view the data as points in the X × Y plane.

3
Regression Model

Y = response or outcome variable


X 1, X 2, X 3, . . . , Xp = explanatory or input variables

The general relationship approximated by:

Y = f (X1 , X2 , . . . , Xp ) + e

And a linear relationship is written

Y = b0 + b1 X1 + b2 X2 + . . . + bp Xp + e

4
Linear Prediction
Appears to be a linear relationship between price and size:
As size goes up, price goes up.

The line shown was fit by the “eyeball” method.

5
Linear Prediction

b1

Y = b0 + b1X

b0

1 2 X

Our “eyeball” line has b0 = 35, b1 = 40.

6
Linear Prediction

Can we do better than the eyeball method?

We desire a strategy for estimating the slope and intercept


parameters in the model Ŷ = b0 + b1 X

A reasonable way to fit a line is to minimize the amount by which


the fitted value differs from the actual value.

This amount is called the residual.

7
Linear Prediction
What is the “fitted value”?

Yi

Ŷi

Xi

The dots are the observed values and the line represents our fitted
values given by Ŷi = b0 + b1 X1 .

8
Linear Prediction
What is the “residual”’ for the ith observation’ ?

Yi
ei = Yi – Ŷi = Residual i
Ŷi

Xi

We can write Yi = Ŷi + (Yi − Ŷi ) = Ŷi + ei .

9
Least Squares
Ideally we want to minimize the size of all residuals:
I If they were all zero we would have a perfect line.
I Trade-off between moving closer to some points and at the
same time moving away from other points.
I Minimize the “total” of residuals to get best fit.
PN 2
Least Squares chooses b0 and b1 to minimize i=1 ei

N
X
ei2 = e12 +e22 +· · ·+eN2 = (Y1 −Ŷ1 )2 +(Y2 −Ŷ2 )2 +· · ·+(YN −ŶN )2
i=1
n
X
= (Yi − [b0 + b1 Xi ])2
i=1

10
Least Squares – Excel Output

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.909209967
R Square 0.826662764
Adjusted R Square 0.81332913
Standard Error 14.13839732
Observations 15

ANOVA
df SS MS F Significance F
Regression 1 12393.10771 12393.10771 61.99831126 2.65987E-06
Residual 13 2598.625623 199.8942787
Total 14 14991.73333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 38.88468274 9.09390389 4.275906499 0.000902712 19.23849785 58.53086763
Size 35.38596255 4.494082942 7.873900638 2.65987E-06 25.67708664 45.09483846

11
2nd Example: Offensive Performance in Baseball

1. Problems:
I Evaluate/compare traditional measures of offensive
performance
I Help evaluate the worth of a player
2. Solutions:
I Compare prediction rules that forecast runs as a function of
either AVG (batting average), SLG (slugging percentage) or
OBP (on base percentage)

12
2nd Example: Offensive Performance in Baseball

13
Baseball Data – Using AVG
Each observation corresponds to a team in MLB. Each quantity is
the average over a season.

I Y = runs per game; X = AVG (average)


LS fit: Runs/Game = -3.93 + 33.57 AVG
14
Baseball Data – Using SLG

I Y = runs per game


I X = SLG (slugging percentage)
LS fit: Runs/Game = -2.52 + 17.54 SLG
15
Baseball Data – Using OBP

I Y = runs per game


I X = OBP (on base percentage)
LS fit: Runs/Game = -7.78 + 37.46 OBP
16
Place your Money on OBP!!!

PN  2
N \i − Runsi
Runs
1 X 2 i=1
ei =
N N
i=1

Average Squared Error


AVG 0.083
SLG 0.055
OBP 0.026

17
The Least Squares Criterion

The formulas for b0 and b1 that minimize the least squares


criterion are:
sy
b1 = rxy × b0 = Ȳ − b1 X̄
sx
where,
I X̄ and Ȳ are the sample mean of X and Y
I corr (x, y ) = rxy is the sample correlation
I sx and sy are the sample standard deviation of X and Y

18
Sample Mean and Sample Variance

I Sample Mean: measure of centrality


n
1X
Ȳ = Yi
n
i=1

I Sample Variance: measure of spread


n
1 X 2
sy2 = Yi − Ȳ
n−1
i=1

I Sample Standard Deviation:


q
sy = sy2

19
Example n
1 X 2
sy2 = Yi − Ȳ
n−1
i=1

!
20

!!
!! !
!
! ! ! ! !! ! !
!
! ! ! !! !
! !


! ! ! !! !
! ! ! !
! ! ! ! ! !
!
0
X

! ! !!
!! !
!
!
! !
! !!! !
!
−20

!
!
(Xi − X̄)
−40

0 10 20 30 40 50 60

sample

!
!
! ! !
! !
20

!! !
!! ! !
! ! ! ! !
! ! ! ! !
!! ! !
! ! !
! !
! !
!

0
Y

! !
! ! ! ! !
! ! !
! ! ! !
!
−20

! ! !
! !
!

!
(Yi − Ȳ ) !
!
−40

0 10 20 30 40 50 60

sample

sx = 9.7 sy = 16.0
20
Covariance
Measure the direction and strength of the linear relationship between Y and X
Pn
i=1 (Yi − Ȳ )(Xi − X̄ )
Cov (Y , X ) =
n−1
!

(Yi − Ȳ )(Xi − X̄) < 0 (Yi − Ȳ )(Xi − X̄) > 0


!
! !
! !
!
20

!
!
! !
! !
! !
!
!!
!
! ! !
!
! !
! ! ! !
!

Ȳ !

!
! ! I sy = 15.98, sx = 9.7
!
0
Y

! !

!
! ! ! !
!
I Cov (X , Y ) = 125.9
!!
!
!
!

!
!
How do we interpret that?
! ! !
−20

! !

!
(Yi − Ȳ )(Xi − X̄) > 0 (Yi − Ȳ )(Xi − X̄) < 0
!
!

!
−40

−20 −10 0 10 20

X

21
Correlation

Correlation is the standardized covariance:


cov(X , Y ) cov(X , Y )
corr(X , Y ) = q =
sx2 sy2 sx sy

The correlation is scale invariant and the units of measurement


don’t matter: It is always true that −1 ≤ corr(X , Y ) ≤ 1.

This gives the direction (- or +) and strength (0 → 1)


of the linear relationship between X and Y .

22
Correlation
cov(X , Y ) cov(X , Y ) 125.9
corr (Y , X ) = = = = 0.812
15.98 × 9.7
q
sx2 sy2 sx sy

(Yi − Ȳ )(Xi − X̄) < 0 (Yi − Ȳ )(Xi − X̄) > 0


!
! !
! !
!
20

!
!
! !
! !
! !
!
!!
!
! ! !
!
! !
! ! ! !
!

Ȳ !
!

!
! !
0
Y

! !

! ! !
! !
!
!!
!
!
! !

! ! !
−20

! !

!
(Yi − Ȳ )(Xi − X̄) > 0 (Yi − Ȳ )(Xi − X̄) < 0
!
!

!
−40

−20 −10 0 10 20

X

23
Correlation

3
corr = 1 corr = .5

2
1

1
0

0
-1

-1
-2

-2
-3

-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
3

3
corr = .8 corr = -.8
2

2
1

1
0

0
-1

-1
-2

-2
-3

-3

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

24
Correlation
Only measures linear relationships:
corr(X , Y ) = 0 does not mean the variables are not related!

corr = 0.01 corr = 0.72

20
0

15
-2

10
-4

5
-6

0
-8

-3 -2 -1 0 1 2 0 5 10 15 20

Also be careful with influential observations.

25
Back to Least Squares
1. Intercept:
b0 = Ȳ − b1 X̄ ⇒ Ȳ = b0 + b1 X̄

I The point (X̄ , Ȳ ) is on the regression line!


I Least squares finds the point of means and rotate the line
through that point until getting the “right” slope
2. Slope:

Pn
sY i=1 (Xi − X̄ )(Yi − Ȳ )
b1 = corr (X , Y ) × = Pn 2
sX i=1 (Xi − X̄ )
Cov (X , Y )
=
var (X )

I So, the right slope is the correlation coefficient times a scaling


factor that ensures the proper units for b1
26
Decomposing the Variance
How well does the least squares line explain variation in Y ?
Remember that Y = Ŷ + e

Since Ŷ and e are uncorrelated, i.e. corr(Ŷ , e) = 0,

var(Y ) = var(Ŷ + e) = var(Ŷ ) + var(e)


Pn 2
Pn ¯ 2 Pn
i=1 (Yi − Ȳ ) i=1 (Ŷi − Ŷ ) (ei − ē)2
= + i=1
n−1 n−1 n−1
¯
Given that ē = 0, and Ŷ = Ȳ (why?) we get to:
n
X n
X n
X
2 2
(Yi − Ȳ ) = (Ŷi − Ȳ ) + ei2
i=1 i=1 i=1

27
Decomposing the Variance

SSR: Variation in Y explained by the regression line.


SSE: Variation in Y that is left unexplained.

SSR = SST ⇒ perfect fit.

Be careful of similar acronyms; e.g. SSR for “residual” SS.

28
A Goodness of Fit Measure: R 2

The coefficient of determination, denoted by R 2 ,


measures goodness of fit:

SSR SSE
R2 = =1−
SST SST

I 0 < R 2 < 1.
I The closer R 2 is to 1, the better the fit.

29
Back to the House Data
Back to the House Data

''/
''-
''.
Applied Regression Analysis
!""#$%%&$'()*"$+,
Carlos M. Carvalho 2 SSR 12395
R = = 0.82 =
SST 14991

30
Back to Baseball

Three very similar, related ways to look at a simple linear


regression... with only one X variable, life is easy!

R2 corr SSE
OBP 0.88 0.94 0.79
SLG 0.76 0.87 1.64
AVG 0.63 0.79 2.49

31
Prediction and the Modeling Goal

There are two things that we want to know:


I What value of Y can we expect for a given X?
I How sure are we about this forecast? Or how different could
Y be from what we expect?

Our goal is to measure the accuracy of our forecasts or how much


uncertainty there is in the forecast. One method is to specify a
range of Y values that are likely, given an X value.

Prediction Interval: probable range for Y-values given X

32
Prediction and the Modeling Goal

Key Insight: To construct a prediction interval, we will have to


assess the likely range of error values corresponding to a Y value
that has not yet been observed!

We will build a probability model (e.g., normal distribution).

Then we can say something like “with 95% probability the error
will be no less than -$28,000 or larger than $28,000”.

We must also acknowledge that the “fitted” line may be fooled by


particular realizations of the residuals.

33
Prediction and the Modeling Goal
I Suppose you only had the purple points in the graph. The
dashed line fits the purple points. The solid line fits all the
points. Which line is better? Why?

6.0




5.5







RPG

● ●
5.0





● ●
● ●



● ●


4.5


0.25 0.26 0.27 0.28 0.29

AVG

I In summary, we need to work with the notion of a “true line”


and a probability distribution that describes deviation around
the line.
34
The Simple Linear Regression Model

The power of statistical inference comes from the ability to make


precise statements about the accuracy of the forecasts.

In order to do this we must invest in a probability model.

Simple Linear Regression Model: Y = β0 + β1 X + ε

ε ∼ N(0, σ 2 )

I β0 + β1 X represents the “true line”; The part of Y that


depends on X .
I The error term ε is independent “idosyncratic noise”; The
part of Y not associated with X .

35
The Simple Linear Regression Model
Y = β0 + β1 X + ε

260
240
220
y
200
180
160

1.6 1.8 2.0 2.2 2.4 2.6

36
The Simple Linear Regression Model – Example

You are told (without looking at the data) that

β0 = 40; β1 = 45; σ = 10

and you are asked to predict price of a 1500 square foot house.

What do you know about Y from the model?

Y = 40 + 45(1.5) + ε
= 107.5 + ε

Thus our prediction for price is Y |X = 1.5 ∼ N(107.5, 102 )


and a 95% Prediction Interval for Y is 87.5 < Y < 127.5

37
Conditional Distributions
Y = β0 + β1 X + ε

140
120
y
100
80
60

0.5 1.0 1.5 2.0 2.5

The conditional distribution for Y given X is Normal:


Y |X = x ∼ N(β0 + β1 x, σ 2 ).
38
Estimation of Error Variance

We estimate s 2 with:
n
1 X 2 SSE
s2 = ei =
n−2 n−2
i=1

(2 is the number of regression coefficients; i.e. 2 for β0 and β1 ).

We have n − 2 degrees of freedom because 2 have been “used up”


in the estimation of b0 and b1 .
p
We usually use s = SSE /(n − 2), in the same units as Y . It’s
also called the regression standard error.

39
Estimation of Error Variance
Estimation of V2
!,"-"$).$s )/$0,"$123"($4506507
Where is s in the Excel output?

Remember that whenever you see “standard error” read it as


8"9"9:"-$;,"/"<"-$=45$.""$>.0?/*?-*$"--4-@$-"?*$)0$?.$estimated
.0?/*?-*$*"<)?0)4/&$V
estimated standard).$0,"$.0?/*?-*$*"<)?0)4/
deviation: σ is the standard deviation.
Applied Regression Analysis
!""#$%%%&$'()*"$+
Carlos M. Carvalho

40
One Picture Summary of SLR
I The plot below has the house data, the fitted regression line
(b0 + b1 X ) and ±2 ∗ s...
I From this picture, what can you tell me about β0 , β1 and σ 2 ?
How about b0 , b1 and s 2 ?

160


140


120
price



100






80



60

1.0 1.5 2.0 2.5 3.0 3.5

size

41
Understanding Variation... Runs per Game and AVG
I blue line: all points
I red line: only purple points
I Which slope is closer to the true one? How much closer?


6.0




5.5







RPG

● ●
5.0





● ●
● ●



● ●


4.5


0.25 0.26 0.27 0.28 0.29

AVG

42
The Importance of Understanding Variation

When estimating a quantity, it is vital to develop a notion of the


precision of the estimation; for example:
I estimate the slope of the regression line
I estimate the value of a house given its size
I estimate the expected return on a portfolio
I estimate the value of a brand name
I estimate the damages from patent infringement
Why is this important?
We are making decisions based on estimates, and these may be
very sensitive to the accuracy of the estimates!

43
Sampling Distribution of b1

The sampling distribution of b1 describes how estimator b1 = β̂1


varies over different samples with the X values fixed.

It turns out that b1 is normally distributed (approximately):


b1 ∼ N(β1 , sb21 ).
I b1 is unbiased: E [b1 ] = β1 .
I sb1 is the standard error of b1 . In general, the standard error is
the standard deviation of an estimate. It determines how close
b1 is to β1 .
I This is a number directly available from the regression output.

44
Sampling Distribution of b1

Can we intuit what should be in the formula for sb1 ?


I How should s figure in the formula?
I What about n?
I Anything else?

s2 s2
sb21 = P =
(Xi − X̄ )2 (n − 1)sx2

Three Factors:
sample size (n), error variance (s 2 ), and X -spread (sx ).

45
Sampling Distribution of b0

The intercept is also normal and unbiased: b0 ∼ N(β0 , sb20 ).

X̄ 2
 
1
sb20 = var(b0 ) = s 2
+
n (n − 1)sx2

What is the intuition here?

46
Understanding Variation... Runs per Game and AVG
Regression with all points
SUMMARY  OUTPUT

Regression  Statistics
Multiple  R 0.798496529
R  Square 0.637596707
Adjusted  R  Square 0.624653732
Standard  Error 0.298493066
Observations 30

ANOVA
df SS MS F Significance  F
Regression 1 4.38915033 4.38915 49.26199 1.239E-­‐07
Residual 28 2.494747094 0.089098
Total 29 6.883897424

Coefficients Standard  Error t  Stat P-­‐value Lower  95% Upper  95%


Intercept -­‐3.936410446 1.294049995 -­‐3.04193 0.005063 -­‐6.587152 -­‐1.2856692
AVG 33.57186945 4.783211061 7.018689 1.24E-­‐07 23.773906 43.369833

sb1 = 4.78

47
Understanding Variation... Runs per Game and AVG
Regression with subsample
SUMMARY  OUTPUT

Regression  Statistics
Multiple  R 0.933601392
R  Square 0.87161156
Adjusted  R  Square 0.828815413
Standard  Error 0.244815842
Observations 5

ANOVA
df SS MS F Significance  F
Regression 1 1.220667405 1.220667 20.36659 0.0203329
Residual 3 0.17980439 0.059935
Total 4 1.400471795

Coefficients Standard  Error t  Stat P-­‐value Lower  95% Upper  95%


Intercept -­‐7.956288201 2.874375987 -­‐2.76801 0.069684 -­‐17.10384 1.191259
AVG 48.69444328 10.78997028 4.512936 0.020333 14.355942 83.03294

sb1 = 10.78

48
Confidence Intervals

I 68% Confidence Interval: b1 ± 1 × sb1


I 95% Confidence Interval: b1 ± 2 × sb1
I 99% Confidence Interval: b1 ± 3 × sb1

Same thing for b0


I 95% Confidence Interval: b0 ± 2 × sb0

The confidence interval provides you with a set of plausible values


for the parameters

49
Example: Runs per Game and AVG
Regression with all points
SUMMARY  OUTPUT

Regression  Statistics
Multiple  R 0.798496529
R  Square 0.637596707
Adjusted  R  Square 0.624653732
Standard  Error 0.298493066
Observations 30

ANOVA
df SS MS F Significance  F
Regression 1 4.38915033 4.38915 49.26199 1.239E-­‐07
Residual 28 2.494747094 0.089098
Total 29 6.883897424

Coefficients Standard  Error t  Stat P-­‐value Lower  95% Upper  95%


Intercept -­‐3.936410446 1.294049995 -­‐3.04193 0.005063 -­‐6.587152 -­‐1.2856692
AVG 33.57186945 4.783211061 7.018689 1.24E-­‐07 23.773906 43.369833

[b1 − 2 × sb1 ; b1 + 2 × sb1 ] ≈ [23.77; 43.36]

50
Testing

Suppose we want to assess whether or not β1 equals a proposed


value β10 . This is called hypothesis testing.

Formally we test the null hypothesis:


H0 : β1 = β10
vs. the alternative
H1 : β1 6= β10

51
Testing

That are 2 ways we can think about testing:


1. Building a test statistic... the t-stat,

b1 − β10
t=
sb1

This quantity measures how many standard deviations the


estimate (b1 ) from the proposed value (β10 ).
If the absolute value of t is greater than 2, we need to worry
(why?)... we reject the hypothesis.

52
Testing

2. Looking at the confidence interval. If the proposed value is


outside the confidence interval you reject the hypothesis.
Notice that this is equivalent to the t-stat. An absolute value
for t greater than 2 implies that the proposed value is outside
the confidence interval... therefore reject.
This is my preferred approach for the testing problem. You
can’t go wrong by using the confidence interval!

53
Example: Mutual Funds
Another Example of Conditional Distributions
Let’s investigate the performance of the Windsor Fund, an
aggressive large cap fund by Vanguard...
-"./0$(11#$2.$2$032.."45426$17$.8"$*2.29$

Applied Regression Analysis


!""#$%%&$'()*"$+,
Carlos plot
The M. Carvalho
shows monthly returns for Windsor vs. the S&P500 54
Example: Mutual Funds

Consider a CAPM regression for the Windsor mutual fund.

rw = β0 + β1 rsp500 + 

Let’s first test β1 = 0

H0 : β1 = 0. Is the Windsor fund related to the market?


H1 : β1 6= 0

55
Hypothesis Testing – Windsor Fund Example
-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<
Example: Mutual Funds

b! sb! b!
sb!
Applied Regression Analysis
!""#$%%%&$'()*"$+,
Carlos M. Carvalho
I t = 32.10... reject β1 = 0!!
I the 95% confidence interval is [0.87; 0.99]... again, reject!!

56
Example: Mutual Funds

Now let’s test β1 = 1. What does that mean?

H0 : β1 = 1 Windsor is as risky as the market.


H1 : β1 6= 1 and Windsor softens or exaggerates market moves.

We are asking whether or not Windsor moves in a different way


than the market (e.g., is it more conservative?).

57
-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<
Example: Mutual Funds

b! sb! b!
sb!
Applied Regression Analysis
!""#$%%%&$'()*"$+,
b1 −1
Carlos M. Carvalho −0.0643
I t= sb1 = 0.0291 = −2.205... reject.
I the 95% confidence interval is [0.87; 0.99]... again, reject,
but...
58
Testing – Why I like Conf. Int.

I Suppose in testing H0 : β1 = 1 you got a t-stat of 6 and the


confidence interval was

[1.00001, 1.00002]

Do you reject H0 : β1 = 1? Could you justify that to you


boss? Probably not! (why?)

59
Testing – Why I like Conf. Int.

I Now, suppose in testing H0 : β1 = 1 you got a t-stat of -0.02


and the confidence interval was

[−100, 100]

Do you accept H0 : β1 = 1? Could you justify that to you


boss? Probably not! (why?)

The Confidence Interval is your best friend when it comes to


testing!!

60
Testing – Summary

I Large t or small p-value mean the same thing...


I p-value < 0.05 is equivalent to a t-stat > 2 in absolute value
I Small p-value means something weird happen if the null
hypothesis was true...
I Bottom line, small p-value → REJECT! Large t → REJECT!
I But remember, always look at the confidence interval!

61
Forecasting

400
300
Price
200
100
0

0 2 4 6 8

Size
I Careful with extrapolation!
62
House Data – one more time!
I R 2 = 82%
I Great R 2 , we are happy using this model to predict house
prices, right?


160


140


120
price



100






80



60

1.0 1.5 2.0 2.5 3.0 3.5

size

63
House Data – one more time!
I But, s = 14 leading to a predictive interval width of about
US$60,000!! How do you feel about the model now?
I As a practical matter, s is a much more relevant quantity than
R 2 . Once again, intervals are your friend!

160


140


120
price



100






80



60

1.0 1.5 2.0 2.5 3.0 3.5

size

64
2. The Multiple Regression Model

Many problems involve more than one independent variable or


factor which affects the dependent or response variable.

I More than size to predict house price!


I Demand for a product given prices of competing brands,
advertising,house hold attributes, etc.

In SLR, the conditional mean of Y depends on X. The Multiple


Linear Regression (MLR) model extends this idea to include more
than one independent variable.

65
The MLR Model
Same as always, but with more covariates.

Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + 

Recall the key assumptions of our linear regression model:


(i) The conditional mean of Y is linear in the Xj variables.
(ii) The error term (deviations from line)
I are normally distributed
I independent from each other
I identically distributed (i.e., they have constant variance)

Y |X1 . . . Xp ∼ N(β0 + β1 X1 . . . + βp Xp , σ 2 )

66
The MLR Model
If p = 2, we can plot the regression surface in 3D.
Consider sales of a product as predicted by price of this product
(P1) and the price of a competing product (P2).
Sales = β0 + β1 P1 + β2 P2 + 

67
Least Squares

Y = β0 + β1 X1 . . . + βp Xp + ε, ε ∼ N(0, σ 2 )

How do we estimate the MLR model parameters?

The principle of Least Squares is exactly the same as before:


I Define the fitted values
I Find the best fitting plane by minimizing the sum of squared
residuals.

68
Least Squares

Model: Salesi = β0 + β1 P1i + β2 P2i + i ,  ∼ N(0, σ 2 )


SUMMARY OUTPUT

Regression Statistics
Multiple R 0.99
R Square 0.99
Adjusted R Square 0.99
Standard Error 28.42
Observations 100.00

ANOVA
df SS MS F Significance F
Regression 2.00 6004047.24 3002023.62 3717.29 0.00
Residual 97.00 78335.60 807.58
Total 99.00 6082382.84

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 115.72 8.55 13.54 0.00 98.75 132.68
p1 -97.66 2.67 -36.60 0.00 -102.95 -92.36
p2 108.80 1.41 77.20 0.00 106.00 111.60

b0 = β̂0 = 115.72, b1 = β̂1 = −97.66, b2 = β̂2 = 108.80,


s = σ̂ = 28.42

69
Least Squares

Just as before, each bi is our estimate of βi

Fitted Values: Ŷi = b0 + b1 X1i + b2 X2i . . . + bp Xp .

Residuals: ei = Yi − Ŷi .
Pn 2
Least Squares: Find b0 , b1 , b2 , . . . , bp to minimize i=1 ei .

In MLR the formulas for the bi ’s are too complicated so we won’t


talk about them...

70
Fitted Values in MLR
Useful way to plot the results for MLR problems is to look at
Y (true values) against Ŷ (fitted values).

1000
800
600
y=Sales

400
200
0

0 200 400 600 800 1000

y.hat (MLR: p1 and p2)

If things are working, these values should form a nice straight line. Can
you guess the slope of the blue line?
71
Fitted Values in MLR
With just P1...
1000

1000

1000
800

800

800
600

600

600
y=Sales

y=Sales

y=Sales
400

400

400
200

200

200
0

0
2 4 6 8 300 400 500 600 700 0

p1 y.hat (SLR: p1)

I Left plot: Sales vs P1


I Right plot: Sales vs. ŷ (only P1 as a regressor)
72
Fitted Values in MLR

Now, with P1 and P2...


1000

1000

1000
800

800

800
600

600

600
y=Sales

y=Sales

y=Sales
400

400

400
200

200

200
0

0
300 400 500 600 700 0 200 400 600 800 1000 0 200 400 600 800 1000

y.hat(SLR:p1) y.hat(SLR:p2) y.hat(MLR:p1 and p2)

I First plot: Sales regressed on P1 alone...


I Second plot: Sales regressed on P2 alone...
I Third plot: Sales regressed on P1 and P2

73
R-squared

I We still have our old variance decomposition identity...

SST = SSR + SSE

I ... and R 2 is once again defined as


SSR SSE
R2 = =1−
SST SST
telling us the percentage of variation in Y explained by the
X ’s.
I In Excel, R 2 is in the same place and “Multiple R” refers to
the correlation between Ŷ and Y .

74
Intervals for Individual Coefficients

As in SLR, the sampling distribution tells us how close we can


expect bj to be from βj

The LS estimators are unbiased: E [bj ] = βj for j = 0, . . . , d.

I We denote the sampling distribution of each estimator as

bj ∼ N(βj , sb2j )

75
Intervals for Individual Coefficients

Intervals and t-statistics are exactly the same as in SLR.

I A 95% C.I. for βj is approximately bj ± 2sbj


(bj − βj0 )
I The t-stat: tj = is the number of standard errors
sbj
between the LS estimate and the null value (βj0 )
I As before, we reject the null when t-stat is greater than 2 in
absolute value
I Also as before, a small p-value leads to a rejection of the null
I Rejecting when the p-value is less than 0.05 is equivalent to
rejecting when the |tj | > 2

76
Intervals for Individual Coefficients

IMPORTANT: Intervals and testing via bj & sbj are one-at-a-time


procedures:
I You are evaluating the j th coefficient conditional on the other
X ’s being in the model, but regardless of the values you’ve
estimated for the other b’s.

77
Understanding Multiple Regression

The Sales Data:


I Sales : units sold in excess of a baseline
I P1: our price in $ (in excess of a baseline price)
I P2: competitors price (again, over a baseline)

78
Understanding Multiple Regression
I If we regress Sales on our Regression
own price, we Plotobtain a somewhat
surprising conclusion... theSales
higher
= 211.165the price
+ 63.7130 p1 the more we sell!!
gress S = 223.401 R-Sq = 19.6 % R-Sq(adj) = 18.8 %

n
1000
ce,
ain the
hat
Sales

500

ng
ion
igher 0

associated 0 1 2 3 4 5 6 7 8 9

p1
ore sales!!
I It looks like we should just raise our prices, right? NO, not if
you have taken this statistics class!
The regression line 79
Understanding Multiple Regression

I The regression equation for Sales on own price (P1) is:

Sales = 211 + 63.7P1

I If now we add the competitors price to the regression we get

Sales = 116 − 97.7P1 + 109P2

I Does this look better? How did it happen?


I Remember: −97.7 is the affect on sales of a change in P1
with P2 held fixed!!

80
Understanding Multiple
How can we see what isRegression
going on ?

I How can we see what is going on? Let’s compare Sales in two
If we compares sales in weeks 82 and 99, we
different observations: weeks 82 and 99.
see that an increase in p1, holding p2 constant
I We see that an increase in P1, holding P2 constant,
(82 to 99) corresponds to a drop is sales.
corresponds to a drop in Sales!

9
8 99 1000
7
6 82
5

Sales
p1

500
4
3
2
99
1 0
82
0
0 1 2 3 4 5 6 7 8 9
0 5 10 15
p1
p2

Note the strong relationship between p1 and p2 !!


I Note the strong relationship (dependence) between P1 and
P2!!
81
Understanding Multiple Regression
Here we select a subset of points where p varies
and p2 does
I Let’s look atisa help
subsetapproximately
of points where P1constant.
varies and P2 is
held approximately constant...

9
8 1000
7
6
5

Sales
500
p1

4
3
2
1 0

0
0 5 10 15 0 1 2 3 4 5 6 7 8 9
p2 p1

I For a fixed level of P2, variation in P1 is negatively correlated


For with
a fixed level of p2, variation in p1 is negatively
Sales!!
correlated with sale!
82
Understanding Multiple Regression
Different
I colors indicate
Below, different different
colors indicate ranges
different of p2.for
ranges P2...

larger p1 are associated with for each fixed level of p2


larger p2 there is a negative relationship
between sales and p1
Sales
p1

1000
8

800
6

600
sales$Sales
sales$p1

400
4

200
2

0 5 10
p2 15 2 4 6 8
p1
sales$p2 sales$p1

83
Understanding Multiple Regression

I Summary:
1. A larger P1 is associated with larger P2 and the overall effect
leads to bigger sales
2. With P2 held fixed, a larger P1 leads to lower sales
3. MLR does the trick and unveils the “correct” economic
relationship between Sales and prices!

84
Understanding Multiple Regression
Beer Data (from an MBA class)
I nbeer – number of beers before getting drunk
I height and weight
ed
20
nbeer

10

60 65 70 75
height

Is number of beers related to height?


85
Understanding Multiple Regression

SUMMARY OUTPUT
nbeers = β0 + β1 height + 
Regression Statistics
Multiple R 0.58
R Square 0.34
Adjusted R Square 0.33
Standard Error 3.11
Observations 50.00

ANOVA
df SS MS F Significance F
Regression 1.00 237.77 237.77 24.60 0.00
Residual 48.00 463.86 9.66
Total 49.00 701.63

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -36.92 8.96 -4.12 0.00 -54.93 -18.91
height 0.64 0.13 4.96 0.00 0.38 0.90

Yes! Beers and height are related...

86
Understanding Multiple Regression

SUMMARY OUTPUT
nbeers = β0 + β1 weight + β2 height + 
Regression Statistics
Multiple R 0.69
R Square 0.48
Adjusted R Square 0.46
Standard Error 2.78
Observations 50.00

ANOVA
df SS MS F Significance F
Regression 2.00 337.24 168.62 21.75 0.00
Residual 47.00 364.38 7.75
Total 49.00 701.63

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -11.19 10.77 -1.04 0.30 -32.85 10.48
weight 0.09 0.02 3.58 0.00 0.04 0.13
height 0.08 0.20 0.40 0.69 -0.32 0.47

What about now?? Height is not necessarily a factor...

87
S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%

Understanding Multiple Regression

75
The correlations:
70 nbeer weight
height

weight 0.692
height 0.582 0.806
65

60 The two x’s are


100 150 200 highly correlated !!
weight

I If we regress “beers” only on height we see an effect. Bigger


heights go with more beers.
I However, when height goes up weight tends to go up as well...
in the first regression, height was a proxy for the real cause of
drinking ability. Bigger people can drink more and weight is a
more accurate measure of “bigness”.
88
No, not all. weight 0.08530 0.02381 3.58 0.001

Understanding Multiple
S = 2.784Regression
R-Sq = 48.1% R-Sq(adj) = 45.9%

75
The correlations:
70 nbeer weight
height

weight 0.692
height 0.582 0.806
65

60 The two x’s are


100 150 200 highly correlated !!
weight

I In the multiple regression, when we consider only the variation


in height that is not associated with variation in weight, we
see no relationship between height and beers.

89
Understanding Multiple Regression

SUMMARY OUTPUT
nbeers = β0 + β1 weight + 

Regression Statistics
Multiple R 0.69
R Square 0.48
Adjusted R Square0.47
Standard Error 2.76
Observations 50

ANOVA
df SS MS F Significance F
Regression 1 336.0317807 336.0318 44.11878 2.60227E-08
Residual 48 365.5932193 7.616525
Total 49 701.625

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -7.021 2.213 -3.172 0.003 -11.471 -2.571
weight 0.093 0.014 6.642 0.000 0.065 0.121

Why is this a better model than the one with weight and height??

90
Understanding Multiple Regression

In general, when we see a relationship between y and x (or x’s),


that relationship may be driven by variables “lurking” in the
background which are related to your current x’s.

This makes it hard to reliably find “causal” relationships. Any


correlation (association) you find could be caused by other
variables in the background... correlation is NOT causation

Any time a report says two variables are related and there’s a
suggestion of a “causal” relationship, ask yourself whether or not
other variables might be the real reason for the effect. Multiple
regression allows us to control for all important variables by
including them into the regression. “Once we control for weight,
height and beers are NOT related”!!

91
BackSUMMARY OUTPUT
to Baseball – Let’s try to add AVG on top of OBP
Regression Statistics
Multiple R 0.948136
R Square 0.898961
Adjusted R Square 0.891477
Standard Error 0.160502
Observations 30

ANOVA
df SS MS F Significance F
Regression 2 6.188355 3.094177 120.1119098 3.63577E‐14
Residual 27 0.695541 0.025761
Total 29 6.883896

Coefficientstandard Erro t Stat P‐value Lower 95% Upper 95%


Intercept ‐7.933633 0.844353 ‐9.396107 5.30996E‐10 ‐9.666102081 ‐6.201163
AVG 7.810397 4.014609 1.945494 0.062195793 ‐0.426899658 16.04769
OBP 31.77892 3.802577 8.357205 5.74232E‐09 23.9766719 39.58116

R/G = β0 + β1 AVG + β2 OBP + 

Is AVG any good?


92
BackSUMMARY OUTPUT
to Baseball - Now let’s add SLG
Regression Statistics
Multiple R 0.955698
R Square 0.913359
Adjusted R Square 0.906941
Standard Error 0.148627
Observations 30

ANOVA
df SS MS F Significance F
Regression 2 6.28747 3.143735 142.31576 4.56302E‐15
Residual 27 0.596426 0.02209
Total 29 6.883896

Coefficientstandard Erro t Stat P‐value Lower 95% Upper 95%


Intercept ‐7.014316 0.81991 ‐8.554984 3.60968E‐09 ‐8.69663241 ‐5.332
OBP 27.59287 4.003208 6.892689 2.09112E‐07 19.37896463 35.80677
SLG 6.031124 2.021542 2.983428 0.005983713 1.883262806 10.17899

R/G = β0 + β1 OBP + β2 SLG + 

What about now? Is SLG any good


93
Back to Baseball

Correlations
AVG 1
OBP 0.77 1
SLG 0.75 0.83 1

I When AVG is added to the model with OBP, no additional


information is conveyed. AVG does nothing “on its own” to
help predict Runs per Game...

I SLG however, measures something that OBP doesn’t (power!)


and by doing something “on its own” it is relevant to help
predict Runs per Game. (Okay, but not much...)

94
3. Dummy Variables... Example: House Prices

We want to evaluate the difference in house prices in a couple of


different neighborhoods.

Nbhd SqFt Price


1 2 1.79 114.3
2 2 2.03 114.2
3 2 1.74 114.8
4 2 1.98 94.7
5 2 2.13 119.8
6 1 1.78 114.6
7 3 1.83 151.6
8 3 2.16 150.7
... ... ... ...

95
Dummy Variables... Example: House Prices

Let’s create the dummy variables dn1, dn2 and dn3...

Nbhd SqFt Price dn1 dn2 dn3


1 2 1.79 114.3 0 1 0
2 2 2.03 114.2 0 1 0
3 2 1.74 114.8 0 1 0
4 2 1.98 94.7 0 1 0
5 2 2.13 119.8 0 1 0
6 1 1.78 114.6 1 0 0
7 3 1.83 151.6 0 0 1
8 3 2.16 150.7 0 0 1
... ... ...

96
Dummy Variables... Example: House Prices

Pricei = β0 + β1 dn1i + β2 dn2i + β3 Sizei + i

E [Price|dn1 = 1, Size] = β0 + β1 + β3 Size (Nbhd 1)


E [Price|dn2 = 1, Size] = β0 + β2 + β3 Size (Nbhd 2)
E [Price|dn1 = 0, dn2 = 0, Size] = β0 + β3 Size (Nbhd 3)

97
Dummy Variables... Example: House Prices

Pricei = β0 + β1 dn1 + β2 dn2 + β3 Size + i


SUMMARY  OUTPUT

Regression  Statistics
Multiple  R 0.828
R  Square 0.685
Adjusted  R  Square 0.677
Standard  Error 15.260
Observations 128

ANOVA
df SS MS F Significance  F
Regression 3 62809.1504 20936 89.9053 5.8E-­31
Residual 124 28876.0639 232.87
Total 127 91685.2143

Coefficients Standard  Error t  Stat P-­value Lower  95%Upper  95%


Intercept 62.78 14.25 4.41 0.00 34.58 90.98
dn1 -­41.54 3.53 -­11.75 0.00 -­48.53 -­34.54
dn2 -­30.97 3.37 -­9.19 0.00 -­37.63 -­24.30
size 46.39 6.75 6.88 0.00 33.03 59.74

Pricei = 62.78 − 41.54dn1 − 30.97dn2 + 46.39Size + i

98
Dummy Variables... Example: House Prices

Nbhd = 1

200
Nbhd = 2
180
160 Nbhd = 3
Price

140
120
100
80

1.6 1.8 2.0 2.2 2.4 2.6

Size
99
Dummy Variables... Example: House Prices

Pricei = β0 + β1 Size + i
SUMMARY  OUTPUT

Regression  Statistics
Multiple  R 0.553
R  Square 0.306
Adjusted  R  Square 0.300
Standard  Error 22.476
Observations 128

ANOVA
df SS MS F Significance  F
Regression 1 28036.4 28036.36 55.501 1E-­11
Residual 126 63648.9 505.1496
Total 127 91685.2

Coefficients
Standard  Error t  Stat P-­valueLower  95%
Upper  95%
Intercept -­10.09 18.97 -­0.53 0.60 -­47.62 27.44
size 70.23 9.43 7.45 0.00 51.57 88.88

Pricei = −10.09 + 70.23Size + i

100
Dummy Variables... Example: House Prices

Nbhd = 1

200
Nbhd = 2
Nbhd = 3
Just Size
180
160
Price

140
120
100
80

1.6 1.8 2.0 2.2 2.4 2.6

Size
101
Sex Discrimination Case

90
80
70
Salary

60
50
40
30

60 70 80 90

Year Hired

Does it look like the effect of experience on salary is the same for
males and females? 102
Sex Discrimination Case

Could we try to expand our analysis by allowing a different slope


for each group?
Yes... Consider the following model:

Salaryi = β0 + β1 Expi + β2 Sexi + β3 Expi × Sexi + i

For Females:
Salaryi = β0 + β1 Expi + i
For Males:

Salaryi = (β0 + β2 ) + (β1 + β3 )Expi + i

103
Sex Discrimination Case

How does the data look like?

YrHired Gender Salary Sex SexExp


1 92 Male 32.00 1 92
2 81 Female 39.10 0 0
3 83 Female 33.20 0 0
4 87 Female 30.60 0 0
5 92 Male 29.00 1 92
... ... ...
208 62 Female 30.00 0 62

104
Sex Discrimination Case
Salary
SUMMARY i = β0
OUTPUT + β1 Sexi + β2 Exp + β3 Exp ∗ Sex + i

Regression Statistics
Multiple R 0.799130351
R Square 0.638609318
Adjusted R S 0.63329475
Standard Err 6.816298288
Observations 208

ANOVA
df SS MS F Significance F
Regression 3 16748.88 5582.96 120.16 7.513E-45
Residual 204 9478.232 46.4619
Total 207 26227.11

Coefficients tandard Erro t Stat P-value Lower 95% Upper 95%


Intercept 61.12479795 8.770854 6.96908 4E-11 43.831649 78.41795
Gender 114.4425931 11.7012 9.78041 9E-19 91.371794 137.5134
YrHired -0.279963351 0.102456 -2.7325 0.0068 -0.4819713 -0.077955
GenderExp -1.247798369 0.136676 -9.1296 7E-17 -1.5172765 -0.97832

Salaryi = 61 + 114Sexi + −0.27Exp + −1.24Exp ∗ Sex + i


105
Sex Discrimination Case

90
80
70
Salary

60
50
40
30

60 70 80 90

Year Hired

106
Variable Interaction

So, the effect of experience on salary is different for males and


females... in general, when the effect of the variable X1 onto Y
depends on another variable X2 we say that X1 and X2 interact
with each other.

We can extend this notion by the inclusion of multiplicative effects


through interaction terms.
Yi = β0 + β1 X1i + β2 X2i + β3 (X1i X2i ) + ε
∂E[Y |X1 , X2 ]
= β1 + β3 X2
∂X1

This is our first non-linear model!

107
4. Residual Plots and Transformations

What kind of properties should the residuals have??

ei ≈ N(0, σ 2 ) iid and independent from the X’s

I We should see no pattern between e and each of the X ’s


I This can be summarized by looking at the plot between
Ŷ and e
I Remember that Ŷ is “pure X ”, i.e., a linear function of the
X ’s.
If the model is good, the regression should have pulled out of Y all
of its “x ness”... what is left over (the residuals) should have
nothing to do with X .

108
Non Linearity
Example: Telemarketing
I How does length of employment affect productivity (number
of calls per day)?

109
Non Linearity
Example: Telemarketing
I Residual plot highlights the non-linearity!

110
Non Linearity

What can we do to fix this?? We can use multiple regression and


transform our X to create a no linear model...

Let’s try
Y = β0 + β1 X + β2 X 2 + 
The data...

months months2 calls


10 100 18
10 100 19
11 121 22
14 196 23
15 225 25
... ... ...

111
Telemarketing
Adding Polynomials

Linear Model

Applied Regression Analysis


Week VIII. Slide 5
Carlos M. Carvalho
112
Telemarketing
Adding Polynomials

With X2

Applied Regression Analysis


Week VIII. Slide 6
Carlos M. Carvalho
113
Telemarketing
Adding Polynomials

Applied Regression Analysis 114


Week VIII. Slide 7
Telemarketing

What is the marginal effect of X on Y?

∂E[Y |X ]
= β1 + 2β2 X
∂X

I To better understand the impact of changes in X on Y you


should evaluate different scenarios.
I Moving from 10 to 11 months of employment raises
productivity by 1.47 calls
I Going from 25 to 26 months only raises the number of calls
by 0.27.

115
Polynomial Regression

Even though we are limited to a linear mean, it is possible to get


nonlinear regression by transforming the X variable.

In general, we can add powers of X to get polynomial regression:


Y = β0 + β1 X + β2 X 2 . . . + βm X m

You can fit any mean function if m is big enough.


Usually, m = 2 does the trick.

116
Closing Comments on Polynomials

We can always add higher powers (cubic, etc) if necessary.

Be very careful about predicting outside the data range. The curve
may do unintended things beyond the observed data.

Watch out for over-fitting... remember, simple models are


“better”.

117
Be careful when extrapolating...

30
calls
25
20

10 15 20 25 30 35 40

months

118
...and, be careful when adding more polynomial terms!

40

2
3
35

8
30
calls
25
20
15

10 15 20 25 30 35 40

months

119
Non-constant Variance

Example...

This violates our assumption that all εi have the same σ 2 .

120
Non-constant Variance

Consider the following relationship between Y and X :

Y = γ0 X β1 (1 + R)

where we think about R as a random percentage error.


I On average we assume R is 0...
I but when it turns out to be 0.1, Y goes up by 10%!
I Often we see this, the errors are multiplicative and the
variation is something like ±10% and not ±10.
I This leads to non-constant variance (or heteroskedasticity)

121
The Log-Log Model

We have data on Y and X and we still want to use a linear


regression model to understand their relationship... what if we take
the log (natural log) of Y ?
h i
log(Y ) = log γ0 X β1 (1 + R)
log(Y ) = log(γ0 ) + β1 log(X ) + log(1 + R)

Now, if we call β0 = log(γ0 ) and  = log(1 + R) the above leads to

log(Y ) = β0 + β1 log(X ) + 

a linear regression of log(Y ) on log(X )!

122
Elasticity and the log-log Model

In a log-log model, the slope β1 is sometimes called elasticity.

In english, a 1% increase in X gives a beta % increase in Y.

d%Y
β1 ≈ (Why?)
d%X

123
Price Elasticity

In economics, the slope coefficient β1 in the regression


log(sales) = β0 + β1 log(price) + ε is called price elasticity.

This is the % change in sales per 1% change in price.

The model implies that E [sales] = A ∗ price β1


where A = exp(β0 )

124
Price Elasticity of OJ
A chain of gas station convenience stores was interested in the
dependency between price of and Sales for orange juice...
They decided to run an experiment and change prices randomly at
different locations. With the data in hands, let’s first run an
regression of Sales on Price:

Sales = β0 + β1 Price + 
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.719
R Square 0.517
Adjusted R Square 0.507
Standard Error 20.112
Observations 50.000

ANOVA
df SS MS F Significance F
Regression 1.000 20803.071 20803.071 51.428 0.000
Residual 48.000 19416.449 404.509
Total 49.000 40219.520

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 89.642 8.610 10.411 0.000 72.330 106.955
Price -20.935 2.919 -7.171 0.000 -26.804 -15.065

125
Price Elasticity of OJ

Fitted Model Residual Plot


100 120 140

80
60
40
residuals
Sales

80

20
60

0
40

-20
20

1.5 2.0 2.5 3.0 3.5 4.0 1.5 2.0 2.5 3.0 3.5 4.0

Price Price

No good!!

126
Price Elasticity of OJ
But... would you really think this relationship would be linear?
Moving a price from $1 to $2 is the same as changing it form $10
to $11?? We should probably be thinking about the price elasticity
of OJ... log(Sales) = γ + γ log(Price) + 0 1
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.869
R Square 0.755
Adjusted R Square 0.750
Standard Error 0.386
Observations 50.000

ANOVA
df SS MS F Significance F
Regression 1.000 22.055 22.055 148.187 0.000
Residual 48.000 7.144 0.149
Total 49.000 29.199

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 4.812 0.148 32.504 0.000 4.514 5.109
LogPrice -1.752 0.144 -12.173 0.000 -2.042 -1.463

How do we interpret γ̂1 = −1.75?


(When prices go up 1%, sales go down by 1.75%)
127
Price Elasticity of OJ

Fitted Model Residual Plot


100 120 140

0.5
residuals
Sales

80

0.0
60
40

-0.5
20

1.5 2.0 2.5 3.0 3.5 4.0 1.5 2.0 2.5 3.0 3.5 4.0

Price Price

Much better!!

128
Making Predictions

What if the gas station store wants to predict their sales of OJ if


they decide to price it at $1.8?
The predicted log(Sales) = 4.812 + (−1.752) × log(1.8) = 3.78

So, the predicted Sales = exp(3.78) = 43.82.


How about the plug-in prediction interval?
In the log scale, our predicted interval in
\ − 2s; log(Sales)
[log(Sales) \ + 2s] =
[3.78 − 2(0.38); 3.78 + 2(0.38)] = [3.02; 4.54].
In terms of actual Sales the interval is
[exp(3.02), exp(4.54)] = [20.5; 93.7]

129
Making Predictions
Plug-in Prediction Plug-in Prediction

5.0
140

4.5
120

4.0
100

log(Sales)

3.5
80
Sales

60

3.0
40

2.5
20

2.0
1.5 2.0 2.5 3.0 3.5 4.0 0.4 0.6 0.8 1.0 1.2 1.4

Price log(Price)

I In the log scale (right) we have [Ŷ − 2s; Ŷ + 2s]


I In the original scale (left) we have
[exp(Ŷ ) ∗ exp(−2s); exp(Ŷ ) exp(2s)]
130
Some additional comments...

I Another useful transformation to deal with non-constant


variance is to take only the log(Y ) and keep X the same.
Clearly the “elasticity” interpretation no longer holds.
I Always be careful in interpreting the models after a
transformation
I Also, be careful in using the transformed model to make
predictions

131
Summary of Transformations

Coming up with a good regression model is usually an iterative


procedure. Use plots of residuals vs X or Ŷ to determine the next
step.

Log transform is your best friend when dealing with non-constant


variance (log(X ), log(Y ), or both).

Add polynomial terms (e.g. X 2 ) to get nonlinear regression.

The bottom line: you should combine what the plots and the
regression output are telling you with your common sense and
knowledge about the problem. Keep playing around with it until
you get something that makes sense and has nothing obviously
wrong with it.

132
Airline Data
Monthly passengers in the U.S. airline industry (in 1,000 of
passengers) from 1949 to 1960... we need to predict the number of
passengers in the next couple of months.
600
500
Passengers

400
300
200
100

0 20 40 60 80 100 120 140

Time

Any ideas?
133
Airline Data

How about a “trend model”? Yt = β0 + β1 t + t


600
500

Fitted Values
Passengers

400
300
200
100

0 20 40 60 80 100 120 140

Time
What do you think?

134
Airline Data

Let’s look at the residuals...


3
2
std residuals
1
0
-1
-2

0 20 40 60 80 100 120 140

Time
Is there any obvious pattern here? YES!!

135
Airline Data
The variance of the residuals seems to be growing in time... Let’s
try taking the log. log(Yt ) = β0 + β1 t + t
6.5
6.0

Fitted Values
log(Passengers)

5.5
5.0

0 20 40 60 80 100 120 140

Time
Any better?
136
Airline Data

Residuals...
2
1
std residuals
0
-1
-2

0 20 40 60 80 100 120 140

Time
Still we can see some obvious temporal/seasonal pattern....

137
Airline Data
Okay, let’s add dummy variables for months (only 11 dummies)...
log(Yt ) = β0 + β1 t + β2 Jan + ...β12 Dec + t
6.5
6.0

Fitted Values
log(Passengers)

5.5
5.0

0 20 40 60 80 100 120 140

Time
Much better!!
138
Airline Data
Residuals...

2
1
std residuals
0
-1
-2

0 20 40 60 80 100 120 140

Time
I am still not happy... it doesn’t look normal iid to me...
139
Airline Data
Residuals... corr(e(t),e(t-1))= 0.786

0.10
0.05
0.00
e(t)
-0.05
-0.10
-0.15

-0.15 -0.10 -0.05 0.00 0.05 0.10

e(t-1)
I was right! The residuals are dependent on time...
140
Airline Data
We have one more tool... let’s add one legged term.
log(Yt ) = β0 + β1 t + β2 Jan + ...β12 Dec + β13 log(Yt−1 ) + t
6.5
6.0

Fitted Values
log(Passengers)

5.5
5.0

0 20 40 60 80 100 120 140

Time
Okay, good...
141
Airline Data
Residuals...

2
1
std residuals
0
-1
-2

0 20 40 60 80 100 120 140

Time
Much better!!
142
Airline Data
Residuals... corr(e(t),e(t-1))= -0.11

0.05
0.00
e(t)
-0.05
-0.10

-0.10 -0.05 0.00 0.05

e(t-1)
Much better indeed!!
143
Model Building Process

When building a regression model remember that simplicity is your


friend... smaller models are easier to interpret and have fewer
unknown parameters to be estimated.

Keep in mind that every additional parameter represents a cost!!

The first step of every model building exercise is the selection of


the the universe of variables to be potentially used. This task is
entirely solved through you experience and context specific
knowledge...
I Think carefully about the problem
I Consult subject matter research and experts
I Avoid the mistake of selecting too many variables

144
Model Building Process

With a universe of variables in hand, the goal now is to select the


model. Why not include all the variables in?

Big models tend to over-fit and find features that are specific to
the data in hand... ie, not generalizable relationships.
The results are bad predictions and bad science!

In addition, bigger models have more parameters and potentially


more uncertainty about everything we are trying to learn...

We need a strategy to build a model in ways that accounts for the


trade-off between fitting the data and the uncertainty associated
with the model

145
5. Variable Selection and Regularization

When working with linear regression models where the number of


X variables is large, we need to think about strategies to select
what variables to use...

We will focus on 3 ideas:


I Subset Selection
I Shrinkage
I Dimension Reduction

146
Subset Selection

The idea here is very simple: fit as many models as you can and
compare their performance based on some criteria!

Issues:
I How many possible models? Total number of models = 2p
Is this large?
I What criteria to use?
Just as before, if prediction is what we have in mind,
out-of-sample predictive ability should be the criteria

147
Information Criteria

Another way to evaluate a model is to use Information Criteria


metrics which attempt to quantify how well our model would have
predicted the data (regardless of what you’ve estimated for the
βj ’s).

A good alternative is the BIC: Bayes Information Criterion, which


is based on a “Bayesian” philosophy of statistics.
BIC = n log(s 2 ) + p log(n)
You want to choose the model that leads to minimum BIC.

148
Information Criteria

One nice thing about the BIC is that you can


interpret it in terms of model probabilities. Given a list of possible
models {M1 , M2 , . . . , MR }, the probability that model i is correct is

1 1
e − 2 BIC (Mi ) e − 2 [BIC (Mi )−BICmin ]
P(Mi ) ≈ PR 1 = PR 1
r =1 e − 2 BIC (Mr ) r =1 e − 2 [BIC (Mr )−BICmin ]

(Subtract BICmin = min{BIC (M1 ) . . . BIC (MR )} for numerical stability.)

Similar, alternative criteria include AIC, Cp , adjusted R 2 ...


bottom line: these are only useful if we lack the ability to compare
models based on their out-of-sample predictive ability!!!

149
Search Strategies: Stepwise Regression

One computational approach to build a regression model


step-by-step is “stepwise regression” There are 3 options:
I Forward: adds one variable at the time until no remaining
variable makes a significant contribution (or meet a certain
criteria... could be out of sample prediction)
I Backwards: starts will all possible variables and removes one
at the time until further deletions would do more harm them
good
I Stepwise: just like the forward procedure but allows for
deletions at each step

150
Shrinkage Methods

An alternative way to deal with selection is to work with all p


predictors at once while placing a constraint on the size of the
estimated coefficients

This idea is a regularization technique that reduces the variability


of the estimates and tend to lead to better predictions.

The hope is that by having the constraint in place, the estimation


procedure will be able to focus on “the important β’s”

151
Ridge Regression

Ridge Regression is a modification of the least squares criteria that


minimizes (as a function of β’s)
 2
n
X p
X p
X
Yi − β0 − βj Xij  + λ βj2
i=1 j=1 j=1

for some value of λ > 0

I The “blue” part of the equation is the traditional objective


function of LS
I The “red” part is the shrinkage penalty, ie, something that
makes costly to have big values for β

152
Ridge Regression

 2
n
X p
X p
X
Yi − β0 − βj Xij  + λ βj2
i=1 j=1 j=1

I if λ = 0 we are back to least squares


I when λ → ∞, it is “too expensive” to allow for any β to be
different than 0...
I So, for different values of λ we get a different solution to the
problem

153
Ridge Regression

I What ridge regression is doing is exploring the bias-variance


trade-off! The larger the λ the more bias (towards zero) is
being introduced in the solution, ie, the less flexible the model
becomes... at the same time, the solution has less variance
I As always, the trick to find the “right” value of λ that makes
the model not too simple but not too complex!
I Whenever possible, we will choose λ by comparing the
out-of-sample performance (usually via cross-validation)

154
Ridge Regression

60

60
Mean Squared Error

Mean Squared Error


50

50
40

40
30

30
20

20
10

10
0

0
1e−01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0

λ kβ̂λR k2 /kβ̂k2

bias 2 (black), var (green), test MSE (purple)


Comments:
I Ridge is computationally very attractive as the “computing
cost” is almost the same of least squares (contrast that with
subset selection!)
I It’s a good practice to always center and scale the X ’s before
running ridge
155
LASSO
The LASSO is a shrinkage method that performs automatic
selection. It is similar to ridge but it will provide solutions that are
sparse, ie, some β’s exactly equal to 0! This facilitates
interpretation of the results...

Ridge LASSO

400
400

Income

300 400

400
Limit
Standardized Coefficients

Standardized Coefficients

Standardized Coefficients
300

300

300
Rating
Student

200
200

200

200
100
100

100

100
0

0
0

0
−100

−300 −200 −100

−100
−300

−300
1e−02 1e+00 1e+02 1e+04 0.0
20 0.2 100
50 0.4
200 5000.6 0.8
2000 50001.0

λ !β̂λR !2λ/!β̂!2

156
LASSO

The LASSO solves the following problem:


  2 
X n X p p
X 
arg minβ Yi − β0 − βj Xij  + λ |βj |
 
i=1 j=1 j=1

I Once again, λ controls how flexible the model gets to be


I Still a very efficient computational strategy
I Whenever possible, we will choose λ by comparing the
out-of-sample performance (usually via cross-validation)

157
Ridge vs. LASSO

Why does the LASSO outputs zeros?

158
Ridge vs. LASSO

Which one is better?


I It depends...
I In general LASSO will perform better than Ridge when a
relative small number of predictors have a strong effect in Y
while Ridge will do better when Y is a function of many of the
X ’s and the coefficients are of moderate size
I LASSO can be easier to interpret (the zeros help!)
I But, if prediction is what we care about the only way to
decide which method is better is comparing their
out-of-sample performance

159
Choosing λ
The idea is to solve the ridge or LASSO objective function over a
grid of possible values for λ...

Ridge CV (k=10) LASSO CV (k=10)

0.55
0.55

0.50
0.50
RMSE

RMSE

0.45
0.45

0.40
0.40

0.35
0.35

-2 0 2 4 6 -8 -6 -4 -2

log(lambda) log(lambda)

160
6. Dimension Reduction Methods

Sometimes, the number (p) of X variables available is too large for


us to work with the methods presented above.

Perhaps, we could first summarize the information in the predictors


into a smaller set of variables (m << p) and then try to predict Y .

In general, these summaries are often linear combinations of the


original variables.

161
Principal Components Regression

A very popular way to summarize multivariate data is Principal


Components Analysis (PCA).

PCA is a dimensionality reduction technique that tries to


represent p variables with a k < p “new” variables.

These “new” variables are create by linear combinations of the


original variables and the hope is that a small number of them are
able to effectively represent what is going on in the original data.

162
Principal Components Analysis

Assume we have a dataset where p variables are observed. Let Xi


be the i th observation of the p-dimensional vector X . PCA writes:

xij = b1j zi1 + b2j zi2 + · · · + bkj zik + eij

where zij is the i th observation of the j th principal component.

You can think about these z variables as the “essential variables”


responsible for all the action in X .

163
PCA looks Components
Principal for high-variance projections from multivariate x
Analysis
(i.e.,Here’s
the long direction) and finds the least squares fit.
a picture...

PC2
1

PC1
x2

0
-1
-2

-2 -1 0 1 2

x1 These
two variables x1 and x2 are very correlated. PC 1 tells you almost
Components
everything are
thatordered by invariance
is going on of the fitted projection.
this dataset!
PCA will look for linear combinations of the original variables that 9
account for most of their variability! 164
Principal Components Analysis
Let’s look at a simple example... Data: Protein consumption by
person by country for 7 variables: red meat, white meat, eggs,
milk, fish, cereals, starch, nuts, vegetables.

Austria Portugal
Netherlands
1.5

4
Hungary W Germany
1.0

E Germany
Czechoslovakia

3
Denmark
Poland Spain
Switzerland
Ireland France
0.5

Belgium

2
WhiteMeat

0.0

PC2
Sweden
Greece

1
Norway
France
Romania
-0.5

Bulgaria E Germany Poland


UK Italy
Denmark
Belgium
Yugoslavia Italy
Finland

0
Norway
USSR UK USSR
Sweden
-1.0

W Germany
Portugal Finland
Czechoslovakia
Spain Ireland Switzerland
Greece Hungary
Netherlands
Yugoslavia
-1
Austria
-1.5

Romania
Bulgaria
Albania Albania

-1 0 1 2 -4 -2 0 2 4

RedMeat PC1

Looks to me that PC1 measures how rich you are and PC2
something to do with the Mediterranean diet!
165
Principal Components Analysis
Portugal
4
3

Spain
2
PC2

Greece
1

Norway
France
E Germany Poland Italy
Denmark
Belgium
0

UK USSR
Sweden
W Germany
Finland
Czechoslovakia
Ireland Switzerland Hungary
Netherlands
Yugoslavia
-1

Austria Romania
Bulgaria
Albania

-4 -2 0 2 4

PC1

These are the weights defining the principal components...


166
Principal Components Analysis
3 variables might be enough to represent this data... Most of the
variability (75%) is explained with PC1, PC2 and PC3.
Food Principal Components Variance
4
3
Variances

2
1
0

PC
167
Principal Components Analysis: Comments

I PCA is a great way to summarize data


I It “clusters” both variables and observations simultaneously!
I The choice of k can be evaluated as a function of the
interpretation of the results or via the fit (% of the variation
explained)
I The units of each PC is not interpretable in an absolute sense.
However, relative to each other it is... see example above.
I Always a good idea to center the data before running PCA.

168
Principal Components Regression (PCR)

Let’s go back to and think of predicting Y with a potentially large


number of X variables...

PCA is sometimes used as a way to reduce the dimensionality of


X ... if only a small number of PC’s are enough to represent X , I
don’t need to use all the X ’s, right? Remember, smaller models
tend to do better in predictive terms!

This is called Principal Component Regression. First represent X


via k principal components (Z ) and then run a regression of Y
onto Z . PCR assumes that the directions in which shows the most
variation (the PCs), are the directions associated with Y .

The choice of k can be done by comparing the out-of-sample


predictive performance.
169
Principal Components Regression (PCR)
Example: Roll Call Votes in Congress... all votes in the 111th
Congress (2009-2011); p = 1647, n = 445.
Goal: Predict party how “liberal” a district is a function of the
votes by their representative
Let’s first take the principal component decomposition of the
votes...
Rollcall-Vote Principle Components
500
400

It looks like 2 PC capture much


Variances

300

of what is going on...


200
100
0

PC

170
Principal Components Regression (PCR)
Histogram of loadings on PC1... What bills are important in
defining PC1?
Afford. Health (amdt.) TARP

500
400
300
Frequency

200
100
0

-0.04 -0.02 0.00 0.02 0.04

1st Principle Component Vote-Loadings


171
Principal Components Regression (PCR)
Histogram of PC1... “Ideology Score”
Conaway (TX-11) Rosleht (FL-18) Deutch (FL-19) Royball (CA-34)

150
100
Frequency

50
0

-40 -30 -20 -10 0 10 20 30

1st Principle Component Vote-Scores


172
Principal Components Regression (PCR)

TX-11 CA-34

The two extremes...

173
Principal Components Regression (PCR)

FL-18 FL-19

The swing state!

174
Principal Components Regression (PCR)
90

0
80

-20
70
% Obama

60

PC2

-40
50

-60
40
30

-40 -30 -20 -10 0 10 20 -80 -40 -30 -20 -10 0 10 20

PC1 PC1

All we need is PC1 to predict party affiliation!


How can this picture help you understand what happened in the
2010 election?
175
Principal Components Regression (PCR)
90

90
losers in 2010
80

80
70

70
% Obama

% Obama
60

60
50

50
40

40
30

30

-40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20

PC1 PC1

176
Partial Least Squares (PLS)

PLS works very similarly to PCR as it will create “new” variables


(Z ) by taking linear combinations of the original variables (X ).

The different is that PLS attempts to find the directions of


variation in X that help explain BOTH X and Y .

It is a supervised learning alternative to PCR...

177
Partial Least Squares (PLS)

PLS works as follows:


1. The weights of the first linear combination (Z1 ) is defined by
the regression of Y onto each of the X ’s... i.e., large weights
are going to placed on the X variables most related to Y in a
univariate sense
2. Regress each X variable onto Z1 and compute the residuals
3. Repeat step (1) using the residuals from (2) in place of X
4. iterate

As always, the choice of where to stop, i.e., how many Z variables


to use should be done by comparing the out-of-sample predictive
performance.

178
Partial Least Squares (PLS)

Roll Call Data again... it looks like the first component from PLS
is the same as the first principal component!
90

90
80

80
70

70
% Obama

% Obama
60

60
50

50
40

40
30

30

-40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20

PC1 Comp1 (PLS)

179
Partial Least Squares (PLS)

But, using two components PLS does better than PCR!!


PCR ( R^2= 0.448 ) PLS ( R^2= 0.558 )
90

90
80

80
70

70
% Obama

% Obama
60

60
50

50
40

40
30

30

40 45 50 55 60 40 50 60 70

Fitted Fitted

180
Partial Least Squares (PLS)
Not easy to understand the difference between the second
component in each method (how is that for a homework!)... the
bottom line is that by using the information from Y in
summarizing the X variables, PLS find a second component that
has the ability to explain part of Y .
PCR PLS

30
0
-20

20
Comp 2
PC2

-40

10
-60

0
-80

-10

-40 -30 -20 -10 0 10 20 -40 -30 -20 -10 0 10 20

PC1 Comp 1
181

You might also like