0% found this document useful (0 votes)
13 views83 pages

Simple Linear Regression Model I

Uploaded by

traczyk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views83 pages

Simple Linear Regression Model I

Uploaded by

traczyk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Introduction to Econometrics

THE SIMPLE REGRESSION MODEL:


DEFINITION AND ESTIMATION

Tobias Broer

January 24, 2023


Outline of today’s lecture

1. Definition of the Simple Regression Model


2. Deriving the Ordinary Least Squares Estimates
3. Properties of OLS on any Sample of Data
4. Units of Measurement and Functional Form
Learning outcomes

1. After this lecture, you will be familiar with the concept, and language, of
“OLS regression analysis"
2. You will know how to estimate the coefficients in the simple linear
regression model on the basis of a sample
3. You will know the properties of the OLS estimates, including goodness of
fit
4. And you will be able to interpret the estimated coefficients, and the effect
of changing units of measurement, and functional form
1. Definition of the Simple Regression Model

1. Consider a cross-sectional population (of which we will eventually have a


random sample), with attributes including x and y
2. We are interested in “how y varies with changes in x" in this population
3. Examples:
I x is amount of fertilizer, and y is soybean yield.
I x is years of schooling, y is hourly wage.
Three issues

1. How do we allow factors other than x to affect y ? There is never an exact


relationship between two variables (in interesting cases).
2. What is the functional relationship between y and x?
3. How can we be sure we a capturing a ceteris paribus (all other factors held
constant) relationship between y and x (as is so often the goal)?
Simple linear regression model

Assume the following relation in the population

y = β0 + β1 x + u, (1)
Simple linear regression model

Assume the following relation in the population

y = β0 + β1 x + u, (1)

I Assumption 1a: Linear model, β0 intercept parameter and β1 slope


parameter
I Assumption 1b: All Factors other than x affect y through “error term" u
Simple linear regression model

Assume the following relation in the population

y = β0 + β1 x + u, (1)

I Assumption 1a: Linear model, β0 intercept parameter and β1 slope


parameter
I Assumption 1b: All Factors other than x affect y through “error term" u
I Note: relation between y and x is not symmetric
I Teminology:
I “Simple" = 2 variable
I “linear" - linear model
I “regression" - no useful meaning (historical origin from “regression to the
mean")
I y is dependent / explained / response variable / regressand
I x is independent / explanatory / control variable / regressor
I u is error / disturbance term
Simple linear regression model

y = β0 + β1 x + u, (2)

Assumptions 1a and 1b solve the three issues


I Other factors than x affect y only through u.
I Functional relationship between y and x: linear
I Ceteris paribus relationship: ∆y = β1 ∆x + ∆u, so ∆y |
∆x ∆u =0
= β1 , tells us
how y changes when x changes, holding all other factors (i.e. u) fixed.
EXAMPLE 1: Yield and Fertilizer

I A model to explain crop yield by fertilizer use is

yield = β0 + β1 fertilizer + u, (3)


I u contains land quality, rainfall on a plot of land, and so on.
I The slope parameter, β1 , is of primary interest: it tells us how yield
changes when the amount of fertilizer changes, holding all else fixed.
I Note: Is the effect of fertilizer really constant?
EXAMPLE 1: Yield and Fertilizer

I A model to explain crop yield by fertilizer use is

yield = β0 + β1 fertilizer + u, (3)


I u contains land quality, rainfall on a plot of land, and so on.
I The slope parameter, β1 , is of primary interest: it tells us how yield
changes when the amount of fertilizer changes, holding all else fixed.
I Note: Is the effect of fertilizer really constant?
The linear function is probably not realistic here. The effect of fertilizer is
likely to diminish at large amounts of fertilizer.
EXAMPLE 2: Wage and Education

wage = β0 + β1 educ + u (4)

I u contains somewhat nebulous factors (“ability”) but also past workforce


experience and tenure on the current job.

∆wage = β1 ∆educ (5)

when ∆u = 0
I But: Is each year of education really worth the same dollar amount no
matter how much education one starts with?
Simple linear regression model

y = β0 + β1 x + u, (6)

I Note: We are mainly interested in the population coefficient β1 , which


describes a ceteris paribus relation. If we know it, that’s enough.
I But: We typically do not know β1 . If we had data on y ,x,u, we could just
calculate β1 .
I But: u is unobserved. If we could manipulate x experimentally, we might
still identify β1 .
I But: We typically cannot change x, and have to identify β1 from given,
“observational" data.
Simple linear regression model

y = β0 + β1 x + u, (6)

I Note: We are mainly interested in the population coefficient β1 , which


describes a ceteris paribus relation. If we know it, that’s enough.
I But: We typically do not know β1 . If we had data on y ,x,u, we could just
calculate β1 .
I But: u is unobserved. If we could manipulate x experimentally, we might
still identify β1 .
I But: We typically cannot change x, and have to identify β1 from given,
“observational" data.
I In other words, we only observe given yi and xi , where yi = β0 + β1 xi + ui
Simple linear regression model

y = β0 + β1 x + u, (7)

I We only observe given yi = β0 + β1 xi + ui and xi


I Can we identify β1 for any arbitrary u?
Can we identify β1 for any arbitrary u?

y = β0 + β1 x + u (8)
Simple linear regression model

I We only observe given yi = β0 + β1 xi + ui and xi


I Can we identify β1 for any arbitrary u?
NO!
I For example, suppose u = c + dx, so yi = (β0 + c) + (β1 + d)xi .
I Or, even worse, u = −β1 x.
I The problem here is that the unobserved u comoves with x (we can add
another error to their relation, s.t. they comove “on average”).
I This implies that some of the observed comovement of x and y comes
from the unobserved u. So we cannot hope to identify β1 from
observations on {yi , xi } alone.
I So we must restrict our attention to certain kinds of situations where u
fulfills additional assumptions.
Additional assumptions on u

y = β0 + β1 x + u, (9)

I Assumption 2a: E (u) = 0 where E (·) is the expected value operator.


I NB: As long as we are not interested in β0 , assumption 2a is without loss
of generality: The presence of β0 in

y = β0 + β1 x + u (10)

allows us to assume E (u) = 0. If the average of u is different from zero,


we just adjust the intercept, leaving the slope the same. If α0 = E (u) then
we can write

y = (β0 + α0 ) + β1 x + (u − α0 ), (11)

where the new error, u 0 = u − α0 , has a zero mean.


I The new intercept is β0 + α0 . The important point is that the slope, β1 ,
has not changed.
How about dependence between u and x?

y = β0 + β1 x + u, (12)

I We could assume u and x uncorrelated in the population:

Corr (x, u) = 0 (13)


I Zero correlation actually works for most purposes, and in this course.
I But it implies only that u and x are not linearly related. Ruling out only
linear dependence can cause problems with interpretation and makes
statistical analysis more difficult.
How about dependence between u and x?

y = β0 + β1 x + u, (14)

I We could assume u and x independent in the population:


I Turns out (full) independence is more than we need.
Mean independence

y = β0 + β1 x + u, (15)

I Assumption 2b: E (u|x) = E (u) for all values of x, where E (u|x) means
“the expected value of u given x.”
I We say u is mean independent of x.
I Note: Full independence also suffices, as it implies mean independence.
I For this course, Cov (x, u) = 0 also suffices.
Example 1: Fertilizer and yield

I Suppose u is “land quality” and x is fertilizer amount. Then


E (u|x) = E (u) if fertilizer amounts are chosen independently of quality.
This assumption is reasonable but assumes fertilizer amounts are assigned
at random.
I Might fail if farmers have deployed more fertilizer on better / worse land.
Example 2: Wage equation

I Suppose u is “ability” and x is years of education. We need, for example,

E (ability |x = 8) = E (ability |x = 12) = E (ability |x = 16) (16)

so that the average ability is the same in the different portions of the
population with an 8th grade education, a 12th grade education, and a
four-year college eduction.
I Because people choose education levels partly based on ability, this
assumption is almost certainly false.
Zero conditional mean and population regression function

I Combining E (u|x) = E (u) (the substantive assumption) with E (u) = 0 (a


normalization) gives

E (u|x) = 0, for all values of x (17)


I Called the zero conditional mean assumption.
I Because the expected value is a linear operator, E (u|x) = 0 implies

E (y |x) = β0 + β1 x + E (u|x) = β0 + β1 x, (18)

which shows the population regression function (PRF) is a linear function


of x.
I Regression analysis is essentially about explaining effects of explanatory
variables on average outcomes of y . Only if the PRF has slope β1 can we
hope to identify it from data using the techniques in this course.
I A different approach to simple regression ignores the causality issue and
just starts with a linear model for E (y |x) as a descriptive device.
Illustration

I The straight line is the PRF, E (y |x) = β0 + β1 x. The conditional


distribution of y at three different values of x are superimposed.
I For a given value of x, we see a range of y values: remember,
y = β0 + β1 x + u, and u has a distribution in the population.
Summary so far

I Aim: test economic theory, quantify ceteris paribus effect of x on y ,


forecast y
I Needed to define functional form, and how factors other than x effect y ,
making sure we can isolate cet-par effect
I Assumed y = β0 + β1 x + u, the “simple linear regression model"
I Saw that, if we want to have some hope to gain information about β1
from observations about yi , xi , i = 1, .., N we need to restrict u
I Added two assumptions
I Assumption 2a: E (u) = 0
I Assumption 2b: E (u|x) = E (u)
I Implies a linear “population regression function " E [y |x] = β0 + β1 x
2. Deriving the Ordinary Least Squares Estimates
Deriving the Ordinary Least Squares Estimates

I Given data on x and y , how can we estimate the population parameters,


β0 and β1 ?
I Let {(xi , yi ) : i = 1, 2, . . . , n} be a sample of size n (the number of
observations) from the population. Think of this as a random sample.
Deriving the Ordinary Least Squares Estimates

I The graph shows n = 15 families and the (usually unknown) population


regression of saving on income.
I We observe yi and xi , but not ui . (However, we know ui is there.)
Deriving the Ordinary Least Squares Estimates

I The graph shows n = 15 families and the (usually unknown) population


regression of saving on income.
I We observe yi and xi , but not ui . (However, we know ui is there.)
I Plug any observation into the population equation:

yi = β0 + β1 xi + ui (19)
where the i subscript indicates a particular observation.
I Strategy: Use our assumptions about u to “identify" β0 , β1 from observed
yi , xi .
Deriving the Ordinary Least Squares Estimates

I Zero conditional mean assumption E (u|x) = 0. Implies


1. E (u) = E (y − β0 − β1 x) = 0
2. Cov (x, u) = Cov (x, y − β0 − β1 x) = 0
in the population
I Remember, the first condition essentially defines the intercept.
I The second condition, stated in terms of the covariance, means that x and
u are uncorrelated. (We could have assumed Cov (u, x) = 0 directly).
I Note:
1. E (u|x) = 0 implies
Cov (x, u) = E [(x − µx )u] = Ex [(x − µx )Eu|x (u|x)] = Ex [(x − µx )0] = 0
2. With E (u) = 0, Cov (x, u) = 0 is the same as E (xu) = 0 because
Cov (x, u) = E (xu) − E (x)E (u) = E (xu).
Deriving the Ordinary Least Squares Estimates

1. E (u) = E (y − β0 − β1 x) = 0
2. E (xu) = E [x(y − β0 − β1 x)] = 0
I These are the two conditions in the population that determine β0 and β1 .
I Method of moments: Use their sample analogs to determine β̂0 and β̂1 ,
the estimates from the data (Note the hats!):
d

N
X
N −1 (yi − β̂0 − β̂1 xi ) = 0 (20)
i=1
N
X
N −1 xi (yi − β̂0 − β̂1 xi ) = 0 (21)
i=1

I These are the “OLS normal equations", two linear equations in the two
unknowns β̂0 and β̂1 .
Deriving the Ordinary Least Squares Estimates

N
X
N −1 (yi − β̂0 − β̂1 xi ) = 0 (22)
i=1
N
X
N −1 xi (yi − β̂0 − β̂1 xi ) = 0 (23)
i=1

Pn
I Solve (22) to get β̂0 = ȳ − β̂1 x̄, for the sample averages ȳ = n−1 i=1 yi
and x̄ = n−1 ni=1 xi
P

I Substitute this into (23):


n
X
xi [yi − (ȳ − β̂1 x̄) − β̂1 xi ] = 0 (24)
i=1

I Simple algebra gives

n
" n
#
X X
xi (yi − ȳ ) = β̂1 xi (xi − x̄) (25)
i=1 i=1
Deriving the Ordinary Least Squares Estimates

n
" n #
X X
xi (yi − ȳ ) = β̂1 xi (xi − x̄) (26)
i=1 i=1
Deriving the Ordinary Least Squares Estimates

n
" n #
X X
xi (yi − ȳ ) = β̂1 xi (xi − x̄) (26)
i=1 i=1

I Remember:
Pn
i=1 (xi − x̄) = 0
Pn Pn Pn
i=1 xi (yi − ȳ ) = i=1 (xi − x̄)(yi − ȳ ) = i=1 (xi − x̄)yi
Pn Pn 2
i=1 xi (xi − x̄) = i=1 (xi − x̄)
Deriving the Ordinary Least Squares Estimates

n
" n #
X X
xi (yi − ȳ ) = β̂1 xi (xi − x̄) (26)
i=1 i=1

I Remember:
Pn
i=1 (xi − x̄) = 0
Pn Pn Pn
i=1 xi (yi − ȳ ) = i=1 (xi − x̄)(yi − ȳ ) = i=1 (xi − x̄)yi
Pn Pn 2
i=1 xi (xi − x̄) = i=1 (xi − x̄)

I Use this to write (26)


n
" n
#
X X
(xi − x̄)(yi − ȳ ) = β̂1 (xi − x̄)2 (27)
i=1 i=1
Deriving the Ordinary Least Squares Estimates

n
" n #
X X
xi (yi − ȳ ) = β̂1 xi (xi − x̄) (26)
i=1 i=1

I Remember:
Pn
i=1 (xi − x̄) = 0
Pn Pn Pn
i=1 xi (yi − ȳ ) = i=1 (xi − x̄)(yi − ȳ ) = i=1 (xi − x̄)yi
Pn Pn 2
i=1 xi (xi − x̄) = i=1 (xi − x̄)

I Use this to write (26)


n
" n
#
X X
(xi − x̄)(yi − ȳ ) = β̂1 (xi − x̄)2 (27)
i=1 i=1

Pn
I If i=1 (xi − x̄)2 > 0, we can write
Pn
i=1 (xi − x̄)(yi − ȳ ) Sample Covariance(xi , yi )
β̂1 = Pn 2
= (28)
i=1 (xi − x̄) Sample Variance(xi )
Deriving the Ordinary Least Squares Estimates

Pn
i=1 (xi − x̄)(yi − ȳ ) Sample Covariance(xi , yi )
β̂1 = Pn 2
= (29)
i=1 (xi − x̄) Sample Variance(xi )

I This formula for β̂1 shows us how to take the data we have and compute
the slope estimate. For reasons we will see, β̂1 is called the ordinary least
squares (OLS) slope estimate. We often refer to it as the slope estimate.
I It can be computed whenever the sample variance of the xi is not zero,
which only rules out the case where each xi is the same value. In other
words, we do not have to assume anything about the population to
calculate β̂1 .
I Once we have β̂1 , we compute β̂0 = ȳ − β̂1 x̄. This is the OLS intercept
estimate.
I These days, one lets a computer do the calculations, which can be tedious
even if n is small.
Why “Ordinary Least Squares Estimates"

I For any candidates β̂0 and β̂1 , define a fitted value for each data point i as

ŷi = β̂0 + β̂1 xi (30)

We have n of these. It is the value we predict for yi given that x has taken
on the value xi .
I The mistake we make is the residual:

ûi = yi − ŷi = yi − β̂0 − β̂1 xi , (31)

and we have n residuals.


Why “Ordinary Least Squares Estimates"
Why “Ordinary Least Squares Estimates"

I For any candidates β̂0 and β̂1 , define a fitted value for each data point i as

ŷi = β̂0 + β̂1 xi (32)


I The mistake we make is the residual:

ûi = yi − ŷi = yi − β̂0 − β̂1 xi , (33)


I Suppose we measure the size of the mistake, for each i, by squaring the
residual: ûi2 ≥ 0. Then we add them all up:

n
X n
X
ûi2 = (yi − β̂0 − β̂1 xi )2 (34)
i=1 i=1

I This quantity is called the sum of squared residuals.


I If we choose β̂0 and β̂1 to minimize the sum of squared residuals it can be
shown (using calculus or other arguments) that the solutions are the slope
and intercept estimates we obtained before.
Deriving the ‘Ordinary Least Squares Estimates’
n
X n
X
min ûi2 = (yi − β̂0 − β̂1 xi )2 (35)
β0 ,β1
i=1 i=1
Deriving the ‘Ordinary Least
n
Squares
n
Estimates’
X X
min ûi2 = (yi − β̂0 − β̂1 xi )2 (36)
β0 ,β1
i=1 i=1
Xn
FOCβ0 2(yi − β̂0 − β̂1 xi )(−1) = 0
i=1
n
X n
X
yi − β̂0 − β̂1 xi = 0
i=1 i=1

β̂0 = ȳ − β̂1 x̄
n
X
FOCβ1 2(yi − β̂0 − β̂1 xi )(−xi ) = 0
i=1
n
X n
X n
X
xi yi − β̂0 xi − β̂1 xi2 = 0
i=1 i=1 i=1
n
" n
#
X X
xi (yi − ȳ ) = β̂1 xi (xi − x̄)
i=1 i=1
Pn
i=1 (xi − x̄)(yi − ȳ ) Sample Covariance(xi , yi )
β̂1 = Pn 2
= (37)
i=1 (xi − x̄) Sample Variance(xi )

Same equations as before!


Deriving the OLS estimators - 2 approaches

I Method of moments: impose population assumptions on sample


I OLS: minimise sum of squared deviations of yi from the sample regression
line (equal to the residual ui )
I NB: the OLS regression coefficients can be derived on any sample of the
data, independently of the model
I But: only under certain assumptions about the true model can we hope
that our OLS regression coefficients are “good" estimators of some true
population coefficients
Interpreting OLS estimates

I Once we have the numbers β̂0 and β̂1 for a given data set, we write the
OLS regression line as a function of x:

ŷ = β̂0 + β̂1 x (38)


I The OLS regression line allows us to predict y for any (sensible) value of
x. It is also called the sample regression function.
I The intercept, β̂0 , is the predicted y when x = 0. (The prediction is
usually meaningless if x = 0 is not possible.)
I The slope, β̂1 , allows us to predict changes in y for any (reasonable)
change in x:

∆ŷ = β̂1 ∆x (39)


I If ∆x = 1, so that x increases by one unit, then ∆ŷ = β̂1 . So β1 is the
“predicted change in y when x changes by 1 unit”.
EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)

I Data are from 1991 on men only. wage is reported in dollars per hour,
educ is number of completed years of schooling.
I The estimated equation is

[ = −5.12 + 1.43educ
wage (40)
n = 759 (41)
I Below we discuss the negative intercept. Literally, it says that wage is
predicted to be −$5.12 when educ = 0!
I Each additional year of schooling is estimated to be worth $1.43.
EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)

[ = −5.12 + 1.43educ
wage (42)
n = 759 (43)

I Note: We do not know the true population coefficients β0 and β1 . Rather,


β̂0 = −5.12 and β̂1 = 1.43 are our estimates from this particular sample of
759 men. These estimates may or may not be close to the population
values.
I If we obtain another sample of 759 men the estimates would almost
certainly change. But we can use the sampling distribution of the OLS
estimators to derive confidence intervals and test hypotheses about β0 and
β1 .
EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)

I The function

[ = −5.12 + 1.43educ
wage (44)

is the OLS (or sample) regression line.


I Plugging in educ = 0 gives the silly prediction wage
[ = −5.12.
Extrapolating outside the range of the data can produce strange
predictions. There are no men in the sample with educ < 8.
I When educ = 8,

[ = −5.12 + 1.43(8) = 6.32


wage (45)
I The predicted hourly wage at eight years of education is $6.32, which we
can think of as our estimate of the average wage in the population when
educ = 8. But no one in the sample earns exactly $6.32: some earn more,
some earn less. One worker earns $6.25, which is close.
EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)
EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)
EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)
3. Properties of OLS on any Sample of Data
3. Properties of OLS on any Sample of Data

1. Algebraic Properties of OLS estimates, that hold mechanically /


independently of the underlying population model
2. Goodness of fit: R 2
3. Properties of OLS on any Sample of Data

I Once we have the sample regression function

ŷ = β̂0 + β̂1 x (46)

we get the OLS fitted values by plugging the xi into the equation:

ŷi = β̂0 + β̂1 xi ,i = 1, 2, . . . , n (47)


I The OLS residuals are

ûi = yi − ŷi = yi − β̂0 − β̂1 xi ,i = 1, 2, . . . , n (48)


EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)
1. Algebraic Properties of OLS Statistics
1. Algebraic Properties of OLS Statistics
From the first OLS normal equation (22)
N
X
N −1 (yi − β̂0 − β̂1 xi ) = 0
i=1
(49)

1. The OLS residuals always add up to zero

n
X
ûi = 0 (50)
i=1

2. Because yi = ŷi + ûi by definition,

n
X n
X n
X
n−1 yi = n−1 ŷi + n−1 ûi
i=1 i=1 i=1

and so
ȳ = ŷ (51)
So the sample average of the actual yi is the same as the sample average
of the fitted values.
Algebraic Properties of OLS Statistics 2

PN
From the second OLS normal equation (23) N −1 i=1 xi (yi − β̂0 − β̂1 xi ) = 0
4. The sample covariance (and therefore the sample correlation) between the
explanatory variables and the residuals is always zero:
n
X
xi ûi = 0 (52)
i=1

5. Because the ŷi are linear functions of the xi , the fitted values and residuals
are uncorrelated, too:

n
X n 
X  n
X n
X
ŷi ûi = β̂0 + β̂1 xi ûi == β̂0 ûi + β̂1 xi ûi = 0 (53)
i=1 i=1 i=1 i=1

I Property (50) to (53) hold by construction. β̂0 and β̂1 were chosen to
make them true.
Algebraic Properties of OLS Statistics 3: ȳ = β̂0 + β̂1 x̄

I From (22), the point (x̄, ȳ ) is always on the OLS regression line. That is,
if we plug in the average for x, we predict the sample average for y :
N
X
N −1 (yi − β̂0 − β̂1 xi ) = 0
i=1

ȳ = β̂0 + β̂1 x̄ (54)

Again, we chose the estimates to make this true.


Algebraic Properties of OLS Statistics 3: ȳ = β̂0 + β̂1 x̄

I From (22), the point (x̄, ȳ ) is always on the OLS regression line. That is,
if we plug in the average for x, we predict the sample average for y :
N
X
N −1 (yi − β̂0 − β̂1 xi ) = 0
i=1

ȳ = β̂0 + β̂1 x̄ (54)

Again, we chose the estimates to make this true.


I This implies we can write the sample regression function in terms of “mean
deviations" as

β̂0 = ȳ − β̂1 x̄
yi = β̂0 + β̂1 xi
yi − ȳ = β̂1 (xi − x̄) + ûi (55)

Sometimes this helps to interpret β̂1 . Also means we can estimate β1 from
1
PN 
N i−1 (xi − x̄)(yi − ȳ − β̂ 1 (xi − x̄)) = 0.
2. Goodness-of-Fit
2. Goodness-of-Fit

I For each observation, write yi = ŷi + ûi


I Define the total sum of squares (SST), explained sum of squares (SSE)
and residual sum of squares (or sum of squared residuals) as
n
X
SST = (yi − ȳ )2 (56)
i=1
Xn
SSE = (ŷi − ȳ )2 (57)
i=1
Xn
SSR = ûi2 (58)
i=1

I Each of these is a sample variance when divided by n (or n − 1). SST /n is


the sample variance of yi , SSE /n is the sample variance of ŷi , and SSR/n
is the sample variance of ûi .
Goodness-of-Fit
I By writing
n
X n
X
SST = (yi − ȳ )2 = [(yi − ŷi ) + (ŷi − ȳ )]2 (59)
i=1 i=1
n
X
= [ûi + (ŷi − ȳ )]2 (60)
i=1

and using that the fitted values and residuals are uncorrelated

SST = SSE + SSR (61)


I Assuming SST > 0, we can define the fraction of the total variation in yi
that is explained by xi (or the OLS regression line) as

SSE SSR
R2 = =1− (62)
SST SST
I Called the R-squared of the regression.
I It can be shown to equal the square of the correlation between yi and ŷi .
Therefore,
0 ≤ R2 ≤ 1 (63)
Goodness-of-Fit

I R 2 = 0 means no linear relationship between yi and xi . R 2 = 1 means a


perfect linear relationship.
I As R 2 increases, the yi are closer and closer to falling on the OLS
regression line.
I Do not want to fixate on R 2 . It is a useful summary measure but tells us
nothing about causality. Having a “high” R-squared is neither necessary
nor sufficient to infer causality.
EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)
4. Units of Measurement and Functional Form
Units of Measurement

I In the simple linear regression model y = β0 + β1 x + u, the coefficient are


interpreted as follows
1. β0 is the value of y when x = 0
2. β1 is the change in y when x changes by 1 unit, ceteris paribus (i.e. holding
all else, including u, fixed). Sometimes easier to interpret from
∆y = β1 ∆x + ∆u
I It is very important to know how y and x are measured in order to
interpret regression functions. Consider an equation estimated from
CEOSAL1.DTA, where annual CEO salary is in thousands of dollars and
the return on equity is a percent:

\ = 963.191 + 18.501roe
salary (64)
n = 209, R 2 = .0132 (65)
Units of Measurement

I In the simple linear regression model y = β0 + β1 x + u, the coefficient are


interpreted as follows
1. β0 is the value of y when x = 0
2. β1 is the change in y when x changes by 1 unit, ceteris paribus (i.e. holding
all else, including u, fixed). Sometimes easier to interpret from
∆y = β1 ∆x + ∆u
I It is very important to know how y and x are measured in order to
interpret regression functions. Consider an equation estimated from
CEOSAL1.DTA, where annual CEO salary is in thousands of dollars and
the return on equity is a percent:

\ = 963.191 + 18.501roe
salary (64)
n = 209, R 2 = .0132 (65)

I When roe = 0 (it never is in the data), salary


\ = 963.191. But salary is in
thousands of dollars, so $963,191.
I A one percentage point increase in roe increases predicted salary by
18.501, or $18501.
EXAMPLE: Salary
Units of Measurement

I What if we measure roe as a decimal number, rather than a percent?


Define

roedec = roe/100 (66)


I What will happen to the intercept, slope, and R 2 when we regress salary
on roedec ?
I To see the effect, substitute roe = 100 ∗ roedec in (64)
I Nothing happens to the intercept: roedec = 0 is the same as roe = 0. But
the slope will increase by a factor of 100. The goodness-of-fit does not
change.
I The new regression is

\ = 963.191 + 1850.1roedec
salary (67)
n = 209, R 2 = .0132 (68)
I Now a one percentage point change in roe is the same as ∆roedec = .01,
and so we get the same effect as before.
Units of Measurement

I What if we measure salary in dollars, rather than thousands of dollars, so


salarydol = 1000 · salary ?
I Substitute salary = salarydol/1000 and simplify:
I Both the intercept and slope get multiplied by 1000:

\ = 963191 + 18501roe
salarydol (69)
n = 209, R 2 = .0132 (70)
Using the Natural Logarithm in Simple Regression

I Recall the wage example:

[ = −5.12 + 1.43educ
wage (71)
n = 759, R 2 = .133 (72)
I Might be an okay approximation, but unsatisfying for a couple of reasons.
First, the negative intercept is a bit strange (even though the equation
gives sensible predictions for education ranging from 8 to 20).
I Second reason is more important: the dollar value of another year of
schooling is constant. So the 16th year of education is worth the same as
the second. We expect additional years of schooling to be worth more, in
dollar terms, than previous years.
I How can we incorporate an increasing effect? One way is to postulate a
constant percentage effect. We can approximate percentage changes using
the natural log (log for me, but also ln is common).
Using the Natural Logarithm in Simple Regression

I Now let the dependent variable be log(wage):

log(wage) = β0 + β1 educ + u (73)

Holding u fixed,

∆ log(wage) = β1 ∆educ (74)

so
∆ log(wage)
β1 = (75)
∆educ
Using the Natural Logarithm in Simple Regression

∆ log(wage)
β1 = (76)
∆educ

I For small changes of w from w1 : log (w ) ≈ log (w1 ) + w −w1


w1
Using the Natural Logarithm in Simple Regression

∆ log(wage)
β1 = (76)
∆educ

I For small changes of w from w1 : log (w ) ≈ log (w1 ) + w −w


w1
1
(Taylor
approx)
I Thus, a small log change equals the proportionate change:
∆ log(wage) = log (w2 )−log (w1 ) ≈ [log (w1 )+ w−w1
w1
]−log (w1 ) = ∆w
w1
.
Using the Natural Logarithm in Simple Regression

∆ log(wage)
β1 = (76)
∆educ

I For small changes of w from w1 : log (w ) ≈ log (w1 ) + w −w


w1
1
(Taylor
approx)
I Thus, a small log change equals the proportionate change:
∆ log(wage) = log (w2 )−log (w1 ) ≈ [log (w1 )+ w−w1
w1
]−log (w1 ) = ∆w
w1
.
I Note that a small fraction ∆x of x (e.g. 0.05), equals 1/100 · ∆x percent
1 ∆x
of x (e.g. 5 (percent)), since ∆x
x
= 100 x
100

∆ log(wage) = 1/100 · %∆wage = β1 ∆educ + ∆u (77)


Using the Natural Logarithm in Simple Regression

∆ log(wage)
β1 = (76)
∆educ

I For small changes of w from w1 : log (w ) ≈ log (w1 ) + w −w


w1
1
(Taylor
approx)
I Thus, a small log change equals the proportionate change:
∆ log(wage) = log (w2 )−log (w1 ) ≈ [log (w1 )+ w−w1
w1
]−log (w1 ) = ∆w
w1
.
I Note that a small fraction ∆x of x (e.g. 0.05), equals 1/100 · ∆x percent
1 ∆x
of x (e.g. 5 (percent)), since ∆x
x
= 100 x
100

∆ log(wage) = 1/100 · %∆wage = β1 ∆educ + ∆u (77)

I This means when ∆ log(wage) = 1/100∆%(wage) = β1 ∆educ + ∆u, we


have a simple interpretation of β1 b y multiplying by 100:

100β1 ≈ %∆wage when ∆educ = 1 (78)


Using the Natural Logarithm in Simple Regression

∆ log(wage)
β1 = (76)
∆educ

I For small changes of w from w1 : log (w ) ≈ log (w1 ) + w −w


w1
1
(Taylor
approx)
I Thus, a small log change equals the proportionate change:
∆ log(wage) = log (w2 )−log (w1 ) ≈ [log (w1 )+ w−w1
w1
]−log (w1 ) = ∆w
w1
.
I Note that a small fraction ∆x of x (e.g. 0.05), equals 1/100 · ∆x percent
1 ∆x
of x (e.g. 5 (percent)), since ∆x
x
= 100 x
100

∆ log(wage) = 1/100 · %∆wage = β1 ∆educ + ∆u (77)

I This means when ∆ log(wage) = 1/100∆%(wage) = β1 ∆educ + ∆u, we


have a simple interpretation of β1 b y multiplying by 100:

100β1 ≈ %∆wage when ∆educ = 1 (78)

I In this example, 100β1 is often called the return to education (just like an
investment).
I This measure is free of units of measurement of wage (currency, price
level).
Using the Natural Logarithm in Simple Regression

I We still want to explain wage in terms of educ! This gives us a way to get
a (roughly) constant percentage effect.
I Predicting wage is more complicated but not usually the most interesting
question.
I The next graph shows the relationship between wage and educ when
u = 0.
Using the Natural Logarithm in Simple Regression

I We can also use the log on both sides of the equation to get constant
elasticity models. For example, if

log(salary ) = β0 + β1 log(sales) + u (79)

then

∆log (salary ) %∆salary


β1 = ≈ (80)
∆log (sales) %∆sales
I The elasticity is free of units of salary and sales.
I A constant elasticity model for salary and sales makes more sense than a
constant dollar effect.
I The elasticity does not change if sales is measured in, say, billions. (So,
define lsalesbil = log(sales/1000).) The intercept will change.
Using the Natural Logarithm in Simple Regression

Model Dep. Var. Indep. Var. β1


Level-Level y x ∆y = β1 ∆x
Level-Log y log(x) ∆y = (β1 /100)%∆x
Log-Level log(y ) x %∆y = (100β1 )∆x
Log-Log log(y ) log(x) %∆y = β1 %∆x
What is “linear"?

I The possibility of using the natural log to get nonlinear relationships


between y and x raises a question: What do we mean now by “linear”
regression?
I The answer is that the model is linear in the parameters, β0 and β1 .
I In other words, we must be able to write the model as
f (y ) = β0 + β1 f (x) + u
I We can use any transformations of the dependent and independent
variables to get interesting interpretations for the parameters.
Recap

I We defined the simple linear regression model

y = β0 + β1 x + u (81)
I We saw how we needed to assume E (u|x) = 0 to identify β0 , β1 from
observations on x, y .
I Alternatively, we could have assumed Cov (x, u) = 0, or independence of
x, u.
I We derived the OLS estimator of β0
Pn
i=1 (xi − x̄)(yi − ȳ ) Sample Covariance(xi , yi )
β̂1 = Pn 2
= (82)
(x
i=1 i − x̄) Sample Variance(xi )

using the method of moments and OLS.


I We showed algebraic properties of OLS estimates.
I And derived R 2 as a measure of goodness of fit.
I We also discussed changing the units of x and y , and using log(y), log(x).

You might also like