0% found this document useful (0 votes)
43 views

Week1 Lecture2

This document discusses different approaches to econometric analysis, including structural vs reduced form models and micro vs macroeconometrics. It also covers out-of-sample predictions using linear regression models and how to evaluate models based on their predictive performance rather than in-sample fit alone.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Week1 Lecture2

This document discusses different approaches to econometric analysis, including structural vs reduced form models and micro vs macroeconometrics. It also covers out-of-sample predictions using linear regression models and how to evaluate models based on their predictive performance rather than in-sample fit alone.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Econometrics 1 (6012B0374Y)

dr. Artūras Juodis

University of Amsterdam

Week 1. Lecture 2

February, 2024

1 / 57
Overview

1 The Art of Econometrics Population linear projection


Econometric analysis Population Least-squares
Out-of-sample (counterfactual 3 Finite sample statistical properties
predictions) Classical assumptions
2 Why OLS and linear model?
Bias of the OLS estimator
(Advanced material) Variance of the OLS estimator
Structural motivation of linear
models 4 Summary

2 / 57
The plan for today

I We discuss different approaches of doing econometric analysis and


different ways how linear model can be motivated.
I We argue how the LS objective function can be motivated from the
infinite population point of view.
I We discuss conditions that can be used to distinguish between causal
and non-causal linear models.
I We study whether the OLS estimator is unbiased and efficient for any
n.

3 / 57
Recap: Linear model

In the first lecture, we considered the simple linear model with a single
regressor xi
yi = α + βxi + εi , i = 1, . . . , n. (1)
We used this model to understand the determinants of hotel prices in
Vienna. Unfortunately, we did not provide any motivation on why linear
model is a useful starting point of empirical analysis. More on this today.

4 / 57
Recap: OLS estimator

We used sample data {(yi , xi )}ni=1 to construct statistics that can be used
as estimates of (α, β). For this purpose, we considered the Ordinary Least
Squares (OLS) objective function.

The OLS estimators: P


i (x− x)(yi − y )
βb = Pi 2
, (2)
i (xi − x)
and that
b = y − x β.
α b (3)
Today we show why OLS is the natural estimator for the linear model, and
investigate if it unbiased and efficient with respect to the true probabilistic
model that generated the data (or the Data Generating Process/DGP).

5 / 57
1. The Art of Econometrics

6 / 57
1.1. Econometric analysis

7 / 57
Structural and reduced form model analysis

In many cases we distinguish between the econometric/statistical models


based on their relationships with economic (behavioural) models.
I Reduced form models are usually driven by heuristic observations about
the potential relationships between certain variables. They are usually
not well grounded in any economic models. These models generally are
difficult to make conclusions and predictions outside of the range of
observations at hand. They usually focus on internal validity.
I Structural form models/analysis are generally (at least partially)
derived from certain types of economic models (e.g. individual utility
maximization, firm profit maximization, etc.). They are fairly restrictive
as their properties do not tell us a lot about the situations where these
economic models fail. However, they can tell us a lot about
counterfactual/out of the sample predictions. The focus is thus on the
external validity.

8 / 57
Microeconometrics vs. Macroeconometrics

The former is mostly characterized by its analysis of cross-section (this


course) and panel data (Microeconometrics, 3rd year) and by its focus on
individual consumers, firms, and micro-level decision makers. Overall, the
data is disaggregated.

Macroecononometrics is involved in the analysis of time-series data (Time


Series, 3rd year), usually of broad aggregates such as price levels, the money
supply, exchange rates, output, investment, economic growth, and so on.

Within Financial econometrics tools from both micro- and macro- are used
(as well separate unique tools for financial dataset). Applications can be
both aggregate (e.g. fluctuations of stock market index), as well as
disaggregated (e.g. fluctuations of individual stocks within the same stock
exchange).

9 / 57
Microeconometrics vs. Macroeconometrics. Examples.

Micro:
I Does attending an elite college bring an expected payoff in expected
lifetime income sufficient to justify the higher tuition?
I Does an increase in the minimum wage lead to reduced employment?
I Do smaller class size bring real benefits in student performance?
I Does the religious belief of your GP has an impact on the choice of the
birth control? What is the effect on the lifetime income?
Macro:
I Does a monetary policy regime that is strongly oriented towards
controlling inflation impose a real costs in terms of lost output?
I What are the effects of monetary policy when interest rates are
negative?
I Why the volatility of inflation went down in the 90s?

10 / 57
The model or a model? Causal or non-causal analysis?

This is a statistical/probabilistic model is supposed to describe relationships


between multiple random variables. In most cases, linearity is an important
assumption.

Overall, for economists (and econometricians) it is important to distinguish


between true causal models (i.e. e.g. increase in xk,i causes
increase/decrease in yi ceteris paribus), versus non-causal models, (i.e. xk,i
is an important feature that helps to predict yi better).

As we will learn from this course, the distinction between the causal and the
non-causal models usually comes from the assumptions we impose the
distribution of εi and (x1,i , . . . , xK ,i ).

11 / 57
1.2. Out-of-sample (counterfactual predictions)

12 / 57
The R 2 is not everything

So far we learned that OLS estimator maximizes sample R 2 measure. So, if


you were to choose between two competing models, you could opt for the
model with highest R 2 . This can be risky (as we know these days)!
I Both the OLS estimator and the R 2 are calculated based on the same
data {(yi , xi )}ni=1 . Model selection/evaluation done this way is not
honest.
I Higher R 2 does not imply that your out-of-sample predictions are
better (in appropriately defined way) and/or provide better predictions
for policy/business decision making.
The first aspect was probably discussed in your Statistical/Machine learning
course. I focus on the second point.

13 / 57
The counterfactual predictions

Given the linear regression model:

yi = α + βxi + εi , (4)

and the estimated regression coefficients (b


α, β)
b we can think of making
predictions about the LHS variable y for any value of the RHS variable x,
i.e:
yb(x) = αb + βx.
b (5)
Here I make the dependence on x explicit. These are point predictions, as
we completely assume that there is no error in our prediction.

14 / 57
The Vienna hotels example.

We considered two specifications with the corresponding functions for fitted


values.

Binary:
yb(D) = 92.09 + 34.67D, (6)
for D = 1(distance < 2km).

Continuous:
yb(distance) = 118.73 − 3.44distance. (7)
Can we compare the usefulness of the two models in terms of their
predictions?

15 / 57
Counterfactual (thought) experiment.

Consider a hotel chain that wants to set up a hotel in Vienna. It has a


choice between two locations: location1 is 5km from the city center, while
location2 is just 2km from the city center. In all other aspects, the hotels
will be identical.

The binary model in both cases suggests the per night average price of:

yb(D = 0) = 92.09. (8)

The continuous variable model, on the other hand, gives two predictions:

yb(x = 2) = 118.84, yb(x = 5) = 101.51. (9)

16 / 57
Which one is better?
Benefits of continuous:
I Model with continuous distance variable gives different predictions that
are more in-line with the common sense.
I Also, the model avoids the boundary problem at 2km, as the difference
between predictions for 1.99km and 2.00km are 0.08eur.
I This is all, despite worse R 2 !
Benefits of binary:
I Usually hotels in the city center can be very heterogeneous (i.e. think
of IBIS Budget and Conrad Hilton). If you are only interested in
predicting the price outside the 2km radius, maybe only the information
about the hotels outside of that radius is important. For 2km and 5km,
binary model predicts using outside 2km data.
I Better in sample R 2 .
Summary: if you want to choose a good model for decision making,
you have to weight both in-sample, out-of sample, as well as
non-statistical considerations into account.

17 / 57
Interpolation vs. extrapolation

I Computing the predicted value of the LHS variable for non-existing


values of of the RHS variable in-between existing values in the data is
interpolation.
I Computing it for values of the explanatory variable that are outside its
range (i.e. lower than min, or larger than max) is called extrapolation.
In our case, we conducted interpolation as both 2km and 5km are in the
range of observed distances. Extrapolation would occur if we considered
distance of 25km.

Do you think both models would still provide useful predictions? Why?

18 / 57
2. Why OLS and linear model? (Advanced
material)

19 / 57
2.1. Structural motivation of linear models

20 / 57
Binary regressor as a motivation of linear model (Reduced
form thinking)

Consider situation where you have random sample of (yi , Di ), here


Di = {0; 1} (binary variable). You know that in the infinite population:

E[yi |Di = 0] = µ0 , E[yi |Di = 1] = µ1 . (10)

Combining the two:

E[yi |Di ] = µ0 + Di (µ1 − µ0 ). (11)

From there:
yi = µ0 + Di β + εi , (12)
εi ≡ yi − E[yi |Di ]. Hence linear model is natural (or as we usually call a
non-parametric) model for the conditional mean of yi given binary variable
Di . Hence, linearity in (12) is not a restriction at all!

21 / 57
Coefficient β

Note that in the case of binary/discrete regressors, we cannot think of


marginal changes (as variables are not continuous), thus:

β = E[yi |Di = 1] − E[yi |Di = 0]. (13)

For example, if yi was your health status, while Di is a measure whether you
received a certain drug, then the coefficient β measures the Average
Treatment Effect (ATE) in the infinite population.

In the course Empirical Project, you will learn more about estimation of
ATEs and Average Treatment Effect on Treated (ATT).

22 / 57
Linearized economic model as a motivation (Structural
form thinking)

Assume that observe a set of companies whose production function can be


well approximated by a Cobb-Douglas production function of the form:

Yi = Ai Lβi Kiα , (14)

where Yi is the output, Li labour, Ki capital, and Ai an unobserved


non-negative productivity shock. Then after taking log:

yi = βli + αki + εi , (15)

where yi = ln(Yi ), ki = ln(Ki ), li = ln(Li ), and εi = ln(Ai ).

If the constant return-to-scale assumption is reasonable then β = 1 − α.


Using real data this assumption can be (statistically) tested.

23 / 57
2.2. Population linear projection

24 / 57
Population-level decomposition

Take two general random variables (yi , xi ) with means µy , µx , and the
covariance σxy .

Consider the problem of finding population-level alternative fitted/residual


decomposition:
yi = ybi + ebi . (16)
Previously, we used the data {(yi , xi )}ni=1 (so statistics) to construct ybi .
What about the population-level decomposition?

25 / 57
Population-level decomposition

Note that if Cov (xi , yi ) = σxy then:

Cov (xi , yi − (σxy /σx2 )xi ) = 0. (17)

Why? By linearity of the covariance operator:

Cov (xi , yi − (σxy /σx2 )xi ) = Cov (xi , yi ) − (σxy /σx2 )Cov (xi , xi )
= Cov (xi , yi ) − Cov (xi , yi )
= 0.

Because σx2 = Cov (xi , xi ).

26 / 57
Population-level decomposition

Why all of that? Denote by ei ≡ yi − µy − (σxy /σx2 )(xi − µx ), then:

yi = (µy − σxy /σx2 µx ) + (σxy /σx2 )xi + ei , (18)

where E[ei ] = 0 and E[xi ei ] = 0 by construction.

What has just happened here? From the first two moments of (yi , xi ) we
“invented” a linear model:

y i = α 0 + β0 x i + e i , (19)

where α0 and β0 are the “true” coefficients of the linear model:

β0 ≡ σxy /σx2 , α0 ≡ µy − β0 µx . (20)

27 / 57
Population linear projection

In this equation
y i = α 0 + β0 x i + e i , (21)
you can think of α0 + β0 xi as the fitted (population) value of yi on xi , and
ei as the population residual (or the error term). Equations like Eq. (21) are
usually referred to as Linear Projections in a L2 space.

As it is clear from above, restrictions like E[ei ] = 0 and E[xi ei ] = 0 generally


are not sufficient to establish causal links between variables.

In this example, β0 just measures certain scaled correlation between yi and


xi .

28 / 57
Population linear projection vs. causal model

For
y i = α 0 + β0 x i + ε i , (22)
to serve as a proper causal model for the first two moments of yi it is
generally required that E[εi xi ] = 0 (that is not assumption at all) is replaced
by the stronger assumption that

E[εi |xi ] = 0. (23)

29 / 57
Special case. Binary regressor.

As we argued in this lecture, this restriction holds naturally when xi is a


binary variable, i.e. xi = Di ∈ {0; 1} we showed that:

yi = µ0 + Di (µ1 − µ0 ) + εi , (24)

with E[εi |Di ] = 0 for both Di = 1 and Di = 0. Hence, for models with
binary regressors population linear projection and population conditional
expectation are the same.

30 / 57
2.3. Population Least-squares

31 / 57
Linear projection as a solution to optimization problem

Coefficients (α0 , β0 ) can be obtained as the solution to the population


Least-squares objective function. Define:

LS(a, b) = E[(yi − a − bxi )2 ]. (25)

Next we show that:


(α0 , β0 ) = arg min LS(a, b). (26)

32 / 57
Population LS objective function

Observe that we can use the decomposition yi = α0 + β0 xi + ei , then

LS(a, b) = E[((α0 − a) + (β0 − b)xi + ei )2 ]


= E[((α0 − a) + (β0 − b)xi )2 ] + 2 E[((α0 − a) + (β0 − b)xi )ei ] + E[ei2 ]
= E[((α0 − a) + (β0 − b)xi )2 ] + E[ei2 ]
= E[((α0 − a + (β0 − b)µx ) + (β0 − b)(xi − µx ))2 ] + E[ei2 ]
≥ E[ei2 ].

Hence the minimum is achieved at a = α0 and b = β0 .

33 / 57
Population LS objective function

Hence, the linear projection coefficients (α0 , β0 ) that implicitly define (in
population) the linear model:

yi = α + βxi + εi , (27)

can be obtained through the minimization of the population least-squares


objective function. This motives the least-squares objective function also for
any n, as:
I OLS estimator is defined as the arg min of the sample counterpart of
P
LS(a, b) upon replacing E[·] with i .
I Thus, we can expect that (b α, β)
b should be good estimators for the
population linear projection coefficients (α0 , β0 ) if this is the true
model.

34 / 57
3. Finite sample statistical properties

35 / 57
3.1. Classical assumptions

36 / 57
Statistical properties

Given that (b
α, β)
b are two statistics, we want to use standard statistical
measures to understand the quality of the OLS estimator. In particular, we
might want to investigate:
I Whether the estimator is biased/unbiased for the true values (α0 , β0 ).
For any fixed value of n.
I Whether the estimator is consistent (more about this in Week 4) for
the true values (α0 , β0 ).
I Whether it has the lowest variance and/or Root mean squared error
(for any fixed value of n) in the class of certain type of estimators.
In this lecture, we will tackle points 1 and 3. These are the finite sample (n
fixed) properties of the OLS estimator.

37 / 57
Assumptions (slightly modified from the ones in the book)
To answer those points, we need to impose some restrictions on the
probabilistic model that generated {(yi , xi )}ni=1 .
Assumption 1 (Fixed regressors/Exogeneity). The n observation on the
explanatory variable x1 , . . . , xn can be treated as fixed
numbers (i.e. one can condition on them). They satisfy
2
P
(x
i i − x) > 0.
Assumption 2 (Random disturbances). The n disturbances/error terms
ε1 , . . . , εn are random variables with
E[εi |xi ] = E[εi |x1 , . . . , xn ] = 0.
Assumption 3 (Homoscedasticity). The variance of ε1 , . . . , εn exists and are
all equal E[ε2i |xi ] = E[ε2i |x1 , . . . , xn ] = σ02 > 0.
Assumption 4 (Independence). ε1 , . . . , εn are independent after
conditioning on x1 , . . . , xn .
Assumption 5 (Fixed parameters). The true parameters α0 , β0 , σ02 are fixed
unknown parameters.
Assumption 6 (Linear model). The data on y1 , . . . , yn have been generated
by:
yi = α0 + β0 xi + εi . (28)
Assumption 7 (Normality) the disturbances ε1 , . . . , εn are jointly normally
distributed conditionally on x1 , . . . , xn . 38 / 57
Discussion. Assumptions 1 and 2.

Assumptions 1 and 2 usually are known as the exogeneity restrictions. As


we discussed above, they are generally important in order to distinguish
between the causal and non-causal linear models.

Among other things E[εi |xi ] = 0 implies that E[εi xi ] = 0. Hence, this
additional information is indeed informative beyond the linear projection we
discussed previously.

39 / 57
Discussion. Assumptions 2,3,4.

These assumptions generally concern with the properties of the disturbance


terms.
I By the law of iterated expectations Assumption 2 implies that
E[εi ] = 0.
I Extreme distributions like Cauchy, t(2), are excluded in Assumption 3.
I Error terms εi are homoscedastic as all variances are the same. This is
in contrast with the heteroscedastic setting (later in this course) where
E[ε2i |xi ] 6= E[ε2j |xj ].
I Assumption 4 of independence can be replaced by uncorrelatedness, i.e.
E[εi εj |x1 , . . . , xn ] = 0 for all i 6= j. But independence (random
sampling) is just much easier to use for proofs.

40 / 57
Discussion. Assumptions 5-6.

These two assumptions sometimes are replaced by the following assumption:

E[yi |x1 , . . . , xn ] = E[yi |xi ] = α0 + β0 xi . (29)

From here: yi = E[yi |x1 , . . . , xn ] + εi with εi defined in Assumptions 2,3,4.


I In particular, this assumption implies that only the characteristics of
the i’th unit determine the explained variable yi . Using the Hotel’s
example, this means that it is only the location of i’th hotel matters for
the per-night price in that hotel,not the location of the neighbouring
hotels.
I For some applications, this assumptions can be difficult to justify, as
characteristics of your neighbours might determine your characteristics
and outputs, e.g. because of local competition.

41 / 57
Discussion. Assumption 7.

This assumption is mostly redundant in the modern day statistical and


econometric analysis. In particular, together with other assumptions it
implies that:
yi |xi = ξi ∼ N(α0 + β0 xi , σ02 ). (30)

1 It is a very strong assumption that is very rarely satisfied for economic


data.
2 We will only use this assumption for the sake of illustration next week.
3 This assumption is generally not used for modern day econometric and
statistical analysis.
4 This assumptions is mostly redundant if study large sample
(asymptotic) properties of OLS estimator (b b for n → ∞ (week 4).
α, β)
Despite all these facts, it is important that you understand where and how
this assumption is used to derive certain results in the next 2 weeks!

42 / 57
3.2. Bias of the OLS estimator

43 / 57
OLS is unbiased

Let (xi , yi ) ∼ P(x, y |α0 , β0 , . . .) (so random variables are drawn from some
distribution with (among others) parameters α0 and β0 ) for all i = 1, . . . , n.
Then we say that statistics α b and βb are unbiased for α0 and β0 if:

E[b
α] = α0 , E[β]
b = β0 , (31)

assuming that both expectations exists.

We will show that the OLS estimator is unbiased.

44 / 57
OLS estimator of β

Recall that P
i (x − x)(yi − y )
βb = Pi 2
, (32)
i (xi − x)
Using Assumption 6 we replace yi with yi = α0 + β0 xi + εi
P P
i (x i − x)(xi − x) (εi − ε)(xi − x)
β = β0
b P 2
+ iP 2
(x i − x) i (xi − x)
Pi
εi (xi − x)
= β0 + Pi 2
.
i (xi − x)

Next consider:
P 
b 1 , . . . , xn ] = E[β0 |x1 , . . . , xn ] + E i εi (xi − x)
E[β|x P 2
|x1 , . . . , xn . (33)
i (xi − x)

45 / 57
By using the definition of conditional expectations and Assumption 2:
P  P
εi (xi − x) |x1 , . . . , xn ] (xi − x)
i E [εiP
E Pi 2
|x 1 , . . . , x n = 2
i (x i − x) i (xi − x)
= 0.

From here:
b 1 , . . . , xn ] = E[β0 |x1 , . . . , xn ] = β0 ,
E[β|x (34)
and by the law of iterated expectations:

E[β]
b = E[E[β|x
b 1 , . . . , xn ]] = β0 . (35)

Hence, βb is unbiased estimator of β0 .

46 / 57
OLS estimator of α is also unbiased.

What about α
b? Note that:
b = y − βx.
α b (36)
Hence:
b = α0 + β0 x + ε − βx.
α b (37)
From here:

α|x1 , . . . , xn ] = α0 + E[x(β0 − β)|x


E[b b 1 , . . . , xn ] + E[ε|x1 , . . . , xn ]

= α0 + x E[(β0 − β)|x
b 1 , . . . , xn ] + 0
= α0 + 0 + 0
= α0 .

Here the first line follows by the linearity of expectations; the second line by
Assumption 2; the third line from the fact that E[β|xb 1 , . . . , x n ] = β0 .

47 / 57
3.3. Variance of the OLS estimator

48 / 57
Variance of βb

Using similar steps as before we can derive the (conditional) variance of the
OLS estimator β. b
" P 2 #
2 i εi (xi − x)
E[(β − β0 ) |x1 , . . . , xn ] = E
b P 2
|x1 , . . . , xn
i (xi − x)
 !2 
 2
1 X
= P E εi (xi − x) |x1 , . . . , xn  .
i (x i − x)2
i

49 / 57
Next we can expand
 !2   
X XX
E εi (xi − x) |x1 , . . . , xn  = E  εi εj (xi − x)(xj − x)|x1 , . . . , xn 
i i j
" #
X
=E ε2i (xi − x)2 |x1 , . . . , xn
i
X
= σ02 (xi − x)2 .
i

Here we used Assumption 4 (independence) in the second line, and


Assumption 3 (homoscedasticity) in the third line.

50 / 57
To summarize:
σ02
E[(βb − β0 )2 |x1 , . . . , xn ] = P 2
. (38)
i (xi − x)

It is not difficult to see that (conditional) variance of βb is decreasing as n


increases.

51 / 57
Variance of α
b

As before expand:
b = α0 + β0 x + ε − βx.
α b (39)
. Hence

α − α0 )2 |x1 , . . . , xn ] = E[ε2 |x1 , . . . , xn ] + E[((βb − β0 )x)2 |x1 , . . . , xn ]


E[(b
− 2 E[((βb − β0 )x)ε|x1 , . . . , xn ].

Using previous results:

σ02
E[ε2 |x1 , . . . , xn ] = (40)
n
x 2 σ02
E[((βb − β0 )x)2 |x1 , . . . , xn ] = P 2
. (41)
i (xi − x)

52 / 57
As for the covariance term, note:

E[((βb − β0 )x)ε|x1 , . . . , xn ] = x E[((βb − β0 ))ε|x1 , . . . , xn ]


P 
εi (xi − x)
= x E Pi 2
ε|x 1 , . . . , x n
i (xi − x)
 
x XX
= P E εi εj (xi − x)|x1 , . . . , xn 
n i (xi − x)2
i j
" #
x X
2
= P E εi (xi − x)|x1 , . . . , xn
n i (xi − x)2
i
" #
xσ02 X
= P E (xi − x)|x1 , . . . , xn
n i (xi − x)2
i
= 0.

Here in the fourth line we used conditional independence between εj and εj ;


in the fifth line we used thePhomoscedasticity assumption; the final line
follows from the fact that i (xi − x) = 0 by construction.

53 / 57
OLS is actually BLUE

Next week, we show that OLS estimator (b α, β)


b is actually BLUE (Best
Linear Unbiased Estimator) in the class of all unbiased estimators that can
be expressed as: X X
α
e= ai yi , βe = bi yi , (42)
i i

for some weights {ai } and {bi } that are a function of data {xi } only. This
is the celebrated Gauss-Markov theorem.

54 / 57
4. Summary

55 / 57
Summary today

In this lecture:
I We argued how to make out-of-sample (counterfactual) predictions
based on the OLS regression results.
I We suggested that good model need not always be the one that
maximizes/minimizes statistical measures.
I We provided motivation for the OLS objective function.
I We studied unbiasedness and variance properties of the OLS estimator
with a single regressor.

56 / 57
Next week

I We study in detail the properties of the regression model and OLS


estimator for a general number of regressors.
I We introduce the vector and matrix notation for this purpose.
I We will show how multivariate regressions (for general K ) can be
analyzed in terms of K appropriately defined regressions on single
(transformed) regressors.
I We introduce the classical assumptions (going back to early XX
century) that can be used to derive exact finite sample properties of
the OLS estimator.

57 / 57

You might also like