0% found this document useful (0 votes)
10 views92 pages

Econ 4

The document outlines the curriculum for an Advanced Econometrics course at Université Lumière Lyon 2 for the academic year 2023-2024, focusing on advanced econometric methods and their applications. Key topics include time series analysis, ARMA modeling, maximum likelihood principles, and instrumental variable estimation, with an emphasis on understanding the limitations and properties of these tools. The course assumes prior knowledge of basic econometric concepts and aims to deepen the understanding of more complex relationships and causality in econometric models.

Uploaded by

Raphaël Dealet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views92 pages

Econ 4

The document outlines the curriculum for an Advanced Econometrics course at Université Lumière Lyon 2 for the academic year 2023-2024, focusing on advanced econometric methods and their applications. Key topics include time series analysis, ARMA modeling, maximum likelihood principles, and instrumental variable estimation, with an emphasis on understanding the limitations and properties of these tools. The course assumes prior knowledge of basic econometric concepts and aims to deepen the understanding of more complex relationships and causality in econometric models.

Uploaded by

Raphaël Dealet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Advanced Econometrics

master MBFA/APE

Université Lumière Lyon 2

Jouneau-Sion Frédéric

academic year 2023-2024

frederic.jouneau(at)univ-lyon2.fr
Roadmap

1. Beyond Linear Model


2. Introduction to time series analysis
3. ARMA Modeling and unit root issues
4. Maximum likelihood principle and main properties
5. Bootstrap and resampling methods
Bibliography

▸ Gouriéroux Monfort + Time series


▸ James Davison
▸ Alistair R. Hall
▸ De Jong
▸ Cameron and Trivedi
▸ Angrsit and Pischke
Scope

The purpose of this course is to give a broad overview of current methods in


econometrics. We intend to present the main tools that are currently used to derive
econometric inference, the known properties of this tools and their limitations.
This is not a elementary course in econometrics. Basic knowledge of what is it about,
as well as well most many basic results are assumed to be known. We shall devote
most of the analysis to extensions of theses results and their justifications. But we
shall also keep our distances with difficult pieces of math (ranging from topology to
functional analysis) that underlie most of the current research in theoretical
econometrics. In a similar vein, sufficient conditions for mathematical existence are
given only if relevant.
Lesson 1 : Beyond linear model
How (not) to write a linear model

▸ Before considering more general settings it is best to start with the simplest
possible case, namely the linear (regression) model.
▸ According to countless presentations (both from academic and non academic
sources), the linear model may be written as

Y = a + bX + ϵ

where ϵ stands for the unobserved so-called error term.


▸ This presentation is often used to present the distinction between
”exogenous/causal” effects –namely when x and ϵ are non correlated– and non
causal/spurious relationship (when this property fails to hold).
▸ We shall now argue that this presentation is –at best– useless and –more
importantly– misleading. We shall claim that the correct presentation of the
linear is as follows:

∃ b such that E [Y − bX ∣X ] is constant (1)


Do we need error terms ?

▸ Clearly ϵ does not appear in the –claimed– correct presentation. When (1) holds
true then ϵ may be defined as Y − E [Y ∣X ] in which case its denomination as
“error term” is plainly justified.
▸ Remark however that this is a forecasting error. It follows that ϵ does not differs
from x because it is unobserved. It is a random variable which is not –yet–
realized. It shall also be stressed that the definition of ϵ involves both Y and X in
an asymmetrical way.
▸ Although defining ϵ may be useful as a shortcut in some proofs and models, it is
clear that may important properties derive directly from (1) in particular the
unbiasness of the OLS estimator and
∂E [Y ∣X =x]
∂x
= b when X is a continuous variable

∆E [Y ∣X =x]
∆x
= b when X is a discrete variable

▸ Hence (1) both provides us with an interpretation of parameter b and a way to


justify an estimator. This two basic –yet important– properties may be presented
without referring to any ”error term”.
Relying on error term may lead to paradoxes

▸ Let us consider Y = a + bX + ϵ as a relationship between three random variables


Y , X , ϵ. Also assume b =
/ 0. Then we may write
a 1 1
X =− + Y− ϵ
b b b
▸ If ϵ is interpreted as some ”unobserved gap” between a linear causal relationship
between Y and X then Y causes X if and only if X causes Y . For instance
gender causes wage gap as much as wage gap causes gender...
▸ Even if we assume Cov [X , ϵ] = 0, such above manipulations may lead to problems.
▸ Consider (X , Y ) is a Gaussian vector such that Cov [X , Y ] =/ 0. As it is well
known E [Y ∣X ] is a linear function of X and E [X ∣Y ] is a linear function of Y .
Hence if we perform the same algebraic manipulation as above, we must have two
different error terms if consider forecasting Y from X or vice versa.
▸ But according to the above computation if the ”causal” effect of X on Y is b = 2
(say) then ”causal” effect of Y on X is should be 1/2. Now the product of both
effects is
Cov [X , Y ]2
Var [X ]Var [Y ]
which is strictly less than 1 according Cauchy-Schwarz inequality as soon as X
and Y are not perfectly correlated.
Endogeneity problems

▸ Clearly (1) may be objected in many instances. From a pure mathematical


viewpoint, linearity is strong an assumption. It has long been recognized that (1)
may be extended to account for more plausible relationships. Adaptation to cases
in which E [Y ∣X = x] is a polynomial or logistic function have proven useful in
several cases. 1
▸ Now, economist are usually more concerned about so–called ”endogeneity”
problems due to omitted variables, error–in–variable, simultaneity biais...
Probably the most common critic raised when regression estimates are provided is
”–possible– endogenity of explanatory variables”.
▸ However, one must recall that exogeneity of a variable is defined (see Engle et al
1983) wrt to some parameter. Indeed since we can always decompose the join
distribution as the product of conditional and marginal distributions

fX ,Y (x, y ) = fY ∣X =x (y ) × fX (x)

then X is always (weakly) exogenous wrt to the ”parameter” fY ∣X . Consequently


there is no such thing as an ”endogenous” variable per se. Any variable can be
consider as weakly exogenous for some parameter in a rich enough model.
▸ What is really at stake is the set of parameters of interest, and whether a given
set of variables can be considered as exogenous wrt these parameters.

1
Notice the interpretation of the parameters involved may then be more complex. For instance if
E [Y ∣X = x] = a + bx + cx 2 then b cannot be interpreted ”other things left equal”
Parameters matter

▸ Consider a Gaussian vector (Y , X , Z ) such that Cov (X , Z ) =


/ 0. As it is well
known E [Y ∣X ] is a linear function and hence OLS provide an unbiaised estimate
of the slope of E [Y ∣X ].
▸ Hence, the statement ”if you omit a variable that is correlated with the
explanatory variable then the OLS are biased” is not always true.This statement
is true if we used the OLS approach justified for E [Y ∣X ] when the parameter of
interest is the slope of X in E [Y ∣X , Z ].
▸ Using E [Y ∣X ] or E [Y ∣X , Z ] may lead to different measures of causality as
exemplified by the famous Simpson’s Paradox. The following table provides the
success rates of two treatments to remove kidney stones (as well as the sample
sizes in each situation).

Size of the stone Treatment A Treatment B


Large 93% (81/87) 87% (234/270)
Small 73% (192/263) 69% (55/80)
Treatment A looks favorable both if the stone is large or small, however
Treatment B’s unconditional success rate is 83% =(234+55)/(270+80) whereas
that of A is only 78%= (81+192)/(87+263).
Parameters matter cont’
▸ If we defined Y = 1 if the the treatment is a success (otherwise 0), X = 1 if
Treatment A has been applied (otherwise 0) and Z = 1 if the stone is large
(otherwise 0), then E [Y ∣X ] is linear and E [Y ∣Z , X ] may be written as a linear
combination of X , Z and X × Z
▸ The problem with Simpson’s Paradox is not that one measure is biased, since
OLS estimates are unbiased in both models
E [Y ∣X ] = α + βX
E [Y ∣X , Z ] = a + bX + cZ + dX × Z

The quantity -0.05 (=78%-83%) is an unbiased estimator for β just as


0.04(= 73% − 69%) is for b, 0.18(= 87% − 69%) for c and 0.02 for d. Notice that
since Z ≥ 0 the total effect of X on E [Y ∣X , Z ] is never be negative.
▸ The problem is that E [Y ∣X ] can hardly be considered in this case a parameter of
interest. Indeed, in any given case, the stone is either large or small, so the
physicist should always use treatment A.
▸ The measure E [Y ∣X ] would be useful if the size of the stone randomly changes
after the treatment has been fixed (quite far fetched in this case) since we can
write
E [Y ∣X ] = a + bX + cE [Z ∣X ] + dX × E [Z ∣X ]
▸ Now consider labor economics and Y = 1 whenever some unemployed individual
has finally found a job and Z = 1 if he made a special searching effort, and X = 1
whenever we decrease unemployment subsides. Then in most cases, the effort Z
is unobserved and it could change if unemployment subsides drop down.
Instrumental estimation some preliminary remarks
▸ The use of instrument is probably one of the most common tool in current
applied microeconometric studies to measure ”causal” effects. We shall now
make use of the arguments developed above to present IV estimation. Let us first
go back to our first argument regarding the ”error term”.
▸ The requirements for correct instrument as stated in any econometric textbooks,
namely ”i) correlated with the regressor, ii) uncorrelated with the error term”
makes direct use of the error term. While i) insures identifiability, ii) provides a
constraint that may be used for estimation.
▸ Now the problem is to explain what the ”error” represent 2 . In particular the
equation
Y = a + bX + ϵ
is true for any a, b, Y , X if ϵ is totally arbitrary. Now Cov [ϵ, X ] = / 0 writes
∂E [Y ∣X =x]
Cov [Y − bX , X ] =/ 0 which is true for any b but b = ∂x
.
▸ Hence if we are unable to explain where the ”error term” comes from, any
”instrument” –X included– provides its own ”b.” As a consequence, any
discussion as to why this or that instrument is ”valid”, is pointless.
▸ So textbook presentations claim that some unobserved individual characteristics
–W , say– are both linked to the outcome and the effect under study. As W could
cover anything, we rely on instruments which were determined very far in the past
or related to some aggregate effect to insure requirement ii) holds whatever W
could be. But we then often fall short with requirement i). In a sense, weakness
of instruments follows from our theory and data leaving too much place for
unexplained and unobserved potential causes.
2
This problem is settled in current macroeconometrics by the notion of ”shocks” which are viewed as deviation
from long run deterministic relationships.
Instrumental estimation some preliminary remarks cont’
▸ Our second argument above states that the parameter of interest should be
precisely defined.
▸ To investigate this point let T be a variable taking values in {0, 1} –the choice T
is of course intended to introduce the treatment effect literature–
▸ If the parameter of interest is ∆E [Y ∣T =t] then running the linear regression of Y
∆t
on T and compute the OLS provides an unbiased estimates, for E [Y ∣T ] is always
a linear function of T since the treatment takes only two values. Whether T has
been randomly assigned or not makes no difference.
▸ ”Wait a minute ! If I simulate ϵ in such a way Cov [T , ϵ] = / 0 and I compute
Y = 1 + 2T + ϵ, then the OLS obtained when runnning the regression of Y on T
cannot be 2, right ?”.
▸ True, but if Cov [ϵ, T ] =
/ 0, then E [Y ∣T ] =
/ 1 + 2T neither ! OLS is a fine
estimator which provide the genuine correct measure, which is not 2 in this case.
▸ Hence if we are really interested in something different from ∆E [Y ∣T =t] we should
∆t
explain what this parameter exactly means. In our simulated example, “2” is
obviously not related with the optimal forecast of Y when T is known.
▸ As we have no direct access to the error term except through the observable
variable and the -unknown- parameter, we must then first defined what the
parameter is, second its links to observable and (possibly) unobservable variables,
third devise a statistical strategy to derive correct inference.
▸ We then must present instrumental variable approaches to infer EXPLICITLY
GIVEN parameters of interest WITHOUT relying on error terms.
The omitted variable case

▸ Let us first consider the simplest omitted variables case. Without any reference to
error term, it could be presented as follows

E [Y ∣X , W ] = a + bX + cW

with W unobserved and b is the parameter of interest.


▸ For Z to be an instrument, you need

Cov [Y − bX − cW ; Z ] = 0

▸ Now as the parameter of interest is b, the above constraint contains a nuisance


parameter c. If we want to get rid of it, we need Cov [Z ; W ] = 0. This exclusion
constraint together with Cov [Y − bX ; Z ] = 0 lead to the famous ratio of
covariances estimator Cov [Z ; Y ]/Cov [Z ; X ] (provided Cov [Z ; X ] =
/ 0).
▸ Notice that Z is uncorrelated with the forecasting error term Y − E [Y ∣X , W ] ,
the omitted variable W and the ”pseudo residual” Y − bX . Of course, if we have
more than one omitted variable, the same line of reasoning implies Z must be
uncorrelated with all of the omitted variables.
▸ To put it the other way around, a given instrument leaves us with a measure
Cov [Z ; Y ]/Cov [Z ; X ] that is valid in models in which each of the omitted
variables are uncorrelated with Z .
Interpreting IV estimates in the omitted variable case

▸ The above argument shows that the IV estimate is the correct measure of the
slope of the conditional expectation of Y given observed variables and unobserved
variables that be may be correlated with X as long as they are uncorrelated with
the instrument (and provided the conditional expectation is linear).
▸ Considering another instrument changes the parameter of interest since the
–linear– conditional expectation is not the same one. Hence, choosing an
instrument merely consists in deciding which forecast you are interested in.
▸ Of course, choosing an instrument which is uncorrelated with ”several” plausible
explanations while still correlated with the effect under study strengthen your
argument. As a drawback such an instrument may be weakly correlated with X ,
leading to identification issues, but you may face another problem.
▸ Let us compare the interpretation of b when W is observed and when it is not. In
the first case the OLS estimate may be computed and it measures the average
effect of X on Y keeping W fixed. In the second case the IV estimator measures
the effect of X while keeping fixed an unknown set of variables, precisely those
affecting the expectation of Y and which are uncorrelated with the instrument Z .
▸ If this set of variables is very large, then the interest of the measure may fade
away since the experiment in which everything else is kept fixed may lack of
practical interest.
The error–in–variable case
▸ The case for error–in–variable may be presented as follows 3 :

E [Y ∣X ∗ ] = a + bX ∗

but OLS cannot be applied directly for X ∗ is not observed. Rather we observe
X = X ∗ + η (and of course, here η is not observed).
▸ Notice we introduce an error here, but the situation is pretty different. First, the
link between observed and unobserved variables contains no parameter. Second,
the error has now a clear interpretation, whatever the values of the parameters of
interest.
▸ As soon as Var [η] > 0, the the naive OLS estimator is biased 4 since it writes

Var [η]
−1
Cov [Y , X ]/Var [X ] = b (1 + )
Var [Y − bX ∗ ]
▸ For Z to be an instrument, you need Cov [Y − bX ∗ ; Z ] = 0 which translates to

Cov [bη + Y − bX ; Z ] = 0

▸ As the above condition involves the nuisance parameter Cov [η; Z ], it must be
null for the ratio of covariances to provide a correct measure.

3
We focus on the one–dimensional case, as adaptation to the multiple variables case is straightforward.
4
Be careful, the famous ”toward zero bias” statement may not be true if more than one variable is measured
with error.
Simultaneous equations models: casual introduction
▸ We shall now consider simultaneous equations models. These models are
presented using explicit error terms and emphasis is put on ”exogenous” and
”endogenous” variables. Also, as considerable attention has been given to
relationships between ”structural form” and ”reduced form” parameters, our
second above argument (namely that parameters interest is an important part of
the model) is also relevant.
▸ To illustrate our argument in this case, let us consider the following usual
demand/(inverse) supply ”model” 5

q = α − βp + γZ + u (D)
{
p = a + bq + dW + v (S)
where q is the logarithm of the quantity, p is the logarithm of the price and Z
(resp. W ) is some ”exogenous effect” affecting the demand (resp. supply) curve,
u, v being uncorrelated ”error terms”. As we follow here the usual presentation,
we shall assume that β > 0 and b > 0, since price is expected to affect negatively
quantity and quantity is expected to affect positively prices (more on this
seemingly harmless assumption below).
▸ This setup explains why ”equation per equation” OLS estimates is affected by the
simultaneity bias. Indeed, the reduced form writes

⎪ =
b(α+γZ )+a+dW +bu+v
⎪ p (RP)
⎨ 1+bβ

⎪ =
−β(a+dW )+α+γZ +u−βv
⎩ q 1+bβ
(RQ)

Hence, Cov [p, u] > 0 and Cov [q, v ] < 0 whenever Cov [u, v ] = 0.
5
Recall as above that the term ”model” is not really appropriate, we use it for the sake of argument developed
below.
Simultaneous equations models: interpretation of structural parameters

▸ Now if Cov [u, v ] = 0 then ∂E [q∣p=x,Z ] =


/ −β. Indeed if it was the case then there
∂x
would be no simultaneity bias. In particular, the change in p keeping u, v , Z may
be due to a change in W (in which case the derivative depends on d).
▸ Hence, the reason why the ”structural” effect of price on demand should be
negative cannot be justified as we casually did above. A correct justification
comes from derivation of (RP) wrt to u and (RQ) wrt to v
β
∂{ } /∂v = − 1+bβ
−β(a+dW )+α+γZ +u+βv
1+bβ
b(α+γZ )+a+dW +bu+v
∂{ 1+bβ
} /∂u = 1+bβ
b

▸ Obviously if b > 0 and β > 0 an upward exogenous shock on price (i.e v ↑)


negatively affects demand at equilibrium and an upward exogenous shock on
demand (i.e v ↑) positively affects the equilibrium price.
▸ But neither b nor β are correct measures of elasticities of equilibrium quantities.
Structural parameters can only be interpreted by means of out–of–equilibrium
”thought experiments”. They are not directly related to the distribution of
observed equilibrium quantities (p, q).
▸ Using the same letters (namely q, p) for two different purposes may be
misleading. But this is exactly what ”structural” econometrics is about : deriving
inference about ”deep” parameters (such as b and β in our example) which may
then be used to assess counterfactual arguments.
Inference in structural equation models

▸ In the most general form, if we have m equations and the sample size is n , the
inference problem may be presented as follows. Let Y be the n × m matrix of
endogenous variables, X be the n × p matrix of exogenous variables, we have

E [YΓ∣X] = XB

where Γ is an unknown non singular matrix whose diagonal elements are all equal
to 1 and B is an unknown k × m matrix.
▸ Post–multiplying by Γ−1 we obtain a multivariate linear model

E [Y∣X] = XBΓ−1 = XΘ

▸ The model being multivariate is not really an issue here since any multivariate
linear model may be represented as a usual univariate linear model by means of
the vec () operator.
▸ The real problem is to extract the parameters of interest B and Γ from the
estimation of Θ. This is clearly not always feasible. For instance if B = 0 then Γ is
not identifiable. Also if M is a size-m non singular matrix, the couples (B, Γ) and
(BM, MΓ) leads to the same value of Θ.
▸ Identification restrictions are clearly needed. Necessary ones require k to be large
enough (or more precisely that the rank of B is large enough) wrt to m. Common
sufficient ones are known as exclusion restrictions in the SEM litterature.
Supplementary material (1) : approximation of non linear model by a linear
regression

▸ Clearly, E [Y ∣X ] has no reason to be linear. It is often claimed that OLS gives an


approximation of some non linear relationship. Let us explore this if both Y and
X are univariate.
▸ If f (x) = E [Y ∣X = x] is twice differentiable, we have for any x, in some
neighborhood of E [X ]

f (x) = f (E [X ]) + (x − E [X ])f ′ (E [X ]) + (x − E [X ])2 f ”(c(x))/2

for some c(x).


▸ Hence the –theoretical– slope of the linear regression of Y on X may be written as

Cov [Y ;X ] Cov [E [Y ∣X ];X ]


βOLS = Var [X ]
= Var [X ]
= f ′ (E [X ]) + Cov [X ; (X − E [X ])2 f ”(c(X ))]/2
= f ′ (E [X ]) + E [(X − E [X ])3 f ”(c(X ))]] /2

▸ If the norm of the second derivative of E [Y ∣X = x] is uniformly bounded by M on


the domain of X we obtain

∂E [Y ∣X = x] E [∣X − E [X ]∣3 ]
∣βOLS − ∣≤M
∂x [x=E [X ]] 2
Supplementary material (2) : Simultaneity bias and forecasting
▸ If we consider the following model

E [q∣p, Z ] = α − βp + γZ
E [p∣q, W ] = a + bq + dW

the OLS equation–by–equation provide unbiased estimates of (α, β, γ, a, b, d).


▸ Clearly, if we set u = q − E [q∣p] and v = p − E [p∣q] we go back to the ”usual”
SEM we described in the main part of the document. So acknowledging that u
appears in the reduced form of p and v in that of q is not sufficient to prove that
OLS are biased.
▸ If the disturbances u, v that affect ”long term relationships” are uncorrelated,
they cannot be considered as forecast errors for (q, p) (and the per–equation OLS
do not provide correct measure of the parameter of interest). Conversely, if u, v
are forecast errors, they must be correlated –conditionally on Z , W – as soon as q
and p are.
▸ This is true even for a simple centered bivariate Gaussian vector (X1 , X2 ) with
correlated components. If one decomposes

X1 = b1 X2 + U1
X2 = b2 X1 + U2

then either E [Xi ∣Xj ] =


/ bi Xj or Cov [U1 , U2 ] =
/ 0.
▸ Hence Forecasting and counterfactual analyses are different exercises.
Lesson 2 : Introduction to Time Series Modelling
Time series’ specificity

▸ Time Series are random sequences indexed by time (denoted t


throughout this presentation).
▸ They call for special modeling and inference strategies since
1. Contrarily to individual data, the index cannot be modified (otherwise we loose track of
what is past, present, and future)
2. New data come in a specific way : they belong to future before realisation, to present
when they realized and to past after that.
3. Correlation often exhibits a particular shape : its less when the difference between
indices is large.
▸ Many data set have time dimension (market data, macro-economic
data, unemployment data,...). But as this topic is also of importance
far beyond economics, many useful tools have been developed in
other discipline (engineering, computer science,...)
Random (or stochastic) processes
▸ A stochastic processes is a sequence of random variable (Yt )t∈T where T is some
ordered set. In most applications we consider T = N ou T = Z but continuous
time modeling requires T ⊂ R.
▸ Special caution must be taken to define the probabilistic space we are working
with. From a practical viewpoint, it must change as the sample size increase (as
it is the case when we consider asymptotic properties of estimators) but
forecasting exercises require a special structure as to insure that past values are
used to forecast future ones and not the reverse. This delicate point (which
become really tricky in continuous time) is beyond the scope of this course,
interested -and courageous- readers can have a look at this course.
▸ In the following, except otherwise mentioned, we consider square integrable
processes (i.e such that marginal moment of order 2 exist). In case data require to
depart from this hypothesis (an important issue in macro-economics and finance),
we shall consider transformations which bring us back to this safer heaven.
▸ For square integrable processes, the definition of the conditional expectation
given all past values may be done as in the usual case. Hence E [Yt ∣Y t ]where
Y t = {Ys ∣s < t} is the the past of Yt has a proper definition.
▸ Notice the above expression directly raise problems. Indeed, E [Y ∣X ] is a function
of X but if T = Z the past of Yt contains an infinite number of variables. Hence
E [Yt ∣Y t ] is a function with a infinite number of variables. Moreover, recall this
function is defined as the solution of an optimization program. As a consequence,
we must make sure that the functional space we are working with enjoy some
minimal properties for this program to be properly defined. This is the type of
“special cautions” we mentioned above refers to... (more on this below).
Examples
▸ Yt = 1, ∀t is a constant process
▸ Yt = sin(t), ∀t is a deterministic process since forecasting errors are
zero at any time.
▸ If ϵ1 , ϵ2 , . . . ... is an infinite sequence of iid centered variables the
process (ϵt )t∈N is a strong white noise.6
▸ if ϵ1 , ϵ2 , . . . ... is a sequence of centered, uncorrelated, random
variables with constant variance the process (ϵt )t∈N is a weak with
noise.
▸ In the following every white noise processes (either weak or strong
depening on the context) will be denoted using letter ϵ.
▸ If ρ ∈] − 1, 1[ and (ϵt )t∈Z is a strong (resp. weak) white noise, then
we may defined (Yt )t∈Z such that Yt = ρYt−1 + ϵt , for all t. Such a
process is a strong (resp. weak) order–1 autoregressive process, or
AR(1) .
▸ Let θ ∈ R and (ϵt )t∈Z is a strong (resp. weak) white noise, then we
may defined (Yt )t∈Z such that Yt = ϵt + θϵt−1 , for all t. Such a
process is a strong (resp. weak) order–1 moving average process,
or MA(1).
6
This denomination comes from signal theory as our brain is unable to distinguish any “regularity” if we “play”
this process. See below for further explanations.
Weak and strong stationnarity

▸ If t → E (Yt ) is a totally unspecified function we have only one


observation and clearly statistic is useless. Some regularity are
needed for inference to be of any purpose.
▸ A stochastic process (Yt )t∈Z is strongly stationary if for all n ∈ N and
any sequence t1 , . . . , tn we have
∀ h ∈ Z L(Yt1 , . . . , Ytn ) = L(Yh+t1 , . . . , Yh+tn )
▸ In a literal sense, the above condition insure that the probabilistic
properties of such a process are time invariant.
▸ A stochastic process (Yt )t∈Z is weakly stationary if

E [Yt ] = m ∀t ∈ Z
E [Yt2 ] = σ 2 ∀t ∈ Z
Cov [Yt , Yt+h ] = γ(h) ∀t ∈ Z
▸ For any given weakly stationary process we may defined the
autocovariance function h → γ(h).
Remarks

▸ Notice that strong stationarity does not imply the existence of second
order moment. Hence, strong stationarity logically implies weak
stationarity in the set of square integrable process.
▸ Strong (resp. weak) white noise processes are strongly (resp. weakly)
stationary.
▸ Deterministic stationary processes are constant processes (show it).
▸ Any square integrable MA(1) process is weakly stationary (show it).
▸ Any strong MA(1) process is strongly stationary. (show it).
▸ If (ϵt )t∈Z is a strong white noise such that Var [ϵ1 ] > 0, then the
process such that X0 = 0 et Xt = Xt−1 + ϵt for all t > 0 is a random
walk. Random walk are non stationary processes (show it when
E [ϵ1 ] =/ 0 and also when E [ϵ1 ] = 0)
▸ Non stationary processes are often met in economics and finance.
They are closely related to growth, inflation, saving behaviors... As a
consequence macro and financial data set are often corrected to
insure stationarity before being used in econometric models. Such
pre-treatment include actualization, de-seasonalization, filtering, etc.
MA(∞) processes

▸ (Refresher) A real sequence (an )n∈Z is associated with a L2


consistent series if and only if limk,l→+∞ ∑kn=−l an2 exists
▸ If (Yt )t∈Z is a square integrable process such that (an2 Var [Yn ])n∈Z is
L2 consistent then
k
lim ∑ an Yn
k,l→+∞ n=−l

almost surely converges towards the random variable


+∞
∑ an Yn
n=−∞

▸ If (Yt )t∈Z is a weakly stationary process and (an )n∈Z is associated to


an L2 consistent series then
1. Zt = ∑+∞n=−∞ an Yt−n is a.s. definied for all t
2. (Zt )t∈Z is a weakly stationary process
3. E [Z1 ] = E [Y1 ] ∑n∈Z an et Cov [Z0 , Zh ] = ∑k,l ak al Cov [Y0 , Yh+k−l ]
4. The process (Zt )t∈Z is the said to be the MA-transformed of (Yt )t∈Z by the sequence
of weights (an )n∈Z .
Innovations

▸ For any square integrable process (Yt )t∈Z one can defined E [Yt ∣Y t ].
▸ The process Yt − E [Yt ∣Y t ] is the innovation process associated
with -or simply “of”- (Yt )t∈Z .
▸ The innovation process associated with any strongly, square
integrable process is a strong white noise (we admit this property).
▸ Linear innovations are defined accordingly, with linear conditional
expectations instead of conditional expectations.
▸ The linear innovation process associated with any weakly stationary
process is a weak white noise (we admit this property).
▸ Beware innovations processes are mainly theoretical objects since we
cannot run the regression as the complete set past of data is not
available.
▸ If Yt = ρYt−1 + ϵt is a square integrable stationary AR(1) process,
then et ϵt is the innovation process of Yt .
Wold’s decomposition

▸ If (Yt )t∈Z is a weakly stationnary process and (ϵYt )t∈Z is its


associated linear innovations process, then there exist a deterministic
process mt and a L2 consistent sequence of reals (an )n∈N such that
+∞
Yt = mt + ∑ ak ϵYk−t
k=0

▸ This property implies tha we can “view” the “random part” of any
weakly stationary process as an MA sequence of past “shocks’. It is
the main justification for modeling time series by means of ARMA
models.
▸ In practice this property is difficult to use directly since (an )n∈N is an
infinite unknown sequence, but many proofs are based on this
decomposition. Mathematically, it shows that the set of weakly
stationary processes is the same as that of scalar products of L2
consistent sequence of reals by weakly white noise processes.
Autocorrelation and Moving Average processes
On considère ici un processus stationnaire au second ordre de variance non nulle. Let
us consider weakly stationary process with strictly positive variance.
▸ We recall the autocovariance function is defined by
γ(h) = Cov [Y0 , Yh ]
▸ The autocorrelation function is defined by ρ(h) = γ(h)/γ(0).
▸ By the Schwarz inequality we have ρ(h) ∈ [−1, 1] for all h.
▸ If (ϵt )t∈Z is a strong (resp. weak) white noise and θ1 , . . . θq is a
vector of size q such that θq =/ 0 then the process defined by
q
Yt = ϵ0 + ∑ θk ϵt−k
k=1

for all t is a strong (resp. weak) order-q Moving average process (or
simply MA(q) process.
▸ A weakly stationary process Yt admits a MA(q) whenever there
exists θ1 , . . . θq is a vector of size q and a white noise process such
that the above formula holds.
▸ The autocorrelation function of an MA(q) process is such that
γ(h) = 0 whenever h > q.
Partial autocorrelation and Autoregressive processes
▸ If (ϵt )t∈Z is a white noise and 0 < ∣ρ∣ < 1 we already considered the order–1
autoregressive process Yt = ρYt−1 + ϵt .
▸ Remark, contrary to the MA processes the process Yt is implicitly defined as a
solution of a recursive equation. We must then show that this equation admits
solutions. We can write
Yt = ρ(ρYt−2 + ϵt−1 ) + ϵt = ρ2 Yt−2 + ρϵt−1 + ϵt
= ρ2 (ρYt−3 + ϵt−2 ) + ρϵt−1 + ϵt

▸ As ∣ρ∣ < 1 the following sum exists 7

Yt = ∑+∞ i
i=0 ρ ϵt−i

▸ Hence Yt depends on all past shocks and

ρ(h) = ρh

In particular, the autocorrelation function vanishes exponentially fast, but it is


never zero.
▸ Now the linear regression of Yt on its own past equal ρ for the first coefficient,
whereas all others are zero. This justify the following definition :
▸ Let (Yt )t∈Z be a weakly stationary process and r (p) the slope associated with
Y−p in the linear regression of Y0 on Y−1 , . . . , Y−p . Then p → r (p) is the partial
autocorrelation function of the process (Yt )t∈Z .
7
More precisely, the sequence (∑T i
i=0 ρ ϵt−i )T ∈N admits a limit in the L-sense if (ϵt )t∈Z is a weak noise.
AR(p) Process and lag operator
Let us consider (ϵt )t∈Z a strong (resp. weak) white noise.
▸ Similarly to what we have done for MA(q) processes, we may
consider process associated with the real vector ϕ1 , . . . , ϕp (where
ϕp =/ 0) such that
p
Yt = ϵt + ∑ ϕk Yt−k
k=1
▸ Now as for the AR(1) case, the above equation defines Yt in a
implicit way. More precisely, it is not clear that the above equation
admits a solution. If this the case, the Yt process is a strong (resp.
weak) order–p autoregressive process.
▸ Let us denote L the operato such that LYt = Yt−1 . We have
Lk Yt = Yt−k for all k ≥ 0 (in particular L0 is the identity operator)
▸ If we consider the polynomial Φ(x) = ∑pk=1 ϕk x k we may write in
more compact way
p
Yt = ϵt + ∑ ϕk Yt−k = ϵt + ϕ(L)Yt
k=1

▸ As we have (L − ϕ(L))Yt = ϵt , the above equation admits a solution


0

in the L2 sense if and only if the operator (L0 − ϕ(L)) is invertible.


Invertibility condition

▸ In the AR(1) case, we have L0 − ϕ(L) = L0 − ρL and inversion is


possible as soon as ∣ρ∣ < 1 since we can write

1
= ∑ ρk
1 − ρ k=0
▸ In a general case, thanks to the Fundamental theorem of Algebra, we
know that 1 + Φ(x) admits p (possibly complex) roots x1 , . . . xp . As
1 + Φ(0) = 1 none of these roots is zero. Hence, writing
p p p p
x
1+Φ(x) = ϕp ∏ (x −xk ) = ϕp ( ∏ (−xk )) ∏ (1 − ) = α ∏ (1−λk x)
k=1 k=1 k=1 xk k=1

we face the invertibility condition of the AR(1) case 1 − λk x (in the


complex plane, if needed).
▸ As we already showed, the invertibility is guaranteed if and only if
∣λk ∣ < 1 ∀ k which means that the polynomial 1 + Φ(x) admits no
roots outside the unit circle in the complex plane. (beware !
λk = 1/xk )
Some remarks

▸ Let us consider (ϵt )t∈Z a weak white noise and the random walk

Yt = Yt−1 + ϵt
▸ In this case, we have 1 + Φ(x) = 1 + x. The unique root is −1 and it
modulus is 1. As a consequence, there is no stationary solution in the
L2 sense.
▸ Indeed, if we had a solution it hould have a constant variance. Now
as
Var [Yt ] = Var [Yt−1 ] + Var [ϵt ] > 0
and the equation x = x + Var [ϵt ] admits no solution if Var [ϵt ] > 0.
▸ The larger the modulus of the roots of polynomial 1 + Φ(x) the faster
the damping of past shocks (as we can easily see by simulation).
▸ Moreover, the fact that some roots admits a non-zero imaginary part
entails cyclical behaviors.
ARMA Modeling and unit Root issues
ARMA models
▸ Using the similar trick as in the AR case, notice we may write any MA process as

Yt = θ(L)ϵt

(with straightforward notations)


▸ MA and AR parts can be combined to obtain ARMA processes

ϕ(L)Yt = θ(L)ϵt

where ϕ() is polynomial such that ϕ(0) = 1 and whose roots are outside the unit
circle and θ() is a polynomial such that θ(0) = 1.
▸ For such a process, the partial autocorrelation function vanishes exponentially fast
for values larger that the degree of ϕ() and the autocorrelation function vanishes
exponentially fast for values larger that the degree of θ()
▸ These processes have become increasingly popular after the seminal work by Bow
and Jenkins. In particular ARMA process often show up as solution of optimal
response to shocks in macro-economic models. The AR part is mostly driven by
the law of motion of capital (together with time–to–build assumption) whereas
MA part result from smoothing effects due to risk aversion and actualization. In
the most simplest DSGE case the aggregate quantities admits ARMA(1,1)
representation. This is rather at odd with empirical findings since impact of
shocks do not vanish exponentially fast. As a consequence, most DSGE models
include AR(1) type of shocks, with values of ρ smaller than 1 but close to. This
also entails that the degree of polynomial is strictly larger than 1, allowing for
complex roots and cyclical features (a key issue for endogenous cycles theory).
Estimation in ARMA processes
▸ Consider an ARMA(1,1) process

Yt = ϕ1 Yt−1 + ϵt + θ1 ϵt−1

If we estimate ϕ1 by running the regression of Yt on Yt−1 the result will biased


since Cov [Yt−1 , ϵt−1 ] =
/ 0. This approach is only feasible for pure AR(p)
processes.
▸ Instrumental varialbe is however a possibility for instance we can use Yt−2 since
Cov [Yt−2 , Yt−1 ] =
/ 0 and Cov [Yt−2 , ϵt + θ1 ϵt−1 ] = 0. In a general case, Yt−q−1 is a
valid instrument for any ARMA(p,q) process.
▸ There are howevr two problems with this approach : it is inefficient and it cannot
be used for the coefficients associated with the MA part.
▸ Consider a pure MA(q)

Yt = ϵt + θ1 ϵt−1 + . . . + θq ϵt−q

Clearly, regression cannot be performed as shocks are not directly observed.


However we can exploit the fact that the ACF is easily computed as a function of
θ()
ρ(h) = ∑ θj θj+h
0≤j≤q−h

(using the convention ∑j∈∅ = 0 et θ0 = 1). This function can be estimated using
the empirical correlations and θ1 , . . . , θq my be estimated by inversion of the
formula.
Estimation in ARMA processes –cont’–
▸ This approach is however not easy to implement. Consider an MA(3)
process we have
γ(1) = θ1 + θ1 θ2 + θ2 θ3
γ(2) = θ2 + θ1 θ3
γ(3) = θ3
Hence
θ3 = γ(3)
θ2 = γ(2) − θ1 γ(3)
and
γ(1) = θ1 + (1 + θ1 )(γ(2) − θ1 γ(3))
More generally, as the link between γ(.) and θ(.) is non linear,
inverting the system is not a trivial task.
▸ This method can be applied to pure AR processes. The values of the
autocorrelation function are directly related to the coefficient of the
Φ() polynomial. In this case, one can express this relationship in a
recursive sequence of linear system (these are the famous
Yule-Walker equations).
▸ If both AR and MA coefficient are present, inverting the link between
ACF/PACF functions and the coefficient of the Φ() and Θ() leads to
very complex systems even when the degrees are moderate.
Estimation in ARMA processes –finish–
▸ The previous methods are, in general, not asymptotically efficient.
There then used as first-step in more elaborate numerical methods.
The current first best approach is to compute a pseudo-likelihood, as
if shocks are strong, Gaussian white noise.
▸ To this end, we must write the log-likelihood of the available sample.
This can be done by investing the relationship which defines the
ARMA process in order to compute the shocks as function of Yt and
the parameters.

ϵt = Yt − ϕ(L)Yt − Θ(L)ϵt
▸ The previous equation define the shocks in a recursive way, once
Y0 , . . . , Y−p , and ϵ0 , . . . , ϵ−q are known. As they are not, two
simplifications can be proposed
● Limited Information Likelihood : only the final part of the likelihood is computed
ϵp , . . . epsilonT and the unknown shocks are set to zero (their unconditional mean
value)
● Full Information : missing part are replaced by their best forecasts as functions of
known values and parameters.
▸ As the first technique is easier to implement, it is often used as a
first-step before, eventually, implementing the second one, which is
more demanding.
▸ Both techniques requires specific optimization routines and
convergence is often slow, especially if the degrees of polynomials are
large.
Estimators’ properties
▸ If the associated white noise (ϵ)t∈Z admits a forth-order moment and
the Φ() polynomial has roots outside the unit circle, the previouos
estimators are consistent at speed T −1/2 and asymptotically gaussian.
▸ Numerically, reliable estimators for the MA part often require Θ()
polynomial has all of its roots outside the unit circle.
▸ Consistency of the variance co-variance matrix of the estimators
requires the existence of moments of higher orders (6 to 8 depending
on the technique used).
▸ The computation of the the estimators, standard errors, test, and the
like requires specific numerical optimization programs. These are well
documented in specific softwares R, GRETL, SAS. Python and C
routines also exists their performances can vary quite a lot with first
step parameters.
▸ The quality of the estimation typically decreases rapidly when roots
are close to unit modulus.
▸ Finally, it is of course vital not to undersetting the degrees of
polynomials Θ() and Φ() otherwise consistency is lost. As these are
often unknown, we need a way to get it from the data. This is the
so-called “identification” problem. (The term is misleading since p
and q are identifiable, in the statistical sense).
Identification in ARMA models

▸ The parameter set obviously depend on the degrees of the Φ() and
Θ(.) polynomials. Let p be the degree we choose for the former and
q for the latter. The problem is a double-edge one. If p and/or q is
too small that then some parameters are wrongly set to zero and no
estimation method can achieve consistency. Now if they are set at
tto large a value, efficiency decrease wrt to the case where the extra
paramter are correctly neglected. This last case –called overiftting– is
illustrated below
▸ Overfitting : white noise fitted with ARMA(3,3) model
Coefficient Std. err z p -value
const 0,0741117 0,0396156 1,8708 0,0614
ϕ1 −0,297077 0,235598 −1,2609 0,2073
ϕ2 −0,676338 0,0940068 −7,1946 0,0000
ϕ3 −0,590732 0,229936 −2,5691 0,0102
θ1 0,316387 0,220251 1,4365 0,1509
θ2 0,599366 0,104415 5,7403 0,0000
θ3 0,647322 0,207655 3,1173 0,0018
Identification –cont’–
▸ The most common method consists in looking at estimated ACF and PCAF. We
know that in case of a pure AR(p) process PACF is strictly zero for any h > p and
for a pure MA(q) process ACF is strictly zero for any h > q.
▸ Hence if the ACF (resp. PACF) abruptly vanish, then p = 0 (resp. q = 0) and q
(resp. p) equal to the last value where the ACF (resp. PACF) is non zero.

▸ In the general case, PACF and ACF can display many different forms for low
values of h but when h > max{p, q} then both the ACF and the PACF converge
exponentially fast towards zero.
▸ The idea is then to detect a value r after which both ACF and PACF decrease
and try various combinations of integers p, q stricly smaller than r . The selected
model minimizes a criterion that penalize over fitted models and estimated
standard errors of shocks. The two most popular criterion are
p+q log(T )
AIC ∶ log(σ̂ 2 ) + 2 and BIC ∶ log(σ̂ 2 ) + (p + q)
T T
Validation
▸ There are two major problems with the above approach. First, even in the “pure”
cases, as we do not have access to estimated ACF and PACF, the uncertainty
can result in over– or under– fitting. Most specialized software then provide
confidence boundaries to check for non significant values. Second, detecting
exponential decay is not an easy task. Identification and estimation steps are then
completed with a final check called validation.
▸ The idea is to exploit extracted shocks that are computed in the Full or Limited
Information Maximisation. If the model is correctly specified, they should
approximately behave as a white noise. Hence, PACF and ACF of the extracted
shock sequence should be zero. This often check by the Ljung Box statistic
whose null hypothesis is γ(j) = 0 si h > j > 0. More precisely the rule is to reject
H0 (hence the entire approach) whenever at level α whenever
h γ̂(j)
T (T + 2) ∑ > χ21−α (h)
j=1 T −j
this procedure is known as the portmanteau test.
▸ The portmanteau test is often completed by two other procedures. The first one
aims at detecting heteroskedasticity –recall indeed the pseudo ML approach rest
on an iid assumption–. This is particularly important for financial data since
heteroskedasticity has important consequences for pricing methods. The most
common procedure is the Breuch-Pagan test .
▸ The second procedure is a stability test. If often rest on the so-called CUMSUM
statistics t
∑j=k+1 ϵ̂j
St = (T − k) t = k + 1, . . . T
∑tj=k+1 ϵ̂2j
t 2
∑j=k+1 ϵ̂j
S′ = t = k + 1, . . . T
Forecasting in ARMA models
▸ A key feature of the ARAM model is that forecasted values can easily be
computed, whatever the distribution of the underlying noise process.
▸ Moreover, the ML approach directly provide the forecasts since it is written as a
function of linear innovations.
▸ Specialized software then provide forecast valuesfor any ARMA model, as well as
confidence intervals of these. Typically three type of forecasts for Ys are
distinguished
1. Ys is observed and it is used in the likelihood to provide estimations (in–sample forecast)
2. Ys is observed but it is NOT used in the likelihood to provide estimations
(out–of–sample forecast)
3. Ys is NOT observed ”pure” forecast
▸ It is of particular importance to distinguish the first and the two last one. Indeed,
when Ys is used to provide estimation, the inference penalize the forecasting error
associated with Ys , which is not the case in the two last cases. As a consequence,
the two last methods provide a more honest view of the actual forecasting
capabilities of the model. To put it the other way around, the in–sample forecast
errors are by construction smaller than their out–of–sample or pure equivalent
ones. Look here here for a detailed presentation
▸ Previsions either out–of or in–sample are mostly one-step ahead ones. However,
in the most general case, forecast horizon can be larger than 1. This causes no
specific problem in ARMA model, but the pratical issue is that out–of–sample
and pure predictions converges exponentially fast towards the stationary values as
horizon increases.
▸ This is due to the fact that white noise processes always has best prediction equal
to zero. Hence, in particular if the horizon is larger than q, the MA part plays no
role, and the stationarity of the of AR part implies geometric convergence towards
the mean value. This convergence is faster as roots of Φ() have larger modulus.
Unit root issues

▸ As we already seen, when the Phi() have root inside the unit circle, the distant
past shock tend to have more impact on current values than close ones, and,
ultimately, realizations of the process are unbounded. This is of course at odds
with most –if not all- economic phenomena. But some economic models (both in
finance and macro-economic) are compatible with exact unit roots.
▸ Moreover, has shown by the seminal paper by Hall in the late 70’s the prediction
obtained by the random walk hypothesis tend to provide better forecasts than
stationary ARMA adjustments. In particular, forecasting that yt+1 will be exactly
as yt is na almost unbeatable strategy (considering the fact it is much more easy
to compute than ARMA forecasts).
▸ They are many problem related to unit root issues.
● How to model such processes ?
● Can we test for the presence of unit root ?
● How to perform estimation ?
● What are the properties of such processes ?
ARIMA models

▸ In theory, unit roots of Φ() could have complex parts. However, as the
coefficients of Φ() are real, this would implies an even (that is at least 2) number
of such roots (more precisely any such root would have a conjugate part). Despite
intensive research efforts in this direction, evidence for complex unit root is weak.
▸ As a consequence, many models rest on the following ARIMA case

Θ(L)
(1 − L)d Yt = ϵt
Φ(L)

where Φ(.) i a polynomial whose roots lies outside the unit circle, and d ≥ 1 is an
integer (the order of integration). In this case, the only unit root is exactly 1.
▸ In most cases, d is set to one. Again, evidence for I(2) integrated proecess is
weak. Remark however that if a flow variable has order of integration 1, the
corresponding stock has order of integration 2.
▸ The inference is condcuted as in the ARMA case, after stationarization that is to
say, considering the process (1 − L)d Yt instead of Yt .
▸ More precisely, order–1 difference Yt − Yt−1 are considered if d = 1 order–2 if
d = 2 (hence Yt − Yt−1 − (Yt−1 − Yt−2 ) and so on.
Unit root testing
▸ Ce major diffuclty with the previous approach is to known whether we do indeed
face unit root. Most unit root test concern the case d = 1 vs d = 0. This are also
known as (non)stationary tests. This problem has probably resulted in more
papers in econometric journals than any previous ones.
▸ A “direct” approach would be to consider the following model

E [Yt ∣Yt−1 ] = ρYt−1

and test for ρ = 1. Now if the null hypothesis is ρ = 1, Dickey and Fuller have
shown that the “natural” estimator

Ĉ ovT [y , y−1 ]
V̂ arT [y ]

has a non standard (that is to say not directly related to Gaussian) distribution.
As a consequence, the unit root hypothesis cannot be tested by Student-type
procedures.
▸ Moreover, they alos show that the associated test procedure would have zero
power against alternative in which the non stationary behavior result from
deterministic trend. Then they proposed several testing procedures to account for
presence or absence of deterministic trends such as linear or quadratic ones
▸ Another approach is to test for stationarity. The most common procedure is that
proposed by Kwiatkowski-Phillips-Schmidt-Shin (the so-called KPSS test. Agin
this test requires specific tables to be used and its power against specific
alternatives can be dramatically low.
Lesson 3 : The Maximum likelihood principle
A basic example

Assume we have Bernouilli iid sample of size n denoted (y1 , . . . , yn ). We would like to
known the probability P(Y1 = 1) = p. Assume moreover 0 < p < 1 (we shall consider
this assumption in more details below). The probability of the event that is observed is
n
y 1−y
∏ p i × (1 − p) i
i=1

Consider this quantity as a function of 0 < p < 1 it may be written as


n n
L(p) = (1 − p)n−∑i=1 yi × p ∑i=1 yi

It is a polynomial function of p of degree n. It has two roots 1 and 0 with orders of


multiplicity, n − ∑ni=1 yi and ∑ni=1 yi so there no other root. The derivative wrt to p is

∂L n n
= L(p) × (−(n − ∑ yi )/(1 − p) + ∑ yi /p)
∂p i=1 i=1

It admits a single root p̂ = n1 ∑ni=1 yi (the above expression seems to imply that 0 and 1
are roots of the above polynomial but it is not the case, check it). As a consequence,
since L(p) > 0 it is concave function of p and the maximum is unique.
Finally notice that p̂ coincides with the empirical mean. Also notice that we could
consider 0 ≤ p ≤ 1. We obtained the same solution, except that the optimum is not
interior if y1 = y2 = . . . = yn (check this case).
Maximum likelihood principle

The previous example is an instance of a much more general approach. The argument
runs as follows.
▸ The model is a set of probability distributions indexed by a parameter θ ∈ Θ.
(This indexation is considered bijective).8
▸ The observed event Z is the result of drawing by Nature according to one of
these probabilities corresponding to the unknown value θ0 .
▸ The model allows to compute the probability of the realization of Z as a function
of θ.
▸ The value of θ is “more likely” than θ′ if the probability of the realization of Z
compute for θ is larger than this probability computed for θ′ .
▸ The decision between different values of θ must conducted according to the
preference relationship induced by the above definition of “more likely”.
The Maximum Likelihood Principle is generally presented as an inference device, in the
sense that it allows to go back from ’effects’ (the observed sample) to ’cause’ (the
value of the parameter). This point raises disputes among Bayesian and frequentists
statisticians. It has also been challenged on methodological grounds.
The main arguments in favor of the maximum likelihood principle are twofold: it is a
principle that may be applied whatever the model, it leads to consistent point
estimators in a large variety of cases.

8
If we consider the mapping from Θ to the probability distribution space, the surjective nature poses no problem
as Θ may be enlarged if necessary. Injectivity is much more demanding as it is ultimately linked to identification
problems, see below for details.
Consistency of the MLE (very informal argument)

Let P, Q, µ be three probability measures such that P and Q are absolutely continuous
wrt to µ. The Kullback-Leibler divergence (hereafter KL divergence) of Q wrt to P is

p
KL(Q∣P) = ∫ p log ( ) dµ
q

(where p, q are the densities of P and Q wrt to µ.)


The Jensen Inequality asserts that KL(Q∣P) ≥ 0 whereas we trivially have
KL(P∣P) = 0. Notice however that the KL divergence is not symmetrical, thus it is not
a distance (moreover it may be shown that it does not satisfies the triangular
inequality).
The link between the Maximum Likelihood principle and the KL divergence is given by
the following (very informal) arguments

θ̂ = argmaxθ∈Θ ∏ni=1 p(Yi ; θ)


= argmaxθ∈Θ ∑ni=1 log(p(Yi ; θ))
= argminθ∈Θ − 1
n ∑ni=1 log(p(Yi ; θ))
p(Y ;θ0 )
≃n→+∞ argminθ∈Θ E [log p(Y ;θ)
]
= KL(p(Y ; θ)∣p(Y ; θ0 ))
Where the ≃n→+∞ term is a the Law of Large Numbers.
A more formal discussion

The previous argument is very informal notably because the mention of LLN is much
too loosely stated to be of any use.
First, we manipulate a function of θ. Hence, we need some kind of functional
equivalent of the LLN. Second, we also need to guarantee that the sequence of optima
(as n increases) does not diverge. This is often obtained (see below) by assuming Θ is
compact, but in practice it may not be the case. Finally, the limit criterion must exist
which entails some kind of uniform integrability condition. For the sake of argument,
the following result ensures strong consistency of the mle.

Proposition (Wald 1949) Let Θ be a compact space and consider a model {pθ ∣θ ∈ Θ}
with a dominated measure µ.
Let X1 , . . . Xn be an iid sample from pθ0 . Assume

i) the mapping θ → KL(p(Y ; θ)∣p(Y ; θ0 )) is continuous.


ii) There exist a function K (.) such that E [K (X1 )] exists and
∀ x, θ, log(pθ0 (x)) − log(pθ (x)) < K (x)
Then the maximum likelihood estimator is consistent.

The initial result by Wald has been extended in several directions. Notably,
compactness is not required, neither (strict) uniform integrability. Note however that
(counter) examples in which the mle is not consistent exist (a famous example is
provided by Ferguson (1982) relies on the case of a single, real parameter, but uniform
integrability fails).
Asymptotic efficiency of the MLE

Beside consistency, the MLE enjoys asymptotic efficiency in most ”regular” cases.
This property relies on two main results.
First, according to the Cramer Rao Bound the variance of any unbiased estimator is
bounded from below (an important result for its own sake). This comes by
differenciation of the unbiaseness relationship

Eθ0 [θ̂] = θ0

which provides (denoting S(.) = ∂


∂θ
log(fθ )(.) the score function).

Eθ0 [θ̂S(X )] = 1 ⇔ Covθ0 [θ̂; S(X )] = 1

(where the last equality comes from the fact that E [S(X )] = 0 by differenciating
E [fθ (X )] = 1 wrt θ). Hence using Cov 2 (X , Y ) ≤ Var (X )Var (Y ) we obtain

Varθ0 [θ̂] ≥ Varθ0 [S(X )]−1

(the right hand side is the inverse of the Fisher Information).


Second, the equality Cov 2 (X , Y ) = Var (X )Var (Y ) arises if and only if X is a linear
function of Y . But the first order condition for the maximization of the log-likelihood
precisely asserts that the mle is asymptotically a linear function of the score.
Note however that this result requires the mle to be consistent and some regularity
conditions, notably the fact that the FOC is met (hence an interior solution).
Asymptotic distribution of the MLE
As a final property, the MLE in most “regular” cases is asymptotically Gaussian. For
the sake of simplicity, consider Θ is some interval for the real line and let the average
sample log-likelihood
1 n
Ln (θ) = ∑ log f (xi ; θ)
n i=1
and assume it converges uniformly in θ to some criterion L∞ (θ) = Eθ0 [log(f (X ; θ)].
We have already obtained that this limit criterion is always negative (by the Jensen
inequality) and that L∞ (θ0 ) = 0 so that θ0 maximizes the limit criterion. We shall
assume that it is the unique value of Θ to do so (=identifiability) and that

L′∞ (θ0 ) = 0
(that is we assume θ0 is an interior maximizer of the limit criterion and that FOC
holds). Now
0 = L′n (θ̂n ) = L′n (θ0 ) + Ln ”(θ̃n )(θ̂n − θ0 )
for some θ̃n ∈ [min(θ̂n ; θ0 ); max(θ̂n ; θ0 )]. Hence
√ ′
√ nLn (θ0 )
n(θ̂n − θ0 ) = −
Ln ”(θ̃n )
By consistency of the mle the denominator converges a.s. to the Fisher Information
L∞ ”(θ0 ) whereas the numerator converges by the Central Limit Theorem to a centred
Gaussian distribution with asymptotic variance equal to (L′∞ (θ0 ))2 . Using again the
fact that the mle is a linear function of the score computed at the true value, we get
N (0, Var [L′ (θ0 )]−1 )
Some important remarks
The Maximum Likelihood is a widespread method whose main advantages rests on
asymptotic arguments : consistency, asymptotically Gaussian, asymptotic efficient and
flexibility (it does not require linearity). It should also be clear that it has some
drawbacks, notably
▸ Good asymptotic properties arises in regular cases. In particular, the solution
must be interior and locally unique.
▸ It is a parametric method, since the ’correct’ family of distributions belongs to
some finite–dimensional space. If the set of distribution is miss–specified
consistency may be obtained, but efficiency is typically lost and the asymptotic
variance may be difficult to compute (since loose the property
L”(θ0 ) = Var [L′ (θ0 )])
▸ Actual computation may be cumbersome either because Likelihood is very
difficult to compute and/or maximization is a hard problem. These issues must
be overcome with special softwares (in particular in time series analysis).
▸ Small sample properties may be very poor and asymptotic approximations may be
very bad. This is in particular the case when the log-likelihood function is not
globally concave or when the Fisher Information is very small (identifiability is
crucial).
Some of these drawbacks may be overcome. For instance, pseudo-Likelihood
techniques (Gouriéroux, Monfort and Trognon 1982) and more recently Empirical
Likelihood (Owen 1998) provide ways to avoids exact knowledge of the distribution.
Not however the asymptotic properties must be studied on a case by case basis. Also
poor small sample behavior may be handled my mean of Bootstrap (at least in the
cross–section case).
First example Gaussian Linear model

This model is defined as follows


1 (y − θ′ Xi )2
f (yi ∣Xi ) = √ exp − ( )
2πσ 2 2σ 2

If we observed the sample (y1 , X1 , . . . , yn , Xn ). The First Order Condition for


maximum in θ of the log-likelihood are
n
0 = ∑(yi − θ′ Xi )Xi
i=1

providing the famous solution θ̂n = (X ′ X )−1 X ′ Y (if X ′ X is not invertible, there are
multiple solutions). The second order condition is X ′ X definite positive.

Notice that the First Order Condition asserts that the vector whose components are
the forecasting errors yi − ŷi (where ŷi = θ̂n′ Xi ) is orthogonal to the columns of X .

Notice also that the Fisher Information is X ′ X (the derivative of the score) and we
then obtain that the Cramer-Rao Bound is reached in this case in finite sample.
Second example : Logit model
This model is defined as follows Yi ∈ {0, 1} and

1
P(Yi = 1∣X ) =
1 + exp(−θXi )

The Log-likehood is
n n
− ∑ log(1 + exp(θ′ Xi )) + θ ∑ yi × Xi
1=1 1=1

The score function is


n 1
∑ (yi − ) × Xi
1=1 1 + exp(θ′ Xi )
(Exercise: derive the above formula)
Notice that the mle is such that the vector whose i-th component is yi − 1
is
1+exp(θ̂n′ Xi )
1
orthogonal to the columns of the matrix X . Also notice that coincides with
1+exp(θ̂n′ Xi )
the expectation of yi if the true value was θ̂n . Hence for this value, the vector of
”prediction errors” is orthogonal to X .

Assume there exists a θ such that for all i we have θ′ Xi > 0 if and only if yi = 1
(complete separation case). Then the mle is not defined. Indeed, for any λ > 0 we
have λ × (θ′ Xi ) > 0. This amounts to say when increasing θ in the direction where
complete separation occurs the likelihood increases continuously. Indeed, in this
direction, the predicted value approaches yi for all indexes.
Third example : (stationary) ARMA models
AutoRegressive Moving Average models set as a corner stone of time series analysis.
An ARMA(p,q) model evolves according to
p q
yt = ϵt + α + ∑ ϕi yt−i + ∑ θj ϵt−j
i=1 j=1

where ϵj is some iid sequence of ”shocks” (we shall consider the distribution of ϵ1 is
N (0, σ 2 )). We shall not consider here in detail the question of existence of a random
sequence yt that fulfills this constraint, but we consider the question of estimation of
the parameters (α, ϕ1 , . . . , ϕp , θ1 , . . . θq , σ 2 ).
A direct formulation of likelihood is difficult, but some approximations may be
proposed. For instance, if ϵ0 , ϵ−1 , ϵ−q where known, ϵ1 , . . . , ϵT could be computed
recursively from y1 , . . . , yT

⎛ p q ⎞
ϵt = yt = − α + ∑ ϕi yt−i + ∑ θj ϵt−j
⎝ i=1 j=1 ⎠

and the log-likelihood of the sequence of shocks may be derived as a function of the
parameters, the observations, and ϵ0 , ϵ−1 , ϵ−q .
To compute the exact ML is a hard task since we should integrate out the conditional
distribution by averaging over the distribution of the q + 1 vector ϵ0 , ϵ−1 , ϵ−q given
y1 , . . . , yT .
To avoid this q + 1–dimensional integral, either ϵ0 , ϵ−1 , ϵ−q are set to their marginal
expectations (that is 0) –this is the Limited Information Maximum Likelihood– or
ϵ0 , ϵ−1 , ϵ−q are set to their conditional expectation, –this is the –so called– Full
Information Likelihood.
Testing and Maximum Likelihood techniques
Another important issue with ML approach is testing. Assume we would like to test
H0 ∶ g (θ) = 0 (where g is some known function mapping in R k ). The ML approach to
inference basically gives rise to three ways to test this hypothesis
Wald Compute unconstrained MLE θ̂ and decide whether g (θ̂) is significatively
different from 0 or not;
LM Compute constrained MLE θ̂0 by maximisation of the Lagrangian L(θ) − λg (θ)
and decide whether λ̂ is significatively different from 0 or not;
LR Compute constrained MLE θ̂0 and the unconstrained MLE and decide whether
L(θ̂) is significatively larger than L(θ̂0 ) or not.
(Exercise: why do I write ”larger” in the last sentence ?)
The three tests statistics are respectively given by
∂g ′
(θ̂)′ V̂ arθ ∂θ (θ̂)
∂g
−1
ζW = g (θ̂)′ [ ∂θ T
] g (θ̂)

∂g ′
(θ̂)′ V̂ arθ ∂θ (θ̂)
∂g
ζLM = λ̂′ [ ∂θ T
] λ̂

ζLR = −2(L(θ̂) − L(θ̂0 ))


all these quantities must be compared with the appropriate quantile of the χ2 (k)
distribution.
An important result is that all these tests are asymptotically equivalent (under suitable
regularity conditions) moreover
ζLM ≤ ζLR ≤ ζW
which means that rejection by Lagrange Multiplier test leads to rejection by the other
Testing and Maximum Likelihood techniques, con’t

A legitimate question could be : what is the need of three tests (taken into account
the fact that they are justified on asymptotic grounds and they are asymptotically
equivalent). There are at least three main reasons

First, there are instances in which constrained ML estimation is easy whereas


unconstrained is difficult, or the other way around. In the first case, LM should be
preferred and Wald in the second. There are finally (rare) instances in which the
variance/covariance matrix is difficult to compute, in which case LR should be
preferred.

Second Wald tests and Confidence Regions may performed very poorly when there is a
lack of identification in finite samples. In such cases, LM is usually preferable since it
correctly detect this problem (at the cost of low power for test and unboundness of
the CI, but this is a good property !). See Dufour (1997) for details.

Third, the inequality holds if the model is correctly specified. A triplet of decisions
that is incompatible with the above inequality signals that the model may be
misspecified (see Godfrey 1991 for a complete treatment of this question.)
Quasi Maximum Likelihood

Our first example above, shows that the OLS may coincides with the MLE. This true
only if the error term is Gaussian (this is not trivial to see since it requires solving of
multivariate differential equation). Now consistency of the OLS does not require the
error term to be Gaussian. Hence, it may be that the MLE is consistent even if the
distribution function use to compute the likelihood is not the write one. This is an
instance of a Quasi ML estimation.

Two questions arise: how to choose the distribution so that the QMLE is consistent,
beside consistency, what are the asymptotic properties of the QMLE ?
The second one is much easier to answer. As a matter of fact, the argument is exactly
the same as for the mle, except that in the limit, the qmle is no longer a linear
function of the score. As a consequence, the limit variance is much more complicated
to write down (see, for instance Trognon 1987 (in french))

The first one turns out to be too general to be answered uniquely. A useful restriction
is to consider the case in which the first moment of the model is well specified, that is
there exist a value of the parameter θ0 such that the true conditional expectation
E [Y ∣X ] coincides with the conditional expectation computed with the chosen model
at this value of the parameter. In this case, it may be shown that the QMLE is CAN
(regularity assumptions required, as usual). This result has been obtained by White
[1982].
Indirect Inference (notions)

Assume we perform OLS estimation in a case where the error terms are iid and denote
f the true density function. Then, according to the previous result, θQMLE (f )n is such
that

lim Eθ [Y ∣X ] = Ef [Y ∣X ]
n→+∞ QMLE (f )n

In words, if the parameter of interest is Ef [Y ∣X ] then the QMLE provides a consistent


approach for this parameter.9 The result says nothing else about f (for instance
quantiles are usually not consistently estimated.)
The idea behind Indirect Inference is to use the QMLE when we know how to simulate
from the true model, but genuine ML is difficult to compute (as in example 3 above).
The approach rest on the computation of the binding function that maps
f → θQMLE (f )n for several simulated samples obtained for different values of f . This
requires fast computation of the QMLE only.
Then the estimation of f is performed by minimizing with respect to f some distance
between the QMLE computed from the data and θQMLE (f )n . If the link between f
and θQMLE (f )n is “precise enough” it provides a consistent estimate of the true model
using an auxiliary wrong one.

9
A caveat, this result is usually obtained for each given value of f . Uniform consistency is not guaranteed (and
fails to hold unless the set over which it is obtained is constrained).
Lesson 4 : Bootstrap and resampling techniques
Introduction
Bootstrap is part of the toolbox of many applied econometricians. Bootstrap is
popular for several reasons
▸ you should not have to care about estimators for the limit distribution (notably
the asymptotic variance)
▸ it provides ”better” estimators than ”usual” techniques
▸ it is easy to implement

Some of the above statements are not always true. For instance the last one is false,
for genuine bootstrap estimators are often impossible to compute and numerical
approximations are used instead 10 . Some other are not clearly defined (in what sense
is the bootstrap estimator ”better”?)

The technique has not been discovered by Bradley Efron (neither by Hebert Simon...).
It goes back to the beginning of modern statistics and has been used explicitly for the
first time in 1923 by Hubback. Fisher acknowledged the great influence of Hubback’s
ideas on his own work.11 Nevertheless, Efron’s contribution is seminal in many
instances and should be regarded as the main building block.

10
As we shall see, a theoretical –non parametric– bootstrap estimator requires nn computations for a sample of
size n so that if the statistic takes 10−8 second to be computed, and the sample size is n = 100, a genuine
theoretical computation requires 101 76 times the age of the universe...
11
”The use of the method of random sampling is theoretically sound. I may mention that its practicability,
convenience and economy was demonstrated by an extensive series of crop-cutting experiments on paddy carried
out by Hubback.. .. They influenced greatly the development of my methods at Rothamsted. (R.A. Fisher, 1945).
source
What does the Bootstrap achieves ?
Let Θ be closed subset of Rp and consider a parametric model (Fθ )θ∈Θ . Now assume
we have an iid sample X1 , . . . , Xn from the true distribution Fθ0 where θ0 ∈ Θ̊.

The ”usual” approach to estimation (either by maximum likelihood or the Method of

;
moments) provides an estimator θ̂. Assume θ̂ is CAN we have, as n goes to infinity,
n1/2 (θ̂ − θ0 ) N (0, Vθ0 ) so that test procedures and confidence intervals that are
asymptotically valid may be derived.

This derivation (and in particular the computation of Vθ0 ) usually relies on the
examination of the mapping θ0 → θ̂(θ0 ) (this is striking in the Indirect Inference
approach). The main tools used here are the Slutsky Theorem and the continuous
mapping theorem.

For instance in a Student procedure for significativity of the mean we compare the
statistic Tn = x n /ŝn with a ”critical point” c(α, n), which is in this case the quantile of
Student distribution with n − 1 degrees of freedom because we know that the limit
distribution of Tn under the null is a Student distribution with n − 1 degrees of
freedom under ”fairly general conditions”.

The problem is that this approach is false in finite sample because Tn = x n /ŝn does not
have a known distribution. Hence the 1 − α quantile of the St (n − 1) is not the
distribution of Tn (even if the null is true). This may lead to possibly large Errors in
Rejection Probability. The bootstrap provides a way to compute more accurate critical
points so that the ERP converges to zero quicker than in the usual classical
approaches.
Parametric Bootstrap

Consider a sequence of family of distributions (Pn (θ))θ∈Θ . We want to test θ ∈ Θ0 ⊂ Θ.

We first define the asymptotic test procedure ϕA = 1 iff Tn > H −1 (1 − α) (otherwise


ϕA = 0) where H is the limiting distribution of Tn under the null.

A crucial remark here is that we assume that the limit distribution of Tn under the null
does not depend on θ. In technical terms, Tn is asymptotically pivotal under the null.

Now let θ̂n be a root-n consistent estimator of θ under the null (for instance the
constraint MLE) and assume we are able to simulate several independent drawings
(x1s , . . . , xns )s where s = 1, . . . S from the distribution Pn (θ̂n ).

For each drawing s we can compute the value of the statistic Tn,s . Now instead of
using H −1 (1 − α) as a critical point, we may use Hn−1 (1 − α, θ̂n ) (we omit the
dependence on S) the empirical 1 − α quantile computed from Tn,1 . . . , Tn,S .

Hence the Bootstrap testing procedure sets ϕB = 1 iff Tn > Hn−1 (1 − α, θ̂n ) (otherwise
ϕB = 0)

This idea may used to provide confidence intervals, simply by keeping the values of θ
that are not rejected by such a test.
Does it work and why ? (for details see Beran 1986 and Beran 1988, JASA)

The key argument is that for each x and θ the empirical cdf Hn (x, θ) converges to
H(x, θ). More precisely, we usually can write something like

Hn (x, θ) = H(x) + n−1/2 h(x, θ) + O(n−1 )

for some function h. Thus we have

Hn (x, θ̂n ) = H(x) + n−1/2 h(x, θ̂n ) + Op (n−1 )

Hence we get

Hn (x, θ̂n ) − Hn (x, θ) = n−1/2 (h(x, θ̂n ) − h(x, θ)) + Op (n−1 ) = Op (n−1 )

Indeed we assume that the difference between θ̂n and θ is of order n−1/2 hence if
h(x, θ) is sufficiently regular we have

h(x, θ̂n ) − h(x, θ) = Op (n−1/2 )

Now by definition of Hn (x, θ) the probability of the event

Hn (Tn , θ) > 1 − α

is exactly α when θ ∈ Θ0 . We then deduce that Hn (x, θ̂n ) provides a better


approximation of the genuine distribution H(Tn , θ) by Hn (x, θ̂n ) than H(x) does.
Why is asymptotic pivotality important ?

Consider now the case in which the limit null distribution depends on θ under the null
(that is, we loose pivotality). We now have

Hn (x, θ) = H(x, θ) + n−1/2 h(x, θ) + O(n−1 )


Thus we now have

Hn (x, θ̂n ) = H(x, θ̂n ) + n−1/2 h(x, θ̂n ) + Op (n−1 )

Hence the leading term in the difference Hn (x, θ̂n ) − Hn (x, θ) is now

H(x, θ) − H(x, θ̂n )


which is a Op (n−1/2 )
under regularity conditions for H(x, θ). In this case, the
bootstrap does not provide any extra precision since the order of magnitude of the
ERP is the same as when we use the asymptotic distribution.

In short when the limit distribution of the statistic Tn is not pivotal, then bootstrap is
useless.

Caveat : there is a lot of ”hand waving” in the previous arguments. A compelling


proof requires precisions regarding the type of convergence (it must achieve some
degree of uniformity wrt to θ.) When these are not fulfilled, bootstrap may also fail
despite the limiting distribution of Tn under the null being pivotal.
Non parametric Bootstrap and bootstrap ”principle”
In the previous presentation, we implicitly assume that the parameter θ provides a full
description of the distribution. In many cases, if Θ is a finite dimensional space this is
quite a restrictive assumption. Indeed, if we suspect large ERP are possible this is
because we have no strict parametric representation under the null. Hence bootstrap
is often presented in a non parametric setting.

In this case, Θ is typically divided in two parts : one is parametric (for instance the
mean) and the other is not. Hence θ = (µ, F ). The value of µ is fixed under the null
(say 0), but F is not.

Strict application of the above technique then requires a consistent estimator of F . A


usual solution is the empirical estimator F̂n . Then we have to compute simulations
using this estimator.

If we use the (pseudo)-inversion algorithm, this amounts to compute F̂n−1 (U) for a
uniform drawing U. Now with probability one, this algorithm is equivalent to
equiprobable drawing with replacement in the original adatset.

This is by far the most well-know presentation of Bootstrap (if not the unique one
considering websites) and it explains the name. The extension of this idea namely to
compute the same statistic from the original dataset X and from a sample X ∗
obtained by equiprobable drawings with replacement, is called ”Bootstrap’s principle”.

This case is much more complicated because controlling limit behavior as to mimic
the presentation of the parametric case in a non parametric one raises difficulties. The
most well known approach has been paved by Peter Hall. It rests on peculiar
Bootstrap extensions : iterations

The Bootstrap principle may be iterated. Indeed, we saw that the testing procedure
ϕB = 1 iff Tn > Hn−1 (1 − α, θ̂n ) and ϕB = 0 otherwise may provide a test with a smaller
ERP in large samples.

Now, this does not mean that the probability of the event ϕB = 1 is exactly α under
the null (see below examples of bootstrap’s failure). But the bootstrap principle may
be applied to bootstrap test itself.

More precisely, if the probability under the null of the event Tn > Hn−1 (1 − α, θ̂n ) is not
exactly 1 − α it may be estimated by simulations. These simulations may then be used
to provide a new correction of the critical point, and so on.

Theoretically, each iteration provides a new improvement in the asymptotic precision


of the test. Hence, if the first iteration changes the asymptotic size of the ERP from
O(n−1/2 ) to O(n−1 ), the second one provides asymptotically an ERP of order
O(n−3/2 ) and so on.

Notice however that this iteration is often very costly with actual computers.12
Indeed, in a non parametric setting, complete simulation of all of the possible
re-samples from a sample of size n requires N = nn computations. The second
iteration of such a process then requires N N computations. If n = 10 N = 1010 and N N
is a number that starts with ”1” followed by 100000000000 zeros...

12
Keep in mind however that research in quantum computing is currently reaching important milestones.
Bootstrap in dynamic models
A major problem with the bootstrap arises when time series are considered. For
instance consider an MA(1) model

xt = ϵt − θϵt−1

where (ϵt )t∈Z is a strong white noise with unknown distribution F . Assume we want
to compute a Bootstrap confidence interval for θ and that we consider as an estimator
for θ the empirical autocorrelogram at horizon 1.

A ”blind” application of the bootstrap ”principle” may lead to re-sample from the
sample x1 , . . . , xT . This is not a good idea since for large sample–sizes this leads a.s.
to an iid sequence. Hence the re–sampled sequence does not have the same
distribution than the original one. In particular the autocorrelograms are different.

A correct application of the Bootstrap requires a consistent estimate of F . If ∣θ∣ < 1


this is feasible since
+∞
ϵt = ∑ xt−i θi
i=0

we then get by truncation an approximate version of an iid sample


t−1
ϵ̂t = ∑ θ̂i xt−i
i=0

in which re-sampling may be performed. Notice however that this is a model specific
approach and it is only approximate (hence ”improvements” are not guaranteed).
Block Bootstrap

To circumvent the previous problem remark that the same model provides direct
access to the following bivariate iid sequence.

Zk = (x2k , x2k+1 )

For this iid sequence, the bootstrap principle may be applied directly by resampling in
the Z sample.

This is an instance of a much more general approach in which resampling is done by


keeping ”blocks” of given size unchanged. This allows to apply the bootstrap
”principle” while keeping trace of the dynamic.

The tricky point is to choose the proper size for the blocks. Too small blocks will
destroy much of the dynamic while too large ones decrease the number of simulated
samples hence the precision of the plug–in.

A difficult problem is that except in very simple (as the above MA) cases, block
bootstraps sequences are no longer stationary. Randomly changing the size of the
blocks may help to circumvent this problem (see Künsch, H. R. 1989. “The jackknife
and the bootstrap for general stationary observations,” Annals of Statistics, 17,
1217–1241 for details).
Bootstrap and the linear model
Consider now the linear model
E [Y ∣X ] = θX
and assume θ is the parameter of interest.

A direct application of the Bootstrap principle in this case lead to a major difficulty
here since the model is a constraint about the conditional distribution of Y given X .
If we perform resampling as a equiprobable drawing with replacement in the (Y , X )
sequence then we no longer consider a distribution of Y conditional on the full
sequence X1 , . . . , Xn

This difficulty is often overcome by applying the bootstrap principle to Y − θ̂n X = ϵ̂n .
More precisely we get a new resample version of the estimated error terms ϵ̂∗n which is
used to compute a new set of endogenous variables Y ∗ = θ̂n X + ϵ̂∗n . This is known as
”resampling in the residuals” technique.

Again, proving that this provides better inference (in the above sense) is difficult
because as n increases, the sequence of estimated residuals (ϵ̂n ) does not converge to
the sequence of the actual residuals ϵn (because as we increase the sample size we also
increase the number of residuals).

Notice finally that the initial solution (that is equiprobable drawing with replacement
in (Y , X ) sequence) and the ”resampling in the residuals” technique rest implictly on
two different assumptions. The first assume we have an iid drawing in the (Y , X )
sequence whereas the second assume that Y − θX is an iid sequence conditional on X .
What Bootstrap cannot achieve

It is important to keep in mind that bootstrap is not a panacea. While the World
Wide Web provides an orgy of happy ending stories, examples of failures are not
difficult to come with. In particular, the belief according to which ”bootstrap performs
better in small samples” (compared to usual asymptotic approaches) is a tale. As the
following example shows.

Consider the set of probability distributions for X such that P(X = −1) = p and
P(X = m) = 1 − p with 0 ≤ p ≤ 1 and m ∈ R. Also assume we observe an iid sample for
size n such that p = p and m = m. Suppose we would like to test that
H0 ∶ E [X ] = 0 ⇔ (1 − p)m = p against H1 E [X ] > 1000.
Assume we use any test procedure ϕ. Can we be sure that using bootstrap principle
will decrease the ERP in small samples ?

The answer is ’no’ because no matter how large n there exists distributions under the
null such that P(X1 ≥ 1000, . . . Xn ≥ 1000) = (1 − p)n can be made as large as desired.
As a consequence, there exist distributions under the null such that the probability
that the initial and the resample drawing are exactly the same is as close as desired to
1. Hence with probability arbitrarily close to 1 bootstrap and classical tests provide
the same decision.
An example of Bootstrap failure (lack of differentiability)
Another issue are the implicit ”regularity assumptions” that lies behind the above
developments underlying Beran’s arguments. When the expansions are not correct
(because functional are non differentiable) or provide poor fit (because order of
magnitude are unclear) argument of strong bootstrap advocates should be taken with
a grain of salt.

Here is an example of such a failure.

Let X1 , . . . , Xn be a i.i.d. samples from an unknown distribution F with mean µ and


variance σ 2 < ∞. Suppose we are interested in the absolute value of the mean ∣µ∣ and
we consider a plug–in estimator
θ̂n = ∣X n ∣
Then the Bootstrap estimates of the variance is not consistent if µ = 0.
Lesson 5 : Non parametric estimation and testing
Introduction
Many (if not all of the) procedures we discussed so far assume that the model is
well-specified. This means that the data Generating Process must belong to the set
probability distributions that forms the model. Of course if the model is a small (in
the sense of inclusion) set of distribution, the risk of miss-specification is high.

This is the main reason why non parametric models are proposed : they are large
enough for the misspecification risk to be reasonably small. More precisely, the model
is parametric when there exist a injection between the set of distributions and Rp
(otherwise it is non–parametric).

They are basically two ways to approach inference in non–parametric models. In the
first one, the parameter of interest lies in a finite–dimensional set and the non
parametric part of the model is a nuisance parameter (most of the case, it is the
common cdf of an iid sequence of ”error terms”). In the second, the parameter of
interest belongs to some functional (infinite dimensional) set.

In the first case, we often look for methods in which the statistic is (at least
asymptotically) pivotal wrt to the non–parametric part of the model (we already
discuss some instances of this case). In the second case, we want to provide
estimators for a function.

In the first case, we often speak of semi-parametric inference, whereas in the second
we use the term non-parametric estimation. This terminology is not as clear as it may
look but we shall use it for simplicity.
Examples : tests

Assume we have an iid sample X1 , . . . , Xn in some unknown continuous cdf F and we


would like to test the following hypothesis Med[X1 ] = 0. A well known test for this
case is as follows :

Compute Tn = ∑1≤i≤n 1Xi ≥0 and reject H0 when Tn is larger than the 1 − α quantile of
the Bin(n, 0.5) distribution. This is a valid non parametric test at level α for H0 . It is
known as the sign test.

Assume now we have an iid bivariate sample (X1 , Y1 ), . . . (Xn , Yn ) such that
P(X ∈ A, Y ∈ B) = PF (A) × PG (B) for any couple of measurable sets A, B. We sh ll
also assume that P( X = Y ) = 0. Suppose we would like to test the hypothesis
H0 ∶ F = G.

If we compute Tn = ∑1≤i,j≤n 1Xi ≥Yj (that is the number of times ’each X beats a Y ’ in
face–to–face contest), Tn has given distribution under the null so that a valid test may
be performed (this is the famous Wilocoxon–Mann-Whitney test procedure).

Consider ANY hypothesis H0 in ANY set of probability distributions. Randomly draw


U in a uniform distribution on [0, 1] and reject H0 whenever U > α provide a valid
1 − α level test. This is a very stupid one, but it is valid. Basically this example shows
that providing a valid test is not an issue. The point is that the test must be valid
AND –hopefully– somewhere powerfull. That it is mut be able to reject at least one
distribution in the alternative with probability strictly larger than the level.
Examples : Estimation
By far, the most well known non-parametric estimator is the histogram. It is an
estimator of the density. The (arguably) second most well known is the empirical
distribution function
1
F̂ (x) = ∑ 1X ≤x
n 1≤i≤n i
This is an estimator of the cdf. It is consistent in the following sense : whatever F0
the cdf of the DGP, the Glivenko Cantelly theorem asserts that

P( lim ∣∣F̂ − F0 ∣∣∞ = 0) = 1.


n→+∞

Finite sample properties may be precised by the DKW Inequality

Assume the model is described by a set of distributions such that we known that for
some parameter θ ∈ Θ ⊂ Rp and some given function h(X , θ) we have

EF [h(X , θ)] = 0

An estimator of θ may be proposed by solving the following maximization problem

maxθ,π1 ,...,πn ∑ni=1 log(πi )


s.t
∑ni=1 πi = 1
∑ni=1 πi h(Xi , θ) = 0

This is the empirical likelihood method pioneered by Owen [1988].


Kernel estimation : density
Consider an iid sequence X1 , . . . , Xn . A common estimator for the density is the kernel
estimator
1 n x − Xi
fˆn,h (x) = ∑K ( )
nh i=1 h
where h > 0 is the width and K (.) is a kernel i.e a positive function, symmetric about
0, that integrates to one. These requirements imply that fˆn,h (.) is a positive function
that integrates to one and that the Xi that are closer to x are given more weights in
the estimation of fˆn,h (x).
x−X
If h is very large, h i will be very small except when Xi is close to x. A the limit, this
will lead to over–smoothing phenomenon. At the other extreme, if h is very small then
x−X
K ( h i ) will provide almost the same –small– value and f will be smooth. Here is a
graph that illustrates the importance of the choice under– and over–smoothing
phenomenon.
Optimal bandwidth choice : Mean Integrated squared error (MISE) criterion
Basically, if h is very large, the estimation of the density nearby a point of the sample
will be almost 1/n (since the frequency of observation is 1/n) and 0 otherwise.

In this case the in–sample bias is zero (we ”predict” exactly what happened) but every
time we have a new observation, everything change. The estimation is very unstable,
and it displays a large variance.

At the other extreme, the opposite arises : the bias is large and the variance is small.
Choosing h then requires a criterion that balances both variance and bias. The usual
one is the MISE criterion. Asymptotically, the leading terms of this criterion are

1 2 h4 2
∫ K (u)∂u + (∫ u 2 K (u)∂u) ∫ (f ”(u))2 ∂u
nh 4

The first term goes to infinity when h goes to zero (this the bias part) while the other
diverges when h is large (hence the variance part). Overall, this quantity, as a function
of h is minimized when
1/5
⎛ ∫ K (u)∂u
2 ⎞
h=
⎝ n (∫ u K (u)∂u)2 ∫ (f ”(u))2 ∂u ⎠
2

Interestingly, the leading term in the MISE criterion is O(n4/5 ). This means that if we
want to divide the (square root of the ) MISE by 10 the number of data must be
multiplied by 316.
Choice of kernel

One may also look at the MISE criterion as a function of the kernel. This amounts to
solve the following problem

∫ u 2 K (u)∂u
max
∫ K 2 (u)∂u
K
s.t. K (−u) = K (u) ∀u ∈ R
K (u) ≥ 0 ∀u ∈ R
∫ K (u)∂u = 1
The ”best” one is known as the Epanechnikov kernel (difficult exercise) :
3
K (u) = (1 − u 2 ) u ∈ [−1, 1].
4

The relative inefficiencies of most other commonly used kernels usually small, meaning
that an optimal choice of the kernel is not a major issue. For instance the uniform
kernel (which underlies the classic histogram) as a relative efficiency of .93 whereas
the Gaussian Kernel achieves .95.
Non parametric estimation of conditional expectation (Nadaraya Watson)

Consider estimation of E [Y ∣X ]. To this end, we have an iid sample


(Y1 , X1 ), . . . , (Yn , Xn ). The kernel estimation method applied to this case leads to
Xi −x
∑ni=1 Yi K ( h
)
Ê [Y ∣X = x] = Xi −x
∑ni=1 K ( h )

Indeed, we have E [Y ∣X = x] = ∫ yfX ,Y (x, y )/fX (x)∂y . Hence if we consider the kernel
estimators
x−Xi y −Yi
fX ,Y (x, y ) = 2
nh ∑ni=1 K ( h
)K ( h
)
x−X
fX (x) = 2
nh ∑ni=1 K ( h i )

a plug–in estimator is
x−Xi y −Yi x−Xi y −Yi
∑ni=1 K ( h
)K ( h
) ∑ni=1 K ( h
) ∫ yK ( h
) ∂y
∫ y x−X
∂y = x−X
∑ni=1 K ( h i ) ∑ni=1 K ( h i )

y −Yi
and since ∫ yK ( h
) ∂y = Yi we obtain the desired expression for Ê [Y ∣X = x].
Controlling endpoints
Most of the results and discussion on Kernel estimation used the MISE as criterion.
As the MISE is an integrated criterion, it gives more weight to the central part of the
distribution of the DGP. It has been documented for a while (in particular by Müller
1993) that endpoints may create very large bias and variance.

This is due to the fact that endpoints correspond to very extreme events, which are,
by definition rarely observed. For these very rare cases, the behavior of the estimator
is driven by extreme value theory, a domain in which the usual central Limit Theorem
dose not applies.

There exists several way to correct for this problem. The first one is to vary the width
of the window in order to enlarge the window where the number of points is too small.
The second amounts to trim the sample. In practice, this amounts to discard from the
sample the most extreme points so as to ensure that the estimation always relies on a
sufficiently large number of points.

In both cases, the study of the properties of kernel estimators is much more complex
than in the ”central” case. One should at least keep in mind that ”optimal” kernel
and bandwidth choices may lead to severe miss–representation of the true DGP when
extreme events are considered.

On some instances (in particular in insurance) tail behaviors are dealt with by specific
parametric models. Fat tails distributions (Gumbel, Pareto,...) are used as proxies
when some underlying information is known. The parameter estimation of these
distributions is also difficult (in particular in the time series context) and good
knowledge of the mathematical specificity is highly recommended.
Asymptotic behavior of kernel estimation for density (Parzen 1962)

Asymptotic behavior of MISE has already been discussed, but we have just seen that
this criterion is an averaged one and that specific characteristics may be difficult to
estimate.
We shall now present a very classic result obtain by Parzen (1962). Let hn be a
deterministic sequence such that limn→+∞ h(n) = 0 then (a.s.)

1 n x0 − Xi
lim E [ ∑K ( )] = f (x0 )
n→+∞ nh(n) i=1 h

if f is continuous in a neighborhood of x0 and ∫ K 2 (u)∂u < ∞. Moreover under the


same assumptions,
2
1 n x − Xi
lim nh(n)E [ ∑K ( ) − f (x)] = f (x0 ) ∫ K 2 (u)∂u
n→+∞ nh(n) i=1 h
The first result assert that h(n) → 0 is a sufficient condition for the kernel estimator to
be asymptotically unbiased at every given point of continuity whereas the second
asserts that consistency requires nh(n) → +∞.

We then recover the same message : desirable asymptotic properties require a


bandwidth choice that is neither too large nor too small. In particular notice the
“optimal” choice (from the MISE viewpoint) that is h(n) = O(n4/5 ) satisfies both
requirements. Similar requirements appears for all kernel estimations problems.
Non parametric tests
Non-parametric testing problems arise when the set of distributions under the null
hypothesis cannot be parametrized in a finite dimensional space. We already present
two examples, but there are many more
▸ H0 ∶ EF [X ] = 0
X
▸ H0 ∶ X1 , . . . Xn is an iid sequence
▸ H0 ∶ X1 , . . . Xn is a stationary sequence
▸ H0 ∶ (Xt )t∈Z is an ARMA(2,1) process
▸ H0 ∶ ∃!x such that u < v < x ⇒ fX (u) < fX (v ) and x < u < v ⇒ fX (u) > fX (v )
▸ H0 ∶ x → EF [Y ∣X = x] is a monotonic function
(X ,Y )
▸ H0 ∶ x → EF [Y ∣X = x] is a linear function
(X ,Y )

Clearly enough, all these questions cannot be answered with the same approach. Two
general remarks may nevertheless been put forward
1. The main question is not to provide a valid test procedure (because the “stupid”
procedure is always available) nor a consistent procedure (because ”always reject”
is also admissible). The problem is to provide a procedure the level of which is at
least controlled (even asymptotically) and which is somewhere consistent.
2. In many cases, the alternative hypothesis is much too large to insure consistency
of (asymptotically) valid tests procedures, even in the pointwise sense. Hence
many tests are built in order to provide consistency against some specific subset
of distributions under the alternative (for instance the KPSS in the second above
case).
Impossibility results

In several instances, it is impossible to provide a valid and (even somewhere)


consistent procedure. This is the case for instance of the first example, as already
discussed. Unimodularity is another instance of the same problem.

This arises because Neyman-Pearson approach to testing is related to the Total


variation distance between two probability measures which is

TV (P, Q) = sup ∣P(A) − Q(A)∣


A

More precisely, assume we want to test that the DGP is P against the alternative that
the DGP is Q then the some of type–one and type two risks equal 1 − TV (P, Q) (see,
for instance here for a direct proof).

Hence an α–level test for this simple testing problem has power α + TV (P, Q). It
follows that if P and Q are arbitrarily close in the TV sense, the ”stupid” test and the
Neyman and Pearson procedure coincide.

Unfortunately in non parametric contexts it is often the case that a measure under the
alternative can be founded in the TV–neighborhood at any point under the null. In
this case, the ”stupid” test is Uniformly Most Powerful.
Permutation tests

Permutation test is a simple yet powerful way to derive non parametric test procedures
that have the correct level and enjoy some power properties.
In practice permutation and Bootstrap procedures are similar and they are often
presented as part of a general approach (see, for instance, here). As we shall see, the
similarity of the practice does carries up to the theory.
For permutation devices to be used, one need
1. That each distribution under the null is invariant wrt to permutation of the indices
2. That the realization of test statistic is not invariant wrt to permutation
3. That the distribution under the alternative hypothesis is not invariant wrt
permutations
For instance, the first statement is fulfilled under ”H0 ∶ X1 , . . . Xn is an iid sequence”.
The second statement is not fulfilled by the arithmetic mean but it is by the rank
statistic, for instance. The last statement is required so that the test is more powerful
than the ”stupid” one.

The procedure amounts to compute some given statistic Tn and to compute Tπ(n) for
all permutation π of the indices. We then obtain a ranked sequence
T(1)∶n! ≤ T(2)∶n! ≤ . . . < T(n!)∶n! . Pick up any interval in this ranked sequence that
contains at least a proportion of 1 − α of this sequence and reject the null whenever
the original data does not belong to this set is a valid test at level α.
Example : ”lady tasting tea” (Fisher 1935)

A lady claims she able by tasting a cup of milked tea whether tea or milk was put first
in the cup. The null hypothesis is that she has no such ability.
Assume –without the lady knowing it– the cups are labeled as follows : in cup 1 to 4
milk was put first whereas it is the tea in cup 5 to 8. The lady then tastes each of the
8 cups and provides her answers.

If she answers at random there are 1 possibility without an error : (m,m,m,m,t,t,t,t)


16 possibilities with one error (you must exchange one out of four ’m’ cup for one out
of four ’t’ cup, hence 4x4=16) there are (4x3/2)x(4x3/2)=36 ways to make two
errors, 16 to make three (because this amount to get one right) and one way to get it
all wrong (t,t,t,t,m,m,m).

Then, depending on her score, here are the corresponding p–values


number of errors 0 ≤1 ≤2 ≤3 ≤4
p-value 1/70 17/70 53/70 69/70 1

Instead of using by combination argument, these quantities may also be obtained by


computing the statistic (here the number of errors) for all possible permutations.
Notice in this case we have 8!=40320 possible permutations, but there are only 5
possible different values for the statistic. Hence the combination formula is time
saving.
Testing by permutation in the linear model

First assume that we consider a simple linear model E [Y ∣X ] = βX with Y − E [Y ∣X ]


iid. Consider testing β = 0.

Under the null hypothesis, Y is iid conditionally on X so that if we keep X unchanged,


every ordering of the Y sequence are equally probable.

We may then compute rn2 the square of the correlation coefficient both for the original
sample and all of the sample obtain by permuting Y while keeping X fixed. If this
statistic is too large compared to its permuted equivalent, we reject the null.

2
More precisely, we compute r(1)∶n! ≤ ... ≤ r(n!)∶n!
2 and we reject the null at level α
whenever the original result is larger than the empirical 1 − α of this ranked sequence.
Some general considerations about permutation tests
Permutation is one a the very few general devices that may be used to derive valid
test procedures in quite general non parametric contexts. Yet the power of the
procedure depends on the statistic used to perform the test.

For instance, in the first previous example assume we simply decide on whether the
lady get it right on the first cup she tastes, the experiment will not be very convincing.

Similarly, in the second example, notice we reject when the statistic computed from
the original dataset is too large because 0 is the smallest achievable value of the
square of the correlation coefficient. Hence under the alternative, we expect this
statistic to be large.

Notice that contrarily to Bootstrap, the justification of the method does not rely on
any asymptotic argument. Also we dot not need the statistic to be pivotal under the
null. Finally, permutation tests are exact (but, contrarily to what is claimed by
Wikipedia exact tests are not always obtained by permutation).

Of course, when n is large n! is very large and computing all of the permutation is not
feasible. But such a large number of computation is not required. In the second
example, if we peak up 19 permutations at random and reject the null whenever the
statistic computed from the original data is larger than all of its permuted
counterparts we still get a valid 5%–level test because under the nul this sequence is
exchangeable. This is an instance of a much more general principle that will be
discussed in the next sequence.
This spreadsheet allows to implement this procedure.

You might also like