Econ 4
Econ 4
master MBFA/APE
Jouneau-Sion Frédéric
frederic.jouneau(at)univ-lyon2.fr
Roadmap
▸ Before considering more general settings it is best to start with the simplest
possible case, namely the linear (regression) model.
▸ According to countless presentations (both from academic and non academic
sources), the linear model may be written as
Y = a + bX + ϵ
▸ Clearly ϵ does not appear in the –claimed– correct presentation. When (1) holds
true then ϵ may be defined as Y − E [Y ∣X ] in which case its denomination as
“error term” is plainly justified.
▸ Remark however that this is a forecasting error. It follows that ϵ does not differs
from x because it is unobserved. It is a random variable which is not –yet–
realized. It shall also be stressed that the definition of ϵ involves both Y and X in
an asymmetrical way.
▸ Although defining ϵ may be useful as a shortcut in some proofs and models, it is
clear that may important properties derive directly from (1) in particular the
unbiasness of the OLS estimator and
∂E [Y ∣X =x]
∂x
= b when X is a continuous variable
∆E [Y ∣X =x]
∆x
= b when X is a discrete variable
fX ,Y (x, y ) = fY ∣X =x (y ) × fX (x)
1
Notice the interpretation of the parameters involved may then be more complex. For instance if
E [Y ∣X = x] = a + bx + cx 2 then b cannot be interpreted ”other things left equal”
Parameters matter
▸ Let us first consider the simplest omitted variables case. Without any reference to
error term, it could be presented as follows
E [Y ∣X , W ] = a + bX + cW
Cov [Y − bX − cW ; Z ] = 0
▸ The above argument shows that the IV estimate is the correct measure of the
slope of the conditional expectation of Y given observed variables and unobserved
variables that be may be correlated with X as long as they are uncorrelated with
the instrument (and provided the conditional expectation is linear).
▸ Considering another instrument changes the parameter of interest since the
–linear– conditional expectation is not the same one. Hence, choosing an
instrument merely consists in deciding which forecast you are interested in.
▸ Of course, choosing an instrument which is uncorrelated with ”several” plausible
explanations while still correlated with the effect under study strengthen your
argument. As a drawback such an instrument may be weakly correlated with X ,
leading to identification issues, but you may face another problem.
▸ Let us compare the interpretation of b when W is observed and when it is not. In
the first case the OLS estimate may be computed and it measures the average
effect of X on Y keeping W fixed. In the second case the IV estimator measures
the effect of X while keeping fixed an unknown set of variables, precisely those
affecting the expectation of Y and which are uncorrelated with the instrument Z .
▸ If this set of variables is very large, then the interest of the measure may fade
away since the experiment in which everything else is kept fixed may lack of
practical interest.
The error–in–variable case
▸ The case for error–in–variable may be presented as follows 3 :
E [Y ∣X ∗ ] = a + bX ∗
but OLS cannot be applied directly for X ∗ is not observed. Rather we observe
X = X ∗ + η (and of course, here η is not observed).
▸ Notice we introduce an error here, but the situation is pretty different. First, the
link between observed and unobserved variables contains no parameter. Second,
the error has now a clear interpretation, whatever the values of the parameters of
interest.
▸ As soon as Var [η] > 0, the the naive OLS estimator is biased 4 since it writes
Var [η]
−1
Cov [Y , X ]/Var [X ] = b (1 + )
Var [Y − bX ∗ ]
▸ For Z to be an instrument, you need Cov [Y − bX ∗ ; Z ] = 0 which translates to
Cov [bη + Y − bX ; Z ] = 0
▸ As the above condition involves the nuisance parameter Cov [η; Z ], it must be
null for the ratio of covariances to provide a correct measure.
3
We focus on the one–dimensional case, as adaptation to the multiple variables case is straightforward.
4
Be careful, the famous ”toward zero bias” statement may not be true if more than one variable is measured
with error.
Simultaneous equations models: casual introduction
▸ We shall now consider simultaneous equations models. These models are
presented using explicit error terms and emphasis is put on ”exogenous” and
”endogenous” variables. Also, as considerable attention has been given to
relationships between ”structural form” and ”reduced form” parameters, our
second above argument (namely that parameters interest is an important part of
the model) is also relevant.
▸ To illustrate our argument in this case, let us consider the following usual
demand/(inverse) supply ”model” 5
q = α − βp + γZ + u (D)
{
p = a + bq + dW + v (S)
where q is the logarithm of the quantity, p is the logarithm of the price and Z
(resp. W ) is some ”exogenous effect” affecting the demand (resp. supply) curve,
u, v being uncorrelated ”error terms”. As we follow here the usual presentation,
we shall assume that β > 0 and b > 0, since price is expected to affect negatively
quantity and quantity is expected to affect positively prices (more on this
seemingly harmless assumption below).
▸ This setup explains why ”equation per equation” OLS estimates is affected by the
simultaneity bias. Indeed, the reduced form writes
⎧
⎪ =
b(α+γZ )+a+dW +bu+v
⎪ p (RP)
⎨ 1+bβ
⎪
⎪ =
−β(a+dW )+α+γZ +u−βv
⎩ q 1+bβ
(RQ)
Hence, Cov [p, u] > 0 and Cov [q, v ] < 0 whenever Cov [u, v ] = 0.
5
Recall as above that the term ”model” is not really appropriate, we use it for the sake of argument developed
below.
Simultaneous equations models: interpretation of structural parameters
▸ In the most general form, if we have m equations and the sample size is n , the
inference problem may be presented as follows. Let Y be the n × m matrix of
endogenous variables, X be the n × p matrix of exogenous variables, we have
E [YΓ∣X] = XB
where Γ is an unknown non singular matrix whose diagonal elements are all equal
to 1 and B is an unknown k × m matrix.
▸ Post–multiplying by Γ−1 we obtain a multivariate linear model
E [Y∣X] = XBΓ−1 = XΘ
▸ The model being multivariate is not really an issue here since any multivariate
linear model may be represented as a usual univariate linear model by means of
the vec () operator.
▸ The real problem is to extract the parameters of interest B and Γ from the
estimation of Θ. This is clearly not always feasible. For instance if B = 0 then Γ is
not identifiable. Also if M is a size-m non singular matrix, the couples (B, Γ) and
(BM, MΓ) leads to the same value of Θ.
▸ Identification restrictions are clearly needed. Necessary ones require k to be large
enough (or more precisely that the rank of B is large enough) wrt to m. Common
sufficient ones are known as exclusion restrictions in the SEM litterature.
Supplementary material (1) : approximation of non linear model by a linear
regression
∂E [Y ∣X = x] E [∣X − E [X ]∣3 ]
∣βOLS − ∣≤M
∂x [x=E [X ]] 2
Supplementary material (2) : Simultaneity bias and forecasting
▸ If we consider the following model
E [q∣p, Z ] = α − βp + γZ
E [p∣q, W ] = a + bq + dW
X1 = b1 X2 + U1
X2 = b2 X1 + U2
E [Yt ] = m ∀t ∈ Z
E [Yt2 ] = σ 2 ∀t ∈ Z
Cov [Yt , Yt+h ] = γ(h) ∀t ∈ Z
▸ For any given weakly stationary process we may defined the
autocovariance function h → γ(h).
Remarks
▸ Notice that strong stationarity does not imply the existence of second
order moment. Hence, strong stationarity logically implies weak
stationarity in the set of square integrable process.
▸ Strong (resp. weak) white noise processes are strongly (resp. weakly)
stationary.
▸ Deterministic stationary processes are constant processes (show it).
▸ Any square integrable MA(1) process is weakly stationary (show it).
▸ Any strong MA(1) process is strongly stationary. (show it).
▸ If (ϵt )t∈Z is a strong white noise such that Var [ϵ1 ] > 0, then the
process such that X0 = 0 et Xt = Xt−1 + ϵt for all t > 0 is a random
walk. Random walk are non stationary processes (show it when
E [ϵ1 ] =/ 0 and also when E [ϵ1 ] = 0)
▸ Non stationary processes are often met in economics and finance.
They are closely related to growth, inflation, saving behaviors... As a
consequence macro and financial data set are often corrected to
insure stationarity before being used in econometric models. Such
pre-treatment include actualization, de-seasonalization, filtering, etc.
MA(∞) processes
▸ For any square integrable process (Yt )t∈Z one can defined E [Yt ∣Y t ].
▸ The process Yt − E [Yt ∣Y t ] is the innovation process associated
with -or simply “of”- (Yt )t∈Z .
▸ The innovation process associated with any strongly, square
integrable process is a strong white noise (we admit this property).
▸ Linear innovations are defined accordingly, with linear conditional
expectations instead of conditional expectations.
▸ The linear innovation process associated with any weakly stationary
process is a weak white noise (we admit this property).
▸ Beware innovations processes are mainly theoretical objects since we
cannot run the regression as the complete set past of data is not
available.
▸ If Yt = ρYt−1 + ϵt is a square integrable stationary AR(1) process,
then et ϵt is the innovation process of Yt .
Wold’s decomposition
▸ This property implies tha we can “view” the “random part” of any
weakly stationary process as an MA sequence of past “shocks’. It is
the main justification for modeling time series by means of ARMA
models.
▸ In practice this property is difficult to use directly since (an )n∈N is an
infinite unknown sequence, but many proofs are based on this
decomposition. Mathematically, it shows that the set of weakly
stationary processes is the same as that of scalar products of L2
consistent sequence of reals by weakly white noise processes.
Autocorrelation and Moving Average processes
On considère ici un processus stationnaire au second ordre de variance non nulle. Let
us consider weakly stationary process with strictly positive variance.
▸ We recall the autocovariance function is defined by
γ(h) = Cov [Y0 , Yh ]
▸ The autocorrelation function is defined by ρ(h) = γ(h)/γ(0).
▸ By the Schwarz inequality we have ρ(h) ∈ [−1, 1] for all h.
▸ If (ϵt )t∈Z is a strong (resp. weak) white noise and θ1 , . . . θq is a
vector of size q such that θq =/ 0 then the process defined by
q
Yt = ϵ0 + ∑ θk ϵt−k
k=1
for all t is a strong (resp. weak) order-q Moving average process (or
simply MA(q) process.
▸ A weakly stationary process Yt admits a MA(q) whenever there
exists θ1 , . . . θq is a vector of size q and a white noise process such
that the above formula holds.
▸ The autocorrelation function of an MA(q) process is such that
γ(h) = 0 whenever h > q.
Partial autocorrelation and Autoregressive processes
▸ If (ϵt )t∈Z is a white noise and 0 < ∣ρ∣ < 1 we already considered the order–1
autoregressive process Yt = ρYt−1 + ϵt .
▸ Remark, contrary to the MA processes the process Yt is implicitly defined as a
solution of a recursive equation. We must then show that this equation admits
solutions. We can write
Yt = ρ(ρYt−2 + ϵt−1 ) + ϵt = ρ2 Yt−2 + ρϵt−1 + ϵt
= ρ2 (ρYt−3 + ϵt−2 ) + ρϵt−1 + ϵt
Yt = ∑+∞ i
i=0 ρ ϵt−i
ρ(h) = ρh
▸ Let us consider (ϵt )t∈Z a weak white noise and the random walk
Yt = Yt−1 + ϵt
▸ In this case, we have 1 + Φ(x) = 1 + x. The unique root is −1 and it
modulus is 1. As a consequence, there is no stationary solution in the
L2 sense.
▸ Indeed, if we had a solution it hould have a constant variance. Now
as
Var [Yt ] = Var [Yt−1 ] + Var [ϵt ] > 0
and the equation x = x + Var [ϵt ] admits no solution if Var [ϵt ] > 0.
▸ The larger the modulus of the roots of polynomial 1 + Φ(x) the faster
the damping of past shocks (as we can easily see by simulation).
▸ Moreover, the fact that some roots admits a non-zero imaginary part
entails cyclical behaviors.
ARMA Modeling and unit Root issues
ARMA models
▸ Using the similar trick as in the AR case, notice we may write any MA process as
Yt = θ(L)ϵt
ϕ(L)Yt = θ(L)ϵt
where ϕ() is polynomial such that ϕ(0) = 1 and whose roots are outside the unit
circle and θ() is a polynomial such that θ(0) = 1.
▸ For such a process, the partial autocorrelation function vanishes exponentially fast
for values larger that the degree of ϕ() and the autocorrelation function vanishes
exponentially fast for values larger that the degree of θ()
▸ These processes have become increasingly popular after the seminal work by Bow
and Jenkins. In particular ARMA process often show up as solution of optimal
response to shocks in macro-economic models. The AR part is mostly driven by
the law of motion of capital (together with time–to–build assumption) whereas
MA part result from smoothing effects due to risk aversion and actualization. In
the most simplest DSGE case the aggregate quantities admits ARMA(1,1)
representation. This is rather at odd with empirical findings since impact of
shocks do not vanish exponentially fast. As a consequence, most DSGE models
include AR(1) type of shocks, with values of ρ smaller than 1 but close to. This
also entails that the degree of polynomial is strictly larger than 1, allowing for
complex roots and cyclical features (a key issue for endogenous cycles theory).
Estimation in ARMA processes
▸ Consider an ARMA(1,1) process
Yt = ϕ1 Yt−1 + ϵt + θ1 ϵt−1
Yt = ϵt + θ1 ϵt−1 + . . . + θq ϵt−q
(using the convention ∑j∈∅ = 0 et θ0 = 1). This function can be estimated using
the empirical correlations and θ1 , . . . , θq my be estimated by inversion of the
formula.
Estimation in ARMA processes –cont’–
▸ This approach is however not easy to implement. Consider an MA(3)
process we have
γ(1) = θ1 + θ1 θ2 + θ2 θ3
γ(2) = θ2 + θ1 θ3
γ(3) = θ3
Hence
θ3 = γ(3)
θ2 = γ(2) − θ1 γ(3)
and
γ(1) = θ1 + (1 + θ1 )(γ(2) − θ1 γ(3))
More generally, as the link between γ(.) and θ(.) is non linear,
inverting the system is not a trivial task.
▸ This method can be applied to pure AR processes. The values of the
autocorrelation function are directly related to the coefficient of the
Φ() polynomial. In this case, one can express this relationship in a
recursive sequence of linear system (these are the famous
Yule-Walker equations).
▸ If both AR and MA coefficient are present, inverting the link between
ACF/PACF functions and the coefficient of the Φ() and Θ() leads to
very complex systems even when the degrees are moderate.
Estimation in ARMA processes –finish–
▸ The previous methods are, in general, not asymptotically efficient.
There then used as first-step in more elaborate numerical methods.
The current first best approach is to compute a pseudo-likelihood, as
if shocks are strong, Gaussian white noise.
▸ To this end, we must write the log-likelihood of the available sample.
This can be done by investing the relationship which defines the
ARMA process in order to compute the shocks as function of Yt and
the parameters.
ϵt = Yt − ϕ(L)Yt − Θ(L)ϵt
▸ The previous equation define the shocks in a recursive way, once
Y0 , . . . , Y−p , and ϵ0 , . . . , ϵ−q are known. As they are not, two
simplifications can be proposed
● Limited Information Likelihood : only the final part of the likelihood is computed
ϵp , . . . epsilonT and the unknown shocks are set to zero (their unconditional mean
value)
● Full Information : missing part are replaced by their best forecasts as functions of
known values and parameters.
▸ As the first technique is easier to implement, it is often used as a
first-step before, eventually, implementing the second one, which is
more demanding.
▸ Both techniques requires specific optimization routines and
convergence is often slow, especially if the degrees of polynomials are
large.
Estimators’ properties
▸ If the associated white noise (ϵ)t∈Z admits a forth-order moment and
the Φ() polynomial has roots outside the unit circle, the previouos
estimators are consistent at speed T −1/2 and asymptotically gaussian.
▸ Numerically, reliable estimators for the MA part often require Θ()
polynomial has all of its roots outside the unit circle.
▸ Consistency of the variance co-variance matrix of the estimators
requires the existence of moments of higher orders (6 to 8 depending
on the technique used).
▸ The computation of the the estimators, standard errors, test, and the
like requires specific numerical optimization programs. These are well
documented in specific softwares R, GRETL, SAS. Python and C
routines also exists their performances can vary quite a lot with first
step parameters.
▸ The quality of the estimation typically decreases rapidly when roots
are close to unit modulus.
▸ Finally, it is of course vital not to undersetting the degrees of
polynomials Θ() and Φ() otherwise consistency is lost. As these are
often unknown, we need a way to get it from the data. This is the
so-called “identification” problem. (The term is misleading since p
and q are identifiable, in the statistical sense).
Identification in ARMA models
▸ The parameter set obviously depend on the degrees of the Φ() and
Θ(.) polynomials. Let p be the degree we choose for the former and
q for the latter. The problem is a double-edge one. If p and/or q is
too small that then some parameters are wrongly set to zero and no
estimation method can achieve consistency. Now if they are set at
tto large a value, efficiency decrease wrt to the case where the extra
paramter are correctly neglected. This last case –called overiftting– is
illustrated below
▸ Overfitting : white noise fitted with ARMA(3,3) model
Coefficient Std. err z p -value
const 0,0741117 0,0396156 1,8708 0,0614
ϕ1 −0,297077 0,235598 −1,2609 0,2073
ϕ2 −0,676338 0,0940068 −7,1946 0,0000
ϕ3 −0,590732 0,229936 −2,5691 0,0102
θ1 0,316387 0,220251 1,4365 0,1509
θ2 0,599366 0,104415 5,7403 0,0000
θ3 0,647322 0,207655 3,1173 0,0018
Identification –cont’–
▸ The most common method consists in looking at estimated ACF and PCAF. We
know that in case of a pure AR(p) process PACF is strictly zero for any h > p and
for a pure MA(q) process ACF is strictly zero for any h > q.
▸ Hence if the ACF (resp. PACF) abruptly vanish, then p = 0 (resp. q = 0) and q
(resp. p) equal to the last value where the ACF (resp. PACF) is non zero.
▸ In the general case, PACF and ACF can display many different forms for low
values of h but when h > max{p, q} then both the ACF and the PACF converge
exponentially fast towards zero.
▸ The idea is then to detect a value r after which both ACF and PACF decrease
and try various combinations of integers p, q stricly smaller than r . The selected
model minimizes a criterion that penalize over fitted models and estimated
standard errors of shocks. The two most popular criterion are
p+q log(T )
AIC ∶ log(σ̂ 2 ) + 2 and BIC ∶ log(σ̂ 2 ) + (p + q)
T T
Validation
▸ There are two major problems with the above approach. First, even in the “pure”
cases, as we do not have access to estimated ACF and PACF, the uncertainty
can result in over– or under– fitting. Most specialized software then provide
confidence boundaries to check for non significant values. Second, detecting
exponential decay is not an easy task. Identification and estimation steps are then
completed with a final check called validation.
▸ The idea is to exploit extracted shocks that are computed in the Full or Limited
Information Maximisation. If the model is correctly specified, they should
approximately behave as a white noise. Hence, PACF and ACF of the extracted
shock sequence should be zero. This often check by the Ljung Box statistic
whose null hypothesis is γ(j) = 0 si h > j > 0. More precisely the rule is to reject
H0 (hence the entire approach) whenever at level α whenever
h γ̂(j)
T (T + 2) ∑ > χ21−α (h)
j=1 T −j
this procedure is known as the portmanteau test.
▸ The portmanteau test is often completed by two other procedures. The first one
aims at detecting heteroskedasticity –recall indeed the pseudo ML approach rest
on an iid assumption–. This is particularly important for financial data since
heteroskedasticity has important consequences for pricing methods. The most
common procedure is the Breuch-Pagan test .
▸ The second procedure is a stability test. If often rest on the so-called CUMSUM
statistics t
∑j=k+1 ϵ̂j
St = (T − k) t = k + 1, . . . T
∑tj=k+1 ϵ̂2j
t 2
∑j=k+1 ϵ̂j
S′ = t = k + 1, . . . T
Forecasting in ARMA models
▸ A key feature of the ARAM model is that forecasted values can easily be
computed, whatever the distribution of the underlying noise process.
▸ Moreover, the ML approach directly provide the forecasts since it is written as a
function of linear innovations.
▸ Specialized software then provide forecast valuesfor any ARMA model, as well as
confidence intervals of these. Typically three type of forecasts for Ys are
distinguished
1. Ys is observed and it is used in the likelihood to provide estimations (in–sample forecast)
2. Ys is observed but it is NOT used in the likelihood to provide estimations
(out–of–sample forecast)
3. Ys is NOT observed ”pure” forecast
▸ It is of particular importance to distinguish the first and the two last one. Indeed,
when Ys is used to provide estimation, the inference penalize the forecasting error
associated with Ys , which is not the case in the two last cases. As a consequence,
the two last methods provide a more honest view of the actual forecasting
capabilities of the model. To put it the other way around, the in–sample forecast
errors are by construction smaller than their out–of–sample or pure equivalent
ones. Look here here for a detailed presentation
▸ Previsions either out–of or in–sample are mostly one-step ahead ones. However,
in the most general case, forecast horizon can be larger than 1. This causes no
specific problem in ARMA model, but the pratical issue is that out–of–sample
and pure predictions converges exponentially fast towards the stationary values as
horizon increases.
▸ This is due to the fact that white noise processes always has best prediction equal
to zero. Hence, in particular if the horizon is larger than q, the MA part plays no
role, and the stationarity of the of AR part implies geometric convergence towards
the mean value. This convergence is faster as roots of Φ() have larger modulus.
Unit root issues
▸ As we already seen, when the Phi() have root inside the unit circle, the distant
past shock tend to have more impact on current values than close ones, and,
ultimately, realizations of the process are unbounded. This is of course at odds
with most –if not all- economic phenomena. But some economic models (both in
finance and macro-economic) are compatible with exact unit roots.
▸ Moreover, has shown by the seminal paper by Hall in the late 70’s the prediction
obtained by the random walk hypothesis tend to provide better forecasts than
stationary ARMA adjustments. In particular, forecasting that yt+1 will be exactly
as yt is na almost unbeatable strategy (considering the fact it is much more easy
to compute than ARMA forecasts).
▸ They are many problem related to unit root issues.
● How to model such processes ?
● Can we test for the presence of unit root ?
● How to perform estimation ?
● What are the properties of such processes ?
ARIMA models
▸ In theory, unit roots of Φ() could have complex parts. However, as the
coefficients of Φ() are real, this would implies an even (that is at least 2) number
of such roots (more precisely any such root would have a conjugate part). Despite
intensive research efforts in this direction, evidence for complex unit root is weak.
▸ As a consequence, many models rest on the following ARIMA case
Θ(L)
(1 − L)d Yt = ϵt
Φ(L)
where Φ(.) i a polynomial whose roots lies outside the unit circle, and d ≥ 1 is an
integer (the order of integration). In this case, the only unit root is exactly 1.
▸ In most cases, d is set to one. Again, evidence for I(2) integrated proecess is
weak. Remark however that if a flow variable has order of integration 1, the
corresponding stock has order of integration 2.
▸ The inference is condcuted as in the ARMA case, after stationarization that is to
say, considering the process (1 − L)d Yt instead of Yt .
▸ More precisely, order–1 difference Yt − Yt−1 are considered if d = 1 order–2 if
d = 2 (hence Yt − Yt−1 − (Yt−1 − Yt−2 ) and so on.
Unit root testing
▸ Ce major diffuclty with the previous approach is to known whether we do indeed
face unit root. Most unit root test concern the case d = 1 vs d = 0. This are also
known as (non)stationary tests. This problem has probably resulted in more
papers in econometric journals than any previous ones.
▸ A “direct” approach would be to consider the following model
and test for ρ = 1. Now if the null hypothesis is ρ = 1, Dickey and Fuller have
shown that the “natural” estimator
Ĉ ovT [y , y−1 ]
V̂ arT [y ]
has a non standard (that is to say not directly related to Gaussian) distribution.
As a consequence, the unit root hypothesis cannot be tested by Student-type
procedures.
▸ Moreover, they alos show that the associated test procedure would have zero
power against alternative in which the non stationary behavior result from
deterministic trend. Then they proposed several testing procedures to account for
presence or absence of deterministic trends such as linear or quadratic ones
▸ Another approach is to test for stationarity. The most common procedure is that
proposed by Kwiatkowski-Phillips-Schmidt-Shin (the so-called KPSS test. Agin
this test requires specific tables to be used and its power against specific
alternatives can be dramatically low.
Lesson 3 : The Maximum likelihood principle
A basic example
Assume we have Bernouilli iid sample of size n denoted (y1 , . . . , yn ). We would like to
known the probability P(Y1 = 1) = p. Assume moreover 0 < p < 1 (we shall consider
this assumption in more details below). The probability of the event that is observed is
n
y 1−y
∏ p i × (1 − p) i
i=1
∂L n n
= L(p) × (−(n − ∑ yi )/(1 − p) + ∑ yi /p)
∂p i=1 i=1
It admits a single root p̂ = n1 ∑ni=1 yi (the above expression seems to imply that 0 and 1
are roots of the above polynomial but it is not the case, check it). As a consequence,
since L(p) > 0 it is concave function of p and the maximum is unique.
Finally notice that p̂ coincides with the empirical mean. Also notice that we could
consider 0 ≤ p ≤ 1. We obtained the same solution, except that the optimum is not
interior if y1 = y2 = . . . = yn (check this case).
Maximum likelihood principle
The previous example is an instance of a much more general approach. The argument
runs as follows.
▸ The model is a set of probability distributions indexed by a parameter θ ∈ Θ.
(This indexation is considered bijective).8
▸ The observed event Z is the result of drawing by Nature according to one of
these probabilities corresponding to the unknown value θ0 .
▸ The model allows to compute the probability of the realization of Z as a function
of θ.
▸ The value of θ is “more likely” than θ′ if the probability of the realization of Z
compute for θ is larger than this probability computed for θ′ .
▸ The decision between different values of θ must conducted according to the
preference relationship induced by the above definition of “more likely”.
The Maximum Likelihood Principle is generally presented as an inference device, in the
sense that it allows to go back from ’effects’ (the observed sample) to ’cause’ (the
value of the parameter). This point raises disputes among Bayesian and frequentists
statisticians. It has also been challenged on methodological grounds.
The main arguments in favor of the maximum likelihood principle are twofold: it is a
principle that may be applied whatever the model, it leads to consistent point
estimators in a large variety of cases.
8
If we consider the mapping from Θ to the probability distribution space, the surjective nature poses no problem
as Θ may be enlarged if necessary. Injectivity is much more demanding as it is ultimately linked to identification
problems, see below for details.
Consistency of the MLE (very informal argument)
Let P, Q, µ be three probability measures such that P and Q are absolutely continuous
wrt to µ. The Kullback-Leibler divergence (hereafter KL divergence) of Q wrt to P is
p
KL(Q∣P) = ∫ p log ( ) dµ
q
The previous argument is very informal notably because the mention of LLN is much
too loosely stated to be of any use.
First, we manipulate a function of θ. Hence, we need some kind of functional
equivalent of the LLN. Second, we also need to guarantee that the sequence of optima
(as n increases) does not diverge. This is often obtained (see below) by assuming Θ is
compact, but in practice it may not be the case. Finally, the limit criterion must exist
which entails some kind of uniform integrability condition. For the sake of argument,
the following result ensures strong consistency of the mle.
Proposition (Wald 1949) Let Θ be a compact space and consider a model {pθ ∣θ ∈ Θ}
with a dominated measure µ.
Let X1 , . . . Xn be an iid sample from pθ0 . Assume
The initial result by Wald has been extended in several directions. Notably,
compactness is not required, neither (strict) uniform integrability. Note however that
(counter) examples in which the mle is not consistent exist (a famous example is
provided by Ferguson (1982) relies on the case of a single, real parameter, but uniform
integrability fails).
Asymptotic efficiency of the MLE
Beside consistency, the MLE enjoys asymptotic efficiency in most ”regular” cases.
This property relies on two main results.
First, according to the Cramer Rao Bound the variance of any unbiased estimator is
bounded from below (an important result for its own sake). This comes by
differenciation of the unbiaseness relationship
Eθ0 [θ̂] = θ0
(where the last equality comes from the fact that E [S(X )] = 0 by differenciating
E [fθ (X )] = 1 wrt θ). Hence using Cov 2 (X , Y ) ≤ Var (X )Var (Y ) we obtain
L′∞ (θ0 ) = 0
(that is we assume θ0 is an interior maximizer of the limit criterion and that FOC
holds). Now
0 = L′n (θ̂n ) = L′n (θ0 ) + Ln ”(θ̃n )(θ̂n − θ0 )
for some θ̃n ∈ [min(θ̂n ; θ0 ); max(θ̂n ; θ0 )]. Hence
√ ′
√ nLn (θ0 )
n(θ̂n − θ0 ) = −
Ln ”(θ̃n )
By consistency of the mle the denominator converges a.s. to the Fisher Information
L∞ ”(θ0 ) whereas the numerator converges by the Central Limit Theorem to a centred
Gaussian distribution with asymptotic variance equal to (L′∞ (θ0 ))2 . Using again the
fact that the mle is a linear function of the score computed at the true value, we get
N (0, Var [L′ (θ0 )]−1 )
Some important remarks
The Maximum Likelihood is a widespread method whose main advantages rests on
asymptotic arguments : consistency, asymptotically Gaussian, asymptotic efficient and
flexibility (it does not require linearity). It should also be clear that it has some
drawbacks, notably
▸ Good asymptotic properties arises in regular cases. In particular, the solution
must be interior and locally unique.
▸ It is a parametric method, since the ’correct’ family of distributions belongs to
some finite–dimensional space. If the set of distribution is miss–specified
consistency may be obtained, but efficiency is typically lost and the asymptotic
variance may be difficult to compute (since loose the property
L”(θ0 ) = Var [L′ (θ0 )])
▸ Actual computation may be cumbersome either because Likelihood is very
difficult to compute and/or maximization is a hard problem. These issues must
be overcome with special softwares (in particular in time series analysis).
▸ Small sample properties may be very poor and asymptotic approximations may be
very bad. This is in particular the case when the log-likelihood function is not
globally concave or when the Fisher Information is very small (identifiability is
crucial).
Some of these drawbacks may be overcome. For instance, pseudo-Likelihood
techniques (Gouriéroux, Monfort and Trognon 1982) and more recently Empirical
Likelihood (Owen 1998) provide ways to avoids exact knowledge of the distribution.
Not however the asymptotic properties must be studied on a case by case basis. Also
poor small sample behavior may be handled my mean of Bootstrap (at least in the
cross–section case).
First example Gaussian Linear model
providing the famous solution θ̂n = (X ′ X )−1 X ′ Y (if X ′ X is not invertible, there are
multiple solutions). The second order condition is X ′ X definite positive.
Notice that the First Order Condition asserts that the vector whose components are
the forecasting errors yi − ŷi (where ŷi = θ̂n′ Xi ) is orthogonal to the columns of X .
Notice also that the Fisher Information is X ′ X (the derivative of the score) and we
then obtain that the Cramer-Rao Bound is reached in this case in finite sample.
Second example : Logit model
This model is defined as follows Yi ∈ {0, 1} and
1
P(Yi = 1∣X ) =
1 + exp(−θXi )
The Log-likehood is
n n
− ∑ log(1 + exp(θ′ Xi )) + θ ∑ yi × Xi
1=1 1=1
Assume there exists a θ such that for all i we have θ′ Xi > 0 if and only if yi = 1
(complete separation case). Then the mle is not defined. Indeed, for any λ > 0 we
have λ × (θ′ Xi ) > 0. This amounts to say when increasing θ in the direction where
complete separation occurs the likelihood increases continuously. Indeed, in this
direction, the predicted value approaches yi for all indexes.
Third example : (stationary) ARMA models
AutoRegressive Moving Average models set as a corner stone of time series analysis.
An ARMA(p,q) model evolves according to
p q
yt = ϵt + α + ∑ ϕi yt−i + ∑ θj ϵt−j
i=1 j=1
where ϵj is some iid sequence of ”shocks” (we shall consider the distribution of ϵ1 is
N (0, σ 2 )). We shall not consider here in detail the question of existence of a random
sequence yt that fulfills this constraint, but we consider the question of estimation of
the parameters (α, ϕ1 , . . . , ϕp , θ1 , . . . θq , σ 2 ).
A direct formulation of likelihood is difficult, but some approximations may be
proposed. For instance, if ϵ0 , ϵ−1 , ϵ−q where known, ϵ1 , . . . , ϵT could be computed
recursively from y1 , . . . , yT
⎛ p q ⎞
ϵt = yt = − α + ∑ ϕi yt−i + ∑ θj ϵt−j
⎝ i=1 j=1 ⎠
and the log-likelihood of the sequence of shocks may be derived as a function of the
parameters, the observations, and ϵ0 , ϵ−1 , ϵ−q .
To compute the exact ML is a hard task since we should integrate out the conditional
distribution by averaging over the distribution of the q + 1 vector ϵ0 , ϵ−1 , ϵ−q given
y1 , . . . , yT .
To avoid this q + 1–dimensional integral, either ϵ0 , ϵ−1 , ϵ−q are set to their marginal
expectations (that is 0) –this is the Limited Information Maximum Likelihood– or
ϵ0 , ϵ−1 , ϵ−q are set to their conditional expectation, –this is the –so called– Full
Information Likelihood.
Testing and Maximum Likelihood techniques
Another important issue with ML approach is testing. Assume we would like to test
H0 ∶ g (θ) = 0 (where g is some known function mapping in R k ). The ML approach to
inference basically gives rise to three ways to test this hypothesis
Wald Compute unconstrained MLE θ̂ and decide whether g (θ̂) is significatively
different from 0 or not;
LM Compute constrained MLE θ̂0 by maximisation of the Lagrangian L(θ) − λg (θ)
and decide whether λ̂ is significatively different from 0 or not;
LR Compute constrained MLE θ̂0 and the unconstrained MLE and decide whether
L(θ̂) is significatively larger than L(θ̂0 ) or not.
(Exercise: why do I write ”larger” in the last sentence ?)
The three tests statistics are respectively given by
∂g ′
(θ̂)′ V̂ arθ ∂θ (θ̂)
∂g
−1
ζW = g (θ̂)′ [ ∂θ T
] g (θ̂)
∂g ′
(θ̂)′ V̂ arθ ∂θ (θ̂)
∂g
ζLM = λ̂′ [ ∂θ T
] λ̂
A legitimate question could be : what is the need of three tests (taken into account
the fact that they are justified on asymptotic grounds and they are asymptotically
equivalent). There are at least three main reasons
Second Wald tests and Confidence Regions may performed very poorly when there is a
lack of identification in finite samples. In such cases, LM is usually preferable since it
correctly detect this problem (at the cost of low power for test and unboundness of
the CI, but this is a good property !). See Dufour (1997) for details.
Third, the inequality holds if the model is correctly specified. A triplet of decisions
that is incompatible with the above inequality signals that the model may be
misspecified (see Godfrey 1991 for a complete treatment of this question.)
Quasi Maximum Likelihood
Our first example above, shows that the OLS may coincides with the MLE. This true
only if the error term is Gaussian (this is not trivial to see since it requires solving of
multivariate differential equation). Now consistency of the OLS does not require the
error term to be Gaussian. Hence, it may be that the MLE is consistent even if the
distribution function use to compute the likelihood is not the write one. This is an
instance of a Quasi ML estimation.
Two questions arise: how to choose the distribution so that the QMLE is consistent,
beside consistency, what are the asymptotic properties of the QMLE ?
The second one is much easier to answer. As a matter of fact, the argument is exactly
the same as for the mle, except that in the limit, the qmle is no longer a linear
function of the score. As a consequence, the limit variance is much more complicated
to write down (see, for instance Trognon 1987 (in french))
The first one turns out to be too general to be answered uniquely. A useful restriction
is to consider the case in which the first moment of the model is well specified, that is
there exist a value of the parameter θ0 such that the true conditional expectation
E [Y ∣X ] coincides with the conditional expectation computed with the chosen model
at this value of the parameter. In this case, it may be shown that the QMLE is CAN
(regularity assumptions required, as usual). This result has been obtained by White
[1982].
Indirect Inference (notions)
Assume we perform OLS estimation in a case where the error terms are iid and denote
f the true density function. Then, according to the previous result, θQMLE (f )n is such
that
lim Eθ [Y ∣X ] = Ef [Y ∣X ]
n→+∞ QMLE (f )n
9
A caveat, this result is usually obtained for each given value of f . Uniform consistency is not guaranteed (and
fails to hold unless the set over which it is obtained is constrained).
Lesson 4 : Bootstrap and resampling techniques
Introduction
Bootstrap is part of the toolbox of many applied econometricians. Bootstrap is
popular for several reasons
▸ you should not have to care about estimators for the limit distribution (notably
the asymptotic variance)
▸ it provides ”better” estimators than ”usual” techniques
▸ it is easy to implement
Some of the above statements are not always true. For instance the last one is false,
for genuine bootstrap estimators are often impossible to compute and numerical
approximations are used instead 10 . Some other are not clearly defined (in what sense
is the bootstrap estimator ”better”?)
The technique has not been discovered by Bradley Efron (neither by Hebert Simon...).
It goes back to the beginning of modern statistics and has been used explicitly for the
first time in 1923 by Hubback. Fisher acknowledged the great influence of Hubback’s
ideas on his own work.11 Nevertheless, Efron’s contribution is seminal in many
instances and should be regarded as the main building block.
10
As we shall see, a theoretical –non parametric– bootstrap estimator requires nn computations for a sample of
size n so that if the statistic takes 10−8 second to be computed, and the sample size is n = 100, a genuine
theoretical computation requires 101 76 times the age of the universe...
11
”The use of the method of random sampling is theoretically sound. I may mention that its practicability,
convenience and economy was demonstrated by an extensive series of crop-cutting experiments on paddy carried
out by Hubback.. .. They influenced greatly the development of my methods at Rothamsted. (R.A. Fisher, 1945).
source
What does the Bootstrap achieves ?
Let Θ be closed subset of Rp and consider a parametric model (Fθ )θ∈Θ . Now assume
we have an iid sample X1 , . . . , Xn from the true distribution Fθ0 where θ0 ∈ Θ̊.
;
moments) provides an estimator θ̂. Assume θ̂ is CAN we have, as n goes to infinity,
n1/2 (θ̂ − θ0 ) N (0, Vθ0 ) so that test procedures and confidence intervals that are
asymptotically valid may be derived.
This derivation (and in particular the computation of Vθ0 ) usually relies on the
examination of the mapping θ0 → θ̂(θ0 ) (this is striking in the Indirect Inference
approach). The main tools used here are the Slutsky Theorem and the continuous
mapping theorem.
For instance in a Student procedure for significativity of the mean we compare the
statistic Tn = x n /ŝn with a ”critical point” c(α, n), which is in this case the quantile of
Student distribution with n − 1 degrees of freedom because we know that the limit
distribution of Tn under the null is a Student distribution with n − 1 degrees of
freedom under ”fairly general conditions”.
The problem is that this approach is false in finite sample because Tn = x n /ŝn does not
have a known distribution. Hence the 1 − α quantile of the St (n − 1) is not the
distribution of Tn (even if the null is true). This may lead to possibly large Errors in
Rejection Probability. The bootstrap provides a way to compute more accurate critical
points so that the ERP converges to zero quicker than in the usual classical
approaches.
Parametric Bootstrap
A crucial remark here is that we assume that the limit distribution of Tn under the null
does not depend on θ. In technical terms, Tn is asymptotically pivotal under the null.
Now let θ̂n be a root-n consistent estimator of θ under the null (for instance the
constraint MLE) and assume we are able to simulate several independent drawings
(x1s , . . . , xns )s where s = 1, . . . S from the distribution Pn (θ̂n ).
For each drawing s we can compute the value of the statistic Tn,s . Now instead of
using H −1 (1 − α) as a critical point, we may use Hn−1 (1 − α, θ̂n ) (we omit the
dependence on S) the empirical 1 − α quantile computed from Tn,1 . . . , Tn,S .
Hence the Bootstrap testing procedure sets ϕB = 1 iff Tn > Hn−1 (1 − α, θ̂n ) (otherwise
ϕB = 0)
This idea may used to provide confidence intervals, simply by keeping the values of θ
that are not rejected by such a test.
Does it work and why ? (for details see Beran 1986 and Beran 1988, JASA)
The key argument is that for each x and θ the empirical cdf Hn (x, θ) converges to
H(x, θ). More precisely, we usually can write something like
Hence we get
Hn (x, θ̂n ) − Hn (x, θ) = n−1/2 (h(x, θ̂n ) − h(x, θ)) + Op (n−1 ) = Op (n−1 )
Indeed we assume that the difference between θ̂n and θ is of order n−1/2 hence if
h(x, θ) is sufficiently regular we have
Hn (Tn , θ) > 1 − α
Consider now the case in which the limit null distribution depends on θ under the null
(that is, we loose pivotality). We now have
Hence the leading term in the difference Hn (x, θ̂n ) − Hn (x, θ) is now
In short when the limit distribution of the statistic Tn is not pivotal, then bootstrap is
useless.
In this case, Θ is typically divided in two parts : one is parametric (for instance the
mean) and the other is not. Hence θ = (µ, F ). The value of µ is fixed under the null
(say 0), but F is not.
If we use the (pseudo)-inversion algorithm, this amounts to compute F̂n−1 (U) for a
uniform drawing U. Now with probability one, this algorithm is equivalent to
equiprobable drawing with replacement in the original adatset.
This is by far the most well-know presentation of Bootstrap (if not the unique one
considering websites) and it explains the name. The extension of this idea namely to
compute the same statistic from the original dataset X and from a sample X ∗
obtained by equiprobable drawings with replacement, is called ”Bootstrap’s principle”.
This case is much more complicated because controlling limit behavior as to mimic
the presentation of the parametric case in a non parametric one raises difficulties. The
most well known approach has been paved by Peter Hall. It rests on peculiar
Bootstrap extensions : iterations
The Bootstrap principle may be iterated. Indeed, we saw that the testing procedure
ϕB = 1 iff Tn > Hn−1 (1 − α, θ̂n ) and ϕB = 0 otherwise may provide a test with a smaller
ERP in large samples.
Now, this does not mean that the probability of the event ϕB = 1 is exactly α under
the null (see below examples of bootstrap’s failure). But the bootstrap principle may
be applied to bootstrap test itself.
More precisely, if the probability under the null of the event Tn > Hn−1 (1 − α, θ̂n ) is not
exactly 1 − α it may be estimated by simulations. These simulations may then be used
to provide a new correction of the critical point, and so on.
Notice however that this iteration is often very costly with actual computers.12
Indeed, in a non parametric setting, complete simulation of all of the possible
re-samples from a sample of size n requires N = nn computations. The second
iteration of such a process then requires N N computations. If n = 10 N = 1010 and N N
is a number that starts with ”1” followed by 100000000000 zeros...
12
Keep in mind however that research in quantum computing is currently reaching important milestones.
Bootstrap in dynamic models
A major problem with the bootstrap arises when time series are considered. For
instance consider an MA(1) model
xt = ϵt − θϵt−1
where (ϵt )t∈Z is a strong white noise with unknown distribution F . Assume we want
to compute a Bootstrap confidence interval for θ and that we consider as an estimator
for θ the empirical autocorrelogram at horizon 1.
A ”blind” application of the bootstrap ”principle” may lead to re-sample from the
sample x1 , . . . , xT . This is not a good idea since for large sample–sizes this leads a.s.
to an iid sequence. Hence the re–sampled sequence does not have the same
distribution than the original one. In particular the autocorrelograms are different.
in which re-sampling may be performed. Notice however that this is a model specific
approach and it is only approximate (hence ”improvements” are not guaranteed).
Block Bootstrap
To circumvent the previous problem remark that the same model provides direct
access to the following bivariate iid sequence.
Zk = (x2k , x2k+1 )
For this iid sequence, the bootstrap principle may be applied directly by resampling in
the Z sample.
The tricky point is to choose the proper size for the blocks. Too small blocks will
destroy much of the dynamic while too large ones decrease the number of simulated
samples hence the precision of the plug–in.
A difficult problem is that except in very simple (as the above MA) cases, block
bootstraps sequences are no longer stationary. Randomly changing the size of the
blocks may help to circumvent this problem (see Künsch, H. R. 1989. “The jackknife
and the bootstrap for general stationary observations,” Annals of Statistics, 17,
1217–1241 for details).
Bootstrap and the linear model
Consider now the linear model
E [Y ∣X ] = θX
and assume θ is the parameter of interest.
A direct application of the Bootstrap principle in this case lead to a major difficulty
here since the model is a constraint about the conditional distribution of Y given X .
If we perform resampling as a equiprobable drawing with replacement in the (Y , X )
sequence then we no longer consider a distribution of Y conditional on the full
sequence X1 , . . . , Xn
This difficulty is often overcome by applying the bootstrap principle to Y − θ̂n X = ϵ̂n .
More precisely we get a new resample version of the estimated error terms ϵ̂∗n which is
used to compute a new set of endogenous variables Y ∗ = θ̂n X + ϵ̂∗n . This is known as
”resampling in the residuals” technique.
Again, proving that this provides better inference (in the above sense) is difficult
because as n increases, the sequence of estimated residuals (ϵ̂n ) does not converge to
the sequence of the actual residuals ϵn (because as we increase the sample size we also
increase the number of residuals).
Notice finally that the initial solution (that is equiprobable drawing with replacement
in (Y , X ) sequence) and the ”resampling in the residuals” technique rest implictly on
two different assumptions. The first assume we have an iid drawing in the (Y , X )
sequence whereas the second assume that Y − θX is an iid sequence conditional on X .
What Bootstrap cannot achieve
It is important to keep in mind that bootstrap is not a panacea. While the World
Wide Web provides an orgy of happy ending stories, examples of failures are not
difficult to come with. In particular, the belief according to which ”bootstrap performs
better in small samples” (compared to usual asymptotic approaches) is a tale. As the
following example shows.
Consider the set of probability distributions for X such that P(X = −1) = p and
P(X = m) = 1 − p with 0 ≤ p ≤ 1 and m ∈ R. Also assume we observe an iid sample for
size n such that p = p and m = m. Suppose we would like to test that
H0 ∶ E [X ] = 0 ⇔ (1 − p)m = p against H1 E [X ] > 1000.
Assume we use any test procedure ϕ. Can we be sure that using bootstrap principle
will decrease the ERP in small samples ?
The answer is ’no’ because no matter how large n there exists distributions under the
null such that P(X1 ≥ 1000, . . . Xn ≥ 1000) = (1 − p)n can be made as large as desired.
As a consequence, there exist distributions under the null such that the probability
that the initial and the resample drawing are exactly the same is as close as desired to
1. Hence with probability arbitrarily close to 1 bootstrap and classical tests provide
the same decision.
An example of Bootstrap failure (lack of differentiability)
Another issue are the implicit ”regularity assumptions” that lies behind the above
developments underlying Beran’s arguments. When the expansions are not correct
(because functional are non differentiable) or provide poor fit (because order of
magnitude are unclear) argument of strong bootstrap advocates should be taken with
a grain of salt.
This is the main reason why non parametric models are proposed : they are large
enough for the misspecification risk to be reasonably small. More precisely, the model
is parametric when there exist a injection between the set of distributions and Rp
(otherwise it is non–parametric).
They are basically two ways to approach inference in non–parametric models. In the
first one, the parameter of interest lies in a finite–dimensional set and the non
parametric part of the model is a nuisance parameter (most of the case, it is the
common cdf of an iid sequence of ”error terms”). In the second, the parameter of
interest belongs to some functional (infinite dimensional) set.
In the first case, we often look for methods in which the statistic is (at least
asymptotically) pivotal wrt to the non–parametric part of the model (we already
discuss some instances of this case). In the second case, we want to provide
estimators for a function.
In the first case, we often speak of semi-parametric inference, whereas in the second
we use the term non-parametric estimation. This terminology is not as clear as it may
look but we shall use it for simplicity.
Examples : tests
Compute Tn = ∑1≤i≤n 1Xi ≥0 and reject H0 when Tn is larger than the 1 − α quantile of
the Bin(n, 0.5) distribution. This is a valid non parametric test at level α for H0 . It is
known as the sign test.
Assume now we have an iid bivariate sample (X1 , Y1 ), . . . (Xn , Yn ) such that
P(X ∈ A, Y ∈ B) = PF (A) × PG (B) for any couple of measurable sets A, B. We sh ll
also assume that P( X = Y ) = 0. Suppose we would like to test the hypothesis
H0 ∶ F = G.
If we compute Tn = ∑1≤i,j≤n 1Xi ≥Yj (that is the number of times ’each X beats a Y ’ in
face–to–face contest), Tn has given distribution under the null so that a valid test may
be performed (this is the famous Wilocoxon–Mann-Whitney test procedure).
Assume the model is described by a set of distributions such that we known that for
some parameter θ ∈ Θ ⊂ Rp and some given function h(X , θ) we have
EF [h(X , θ)] = 0
In this case the in–sample bias is zero (we ”predict” exactly what happened) but every
time we have a new observation, everything change. The estimation is very unstable,
and it displays a large variance.
At the other extreme, the opposite arises : the bias is large and the variance is small.
Choosing h then requires a criterion that balances both variance and bias. The usual
one is the MISE criterion. Asymptotically, the leading terms of this criterion are
1 2 h4 2
∫ K (u)∂u + (∫ u 2 K (u)∂u) ∫ (f ”(u))2 ∂u
nh 4
The first term goes to infinity when h goes to zero (this the bias part) while the other
diverges when h is large (hence the variance part). Overall, this quantity, as a function
of h is minimized when
1/5
⎛ ∫ K (u)∂u
2 ⎞
h=
⎝ n (∫ u K (u)∂u)2 ∫ (f ”(u))2 ∂u ⎠
2
Interestingly, the leading term in the MISE criterion is O(n4/5 ). This means that if we
want to divide the (square root of the ) MISE by 10 the number of data must be
multiplied by 316.
Choice of kernel
One may also look at the MISE criterion as a function of the kernel. This amounts to
solve the following problem
√
∫ u 2 K (u)∂u
max
∫ K 2 (u)∂u
K
s.t. K (−u) = K (u) ∀u ∈ R
K (u) ≥ 0 ∀u ∈ R
∫ K (u)∂u = 1
The ”best” one is known as the Epanechnikov kernel (difficult exercise) :
3
K (u) = (1 − u 2 ) u ∈ [−1, 1].
4
The relative inefficiencies of most other commonly used kernels usually small, meaning
that an optimal choice of the kernel is not a major issue. For instance the uniform
kernel (which underlies the classic histogram) as a relative efficiency of .93 whereas
the Gaussian Kernel achieves .95.
Non parametric estimation of conditional expectation (Nadaraya Watson)
Indeed, we have E [Y ∣X = x] = ∫ yfX ,Y (x, y )/fX (x)∂y . Hence if we consider the kernel
estimators
x−Xi y −Yi
fX ,Y (x, y ) = 2
nh ∑ni=1 K ( h
)K ( h
)
x−X
fX (x) = 2
nh ∑ni=1 K ( h i )
a plug–in estimator is
x−Xi y −Yi x−Xi y −Yi
∑ni=1 K ( h
)K ( h
) ∑ni=1 K ( h
) ∫ yK ( h
) ∂y
∫ y x−X
∂y = x−X
∑ni=1 K ( h i ) ∑ni=1 K ( h i )
y −Yi
and since ∫ yK ( h
) ∂y = Yi we obtain the desired expression for Ê [Y ∣X = x].
Controlling endpoints
Most of the results and discussion on Kernel estimation used the MISE as criterion.
As the MISE is an integrated criterion, it gives more weight to the central part of the
distribution of the DGP. It has been documented for a while (in particular by Müller
1993) that endpoints may create very large bias and variance.
This is due to the fact that endpoints correspond to very extreme events, which are,
by definition rarely observed. For these very rare cases, the behavior of the estimator
is driven by extreme value theory, a domain in which the usual central Limit Theorem
dose not applies.
There exists several way to correct for this problem. The first one is to vary the width
of the window in order to enlarge the window where the number of points is too small.
The second amounts to trim the sample. In practice, this amounts to discard from the
sample the most extreme points so as to ensure that the estimation always relies on a
sufficiently large number of points.
In both cases, the study of the properties of kernel estimators is much more complex
than in the ”central” case. One should at least keep in mind that ”optimal” kernel
and bandwidth choices may lead to severe miss–representation of the true DGP when
extreme events are considered.
On some instances (in particular in insurance) tail behaviors are dealt with by specific
parametric models. Fat tails distributions (Gumbel, Pareto,...) are used as proxies
when some underlying information is known. The parameter estimation of these
distributions is also difficult (in particular in the time series context) and good
knowledge of the mathematical specificity is highly recommended.
Asymptotic behavior of kernel estimation for density (Parzen 1962)
Asymptotic behavior of MISE has already been discussed, but we have just seen that
this criterion is an averaged one and that specific characteristics may be difficult to
estimate.
We shall now present a very classic result obtain by Parzen (1962). Let hn be a
deterministic sequence such that limn→+∞ h(n) = 0 then (a.s.)
1 n x0 − Xi
lim E [ ∑K ( )] = f (x0 )
n→+∞ nh(n) i=1 h
Clearly enough, all these questions cannot be answered with the same approach. Two
general remarks may nevertheless been put forward
1. The main question is not to provide a valid test procedure (because the “stupid”
procedure is always available) nor a consistent procedure (because ”always reject”
is also admissible). The problem is to provide a procedure the level of which is at
least controlled (even asymptotically) and which is somewhere consistent.
2. In many cases, the alternative hypothesis is much too large to insure consistency
of (asymptotically) valid tests procedures, even in the pointwise sense. Hence
many tests are built in order to provide consistency against some specific subset
of distributions under the alternative (for instance the KPSS in the second above
case).
Impossibility results
More precisely, assume we want to test that the DGP is P against the alternative that
the DGP is Q then the some of type–one and type two risks equal 1 − TV (P, Q) (see,
for instance here for a direct proof).
Hence an α–level test for this simple testing problem has power α + TV (P, Q). It
follows that if P and Q are arbitrarily close in the TV sense, the ”stupid” test and the
Neyman and Pearson procedure coincide.
Unfortunately in non parametric contexts it is often the case that a measure under the
alternative can be founded in the TV–neighborhood at any point under the null. In
this case, the ”stupid” test is Uniformly Most Powerful.
Permutation tests
Permutation test is a simple yet powerful way to derive non parametric test procedures
that have the correct level and enjoy some power properties.
In practice permutation and Bootstrap procedures are similar and they are often
presented as part of a general approach (see, for instance, here). As we shall see, the
similarity of the practice does carries up to the theory.
For permutation devices to be used, one need
1. That each distribution under the null is invariant wrt to permutation of the indices
2. That the realization of test statistic is not invariant wrt to permutation
3. That the distribution under the alternative hypothesis is not invariant wrt
permutations
For instance, the first statement is fulfilled under ”H0 ∶ X1 , . . . Xn is an iid sequence”.
The second statement is not fulfilled by the arithmetic mean but it is by the rank
statistic, for instance. The last statement is required so that the test is more powerful
than the ”stupid” one.
The procedure amounts to compute some given statistic Tn and to compute Tπ(n) for
all permutation π of the indices. We then obtain a ranked sequence
T(1)∶n! ≤ T(2)∶n! ≤ . . . < T(n!)∶n! . Pick up any interval in this ranked sequence that
contains at least a proportion of 1 − α of this sequence and reject the null whenever
the original data does not belong to this set is a valid test at level α.
Example : ”lady tasting tea” (Fisher 1935)
A lady claims she able by tasting a cup of milked tea whether tea or milk was put first
in the cup. The null hypothesis is that she has no such ability.
Assume –without the lady knowing it– the cups are labeled as follows : in cup 1 to 4
milk was put first whereas it is the tea in cup 5 to 8. The lady then tastes each of the
8 cups and provides her answers.
We may then compute rn2 the square of the correlation coefficient both for the original
sample and all of the sample obtain by permuting Y while keeping X fixed. If this
statistic is too large compared to its permuted equivalent, we reject the null.
2
More precisely, we compute r(1)∶n! ≤ ... ≤ r(n!)∶n!
2 and we reject the null at level α
whenever the original result is larger than the empirical 1 − α of this ranked sequence.
Some general considerations about permutation tests
Permutation is one a the very few general devices that may be used to derive valid
test procedures in quite general non parametric contexts. Yet the power of the
procedure depends on the statistic used to perform the test.
For instance, in the first previous example assume we simply decide on whether the
lady get it right on the first cup she tastes, the experiment will not be very convincing.
Similarly, in the second example, notice we reject when the statistic computed from
the original dataset is too large because 0 is the smallest achievable value of the
square of the correlation coefficient. Hence under the alternative, we expect this
statistic to be large.
Notice that contrarily to Bootstrap, the justification of the method does not rely on
any asymptotic argument. Also we dot not need the statistic to be pivotal under the
null. Finally, permutation tests are exact (but, contrarily to what is claimed by
Wikipedia exact tests are not always obtained by permutation).
Of course, when n is large n! is very large and computing all of the permutation is not
feasible. But such a large number of computation is not required. In the second
example, if we peak up 19 permutations at random and reject the null whenever the
statistic computed from the original data is larger than all of its permuted
counterparts we still get a valid 5%–level test because under the nul this sequence is
exchangeable. This is an instance of a much more general principle that will be
discussed in the next sequence.
This spreadsheet allows to implement this procedure.