ECONOMETRICS Course

Download as pdf or txt
Download as pdf or txt
You are on page 1of 129

Lecture 1: The Econometrics of

Financial Returns

Proff. Massimo Guidolin – Francesco Rotondi

20192– Financial Econometrics

Winter/Spring 2022

Overview
General goals of the course and definition of risk(s)

Predicting asset returns: discrete vs. continuous


compounding and their aggregation properties

Stylized facts concerning asset returns

A baseline model for asset returns

Predicting Densities

Conditional vs. Unconditional Moments and Densities

Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 2


General goals
This course is about risk and prediction
Risk must be correctly measured in order to select the
quantity to be borne vs. to be hedged
Several kinds of risk: market, liquidity (including funding),
operational, business, credit
There are different kinds of risk we care for:
o Market risk is defined as the risk to a financial portfolio from
movements in market prices such as equity prices, foreign exchange
rates, interest rates, and commodity prices
o It is important to choose how much of this risk my be taken on (thus
reaping profits and losses), and how much hedged away
o Liquidity risk comes from a chance to have to trade in markets
characterized by low trading volume and/or large bid-ask spreads
o Under such conditions, the attempt to sell assets may push prices
lower, and assets may have to be sold at prices below their
fundamental values or within a time frame longer than expected
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 3

General goals
Not always risks may be predicted or, even though these are
predictable, they may be managed in asset markets
When they are, then we care for them in this course
o Operational (op) risk is defined as the risk of loss due to physical
catastrophe, technical failure, and human error in the operation of a
firm, including fraud, failure of management, and process errors
Although it should be mitigated and ideally eliminated in any firm, this
course has little to say about op risk because op risk is typically very
difficult to hedge in asset markets
• But cat bonds…
o Op risk is instead typically managed using self-insurance or third-
party insurance
o Credit risk is defined as the risk that a counterparty may become less
likely to fulfill its obligation in part or in full on the agreed upon date
o Banks spend much effort to carefully manage their credit risk
exposure while nonfinancial firms try and remove it completely
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 4
Predicting asset returns
When risk is quantifiable and manageable in asset markets,
then we shall predict the distribution of risky asset returns
o Business risk is defined as the risk that changes in variables of a
business plan will destroy that plan’s viability
• It includes quantifiable risks such as business cycle and demand equation
risks, and non-quantifiable risks such as changes in technology
o These risks are integral part of the core business of firms
The lines between the different kinds of risk are often blurred; e.g.,
the securitization of credit risk via credit default swaps (CDS) is an
example of a credit risk becoming a market risk (price of a CDS)
How do we measure and predict risks? Studying asset returns
Because returns have much better statistical properties than price
levels, risk modeling focuses on describing the dynamics of returns

(discretely compounded) (continuously compounded)


Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 5

Predicting asset returns


At daily or weekly frequencies, the numerical differences
between simple and compounded returns are minor
Simple rates aggregate well cross-sectionally (in portfolios),
while continuously compounded returns aggregate over time
o The two returns are typically fairly similar over short time
intervals, such as daily:

o The approximation holds because when x ≅ 1


o The simple rate of return definition has the advantage that the
rate of return on a portfolio is the portfolio of the rates of return
If VPF;t is the value of the portfolio on day t so that
Then the portfolio rate of return is

where is the portfolio weight in asset i


Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 6
Predicting asset returns
o This relationship does not hold for log returns because the log
of a sum is not the sum of the logs
o However, most assets have a lower bound of zero on the price.
Log returns are more convenient for preserving this lower
bound in risk models because an arbitrarily large negative log
return tomorrow will still imply a positive price at the end of
tomorrow:
• If we instead we use the rate of return definition, then tomorrow’s
closing price is so that the price might go negative
in the model unless the assumed distribution of tomorrow’s return, rt+1,
is bounded below by -1
o An advantage of the log return definition is that we can
calculate the compounded return at the K-day horizon simply as
the sum of the daily returns:
[ ]

Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 7

Stylized facts on asset returns


At daily or weekly frequencies, asset returns display weak
serial correlations (in absolute value)
Returns are not normal and display asymmetries and fat tails
Asset returns display a few stylized facts that tend to generally apply
and that are well-known
o Refer to daily returns on the
S&P 500 from January 1, 2001,
through December 31, 2010
o But these properties are much
more general, see below
① Daily returns show weak
autocorrelation:

o Returns are almost impossible to predict from their own past


② The unconditional distribution of daily returns does not follow the
normal distribution
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 8
Stylized facts on asset returns

o The histogram is more peaked around zero than a normal distribution


o Daily returns tend to have more small positive and fewer small
negative returns than the normal distribution (fat tails)
o The stock market exhibits occasional, very large drops but not equally
large upmoves
o Consequently, the distribution is asymmetric or negatively skewed
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 9

Stylized facts on asset returns


At high frequencies, the standard deviation of asset returns
completely dominates the mean which is often not significant
Squared and absolute returns have strong serial correlations
and there is a leverage effect
Correlations between asset returns are time-varying
③ Std. dev. completely dominates the mean at short horizons
o S&P 500: daily mean of 0.056% and daily std. dev. of 1.3771%
④ Variance, measured, for example, by squared returns, displays
positive correlation with its own past
⑤ Equity and equity indices
display negative correlation
between variance and mean
returns, the leverage effect
⑥ Correlation between assets
appears to be time varying
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 10
A baseline model for asset returns
Our general model for asset returns is:

o Correlations appear to increase in highly volatile down markets


and extremely so during market crashes
As the return-horizon increases, the unconditional return
distribution changes and looks increasingly like a normal
Based on the previous list of stylized facts, our model of asset
returns will take the generic form:

o zt+1 is an innovation term, which we assume is identically and


independently distributed (i.i.d.) according to the distribution D(0, 1),
which has a mean equal to zero and variance equal to one
o The conditional mean of the return, Et[Rt+1], is thus t+1, and the
conditional variance, Et [Rt+1 - t+1,]2 ; is 2t+1
o Often assume t+1 = 0 as for daily data this is a reasonable assumption
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 11

Density prediction
o Notice that D(0, 1) does not have to be a normal distribution
o Our task will consist of building and estimating models for both the
conditional variance and the conditional mean
• E.g., t+1 = 0 + 1Rt and 2t+1 = 2t + (1 - )R2t
o However, robust conditional mean relationships are not easy to find,
and assuming a zero mean return may be a prudent choice
In what sense do we care for predicting return distributions?

Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 12


Unconditional vs. Conditional objects
Unconditional moments and densities represent the long-run,
average properties of times series of interest
Conditional moments and densities capture how our
perceptions of RV dynamics changes over time as news arrive

o Our task will consist of building and estimating models for both the
conditional variance and the conditional mean
• E.g., t+1 = 0 + 1Rt and 2t+1 = 2t + (1 - )R2t
o However, robust conditional mean relationships are not easy to find,
and assuming a zero mean return may be a prudent choice
One important notion in this course distinguishes between
unconditional vs. conditional moments and/or densitiies
An unconditional moment or density represents the long-run,
average, “stable” properties of one or more random variables
o Example 1: E[Rt+1] = 11% means that on average, over all data, one
expects that an asset gives a return of 11%
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 13

Unconditional vs. Conditional objects


o Example 2: E[Rt+1] = 11% is not inconsistent with Et[Rt+1] = -6% if
news are bad today, e.g., after a bank has defaulted on its obligations
o Example 3: One good reason for the conditional mean to move over
time is that Et[Rt+1] = + Xt, which is a predictive regression
o Example 4: This applies also to variances, i..e, there is a difference
between Var[Rt+1] ≡ 2 and Vart[Rt+1] ≡ 2t+1
o Example 5: Therefore the unconditional density of a time series
represents long-run average frequencies in one observed sample
o Example 6: The conditional density describes the expected frequen-
cies (probabilities) of the data based on currently available info
When a series (or a vector of series) is identically and
independently (i.i..d. or IID) distributed over time, then the
conditional objects collapse into being unconditional ones
Otherwise unconditional ones mix over conditional ones…

Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 14


Appendix A: What is an econometric model?
A relationship btw. a set of variables subject to stochastic shocks
In general, say g(Yt, X1,t-1, X1,t-2, …, X1,t-J1, …., XK,t-1, …, XK,t-Jk) = 0
where all variables are random, subject to random perturbations
o To equal zero, is not that important
o When the relationship g( ) is sufficiently simple, (call it h( )) then
some variables will be explained or predicted by others, Yt = h(X1,t-1,
X1,t-2, …, X1,t-J1, XK,t-1, …, XK,t-Jk) or even
Yt = h(X1,t-1, X1,t-2, …, X1,t-J1, …, XK,t-1, …, XK,t-Jk) + t
where X1,t-1, X1,t-2, …, X1,t-J1, XK,t-1, …, XK,t-Jk are fixed in repeated
samples
o When h( ) is so incredibly simple to be almost trivial, then it may be
represented by a linear function:
Yt = 0 + 11X1,t-1+ 12X1,t-2+ …+ 1J1X1,t-J1+ … + K1XK,t-1+ …+ KJKXK,t-Jk+ t
o Recall that linear functions may be interpreted as first-order Taylor
expansions, in this case of h(X1,t-1, X1,t-2, …, X1,t-J1, …, XK,t-1, …, XK,t-Jk)
Lecture 1: The Econometrics of Financial Returns – Prof. Guidolin 15

Appendix A: What is an econometric model?


In Yt = h(X1,t-1, X1,t-2, …, X1,t-J1, …, XK,t-1, …, XK,t-Jk) + t
the properties of the shocks t will matter a lot
In general terms, we say that t D(0, Vt-1|t; )
o The zero mean in t D(0, Vt-1|t; ) is a just a standardization because
any deviations may usually be absorbed by the constant(s), 0
o D( ; ) is a parametric distribution from which the shocks are drawn
o This is where statistics communicates to the model and makes into
an econometric model
o is the vector or matrix collecting such parameters
• For instance, it is the number of degrees of freedom in a t-student
distribution
o Vt-1|t is a variance-covariance (sometimes «dispersion») matrix known on
the basis of time t-1 information and valid for time t
o « » does not specify whether there is any dependence structure
characterizing the data, but typically we assume t IID D(0, Vt-1|t; )
Lecture 1: The Econometrics of Financial Returns – Prof. Guidolin 16
Appendix A: What is an econometric model?
Value of Yt
absent other
effects Marginal effect
Marginal
Oh I care for of first variable,
effect of first
this variable! variable increasing lags

𝑌𝑌𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1,𝑗𝑗 𝑋𝑋1,𝑡𝑡−𝑗𝑗 + 𝛽𝛽1,𝑗𝑗+1 𝑋𝑋1,𝑡𝑡−(𝑗𝑗+1)


For instance, For instance, j 0 Lagged ECB
FTSE MIB daily ECB interest
Let me look interest rates
returns for variables rates
To capture omitted variables,
that explain it functional misspecifications, and
measurement errors
+ … + 𝛽𝛽𝐾𝐾,𝑗𝑗 𝑋𝑋𝐾𝐾,𝑡𝑡−𝑗𝑗 + 𝛽𝛽𝐾𝐾,𝑗𝑗+1 𝑋𝑋𝐾𝐾,𝑡𝑡−(𝑗𝑗+1) + ⋯ + 𝜖𝜖𝑡𝑡
Aggregate Lagged aggregate
earning-price earning-price 𝜖𝜖𝑡𝑡 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼(0, 𝑉𝑉𝑡𝑡−𝑗𝑗 )
ratio ratio
Lecture 1: The Econometrics of Financial Returns – Prof. Guidolin 17
Lecture 2: Essential Concepts in
Time Series Analysis

Proff. Massimo Guidolin/Francesco Rotondi

20192– Financial Econometrics

Winter/Spring 2022

Overview
Time Series: When Can We Focus on the First Two Moments
Only?

Strict vs. Weak Stationarity

White noise processes

The sample autocorrelation function vs. the population ACF

The sample partial autocorrelation function vs. the


population PACF

Box-Pierce-Ljung test for sample ACF

The sample partial autocorrelation function vs. the


population PACF
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 2
Time Series
A time series consists of a sequence of random variables, y1, y2, …,
yT, also known as a stochastic process 𝑦𝑦𝑡𝑡 𝑇𝑇𝑡𝑡=1 , of which we only
observe the empirical realizations
𝑇𝑇
o An observed time series 𝑦𝑦𝑡𝑡 𝑡𝑡=1 (technically, a sub-sequence
because limited to a finite sample) of the realized values of a family
of random variables 𝑌𝑌𝑡𝑡 𝑡𝑡=−∞
+∞
defined on an appropriate probability
space
𝑇𝑇
o See difference between sample ( 𝑦𝑦𝑡𝑡 𝑡𝑡=1 ) and population ( 𝑌𝑌𝑡𝑡 𝑡𝑡=−∞
+∞
)

A time series model for the observations 𝑦𝑦𝑡𝑡 𝑇𝑇𝑡𝑡=1 is a specification


of the joint distribution of the set of random variables of which
the sampled data are a realization
o We often exploit the linearity of the process to specify only the first-
and second-order moments of the joint distribution, i.e., the mean,
variances and covariances of 𝑌𝑌𝑡𝑡 𝑡𝑡=−∞
+∞

Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 3

Linear Processes

If a time series process is linear, modelling its conditional mean


and variance is sufficient in a mean-squared error sense
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 4
Strict Stationarity
To use past realizations of a variable of interest to forecast its
future values, it is necessary for the stochastic process that has
originated the observations to be stationary
Loosely speaking, a process is said to be stationary if its statistical
properties do not change over time

Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 5

Weak (Covariance) Stationarity


In many applications, a weaker form of stationarity generally
provides a useful sufficient condition

(Autocovariance function)

𝜌𝜌ℎ ≡ 𝛾𝛾ℎ /𝛾𝛾0 (where 𝛾𝛾0 is the variance) is called autocorrelation function
(ACF), for h = …, -2, -1, 1, 2, ….
o Often more meaningful than ACVF because it is expressed as pure
numbers that fall in [-1, 1]
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 6
An Example of Stationary Series
.8 2
AR(1) Simulated Data
.6 0

.4
-2
.2
-4
.0
-6
-.2
-8
-.4 AR(1) Simulated Data vs.
Random Walk Simulated Data
-.6 -10

-.8 -12
250 500 750 1000 250 500 750 1000
Panel (a) Panel (b)

A time series generated by a stationary process fluctuates around a


constant mean, because its memory of past shocks decays over time
o The data plotted in panel (a) are 1,000 realizations of a first-order
autoregressive (henceforth, AR) process of the type 𝑦𝑦𝑡𝑡 = 𝜙𝜙0 +
𝜙𝜙1 𝑦𝑦𝑡𝑡−1 + 𝜖𝜖𝑡𝑡 , with 𝜙𝜙0 = 0 and 𝜙𝜙1 = 0.2
o In panel (b) we have a nonstationary random walk, 𝑦𝑦𝑡𝑡 = 𝑦𝑦𝑡𝑡−1 + 𝜖𝜖𝑡𝑡
o We shall describe these models later
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 7

White Noise Process


A fundamental class of stationary processes is the fundamental
building block of all (covariance) stationary processes: white noise

.6
Gaussian White Noise Data
.4

.2

.0

-.2

-.4

-.6

-.8
250 500 750 1000

Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 8
Sample Autocorrelation Function
Stationary AR and white noise processes may sometimes be hard to
tell apart – what tools are available to identify them?
The sample ACF reflects important information about the linear
dependence of a series at different times

If 𝑌𝑌𝑡𝑡 +∞
𝑡𝑡=−∞ is an i.i.d. process with finite variance, then for a large
sample, the estimated autocorrelations 𝜌𝜌�ℎ will be asymptotically
normally distributed with mean 𝜌𝜌ℎ and variance 1/T
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 9

Sample Autocorrelation Function


1.00 1.00
Autocorrelations of AR(1) Simulated Data
0.75 0.75
(AR coefficient = 0.2, mean = 0)
0.50 0.50

0.25 0.25

0.00 0.00

-0.25 -0.25 Autocorrelation of Random Walk


Simulated Data (Drift=0)
-0.50 -0.50
5 10 15 20 5 10 15 20

Panel (a) Panel (b)


o The dashed lines correspond to approximate (asymptotic) 95%
confidence intervals built as ±1.96/ 𝑇𝑇
o The SACF in panel (a) shows that a stationary process quickly
“forgets” information from a distant past
o The theoretical ACF for a random walk process shall be exactly one at
all lags but because SACF is a downward biased estimates of the true
and unobserved ACF, the sample coefficients are less than 1
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 10
Ljung-Box Test for SACF
It is also possible to jointly test whether several (say, M ) con-
secutive autocorrelation coefficients are equal to zero:
𝐻𝐻𝑜𝑜 : 𝜌𝜌1 = 𝜌𝜌2 = ⋯ = 𝜌𝜌𝑀𝑀 = 0 𝑣𝑣𝑣𝑣. 𝐻𝐻𝑎𝑎 : ∃ 𝑣𝑣𝑠𝑠𝑠𝑠𝑠𝑠 𝑗𝑗 𝑣𝑣. 𝑡𝑡. 𝜌𝜌𝑗𝑗 ≠ 0
Box and Pierce (1970) and
Ljung and Box (1978) deve-
loped a well-known port-
manteau test based on the Q- or LB-statistic
Serial Correlation Structure of Simulated AR(1) Data Serial Correlation Structure of Simulated White Noise Data

Autocorrelation Partial Correlation AC PAC Q-Stat Prob Autocorrelation Partial Correlation AC PAC Q-Stat Prob

1 0.168 0.168 28.433 0.000 1 0.022 0.022 0.4673 0.494


2 0.039 0.010 29.925 0.000 2 -0.010 -0.010 0.5598 0.756
3 0.006 -0.002 29.961 0.000 3 0.051 0.051 3.1776 0.365
4 0.006 0.005 29.997 0.000 4 0.027 0.025 3.9294 0.416
5 0.025 0.024 30.646 0.000 5 -0.056 -0.056 7.1017 0.213
6 0.015 0.007 30.873 0.000 6 0.033 0.034 8.2260 0.222
7 -0.045 -0.052 32.956 0.000 7 0.004 -0.001 8.2428 0.312
8 -0.018 -0.003 33.282 0.000 8 0.006 0.011 8.2735 0.407
9 0.018 0.024 33.594 0.000 9 -0.011 -0.012 8.4026 0.494
10 0.020 0.013 34.001 0.000 10 0.028 0.024 9.2163 0.512
11 0.023 0.016 34.531 0.000 11 -0.033 -0.032 10.308 0.503
12 0.032 0.028 35.566 0.000 12 -0.051 -0.050 12.961 0.372
13 0.014 0.005 35.764 0.001 13 -0.031 -0.031 13.937 0.378
14 0.045 0.038 37.802 0.001 14 0.008 0.008 13.996 0.450
15 -0.005 -0.022 37.828 0.001 15 0.059 0.069 17.483 0.291
16 -0.001 0.002 37.829 0.002 16 0.010 0.008 17.584 0.349
17 -0.001 -0.000 37.829 0.003 17 -0.013 -0.015 17.752 0.405
18 0.040 0.042 39.502 0.002 18 0.020 0.013 18.143 0.446
19 0.029 0.016 40.336 0.003 19 0.014 0.013 18.333 0.500
20 0.059 0.051 43.884 0.002 20 0.051 0.059 21.024 0.396
21 -0.001 -0.018 43.884 0.002 21 -0.009 -0.014 21.100 0.453
22 -0.094 -0.099 52.881 0.000 22 -0.037 -0.041 22.539 0.428
23 -0.028 -0.001 53.672 0.000 23 -0.033 -0.039 23.667 0.422

Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi
24 0.038 0.046 55.161 0.000 24 -0.037 -0.042 25.038 0.404 11
Panel (a) Panel (b)

Sample Partial Autocorrelation Function


The partial autocorrelation between 𝑦𝑦𝑡𝑡 and 𝑦𝑦𝑡𝑡−ℎ is the
autocorrelation between the two random variables in the time
series, conditional on 𝑦𝑦𝑡𝑡−1 , 𝑦𝑦𝑡𝑡−2 , … , 𝑦𝑦𝑡𝑡−ℎ+1
Or, the ACF measured after netting out the portion of the variability
linearly explained already by the lags between 𝑦𝑦𝑡𝑡−1 and 𝑦𝑦𝑡𝑡−ℎ+1 e
o The sample estimate of the partial autocorrelation at lag h is obtained
as the ordinary least square estimator of 𝜙𝜙ℎ in an autoregressive
model: 𝑦𝑦𝑡𝑡 = 𝜙𝜙0 + 𝜙𝜙1 𝑦𝑦𝑡𝑡−1 + 𝜙𝜙2 𝑦𝑦𝑡𝑡−2 … + 𝜙𝜙ℎ 𝑦𝑦𝑡𝑡−ℎ + 𝜖𝜖𝑡𝑡
Serial Correlation Structure of Simulated AR(1) Data Serial Correlation Structure of Simulated White Noise Data

Autocorrelation Partial Correlation AC PAC Q-Stat Prob Autocorrelation Partial Correlation AC PAC Q-Stat Prob

1 0.168 0.168 28.433 0.000 1 0.022 0.022 0.4673 0.494


2 0.039 0.010 29.925 0.000 2 -0.010 -0.010 0.5598 0.756
3 0.006 -0.002 29.961 0.000 3 0.051 0.051 3.1776 0.365
4 0.006 0.005 29.997 0.000 4 0.027 0.025 3.9294 0.416
5 0.025 0.024 30.646 0.000 5 -0.056 -0.056 7.1017 0.213
6 0.015 0.007 30.873 0.000 6 0.033 0.034 8.2260 0.222
7 -0.045 -0.052 32.956 0.000 7 0.004 -0.001 8.2428 0.312
8 -0.018 -0.003 33.282 0.000 8 0.006 0.011 8.2735 0.407
9 0.018 0.024 33.594 0.000 9 -0.011 -0.012 8.4026 0.494
10 0.020 0.013 34.001 0.000 10 0.028 0.024 9.2163 0.512
11 0.023 0.016 34.531 0.000 11 -0.033 -0.032 10.308 0.503
12 0.032 0.028 35.566 0.000 12 -0.051 -0.050 12.961 0.372
13 0.014 0.005 35.764 0.001 13 -0.031 -0.031 13.937 0.378
14 0.045 0.038 37.802 0.001 14 0.008 0.008 13.996 0.450
15 -0.005 -0.022 37.828 0.001 15 0.059 0.069 17.483 0.291
16 -0.001 0.002 37.829 0.002 16 0.010 0.008 17.584 0.349
17 -0.001 -0.000 37.829 0.003 17 -0.013 -0.015 17.752 0.405
18 0.040 0.042 39.502 0.002 18 0.020 0.013 18.143 0.446
19 0.029 0.016 40.336 0.003 19 0.014 0.013 18.333 0.500
20 0.059 0.051 43.884 0.002 20 0.051 0.059 21.024 0.396
21 -0.001 -0.018 43.884 0.002 21 -0.009 -0.014 21.100 0.453
22 -0.094 -0.099 52.881 0.000 22 -0.037 -0.041 22.539 0.428
23 -0.028 -0.001 53.672

Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi
24 0.038 0.046 55.161
0.000
0.000
23
24
-0.033
-0.037
-0.039
-0.042
23.667
25.038
0.422
0.404 12
Panel (a) Panel (b)
Lecture 3: Autoregressive Moving
Average (ARMA) Models and their
Practical Applications
Prof. Massimo Guidolin

20192– Financial Econometrics

Winter/Spring 2021

Overview
Moving average processes
Autoregressive processes: moments and the Yule-Walker
equations
Wold’s decomposition theorem
Moments, ACFs and PACFs of AR and MA processes
Mixed ARMA(p, q) processed
Model selection: SACF and SPACF vs. information criteria
Model specification tests
Forecasting with ARMA models
A few examples of applications
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 2
Moving Average Process

MA(q) models are always stationary as they are finite, linear


combination of white noise processes
o Therefore a MA(q) process has constant mean, variance and
autocovariances that differ from zero up to lag q, but zero afterwards

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 3

Moving Average Process: Examples


MA(q) models are always stationary as they are finite, linear
combination of white noise processes
o Therefore a MA(q) process has constant mean, variance and
autocovariances that differ from zero up to lag q, but zero afterwards

o Simulations are based on


Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 4
Moving Average Process : Examples

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 5

Autoregressive Process
An autoregressive (henceforth AR) process of order p is a process
in which the series 𝑦𝑦𝑡𝑡 is a weighted sum of p past variables in the
series (𝑦𝑦𝑡𝑡−1 , 𝑦𝑦𝑡𝑡−2 , … , 𝑦𝑦𝑡𝑡−𝑝𝑝 ) plus a white noise error term, 𝜖𝜖𝑡𝑡
o AR(p) models are simple univariate devices to capture the observed
Markovian nature of financial and macroeconomic data, i.e., the fact
that the series tends to be influenced at most by a finite number of
past values of the same series, which is often also described as the
series only having a finite memory (see below on this claim)

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 6


The Lag and Difference Operators
The lag operator, generally denoted by L, shifts the time index of a
variable regularly sampled over time backward by one unit
o Therefore, applying the lag operator to a generic variable 𝑦𝑦𝑡𝑡 , we
obtain the value of the variable at time t -1, i.e., 𝐿𝐿𝑦𝑦𝑡𝑡 = 𝑦𝑦𝑡𝑡−1
o Equivalently, applying 𝐿𝐿𝑘𝑘 means lagging the variable k > 1 times, i.e.,
𝐿𝐿𝑘𝑘 𝑦𝑦𝑡𝑡 = 𝐿𝐿𝑘𝑘−1 (𝐿𝐿𝑦𝑦𝑡𝑡 ) = 𝐿𝐿𝑘𝑘−1 𝑦𝑦𝑡𝑡−1 = 𝐿𝐿𝑘𝑘−2 (𝐿𝐿𝑦𝑦𝑡𝑡−1 ) = ⋯ = 𝑦𝑦𝑡𝑡−𝑘𝑘
The difference operator, Δ, is used to express the difference
between consecutive realizations of a time series, Δ𝑦𝑦𝑡𝑡 = 𝑦𝑦𝑡𝑡 − 𝑦𝑦𝑡𝑡−1
o With Δ we denote the first difference, with Δ2 we denote the second-
order difference, i.e., Δ2 𝑦𝑦𝑡𝑡 = Δ Δ𝑦𝑦𝑡𝑡 = Δ 𝑦𝑦𝑡𝑡 − 𝑦𝑦𝑡𝑡−1 = Δ𝑦𝑦𝑡𝑡 − Δ𝑦𝑦𝑡𝑡−1 =
𝑦𝑦𝑡𝑡 − 𝑦𝑦𝑡𝑡−1 − 𝑦𝑦𝑡𝑡−1 − 𝑦𝑦𝑡𝑡−2 = 𝑦𝑦𝑡𝑡 − 2𝑦𝑦𝑡𝑡−1 + 𝑦𝑦𝑡𝑡−2 and so on
o Note that Δ2 𝑦𝑦𝑡𝑡 ≠ 𝑦𝑦𝑡𝑡 − 𝑦𝑦𝑡𝑡−2
o Δ𝑦𝑦𝑡𝑡 can also be rewritten using the lag operator, i.e., Δ𝑦𝑦𝑡𝑡 = (1 − 𝐿𝐿)𝑦𝑦𝑡𝑡
o More generally, we can write a difference equation of any order, Δ𝑘𝑘 𝑦𝑦𝑡𝑡
as Δ𝑘𝑘 𝑦𝑦𝑡𝑡 = (1 − 𝐿𝐿)𝑘𝑘 𝑦𝑦𝑡𝑡 , k 1
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 7

Stability and Stationarity of AR(p) Processes


In case of an AR(p), because it is a stochastic difference equation, it
can be rewritten as
or, more compactly, as 𝜙𝜙(𝐿𝐿)𝑦𝑦𝑡𝑡 = 𝜙𝜙0 + 𝜀𝜀𝑡𝑡 , where 𝜙𝜙(𝐿𝐿) is a
polynomial of order p,
Replacing in the polynomial 𝜙𝜙(𝐿𝐿) the lag operator by a variable 𝜆𝜆
and setting it equal to 0, i.e., 𝜙𝜙 𝜆𝜆 = 0, we obtain the characteristic
equation associated with the difference equation 𝜙𝜙(𝐿𝐿)𝑦𝑦𝑡𝑡 = 𝜙𝜙0 + 𝜀𝜀𝑡𝑡
o A value of 𝜆𝜆 which satisfies the polynomial equation is called a root
o A polynomial of degree p has p roots, often complex numbers
If the absolute value of all the roots of the characteristic equations
is higher than one the process is said to be stable
A stable process is always weakly stationary
o Even if stability and stationarity are conceptually different, stability
conditions are commonly referred to as stationarity conditions
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 8
Wold’s Decomposition Theorem


𝑦𝑦𝑡𝑡 − 𝜇𝜇 = 𝜖𝜖𝑡𝑡 + 𝜓𝜓1 𝜖𝜖𝑡𝑡−1 + 𝜓𝜓2 𝜖𝜖𝑡𝑡−2 + ⋯ = � 𝜓𝜓𝑖𝑖 𝜖𝜖𝑡𝑡−𝑖𝑖
𝑖𝑖=0
An autoregressive process of order p with no constant and no other
predetermined, fixed terms can be expressed as an infinite order
moving average process, MA( ), and it is therefore linear

If the process is stationary, the sum ∑𝑖𝑖=0 𝜓𝜓𝑗𝑗 𝜖𝜖𝑡𝑡−𝑗𝑗 will converge
The (unconditional) mean of an AR(p) model is

o The sufficient condition for the mean of an AR(p) process to exist and
be finite is that the sum of the AR coefficients is less than one in
absolute value, , see next
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 9

Moments and ACFs of an AR(p) Process


The (unconditional) variance of an AR(p) process is computed from
Yule-Walker equations written in recursive form (see below)
o In the AR(2) case, for instance, we have
(1 − 𝜙𝜙2 )𝜎𝜎𝜖𝜖2
𝑉𝑉𝑉𝑉𝑉𝑉 𝑦𝑦𝑡𝑡 =
(1 + 𝜙𝜙2 )(1 − 𝜙𝜙1 − 𝜙𝜙2 )(1 + 𝜙𝜙1 − 𝜙𝜙2 )
o For AR(p) models, the characteristic polynomials are rather
convoluted – it is infeasible to define simple restrictions on the AR
coefficients that ensure covariance stationarity
o E.g., for AR(2), the conditions are 𝜙𝜙1 + 𝜙𝜙2 < 1, 𝜙𝜙1 − 𝜙𝜙2 < 1, |𝜙𝜙2 | < 1
The autocovariances and autocorrelations functions of AR(p)
processes can be computed by solving a set of simultaneous
equations known as Yule-Walker equations
o It is a system of K equations that we recursively solve to determine the
ACF of the process, i.e., 𝜌𝜌ℎ for h = 1, 2, …
o See example concerning AR(2) process given in the lectures and/or in
the textbook
For a stationary AR(p), the ACF will decay geometrically to zero
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 10
ACF and PACF of AR(p) Process
The SACF and SPACF are of primary importance to identify the lag
order p of a process

o F 11
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin

ACF and PACF of AR(p) and MA(q) Processes

An AR(p) process is described by an ACF that may slowly tail off at


infinity and a PACF that is zero for lags larger than p
Conversely, the ACF of a MA(q) process cuts off after lag q, while the
PACF of the process may slowly tail off at infinity

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 12


ARMA(p,q) Processes
In some applications, the empirical description of the dynamic
structure of the data require us to specify high-order AR or MA
models, with many parameters
To overcome this problem, the literature has introduced the class of
autoregressive moving-average (AR-MA) models, combinations of
AR and MA models

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 13

ARMA(p,q) Processes
We can also write the ARMA(p, q) process using the lag operator:

The ARMA(p, q) model will have a stable solution (seen as a


deterministic difference equation) and will be co-variance
stationary if the roots of the polynomial
lie outside the unit circle
The statistical properties of an ARMA process will be a combination
of those its AR and MA components
The unconditional expectation of an ARMA(p, q) is

o An ARMA(p, q) process gives the same mean as the corresponding


ARMA(p, 0) or AR(p)
The general variances and autocovariances can be found solving
the Yule-Walker equation, see the book
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 14
ARMA(p,q) Processes
For a general ARMA(p, q) model, beginning with lag q, the values of
𝜌𝜌𝑠𝑠 will satisfy:

o After the qth lag, the ACF of an ARMA model is geometrically


declining, similarly to a pure AR(p) model
The PACF is useful for distinguishing between an AR(p) process and
an ARMA(p, q) process
o While both have geometrically declining autocorrelation functions,
the former will have a partial autocorrelation function which cuts off
to zero after p lags, while the latter will have a partial autocorrelation
function which declines geometrically

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 15

ARMA(p,q) Processes

o As one would expect of an ARMA process, both the ACF and the PACF
decline geometrically: the ACF as a result of the AR part and the PACF
as a result of the MA part
o However, as the coefficient of the MA part is quite small the PACF
becomes insignificant after only two lags. Instead, the AR coefficient is
higher (0.7) and thus the ACF dies away after 9 lags and rather slowly
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 16
Model Selection: SACF and SPACF
A first strategy, compares the
sample ACF and PACF with the
theoretical, population ACF and
PACF and uses them to identify
the order of the ARMA(p, q) model
US CPI Inflation

o Process of some ARMA type, but it


remains quite difficult to determine
its precise order (especially the MA)
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 17

Model Selection: Information Criteria


The alternative is to use information criteria (often shortened to IC)
They essentially trade off the goodness of (in-sample) fit and the
parsimony of the model and provide a (cardinal, even if specific to
an estimation sample) summary measure
o We are interested in forecasting out-of-sample: using too many para-
meters we will end up fitting noise and not the dependence structure
in the data, reducing the predictive power of the model (overfitting)
Information criteria include in rather simple mathematical
formulations two terms: one which is a function of the sum of
squared residual (SSR), supplemented by a penalty for the loss of
degrees of freedom from the number of parameters of the model
o Adding a new variable (or a lag of a shock or of the series itself) will
have two opposite effects on the information criteria: it will reduce
the residual sum of squares but increase the value of the penalty term
The best performing (promising in out-of-sample terms) model will
be the one that minimizes the information criteria
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 18
Model Selection: Information Criteria
Number of parameters

Sample size

The SBIC is the one IC that imposes the strongest penalty (lnT) for
each additional parameter that is included in the model.
The HQIC embodies a penalty that is somewhere in between the
one typical of AIC and the SBIC
o SBIC is a consistent criterion, i.e.,
it determinates the true model
asymptotically
o AIC asymptotically overestimates
the order/complexity of a model
with positive probability

Estimation Methods: OLS vs MLE


o It is not uncommon that different criteria lead to different models
o Using the guidance derived from the inspection of the correlogram,
we believe that an ARMA model is more likely, given that the ACF does
not show signs of geometric decay
o Could be inclined to conclude in favor of a ARMA(2,1) for the US
monthly CPI inflation rate
The estimation of an AR(p) model because it can be performed
simply by (conditional) OLS
o Conditional on p starting values for the series
When an MA(q) component is included, the estimation becomes
more complicated and requires Maximum Likelihood
o Please review Statistics prep-course + see the textbook
However, this opposition is only apparent: conditional on the p
starting values, under the assumptions of a classical regression
model, OLS and MLE are identical for an AR(p)
o See 20191 for the classical linear regression model
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 20
Estimation Methods: MLE
The first step in deriving the MLE consists of defining the joint
probability distribution of the observed data
The joint density of the random variables in the sample may be
written as a product of conditional densities so that the log-like-
lihood function of ARMA(p, q) process has the form

o For instance, if 𝑦𝑦𝑡𝑡 has a joint and marginal normal pdf (which must
derive from the fact that 𝜖𝜖𝑡𝑡 has it), then

o MLE can be applied to any parametric distribution even when


different from the normal
Under general conditions, the resulting estimators will then be
consistent and have an asymptotic normal distribution, which may
be used for inference
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 21

Example: ARMA(2,1) Model of US Inflation

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 22


Model Specification Tests
If the model has been specified correctly, all the structure in the
(mean of the) data ought to be captured and the residuals shall not
exhibit any predictable patterns
Most diagnostic checks involve the analysis of the residuals
① An intuitive way to identify potential problems with a ARMA
model is to plot the residuals or, better, the standardized residuals,
i.e.,
o If the residuals are normally distributed with zero mean and unit
variance, then approximately 95% of the standardized residuals
should fall in an interval of ±2 around zero
o Also useful to plot the squared (standardized) residuals: if the model
is correctly specified, such a plot of squared residuals should not
display any clusters, i.e., the tendency of high (low) squared residuals
to be followed by other high (low) squared standardized residuals
② A more formal way to test for normality of the residuals is the
Jarque-Bera test
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 23

Model Specification Tests: Jarque-Bera Test


o Because the normal distribution is symmetric, the third central
moment, denoted by 𝜇𝜇3 , should be zero; and the fourth central
moment, 𝜇𝜇4 , should satisfy 𝜇𝜇4 = 3𝜎𝜎𝜖𝜖4
o A typical index of asymmetry based on the third moment (skewness),
̂ of the distribution of the residuals is
that we denote by 𝑆𝑆,

o The most commonly employed index of tail thickness based on the


� is
fourth moment (excess kurtosis), denoted by 𝐾𝐾,

o If the residuals were normal, 𝑆𝑆̂ and 𝐾𝐾


� would have a zero-mean
asymptotic distribution, with variances 6/T and 24/T, respectively
o The Jarque-Bera test concerns the composite null hypothesis:

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 24


Model Specification Tests: Jarque-Bera Test
Jarque and Bera prove that because the square roots of the sample
statistics 2 2

are 𝑁𝑁(0,1) distributed, the null consists of a joint test that 𝜆𝜆1 and
𝜆𝜆2 are zero tested as 𝐻𝐻0 : 𝜆𝜆1 +𝜆𝜆2 = 0, where 𝜆𝜆12 + 𝜆𝜆22 ~𝜒𝜒22 as T⟶
③ Compute sample autocorrelations of residuals and perform tests
of hypotheses to assess whether there is any linear dependence
o Same portmanteau tests based on the Q-statistic can be applied to test
the null hypothesis that there is no autocorrelation at orders up to h

ARMA(2,1) Model of US Inflation

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 25

Example: ARMA(2,1) Model of US Inflation

Residuals Squared Residuals

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 26


Forecasting with ARMA
In-sample forecasts are those generated with reference to the same
data that were used to estimate the parameters of the model
o The R-square of the model is a measure of in-sample goodness of fit
o Yet, ARMA are time series models in which the past of a series is used
to explain the behavior of the series, so that using the R-square to
quantify the quality of a model faces limitations
We are more interested in how well the model performs when it is
used to forecast out-of-sample, i.e., to predict the value of
observations that were not used to specify and estimate the model
Forecasts can be one-step-ahead, 𝑦𝑦�𝑡𝑡 (1), or multi-step-ahead, 𝑦𝑦�𝑡𝑡 (ℎ)
In order to evaluate the usefulness of a forecast we need to specify
a loss function that defines how concerned we are if our forecast
were to be off relative to the realized value, by a certain amount.
Convenient results obtain if one assumes a quadratic loss function,
i.e., the minimization of:
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 27

Forecasting with AR(p)


o This is known as the mean square forecast error (MSFE)
It is possible to prove that MSFE is minimized when 𝑦𝑦�𝑡𝑡 (ℎ) is equal
to where ℑ𝑡𝑡 is the information set available
In words, the conditional mean of 𝑦𝑦𝑡𝑡+ℎ given its past observations
is the best estimator of 𝑦𝑦�𝑡𝑡 (ℎ) in terms of MSFE
In the case of an AR(p) model, we have:

where
o For instance,
o The forecast error is
o The h-step forecast can be computed recursively, see the
textbook/class notes
For a stationary AR(p) model, 𝑦𝑦�𝑡𝑡 (ℎ) converges to the mean 𝐸𝐸[𝑦𝑦𝑡𝑡 ] as
h grows, the mean reversion property
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 28
Forecasting with MA(q)
Because the model has a memory limited to q periods only, the
point forecasts converge to the mean quickly and they are forced to
do so when the forecast horizon exceeds q periods
o E.g., for a MA(2),
because both shocks have been observed and are therefore known
o Because 𝜀𝜀𝑡𝑡 has not yet been observed at time t, and its expectation at
time t is zero, then
o By the same principle, because 𝜀𝜀𝑡𝑡+3 ,
𝜀𝜀𝑡𝑡+2 , and 𝜀𝜀𝑡𝑡+1 are not known at time t
By induction, the forecasts of an ARMA(p, q) model can be obtained
from 𝑝𝑝 𝑞𝑞
𝑦𝑦�𝑡𝑡 ℎ = 𝜙𝜙0 + � 𝜙𝜙𝑖𝑖 𝑦𝑦�𝑡𝑡 ℎ − 𝑖𝑖 + � 𝜃𝜃𝑗𝑗 𝐸𝐸𝑡𝑡 [𝜖𝜖𝑡𝑡+ℎ−𝑗𝑗 ]
𝑖𝑖=1 𝑗𝑗=1
How do we assess the forecasting accuracy of a model?

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 29

Forecasting US CPI Inflation with ARMA Models

Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 30


A Few Examples of Potential Applications
To fit the time series dynamics of the earnings of company i = 1, 2,
…, N with ARMA(pi, qi) models, to compare the resulting forecasts
with earnings forecasts published by analysts
o Possibly, also compare the 90% forecast confidence intervals from
ARMA(pi, qi) with the dispersion over time of analysts forecasts
o Possibly also price stocks of each company using some DCF model and
compare it with the target prices published by the analysts
To fit the time series dynamics of commodity future returns (for a
range of underlying assets) using ARMA(p, q) models to forecast
o Possibly, compare such forecasts with those produced by predictive
regressions that just use (or also use) commodity-specific information
o A predictive regression is a linear model to predict the conditional
mean with structure 𝑦𝑦�𝑡𝑡 ℎ = 𝛼𝛼 + ∑𝑖𝑖=1 𝐾𝐾
𝛽𝛽̂𝑖𝑖 𝑥𝑥𝑖𝑖,𝑡𝑡 where the
𝑥𝑥1,𝑡𝑡 , 𝑥𝑥2,𝑡𝑡 , … , 𝑥𝑥𝐾𝐾,𝑡𝑡 are the commodity-specific variables
o Possibly, to try and understand why and when only the past of a series
helps to predict future returns or not (i.e., for which commodities)
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 31

A Few Examples of Potential Applications


Given the mkt. ptf., use mean-variance portfolio theory (see 20135,
part 1) and ARMA models to forecast the (conditional risk
premium) and decide the optimal weight to be assigned to risky vs.
riskless assets
o Also called strategic asset allocation problem
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑓𝑓
o As you will surely recall, 𝜔𝜔
�𝑡𝑡 = (1/𝜆𝜆𝜎𝜎 2 )× 𝐹𝐹𝐹𝐹𝑉𝑉𝐹𝐹𝐹𝐹𝑉𝑉𝐹𝐹𝐹𝐹 𝐹𝐹𝑜𝑜 (𝑉𝑉𝑡𝑡+1 − 𝑉𝑉𝑡𝑡+1 )
o Similar/partly identical to a question of Homework 2 in 20135!?
o Possibly compare with the performance results (say, Sharpe ratio)
𝑓𝑓
produced by the strategy 𝜔𝜔 �𝑡𝑡 = (1/𝜆𝜆𝜎𝜎 2 )× 𝐻𝐻𝑖𝑖𝐹𝐹𝐹𝐹. 𝑀𝑀𝐹𝐹𝑉𝑉𝑀𝑀 𝐹𝐹𝑜𝑜 (𝑉𝑉𝑡𝑡+1
̂ − 𝑉𝑉𝑡𝑡+1 )
which results from ARMA(0,0) processes (== white noise returns)
After measuring which portion of a given policy or company
announcement represents news/unexpected information, measure
how long it takes for the news to be incorporated in the price
o Equivalent to test the number of lags q in a ARMA(0, q) model
o Unclear what the finding of p > 0 could mean in a ARMA(p, q) case
o Related to standard and (too) popular event studies
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 32
Lecture 4: Multivariate Linear Time
Series (Vector Autoregressions)

Prof. Massimo Guidolin

20192– Financial Econometrics

Winter/Spring 2022

Overview
Multivariate strong vs. weak stationarity

Multivariate white noise and testing for it

Vector autoregressions: reduced vs. structural forms

From structural to reduced-form VARs and back:


identification issues

Recursive Choleski identification

Stationarity and moments of VARs, model specification

Impulse response functions and variance decompositions

Granger causality tests


Lecture 4: Multivariate Linear Time Series (VARs) – Prof. Guidolin 2
Motivation and Preliminaries
Because markets and institutions are highly intercorrelated, in
financial applications we need to jointly model different time series
to study the dynamic relationship among them
This requires multivariate time series analysis
Instead of focusing on a single variable, we consider a set of
variables, 𝒚𝒚𝑡𝑡 ≡ [𝑦𝑦1,𝑡𝑡 𝑦𝑦2,𝑡𝑡 … 𝑦𝑦𝑁𝑁,𝑡𝑡 ]′ with t = 1, 2, …, T
o The resulting sequence is called a N-dimensional (discrete) vector
stochastic process
Most important example are vector autoregressive (VAR) models
o Flexible models in which a researcher needs to know very little ex-
ante theoretical information about the relationship among the
variables to guide the specification of the model
o All variables are treated as a-priori endogenous
But first need to generalize the concepts of (weak) stationarity to the case
of N-dimensional vector time series and discuss how to compute the first
two moments
Lecture 4: Multivariate Linear Time Series (VARs) – Prof. Guidolin 3

Multivariate Weak vs. Strong Stationarity

Lecture 4: Multivariate Linear Time Series (VARs) – Prof. Guidolin 4


Multivariate Weak vs. Strong Stationarity
The object that appears in the definition is new but it collects
familiar objects: given a sample , the cross-covariance matrix can be
estimated by ‘ h 0

o Here is
the vector of sample means
o When h = 0 we have the
sample covariance matrix
o The cross-correlation
matrix is
where D is the diagonal
matrix that collects sample standard deviations on its main diagonal

Lecture 4: Multivariate Linear Time Series (VARs) – Prof. Guidolin 5

Multivariate Weak vs. Strong Stationarity

Cross-sample
correlogram (i.e., off-
diagonal element of
0, 1, …, 24)

It means “if and only if”

Iff a series is stationary, all cross-serial correlations will decay to 0


Strict stationarity has identical definition, except that it now
involves the joint multivariate PDF of the variables
o So we have both a time series dimension, f(y1), f(y2), …, f(yT), but also
each of the f(yt) is a N X 1 multivariate density
Lecture 4: Multivariate Linear Time Series (VARs) – Prof. Guidolin 6
Multivariate White Noise and Portmanteau Tests
Ljung and Box’s (1978) Q-statistic to jointly test whether several
(M) consecutive autocorrelation coefficients were equal to zero can
be generalized to the multivariate case, see Hosking (1980):
vs. for some i = 1, 2, …, m can be tested
using:

o tr(A) is the trace of a matrix, simply the sum of the elements on its
main diagonal
o Q(m) has an asymptotic (large sample) 𝜒𝜒𝑁𝑁2 2 𝑚𝑚 (which may be poor in
small samples)
o Note that the null hypothesis corresponds to:

Lecture 4: Multivariate Linear Time Series (VARs) – Prof. Guidolin 7

Vector Autoregressions: Reduced-Form vs. Structural


A VAR is a system regression model that treats all the N variables as
endogenous and allows each of them to depend on p lagged values
of itself and of all the other variables

(serially)
For instance, when N = 2, yt = [xt zt ]’ or [R1,t R2,t]’, one example
concerning two asset returns may be:
u1,t
u2,t – Prof. Guidolin
Lecture 24: Multivariate Linear Time Series (VARs) 8
Vector Autoregressions: Reduced-Form vs. Structural
o Σ ≡ Var[ut] is the covariance matrix of the shocks
o When it is a full matrix, it implies that contemporaneous shocks to
one variable may produce effects on others that are not captured by
the VAR structure
If the variables included on the RHS of each equation in the VAR
are the same then the VAR is called unrestricted and OLS can be
used equation-by-equation to estimate the parameters
o This means that estimation is very simple
When the VAR includes restrictions, then one should use MLE,
which in this case often takes the form of Generalized Least
Squares (GLS), Seemingly Unrelated Regressions (SUR), or MLE
Because the VAR(p) model, yt = a0 + A1yt-1 + A2yt-2 + ... + Apyt-p+ ut
does not include contemporaneous effects, it is said to be in
standard or reduced form, to be opposed to a structural VAR
In a structural VAR(p), the contemporaneous effects do not need to
go through the covariance matrix of the residuals, ut
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 9

Vector Autoregressions: Reduced-Form vs. Structural


What is the difference? Consider the simple N = 2, p = 1 case of

where both xt and zt are stationary, xt and zt are uncorrelated


white noise processes, also called structural errors
Using matrices, this VAR(1) model may be re-written as:
Structural VAR

Pre-multiplying both sides by B-1 (this will be possible if b12b21


1),
a0
a0 Reduced-form VAR
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 10
Vector Autoregressions: Reduced-Form vs. Structural
The “new” error terms are composites of the two primitive shocks:

What are the properties of the reduced form errors? Recall that x
t
and zt were uncorrelated, white noise processes, then:

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 11

Vector Autoregressions: Reduced-Form vs. Structural


The reduced-form shocks uxt and uzt will be correlated even though
the structural shocks are not:

which shows that they are uncorrelated if b12 = b21 = 0


This is very important: unless the variables are contempo-
raneously uncorrelated in the structural VAR (b12 = b21 = 0), a
reduced-form VAR will generally display correlated shocks
o If VARs are just extensions of AR models under what conditions will
they be stationary? Stay tuned...
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 12
Vector Autoregressions: Reduced-Form vs. Structural
A structural VAR cannot be estimated directly by OLS: because of
the contemporaneous feedbacks, each contemporaneous variable
is correlated with its own error term
o Standard estimation requires that the regressors be uncorrelated
with the error term or a simultaneous equation bias will emerge
o An additional drawback of structural models is that contempora-
neous terms cannot be used in forecasting, where VARs are popular
However there is no such problem in estimating the VAR system in
its reduced form; OLS can provide estimates of a0 and A1, A2, …
o Moreover, from the residuals, it is possible to calculate estimates of
the variances of and of the covariances between elements of ut
o The issue is whether it is possible to recover all of the information
present in the structural primitive VAR
Is the primitive system identifiable given OLS estimates?
The lack of identification may be overcome if one is prepared to
impose appropriate restrictions on the primitive, structural system
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 13

Identifying Structural from Reduced-Form VARs


The reason is clear if we compare the number of parameters of the
primitive system with the number recovered from the estimated
VAR model
8 mean parameters + 2

a0 6 mean parameters + 3
a0
o 9 vs. 10: unless one is willing to restrict one of the parameters, it is
not possible to identify the primitive system and the structural VAR
is under-identified
One way to identify the model is to use the type of recursive
system proposed by Sims (1980): we speak of triangularizations
In our example, it consists of imposing a restriction on the
primitive system such as, for example, b21 = 0
As a result, while zt has a contemporaneous impact on xt, the
opposite is not true
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 14
Identifying Structural from Reduced-Form VARs
In a sense, shocks to zt are more primitive, enjoy a higher rank, and
move the system also through a contemporaneous impact on xt
The VAR(1) now acquires a triangular structure:

a0
a0
This corresponds to imposing a Choleski decomposition on the
covariance matrix of the residuals of the VAR in its reduced form
Indeed, now we can re-write the relationship between the pure
shocks (from the structural VAR) and the regression residuals as

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 15

Recursive Choleski Identification


Working out the full algebra,

o Notice that by estimating 6 mean parameters, it is possible to pin


down the 6 structural parameters; same for variances and
covariances: a01 a02
9 equations,
9 unknowns

We say that the triangularized VAR(1) is exactly identified

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 16


Recursive Choleski Identification
In fact, this method is quite general and extends well beyond this
VAR(1), N = 2 example: in a N-variable VAR(p), B is a NxN matrix
because there are N residuals and N structural shocks
Exact identification requires (N2 N)/2 restrictions placed on the
relationship btw. regression residuals and structural innovations
Because a Choleski decomposition is triangular, it forces exactly
(N2 -N)/2 values of the B matrix to equal zero
o Because with N = 2, (22 -2)/2 = 1, you see that b21 = 0 was sufficient
in our example
There are as many Choleski decompositions as all the possible
orderings of the variables, a combinatorial factor of N
o A Choleski identification scheme results in a specific ordering, we are
introducing a number of (potentially arbitrary) assumptions on the
contemporaneous relationships among variables
o Choleski decompositions are deliberate in the restrictions they place
but tend not to be based on theoretical assumptions
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 17

Stationarity of VAR Processes


Alternative identification schemes are possible (but they are more
popular in macroeconomics than in finance)
For a general VAR(p), algebra shows that
which is the vector moving average (VMA) infinite representation
o The vectors of coefficients 1, 2, 3, … are complex functions of the
original (reduced-form) coefficients
o = E[yt] is the unconditional mean of the VAR process
In a VAR(1), we have

and i = Ai1, i.e., increasing powers of the A1 matrix


For such dependence to fade progressively away as the time
distance between yt and past innovations grows, i must converge
to zero as i goes to infinity
o This means that all the N eigenvalues of the matrix A1 must be less
than 1 in modulus, in order to avoid that Ai1 will either explode or
converge to a nonzero matrix as i goes to infinity
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 18
Stationarity and Moments of VAR Processes
o The requirement that all the eigenvalues of Ai1 are less than one in
modulus is a necessary and sufficient condition for yt to be stable
(and, thus, stationary, as stability implies stationarity), that is:
det(IN − A 1z ) = 0 | z |> 1
Multivariate extension of the Wold’s representation theorem
Assuming now stationarity, in a VAR(1) case:

vec of
= N

These hold if and only if is non-singular which requires


that, again, all the eigenvalues of A1 are less than 1 in modulus
These unconditional moments must be contrasted with the
conditional moments:

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 19

Conditional vs. Unconditional Moments


o While the unconditional covariance matrix is a function of both the
covariance matrix of the residuals, u, and of the matrix A1,
conditioning on past information, the covariance matrix of yt is the
same as the covariance matrix of the residuals, u
For instance, in the case of the US monthly interest rate data on 1-
month and 10-year Treasuries, we have (t-statistics in […]):

=
o The conditional moments are obviously different, e.g.,

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 20


Generalizations to VAR(p) Models
At some cost of algebra complexity, these findings generalize
o If is non-singular,
assuming the series is weakly stationary
o The conditional mean differs from the unconditional one, as

o The expressions for conditional and unconditional covariance


matrices are harder to derive (think of Yule-Walker equations)
o As for stationarity, a VAR(p) model is stable (thus stationary) as long
as
det(IN − A1z − ... − A p z p ) = 0 for | z |> 1
o The roots of the characteristic polynomial should all exceed one in
modulus (i.e., they should lie outside the unit circle)
The typical estimation outputs,
o OLS equation-by-equation for unrestricted reduced-form models
o MLE/GLS otherwise (more complex estimators for structural VARs)
have standard structure but deal with N+N2p+0.5N(N+1) parameters
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 21

VAR(p) Model Specification


Increasing the order of a VAR reduces the (absolute) size of the
residuals and improves the fit, but damages its forecasting power
because of the risk of overfitting
o If the lag length is p, each of the N equations will contain Np coeffi-
cients plus the intercept
How do we appropriately select or even test for p?
o Such a lag choice is then common to all equations restricted
models in which each equation is specified separately (by t- or F-
tests that however are often contradicting the need to avoid OLS)
We can use the likelihood ratio (LR) test – want to test the
hypothesis that a set of variables was generated from a Gaussian
VAR with p0 lags against the alternative specification of p1 lags
Under the assumption of normally distributed shocks
𝑎𝑎𝑠𝑠𝑠𝑠 2
� 𝑝𝑝𝑢𝑢0 − 𝑙𝑙𝑙𝑙|𝜮𝜮
L𝑅𝑅𝑅𝑅 𝑝𝑝0 , 𝑝𝑝1 = (𝑅𝑅 − 𝑁𝑁𝑝𝑝0 − 1) ln 𝜮𝜮 � 𝑝𝑝𝑢𝑢1 | 𝜒𝜒𝑁𝑁2 (𝑝𝑝1 −𝑝𝑝0 )
o N2(p1 – p0) is the number of restrictions Determinant of a matrix
that are tested
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 22
VAR(p) Model Specification
o Large values of the LRT trigger a rejection of the null hypothesis that
� 𝑝𝑝𝑢𝑢0 − 𝑙𝑙𝑙𝑙|𝜮𝜮
lags are sufficient, as 𝑙𝑙𝑙𝑙 𝜮𝜮 � 𝑝𝑝𝑢𝑢1 |, an indication that increasing
the number of lags reduces RSS by a lot
LR tests can only be used to perform a pairwise comparison of two
VARs, one obtained as a restricted version of the other (nested)
The recipe is then to first to specify the largest VAR and then
proceed to pair it down until we can reject the null hypothesis
o LRT is valid only under the assumption that errors are normally
distributed; without distributional assumptions, it is unclear
whether it may have any merit
An alternative approach is to minimize a multivariate version of
the information criteria: 2

where K = N2p + N
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 23

Forecasting with VAR Models


o The various
criteria suggest
it would be pru-
dent to estimate
larger VAR models
Loss functions that
lead to the minimi-
zation of the mean
squared forecast
error (MSFE) are the
most widely used
The minimum time t
MSFE prediction at
a forecast horizon h
is the conditional
expected value:
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 24
Impulse Response Functions
o The formula can be used recursively to compute h-step-ahead
predictions starting with h = 1:

In essence, the same results that apply to AR models generalize


VAR models are used in practice with the goal of understanding
the dynamic relationships between the variables of interest

Let’s use again a simple VAR(1) model:

We know that a stationary VAR has a MA( ) representation:

25

Impulse Response Functions


o The two error processes, 𝑢𝑢1,𝑡𝑡 and 𝑢𝑢2,𝑡𝑡 can be represented in
terms of the two sequences 𝜖𝜖1,𝑡𝑡 and 𝜖𝜖2,𝑡𝑡 , i.e., the structural
innovations:

Therefore the model can be re-written as

The coefficients in 𝚽𝚽𝑖𝑖 (impact multipliers) can be used to generate


the effects of shocks to the innovations 𝜖𝜖1,𝑡𝑡 and 𝜖𝜖2,𝑡𝑡 , on the time
path of 𝑦𝑦1,𝑡𝑡 and 𝑦𝑦2,𝑡𝑡
The cumulative effects of a one-unit shock (or impulse) to a
structural shock on an endogenous variable after H periods can
then be obtained by computing the sum
A VAR in reduced form cannot identify the structural form and
therefore we cannot compute the coefficients in 𝚽𝚽𝑖𝑖 from the OLS
estimates in its standard form unless we impose restrictions
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin
Impulse Response Functions
One method to place these restrictions consists of the application
of a Choleski decomposition:
( )
o Because of the triangular structure of W = B-1, a Choleski
decomposition allows only the shock to the first variable to
contemporaneously affect all the other variables in the system
o A shock to the second variable will produce a contemporaneous
effect on all the variables in the system, but the first one
o This may of course be impacted in the subsequent period, through
the transmission effects mediated by the autoregressive coefficients
o A shock to the third variable will affect all the variables in the system,
but the first two, and so on
Therefore, a Choleski identification scheme forces a potentially
important identification asymmetry on the system
A different ordering of the variables in the system would have
been possible, implying a reverse ordering of the shocks
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin

Impulse Response Functions


IRFs are based on estimated coefficients: as each coefficient is
estimated with uncertainty, IRFs will contain sampling error
o Advisable to construct confidence intervals around IRFs
o An IRF is statistically significant if zero is not included in the
confidence
interval
Estimates
based on
VAR(2) for
weekly series
selected by
information
criteria

The
experiment is
a tightening
on short-term
rates by the
FED
Variance Decompositions
Understanding the properties of forecast errors from VARs is
helpful in order to assess the interrelationships among variables
Using the VMA representation of the errors, the h-step-ahead
forecast error is

o See lecture notes for algebra of such representation


o Because all white noise shocks the same variance, if we denote by
the h-step-ahead variance of the forecast of (say) y1, we have:

o Because all the coefficients in 𝚽𝚽𝑖𝑖 are non-negative, the variance of


the forecast error increases as the forecast horizon h increases
o We decompose the h-step-ahead forecast error variance into the
proportion due to each of the (structural) shocks

Such proportions due to each shock is a variance decomposition


Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin

Variance Decompositions
Like in IRF analysis, variance decompositions of reduced-form
VARs require identification (because otherwise we would be
unable to go from the coefficients in 𝚯𝚯𝑖𝑖 to their counterparts in 𝚽𝚽𝑖𝑖 )
o Choleski decompositions are typically imposed
o Forecast error variance decomposition and IRF analyses both entail
similar information from the time series
o Example on weekly US Treasury yields, 1990-2016 sample:

Choleski ordering:
__ 1M Yield
__ 1Y Yield
__ 5Y Yield
__ 10Y Yield

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin


Variance Decompositions and Granger Causality

Choleski ordering:
__ 10Y Yield
__ 1M Yield
__ 1Y Yield
__ 5Y Yield

Unfortunately, the ordering may turn out to be crucial


o This occurs because the reduced-form residuals from a 4x1 VAR(2)
system for US Treasury rates are highly, positively correlated
Variance decompositions may convey useful information when in a
given VAR(p)m a subset of the Nx1 vector (say 𝒙𝒙𝑡𝑡 ) forecasts their
own future and all the remaining variables in 𝒚𝒚𝑡𝑡 but…
… such remaining variables fail to forecast 𝒙𝒙𝑡𝑡
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin

Granger-Sims Causality
We say that the sub-vector 𝒙𝒙𝑡𝑡 is (block) exogenous to 𝒚𝒚𝑡𝑡 or that 𝒙𝒙𝑡𝑡
is not Granger-caused by the remaining variables

We also write 𝒙𝒙𝑡𝑡 ⟹𝐺𝐺𝐺𝐺 𝒚𝒚𝑡𝑡 but 𝒚𝒚𝑡𝑡 ⇏𝐺𝐺𝐺𝐺 𝒙𝒙𝑡𝑡


When in a Nx1 system , [𝒙𝒙𝑡𝑡 𝒚𝒚𝑡𝑡 ]’, 𝒙𝒙𝑡𝑡 ⟹𝐺𝐺𝐺𝐺 𝒚𝒚𝑡𝑡 and 𝒚𝒚𝑡𝑡 ⟹𝐺𝐺𝐺𝐺 𝒙𝒙𝑡𝑡 , we say
that there is a feedback system or two-way causality
o Consider the example:

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin


Granger-Sims Causality

If y1 is found to Granger-cause y2, but not vice versa, we say that


variable y1 is strongly exogenous (in the equation for y2).
If neither set of lags are statistically significant in the equation for
the other variable, it would be said that y1 and y2 are linearly
unrelated
o Practically, block-causality tests simply consist of LR or F-type tests
The word “causality” is somewhat of a misnomer, for Granger-Sims
causality really means only a correlation between the current
value of one variable and the past values of others
It does not mean that movements of one variable actually cause
movements of another
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 33

One Example of Granger-Sims Causality Tests


Use the VAR(2) model for the one-month, one-, five-, and ten-year
Treasury yields to test Granger causality
The table considers one dependent variable at a time and tests
whether the lags of each of the other variables help to predict it
o The chi-square statistics refer to a test in which the null is that the
lagged coefficients of the “excluded” variable are equal to zero

Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 34


Lecture 5: Unit Roots, Cointegration
and Error Correction Models – The
Spurious Regression Problem
Prof. Massimo Guidolin

20192– Financial Econometrics

Winter/Spring 2022

Overview
Stochastic vs. deterministic trends

The random walk process

Isolating and removing trends and the associated perils

Spurious regressions

Unit root tests

Defining cointegration

Vector Error Correction VAR models

Testing for cointegration and Johansen’s method

Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 2
Trends in Time Series
Can the methods of lectures 2-4 be applied to nonstationary series?
o No, the inference would be invalid, and the costs are considerable
o If nonstationary is simply ignored, it can produce misleading
inferences because they will be spurious
Nonstationarity may be linked to the presence of trends in time
series processes
Time series often contain a trend, a possibly random component
with a permanent effect on a time series
① Deterministic trends, functions (linear or non-linear) of time, t,
; for instance, polynomials
where 𝜖𝜖𝑡𝑡 is a white noise process
o The trending effect caused by functions of t is permanent and impres-
ses a trend because time is obviously irreversible, e.g., for Q =1
stationary
o The solution to the difference equation is 𝜂𝜂𝜏𝜏
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 3

Trends in Time Series


The long-term forecast of 𝑦𝑦𝑡𝑡 will converge to the trend line, 𝛿𝛿𝑡𝑡, so
that this type of model is said to be trend stationary
② Stochastic trends, which characterize all processes that can be
written as:
o Because , presence of a
stochastic trend, implies a random walk with drift: 𝑦𝑦𝑡𝑡+1 = 𝜇𝜇 + 𝑦𝑦𝑡𝑡 + 𝜖𝜖𝑡𝑡+1
Therefore a stochastic trend is not a RW, but it implies its presence
o A RW is the non-stationary variant of AR(1) with 𝜇𝜇 = 𝜙𝜙0 and 𝜙𝜙1 = 1
A determinist trend with Q = 1 implies the presence of a (restri-
cted) stochastic trend, but stochastic trends may arise on their own
Since all values of 𝜖𝜖𝜏𝜏 carry a coefficient of unity, the effect of each
shock on the intercept term is permanent, which is indeed the
intrinsic nature of a trend
If shocks are never forgotten and time series have infinite memory
⟹ both deterministic and stochastic trends are non-stationary, the
latter are denoted as I(d), d 1 4
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Trends in Time Series
Because unit root tests are sensitive to the presence of
deterministic trends, this distinction is not just classification
Useful unit root tests will manage to tell deterministic time trends
apart from stochastic ones to generate decompositions such as

These 6 series are generated from the same sequence of IID N(0,1) shocks
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 5

The Random Walk Process


The RW is the key (but not only) type of non-stationary process
o While its conditional mean is well-defined, its unconditional mean
explodes: 𝐸𝐸𝑡𝑡[𝑦𝑦𝑡𝑡+1] = 𝜇𝜇 + 𝐸𝐸𝑡𝑡[𝑦𝑦𝑡𝑡] + 𝐸𝐸𝑡𝑡[𝜖𝜖𝑡𝑡+1] = 𝜇𝜇 + 𝑦𝑦𝑡𝑡
𝐸𝐸𝑡𝑡[𝑦𝑦𝑡𝑡+𝑠𝑠] = 𝜇𝜇 + 𝐸𝐸𝑡𝑡[𝑦𝑦𝑡𝑡+𝑠𝑠−1] + 𝐸𝐸𝑡𝑡[𝜖𝜖𝑡𝑡+𝑠𝑠] = 2𝜇𝜇 + 𝐸𝐸𝑡𝑡[𝑦𝑦𝑡𝑡+𝑠𝑠−2] + 𝐸𝐸𝑡𝑡[𝜖𝜖𝑡𝑡+𝑠𝑠+𝜖𝜖𝑡𝑡+𝑠𝑠−1]
=…= 𝑠𝑠𝜇𝜇 + 𝑦𝑦𝑡𝑡 (s > t)

so it depends on t and if there is no drift (𝜇𝜇 = 0), then 𝐸𝐸𝑡𝑡[𝑦𝑦𝑡𝑡+1] = 𝐸𝐸𝑡𝑡[𝑦𝑦𝑡𝑡+𝑠𝑠]


o Also unconditional variance depends on time and it explodes as t ⟶

o Also autocovariances and autocorrelations display pathological


patterns:
As t ⟶ , there is
perfect memory,

Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 6
The Random Walk Process
o The ACF of a RW shows, especially in small samples, a slight tendency
to decay that would make one think of a stationary AR(p) process
with a sum of the AR coefficients close to 1

In realistic samples, not possible to use the ACF to distinguish


between a unit root process and a stationary, near I(1) process
Suppose you have already established that a time series contains a
trend: how do you estimate (remove) it/them to apply the
decomposition
7
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin

De-Trending a Series: Deterministic vs. Stochastic

In trend-stationary case, de-trending is simply done by OLS


For stochastic trends, consider the RW with drift process and take
its first difference:
The result is a white noise plus a constant intercept (the drift)
This approach is more general:
De-Trending a Series: Deterministic vs. Stochastic
o If you want to see what a I(2) process may look like, see Appendix A
De-trended series
Simulated quadratic trend

o The two de-trended series are identical in the two plots because they
had been generated to be identical, and they are white noise

Pitfalls in De-Trending Applications


Serious damage—in a statistical sense—can be done when the
inappropriate method is used to eliminate a trend
① When a time series is I(d) but an attempt is made to remove its
stochastic trend by fitting deterministic time trend functions, the
OLS residuals will still contain one or more unit roots
o Deterministic de-trending does not remove the stochastic trends
o For instance:
o Even when 𝜇𝜇 = 𝛿𝛿, the Linear trend
stochastic trend remains

10
Pitfalls in De-Trending Applications
② When a time series contains a deterministic trend but is other-
wise I(0), (trend-stationary) and an attempt is made to remove the
trend by differentiating the series d times, the resulting differentia-
ted series will contain d unit roots in its MA components
o It will therefore be not invertible
o Differentiating a +1 +1

trend- stationary series,


creates new stochastic
trends that are shifted
inside the shocks of the series

Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 11

Pitfalls in De-Trending Applications


o Even when the trend-stationary component is absent, if the time
series is I(0) but it is incorrectly differenced d times, the resulting
differentiated series will contain d unit roots in its MA components
What if 𝑦𝑦𝑡𝑡 ~𝐼𝐼(𝑑𝑑) but by mistake we differentiate it d + r times?
③a -- If r > 0, we are over-differencing the series, and as such ②
applies, that is, the resulting over-differentiated series will contain r
unit roots in its MA components and will therefore be not invertible
③b -- If r < 0, we are not differencing the series enough and the
resulting series will still contain d – r and will remain nonstationary
Why is it that we care so much for isolating and removing trends?
It turns out that, at least in general, using I(d) series with d > 0 in
standard regression analysis, in general exposes us to the peril of
invalid inferences
We speak of spurious regressions
Suppose that 𝑦𝑦𝑡𝑡 ~𝐼𝐼(1) and 𝑥𝑥𝑡𝑡 ~𝐼𝐼(1), e.g., stock prices and GDP
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 12
The Spurious Regression Problem
You estimate a regression of 𝑦𝑦𝑡𝑡 on 𝑥𝑥𝑡𝑡 , , expecting the
errors (say, 𝜂𝜂𝑡𝑡 ) to be white noise, as required by OLS, but instead:

𝑦𝑦𝑡𝑡 ~𝐼𝐼(1) 𝑥𝑥𝑡𝑡 ~𝐼𝐼(1)

o The very error terms of a regression are I(1)!


o This occur unless very special conditions occur, see below
A spurious regression has the following features:
① The residuals are I(1) and as such any shock is a permanent
change of the intercept of the regression, in no way news
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 13

The Spurious Regression Problem


② Standard OLS estimators are inconsistent and the associated
inferential procedures are invalid and statistically meaningless
③ The regression has a high R2 and t-statistics that appear to be
significant, but the results are void of any economic meaning
o Do not fall in the spurious regression trap, do not just boast huge R-
squares, in a finance they are more often symptoms of problems
o This is not a small sample problem; in fact, these issues worsen as the
sample size grows
o These ideas generalize, at the cost of technical complexity when one
would try and regress an I(d) series on another I(d) series
o Or when we regress a deterministic trend on another trend
The cure of the problem is to work with stationary first/d-
differenced series
o E.g, we generate two independent sets of IID white noise variables
and use them to simulate 1000 observations from two driftless RWs
o The two RWs are expected to be unrelated
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 14
The Spurious Regression Problem
o If you object the series seem to be trending in similar directions, you
know that it is not—it just chance and a good dose of visual illusion
o The estimated regression of one RW on the other gives Regression
residuals

o When the series are differentiated, a regression provides no explana-


tory power:

Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 15

The Spurious Regression Problem

o The sample ACF of the regression residuals are:

White
I(1)! noise

16
Testing for Unit Roots: the Dickey-Fuller Test
SACFs are downward biased ⟹ Box-Jenkins cannot be used
o Dickey and Fuller (1979) offer a test procedure to take the bias in due
account, based on a Monte Carlo design, under the null of a RW
Their method boils down to estimate by OLS the regression:
α/SE(α ) = number of standard
deviations away from 0
o The one-sided t-statistic of the OLS estimate of α is then compared to
critical values found by DF by simulations under the null of a RW
• E.g., if the estimated is 0.962 with a standard error of 0.013, then
the estimated α is -0.038 and the t-statistic is -0.038/0.013 = -2.923
• According to DF’s simulations, this happens in less than 5% of the
time, under the null of an RW, but in more than 1% of the simulations
• This is a rather unlucky event under the null of a RW and this may
lead to a rejection of the hypothesis, with a p-value btw. 0.01 and 0.05
We therefore use a standard t-ratio taking into account that under
the null of a RW, its distribution is nonstandard and cannot be
analytically evaluated 17
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin

Testing for Unit Roots: Augmented DF Test


The classical DF test suffers from one rigidity: given the null
hypothesis of a RW, the alternative hypothesis is specified as an
AR(1) stationary process
It is possible to use the DF tests in more general cases
o Appendix B shows that through a sequence of “adding-and-subtract”
operations, it is possible to re-write a general AR(p) process as:

We can test for the presence of a unit root using the DF test,
although this is called augmented Dickey-Fuller (ADF) test
o ADF implements a parametric correction for high order correlation
o DF show that the asymptotic distribution of the t-ratio for α is
independent of the number of lagged first differences included
In fact, the appropriate “tables” (simulated statistic) to use depend
on the deterministic components included in the regression

Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 18
Other Unit Root Tests
o In the case of real earnings, ADF test that includes an intercept gives
an estimate of α of 0 with a t-ratio of 10.196 which leads to a failure to
reject the null of a unit root
o The presence of a time trend cannot be ruled out on theoretical
grounds – an ADF test also including a linear time trend, gives an
estimate of α of -0.002 which is -1.900 standard deviations away from
0 and that does not allow us to reject the null of a unit root
Phillips and Perron (1988) propose a nonparametric method of
controlling for serial correlation when testing for a unit root that is
an alternative to the ADF test
o Classical DF test + modify the t-ratio of α so that serial correlation in
the residuals does not affect the asymptotic distribution of the test
o See lecture notes for PP test statistic
o Null hypothesis remains a unit root
Kwiatkowski, Phillips, Schmidt, and Shin (1992) have proposed a
testing strategy under the null of (trend-) stationarity
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 19

Other Unit Root Tests


KPSS statistic is based on residuals from a regression of the series
on exogenous, deterministic factors:
KPSS test is: t Intercept
and trend

Re-examine whether S&P real stock prices, aggregate earnings, and


aggregative dividends give evidence of a unit root, with PP tests:
∆Pt+1
∆Et+1
∆Dt+1
All series contain a unit root and this should be taken into account
KPSS tests lead to the same conclusion even though the null differs
Although rejecting the null of a unit root does not imply “accepting”
the alternative hypothesis of stationarity, ADF-type and KPSS tests
are sufficiently different to occasionally contradict each other
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 20
Cointegration: General Concept and Definition
In finance and macroeconomics, most popular series contain a unit
root, i.e., they are I(1) series (random walks)
o For instance, the US aggregate dividends and stock prices
As we shall see, however there may exist a linear combination (e.g.,
their weighted difference) of them that becomes stationary
o Situations in which the choice to simultaneously difference all nonsta-
tionary series may be a mistake, as it will imply a loss of information,
possibly invalid inference, and suboptimal predictive performance
o Economic theories often useful, e.g., no-arbitrage relationships
We say that two non-stationary series integrated of order d are
cointegrated of order b, if there exists a linear combination of them
which is integrated of order d-b

21

Getting Intuition Through One Realistic Case


For concreteness, let’s consider a N = 2 case in which the variables
are log dividends (ldt) and log stock prices (lpt)
We represent their dynamic process as a restricted VAR(1):

This is why the VAR


is restricted
This can be re-written in compact form as:

Consider the realistic case, when our variables are non-stationary,


which is obtained simply by setting b1 = 1:
a2 > 0

This way, ldt becomes a random walk with drift; because lpt is a
linear function of a random walk, it becomes itself a random walk
22
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Getting Intuition Through One Realistic Case
Cointegration can be identified from the need for models to be
balanced in terms of LHS vs. RHS orders of integration
To understand the essence of cointegration, consider re-
parameterizing the model in the following way:

The model for changes in log-prices, i.e., for log-stock returns, is


“balanced” if and only if lpt-1 - 1ldt-1 is I(0)
For a model to be balanced it means that it must involve variables
of the same level of integration, i.e., I(0) = a0 + • I(0) + I(0)
o Of course the model for ldt is I(0)
23
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin

Getting Intuition Through One Realistic Case


Cointegrating vector = coefficients that balance the model
This model is balanced if and only if lpt-1 - 1ldt-1 is I(0)
However, this implies that a 1 must exist such that a weighted sum
(difference) of two I(1) variables, must be stationary
Hence, lpt-1 and ldt-1 are cointegrated, with cointegrating vector
equal to (1 - 1)’
The model written as

in its first equation is called an error correction model (VECM)


An error correction model represents all variables as I(0) showing
the adjustment mechanism that drives them back towards the
cointegrating relationship
Dropping error correction and just estimate a VAR, model is invalid
If we interpret 1ldt-1 as the long-run equilibrium for the log-stock
price, lp*t-1 = 1ldt-1, you understand meaning of the correction part
24
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Getting Intuition Through One Realistic Case
If a1 < 1, then a1 – 1 < 0, and when lpt-1 < 1ldt-1 implies that lpt
> 0, i.e., when prices are below their long-run equilibrium defined
by dividends, then prices will increase
When lpt-1 > 1ldt-1, lpt < 0, i.e., when prices are above their long-
run equilibrium defined by dividends, prices will decrease
The parameter in the ECM specification determines the speed of
adjustment in the presence of disequilibrium
The system defined by the ECM based on a cointegrating
relationship is self-equilibrating

Small alpha High alpha


Lecture 4: Cointegration and Error Correction Models – Prof. Guidolin 25

Testing Bivariate Cointegrating Relationships


Cointegrated I(d) variables are such because they share at least one
common stochastic trend, see Appendix C for an example
Two alternative and fundamental ways to test for cointegration:
① Univariate, regression-based tests (Engle and Granger’s, 1987)
that exploit the idea that a regression can be used to find at least
one (the mean-squared error minimal) cointegrating relationship
② Multivariate, VECM-based multi-cointegration tests, Johansen’s
Engle and Granger’s methodology seeks to determine whether the
residuals of an estimated equilibrium relationship are stationary
o Suppose Pt and Ft are both I(1) and estimate the long-run equilibrium
relationship:
o If the variables are cointegrated, an OLS regression yields a super-
consistent estimator of the cointegrating parameters κ0 and κ1
o OLS estimator converges faster (at a rate T) than in OLS models using
stationary variables, where the convergence rate is traditionally T1/2
o Test consists of no-intercept ADF tests applied to
26
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Multivariate Cointegrating Relationships
Matters are a bit more complicated in the multivariate case
In general, among N non-stationary series of the same integration
order, we may have up to N-1 cointegrating vectors
o The single equation dynamic modeling we have used in the bivariate
example for the Granger’s test may cause serious troubles when there
are multiple cointegrating vectors
o There will exist a sort of indeterminacy as to which relationship holds
o The solution of this identification problem requires a framework to
allow the researcher to find the number of cointegrating vectors
among a set of variables and to identify them
The procedure proposed by Johansen (1988, 1992) within a VAR
framework achieves both results
Their advantage is that all the r ≤ N – 1 cointegrating relationships
will be tested and estimated
Consider the multivariate generalization of the single-equation
dynamic model derived in Appendix B
27
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin

Johansen’s Method
Suppose that N ≥ 2 variables are I(1) and follow:

To use Johansen’s test, the VAR needs to be turned into a V-ECM:

(*)
Johansen test centers around the matrix that can be interpreted
as a long-run coefficient matrix, because in equilibrium, all the yt−i
are zero, and setting ut to their expectation of zero yields
*

28
Johansen’s Method
Formal tests based on the rank of the matrix via its eigenvalues,
to determine the number of cointegrating relationships/vectors
The rank of a matrix is equal to the number of its characteristic
roots (eigenvalues) that are different from 0
o The eigenvalues, λis, are put in discending order λ1 ≥ λ2 ≥ . . . ≥ λg ≥ 0
o By construction, they must be less than 1 in absolute value and
positive, and λ1 will be the largest, while λg will be the smallest
If the variables are not cointegrated, the rank of will not be
significantly different from zero, so λi ≃ 0 ∀i
o If rank( ) = 1, then ln(1 − λ1) will be negative and ln(1 − λi) = 0 ∀i > 1
o If the eigenvalue i is non-zero, then ln(1 − λi) < 0 ∀i > 1, for to have a
rank of 1, the largest eigenvalue must be significantly non-zero, while
others will not be significantly different from 0
Two test statistics for cointegration under the Johansen approach:

29
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin

Johansen’s Method
Each eigenvalue will have associated with it a different
cointegrating vector, which will be the corresponding eigenvector
A significant eigenvalue indicates a significant cointegrating vector
λtrace is a joint test where the null is that the number of
cointegrating vectors is less than or equal to r against an
unspecified or general alternative that they are more than r
λmax conducts separate tests on each eigenvalue, and has as its null
hypothesis that the number of cointegrating vectors is r against an
alternative of r + 1
o The distribution of the test statistics is non-standard: the critical values
depend on N − r , the number of non-stationary components and whether
constants are included in each of the equations
o If the test statistic is greater than the critical value, we reject the null
hypothesis of r cointegrating vectors in favor of the alternative
o The testing is conducted in a sequence and under the null
r is the rank of : it cannot be of full rank (N) since this would
correspond to the original yt being stationary
30
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Appendix A: An Example of I(2) Process

31

Appendix B: Deriving the ADF Test Regression

32
Appendix C: Cointegration = Sharing Common Trends

33
Lecture 6: Univariate Volatility
Modelling: ARCH and GARCH
Models
Prof. Massimo Guidolin

20192– Financial Econometrics

Winter/Spring 2022

Overview
Generalities and the mixture of distributions hypothesis

Stylized facts of conditional heteroskedasticity

Simple models: rolling window and RiskMetrics

ARCH models and their limitations

Generalized ARCH models and the reasons of their success

Integrated GARCH

Exponential GARCH and asymmetric effects in volatility

Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 2


Generalities: from Risk to Conditional Heteroskedasticity
Since the late 1980s and the seminal work by Engle (1982) and
Bollerslev (1986), financial econometrics has witnessed a drive
towards methods to specify and estimate models of risk
o Until 30 years ago the focus of time series analysis centered on condi-
tional first moment, with any dependencies in higher order moments
treated as a nuisance that required at best adjustments to estimation
o This required the development of new techniques that allow for time-
varying higher order moments, chiefly variances and covariances
When the data display patterns of time-varying variances and
covariances, they are said to be conditionally heteroskedastic (CH)
In finance, risk has been identified with variance or its square root,
(the std. dev., volatility) but variance becomes a proper measure of
risk only when coupled with an assumption on predictive densities
o E.g., variance is a proper measure of risk when returns are normal
o Distributions can be found such that alternative measures (such as
mean absolute deviation or interquartile range) ought to be used
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 3

The Mixture of Distributions Hypothesis


Crucially, variances, covariances, and all higher order moments
(skewness, kurtosis, etc.) are not directly observable—unlike
market prices, they are latent variables
o E.g., if prices fluctuate a lot, we know volatility is high, but we cannot
ascertain precisely how high
o Can’t distinguish if a large shock to prices is transitory or permanent
What is the economics/finance theory of CH data?
o Unfortunately, not many compelling models, although some exist
(learning, transaction costs, asymmetric risk aversion, etc.)
o An ever-green explanation for CH is that information arrivals which
drive price changes occur in clusters rather than being evenly spaced,
the mixture of distributions hypothesis (Tauchen and Pitts, 1983)
o Returns and volume are determined by same latent mixing variable; if
the news arrival process is serially dependent, volatility and trading
volume will be jointly serially correlated
o Such theory just moves the issue one step further down the road
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 4
Stylized Facts of Conditional Heteroskedasticity
See lecture 1 for the relevant stylized fact, only listed here
These empirical regularities commonly found in the data have
driven the development of ARCH and GARCH models
① Financial data tend to be leptokurtic, their unconditional density is
characterized by tails that are “thicker” than a normal well as by more
probability mass collected around the mean (or the mode)
② Data are characterized by clusters in higher order moments,
especially volatility: large changes tend to be followed by large changes,
of either sign, and small changes tend to be followed by small changes

o To favor visibility, the values of the squares and cubes have been
trimmed, even though this hides a few large spikes
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 5

Stylized Facts of Conditional Heteroskedasticity


o One typical way in which volatility clustering is analyzed is by
conducting Box-Jenkins analysis on the squared residuals of some
conditional mean function (or on returns themselves)…
o … or by studying the cross-serial correlation of powers of the data

o The book proves that the two stylized facts are intimately related
because the model residuals are such that

(*)

o Excess kurtosis may arise from randomness in conditional variance


6
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin
Stylized Facts of Conditional Heteroskedasticity
③ The so-called leverage effect, first reported by Black (1976), the
tendency for changes in (stock) prices to be negatively correlated with
changes in subsequent (stock) volatility
o The leverage/firm risk link is rather simplistic
o CH models are applied to data (such as interest or exchange rates)
that pertain to assets that are not equities and may not show leverage
o In such cases we speak of asymmetric CH effects
Kernel regression
fits

Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 7

Simple Models: Rolling Window Variance Forecast


The most naive and yet surprisingly widespread models among
practitioners are simple rolling window models:

o 𝜀𝜀𝑡𝑡 consists of the empirical residuals of some conditional mean


function model (a ARMA or a regression, say)
o W is the rolling window length, the only parameter to be selected
In short, this is a moving average model for squared residuals
W << T allows the model to capture time variation in conditional
variance ⟹ predictive power that responds to market conditions
o When W = T, the model gives the ML estimator of the variance
This model has obvious limitations:
① All past squared errors are given the same weight, 1/W, irrespective
of how old they are
② Unclear how we should go about selecting the window length W as it
represents the upper limit of a sum
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 8
Simple Models: Rolling Window Variance Forecast
o Selection of W is left to subjective assessments, with the paradox that
users with the same data, will deliver very different forecasts

③ Especially when W is small, the forecasts generate “box shaped effects”


o When forecast spikes up, this may be due to either some small
squared residual from W + 1 periods before been dropped or to a
large time t squared residual
o The former event is hard to rationalize
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 9

Simple Models: Rolling Window Variance Forecast

Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 10


Simple Models: RiskMetrics
The RiskMetrics (Exponential Variance Smoother) model offers a
way to use the entire history of a time series…
… but at the same time to weight each past observation as a
decreasing function of its distance to the forecast origin:

o The presence of the factor 𝜆𝜆𝜏𝜏−1 that pre-multiplies the infinite sum
guarantees that the sum of the weights equals 1
In the late 1980s, researchers at J.P. Morgan Chase realized that this
famous forecasting device could be rewritten in a simpler way:

o RiskMetrics is characterized by just one parameter, 𝜆𝜆 that can be


estimated from the data 11
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin

Simple Models: RiskMetrics


o The variance forecast consists of a convex linear combinations of (i)
the most recent squared residual and (ii) the most recently saved
forecast of the variance at time t-1 for time t
o In RiskMetrics, 𝜆𝜆 plays a role similar to the choice of the rolling
window parameter W : the larger λ, the slower is the speed at which
past squared innovations are forgotten by the model
o Note that , or today’s
forecast of time t+1 variance is simply yesterday’s variance forecast
2
o However, solving then the model backward, we would have 𝜎𝜎𝑡𝑡+1 =
𝜎𝜎𝑡𝑡2 = … = 𝜎𝜎02 , i.e., the process for variance becomes homoskedastic
o The use of squared residuals (or returns) to predict variance derives
instead from
o The fact that RiskMetrics contains one parameter is attractive
o Even though in many applications λ is estimated by ML, for a variety
of high-frequency data sets it has become typical to obtain estimated
values for λ that tend to be close to 0.94
12
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin
ARCH Models
The key idea of ARCH is that conditional forecasts are generally
vastly superior to unconditional forecasts

Because the model is based on the decomposition


2
and 𝜎𝜎𝑡𝑡+1|𝑡𝑡 is time-varying provided that at least one of the coef-
ficients α1, α2, …, αp is positive, by Jensen’s inequality, we have (*)
Because ARCH(p) generates a symmetric return distribution—to
integrate to 1—the inflated tails must be compensated by the
absence of probability mass in the intermediate range, ARCH
captures the leptokurtic nature of asset returns
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 13

ARCH Models
ARCH captures volatility clustering: large past squared innovations
will lead to large forecasts of subsequent variance when all or most
α1, α2, …, αp coefficients are positive and non-negligible
However, ARCH models cannot capture the existence of asymmetric
reaction of conditional variance to positive vs. negative shocks
ARCH(p), in particular ARCH(1) differs from RiskMetrics in 2 ways:
① It features no memory for recent, past variance forecasts
② It features a constant α0 that was absent in RiskMetrics
o When we set α0 = 0 and αi = 1/W, then an ARCH(W) model simply
becomes a rolling window variance model
o Appendix A collects the moments and key properties of a ARCH(1)
o Algebra in this Appendix establishes that
long-run, ergodic variance from ARCH(p) is:
Even though conditional variance changes over time, the model can
be (covariance) stationary and the unconditional variance exists
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 14
ARCH Models: One Example
Even though ARCH represents progress vs. simple rolling window,
one limitation: their specification is richly parameterized
o Given the empirical success of RiskMetrics, a need to pick a large p
does not come as a surprise, because such a selection obviously
surrogates the role played by 𝜎𝜎𝑡𝑡|𝑡𝑡−1
2
on the RHS of RiskMetrics
Consider 1963 - 2016 CRSP stock excess daily returns
o SACF/SPACF and information criteria analyses suggest a MA(1) mean

p-values
o SACF/SPACF of squared residuals give evidence of AR(5) at least

Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 15

ARCH Models: One Example


o BIC criterion for MA(1)/ARCH(6) is 2.4426 and for MA(1)/ARCH(5) is
2.4513; therefore, we select the former model
o Such a BIC is lower vs. 2.8070 from a homoskedastic MA(1) model
o The MA(1)/ARCH(6) model estimated by ML in Eviews is

o Each of the estimated coefficients is positive and statistically


significant and their sum is 0.784 which establishes stationarity
o One wonders if a more parsimonious way can be found

16
Are ARCH Models Enough?
ARCH models are not set up or estimated to imply unconditional
variance = sample variance and this may be embarrassing
One constraint often imposed in estimation is variance targeting:

o It guarantees that ARCH(p) yields unconditional = sample variance


How do we assess whether a CH model is “adequate” for a given
application/data sets?
① If a CH model is correctly specified, then the standardized
residuals from the model should reflect any assumptions made
when the model has been specified and estimated
o E.g., from , check
whether holds
o Testable by using (i = 1, 2, …) for sensible choices
of the functions h: ℝ⟶ℝ and g:ℝ ⟶ ℝ (not necessarily identical)
② A good CH model should accurately predict future variance
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 17

Are ARCH Models Enough?


o What does it mean that a CH models yield “good” forecasts? A
requirement is that on average the realized squared residuals must
equal the variance forecasts that a model offers: White
noise

o Empirically, it implies that two simple restrictions must be satisfied in


the regression
2
o a = 0 and b = 1, jointly (when this occurs, 𝜎𝜎𝑡𝑡+1|𝑡𝑡 offers an unbiased
predictor of squared residuals, used as a proxy of realized variance)
o The regression R2 must be “large”
o However, this test of predictive performance may be fallacious: the
process 𝜀𝜀𝑡𝑡2 invariably provides a poor proxy for the process followed
by the true but unobserved time-varying variance, 𝜎𝜎𝑡𝑡2
o This follows from

Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 18


Are ARCH Models Enough?
2 4
o When either 𝜎𝜎𝑡𝑡+1|𝑡𝑡 (hence, 𝜎𝜎𝑡𝑡+1|𝑡𝑡 ) or the kurtosis of the stdz. residuals
2
are high, 𝑉𝑉𝑉𝑉𝑉𝑉[𝜀𝜀𝑡𝑡+1 ] will be large, and using squared residuals to proxy
instantaneous variances exposes a researcher to a lot of noise
o This choice is almost guaranteed to yield low regression R2
o There are remedies (see the book) but this is advanced material
o Compare ARCH and RiskMetrics
for daily stock returns, 1963-2016
o ARCH(6) forecasts are spikier

o However, there are significant


departures from IID-ness in the
squared stdz. residuals from both
models
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 19

Are ARCH Models Enough?


o All forecasting power of past squared
residuals is well captured
o It seems that past US equities losses
lead to subsequent higher variance,
the leverage effect
o Predictive accuracy regressions give
(std errors in parentheses):

o Crucial to report standard errors and not p-values because the simple
null hypothesis of b = 1 requires that we calculate the t ratios:

o The null of a = 0 may be rejected with p-values close to 0.000


o Given individual rejections, pointless to apply F-tests of joint hypothesis
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 20
Generalized ARCH Models
o Although R2s are not irrelevant, positive significant estimates of inter-
cepts ⟹ predicted variance is too low vs. realized variance
o The two slope coefficients significantly less than 1 ⟹ realized variance
moves over time less vs. what is predicted
Because of its limitation, ARCH has soon been generalized from an
AR(p)-style model for squared residuals, to
ARMA(max[p,q], q):

Bollerslev (1987) observed that such a process may be written as

21

Generalized ARCH Models


Key issue of GARCH models is to keep variance forecasts positive
o ω > 0, α1, α2, …, αp 0, and β1, β2, …, βq 0 are only sufficient
o Under technical conditions on the lag polynomials characterized by α1,
α2, …, αp and β1 β2, …, βq (provided the roots of the polynomial defined
by the βs lie outside the unit circle), positivity constraint is satisfied
o As for all ARMA processes, GARCH will be (covariance) stationary if
and only if the roots of the characteristic polynomial associated to the
coeffs α1 + β1, α2 + β2, …, αmax(p,q)+ βmax(p,q) lie outside the unit circle
o As far strict stationarity goes, in the case of a GARCH(1,1), the
condition is sufficient (see Lumsdaine, 1996)
o However, because

under covariance stationarity, for a GARCH(1,1), covariance statio-


narity guarantees strict stationarity
GARCH is a highly successful and resilient empirical model because
with fewer parameters than a ARCH, it may lead to a more parsi-
monious representation of volatility clustering
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 22
GARCH(1,1): The Reasons of Its Success
ARCH(p) is simply a GARCH(p,0) model in which there is no
memory in the process for past conditional variance predictions
Because in forecasting applications it has proven to be very hard to
beat (Hansen and Lunde, 2005), practitioners usually resort to
simple GARCH(1,1) models:
o ARMA(1,1) for squared errors
o In the case of GARCH(1,1), positivity comes from the restrictions ω > 0,
α 0, and β 0 and stationarity from the constraint α + β < 1
o Exploiting equivalence to ARMA(1,1), the stationary long-run variance:

Under stationarity, Wold’s theorem ⟹ GARCH(1,1) is a sample


variance that downweights distant lagged squared errors:

-1 -1

-1
23

GARCH(1,1): The Reasons of Its Success


o The reason for the success of GARCH(1,1) over complex ARCH(p) with
relatively large p is that GARCH(1,1) can be shown to be equivalent to
an ARCH( ) model with a special structure of decaying weights!
o GARCH(1,1) predicts as a weighted average of long-term variance (the
constant), most recent forecast (GARCH term), and information about
volatility observed in the previous period (ARCH term):

o Let’s study weekly 1982-2016 returns on 10-year US Treasury notes


o A BIC-based specification leads to a simple AR(1) mean model
o SACF and SPACF of squared residuals show evidence of ARMA

24
GARCH(1,1): A Fixed Income Example
o Attempt to use ARCH leads to a large, possibly ARCH(11) specification
o GARCH(1,1) offers best trade-off between simplicity and in-sample fit

p-values

o The sum of the coefficients is 0.983 ⟹ (covariance) stationarity


o Evidence in favor of GARCH(1,1) is strong: SACF of squared stdz
residuals is characterized by absence of additional structure
o Regression that tests whether
GARCH(1,1) can forecast
squared residuals gives (stan-
dard errors in parentheses):

o Intercept is not significant, while

o F-test of hypothesis of a = 0, b = 1 gives 1.687 that with (2, 1822) d.f.


implies a p-value of 0.185 and leads to a failure to reject
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 25

The Persistence of Shocks in GARCH(1,1) Models


Although the persistence index of a GARCH(p,q) model is given by

2
different coefficients contribute to increase 𝜎𝜎𝑡𝑡+1|𝑡𝑡 in different ways
o The larger are the αis, the larger is the response of 𝜎𝜎𝑡𝑡+1|𝑡𝑡
2
to new
information; the larger are the βjs, the longer and stronger is the
memory of conditional variance to past (forecasts of) variance
For any given persistence index, it is possible for different
stationary GARCH models to behave rather differently and
therefore yield heterogeneous economic insights
o This plot performs simula-
tions on a baseline estimate Sum = 0.984

on monthly UK stock returns,


sample period 1977-2016
o The volatility scenarios dif-
ferent from solid blue fix the
persistence but impute it to
alternative α and β
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 26
Integrated GARCH Model
In many applications to high-frequency financial data, the estimate
of turns out to be close to 1
Empirical motivation for IGARCH(p,q), model, where
= 1 (a unit root in ARMA for conditional variance)
Consequently a shock to the conditional variance is infinitely
persistent ⟹ it remains equally important at all horizons
IGARCH may be strictly stationary (under appropriate conditions,
0) but is not covariance stationary
In the case α + β = 1 α = 1 - β, i.e., IGARCH(1,1), this is no news:

This is just RiskMetrics in which λ = β and with an intercept,


which establishes that RiskMetrics is not covariance stationary
o The long-run variance does not exist
o Yet, then RiskMetrics should be generalized to include an intercept
and to have ARMA “complexity dimensions” p and q that should be
either estimated or at least selected on the basis of the data
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 27

Exponential GARCH Model


Similarly to ARCH, GARCH captures thick-tailed returns and
volatility clustering but it is not well suited to capture the “leverage
2
effect” because 𝜎𝜎𝑡𝑡+1|𝑡𝑡 is only a function of 𝜀𝜀𝑡𝑡2 and not of their signs
In the exponential GARCH (EGARCH) model of Nelson (1991),
2
𝑙𝑙𝑙𝑙𝜎𝜎𝑡𝑡+1|𝑡𝑡 depends on both the size and the sign of lagged residuals
and therefore can capture asymmetries

t|t-1

28
Exponential GARCH Model
2 2
Because 𝜎𝜎𝑡𝑡+1|𝑡𝑡 = exp(𝑙𝑙𝑙𝑙𝜎𝜎𝑡𝑡+1|𝑡𝑡 ) and exp( ) > 0, EGARCH always
yields positive variance forecasts without imposing restrictions
o is function of both the magnitude and the sign
of past standardized residuals, and it allows the conditional variance
process to respond asymmetrically to rises and falls in asset prices
o It can be rewritten as:
o Nelson’s EGARCH has another advantage: in a GARCH, the parameter
restrictions needed to ensure moment existence become increasingly
stringent as the order of the moment grows
o E.g., in case of ARCH(1), for an integer r, the 2rth moment exists if and
only if ; for r = 2, existence of unconditional
kurtosis requires α1 < (1/3)1/2
o In a EGARCH(p,q) case, if the error process ηt in the ARMA repre-
sentation of the model has all moments and , then all
moments of an EGARCH process exist
How far better can EGARCH fare versus a standard GARCH model?
how important are asymmetries in conditional variance?
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 29

EGARCH and Asymmetries: One Example


o Let’s return to the 1963-2016 CRSP daily stock excess return data
o A model specification search based on information criteria in the
space of GARCH and EGARCH(p, q) models yields

o The ICs select large models: a GARCH(2,2) in the GARCH family and
even a more complex EGARCH(3,3), the latter being preferred:

GARCH(1,1)

Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 30


EGARCH and Asymmetries: One Example

EGARCH(3,3)

o This process implies an odd, mixed leverage effect, because negative


returns from the previous business day increase predicted variance,
but negative returns from two previous business days depress it

Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 31

EGARCH and Asymmetries: One Example


o Although variance forecasts are not radically different, the scatter plot
shows that when volatility is predicted to be high, often GARCH(1,1)
predicts a higher level than EGARCH(3,3) does
o We have tested the two models for their ability to predict squared
realized residuals, obtaining:

o While in the case of GARCH(1,1) we obtain the same result as before,


in the case of EGARCH the R2 increases but the results on the
intercept and slope point towards a rejection of model accuracy
A preview of topics to follow:
Are alternative, possibly more complex GARCH-structures useful?
How do you estimate a GARCH-type model?
2
Is there any gain in specifying 𝜀𝜀𝑡𝑡 to be anything but N(0,𝜎𝜎𝑡𝑡|𝑡𝑡−1 )?
How do you test whether your data are affected by GARCH?
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 32
Appendix A: Key Properties of ARCH(1)

33

Appendix A: Key Properties of ARCH(1)

34
Appendix A: Key Properties of ARCH(1)

35
Lecture 7: Advanced Volatility
Modelling

Prof. Massimo Guidolin

20192– Financial Econometrics

Winter/Spring 2022

Overview
Threshold GARCH Model
Power ARCH and Nonlinear GARCH Models
The Component GARCH Model
GARCH-in-Mean Models and Asset Pricing Theory
Non-Normalities in GARCH Modelling: t-Student GARCH
Generalized Error Distribution GARCH
Testing for ARCH and GARCH
Forecasting with GARCH Models
ML and Quasi-ML Estimation
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 2
All in the Family! Other GARCH Models
The GARCH class is so successful in risk mgmt., derivative pricing,
and to some extent portfolio choice, that hundreds of different spin-
offs of GARCH(p,q) have been proposed and tested
o Bollerslev (2009) published a glossary that compiled 139 different
GARCH models, when also a few “multivariate” strands were included!
Threshold GARCH(p,d,q) (TARCH) offers a way to capture
asymmetries and leverage that is alternative to EGARCH, but
requires positivity-driven restrictions

Threshold GARCH Model


o Assuming for simplicity p = d, good news and bad news have differ-
rential effects on variance: past good news (𝜀𝜀𝑡𝑡−𝑙𝑙+1 ≥ 0) has an impact
of αi, while bad news (𝜀𝜀𝑡𝑡−𝑙𝑙+1 < 0) has an impact of αi + δi
o If δi > 0, bad news increases volatility, and we say that there is a
leverage effect of the ith order
o The standard GARCH(p, q) is a special case of the Threshold GARCH
model where the threshold order, d, is set to 0
o In a TARCH, the ARMA structure becomes nonlinear and the un-
conditional variance of the process, under covariance stationarity, is:

where holds for all symmetric


distributions with zero mean
o An on-line example in the book’s web site the different degree of
estimated exposure to asymmetries of different asset classes
4
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin
Power ARCH and Nonlinear GARCH Models
o The example finds that equity data contain strong evidence of asym-
metries in GARCH; bond returns data do not, and economically it
would be more complex to find any justification for such asymmetries
o The frequency of the data is not crucial: stock returns contain leve-
rage both at daily and monthly frequency and over different periods
Taylor (1986) and Schwert (1989) introduced a standard deviation
GARCH:

The model was generalized to a flexible Power ARCH(p; d; q) model:

where γ > 0, |δi| ≤ 1 for i = 1, 2, …, d and δi = 0 for i > d


o For γ = 1 and d = 0, one obtains the symmetric standard deviation
model of Taylor and Schwert
o For γ = 2 and d = 0, this is a standard GARCH(p,q) model
o For γ = 2 and d = 1, this is a threshold quadratic GARCH(p,d,q)
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 5

Power ARCH and Nonlinear GARCH Models


o When γ = 1 and p = d, this is Zakoian’s (1994) threshold GARCH,

o Another nonlinear asymmetric GARCH or NAGARCH is:

o In the (1,1) case, this is simply

o NAGARCH(1,1) is asymmetric, because if δ ≠ 0, the impact of a past


2
squared standardized shock (for given 𝜎𝜎𝑡𝑡|𝑡𝑡−1 ) is proportional to
2 2
α𝜎𝜎𝑡𝑡|𝑡𝑡−1 𝑧𝑧𝑡𝑡 - 2αδ𝜎𝜎𝑡𝑡|𝑡𝑡−1 𝑧𝑧𝑡𝑡 which is no longer a simple, symmetric
2

quadratic function of standardized residuals, differently from GARCH


o It is nonlinear, because NAGARCH(1,1) may be written as:

Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 6


Nonlinear GARCH
o is a function that makes the beta coefficient of the
GARCH dependent on lagged standardized residuals
o A linear affine model in which some coefficients depend on condition-
ning variables is a nonlinear model, a time-varying coefficient model
NAGARCH plays a key role in option pricing with stochastic
volatility because it allows us to derive closed-form expressions for
European option prices in spite of its rich volatility dynamics
Because NAGARCH is and 𝑧𝑧𝑡𝑡 is
2 2
independent of 𝜎𝜎𝑡𝑡|𝑡𝑡−1 as 𝜎𝜎𝑡𝑡|𝑡𝑡−1 just depends on an infinite number
of past squared returns, it is possible to easily derive the long-run,
unconditional variance under the assumption of stationarity:

because
As a result
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 7

Power and Nonlinear GARCH: One Example


o The unconditional variance exists and is positive if and only if

o NAGARCH(1,1) is stationary if and only if this condition is satisfied


o Consider asymmetric Power GARCH(1,1,1) estimates for daily 1963-
2016 US excess stock returns: Estimate of γ
and its p-value

o γ is estimated at 1.187 and the standard error of the estimate is 0.054,


which means that the estimate significantly differs from both 1 and 2

8
The Component GARCH Model
o Because past shocks are here raised to a power which is less than 2,
PARCH forecasts are less spiky vs. threshold GARCH
o Once differences in estimates and functional form are factored in, the
two sets of forecasts are not so different (correlation is 0.987)
The variance process in a plain vanilla GARCH model shows mean
reversion to which is a constant at all times
Engle and Lee (1999) generalize GARCH to a component GARCH (C-
GARCH) model that allows mean reversion to a varying level
Equivalently, the model incorporates a distinction between transi-
tory and permanent conditional variance dynamics:
(short-term)
(long-term)
o The model can be easily generalized to the (p,q) case
o First equation: dynamics of the transitory variance component which,
2
by construction, converges to 𝑣𝑣𝑡𝑡+1|𝑡𝑡
o Second equation: dynamics of the long-term or permanent variance
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 9

The Component GARCH Model


o Algebra shows that the C-GARCH(1,1) may be re-written as:
𝛼𝛼 + 𝛽𝛽 < 1

2
which shows that the transitory component for 𝜀𝜀𝑡𝑡+1 is a ARMA(1,1)
2
process that converges to 𝑣𝑣𝑡𝑡+1|𝑡𝑡 with a persistence rate 𝛼𝛼 + 𝛽𝛽 < 1
o The permanent component is an AR(1) process with persistence ψ
o One can combine the two components to write (see the book):

which shows the C-GARCH(1,1) is a (nonlinear) restricted GARCH(2,2)


One wonders about the use of GARCH(p,q) when p 2 and q 2
Higher-order GARCH models are rarely used, and this GARCH(2,2)
case represents one of the few cases in which—although it is sub-
ject to constraints (see the book)—a (2,2) model has relevance
o Note that GARCH(2,2) fulfills positivity constraints because CGARCH
(1,1) does, but some coefficients may be negative!
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 10
Component GARCH: One Fixed Income Example
o You can also include asymmetric, leverage-type effect in CGARCH:

transitory permanent

o Consider weekly 1982-2016 data on the negative of the changes in 10-


year US Treasury rates
o Comparing a plain vanilla GARCH(1,1) a CGARCH(1,1):

GARCH(1,1)

Component GARCH(1,1)

o While the permanent component features a precisely estimated


GARCH(1,1), the transitory variance component is rather weak
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 11

Component GARCH: One Fixed Income Example

o Most variance on govy returns comes from permanent component,


with transitory variance small, scarcely persistent, used to fit occa-
sional weeks of low volatility, when it gives a negative contribution
o Although a CGARCH(1,1) model is in fact a restricted GARCH(2,2)
model, it gives forecasts that are very hard to be told apart from a
plain vanilla GARCH(1,1)
o In fact, GARCH(1,1), GARCH(2,2), and CGARCH(1,1) give maximized
log-likelihoods of 1425.64, 1426.922, and 1426.921
o Improvement from a model more complex vs. GARCH(1,1) is modest
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 12
Component GARCH: One Equity Example
o Are component GARCH models of dubious practical relevance?
o When estimation is applied to 1963-2016 US daily excess stock
returns, model selection is more favorable, but asymmetries needed:

o The asymmetric
effect is precisely
estimated
o CGARCH have a
crucial derivative
pricing justifica-
tion: often fair
price may not be
made to depend
on short-run spikes
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 13

GARCH-in Mean
So far, we have rigidly and somewhat artificially kept distinct the
conditional variance from the conditional mean specifications
o Apart from the obvious: the residuals that a CH model uses are esti-
mated by subtracting from the data an estimate of conditional mean
It is possible to jointly model conditional mean and variance when
the latter appears in the specification of the former
o Formally, when 𝜇𝜇𝑡𝑡+1|𝑡𝑡 is a function of 𝜎𝜎𝑡𝑡+1|𝑡𝑡 we have (G)ARCH-in mean

with g( ) possibly linear,


o When γ 0, the risk premium will be time-varying
o One of basic tenets of modern finance: linear relationship btw. expec-
ted returns and variance, e.g., Merton’s (1973) intertemporal CAPM
Up to this point nothing was said on the distribution of the standar-
dized shocks, besides zt+1 IID (0,1)
o Usually assume zt+1 IID N(0,1), although this may be sub-optimal: zt+1
IID D(0,1), D not normal may provide additional degrees of freedom
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 14
Non-Normal Marginal Densities in GARCH
o Alternatively, assume the data have been truly generated by:

o δ is a vector of unknown parameters and D is a generic distribution


with zero mean and unit variance
o Suppose that, even though the conditional mean and volatility
functions 𝜇𝜇𝑡𝑡+1|𝑡𝑡 and 𝜎𝜎𝑡𝑡+1|𝑡𝑡 were correctly specified, we estimate:

o where the distribution D is incorrectly specified to be Gaussian


Such distributional misspecifications may cause the resulting ML
estimates to be inefficient or even inconsistent

US daily stock
returns, 1963-2016

15

Non-Normal Marginal Densities in GARCH


o Leftmost plot concerns raw data and—by comparison with a dashed
normal density with same mean and variance—shows leptokurticity
o Rightmost plot shows that the standardized residuals from a
threshold GARCH(1,1,1) with Gaussian shocks follow a distribution
that is much closer to a normal density than the raw data
o Yet, the differences btw. the estimated empirical kernel density and
the dashed Gaussian benchmark remain visible
A simple GARCH modeling strategy may be inadequate, without
further adjustments to the assumed marginal density of the errors
o One robust but technically involved idea is to escape the full specifica-
tion of D to be replaced by a nonparametric or semiparametric GARCH
o A more classical option consists instead in proposing an alternative
parametric specification
Key candidate: the residuals follow a symmetric t-Student,
with density Gamma fnct.

# degrees of freedom
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 16
t-Student GARCH
o The gamma function has well-known recursive properties
o Note that: Negative power
fnct.

Negative exponential
fnct.

o which derives from the very definition of the number “e”


As the number of degrees of freedom diverges to infinity, a t-
Student distribution converges to a standard Gaussian
o The t-Student nests the Gaussian ⟺ the data may tell us whether
their underlying DGP was (approximately) normal or not
o In practice, a useful rule-of-thumb is that when ν is large and exceeds
20, then for applied purposes 𝑓𝑓𝑡𝑡 (𝑧𝑧𝑡𝑡+1 ; 𝑣𝑣) may be well approximated by
a normal distribution
o Comparison btw. the t-Student PDF (a negative power fnct.) and the
limit standard normal (a negative exponential fnct.) leads to a further
point 17
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin

t-Student GARCH
o The tails of a negative power fnct. decay to 0 as zt+1⟶±∞ at a speed
given by the power –ν/2; the lower is ν, the slower the rate
o The negative exponential is the specification whose tails go to 0 faster
among all functions
o E.g., for z = 4 (shock 4 std. dev. from mean) ,
under a negative power with ν = 10, we have
o The latter probability is then 0.002376/0.000336 7 times larger
o Repeat this experiment for am extreme realization, say a (standardi-
zed) return 12 times away from the sample mean (say a -9.5%
return), then the ratio is a few thousands billion times larger!
Events that are impossible under a Gaussian distribution remain
rare but billions of times more likely under a fat-tailed, t-Student
Consider a flexible threshold GARCH(1,1,1) under normal vs. t-
Student applied to a variety of time series
o Both the accurately estimates for v—always below 20 and often 5—
and (unreported) information criteria confirm that all the data
require a t-Student assumption 18
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin
t-Student GARCH ARMA(1,1)

o As the frequency of data declines (from daily, to weekly, to monthly),


the need for t-Student innovations remains
o The type of financial series to drive such a feature 19
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin

Moments under t-Student GARCH


Assuming they exist, a t-Student is characterized by the moments:
Variance exists iff
v>2
Kurtosis exists iff
v>4

o E.g., if ν = 3, then only mean and variance exist and skewness is not 0,
it is simply not defined; if ν = 3.001, then mean, variance, and
skewness exist, with both mean and skewness equal to 0, etc.
o Please pay attention: variance and kurtosis cannot be negative!
2
o 3 further implications: no longer true that 𝜎𝜎𝑡𝑡+1|𝑡𝑡 is the time-t forecast
2
of one-step-ahead variance, as conditional variance exceeds 𝜎𝜎𝑡𝑡+1|𝑡𝑡 :

o When ν > 4, a t-Student marginal density for the shocks in a GARCH


model ends up creating excess kurtosis versus a Gaussian benchmark
o When ν diverges to +∞ and the t-Student ⟶ a normal distribution,
2
then 𝜎𝜎𝑡𝑡+1|𝑡𝑡 becomes conditional variance and excess kurtosis is zero
o A t-Student GARCH higher conditional and unconditional kurtosis
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 20
Testing for ARCH
How do we decide whether our data need a CH model?
We know how to test whether a given CH model is misspecified, see
lecture 6
o We tested the null of standardized residual IID (0,1) using Box-Jenkins
If such tests are applied not to the residuals of a CH model but to
the raw, unfiltered data, they are tests for ARCH effects
A better formalized approach are Lagrange multiplier (LM) tests:
① Estimate most appropriate conditional mean model (e.g., some ARMA
model or a regression), letting be the squared stdzed residuals
o If approach yet unexplored data, model is homoskedastic so
② Regress squared stdzed residuals on a constant and m lagged values:
Auxiliary regression White noise

③ In the absence of ARCH the estimated values of ξ1 through ξm should


be 0, the regression has no explanatory power so that its R2 will be low
o Under the null of no ARCH, the test statistic TR2 converges to a χm2 so
that if TR2 is large, the rejection of the null is equivalent to rejection of
the null hypothesis of no ARCH
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 21

Testing for ARCH: A Fixed Income Example


Also likelihood ratio tests may be examined, although they have an
uncertain distribution under the null
o Consider testing ARCH in the residuals of a Gaussian AR(1) model for
weekly returns on 5-year Treasury notes selected from minimum BIC

o The R2 is 0.045, with T = 1,820 ⟹ TR2 = 82.54, under a χ24 ⟹ 0 p-value


A flexible way to capture and test for ARCH asymmetries is the

22
The News Impact Curve and ARCH Asymmetries
o For GARCH(1,1) :
o The NIC is a quadratic function (a parabola) symmetric around εt = 0
o For EGARCH(1,1):
o Such a function is clearly not symmetric around εt = 0
o Let us consider the asymmetry properties of
CH models of the
daily value factor
(HML) returns over a
1963-2016 sample
o We estimate 3 models with GED errors (from
a AR(1))
o TARCH and EGARCH models are different (the
EGARCH kink is cau-
sed by the absolute
value), but they both
display negative
leverage
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 23

Forecasting with GARCH Models


The point of GARCH models is to allow the calculation of forecasts
o For concreteness, we focus on GARCH(1,1)
o We still ignore estimation issues and parameter uncertainty
At one level, the answer is simple as the one-step ahead forecast
2
𝜎𝜎𝑡𝑡+1|𝑡𝑡 is given directly by the model:
Rarely interested in 1-step ahead variance—consider a horizon, H:

o This is a recursive relationship: the deviations of t+H forecasts from


the unconditional variance equal α+β times the predicted deviations of
the t+H-1 forecasts from unconditional variance
o Because , by recursive substitution

Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 24


Forecasting with GARCH Models
Working backwards H – 1 times, we obtain:

o Stationary GARCH is mean-reverting and long-run forecasts ⟶


o Because in a GARCH we restrict both α and β to be non-negative, then
and viceversa
o We examine a t-Student AR(1) GARCH(1,1)-in-mean model for weekly
returns on 5-year Treasury notes, over 1982-2016

Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 25

Forecasting with GARCH Models: One Example

Upward sloping because


as of Dec. 2016

Persistence index of
0.99, slow convergence

1-week horizon

1-week horizon

o In fact, because , it takes >5 yrs for convergence


Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 26
Forecasting with GARCH Models
Because under RiskMetrics we know that α + β = 1, it follows that

Any shock to current variance persists forever: if today is a high


(low)-variance day, then RiskMetrics predicts that all future days
will be high (low)-variance days, which is unrealistic
o Under RiskMetrics (more generally, IGARCH) 𝜎𝜎𝑡𝑡+1|𝑡𝑡
2
is a random walk
In asset allocation, care for long-horizon returns:
o Because of the model
implies that residuals have zero autocorrelations, variance of the cu-
mulative H-day returns:

o Under a
stationary
GARCH(1,1):

Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 27

Forecasting with GARCH Models


The variance of the (log-)long horizon return is not simply H times
their unconditional, long-run variance
Hσ2 needs to be adjusted to take into account transitory effects
o However, under RiskMetrics, the variance of long-horizon returns is:

o The usual “xH” multiplication rule holds only when α + β = 1; the more
α + β << 1, the more inaccurate it is
In a frequentist statistical perspective, to estimate the parameters of
a GARCH model means that, given a random sample, one wants to
select a vector θ ℝK (K is number of parameters) to maximize a
criterion function that measures how plausible each value for θ is
Choice of the criterion and of a method to maximize it defines the
(point) estimation method
Three methods typically applied to GARCH: 1) (Generalized)
Method of Moments, 2) Maximum Likelihood, 3) Quasi MLE
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 28
Hints to Estimation of GARCH Models: GMM
One idea is to look for a (unique) θMM such that some statistical
features implied by the process—e.g., some moments, such as
unconditional means and variances, or the IRFs—are the same:
⊕ when computed from the assumed stochastic process under θ
= θMM, and
⊕ in the sample
One such estimator, based on matching the sample with population
moments, is the method-of-moment (MM) estimator
o See also statistics prep course
Problem: although intuitive, because MM does not exploit the entire
empirical density of the data but only a few sample moments, it is
clearly not as efficient as other methods, e.g., MLE
o MM yields standard errors that are larger than necessary
o GARCH models can be estimated with methods that extend the MMs,
called the generalized method of moments (GMM) based on:
Need at least K orthogonality
conditions from the model
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 29

Hints to Estimation of GARCH Models: MLE

Subset of admissible parameters


(i.e., any constraints go in here)

o MLE requires fully specified parametric models: given θ, all necessary


information is available to compute the probability of each obs. in the
sample under the joint PDF or to simulate the variable(s) of interest
o Consider the MLE of a standard Gaussian AR(1)/GARCH(1,1) model
o Under IID normal shocks in
the conditional density—the contribution to the total likelihood, lt—of
the time t observation, given the information up to time t-1, is:
From
normality

Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 30


Hints to Estimation of GARCH Models: MLE
o Because each shock is conditionally independent of others, the total PDF
of the entire sample is then the product of T such densities:
From IIDness

o We condition on the probability of the very first observation


o Because it is computationally convenient, deal with natural logarithm:

the log-likelihood function of the sample


o To maximize a function of θ and to maximize any monotone increasing
transformation of the same fnct (e.g., ln f) gives the same maximizer
o The book shows log-likelihood fnct for Gaussian AR(1)/GARCH(1,1)
o MLE applies whenever the distribution of the shocks is fully specified,
not only to a Gaussian distribution; e.g. t-Student or GED will do
o This description is for a given, fixed θ, but infinite choices are possible;
one will need to repeat this operation an infinite number of times to
span all the vectors of parameters in Θ
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 31

Hints to Estimation of GARCH Models: MLE


o Appropriate methods of numerical, constrained optimization need to be
used: what packages such as Matlab, Gauss, E-Views, or Stata are for
o Posted optional material describes (but better and more complex
methods exist), Newton’s method to perform the optimization
No need for further examples: all previous cases estimated via MLE
Under correct specification of the functional forms estimated (condi-
tional mean and conditional variance) and of the assumed parametric
density for the errors, MLE has optimal properties:

o The standard errors for MLEs are derived from the information matrix:
as
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 32
Hints to Estimation of GARCH Models: Quasi MLE
o MLE is the most efficient estimator (== it reaches the Cramér-Rao lower
bound) because it exploits the information in the joint PDF of the data
o See statistic prep course for details
o The result implies that
o This covariance matrix can be used for hypothesis
testing by constructing the usual z-ratio stat:
o The book reports examples of model- dependence of std. errors
Unfortunately, MLE requires that we know the parametric functional
form of the joint PDF of the sample and the zt are IID
o If one assumes normality, t-Student or GED, everything depends on that,
no margins of error
What if we are not positive about the parametric distribution? E.g., all
we can say is that
but D is unknown: Can we still apply the ML procedure and count on
some of the good properties of MLEs?
The answer is a qualified—as it holds only subject to specific condi-
tions—“yes”
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 33

Hints to Estimation of GARCH Models: Quasi MLE


The resulting estimator is the quasi (or pseudo) maximum like-
lihood estimator (QMLE)
o The associated proof is one of the most frequently exploited findings in
econometrics—in a way, as close to magic as econometrics can get!
o Under QML even though the distribution of the shocks zt+1 is non-
normal (unknown), under some conditions, an application of MLE
based on zt+1 IID N(0,1), where normality is anyway assumed to hold—
will yield consistent estimators
o Apart from IIDness and some technical requirements, conditions are:
① The conditional variance, 𝜎𝜎𝑡𝑡+1|𝑡𝑡
2
, must be correctly specified
② The conditional mean function, 𝜇𝜇𝑡𝑡+1|𝑡𝑡 , must be correctly specified

(*)

34
Hints to Estimation of GARCH Models: Quasi MLE
Use
normality

(*)

o The need to correctly specify the conditional mean function applies also
when our main interest lies with the estimation of conditional variance
However QMLE methods imply a cost, in a statistical sense: QMLE
will in general be less efficient than ML estimators are
o The denominator of the z-score statistics will be larger under QMLE vs.
MLE, and this implies less power of the tests
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 35

Quasi MLE and Sequential ML Estimation


o In practice, QMLE buys us the freedom to worry about distribution of
the shocks in a second stage, focusing most of the initial effort on
empirically sound models of the conditional mean and variance
o Bollerslev and Wooldridge (1992) performed extensive simulations to
show that for symmetric departures from normality, QMLE is close to
the exact MLE, in terms of resulting estimates
o However, Engle and Gonzalez-Rivera (1991) note that for
nonsymmetric distributions, the loss in efficiency is large
There is one case in which QMLE is useful although our problem is
not the correct specification of the joint density of the shocks, i.e.,
when we have a precise idea on the parametric structure of the PDF
This occurs when estimation of some vector of parameters θ is split
up in a number of sequential estimation stages
o E.g., if , one would first estimate by regular, fully specified
MLE the first sub-vector θ1
o Then, conditional on the estimate of θ1, estimate—again, at least in
principle by full MLE—θ2
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 36
Quasi MLE and Sequential ML Estimation
o Why would we do that? Because of enhanced tractability
o In other occasions, to avoid numerical optimization altogether
o For instance, estimate a Gaussian AR(1)/ARCH(1) by OLS recalling from
20191 that under the assumption of the classical linear regression
model, OLS = MLE
o Use daily sample of 1963-2016 excess stock returns data
o By MLE, we have

o Let’s try a different route: why not obtain the estimated OLS residuals
from a simple regression, and then separately
estimate ω and α from maximization of:

where the OLS residuals are considered as if they were data


o Crucially, under the assumption of normally distributed errors in the
original AR(1) model, we have that 𝜙𝜙� 0 and 𝜙𝜙�1 are also ML estimates
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 37

Quasi MLE and Sequential ML Estimation


o If we apply the two-step QML procedure in E-Views, we obtain:

o The estimates are rather different, especially those of the AR(1) model
o Although not all p-values reveal it, the standard errors have grown
o OLS estimation of GARCH models should be avoided in favor of MLE
Waves of partial ML estimators that may, on the surface, fully exploit
the assumed model structure will not deliver the optimal statistical
properties and characterization of the MLE
o Sequential ML-based estimator may be characterized as a QMLE and
will be subject the limitations of QMLE: loss of efficiency
o This is due to the fact that when we split θ into θ1 and θ2 to separately
estimate them, this very separation implies for any pair
o Such a zero covariance may be at odds with the data
o E.g., in our earlier example,
Under variance targeting, what seems MLE is in fact QMLE
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 38
MSc. Finance/CLEFIN
2021/2022 Edition

FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE


- MODULE 2
Mock Exam
Time Allowed: 1 hour
This exam is closed book, closed notes, subject to the Honor Code and proctored
through Respondus Monitor. In case you believe it is needed, you can use the
calculator that is available within the Respondus framework. You cannot use your
personal calculator or your smartphone. You must avoid using scratch paper, as
this will be interpreted as a suspicious behaviour by the software and expose you
to risks of disciplinary actions.
You always need to carefully justify your answers. However, no algebra or
formulas are required in order to achieve the maximum score in all questions. In
case you **FEEL THAT YOU NEED* an assumption in order to answer a question,
please do so and proceed. The necessity and quality of the assumption(s) will be
assessed together with the answer.
Note: the scores to the questions sum to 27 to accommodate the 4 points
(maximum) deriving from the (optional) homeworks; however, because
homeworks were optional or in case the max rule score favours a student, when
re-scaling the exam score to 31 gives a higher grade, this will prevail (see the
syllabus for the max rule formula and details).

Question 1.A (9.5 points)


Consider the following VMA(∞) representation of a VAR(1) model:

𝒚𝒚𝑡𝑡 = 𝝁𝝁 + � 𝚯𝚯𝑖𝑖 𝒖𝒖𝑡𝑡−𝑖𝑖 + 𝒖𝒖𝑡𝑡


𝑖𝑖=1
where 𝚯𝚯𝑖𝑖 = 𝑨𝑨1𝑖𝑖 and 𝑨𝑨1 is the matrix of the coefficients of the reduced form VAR(1). Can we
interpret the coefficients 𝚯𝚯𝑖𝑖 as impact multipliers of the true, structural innovations? If not,
carefully explain why and discuss whether and under what conditions it is possible to retrieve
the impact multipliers of the structural innovations from the OLS estimates of a VAR in its
reduced form. Finally, explain what it means that a Cholesky decomposition imposes “an
ordering” upon the variables and why this may turn out to be a disadvantage. (Note: No algebra
is required in order to achieve the maximum score in this question; to refer to elements of any
equations you may know of, you can use their names instead of their symbols, e.g., 𝚯𝚯 may be
referred to as the vector of theta coefficients. Examples may be useful to corroborate your
explanations).

1
Debriefing.
It is not possible to interpret the theta coefficients as impact multipliers of the true, structural
innovations, as they are functions of A1, which is the matrix of the coefficients of the reduced
form VAR (estimated via OLS). Consequently, the theta coefficients are the impact multipliers
of the error terms of the reduced form model, which are a composite of the true, structural and
uncorrelated innovations to the series. In general, it is not possible to retrieve the impact
multipliers of the structural innovations because a VAR in its reduced form is under-identified,
i.e., the OLS estimated parameters are less than the parameters in the primitive system. For
instance, a bivariate VAR(1) will have 10 structural parameters (six coefficients, two intercepts,
and two variances) while its reduced form only contains 9 parameters(four coefficients, two
intercepts, two variances, and one covariance). Therefore, in order to recover the primitive
parameters, we should be ready to impose restrictions (specifically, we need to impose (N^2-
N)/N restrictions). A popular method to impose restrictions consists of the application of a
Cholesky decomposition which enforces (N^2-N)/N contemporaneous coefficients in the
primitive system to be equal to zero. More specifically, the Cholesky decomposition enforces a
triangular identification scheme: the first variable in the system is allowed to
contemporaneously impact all the others, the second variable in the system influences all the
variables but the first one, and so on. For this reason, we say that the Cholesky decomposition
imposes an ordering on the variables. This is also a major drawback of the Cholesky
decomposition, as the values of the impact multipliers (and therefore the IRFs) depend on the
ordering of the variables, so that the conclusions that are drawn may change if we place the
variables in a different order. This is a problem especially when the ordering is not based on
any sensible economic assumption.

2
Question 1. B (4 points)
Mr. Earl, a senior analyst at Vicod’s Hedge Fund, wants to estimate a VAR model for the 1-month,
1-, 5-, and 10-year Treasury yields and he is wondering which lag order he should be selecting.
Therefore, he conducts a full specification search and produces the table below. Why is it a good
practice to look at the information criteria? Which is the model selected by each of the three
information criteria (AIC, SC, HQ)? Do they all lead to the selection of the same model and, if
not, is this plausible?
VAR Lag Order Selection Criteria
Endogenous variables: ONEMONTH ONEYEAR FIVEYEARS TENYEARS
Exogenous variables: C
Date: 10/11/18 Time: 17:43
Sample: 1/05/1990 12/30/2016
Included observations: 1395

Lag LogL LR AIC SC HQ

0 -4984.879 NA 7.152514 7.167541 7.158133


1 6162.080 22214.01 -8.805849 -8.730715 -8.777757
2 6264.603 203.7234 -8.929897 -8.794654 -8.879331
3 6284.063 38.55692 -8.934857 -8.739507 -8.861818
4 6299.304 30.11002 -8.933769 -8.678310 -8.838256
5 6354.346 108.4274 -8.989743 -8.674177 -8.871756
6 6375.059 40.68298 -8.996500 -8.620826 -8.856039
7 6390.868 30.96155 -8.996226 -8.560445 -8.833292
8 6406.572 30.66431 -8.995802 -8.499912 -8.810394
9 6419.565 25.29749 -8.991491 -8.435494 -8.783610
10 6443.217 45.91304 -9.002461 -8.386356 -8.772106
11 6460.887 34.20005 -9.004855 -8.328643 -8.752027
12 6473.561 24.45873 -9.000088 -8.263767 -8.724785
13 6490.912 33.38322 -9.002024 -8.205596 -8.704248
14 6504.644 26.34175 -8.998773 -8.142236 -8.678523

Debriefing
It is a good practice to look at information criteria (instead of comparing, for instance, the
squared residuals from different models) because they trade off the goodness of (in-sample) fit
and the parsimony of the model, thus avoiding the overfitting problem, i.e., the selection of
large-scale models that display a good in sample fit, but often have a poor forecasting
performance out-of-sample. The best model according to each information criterion is the
model that minimizes the criterion. Therefore, the AIC selects a VAR(11) model while both the
SC and the HQ criterion select a more parsimonious VAR(2) model. It is perfectly possible that
the three information criteria lead to the selection of different models. Indeed, although they
both contain a first term which is a function of squared residuals, each of the criteria imposes a
different penalty for the loss of degrees of freedom from the presence of a given number of
parameters in the model. In particular, the SC is the criterion that imposes the strongest penalty
for each additional parameter that is included in the model, while the AIC is the one with the
smallest penalization for additional parameters (HQ falls between the other two).

Question 2.A (10 points) Consider the family of GARCH(p, q) models for the variance of asset
returns. Define the persistence index and discuss what is the role that such index plays in
establishing the covariance stationarity of a GARCH model. Consider two alternative GARCH
models for the same series of returns characterized by identical persistence index, but (i) the

3
𝑝𝑝 𝑞𝑞
first model characterized by a large ∑𝑖𝑖=1 𝛼𝛼𝑖𝑖 and a small ∑𝑗𝑗=1 𝛽𝛽𝑗𝑗 ; (ii) the second model
𝑝𝑝 𝑞𝑞
characterized by a small ∑𝑖𝑖=1 𝛼𝛼𝑖𝑖 and a large ∑𝑗𝑗=1 𝛽𝛽𝑗𝑗 . What do you expect that the differences
between the filtered, one-step ahead predicted variances from the two models will look like?
Consider the two cases that follow:
You are a risk manager and you are considering calculating daily value-at-risk on the
basis of a Gaussian homoskedastic model: is the mistake you are about to make larger
under model (i) or model (ii)? Carefully explain why.
You write and sell long-term options written on the underlying asset that you price using
a pricing tool that accounts for time varying volatility under GARCH: in the presence of
large return shocks, will the mispricing be larger under model (i) or under model (ii)?
(Note: Remember, you are not required to use any formulas or equation in your replies).

Debriefing:
In a standard GARCH(p, q) model, the persistence index is defined as the sum of all alpha and
beta coefficients. Such an index provides information on the time required for the effects of a
shock to returns to decay to zero in terms of its effects in the prediction of the variance or,
equivalently, how much memory (in a ARMA sense) the GARCH model displays. Given the
overall persistence index of a model, when—the case of model (i)— the sum of all the alpha
coefficients is large while the sum of all beta coefficients is small, this means that the GARCH
will be characterized by a high reactivity of variance forecasts to the most recent p shocks to
returns (squared) and by modest memory of the past q variance forecasts: this gives a sample
path of conditional variance predictions that tends to be jagged and rather spiky. When instead
the sum of all alpha coefficients is small while the sum of all beta coefficients is large, i.e., model
(ii), this means that the GARCH will be characterized by a modest reactivity of variance
forecasts to the most recent p shocks to returns (squared) and by considerable memory for the
past q variance forecasts: this gives a sample path of conditional variance predictions that tends
to be smooth but to swing up and down in dependence of strings of squared shocks that move
predicted variance above or below the ergodic, long-run variance.
As for the two cases, a risk manager engaged in estimating daily VaR under a model that
incorrectly sets p = q = 0 (basically, no GARCH), will incur in the larger mistakes under (a true)
model (i), when the sum of all alpha coefficients is large while the sum of all beta coefficients is
small, because the risk officer will ignore the short-lived predicted variance spikes implied by
a large sum of all alpha coefficients; in fact, a homoscedastic model is somewhat close to case
(ii), representing a situation of extreme smoothness, when the variance is constant. A long-term
option trader will instead make the same mistakes by using a homoscedastic Gaussian model
(also known as Black-Scholes) vs. either model (i) or (ii): the reason is that in the pricing of
long-term contracts, what really matters is not either the sum of all alpha or of all beta
coefficients, but their overall sum, i.e., the very persistence index.

4
Question 2.B (3.5 points) Bruno Cerelli, an analyst at Unresponsive & So., has estimated a
Gaussian GARCH(1,1) model for FTSE MIB stock index returns and found that 𝛼𝛼� + 𝛽𝛽̂ = 1; upon
testing, he has not been able to reject the null hypothesis that 𝛼𝛼 + 𝛽𝛽 = 1. Therefore he has
concluded that because the condition 𝛼𝛼 + 𝛽𝛽 < 1 is violated, the GARCH model is non stationary
so that a time-invariant unconditional distribution for the FTSE MIB returns does not exist and
one cannot learn from past data to forecast future returns. A colleague of his, Vari Keen, has
objected that this implication is unwarranted, forecasting is indeed possible, even though a
GARCH with 𝛼𝛼 + 𝛽𝛽 = 1 implies that variance follows a random walk with drift process so that
time t estimated variance is (in a mean-squared error sense) the best forecast for variance at
time t + 1, t + 2, …, t + H for all H 1. Which one of the two analysts at Unresponsive & So. is
correct and why? Would you ever advise to use such special GARCH(1,1) model with 𝛼𝛼 + 𝛽𝛽 = 1
to obtain long-run forecasts. Make sure to carefully explain your answer.

Debriefing:
Vari Keen is right on her claims: when 𝛼𝛼 + 𝛽𝛽 = 1 in GARCH(1,1) model, we are truly facing a
IGARCH(1,1), which in turn represents a special case of the RiskMetrics model. A RiskMetrics
model corresponds to a random walk (possibly, with drift in the case of IGARCH) which implies
that forecasting is possible but also characterized by the odd property that the time t estimated
variance is the best forecast for variance at time t + 1, t + 2, …, t + H for all H 1. However,
IGARCH/RiskMetrics is not covariance stationary but it is strictly stationary, which allows us to
produce forecasts at least in general terms, even though how advisable it may be to produce H-
step ahead forecasts that correspond to time t, short-term estimates of the variance, remains
questionable. Moreover, Vari Keen should place her focus on how/whether the conditions (that
we do not state as rather complex) that ensure strict stationarity of the IGARCH model are
satisfied, a matter on which we have had no specific information (but the book shows that a
sufficient condition is satisfied.

5
MSc. Finance/CLEFIN
2017/2018 Edition

FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE


- MODULE 2
Mock Question 1 (total 16 points, out of 50 from 3 questions)
Time Advised: 22 minutes (for this question)

Question 1.A (13 points)


Write in formal terms an AR(p) model with p 1, making sure to explain what each term
represents and whether each term is an observable random variable, a latent shock, or a
parameter; also explain the economic intuition for the model, if any. What does it mean, both in
logical and in statistical terms, that an AR(p) time series process is stationary? Assuming
stationarity, make sure to discuss what the relevant population moments of the process are,
also providing a few examples of the corresponding closed-form formulas.

Debriefing:
We expect all sub-questions to be answered but within a well-organized, 12-15 line long reply
that will need to fit in the (generous) space provided.

In class we have expressed some doubts as to a rational, efficient-markets hypothesis related


explanation for the meaning of AR(p) models in finance, and that represents the expected
answer to that part of the question.

1
2
Question 1.B (2 points)
Using the lag operator L, write an AR(2) process in “lag operator-polynomial” form and discuss
how would you go about testing whether the process is stable and hence stationary. Will the
resulting stationarity, if verified, be strong or weak? Make sure to explain your reasoning.

Debriefing:
Because if the series is stationary, Wold’s decomposition applies, the process is linear and as
such weak and strong stationarity are equivalent.
See also material copied below.

3
Question 1.C (1 point)
Consider the following data and the corresponding sample ACF:

0.4

0.3

0.2

0.1

0
0 200 400 600 800 1000
-0.1

-0.2

-0.3

-0.4

-0.5

Autocorrelation Function

23
21
19
17
15
13
11
9
7
5
3
1
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5

4
What is the most likely type of ARMA(p, q) process that may have originated this SACF? What
other type of information would you be needing in order to make sure of your answer? Make
sure to carefully justify your arguments.

Debriefing: The series and the corresponding SACF were generated from 1,000 simulations
from an AR(2) process with the following features:

As seen in the lectures (lecture 3, slide 12), this cyclical but fading pattern characterizes
stationary AR(2) processes with coefficients of opposite signs. However, one could be really
“positive” this SACF comes from an AR(2) only after verifying that the SPACF has the typical
behavior of PACF for stationary AR(2) processes: two values statistically significant, followed
by no other significant values. Of course, one cannot detect the generating process just on the
basis of the SACF (also the confidence intervals were not given), but she could speculate on its
likely nature.

5
MSc. Finance/CLEFIN
2017/2018 Edition

FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE


- MODULE 2
Mock Question 2 (total 17 points, out of 50 from 3 questions)
Time Advised: 24 minutes (for this question)

Question 2.A (14 points)


Consider a bivariate VAR(2) model for S&P 500 returns and the log changes in the VIX volatility
index (𝑅𝑅𝑡𝑡𝑆𝑆&𝑃𝑃 and ∆𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡 ). Write:
The structural, unconstrained VAR(2) that includes contemporaneous effects across the
two markets.
The associated unconstrained reduced-form VAR(2).
Explain through which steps it is possible to transform the structural VAR model into the
reduced-form one (algebra is not required, unless it helps you provide an efficient answer). How
would/could you estimate the structural VAR? How would/could you estimate the reduced-
form model? Explain what are the issues/limitations caused by the transformation of a
structural VAR into a reduced-form model.

Debriefing:

1
2
Question 2.B (2 points)
Suppose that the bivariate structural VAR(2) is to be exactly identified by imposing either of the
two possible Choleski triangularization schemes:
1 0 1 𝑏𝑏12
𝑩𝑩′ = � � and 𝑩𝑩′′ = � �
𝑏𝑏21 1 0 1
Carefully explain the implications and differences in economic interpretations of the estimated,
corresponding reduced-form model deriving from imposing the restriction in 𝑩𝑩′ instead of 𝑩𝑩′′.
How does your answer change when the restriction
1 0
𝑩𝑩′′′ = � �
0 1
is imposed instead?

Debriefing:
Trivially, 𝑩𝑩′′′ implies that the original structural model is in reduced form or, alternatively, the
model has been over-identified by removing all contemporaneous effects between variables.
𝑩𝑩′ implies that S&P 500 returns are ordered before VIX log-changes, so that any 𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃 shock to
the S&P 500 is structural and primitive, while the 𝑢𝑢𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉 shocks are correlated with both
structural shocks to S&P 500 and the VIX.
On the opposite, 𝑩𝑩′′ implies that log changes in VIX are order before S&P 500, so that any
𝑢𝑢𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉 shock to the VIX is structural and primitive, while the 𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃 shocks are correlated with both
structural shocks to S&P 500 and the VIX.

3
Question 2.C (1 point)
Suppose that the estimation of a constrained, reduced-form VAR(2) has provided the following
ML estimates of the conditional mean function and of the covariance matrix of the reduced-form
shocks (p-values are in parentheses):
0.006 0.053 𝑆𝑆&𝑃𝑃 0.473 0.113
𝑅𝑅𝑡𝑡𝑆𝑆&𝑃𝑃 = + 𝑅𝑅 − Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−1 + Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−2 + 𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃
(0.044) (0.093) 𝑡𝑡−1 (0.003) (0.045)

0.194 0.375 𝑆𝑆&𝑃𝑃 0.094 𝑆𝑆&𝑃𝑃 0.804
Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡 = − − 𝑅𝑅𝑡𝑡−1 + 𝑅𝑅𝑡𝑡−2 + Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−1 + 𝑢𝑢𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉
(0.149) (0.024) (0.050) (0.000)
0.008 −0.016
𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃 (0.000) (0.007)
𝑉𝑉𝑉𝑉𝑉𝑉 �� 𝑉𝑉𝑉𝑉𝑉𝑉 �� = � �
𝑢𝑢𝑡𝑡 −0.016 0.014
(0.007) (0.000)
You would like to recover the original structural parameters, including the contemporaneous,
average impact of both VIX changes on S&P 500 returns and vice-versa. Is there a chance that
this may be possible even though you are not ready to impose a Choleski ordering on the two
variables?

Debriefing: One cannot say for sure but the evidence shown has two implications:
The estimated reduced-form VAR(2) carries restrictions and in fact estimation has been
properly performed by MLE applied to the bivariate system.
𝑆𝑆&𝑃𝑃
There are two restrictions that have been imposed, setting the coefficients of 𝑅𝑅𝑡𝑡−2 to
zero in the first equation and the coefficient of Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−2 to zero in the second equation;
indeed note that ML estimation has been performed, because it is likely that the reduced-

4
form VAR will include restrictions.
Now, we can only speculate that such restrictions derive from restrictions that have been
imposed on the matrix Γ2 in the structural representation of the model,
𝑆𝑆&𝑃𝑃 𝑆𝑆&𝑃𝑃
𝑅𝑅𝑡𝑡𝑆𝑆&𝑃𝑃 𝑅𝑅𝑡𝑡−1 𝑅𝑅𝑡𝑡−2 𝜖𝜖 𝑆𝑆&𝑃𝑃
𝑩𝑩 � � = 𝚪𝚪0 + 𝚪𝚪1 � � + 𝚪𝚪2 � � + � 𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉 �,
Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡 Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−1 Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−2 𝜖𝜖𝑡𝑡
2
0 𝛾𝛾12
in the sense that 𝚪𝚪2 = � 2 �.
𝛾𝛾21 0
However, we know that the exact identification of a bivariate structural VAR requires imposing
(22 – 2)/2 = 1 restriction and that such constraints do not have to be imposed necessarily on the
matrix of contemporaneous effects 𝑩𝑩. Because two such restrictions seem to have been imposed
on 𝚪𝚪2 , yes, there is a chance for the structural model—in particular for the two coefficients
measuring the contemporaneous, average impact of both VIX changes on S&P 500 returns and
vice-versa—to be identified (probably, over-identified), even though no Choleski
triangularization has affected 𝑩𝑩 (in fact, no restrictions at all have been imposed).

5
MSc. Finance/CLEFIN
2017/2018 Edition

FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE


- MODULE 2
Mock Question 3 (total 17 points, out of 50 from 3 questions)
Time Advised: 24 minutes (for this question)

Question 3.A (13 points)


Define a stochastic trend and indicate what is the relationship between a stochastic trend and a
random walk, with and without drift, for the special case of a I(1) process. For this case,
comment on (or show, as you deem most appropriate) the stationarity or lack thereof of a
random walk and explain why this may represent a problem in empirical work. Indicate how
would you proceed to make a I(d) time series, {𝑦𝑦𝑡𝑡 }, with d 2, stationary. Would the choice of
considering {𝑦𝑦𝑡𝑡 − 𝑦𝑦𝑡𝑡−𝑑𝑑 } instead of {𝑦𝑦𝑡𝑡 } be an appropriate one? Make sure to carefully explain
your answers.

Debriefing:

1
2
The answer to the last sub-point is negative because we know that Δ𝑑𝑑 𝑦𝑦𝑡𝑡 ≠ 𝑦𝑦𝑡𝑡 − 𝑦𝑦𝑡𝑡−𝑑𝑑 , while Δ𝑑𝑑 𝑦𝑦𝑡𝑡
consists of taking a number of d of successive differences of the series under consideration, i.e.,
Δ2 𝑦𝑦𝑡𝑡 = Δ(Δ𝑦𝑦𝑡𝑡 ) Δ3 𝑦𝑦𝑡𝑡 = Δ(Δ2 𝑦𝑦𝑡𝑡 ) …

Question 3.B (2.5 points)


An analyst at Charles Thomas and Associates has just downloaded the following series of data
on the quarterly US real GDP (in constant dollars, expressed as 2009 billions).

He has proposed to make this series stationary by first fitting (by simple OLS) a quadratic
function of time (shown as a dashed red line in the picture) and then replace the time series of
real GDP with the OLS residuals from such a quadratic trend regression. What are the risks that
the analysts is exposing himself and his firm to by adopting this simple procedure?

Debriefing:
As commented in the lectures, such a procedure does not really make an obviously trending,
non-stationary series any stationary when the series contains a stochastic unit root, which
should be tested as a first order of matter. The risks are that, because the adopted method is
ineffective, the residuals will be then treated as I(0) while they are in fact I(1) or worse.

3
Question 3.C (1.5 points)
You know that a time series {𝑦𝑦𝑡𝑡 } was originally suspected to be I(d) with d 1. A fellow quant
analyst, Ms. Maria Delas, has then transformed it by differentiating three times, in the attempt
to make it stationary and delivered the series to you. Upon your own analysis, you determine
that the series contains now 2 unit roots in its MA component (i.e., the residuals need to be
differentiated twice for them to be “well-behaved”, that we may have called invertible). What
do you know about the d characterizing the original series?

Debriefing:
The original series was I(1): differentiating it three times—well more than what is needed—
“messes” it stochastic structure up, by creating two unit roots in its MA component. In short, if
d – 3 = -2, then it must have been d = 1.

4
MSc. Finance/CLEFIN
2020/2021 Edition

FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE


- MODULE 2
Exam May 12, 2021
Time Allowed: 1 hour

This exam is closed book, closed notes, subject to the Honor Code and proctored
through Respondus Monitor. In case you believe it is needed, you can use the
calculator that is available within the Respondus framework. You cannot use
your personal calculator or your smartphone. You must avoid using scratch
paper, as this will be interpreted as a suspicious behaviour by the software and
expose you to risks of disciplinary actions.
You always need to carefully justify your answers. However, no algebra or
formulas are required in order to achieve the maximum score in all questions. In
case you **FEEL THAT YOU NEED* an assumption in order to answer a question,
please do so and proceed. The necessity and quality of the assumption(s) will be
assessed together with the answer.
Note: the scores to the questions sum to 27 to accommodate the 4 points
(maximum) deriving from the (optional) homeworks; however, because
homeworks were optional or in case the max rule score favours a student, when
re-scaling the exam score to 31 gives a higher grade, this will prevail (see the
syllabus for the max rule formula and details).

Question 1.A (10 points)


Provide a definition of cointegration of order (d, b) among N time series and discuss how it
relates to the existence of a long-run equilibrium relating the N variables. Suppose you have N
time series available: how can you test whether they are cointegrated? What are the
drawbacks of a univariate test for cointegration? How many cointegrating relationships
among the N variables do potentially exist? (NOTE: it is not required that you use formulas in
order to obtain the maximum score in this question; a minimal use of symbols is supported by
the text editor embedded in the system, but the use of words in place of symbols is also
accepted, e.g., 𝛽𝛽 can be written as beta, 𝑦𝑦𝑡𝑡 can be written as y_t and so on).

Debriefing. Given a N-dimensional random vector 𝒚𝒚𝑡𝑡 , its components 𝑦𝑦1,𝑡𝑡 , … , 𝑦𝑦𝑁𝑁,𝑡𝑡 are said to
be cointegrated of order (d, b), if they are integrated of order d, and there exist at least one
vector 𝜿𝜿 (the cointegrating vector) such that 𝜿𝜿′𝒚𝒚𝑡𝑡 is integrated of order 𝑑𝑑 − 𝑏𝑏 (with 𝑑𝑑 − 𝑏𝑏

1
larger or equal to zero and b strictly greater than zero). A cointegrating vector 𝜿𝜿 represents
one (of the potentially many) long-run equilibrium relationships among the variables, i.e., the
variables cannot arbitrarily wander away from each other.
If 𝑦𝑦𝑡𝑡 contains N nonstationary components, there may be as many as (N - 1) linearly
independent cointegrating vectors. The number of cointegrating vectors is called the
cointegrating rank of 𝑦𝑦𝑡𝑡 . There are two ways to test for cointegration: univariate, regression-
based procedures (Engle and Granger’s test) and multivariate, VECM-based tests (Johansen’s
tests). In the case of Engle and Granger’s procedure, the method consists of regressing one
series on all the others (after having determined that they are both integrated of the same
order) and to test the residuals of such a regression for the presence of unit root (e.g., using an
ADF test). If the series are stationary, this will prove that the series are cointegrated. This
procedure has two main drawbacks:
• the Engle-Granger procedure relies on a two-step estimator, in which the first step
regression residuals are used in the second step to estimate an ADF (or PP)-type
regression, which causes errors and contamination deriving from a generated
regressors problem;
• Engle and Granger’s method has no systematic procedure to perform the separate
estimation of multiple cointegrating vectors, in the sense that the approach suffers from
a inherent indeterminacy as it is unclear which I(1) variables ought to be used as a
dependent variable and which ones as regressors; moreover, in principle, the outcomes
from Engle-Granger’s test may turn out to depend on exactly such choices of dependent
vs. explanatory variables in the testing regressions.
Therefore, in the case of 𝑁𝑁 > 2, the use of Johansen’s procedure is strongly advised in order to
test whether a number N of series are cointegrated (and how many cointegrating vectors
exist). With a Johansen’s test, one estimates the vector error correction VAR(p) (in difference)
model best suited to the data (e.g., on the basis of appropriate information criteria) and
proceeds to apply tests concerning the eigenvalues of a particular matrix (“pi”) to ascertain
the rank of such a matrix. There are three possible outcomes concerning the rank of the
matrix “pi”: (i) if the matrix “pi” has rank equal to zero, then all the N variables in the VECM
contain a unit root and they are not cointegrated; (ii) if the matrix “pi” has a rank 𝑟𝑟 such that
0 < 𝑟𝑟 < 𝑁𝑁, then all the N variables in the VECM contain a unit root and they are characterized
by 𝑟𝑟 cointegrating relationships; (iii) if the matrix “pi” has full rank, all the variables in the
VECM are stationary. Johansen proposed two different tests (based on the eigenvalues of the
matrix “pi” given that the rank of a 𝑁𝑁 × 𝑁𝑁 matrix is equal to N minus the number of
eigenvalues that are equal to zero) to ascertain the rank of “pi”: the trace test, which test the
null that the number of distinct cointegrating vectors is less than or equal to r against a
general alternative of a number exceeding r; the max eigenvalue test, which tests the null that
the number of cointegrating vectors is r against the alternative of 𝑟𝑟 + 1 cointegrating vectors.
If the test statistic is greater than the appropriate critical value, we will reject:
• the null hypothesis that there are at most r cointegrating vectors in favor of the
alternative that there are 𝑟𝑟 + 1 in the case of the trace test, or
• the null hypothesis that there are exactly r cointegrating vectors in favor of the

2
alternative that there are 𝑟𝑟 + 1 vectors, in the max eigenvalue test.

Question 1. B (3.5 points)


Osman Juice, an analyst at Happy Hedge fund, has estimated the following regression:
𝑦𝑦𝑡𝑡 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥𝑡𝑡 + 𝜀𝜀𝑡𝑡 .
He has then performed an ADF test on the residuals of the regression and has discovered that
they contain a unit root (while their first difference is stationary). While he does not know
anything about the order of integration of the series 𝑥𝑥𝑡𝑡 , he had previously determined that 𝑦𝑦𝑡𝑡
is I(2). What conclusion (if any) can he draw about the order of integration of 𝑥𝑥𝑡𝑡 and on the
existence of a cointegrating relationship between the two series? Should he trust the results of
this regression? If not, what would you advise him to do? Carefully justify your answer.

Debriefing. The errors of the regression are a weighted sum of two random variables and
therefore their order of integration should be equal to the maximum of the orders of
integration of the variables, unless the variables are cointegrated (of some order). This means
that, because the order of integration of 𝑦𝑦𝑡𝑡 is 2, the two variables need to be cointegrated;
otherwise the errors of the regression would be I(2). However, because the residuals are still
I(1) (i.e., they are non-stationary), the variables are C(2,1). This means that the order of
integration of 𝑥𝑥𝑡𝑡 should be 2 as well, as variables need to be integrated of the same order to be
cointegrated. Despite the variables are CI(2,1), the results of the regression should not be
trusted because of the error being I(1) (such that any shock has a permanent effect on the
regression). Juice should take the first difference of both the variable and then regress them
one on the other, if he needs to insist in carrying on his analysis.

Question 2.A (9.5 points)


Describe (in qualitative terms) the structure of a component GARCH model and in what ways
may such a model differ from a standard GARCH model of the same order. What ensures that a
component GARCH model may be covariance stationary? Thinking about the case of a
component GARCH(1,1), what is its relationship with higher order standard GARCH(p, q)
models? Can component GARCH models be generalized to include asymmetric effects? Finally,
make sure to explain the economic purpose/meaning of component GARCH models. (NOTE: it
is not required that you use formulas in order to obtain the maximum score in this question; a
minimal use of symbols is supported by the text editor embedded in the system, but the use of
words in place of symbols is also accepted, e.g., 𝛽𝛽 can be written as beta, 𝑦𝑦𝑡𝑡 can be written as
y_t and so on).

Debriefing.
Engle and Lee (1999) have generalized the classical GARCH model (with and without
threshold effects) to component GARCH (C-GARCH) models. Differently from standard GARCH
that fluctutate around a constant long-run variance, C-GARCH allows mean reversion to a
varying level or—equivalently—they incorporate a distinction between transitory and

3
permanent conditional variance dynamics. However, the long-term, permanent variance is
itself stationary and converges to some constant long-run level. In essence a C-GARCH is a
two-layer CH model in which short-term (also known as temporary) variance follows a
GARCH that fluctuates around a randomly varying permanent conditional variance
component which, in its turn, follows a AR model (often with an interaction between the
short-term ARMA(1,1) model and the AR(1) permanent component).
Interestingly, one can combine the transitory and permanent component to write a C-
GARCH(1,1) model as a restricted, classical GARCH(2,2) model in which the parameters (a
total of 5) are restricted to depend (in a non-linear fashion) on the 5 parameters that
characterized the original C-GARCH(1,1) model. This is meaninfgul because this fact
illustrates that empirically a C-GARCH model not only carries a meaninfgul power to separate
the temporary and permanent components (see below for comments), but also has an
enhnanced power to fit the long-memory that squared and absolute residuals from a variaty
of models for financial returns tend to display. In fact, higher-order GARCH models are rarely
used in practice, and the GARCH(2,2) case deriving from a C-GARCH(1,1) represents one of
the few cases in which—even though it is subject to constraints coming from the structure of
the C-GARCH—implicitly, through the component GARCH case, a (2,2) model has been used in
many practical applications. Even though by construction the corresponding GARCH(2,2)
always gives positive variance components, its coefficients are not all positive, which is a
powerful proof of the fact that setting all coefficients to be non-negative in GARCH(p, q) is
certainly sufficient but never necessary to guarantee that the resulting variance forecast be
positive at all times.
In the literature, we have had also variations of C-GARCH(p, d, q) models that have included
leverage-type asymmetric effects in the short-run conditional variance equation. This model
combines the component model with the asymmetric threshold GARCH model, introducing
asymmetric effects in the transitory equation. Even though they may not always represent the
best fitting models, C-GARCH may have a crucial asset pricing justification: it may be often
optimal that the fair price of certain securities (think of long-term options, sometimes called
leaps) and a few types of financial decisions (such as portfolio weights for the long-run) may
not be made to depend on short-run spikes and alterations in predicted conditional variance.

4
Question 2.B (4 points)
Geremia Morgano is a quant analyst at Randomshoot Inc. Geremia has estimated by MLE,
under normality a component GARCH(1,1) for a return series, obtaining the outputs that
follow.
Sample: 1975M01 2020M12
Included observations: 552
Convergence achieved after 29 iterations
Presample variance: backcast (parameter = 0.7)
Q = C(2) + C(3)*(Q(-1) - C(2)) + C(4)*(RESID(-1)^2 - GARCH(-1))
GARCH = Q + C(5) * (RESID(-1)^2 - Q(-1)) + C(6)*(GARCH(-1) - Q(-1))

Variable Coeffici... Std. Error z-Statistic Prob.

C 1.193457 0.273234 4.367889 0.0000

Variance Equation

C(2) 95.15398 46.10993 2.063633 0.0391


C(3) 0.971678 0.019781 49.12141 0.0000
C(4) 0.181394 0.029895 6.067784 0.0000
C(5) -0.089096 0.039014 -2.283729 0.0224
C(6) 0.190395 0.450726 0.422418 0.6727

R-squared -0.001459 Mean dependent var 1.491830


Adjusted R-squared -0.001459 S.D. dependent var 7.819156
S.E. of regression 7.824857 Akaike info criterion 6.838580
Sum squared resid 33736.84 Schwarz criterion 6.885466
Log likelihood -1881.448 Hannan-Quinn criter. 6.856900
Durbin-Watson stat 1.843526

5
The sample correlogram refers to squared standardized residuals derived from the C-
GARCH(1,1) model. Do you think that Geremia may use the estimated C-GARCH to forecast
variance, price options, compute risk measures, etc.? Make sure to justify your answer.
Geremia has also proceeded to estimate a GARCH(2,2) model deriving the following outputs:
Sample: 1975M01 2020M12
Included observations: 552
Convergence achieved after 24 iterations
Coefficient covariance computed using outer product of gradients
Presample variance: backcast (parameter = 0.7)
GARCH = C(2) + C(3)*RESID(-1)^2 + C(4)*RESID(-2)^2 + C(5)
*GARCH(-1) + C(6)*GARCH(-2)

Variable Coeffici... Std. Error z-Statistic Prob.

C 1.193355 0.273269 4.366966 0.0000

Variance Equation

C 2.412452 1.225584 1.968410 0.0490


RESID(-1)^2 0.092934 0.041258 2.252482 0.0243
RESID(-2)^2 0.067425 0.083952 0.803139 0.4219
GARCH(-1) 0.982479 0.440309 2.231342 0.0257
GARCH(-2) -0.168067 0.358593 -0.468686 0.6393

R-squared -0.001460 Mean dependent var 1.491830


Adjusted R-squared -0.001460 S.D. dependent var 7.819156
S.E. of regression 7.824861 Akaike info criterion 6.837776
Sum squared resid 33736.87 Schwarz criterion 6.884662
Log likelihood -1881.226 Hannan-Quinn criter. 6.856096
Durbin-Watson stat 1.843524

6
Geremia claims that the second, GARCH(2,2) model is superior to the original component
GARCH(1,1) because it fails to imply a highly variable but generally small permanent
component. Do you agree with Geremia? Are you worried for the negatively estimated
coefficient obtained from the ML estimation of the GARCH(2,2) model?

Debriefing.
Geremia should refrain from using the estimated component GARCH(1,1) model because the
transitory component is poorly identified and its coefficients not significantly estimated (at
0.190) or negative (at -0.089). As a result, the transitory component is highly variable but
generally small and seems to contribute to total variance mostly jumps in a highly volatile
manner and not in a smooth way. As a result, almost all of filtered variance consists of
permanent variance, which is also problematic and unrealistic. Moreover, in spite of the ML
estimation under normality, the empirical distribution of the standardized residuals fail to be
normally distributed.
However, Geremia’s claim that the second, classical GARCH(2,2) model is superior to the
original component GARCH(1,1) model is absurd. First, we know that component GARCH(1,1)
models can be written as GARCH(2,2) under restrictions. In this specific case, the restrictions
must be hardly binding because the plot of the filtered variances under component
GARCH(1,1) and the GARCH(2,2) models are basically identical (apart from very high
variances, in excess of 200, i.e., the two models just differ in the spikes in variance that they
forecast). Moreover, the histograms of the standardized residuals are practically identical. The

7
two models imply the same total variance but the ML estimation remains reason for concern
as the models appear to be misspecified. Finally, there is some reason to be worried for the
negative coefficient in the GARCH(2,2) model as a negative coefficient was also appearing in
the component GARCH(1,1) specification, which may violate sufficient conditions for
positivity of conditional variance.

You might also like