ECONOMETRICS Course
ECONOMETRICS Course
ECONOMETRICS Course
Financial Returns
Winter/Spring 2022
Overview
General goals of the course and definition of risk(s)
Predicting Densities
General goals
Not always risks may be predicted or, even though these are
predictable, they may be managed in asset markets
When they are, then we care for them in this course
o Operational (op) risk is defined as the risk of loss due to physical
catastrophe, technical failure, and human error in the operation of a
firm, including fraud, failure of management, and process errors
Although it should be mitigated and ideally eliminated in any firm, this
course has little to say about op risk because op risk is typically very
difficult to hedge in asset markets
• But cat bonds…
o Op risk is instead typically managed using self-insurance or third-
party insurance
o Credit risk is defined as the risk that a counterparty may become less
likely to fulfill its obligation in part or in full on the agreed upon date
o Banks spend much effort to carefully manage their credit risk
exposure while nonfinancial firms try and remove it completely
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 4
Predicting asset returns
When risk is quantifiable and manageable in asset markets,
then we shall predict the distribution of risky asset returns
o Business risk is defined as the risk that changes in variables of a
business plan will destroy that plan’s viability
• It includes quantifiable risks such as business cycle and demand equation
risks, and non-quantifiable risks such as changes in technology
o These risks are integral part of the core business of firms
The lines between the different kinds of risk are often blurred; e.g.,
the securitization of credit risk via credit default swaps (CDS) is an
example of a credit risk becoming a market risk (price of a CDS)
How do we measure and predict risks? Studying asset returns
Because returns have much better statistical properties than price
levels, risk modeling focuses on describing the dynamics of returns
Density prediction
o Notice that D(0, 1) does not have to be a normal distribution
o Our task will consist of building and estimating models for both the
conditional variance and the conditional mean
• E.g., t+1 = 0 + 1Rt and 2t+1 = 2t + (1 - )R2t
o However, robust conditional mean relationships are not easy to find,
and assuming a zero mean return may be a prudent choice
In what sense do we care for predicting return distributions?
o Our task will consist of building and estimating models for both the
conditional variance and the conditional mean
• E.g., t+1 = 0 + 1Rt and 2t+1 = 2t + (1 - )R2t
o However, robust conditional mean relationships are not easy to find,
and assuming a zero mean return may be a prudent choice
One important notion in this course distinguishes between
unconditional vs. conditional moments and/or densitiies
An unconditional moment or density represents the long-run,
average, “stable” properties of one or more random variables
o Example 1: E[Rt+1] = 11% means that on average, over all data, one
expects that an asset gives a return of 11%
Lecture 1: The Econometrics of Financial Returns – Proff. Guidolin & Rotondi 13
Winter/Spring 2022
Overview
Time Series: When Can We Focus on the First Two Moments
Only?
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 3
Linear Processes
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 5
(Autocovariance function)
𝜌𝜌ℎ ≡ 𝛾𝛾ℎ /𝛾𝛾0 (where 𝛾𝛾0 is the variance) is called autocorrelation function
(ACF), for h = …, -2, -1, 1, 2, ….
o Often more meaningful than ACVF because it is expressed as pure
numbers that fall in [-1, 1]
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 6
An Example of Stationary Series
.8 2
AR(1) Simulated Data
.6 0
.4
-2
.2
-4
.0
-6
-.2
-8
-.4 AR(1) Simulated Data vs.
Random Walk Simulated Data
-.6 -10
-.8 -12
250 500 750 1000 250 500 750 1000
Panel (a) Panel (b)
.6
Gaussian White Noise Data
.4
.2
.0
-.2
-.4
-.6
-.8
250 500 750 1000
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 8
Sample Autocorrelation Function
Stationary AR and white noise processes may sometimes be hard to
tell apart – what tools are available to identify them?
The sample ACF reflects important information about the linear
dependence of a series at different times
If 𝑌𝑌𝑡𝑡 +∞
𝑡𝑡=−∞ is an i.i.d. process with finite variance, then for a large
sample, the estimated autocorrelations 𝜌𝜌�ℎ will be asymptotically
normally distributed with mean 𝜌𝜌ℎ and variance 1/T
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi 9
0.25 0.25
0.00 0.00
Autocorrelation Partial Correlation AC PAC Q-Stat Prob Autocorrelation Partial Correlation AC PAC Q-Stat Prob
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi
24 0.038 0.046 55.161 0.000 24 -0.037 -0.042 25.038 0.404 11
Panel (a) Panel (b)
Autocorrelation Partial Correlation AC PAC Q-Stat Prob Autocorrelation Partial Correlation AC PAC Q-Stat Prob
Lecture 2: Essential Concepts in Time Series Analysis – Proff. Guidolin & Rotondi
24 0.038 0.046 55.161
0.000
0.000
23
24
-0.033
-0.037
-0.039
-0.042
23.667
25.038
0.422
0.404 12
Panel (a) Panel (b)
Lecture 3: Autoregressive Moving
Average (ARMA) Models and their
Practical Applications
Prof. Massimo Guidolin
Winter/Spring 2021
Overview
Moving average processes
Autoregressive processes: moments and the Yule-Walker
equations
Wold’s decomposition theorem
Moments, ACFs and PACFs of AR and MA processes
Mixed ARMA(p, q) processed
Model selection: SACF and SPACF vs. information criteria
Model specification tests
Forecasting with ARMA models
A few examples of applications
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 2
Moving Average Process
Autoregressive Process
An autoregressive (henceforth AR) process of order p is a process
in which the series 𝑦𝑦𝑡𝑡 is a weighted sum of p past variables in the
series (𝑦𝑦𝑡𝑡−1 , 𝑦𝑦𝑡𝑡−2 , … , 𝑦𝑦𝑡𝑡−𝑝𝑝 ) plus a white noise error term, 𝜖𝜖𝑡𝑡
o AR(p) models are simple univariate devices to capture the observed
Markovian nature of financial and macroeconomic data, i.e., the fact
that the series tends to be influenced at most by a finite number of
past values of the same series, which is often also described as the
series only having a finite memory (see below on this claim)
∞
𝑦𝑦𝑡𝑡 − 𝜇𝜇 = 𝜖𝜖𝑡𝑡 + 𝜓𝜓1 𝜖𝜖𝑡𝑡−1 + 𝜓𝜓2 𝜖𝜖𝑡𝑡−2 + ⋯ = � 𝜓𝜓𝑖𝑖 𝜖𝜖𝑡𝑡−𝑖𝑖
𝑖𝑖=0
An autoregressive process of order p with no constant and no other
predetermined, fixed terms can be expressed as an infinite order
moving average process, MA( ), and it is therefore linear
∞
If the process is stationary, the sum ∑𝑖𝑖=0 𝜓𝜓𝑗𝑗 𝜖𝜖𝑡𝑡−𝑗𝑗 will converge
The (unconditional) mean of an AR(p) model is
o The sufficient condition for the mean of an AR(p) process to exist and
be finite is that the sum of the AR coefficients is less than one in
absolute value, , see next
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 9
o F 11
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin
ARMA(p,q) Processes
We can also write the ARMA(p, q) process using the lag operator:
ARMA(p,q) Processes
o As one would expect of an ARMA process, both the ACF and the PACF
decline geometrically: the ACF as a result of the AR part and the PACF
as a result of the MA part
o However, as the coefficient of the MA part is quite small the PACF
becomes insignificant after only two lags. Instead, the AR coefficient is
higher (0.7) and thus the ACF dies away after 9 lags and rather slowly
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 16
Model Selection: SACF and SPACF
A first strategy, compares the
sample ACF and PACF with the
theoretical, population ACF and
PACF and uses them to identify
the order of the ARMA(p, q) model
US CPI Inflation
Sample size
The SBIC is the one IC that imposes the strongest penalty (lnT) for
each additional parameter that is included in the model.
The HQIC embodies a penalty that is somewhere in between the
one typical of AIC and the SBIC
o SBIC is a consistent criterion, i.e.,
it determinates the true model
asymptotically
o AIC asymptotically overestimates
the order/complexity of a model
with positive probability
o For instance, if 𝑦𝑦𝑡𝑡 has a joint and marginal normal pdf (which must
derive from the fact that 𝜖𝜖𝑡𝑡 has it), then
are 𝑁𝑁(0,1) distributed, the null consists of a joint test that 𝜆𝜆1 and
𝜆𝜆2 are zero tested as 𝐻𝐻0 : 𝜆𝜆1 +𝜆𝜆2 = 0, where 𝜆𝜆12 + 𝜆𝜆22 ~𝜒𝜒22 as T⟶
③ Compute sample autocorrelations of residuals and perform tests
of hypotheses to assess whether there is any linear dependence
o Same portmanteau tests based on the Q-statistic can be applied to test
the null hypothesis that there is no autocorrelation at orders up to h
where
o For instance,
o The forecast error is
o The h-step forecast can be computed recursively, see the
textbook/class notes
For a stationary AR(p) model, 𝑦𝑦�𝑡𝑡 (ℎ) converges to the mean 𝐸𝐸[𝑦𝑦𝑡𝑡 ] as
h grows, the mean reversion property
Lecture 3: Autoregressive Moving Average (ARMA) Models – Prof. Guidolin 28
Forecasting with MA(q)
Because the model has a memory limited to q periods only, the
point forecasts converge to the mean quickly and they are forced to
do so when the forecast horizon exceeds q periods
o E.g., for a MA(2),
because both shocks have been observed and are therefore known
o Because 𝜀𝜀𝑡𝑡 has not yet been observed at time t, and its expectation at
time t is zero, then
o By the same principle, because 𝜀𝜀𝑡𝑡+3 ,
𝜀𝜀𝑡𝑡+2 , and 𝜀𝜀𝑡𝑡+1 are not known at time t
By induction, the forecasts of an ARMA(p, q) model can be obtained
from 𝑝𝑝 𝑞𝑞
𝑦𝑦�𝑡𝑡 ℎ = 𝜙𝜙0 + � 𝜙𝜙𝑖𝑖 𝑦𝑦�𝑡𝑡 ℎ − 𝑖𝑖 + � 𝜃𝜃𝑗𝑗 𝐸𝐸𝑡𝑡 [𝜖𝜖𝑡𝑡+ℎ−𝑗𝑗 ]
𝑖𝑖=1 𝑗𝑗=1
How do we assess the forecasting accuracy of a model?
Winter/Spring 2022
Overview
Multivariate strong vs. weak stationarity
o Here is
the vector of sample means
o When h = 0 we have the
sample covariance matrix
o The cross-correlation
matrix is
where D is the diagonal
matrix that collects sample standard deviations on its main diagonal
Cross-sample
correlogram (i.e., off-
diagonal element of
0, 1, …, 24)
o tr(A) is the trace of a matrix, simply the sum of the elements on its
main diagonal
o Q(m) has an asymptotic (large sample) 𝜒𝜒𝑁𝑁2 2 𝑚𝑚 (which may be poor in
small samples)
o Note that the null hypothesis corresponds to:
(serially)
For instance, when N = 2, yt = [xt zt ]’ or [R1,t R2,t]’, one example
concerning two asset returns may be:
u1,t
u2,t – Prof. Guidolin
Lecture 24: Multivariate Linear Time Series (VARs) 8
Vector Autoregressions: Reduced-Form vs. Structural
o Σ ≡ Var[ut] is the covariance matrix of the shocks
o When it is a full matrix, it implies that contemporaneous shocks to
one variable may produce effects on others that are not captured by
the VAR structure
If the variables included on the RHS of each equation in the VAR
are the same then the VAR is called unrestricted and OLS can be
used equation-by-equation to estimate the parameters
o This means that estimation is very simple
When the VAR includes restrictions, then one should use MLE,
which in this case often takes the form of Generalized Least
Squares (GLS), Seemingly Unrelated Regressions (SUR), or MLE
Because the VAR(p) model, yt = a0 + A1yt-1 + A2yt-2 + ... + Apyt-p+ ut
does not include contemporaneous effects, it is said to be in
standard or reduced form, to be opposed to a structural VAR
In a structural VAR(p), the contemporaneous effects do not need to
go through the covariance matrix of the residuals, ut
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 9
What are the properties of the reduced form errors? Recall that x
t
and zt were uncorrelated, white noise processes, then:
a0 6 mean parameters + 3
a0
o 9 vs. 10: unless one is willing to restrict one of the parameters, it is
not possible to identify the primitive system and the structural VAR
is under-identified
One way to identify the model is to use the type of recursive
system proposed by Sims (1980): we speak of triangularizations
In our example, it consists of imposing a restriction on the
primitive system such as, for example, b21 = 0
As a result, while zt has a contemporaneous impact on xt, the
opposite is not true
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 14
Identifying Structural from Reduced-Form VARs
In a sense, shocks to zt are more primitive, enjoy a higher rank, and
move the system also through a contemporaneous impact on xt
The VAR(1) now acquires a triangular structure:
a0
a0
This corresponds to imposing a Choleski decomposition on the
covariance matrix of the residuals of the VAR in its reduced form
Indeed, now we can re-write the relationship between the pure
shocks (from the structural VAR) and the regression residuals as
vec of
= N
=
o The conditional moments are obviously different, e.g.,
where K = N2p + N
Lecture 4: Multivariate Time Series Analysis– Prof. Guidolin 23
25
The
experiment is
a tightening
on short-term
rates by the
FED
Variance Decompositions
Understanding the properties of forecast errors from VARs is
helpful in order to assess the interrelationships among variables
Using the VMA representation of the errors, the h-step-ahead
forecast error is
Variance Decompositions
Like in IRF analysis, variance decompositions of reduced-form
VARs require identification (because otherwise we would be
unable to go from the coefficients in 𝚯𝚯𝑖𝑖 to their counterparts in 𝚽𝚽𝑖𝑖 )
o Choleski decompositions are typically imposed
o Forecast error variance decomposition and IRF analyses both entail
similar information from the time series
o Example on weekly US Treasury yields, 1990-2016 sample:
Choleski ordering:
__ 1M Yield
__ 1Y Yield
__ 5Y Yield
__ 10Y Yield
Choleski ordering:
__ 10Y Yield
__ 1M Yield
__ 1Y Yield
__ 5Y Yield
Granger-Sims Causality
We say that the sub-vector 𝒙𝒙𝑡𝑡 is (block) exogenous to 𝒚𝒚𝑡𝑡 or that 𝒙𝒙𝑡𝑡
is not Granger-caused by the remaining variables
Winter/Spring 2022
Overview
Stochastic vs. deterministic trends
Spurious regressions
Defining cointegration
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 2
Trends in Time Series
Can the methods of lectures 2-4 be applied to nonstationary series?
o No, the inference would be invalid, and the costs are considerable
o If nonstationary is simply ignored, it can produce misleading
inferences because they will be spurious
Nonstationarity may be linked to the presence of trends in time
series processes
Time series often contain a trend, a possibly random component
with a permanent effect on a time series
① Deterministic trends, functions (linear or non-linear) of time, t,
; for instance, polynomials
where 𝜖𝜖𝑡𝑡 is a white noise process
o The trending effect caused by functions of t is permanent and impres-
ses a trend because time is obviously irreversible, e.g., for Q =1
stationary
o The solution to the difference equation is 𝜂𝜂𝜏𝜏
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 3
These 6 series are generated from the same sequence of IID N(0,1) shocks
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 5
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 6
The Random Walk Process
o The ACF of a RW shows, especially in small samples, a slight tendency
to decay that would make one think of a stationary AR(p) process
with a sum of the AR coefficients close to 1
o The two de-trended series are identical in the two plots because they
had been generated to be identical, and they are white noise
10
Pitfalls in De-Trending Applications
② When a time series contains a deterministic trend but is other-
wise I(0), (trend-stationary) and an attempt is made to remove the
trend by differentiating the series d times, the resulting differentia-
ted series will contain d unit roots in its MA components
o It will therefore be not invertible
o Differentiating a +1 +1
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 11
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 15
White
I(1)! noise
16
Testing for Unit Roots: the Dickey-Fuller Test
SACFs are downward biased ⟹ Box-Jenkins cannot be used
o Dickey and Fuller (1979) offer a test procedure to take the bias in due
account, based on a Monte Carlo design, under the null of a RW
Their method boils down to estimate by OLS the regression:
α/SE(α ) = number of standard
deviations away from 0
o The one-sided t-statistic of the OLS estimate of α is then compared to
critical values found by DF by simulations under the null of a RW
• E.g., if the estimated is 0.962 with a standard error of 0.013, then
the estimated α is -0.038 and the t-statistic is -0.038/0.013 = -2.923
• According to DF’s simulations, this happens in less than 5% of the
time, under the null of an RW, but in more than 1% of the simulations
• This is a rather unlucky event under the null of a RW and this may
lead to a rejection of the hypothesis, with a p-value btw. 0.01 and 0.05
We therefore use a standard t-ratio taking into account that under
the null of a RW, its distribution is nonstandard and cannot be
analytically evaluated 17
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
We can test for the presence of a unit root using the DF test,
although this is called augmented Dickey-Fuller (ADF) test
o ADF implements a parametric correction for high order correlation
o DF show that the asymptotic distribution of the t-ratio for α is
independent of the number of lagged first differences included
In fact, the appropriate “tables” (simulated statistic) to use depend
on the deterministic components included in the regression
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 18
Other Unit Root Tests
o In the case of real earnings, ADF test that includes an intercept gives
an estimate of α of 0 with a t-ratio of 10.196 which leads to a failure to
reject the null of a unit root
o The presence of a time trend cannot be ruled out on theoretical
grounds – an ADF test also including a linear time trend, gives an
estimate of α of -0.002 which is -1.900 standard deviations away from
0 and that does not allow us to reject the null of a unit root
Phillips and Perron (1988) propose a nonparametric method of
controlling for serial correlation when testing for a unit root that is
an alternative to the ADF test
o Classical DF test + modify the t-ratio of α so that serial correlation in
the residuals does not affect the asymptotic distribution of the test
o See lecture notes for PP test statistic
o Null hypothesis remains a unit root
Kwiatkowski, Phillips, Schmidt, and Shin (1992) have proposed a
testing strategy under the null of (trend-) stationarity
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin 19
21
This way, ldt becomes a random walk with drift; because lpt is a
linear function of a random walk, it becomes itself a random walk
22
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Getting Intuition Through One Realistic Case
Cointegration can be identified from the need for models to be
balanced in terms of LHS vs. RHS orders of integration
To understand the essence of cointegration, consider re-
parameterizing the model in the following way:
Johansen’s Method
Suppose that N ≥ 2 variables are I(1) and follow:
(*)
Johansen test centers around the matrix that can be interpreted
as a long-run coefficient matrix, because in equilibrium, all the yt−i
are zero, and setting ut to their expectation of zero yields
*
28
Johansen’s Method
Formal tests based on the rank of the matrix via its eigenvalues,
to determine the number of cointegrating relationships/vectors
The rank of a matrix is equal to the number of its characteristic
roots (eigenvalues) that are different from 0
o The eigenvalues, λis, are put in discending order λ1 ≥ λ2 ≥ . . . ≥ λg ≥ 0
o By construction, they must be less than 1 in absolute value and
positive, and λ1 will be the largest, while λg will be the smallest
If the variables are not cointegrated, the rank of will not be
significantly different from zero, so λi ≃ 0 ∀i
o If rank( ) = 1, then ln(1 − λ1) will be negative and ln(1 − λi) = 0 ∀i > 1
o If the eigenvalue i is non-zero, then ln(1 − λi) < 0 ∀i > 1, for to have a
rank of 1, the largest eigenvalue must be significantly non-zero, while
others will not be significantly different from 0
Two test statistics for cointegration under the Johansen approach:
29
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Johansen’s Method
Each eigenvalue will have associated with it a different
cointegrating vector, which will be the corresponding eigenvector
A significant eigenvalue indicates a significant cointegrating vector
λtrace is a joint test where the null is that the number of
cointegrating vectors is less than or equal to r against an
unspecified or general alternative that they are more than r
λmax conducts separate tests on each eigenvalue, and has as its null
hypothesis that the number of cointegrating vectors is r against an
alternative of r + 1
o The distribution of the test statistics is non-standard: the critical values
depend on N − r , the number of non-stationary components and whether
constants are included in each of the equations
o If the test statistic is greater than the critical value, we reject the null
hypothesis of r cointegrating vectors in favor of the alternative
o The testing is conducted in a sequence and under the null
r is the rank of : it cannot be of full rank (N) since this would
correspond to the original yt being stationary
30
Lecture 5: Unit Roots, Cointegration and Error Correction Models – Prof. Guidolin
Appendix A: An Example of I(2) Process
31
32
Appendix C: Cointegration = Sharing Common Trends
33
Lecture 6: Univariate Volatility
Modelling: ARCH and GARCH
Models
Prof. Massimo Guidolin
Winter/Spring 2022
Overview
Generalities and the mixture of distributions hypothesis
Integrated GARCH
o To favor visibility, the values of the squares and cubes have been
trimmed, even though this hides a few large spikes
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 5
o The book proves that the two stylized facts are intimately related
because the model residuals are such that
(*)
o The presence of the factor 𝜆𝜆𝜏𝜏−1 that pre-multiplies the infinite sum
guarantees that the sum of the weights equals 1
In the late 1980s, researchers at J.P. Morgan Chase realized that this
famous forecasting device could be rewritten in a simpler way:
ARCH Models
ARCH captures volatility clustering: large past squared innovations
will lead to large forecasts of subsequent variance when all or most
α1, α2, …, αp coefficients are positive and non-negligible
However, ARCH models cannot capture the existence of asymmetric
reaction of conditional variance to positive vs. negative shocks
ARCH(p), in particular ARCH(1) differs from RiskMetrics in 2 ways:
① It features no memory for recent, past variance forecasts
② It features a constant α0 that was absent in RiskMetrics
o When we set α0 = 0 and αi = 1/W, then an ARCH(W) model simply
becomes a rolling window variance model
o Appendix A collects the moments and key properties of a ARCH(1)
o Algebra in this Appendix establishes that
long-run, ergodic variance from ARCH(p) is:
Even though conditional variance changes over time, the model can
be (covariance) stationary and the unconditional variance exists
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 14
ARCH Models: One Example
Even though ARCH represents progress vs. simple rolling window,
one limitation: their specification is richly parameterized
o Given the empirical success of RiskMetrics, a need to pick a large p
does not come as a surprise, because such a selection obviously
surrogates the role played by 𝜎𝜎𝑡𝑡|𝑡𝑡−1
2
on the RHS of RiskMetrics
Consider 1963 - 2016 CRSP stock excess daily returns
o SACF/SPACF and information criteria analyses suggest a MA(1) mean
p-values
o SACF/SPACF of squared residuals give evidence of AR(5) at least
16
Are ARCH Models Enough?
ARCH models are not set up or estimated to imply unconditional
variance = sample variance and this may be embarrassing
One constraint often imposed in estimation is variance targeting:
o Crucial to report standard errors and not p-values because the simple
null hypothesis of b = 1 requires that we calculate the t ratios:
21
-1 -1
-1
23
24
GARCH(1,1): A Fixed Income Example
o Attempt to use ARCH leads to a large, possibly ARCH(11) specification
o GARCH(1,1) offers best trade-off between simplicity and in-sample fit
p-values
2
different coefficients contribute to increase 𝜎𝜎𝑡𝑡+1|𝑡𝑡 in different ways
o The larger are the αis, the larger is the response of 𝜎𝜎𝑡𝑡+1|𝑡𝑡
2
to new
information; the larger are the βjs, the longer and stronger is the
memory of conditional variance to past (forecasts of) variance
For any given persistence index, it is possible for different
stationary GARCH models to behave rather differently and
therefore yield heterogeneous economic insights
o This plot performs simula-
tions on a baseline estimate Sum = 0.984
t|t-1
28
Exponential GARCH Model
2 2
Because 𝜎𝜎𝑡𝑡+1|𝑡𝑡 = exp(𝑙𝑙𝑙𝑙𝜎𝜎𝑡𝑡+1|𝑡𝑡 ) and exp( ) > 0, EGARCH always
yields positive variance forecasts without imposing restrictions
o is function of both the magnitude and the sign
of past standardized residuals, and it allows the conditional variance
process to respond asymmetrically to rises and falls in asset prices
o It can be rewritten as:
o Nelson’s EGARCH has another advantage: in a GARCH, the parameter
restrictions needed to ensure moment existence become increasingly
stringent as the order of the moment grows
o E.g., in case of ARCH(1), for an integer r, the 2rth moment exists if and
only if ; for r = 2, existence of unconditional
kurtosis requires α1 < (1/3)1/2
o In a EGARCH(p,q) case, if the error process ηt in the ARMA repre-
sentation of the model has all moments and , then all
moments of an EGARCH process exist
How far better can EGARCH fare versus a standard GARCH model?
how important are asymmetries in conditional variance?
Lecture 6: Univariate Volatility Modelling, ARCH and GARCH – Prof. Guidolin 29
o The ICs select large models: a GARCH(2,2) in the GARCH family and
even a more complex EGARCH(3,3), the latter being preferred:
GARCH(1,1)
EGARCH(3,3)
33
34
Appendix A: Key Properties of ARCH(1)
35
Lecture 7: Advanced Volatility
Modelling
Winter/Spring 2022
Overview
Threshold GARCH Model
Power ARCH and Nonlinear GARCH Models
The Component GARCH Model
GARCH-in-Mean Models and Asset Pricing Theory
Non-Normalities in GARCH Modelling: t-Student GARCH
Generalized Error Distribution GARCH
Testing for ARCH and GARCH
Forecasting with GARCH Models
ML and Quasi-ML Estimation
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 2
All in the Family! Other GARCH Models
The GARCH class is so successful in risk mgmt., derivative pricing,
and to some extent portfolio choice, that hundreds of different spin-
offs of GARCH(p,q) have been proposed and tested
o Bollerslev (2009) published a glossary that compiled 139 different
GARCH models, when also a few “multivariate” strands were included!
Threshold GARCH(p,d,q) (TARCH) offers a way to capture
asymmetries and leverage that is alternative to EGARCH, but
requires positivity-driven restrictions
because
As a result
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 7
8
The Component GARCH Model
o Because past shocks are here raised to a power which is less than 2,
PARCH forecasts are less spiky vs. threshold GARCH
o Once differences in estimates and functional form are factored in, the
two sets of forecasts are not so different (correlation is 0.987)
The variance process in a plain vanilla GARCH model shows mean
reversion to which is a constant at all times
Engle and Lee (1999) generalize GARCH to a component GARCH (C-
GARCH) model that allows mean reversion to a varying level
Equivalently, the model incorporates a distinction between transi-
tory and permanent conditional variance dynamics:
(short-term)
(long-term)
o The model can be easily generalized to the (p,q) case
o First equation: dynamics of the transitory variance component which,
2
by construction, converges to 𝑣𝑣𝑡𝑡+1|𝑡𝑡
o Second equation: dynamics of the long-term or permanent variance
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 9
2
which shows that the transitory component for 𝜀𝜀𝑡𝑡+1 is a ARMA(1,1)
2
process that converges to 𝑣𝑣𝑡𝑡+1|𝑡𝑡 with a persistence rate 𝛼𝛼 + 𝛽𝛽 < 1
o The permanent component is an AR(1) process with persistence ψ
o One can combine the two components to write (see the book):
transitory permanent
GARCH(1,1)
Component GARCH(1,1)
o The asymmetric
effect is precisely
estimated
o CGARCH have a
crucial derivative
pricing justifica-
tion: often fair
price may not be
made to depend
on short-run spikes
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 13
GARCH-in Mean
So far, we have rigidly and somewhat artificially kept distinct the
conditional variance from the conditional mean specifications
o Apart from the obvious: the residuals that a CH model uses are esti-
mated by subtracting from the data an estimate of conditional mean
It is possible to jointly model conditional mean and variance when
the latter appears in the specification of the former
o Formally, when 𝜇𝜇𝑡𝑡+1|𝑡𝑡 is a function of 𝜎𝜎𝑡𝑡+1|𝑡𝑡 we have (G)ARCH-in mean
US daily stock
returns, 1963-2016
15
# degrees of freedom
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 16
t-Student GARCH
o The gamma function has well-known recursive properties
o Note that: Negative power
fnct.
Negative exponential
fnct.
t-Student GARCH
o The tails of a negative power fnct. decay to 0 as zt+1⟶±∞ at a speed
given by the power –ν/2; the lower is ν, the slower the rate
o The negative exponential is the specification whose tails go to 0 faster
among all functions
o E.g., for z = 4 (shock 4 std. dev. from mean) ,
under a negative power with ν = 10, we have
o The latter probability is then 0.002376/0.000336 7 times larger
o Repeat this experiment for am extreme realization, say a (standardi-
zed) return 12 times away from the sample mean (say a -9.5%
return), then the ratio is a few thousands billion times larger!
Events that are impossible under a Gaussian distribution remain
rare but billions of times more likely under a fat-tailed, t-Student
Consider a flexible threshold GARCH(1,1,1) under normal vs. t-
Student applied to a variety of time series
o Both the accurately estimates for v—always below 20 and often 5—
and (unreported) information criteria confirm that all the data
require a t-Student assumption 18
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin
t-Student GARCH ARMA(1,1)
o E.g., if ν = 3, then only mean and variance exist and skewness is not 0,
it is simply not defined; if ν = 3.001, then mean, variance, and
skewness exist, with both mean and skewness equal to 0, etc.
o Please pay attention: variance and kurtosis cannot be negative!
2
o 3 further implications: no longer true that 𝜎𝜎𝑡𝑡+1|𝑡𝑡 is the time-t forecast
2
of one-step-ahead variance, as conditional variance exceeds 𝜎𝜎𝑡𝑡+1|𝑡𝑡 :
22
The News Impact Curve and ARCH Asymmetries
o For GARCH(1,1) :
o The NIC is a quadratic function (a parabola) symmetric around εt = 0
o For EGARCH(1,1):
o Such a function is clearly not symmetric around εt = 0
o Let us consider the asymmetry properties of
CH models of the
daily value factor
(HML) returns over a
1963-2016 sample
o We estimate 3 models with GED errors (from
a AR(1))
o TARCH and EGARCH models are different (the
EGARCH kink is cau-
sed by the absolute
value), but they both
display negative
leverage
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 23
Persistence index of
0.99, slow convergence
1-week horizon
1-week horizon
o Under a
stationary
GARCH(1,1):
o The usual “xH” multiplication rule holds only when α + β = 1; the more
α + β << 1, the more inaccurate it is
In a frequentist statistical perspective, to estimate the parameters of
a GARCH model means that, given a random sample, one wants to
select a vector θ ℝK (K is number of parameters) to maximize a
criterion function that measures how plausible each value for θ is
Choice of the criterion and of a method to maximize it defines the
(point) estimation method
Three methods typically applied to GARCH: 1) (Generalized)
Method of Moments, 2) Maximum Likelihood, 3) Quasi MLE
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 28
Hints to Estimation of GARCH Models: GMM
One idea is to look for a (unique) θMM such that some statistical
features implied by the process—e.g., some moments, such as
unconditional means and variances, or the IRFs—are the same:
⊕ when computed from the assumed stochastic process under θ
= θMM, and
⊕ in the sample
One such estimator, based on matching the sample with population
moments, is the method-of-moment (MM) estimator
o See also statistics prep course
Problem: although intuitive, because MM does not exploit the entire
empirical density of the data but only a few sample moments, it is
clearly not as efficient as other methods, e.g., MLE
o MM yields standard errors that are larger than necessary
o GARCH models can be estimated with methods that extend the MMs,
called the generalized method of moments (GMM) based on:
Need at least K orthogonality
conditions from the model
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 29
o The standard errors for MLEs are derived from the information matrix:
as
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 32
Hints to Estimation of GARCH Models: Quasi MLE
o MLE is the most efficient estimator (== it reaches the Cramér-Rao lower
bound) because it exploits the information in the joint PDF of the data
o See statistic prep course for details
o The result implies that
o This covariance matrix can be used for hypothesis
testing by constructing the usual z-ratio stat:
o The book reports examples of model- dependence of std. errors
Unfortunately, MLE requires that we know the parametric functional
form of the joint PDF of the sample and the zt are IID
o If one assumes normality, t-Student or GED, everything depends on that,
no margins of error
What if we are not positive about the parametric distribution? E.g., all
we can say is that
but D is unknown: Can we still apply the ML procedure and count on
some of the good properties of MLEs?
The answer is a qualified—as it holds only subject to specific condi-
tions—“yes”
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 33
(*)
34
Hints to Estimation of GARCH Models: Quasi MLE
Use
normality
(*)
o The need to correctly specify the conditional mean function applies also
when our main interest lies with the estimation of conditional variance
However QMLE methods imply a cost, in a statistical sense: QMLE
will in general be less efficient than ML estimators are
o The denominator of the z-score statistics will be larger under QMLE vs.
MLE, and this implies less power of the tests
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 35
o Let’s try a different route: why not obtain the estimated OLS residuals
from a simple regression, and then separately
estimate ω and α from maximization of:
o The estimates are rather different, especially those of the AR(1) model
o Although not all p-values reveal it, the standard errors have grown
o OLS estimation of GARCH models should be avoided in favor of MLE
Waves of partial ML estimators that may, on the surface, fully exploit
the assumed model structure will not deliver the optimal statistical
properties and characterization of the MLE
o Sequential ML-based estimator may be characterized as a QMLE and
will be subject the limitations of QMLE: loss of efficiency
o This is due to the fact that when we split θ into θ1 and θ2 to separately
estimate them, this very separation implies for any pair
o Such a zero covariance may be at odds with the data
o E.g., in our earlier example,
Under variance targeting, what seems MLE is in fact QMLE
Lecture 7: Advanced Volatility Modelling – Prof. Guidolin 38
MSc. Finance/CLEFIN
2021/2022 Edition
1
Debriefing.
It is not possible to interpret the theta coefficients as impact multipliers of the true, structural
innovations, as they are functions of A1, which is the matrix of the coefficients of the reduced
form VAR (estimated via OLS). Consequently, the theta coefficients are the impact multipliers
of the error terms of the reduced form model, which are a composite of the true, structural and
uncorrelated innovations to the series. In general, it is not possible to retrieve the impact
multipliers of the structural innovations because a VAR in its reduced form is under-identified,
i.e., the OLS estimated parameters are less than the parameters in the primitive system. For
instance, a bivariate VAR(1) will have 10 structural parameters (six coefficients, two intercepts,
and two variances) while its reduced form only contains 9 parameters(four coefficients, two
intercepts, two variances, and one covariance). Therefore, in order to recover the primitive
parameters, we should be ready to impose restrictions (specifically, we need to impose (N^2-
N)/N restrictions). A popular method to impose restrictions consists of the application of a
Cholesky decomposition which enforces (N^2-N)/N contemporaneous coefficients in the
primitive system to be equal to zero. More specifically, the Cholesky decomposition enforces a
triangular identification scheme: the first variable in the system is allowed to
contemporaneously impact all the others, the second variable in the system influences all the
variables but the first one, and so on. For this reason, we say that the Cholesky decomposition
imposes an ordering on the variables. This is also a major drawback of the Cholesky
decomposition, as the values of the impact multipliers (and therefore the IRFs) depend on the
ordering of the variables, so that the conclusions that are drawn may change if we place the
variables in a different order. This is a problem especially when the ordering is not based on
any sensible economic assumption.
2
Question 1. B (4 points)
Mr. Earl, a senior analyst at Vicod’s Hedge Fund, wants to estimate a VAR model for the 1-month,
1-, 5-, and 10-year Treasury yields and he is wondering which lag order he should be selecting.
Therefore, he conducts a full specification search and produces the table below. Why is it a good
practice to look at the information criteria? Which is the model selected by each of the three
information criteria (AIC, SC, HQ)? Do they all lead to the selection of the same model and, if
not, is this plausible?
VAR Lag Order Selection Criteria
Endogenous variables: ONEMONTH ONEYEAR FIVEYEARS TENYEARS
Exogenous variables: C
Date: 10/11/18 Time: 17:43
Sample: 1/05/1990 12/30/2016
Included observations: 1395
Debriefing
It is a good practice to look at information criteria (instead of comparing, for instance, the
squared residuals from different models) because they trade off the goodness of (in-sample) fit
and the parsimony of the model, thus avoiding the overfitting problem, i.e., the selection of
large-scale models that display a good in sample fit, but often have a poor forecasting
performance out-of-sample. The best model according to each information criterion is the
model that minimizes the criterion. Therefore, the AIC selects a VAR(11) model while both the
SC and the HQ criterion select a more parsimonious VAR(2) model. It is perfectly possible that
the three information criteria lead to the selection of different models. Indeed, although they
both contain a first term which is a function of squared residuals, each of the criteria imposes a
different penalty for the loss of degrees of freedom from the presence of a given number of
parameters in the model. In particular, the SC is the criterion that imposes the strongest penalty
for each additional parameter that is included in the model, while the AIC is the one with the
smallest penalization for additional parameters (HQ falls between the other two).
Question 2.A (10 points) Consider the family of GARCH(p, q) models for the variance of asset
returns. Define the persistence index and discuss what is the role that such index plays in
establishing the covariance stationarity of a GARCH model. Consider two alternative GARCH
models for the same series of returns characterized by identical persistence index, but (i) the
3
𝑝𝑝 𝑞𝑞
first model characterized by a large ∑𝑖𝑖=1 𝛼𝛼𝑖𝑖 and a small ∑𝑗𝑗=1 𝛽𝛽𝑗𝑗 ; (ii) the second model
𝑝𝑝 𝑞𝑞
characterized by a small ∑𝑖𝑖=1 𝛼𝛼𝑖𝑖 and a large ∑𝑗𝑗=1 𝛽𝛽𝑗𝑗 . What do you expect that the differences
between the filtered, one-step ahead predicted variances from the two models will look like?
Consider the two cases that follow:
You are a risk manager and you are considering calculating daily value-at-risk on the
basis of a Gaussian homoskedastic model: is the mistake you are about to make larger
under model (i) or model (ii)? Carefully explain why.
You write and sell long-term options written on the underlying asset that you price using
a pricing tool that accounts for time varying volatility under GARCH: in the presence of
large return shocks, will the mispricing be larger under model (i) or under model (ii)?
(Note: Remember, you are not required to use any formulas or equation in your replies).
Debriefing:
In a standard GARCH(p, q) model, the persistence index is defined as the sum of all alpha and
beta coefficients. Such an index provides information on the time required for the effects of a
shock to returns to decay to zero in terms of its effects in the prediction of the variance or,
equivalently, how much memory (in a ARMA sense) the GARCH model displays. Given the
overall persistence index of a model, when—the case of model (i)— the sum of all the alpha
coefficients is large while the sum of all beta coefficients is small, this means that the GARCH
will be characterized by a high reactivity of variance forecasts to the most recent p shocks to
returns (squared) and by modest memory of the past q variance forecasts: this gives a sample
path of conditional variance predictions that tends to be jagged and rather spiky. When instead
the sum of all alpha coefficients is small while the sum of all beta coefficients is large, i.e., model
(ii), this means that the GARCH will be characterized by a modest reactivity of variance
forecasts to the most recent p shocks to returns (squared) and by considerable memory for the
past q variance forecasts: this gives a sample path of conditional variance predictions that tends
to be smooth but to swing up and down in dependence of strings of squared shocks that move
predicted variance above or below the ergodic, long-run variance.
As for the two cases, a risk manager engaged in estimating daily VaR under a model that
incorrectly sets p = q = 0 (basically, no GARCH), will incur in the larger mistakes under (a true)
model (i), when the sum of all alpha coefficients is large while the sum of all beta coefficients is
small, because the risk officer will ignore the short-lived predicted variance spikes implied by
a large sum of all alpha coefficients; in fact, a homoscedastic model is somewhat close to case
(ii), representing a situation of extreme smoothness, when the variance is constant. A long-term
option trader will instead make the same mistakes by using a homoscedastic Gaussian model
(also known as Black-Scholes) vs. either model (i) or (ii): the reason is that in the pricing of
long-term contracts, what really matters is not either the sum of all alpha or of all beta
coefficients, but their overall sum, i.e., the very persistence index.
4
Question 2.B (3.5 points) Bruno Cerelli, an analyst at Unresponsive & So., has estimated a
Gaussian GARCH(1,1) model for FTSE MIB stock index returns and found that 𝛼𝛼� + 𝛽𝛽̂ = 1; upon
testing, he has not been able to reject the null hypothesis that 𝛼𝛼 + 𝛽𝛽 = 1. Therefore he has
concluded that because the condition 𝛼𝛼 + 𝛽𝛽 < 1 is violated, the GARCH model is non stationary
so that a time-invariant unconditional distribution for the FTSE MIB returns does not exist and
one cannot learn from past data to forecast future returns. A colleague of his, Vari Keen, has
objected that this implication is unwarranted, forecasting is indeed possible, even though a
GARCH with 𝛼𝛼 + 𝛽𝛽 = 1 implies that variance follows a random walk with drift process so that
time t estimated variance is (in a mean-squared error sense) the best forecast for variance at
time t + 1, t + 2, …, t + H for all H 1. Which one of the two analysts at Unresponsive & So. is
correct and why? Would you ever advise to use such special GARCH(1,1) model with 𝛼𝛼 + 𝛽𝛽 = 1
to obtain long-run forecasts. Make sure to carefully explain your answer.
Debriefing:
Vari Keen is right on her claims: when 𝛼𝛼 + 𝛽𝛽 = 1 in GARCH(1,1) model, we are truly facing a
IGARCH(1,1), which in turn represents a special case of the RiskMetrics model. A RiskMetrics
model corresponds to a random walk (possibly, with drift in the case of IGARCH) which implies
that forecasting is possible but also characterized by the odd property that the time t estimated
variance is the best forecast for variance at time t + 1, t + 2, …, t + H for all H 1. However,
IGARCH/RiskMetrics is not covariance stationary but it is strictly stationary, which allows us to
produce forecasts at least in general terms, even though how advisable it may be to produce H-
step ahead forecasts that correspond to time t, short-term estimates of the variance, remains
questionable. Moreover, Vari Keen should place her focus on how/whether the conditions (that
we do not state as rather complex) that ensure strict stationarity of the IGARCH model are
satisfied, a matter on which we have had no specific information (but the book shows that a
sufficient condition is satisfied.
5
MSc. Finance/CLEFIN
2017/2018 Edition
Debriefing:
We expect all sub-questions to be answered but within a well-organized, 12-15 line long reply
that will need to fit in the (generous) space provided.
1
2
Question 1.B (2 points)
Using the lag operator L, write an AR(2) process in “lag operator-polynomial” form and discuss
how would you go about testing whether the process is stable and hence stationary. Will the
resulting stationarity, if verified, be strong or weak? Make sure to explain your reasoning.
Debriefing:
Because if the series is stationary, Wold’s decomposition applies, the process is linear and as
such weak and strong stationarity are equivalent.
See also material copied below.
3
Question 1.C (1 point)
Consider the following data and the corresponding sample ACF:
0.4
0.3
0.2
0.1
0
0 200 400 600 800 1000
-0.1
-0.2
-0.3
-0.4
-0.5
Autocorrelation Function
23
21
19
17
15
13
11
9
7
5
3
1
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
4
What is the most likely type of ARMA(p, q) process that may have originated this SACF? What
other type of information would you be needing in order to make sure of your answer? Make
sure to carefully justify your arguments.
Debriefing: The series and the corresponding SACF were generated from 1,000 simulations
from an AR(2) process with the following features:
As seen in the lectures (lecture 3, slide 12), this cyclical but fading pattern characterizes
stationary AR(2) processes with coefficients of opposite signs. However, one could be really
“positive” this SACF comes from an AR(2) only after verifying that the SPACF has the typical
behavior of PACF for stationary AR(2) processes: two values statistically significant, followed
by no other significant values. Of course, one cannot detect the generating process just on the
basis of the SACF (also the confidence intervals were not given), but she could speculate on its
likely nature.
5
MSc. Finance/CLEFIN
2017/2018 Edition
Debriefing:
1
2
Question 2.B (2 points)
Suppose that the bivariate structural VAR(2) is to be exactly identified by imposing either of the
two possible Choleski triangularization schemes:
1 0 1 𝑏𝑏12
𝑩𝑩′ = � � and 𝑩𝑩′′ = � �
𝑏𝑏21 1 0 1
Carefully explain the implications and differences in economic interpretations of the estimated,
corresponding reduced-form model deriving from imposing the restriction in 𝑩𝑩′ instead of 𝑩𝑩′′.
How does your answer change when the restriction
1 0
𝑩𝑩′′′ = � �
0 1
is imposed instead?
Debriefing:
Trivially, 𝑩𝑩′′′ implies that the original structural model is in reduced form or, alternatively, the
model has been over-identified by removing all contemporaneous effects between variables.
𝑩𝑩′ implies that S&P 500 returns are ordered before VIX log-changes, so that any 𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃 shock to
the S&P 500 is structural and primitive, while the 𝑢𝑢𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉 shocks are correlated with both
structural shocks to S&P 500 and the VIX.
On the opposite, 𝑩𝑩′′ implies that log changes in VIX are order before S&P 500, so that any
𝑢𝑢𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉 shock to the VIX is structural and primitive, while the 𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃 shocks are correlated with both
structural shocks to S&P 500 and the VIX.
3
Question 2.C (1 point)
Suppose that the estimation of a constrained, reduced-form VAR(2) has provided the following
ML estimates of the conditional mean function and of the covariance matrix of the reduced-form
shocks (p-values are in parentheses):
0.006 0.053 𝑆𝑆&𝑃𝑃 0.473 0.113
𝑅𝑅𝑡𝑡𝑆𝑆&𝑃𝑃 = + 𝑅𝑅 − Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−1 + Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−2 + 𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃
(0.044) (0.093) 𝑡𝑡−1 (0.003) (0.045)
�
0.194 0.375 𝑆𝑆&𝑃𝑃 0.094 𝑆𝑆&𝑃𝑃 0.804
Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡 = − − 𝑅𝑅𝑡𝑡−1 + 𝑅𝑅𝑡𝑡−2 + Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−1 + 𝑢𝑢𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉
(0.149) (0.024) (0.050) (0.000)
0.008 −0.016
𝑢𝑢𝑡𝑡𝑆𝑆&𝑃𝑃 (0.000) (0.007)
𝑉𝑉𝑉𝑉𝑉𝑉 �� 𝑉𝑉𝑉𝑉𝑉𝑉 �� = � �
𝑢𝑢𝑡𝑡 −0.016 0.014
(0.007) (0.000)
You would like to recover the original structural parameters, including the contemporaneous,
average impact of both VIX changes on S&P 500 returns and vice-versa. Is there a chance that
this may be possible even though you are not ready to impose a Choleski ordering on the two
variables?
Debriefing: One cannot say for sure but the evidence shown has two implications:
The estimated reduced-form VAR(2) carries restrictions and in fact estimation has been
properly performed by MLE applied to the bivariate system.
𝑆𝑆&𝑃𝑃
There are two restrictions that have been imposed, setting the coefficients of 𝑅𝑅𝑡𝑡−2 to
zero in the first equation and the coefficient of Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−2 to zero in the second equation;
indeed note that ML estimation has been performed, because it is likely that the reduced-
4
form VAR will include restrictions.
Now, we can only speculate that such restrictions derive from restrictions that have been
imposed on the matrix Γ2 in the structural representation of the model,
𝑆𝑆&𝑃𝑃 𝑆𝑆&𝑃𝑃
𝑅𝑅𝑡𝑡𝑆𝑆&𝑃𝑃 𝑅𝑅𝑡𝑡−1 𝑅𝑅𝑡𝑡−2 𝜖𝜖 𝑆𝑆&𝑃𝑃
𝑩𝑩 � � = 𝚪𝚪0 + 𝚪𝚪1 � � + 𝚪𝚪2 � � + � 𝑡𝑡𝑉𝑉𝑉𝑉𝑉𝑉 �,
Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡 Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−1 Δ𝑙𝑙𝑙𝑙𝑉𝑉𝑉𝑉𝑉𝑉𝑡𝑡−2 𝜖𝜖𝑡𝑡
2
0 𝛾𝛾12
in the sense that 𝚪𝚪2 = � 2 �.
𝛾𝛾21 0
However, we know that the exact identification of a bivariate structural VAR requires imposing
(22 – 2)/2 = 1 restriction and that such constraints do not have to be imposed necessarily on the
matrix of contemporaneous effects 𝑩𝑩. Because two such restrictions seem to have been imposed
on 𝚪𝚪2 , yes, there is a chance for the structural model—in particular for the two coefficients
measuring the contemporaneous, average impact of both VIX changes on S&P 500 returns and
vice-versa—to be identified (probably, over-identified), even though no Choleski
triangularization has affected 𝑩𝑩 (in fact, no restrictions at all have been imposed).
5
MSc. Finance/CLEFIN
2017/2018 Edition
Debriefing:
1
2
The answer to the last sub-point is negative because we know that Δ𝑑𝑑 𝑦𝑦𝑡𝑡 ≠ 𝑦𝑦𝑡𝑡 − 𝑦𝑦𝑡𝑡−𝑑𝑑 , while Δ𝑑𝑑 𝑦𝑦𝑡𝑡
consists of taking a number of d of successive differences of the series under consideration, i.e.,
Δ2 𝑦𝑦𝑡𝑡 = Δ(Δ𝑦𝑦𝑡𝑡 ) Δ3 𝑦𝑦𝑡𝑡 = Δ(Δ2 𝑦𝑦𝑡𝑡 ) …
He has proposed to make this series stationary by first fitting (by simple OLS) a quadratic
function of time (shown as a dashed red line in the picture) and then replace the time series of
real GDP with the OLS residuals from such a quadratic trend regression. What are the risks that
the analysts is exposing himself and his firm to by adopting this simple procedure?
Debriefing:
As commented in the lectures, such a procedure does not really make an obviously trending,
non-stationary series any stationary when the series contains a stochastic unit root, which
should be tested as a first order of matter. The risks are that, because the adopted method is
ineffective, the residuals will be then treated as I(0) while they are in fact I(1) or worse.
3
Question 3.C (1.5 points)
You know that a time series {𝑦𝑦𝑡𝑡 } was originally suspected to be I(d) with d 1. A fellow quant
analyst, Ms. Maria Delas, has then transformed it by differentiating three times, in the attempt
to make it stationary and delivered the series to you. Upon your own analysis, you determine
that the series contains now 2 unit roots in its MA component (i.e., the residuals need to be
differentiated twice for them to be “well-behaved”, that we may have called invertible). What
do you know about the d characterizing the original series?
Debriefing:
The original series was I(1): differentiating it three times—well more than what is needed—
“messes” it stochastic structure up, by creating two unit roots in its MA component. In short, if
d – 3 = -2, then it must have been d = 1.
4
MSc. Finance/CLEFIN
2020/2021 Edition
This exam is closed book, closed notes, subject to the Honor Code and proctored
through Respondus Monitor. In case you believe it is needed, you can use the
calculator that is available within the Respondus framework. You cannot use
your personal calculator or your smartphone. You must avoid using scratch
paper, as this will be interpreted as a suspicious behaviour by the software and
expose you to risks of disciplinary actions.
You always need to carefully justify your answers. However, no algebra or
formulas are required in order to achieve the maximum score in all questions. In
case you **FEEL THAT YOU NEED* an assumption in order to answer a question,
please do so and proceed. The necessity and quality of the assumption(s) will be
assessed together with the answer.
Note: the scores to the questions sum to 27 to accommodate the 4 points
(maximum) deriving from the (optional) homeworks; however, because
homeworks were optional or in case the max rule score favours a student, when
re-scaling the exam score to 31 gives a higher grade, this will prevail (see the
syllabus for the max rule formula and details).
Debriefing. Given a N-dimensional random vector 𝒚𝒚𝑡𝑡 , its components 𝑦𝑦1,𝑡𝑡 , … , 𝑦𝑦𝑁𝑁,𝑡𝑡 are said to
be cointegrated of order (d, b), if they are integrated of order d, and there exist at least one
vector 𝜿𝜿 (the cointegrating vector) such that 𝜿𝜿′𝒚𝒚𝑡𝑡 is integrated of order 𝑑𝑑 − 𝑏𝑏 (with 𝑑𝑑 − 𝑏𝑏
1
larger or equal to zero and b strictly greater than zero). A cointegrating vector 𝜿𝜿 represents
one (of the potentially many) long-run equilibrium relationships among the variables, i.e., the
variables cannot arbitrarily wander away from each other.
If 𝑦𝑦𝑡𝑡 contains N nonstationary components, there may be as many as (N - 1) linearly
independent cointegrating vectors. The number of cointegrating vectors is called the
cointegrating rank of 𝑦𝑦𝑡𝑡 . There are two ways to test for cointegration: univariate, regression-
based procedures (Engle and Granger’s test) and multivariate, VECM-based tests (Johansen’s
tests). In the case of Engle and Granger’s procedure, the method consists of regressing one
series on all the others (after having determined that they are both integrated of the same
order) and to test the residuals of such a regression for the presence of unit root (e.g., using an
ADF test). If the series are stationary, this will prove that the series are cointegrated. This
procedure has two main drawbacks:
• the Engle-Granger procedure relies on a two-step estimator, in which the first step
regression residuals are used in the second step to estimate an ADF (or PP)-type
regression, which causes errors and contamination deriving from a generated
regressors problem;
• Engle and Granger’s method has no systematic procedure to perform the separate
estimation of multiple cointegrating vectors, in the sense that the approach suffers from
a inherent indeterminacy as it is unclear which I(1) variables ought to be used as a
dependent variable and which ones as regressors; moreover, in principle, the outcomes
from Engle-Granger’s test may turn out to depend on exactly such choices of dependent
vs. explanatory variables in the testing regressions.
Therefore, in the case of 𝑁𝑁 > 2, the use of Johansen’s procedure is strongly advised in order to
test whether a number N of series are cointegrated (and how many cointegrating vectors
exist). With a Johansen’s test, one estimates the vector error correction VAR(p) (in difference)
model best suited to the data (e.g., on the basis of appropriate information criteria) and
proceeds to apply tests concerning the eigenvalues of a particular matrix (“pi”) to ascertain
the rank of such a matrix. There are three possible outcomes concerning the rank of the
matrix “pi”: (i) if the matrix “pi” has rank equal to zero, then all the N variables in the VECM
contain a unit root and they are not cointegrated; (ii) if the matrix “pi” has a rank 𝑟𝑟 such that
0 < 𝑟𝑟 < 𝑁𝑁, then all the N variables in the VECM contain a unit root and they are characterized
by 𝑟𝑟 cointegrating relationships; (iii) if the matrix “pi” has full rank, all the variables in the
VECM are stationary. Johansen proposed two different tests (based on the eigenvalues of the
matrix “pi” given that the rank of a 𝑁𝑁 × 𝑁𝑁 matrix is equal to N minus the number of
eigenvalues that are equal to zero) to ascertain the rank of “pi”: the trace test, which test the
null that the number of distinct cointegrating vectors is less than or equal to r against a
general alternative of a number exceeding r; the max eigenvalue test, which tests the null that
the number of cointegrating vectors is r against the alternative of 𝑟𝑟 + 1 cointegrating vectors.
If the test statistic is greater than the appropriate critical value, we will reject:
• the null hypothesis that there are at most r cointegrating vectors in favor of the
alternative that there are 𝑟𝑟 + 1 in the case of the trace test, or
• the null hypothesis that there are exactly r cointegrating vectors in favor of the
2
alternative that there are 𝑟𝑟 + 1 vectors, in the max eigenvalue test.
Debriefing. The errors of the regression are a weighted sum of two random variables and
therefore their order of integration should be equal to the maximum of the orders of
integration of the variables, unless the variables are cointegrated (of some order). This means
that, because the order of integration of 𝑦𝑦𝑡𝑡 is 2, the two variables need to be cointegrated;
otherwise the errors of the regression would be I(2). However, because the residuals are still
I(1) (i.e., they are non-stationary), the variables are C(2,1). This means that the order of
integration of 𝑥𝑥𝑡𝑡 should be 2 as well, as variables need to be integrated of the same order to be
cointegrated. Despite the variables are CI(2,1), the results of the regression should not be
trusted because of the error being I(1) (such that any shock has a permanent effect on the
regression). Juice should take the first difference of both the variable and then regress them
one on the other, if he needs to insist in carrying on his analysis.
Debriefing.
Engle and Lee (1999) have generalized the classical GARCH model (with and without
threshold effects) to component GARCH (C-GARCH) models. Differently from standard GARCH
that fluctutate around a constant long-run variance, C-GARCH allows mean reversion to a
varying level or—equivalently—they incorporate a distinction between transitory and
3
permanent conditional variance dynamics. However, the long-term, permanent variance is
itself stationary and converges to some constant long-run level. In essence a C-GARCH is a
two-layer CH model in which short-term (also known as temporary) variance follows a
GARCH that fluctuates around a randomly varying permanent conditional variance
component which, in its turn, follows a AR model (often with an interaction between the
short-term ARMA(1,1) model and the AR(1) permanent component).
Interestingly, one can combine the transitory and permanent component to write a C-
GARCH(1,1) model as a restricted, classical GARCH(2,2) model in which the parameters (a
total of 5) are restricted to depend (in a non-linear fashion) on the 5 parameters that
characterized the original C-GARCH(1,1) model. This is meaninfgul because this fact
illustrates that empirically a C-GARCH model not only carries a meaninfgul power to separate
the temporary and permanent components (see below for comments), but also has an
enhnanced power to fit the long-memory that squared and absolute residuals from a variaty
of models for financial returns tend to display. In fact, higher-order GARCH models are rarely
used in practice, and the GARCH(2,2) case deriving from a C-GARCH(1,1) represents one of
the few cases in which—even though it is subject to constraints coming from the structure of
the C-GARCH—implicitly, through the component GARCH case, a (2,2) model has been used in
many practical applications. Even though by construction the corresponding GARCH(2,2)
always gives positive variance components, its coefficients are not all positive, which is a
powerful proof of the fact that setting all coefficients to be non-negative in GARCH(p, q) is
certainly sufficient but never necessary to guarantee that the resulting variance forecast be
positive at all times.
In the literature, we have had also variations of C-GARCH(p, d, q) models that have included
leverage-type asymmetric effects in the short-run conditional variance equation. This model
combines the component model with the asymmetric threshold GARCH model, introducing
asymmetric effects in the transitory equation. Even though they may not always represent the
best fitting models, C-GARCH may have a crucial asset pricing justification: it may be often
optimal that the fair price of certain securities (think of long-term options, sometimes called
leaps) and a few types of financial decisions (such as portfolio weights for the long-run) may
not be made to depend on short-run spikes and alterations in predicted conditional variance.
4
Question 2.B (4 points)
Geremia Morgano is a quant analyst at Randomshoot Inc. Geremia has estimated by MLE,
under normality a component GARCH(1,1) for a return series, obtaining the outputs that
follow.
Sample: 1975M01 2020M12
Included observations: 552
Convergence achieved after 29 iterations
Presample variance: backcast (parameter = 0.7)
Q = C(2) + C(3)*(Q(-1) - C(2)) + C(4)*(RESID(-1)^2 - GARCH(-1))
GARCH = Q + C(5) * (RESID(-1)^2 - Q(-1)) + C(6)*(GARCH(-1) - Q(-1))
Variance Equation
5
The sample correlogram refers to squared standardized residuals derived from the C-
GARCH(1,1) model. Do you think that Geremia may use the estimated C-GARCH to forecast
variance, price options, compute risk measures, etc.? Make sure to justify your answer.
Geremia has also proceeded to estimate a GARCH(2,2) model deriving the following outputs:
Sample: 1975M01 2020M12
Included observations: 552
Convergence achieved after 24 iterations
Coefficient covariance computed using outer product of gradients
Presample variance: backcast (parameter = 0.7)
GARCH = C(2) + C(3)*RESID(-1)^2 + C(4)*RESID(-2)^2 + C(5)
*GARCH(-1) + C(6)*GARCH(-2)
Variance Equation
6
Geremia claims that the second, GARCH(2,2) model is superior to the original component
GARCH(1,1) because it fails to imply a highly variable but generally small permanent
component. Do you agree with Geremia? Are you worried for the negatively estimated
coefficient obtained from the ML estimation of the GARCH(2,2) model?
Debriefing.
Geremia should refrain from using the estimated component GARCH(1,1) model because the
transitory component is poorly identified and its coefficients not significantly estimated (at
0.190) or negative (at -0.089). As a result, the transitory component is highly variable but
generally small and seems to contribute to total variance mostly jumps in a highly volatile
manner and not in a smooth way. As a result, almost all of filtered variance consists of
permanent variance, which is also problematic and unrealistic. Moreover, in spite of the ML
estimation under normality, the empirical distribution of the standardized residuals fail to be
normally distributed.
However, Geremia’s claim that the second, classical GARCH(2,2) model is superior to the
original component GARCH(1,1) model is absurd. First, we know that component GARCH(1,1)
models can be written as GARCH(2,2) under restrictions. In this specific case, the restrictions
must be hardly binding because the plot of the filtered variances under component
GARCH(1,1) and the GARCH(2,2) models are basically identical (apart from very high
variances, in excess of 200, i.e., the two models just differ in the spikes in variance that they
forecast). Moreover, the histograms of the standardized residuals are practically identical. The
7
two models imply the same total variance but the ML estimation remains reason for concern
as the models appear to be misspecified. Finally, there is some reason to be worried for the
negative coefficient in the GARCH(2,2) model as a negative coefficient was also appearing in
the component GARCH(1,1) specification, which may violate sufficient conditions for
positivity of conditional variance.