Modern Time Series: Description, Prediction and Causality: ©2023 Neil Shephard
Modern Time Series: Description, Prediction and Causality: ©2023 Neil Shephard
Neil Shephard
December 8, 2023
1
©2023 Neil Shephard
Contents
1 Introduction 7
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Describe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Lags and differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Stationarity 39
3.1 Strict stationarity and marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Moving average process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Covariance stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Examples of covariance stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Covariance stationarity and the MA(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.4 Covariance stationarity and the sample mean . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Covariance stationary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Inference under martingale differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Invertibility of moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2 Lag operator and lag polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3
4 CONTENTS
4 Memory 61
4.1 Markov process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.1 Basic case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.2 Companion form and K-th order Markov processes . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 AR(p) and VAR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.2 Autoregressions, the M A(∞) and covariance stationarity . . . . . . . . . . . . . . . . . . 66
4.3 m-dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2 m-dependence CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Integration and differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Random walk and integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Lévy, Brownian and Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Inference and linear autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.1 Three versions of an autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.2 Least squares and AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.3 Properties of ϕ b
LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 Linearity 119
6.1 Frequency domain time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.1 A regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1.2 Some background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1.3 Writing the regression model using complex variables . . . . . . . . . . . . . . . . . . . . 128
6.2 Fourier transform of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.1 Core ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
CONTENTS 5
Introduction
1.1 Overview
Welcome to the cross-listed course on Time Series, which is labelled Stat242 and Econ2142.
describing;
predicting;
causality.
The main distinctive feature of time series is the role of the past and future, but many aspects of the above
goals carry over.
how data X in the past might help predict data Y in the future.
Learn through
simulation;
applications.
y1 , ..., yT
7
8 CHAPTER 1. INTRODUCTION
Remark 1. There are two academic disciplines which look at sequences of data in time: “time series” and
“stochastic processes.” They share many features and it is rather sad there is not just a single topic. Courses
and writing about time series tend to be rather more statistical and applied. Stochastic processes tend to
be focused on probability theory. There a couple of dedicated time series journals: “Journal of Time Series
Analysis” and the “Journal of Forecasting.” Of course you will see time series papers published in Nature,
Econometrica, Journal of Econometrics etc. Leading stochastic process journals are “Stochastic Processes and
Their Applications” and “Stochastics” and “Statistical Inference for Stochastic Processes,” while you will also
see stochastic process papers in general probability journals like the “Annals of Probability” and the “Annals of
Applied Probability.” There has been a surge of time series focus on global warming type problems in the last
decade, e.g. leading econometricians such as Jim Stock, Lars Hansen, David Hendry and Frank Diebold have
been working in that area. More traditionally, time series is seen mostly in academic economics in finance and
macroeconomics. Outside economics, it is widely seen in engineering (e.g. modern versions would be automatic
driving of cars).
I will teach a general time series course, but my applied interest is in economics. There will be a bias towards
economic problems, both in theory and applications. I tend to think of things from a time series viewpoint,
but for a time series person I nudge towards stochastic processes due to my background in continuous time
processes.
1.2 Describe
temporal averages
T
1X
y= yt
T t=1
or quantile versions.
In many problems it is helpful to transform the data, to make it more meaningful. As well as the usual log,
square root and other transforms we see in Introductory statistics, there are some new and vital time series
transforms. They appear all over the subject and have some specialized notation.
1.2. DESCRIBE 9
Lyt = yt−1
L is called a lag operator. Each time it sees time, it goes back one period. e.g. Lt = t − 1, Lt2 = (t − 1)2 .
LS = yt−S ,
Difference
∆yt = yt − yt−1
= (1 − L)yt .
Logs (of strictly positive time series) and differences combine in important ways
= log(yt /yt−1 )
yt − yt−1
= log 1 + .
yt−1
Now
yt − yt−1
100
yt−1
yt −yt−1
is percentage change. If yt−1 small, then, by Taylor expansion
yt − yt−1
∆ log yt ≃ .
yt−1
∆2 yt = yt − yt−2
= ∆yt + ∆yt−1 .
This type of differences cumulates first differences. More generally, for integer S > 1,
∆S y t = yt − yt−S
Price Index
300
250
200
CPI
150
100
50
Time
Example 1.2.1. (Inflation) Pt time t price monthly index, plotted in Figure 1.1 from 1947, the non-seasonally
Pt − Pt−12
100 ,
Pt−12
is the annual percentage change, which is the traditional way of computing the time-t inflation rate. Thus
This is not very pretty, as this is not the sum of monthly inflation
∆Pt
100 .
Pt−1
which is the weighted version of the monthly inflation. If annual inflation is low these weights should all be
close to one, but for high inflation the weights could be substantially larger than one. A mathematically more
1.2. DESCRIBE 11
10
1
Monthly change
Monthly change
5
0
−1
0
−2
Time Time
Figure 1.2: LHS: Monthly geometric inflation in U.S.A. from 1947, 100∆ log Pt . RHS: Annual geometric
inflation in U.S.A. from 1947, 100∆12 log Pt .
attractive way of thinking about the time series of inflation is to work with
then monthly inflation aggregates to yearly inflation. The left hand side of Figure 1.2 Right hand side of Figure
1.2 shows 100∆ log Pt , while the right hand side shows 100∆12 log Pt . Summary measures include the average
Recent inflation targeting has been at 2%, but it has been above that frequently
T
1 X
1(100∆S log Pt > 2) ≃ 0.67.
T
t=S+1
Remark 2. I often download data through APIs. In R there is a nice package quantmod which allows access
to FRED (St. Louis Fed’s economic database free service), google, yahoo and many other databases. Once this
is installed the data is got through the command
\texttt{
12 CHAPTER 1. INTRODUCTION
}
\texttt{tail(CPIAUCNS,15); # tail() gets us the last rows of the dataset
}
Example 1.2.2. (Stock returns) Example 1.2.1 is also important in thinking about the price (plus any rein-
vested dividends) Pt of a risky asset at time t. Then over S periods:
Pt − Pt−S
100 ,
Pt−S
Think of the
∆Pt ∆Pt−1
= 0.05 but = −0.05,
Pt−1 Pt−2
then the impact of the weight Pt−1 /Pt−2 < 1 means that
∆ 2 Pt
< 0.
Pt−2
Hence early negative returns outweighed later positive returns of the same size. This can be super confusing
for an investor. Reporting returns through
removes this problem. These {100∆ log Pt } are called geometric returns in finance — these sum up over time.
The {100∆Pt /Pt−S } are called arithmetic returns.
1.3 Predict
A prime goal of applied researchers using time series data is to make forecasts — what comes next?
h 1.3.1. ChatGTP is a large language model, a version of natural language processing. The statistical basis
of this is to make some text (user input) and to predict what would come next using a corpus of existing text
(data). The intellectual origins of this are due to Shannon (1948), who thought of this as a time series type
problem — predicting future text from past text.
1.4. CAUSALITY 13
A tradition way of thinking about this is to use data to estimate statistically using some data, e.g.
Of course, for some purposes computing the conditional median or alike might be better, but let us think about
that later.
The conditional expectation sure looks like the kind of problems we saw in Introductory Statistics, where X
is not past data, and Y is not future data. But there we saw many pairs of random variables (X, Y ) in order
to produce a good E[Y |X]. In time series we do not have that replication, we only have one series. This is a
fundamental challange which means applied researchers are likely to have to make stronger assumptions than
you see in Introductory Statistics.
Example 1.3.2. Think about predicting next month’s official inflation number. Recall from Example 1.2.1 it
is
∆12 Pt ∆Pt Pt−1 ∆Pt−1 Pt−2 ∆Pt−11
= × + × + ... + .
Pt−12 Pt−1 Pt−12 Pt−2 Pt−12 Pt−12
This looks mighty complicated but we predict it given what we know at time t − 1 — and we know a lot! One
Pt −pt−1
Write Yt = pt−1 and then yt−s = ∆pt−s /pt−s−1 for s = 1, 2, .... A famous forecasts is:
∆pt−1
yt−1 = ,
pt−2
the last data point we saw. Of course many other forecasts are possible and we will explore some of them.
1.4 Causality
In Introductory Statistics a classical way of quantifying causality is think about the instant impact on an
outcome of taking action a = 1 instead of action a = 0 at time t − s. The corresponding pair of outcomes at
time t are {Yt,s (0), Yt,s (1)} (here Yt,s (1) corresponding to action a = 1 having been taken at time t − s) under
these possible actions are called lag-s “potential outcomes” (the idea of potential outcomes is due to Neyman
(1923)), while the lag-s average treatment effect is the random variable
In practice we cannot see both potential outcomes, so we have no choice but to select a single action
At−s ∈ {0, 1}, and then seeing the outcome Yt = Yt,s (At−s ). One way to progress is to use a lag-s sequential
randomization assumption:
This is a simple assumption to write down, understanding it takes some thought! It will be discussed extensively
= E[Yt |At−s = 1]
= µt,s (1),
where, generically
µt,s (a) = E[Yt |At−s = a]
1.5 Recap
Lag operator
Difference
Difference operator
Assignment
Outcome
Potential outcome
2.1 Introduction
y1 , ..., yT
ordered in time, which we might plot. But where does this data come from?
I will answer this question using the language of probability theory, defining a time series model. As
usual, I will use upper case letters to denote random variables and lower case letters to denote data or arguments
in integrals, etc., and bold vectors and matrices. This can be confusing as sometimes this notation will clash
with the linear algebra convention of using capital letters for matrices and lower case for vectors and scalars.
Further, I will use Y as my prime notation for a random variable, as is standard in statistics, while noting X is
predominately used for the same purpose in probability theory.
h 2.1.1. The probabilistic viewpoint of time series is not the only viable approach. There is also an algorithmic
tradition of filters and optimization which intersects with the time series model viewpoint. I will discuss some
aspects of this in later Chapters.
The time series model views the data as a realization of a sequence of random variables. Sometimes it is
helpful to go outside the range of data we see, e.g. for the purposes of forecasting or asymptotic arguments.
y = (y1 , ..., yT )
is a single realization (draw) of length T from the joint distribution of a sequence of random variables
Y = (Y1 , ..., YT ) ,
17
18 CHAPTER 2. TIME SERIES MODEL
T
ordered in time t = 1, ..., T . Write the dim(yt ) = dt ≥ 1. The sequence Y is written compactly as {Yt }t=1
and the subsequence from time s to time t as Ys:t = (Ys , Ys+1 , ..., Yt ), where s ∈ {1, 2, ..., t}. Sometimes it is
T +s
helpful to think of Y as a single snippet of a longer {Yt }t=1 , where s > 0, or infinitely lived time series, written
∞
as {Yt }t=−∞ or {Yt }t∈Z .
Remark 3. [Calendar time] The definition of a time series model makes no reference to the calendar times
at which the Y1:T appear, only that these random variables appear in sequence. In some applied contexts
the times are crucial, e.g. times at which trades in a financial market happen. In general we will denote the
corresponding times as
Sometimes scientifically it is helpful to think of the times as random variables, and include these times as an
e.g. Xt is bivariate, the price and volume of the t-th transaction, while τt is the time of the t-th trade. But for
most problems this level of generality is not needed and the times are regarded as a deterministic sequence, e.g.
every month, and Yt does not refer to the calendar time explicitly.
h 2.1.3. The infinitely lived version is rarely literally believable, at least in the social and medical sciences, e.g.
the GDP of Belgium literally cannot exist after the Earth is swallowed by the expanding Sun. It is important
to not unthinkingly use such an assumption.
Example 2.1.4. An example of the use of a longer series is to focus on the law of YT +1 | (Y1:T = y1:T ), the
distribution of the next variable in the sequence extrapolated beyond the single realization y1:T of Y1:T .
The definition of a time series model excludes the possibility that Yt is directly a text or an image, but data
of that kind can be potentially included by preprocessing the text or image into a real variable.
Example 2.1.5. ChatGPT output is a forecast from a very dimensional time series model, coding input text
to numbers.
h 2.1.6. [Fundamental challange of time series: no easy replication] y1 is the single data point from the random
variable Y1 . There is no other data point with the same law as Y1 — unless some additional assumptions are
made. Likewise the pair y1:2 is the only pair from the joint law of Y1:2 . This lack of straightforward replication
is the fundamental statistical challange of time series and places it apart from introductory statistics. This is
summarized by the warning:
There is no easy replication in time series without assumption
2.1. INTRODUCTION 19
The ordering in time makes it distinct from other dependent data such as from networks, spatial structures and
images.
From the definition of expectations for continuous random variables, if it exists, the
Z
E[g(Yt1 , ..., Ytp )] = g(y1 , ..., yp )fYt1 ,...,Ytp (y1 , ..., yp )dy1 ...dyp
p
where fYt1 ,...,Ytp is the density of Yt1 , ..., Ytp and (t1 , ..., tp ) ∈ {1, ..., T } . The equivalent result for discrete
More abstractly, the cumulative distribution function FY of Y is derived from the probability triple (Ω, F, P ).
An important condition, which appears frequently in model building and estimation is the existence of means
and variances for each t = 1, ..., T .
(“square integrable”). Here the sup is over time, t, typically t ∈ {1, ..., T } but for asymptotics it could be t ∈ N.
The space of square integrable random variables is sometimes written as L2 (Ω, F, P ). I will refer to this as L1
and L2 processes as shorthand.
h 2.1.8. In many application areas, time series data has very heavy tails and so it is important to be careful
about the existence of higher order moments.
is, for a single y1 , a real number. Calculating this conditional expectation for a variety of possible values for
E[Y2 |Y1 ]
I now usually think of this as a random variable, as this expectation is just a function of Y1 having integrated
out the randomness in Y2 . You will often see in these notes terms like
E[Yt |Y1:t−1 ]
20 CHAPTER 2. TIME SERIES MODEL
or
fYt |Y1:t−1 (yt ).
is not! Once in a while I will get bored writing things out so carefully, and express the last term as fYt |Y1:t−1 (yt ),
even when I read it as a constant, e.g. in expressing a likelihood or carrying out filtering. I hope you will
forgive me for using this more compact notation, where I think the context should make things clear. This may
be particularly tricky when I use a standard abstract notation for conditioning on the past, Ft−1 . Then I will
typically think of
E[Yt |Ft−1 ]
Example 2.1.10. [Mean and variance of an average] For a time series model, define the statistic Y =
PT
T −1 t=1 Yt . Then, if Y is in L1 , then E[Y ] exists and is
T
1X
E[Y ] = E[Yt ].
T t=1
as Cov(Ys , Yt ) = Cov(Yt , Ys )T . The (2.1) is sometimes called the Bienaymé’s identity. There is no reason to
expect Var(Y ) to go to 0d,d as T increases for time series models, nor for E[Y ] to be scientifically interesting.
Additional assumptions will be needed for this to be true.
Prediction plays a distinctive role in time series model building, statistical analysis, decision making and applied
work. In particular, one-step ahead prediction has many probabilistic features which make it a very close cousin
to the role of independence in introductory statistics.
Definition 2.2.1. [One-step ahead prediction] The law of the conditional random variable
is called the (one-step ahead) time-t predictive distribution of Yt given the past data y1:t−1 . This past data,
or information set, plays a central role in time series.
Remark 4. The past data y1:t−1 is often (particularly in probability theory) labeled the natural filtration and
written (using σ as the notation for a sigma-algebra) as
Y
Ft−1 = σ(Y1:t−1 ),
Y
if the context is obvious. Note Ft−1 ⊆ FtY = σ(Y1:t ). The natural filtration is a special case of a general
filtration or information set
X
Ft−1 = σ(X1:t−1 ).
For Yt to be adapted with respect to FtX , there must exists a function ht such that
Yt = ht (X1:t ),
Y
i.e. glimpsing X1:t is enough to determine Yt . Thus for Yt to be adapted to FtX , then Ft−1 X
⊆ Ft−1 . For the
process {Zt }t≥1 to be previsible (sometimes the label “predictable” is used instead) the Zt must be determined
X
by Ft−1 for each t ≥ 1.
Recall, the joint distribution being equal to the product of marginal distributions is the defining characteristic
of independence. In a time series, the sequence itself is not independent, but its one-step-ahead predictions
Yt |Y1:t−1 , t = 1, ..., T,
have an independence-like feature. The result (2.2) is profoundly important in the study of time series.
Example 2.2.2. Write Yt | (Y1:t−1 = y1:t−1 ) ∼ N (ϕyt−1 , σ 2 ), a Gaussian autoregression model where the pa-
rameters are ϕ, σ 2 . Write Yt | (Y1:t−1 = y1:t−1 ) ∼ N (0, α + βyt−1
2
), a Gaussian autoregressive heteroskedasticity
model (Engle (1982)) where the parameters are α, β.
22 CHAPTER 2. TIME SERIES MODEL
Following the logic expressed in introductory statistics, it is often convenient, both mathematically and
computationally, to work with logs of the joint density
log fY1:T (y1:T ) = log fY1:T −1 (y1:T −1 ) + log fYT |Y1:T −1 (yT )
T
X
= log fYt |Y1:t−1 (yt )
t=1
rather than the joint density itself or in ratios, comparing two joint densities g to f for Y1:T :
gY1:T (y1:T )
LRT : =
fY1:T (y1:T )
gYt |Y1:t−1 (yt )
= LRT −1 × Λ(G||F )T , where Λ(G||F )t := .
fYt |Y1:t−1 (yt )
if fY1:T (y1:T ) > 0 for all y1:T (more formally f absolutely dominates g) as
and is an example of a fundamental sequence in time: a martingale. This section defines these objects and
studies some of their properties. Before we start, note that it is traditional to (2.4) more compactly, removing
the subscripts and using filtration notation, as
Definition 2.3.1. The sequence {Yt }t∈N>0 is a martingale process with respect to an adapted filtration
hold for every t = 1, 2, .... If (b) switches to E[Yt |Ft−1 ] = 0, then it is a martingale difference. If (b) switches to
E[Yt |Ft−1 ] ≤ Yt−1 , then it is a supermartingale. If (b) switches to E[Yt |Ft−1 ] ≥ Yt−1 , then it is a submartingale.
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 23
n o
gY1:t (y1:t )
Example 2.3.2. The process fY1:t (y1:t ) is a martingale process with respect to the filtration generated
t∈N>0
gY1:t (y1:t )
by {Yt }t∈N>0 , noting that fY1:t (y1:t ) ≥ 0 and
gY1:t (Y1:t )
E = 1, for every, t = 1, 2, ....
fY1:t (Y1:t )
Definition 2.3.3. In the martingale difference Definition 2.3.1 if (a) is strengthened to (a) supt E[Yt2 ] < ∞,
then {Yt }t∈{1,...,T } is a MD in L2 .
h 2.3.4. You might think that being i.i.d. and symmetrically distributed about 0 is enough to be a martingale
difference — but this is not true (but i.i.d. plus E[Yt ] = 0 is enough). For example, if {Yt }t∈N>0 is an i.i.d.
Cauchy sequence, then it is not a martingale difference sequence as E[|Yt |] = ∞. Heavy tailed processes play
a much more prominent role in time series than in introductory statistic: it is important to keep tract of this
seemingly technical condition.
Martingales and martingale differences have many important properties. Here focus on five.
MD and correlation
Here is the first. It is my favourite! Recall that previsible means that Ct ∈ Ft−1 , i.e. it is part of the past.
Theorem 2.3.5. For {Yt }t∈N>0 , a martingale difference with respect to the filtration {Ft }t∈N>0 , and the scalar
Ct is previsible such that E[|Ct Yt |] < ∞, then, for every t,
E[Ct Yt ] = 0.
The most famous special case of this is where Ct = 1, which implies that for all MDs (as E[|Yt |] < ∞ is part
of the definition of a MD)
E[Yt ] = 0.
As E[Yt ] = 0, the Theorem means that Yt is uncorrelated (but not independent!) with the past
Cov(Ct , Yt ) = 0,
24 CHAPTER 2. TIME SERIES MODEL
so long as E[|Ct Yt |] < ∞. The most used case of this Ct = Yt−s for a fixed integer s > 0. This means, for
every t,
so long as E[|Yt Yt−s |] < ∞. By Holder’s inequality E[|Yt−s Yt |] ≤ E[|Yt |2 ]1/2 E[|Yt−s |2 ]1/2 , so for MDs in L2 ,
Cov(Yt−s , Yt ) = 0 always holds.
The next result is remarkable. It shows that martingale reproduce, when combined with previsible processes.
This result plays a huge role in the theory of finance and the theory of gambling.
Theorem 2.3.6. [Martingale transform theorem] Assume that (i) {Ct , Yt }t∈N>0 are adapted with respect to
{Ft }t∈N>0 ; (ii) {Yt }t∈N>0 is a martingale with respect to {Ft }t∈N>0 ; (iii) {Ct }t∈N>0 is previsible and bounded,
that is |Ct | < c for all t ∈ N>0 . Then
t
X
Mt := Cj ∆Yj = C · Yt ,
j=1
Proof. Bounded {Ct }t∈N>0 means that E[|Ct ∆Yt |] < ∞, the only issue is
iid
Yt = Yt−1 ηt , ηt ∼ , E[ηt ] = 1, Y0 > 0, ηt ≥ 0, t = 1, 2, ....
Then
so it is a martingale. Let Ct be the (bounded) share of the risky asset at time t you hold! Then Ct must be
decided at time t − 1, so Ct ∈ Ft−1 , i.e. it is previsible (this rules out knowledge of ηt when the investment
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 25
is made, that would be insider knowledge!). Thus the change in the value of your investment during the t-th
period is
Ct ∆Yt .
which is a martingale with respect to {Ft }. The main beauty of this result is it abstracts from the details of
the investment strategy: all that matters is the timing of the investment — it is previsible.
Here is the third property. An implication of the Cov(Yt−s , Yt ) = 0 result is a weak law of large numbers for
time series.
Theorem 2.3.8. [WLLN for MDs] If {Yt }t∈N , a martingale differences in L2 with respect to the filtration
{Ft }t∈N>0 , then
T
1 X
E(Y ) = 0, Var(Y ) = Var(Yt ). (2.6)
T 2 t=1
Then as T → ∞, the
p
Y → 0.
Proof. Go back to the variance of the sample average (2.1), apply the Cov(Yt−s , Yt ) = 0 result for martingale
differences in L2 . This yields (2.6). The
The property that sup Vart (Yt ) < ∞ for all t ∈ N comes from the L2 assumption. Then the Var(Y ) is driven
to 0. Then Chebyshev’s inequality yields the convergence result.
Note, so course,
T t
1X 1 X
Y = Yt = MT , Mt = Yj ,
T t=1 T j=1
1 p
Mt → 0,
t
Doob decomposition
Here is the fourth property. How do martingales relate to a broader time series {Yt }t∈N>0 which is adapted to
the filtration {Ft }t∈N>0 and has E[|Yt |] < ∞ for all t? The Doob (1953) decomposition says that
t
X t
X
Yt = At + Mt , At = E[Yj |Fj−1 ] − Yj−1 , Mt = Y0 + {Yj − E[Yj |Fj−1 ]} ,
j=1 j=1
almost surely, where the splitting into {At }t∈N>0 and {Mt }t∈N>0 is unique. The {Mt }t∈N>0 is a martingale
with M0 = Y0 , and the {At }t∈N>0 is previsible. It is sometimes useful to call {E[Yt |Ft−1 ] − Yt−1 }t∈N>0 the
drift process.
Why does this hold? It is by iteration, unwinding the process. The first step is to write
repeating the process, working on the Yt−1 term in the square bracket, yields the desired result. This decom-
position can be shown to be unique, but I will not give a proof of that here.
Why is this useful? Every time series with E[|Yt |] < ∞, has a martingale component and another important
term. We have tools to understand the behaviour of martingales, which can simplify the problem, e.g. by the
WLLN of martingale differences in L2 , the
Yt − At Mt p
= →0
t t
Example 2.3.9. Let Yt = ϕYt−1 + εt , where {εt }t∈N>0 is martingale difference sequence with respect to
{Ft }t∈N>0 , then E[Yt |Ft−1 ] = ϕYt−1 and so the Doob decomposition writes
t
X t
X
At = (ϕ − 1)Yj−1 , Mt = Y0 + εj .
j=1 j=1
Doob’s inequalities
Here is the fifth property. Let {Yt } be a martingale, then Doob’s inequalities are:
so rearranging.
E sup Ys ≤ 4E[Yt2 ].
2
s≤t
The latter is sometimes called the Doob’s maximal quadratic inequality. The proof of (2.7) is beyond these
lectures.
Pt
Example 2.3.10. Suppose Yt = j=1 εt , where {εt } is i.i.d., zero mean and variance σ 2 . Then {Yt } is a
martingale and
tσ 2
P sup |Ys | > c ≤ 2 , and E sup Ys2 ≤ 4tσ 2 .
s≤t c s≤t
Many dynamic economics and engineering problems are phrased in terms of taking actions to maximize expected
utility, given the current information set. Leading cases are consumption based models and dynamic portfolio
analysis in finance, as well as problems of control and reinforcement learning. Here I will discuss this abstractly,
at−1 ∈ A.
Yt (at−1 ),
(hence the output is a random function of at−1 and is not interfered by a1:t−2 ). Associated with each level of
time-t outcome yt ∈ Y is the deterministic time-t utility function
varies with the action at−1 . Throughout assume that E[|Ut (Yt (at−1 ))|] < ∞ uniformly over at−1 ∈ A.
Assume that Yt (at−1 ) is linear in at−1 and note that almost surely
2
∂ 2 Ut (Yt (at−1 )) ∂ 2 Ut (yt )
∂Yt (at−1 )
2 = × < 0.
∂at−1 ∂yt2 yt =Yt (at−1 ) ∂at−1
Thus E[Ut (Yt (at−1 ))|Ft−1 ] is strictly concave in the action at−1 , with a unique maximizer given by the turning
point
∂E[Ut (Yt (at−1 ))|Ft−1 ] ∂E[Ut (Yt )|Ft−1 ]
= = 0,
∂at−1 at−1 =b
at−1 ∂at−1
where
Yt = Yt (b
at−1 ).
∂Ut (Yt )
E [Ut′ |Ft−1 ] = 0, where Ut′ = ,
∂at−1
that is Ut′ , the marginal conditional expected utility from action at−1 evaluated at b
at−1 , is a martingale difference
sequence with respect to the filtration.
Example 2.3.11. [discrete time version of Merton (1969)] Think of the portfolio allocation problem, holding
at−1 units in risky asset and 1 − at−1 in the riskless asset. Then portfolio at time t is worth
where Rt is the time-t return on the risky asset and r is the return on the riskless asset. To get analytic results,
assume utility from the portfolio has constant absolute risk aversion (CARA)
1 − e−γyt
Ut (yt ) = ; γ ̸= 0.
γ
Then b
at−1 is found by selecting at−1 to minimizing (as log transforms are 1-1)
noting that
∂E[Ut (Yt (at−1 ))|Ft−1 ] 1 ∂E[e−γYt (at−1 ) |Ft−1 ] 1 ∂ log E[e−γYt (at−1 ) |Ft−1 ]
=− = − E[e−γYt (at−1 ) |Ft−1 ] × .
∂at−1 γ ∂at−1 γ ∂at−1
Hence
∂ log E[e−γYt (at−1 ) |Ft−1 ]
E [Ut′ |Ft−1 ] = 0 ⇔ = 0.
∂at−1 at−1 =b
at−1
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 29
If Rt |Ft−1 ∼ N (µt , σt2 ), then using the moment generating function of a normal,
log E[e−γYt (at−1 ) |Ft−1 ] = −γ(1 + r) − γat−1 (µt − r) + γ 2 a2t−1 σt2 /2,
∂ log E[e−γYt (at−1 ) |Ft−1 ]
= −γ (µt − r) + γ 2 at−1 σt2 ,
∂at−1
so
µt − r
at−1 = ,
γσt2
b
implying the excess returns on the wealth portfolio, beyond the risk free rate, is
!
2 2
(µt − r) (µt − r)
{Yt (b
at−1 ) − (1 + r)} |Ft−1 ∼ N , .
γσt2 γ 2 σt2
Thus the marginal conditional expected utility from investing Ut′ is a martingale difference sequence, not the
returns.
Bayes theorem is simple but extraordinary: combining prior knowledge of θ and the likelihood function, the
result is a posterior, reflecting rational knowledge about θ gleaned from the data. But is not that easy to use:
it needs the specification of the prior and the likelihood and then the use of computer resources to manipulate
the posterior.
The following is a partial shortcut, available without explicitly specifying the prior or likelihood!
Mt = E[θ|Y1:t ]
is a martingale (Doob (1949), Miller (2018)). This is a remarkable and inspiring result. It has a direct proof:
Mt−1 : = E[θ|Y1:t−1 ]
as required. The result holds more generally for E[h(θ)|Y1:t ] is a martingale if E[|h(θ)|] < ∞. An important
example of this is E[1(θ ≤ c)|Y1:t ], the posterior cumulative distribution function evaluated at the constant c.
This is a key result in time series practice. Repeatedly forecasts of the same object, e.g. daily nowcast
updates of GDP in 2024Q3, should form a martingale sequence.
30 CHAPTER 2. TIME SERIES MODEL
h 2.3.12. E[h(θ)|Y1:t ] is not everything. Assume the prior has E[θ2 ] < ∞, then the posterior variance using
information at time t − 1 is
2
Vt−1 = E[(θ − Mt−1 ) |Y1:t−1 ], recalling Mt−1 = E[θ|Y1:t−1 ]
2
= E[E[(θ − Mt−1 ) |Y1:t−1 , Yt ]|Y1:t−1 ]
∆Mt = Mt − Mt−1 .
Rearranging,
implies the posterior variance is a supermartingale (but Vt is not, in general, less than or equal to Vt−1 — which
superficially you might think would hold). Likewise, posterior quantiles will not, in general, be martingales.
The cumulative sum of term (2.8) in Biohazard 2.3.12 plays an important role in probability theory and
statistics.
∞
Definition 2.3.13. [Quadratic variation] For the martingale differences in L2 , the {Yt }t=1 is a martingale on
∞
the filtration {Ft }t=1 and
t
X
⟨Y, Y ⟩t = E[(∆Yj )2 |Fj−1 ],
j=1
∞
is called the predictable quadratic variation (predictable QV) of {Y }t=1 or the “angle-bracket” process in
probability theory. Crucially ⟨Y, Y ⟩t ∈ Ft−1 , that is it is previsible, and is bounded due to the L2 assumption.
The predictable QV drives a version of the strong law of large number (SLLN) for martingales.
∞
Theorem 2.3.14. [SLLN for MDs] If {Yt }t=1 is a martingale and Y0 = 0, then
Yt a.s.
→ 0,
⟨Y, Y ⟩t
∞
for every {Yt }t=1 where, almost surely, ⟨Y, Y ⟩t → ∞.
Example 2.3.15. [Regression] Think about the bivariate time series {Xt , Yt }t∈N>0 . Suppose E[Yt |Xt , Ft−1 ] =
βXt and construct
Ut = Yt − βXt ,
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 31
implying E[Ut |Xt , Ft−1 ] = 0, where the filtration Ft is generated by {Xt , Yt }t∈N>0 . The least squares estimator
PT PT
t=1 Xt Yt Xt Ut
β T = PT
b = β + Pt=1
T
.
2 2
t=1 Xt t=1 Xt
E[Xt Ut |Ft−1 ] = 0.
2
Thus {Xt Ut }t∈N>0 is a MD sequence. If E[(Xt Ut ) ] < ∞ for all t, then {Xt Ut }t∈N>0 is a MD in L2 and so
Pt
⟨XU, XU ⟩t = j=1 E[(Xj Uj )2 |Fj−1 ]. Then by the SLLN for MDs
PT
t=1 Xt Ut a.s.
→ 0,
⟨XU, XU ⟩T
so long as ⟨XU, XU ⟩T → 0.
Martingale differences are also is at the center of the Brown (1971) basic central limit theorem (CLT) for
martingales.
∞
Theorem 2.3.16. [CLT for MDs] If {Yt }t=1 is a martingale, define St2 = E[⟨Y, Y ⟩t ] and assume that as t → ∞
the
⟨Y, Y ⟩t p
→ 1, (2.9)
St2
and that the Lindeberg-type condition
t
1 X p
E[(∆Yj )2 1(∆Yj )2 ≥ϵSt ] → 0, for all ϵ > 0. (2.10)
St2 j=1
Then
Yt d
→ N (0, 1).
St
Proof. See Brown (1971).
Pt
Equation (2.9) requires that the ratio of j=1 E[(∆Yj )2 |Fj−1 ] to its unconditional mean
t
X t
X
St2 = E[⟨Y, Y ⟩t ] = E[(∆Yj )2 ] = Var(∆Yj ),
j=1 j=1
to converges to one when the same size is large. This should happen if the E[(∆Yj )2 |Fj−1 ] has only a moderate
degree of memory through time — we will formalize this later.
Equation (2.10) appears in the Lindeberg CLT for independent but not identically distributed random
variables — time plays no special role here. The Lindeberg-type condition stops a few data points (as the data
is not identically distributed through time) from dominating the average. The Lindeberg condition is implied by
−(2+δ) Pt 2+δ p
the Lyapunov’s condition, which is that for some δ > 0 it is possible to show that St j=1 E[|∆Yj | ] → 0.
2+δ
One way to satisfy this is to assume that E[|∆Yt | ] < ∞ and d < E[∆Yt2 ] < ∞, for some d > 0.
32 CHAPTER 2. TIME SERIES MODEL
Assume a univariate Yt comes from a continuous FYt |Y1:t−1 , then evaluate FYt |Y1:t−1 at the random Yt . This
yields
a standard uniform, using the universality of uniform result (e.g. Section 5.3 of Blitzstein and Hwang (2019)).
The transform (2.11) is often called the probability integral transform. The joint distribution of the
sequence U1 , ..., UT is
are independent standard uniforms (and so any statistic T (U1:T ) is a pivot), where the independence is due to
the product (2.12). Hence
At a high level this means that one-step ahead predictions reveal a pure source of replication in time series
U1:T , providing an answer to Biohazard 2.1.6. Of course, forming these predictions in practice involves some
form of modeling, so is not easy. This contrasts with Introductory Statistics problems where blocks of data are
Example 2.4.1. An early use of the i.i.d. uniformity of U1 , ..., UT was in checking weather forecasting (e.g. a
model forecasts rain from 10-11am with 6% chance, then check all model forecasts through time by asking if
forecasts with a 6% chance happen 6% of the time). For an extensive review and much more see, for example,
The result (2.12) goes back at least to Rosenblatt (1952), their use in econometrics was introduced by
Diebold, Gunther, and Tay (1998), Kim, Shephard, and Chib (1998), Omori, Chib, Shephard, and Nakajima
(2007) and Bladt and McNeil (2022).
To see a proof of (2.12), it is sufficient to look at the T = 2 case as T > 2 case has the same structure, then
Differentiating the bivariate distribution function yields the stated joint density (Diebold, Gunther, and Tay
(1998) give a different type of proof, one based the change of variable method for densities, which uses Jaco-
bians).
iid
where the Uj,t ∼ U (0, 1) is through time. However, the U1,t , ..., Ud,t are not necessarily independent of one
another (they are linked through a time-t copula — but that copula can, potentially, change every time period).
Just because we deal with time series, does not change the definitions of estimands, statistics, estimators and
estimates. They are the same as in introductory statistics. An estimand θ is an object we would like to learn
from data, a statistic is a function of the data T (Y), and estimator is a statistic aimed at learning an estimand
θ = T (Y) and an estimate is the data version of the estimator b
b θ = T (y).
The θ = E[YT +2 | (Y = y)], the conditional mean forecast of YT +2 given the past data up to time T is
y1 , ..., yT ;
parameters θ which index a parametric model, here expressed through the cumulative distribution function
FY|θ , or FY (y|θ).
is the chance a randomly selected pair of consecutive quarters will both have strictly negative growth
(recall two consecutive contractions in GDP is one crude indicator of a recession).
34 CHAPTER 2. TIME SERIES MODEL
One approach to learning from data is through the use of parametric models.
statistical model K = ∞.
h 2.5.3. Many researchers in the last 30 years have made a big deal out of the difference between parametric
and nonparametric models. But some of this is nonsense. The autoregressive language model behind ChatGPT
is a parametric statistical model but it has K = 175 billion. Hence it is a super flexible parametric model. Do
not be taken in by the mantra nonparametric good, parametric bad. It is intellectually vacuous. Instead, be
aware of the difference between flexible models and very tightly constrained models. Sometimes very tightly
constrained models can yield simple but fragile results and so can be dangerous. Thought is needed.
Example 2.5.4. Suppose Yt |Y1:t−1 ; θ ∼ N (ϕYt−1 , σ 2 ), where ϕ and σ 2 > 0 are unknown parameters and so
T
θ = ϕ, σ 2 and Θ = R × R≥0 . This is called a first order Gaussian autoregression. Then
T
T −1 2 1 X 2
log L(θ) = c − log σ − 2 (Yt − ϕYt−1 ) .
2 2σ t=2
To compare the joint model FY1:t to GY1:t look at the ratio of the densities
so
log Λt = log Λt−1 + log Λ(F ||G)t .
This form means that if E[| log Λt |] < ∞, then E[log Λt |Ft−1 ] − log Λt−1 equals E [log {Λ(F ||G)j } |Fj−1 ] so the
noting log Λ0 = 0, where {Mt } is a martingale with respect to the filtration generated by {Yt }t∈N>0 , with
increment ∆Mt = Mt − Mt−1 which equals ∆ log Λt − E [log {Λ(F ||G)j } |Fj−1 ].
From the LLN for martingales, we know if t gets large Mt /t goes to zero so
p
T −1 {log ΛT − At } → 0.
So what? Is AT interesting?
Theorem 2.5.5. Assume that E[| log {Λ(F ||G)t } |] < ∞, then the predictive Kullback-Leibler divergence
Proof. Recall the core Kullback-Leibler divergence inequality. For any distribution functions P and Q, the
h i h i h i h i
DKL (P ||Q) = EP log pq = −EP log pq ≥ − log EP pq by Jensen’s inequality. Then note EP pq = 1
yielding the celebrated result that
DKL (P ||Q) ≥ 0,
As ∆At ≥ 0 for every t, then {log Λt } is a submartingale with respect to the filtration generated by {Yt }t∈N>0
— so it will tend to drift upwards as t increases unless F = G, when ∆At = 0.
Comparing the data generating process F to the parametric model G = Fθ , then for each θ ∈ Θ, the corre-
sponding ∆At changes with θ. Write this as
In the special case where F = Fθ∗ where θ∗ ∈ Θ (the truth is a special case of the model) then the corresponding
∆At changes with θ, θ∗ . Write this as
This inequality provides a major justification for the use of the maximum likelihood estimator (MLE) for
time series models
θ =
b arg max log L(θ; y1:T ) = arg max log fY1:T (y1:T ; θ)
θ∈Θ θ∈Θ
fY1:T (y1:T ; θ∗ )
= arg min log
θ∈Θ fY1:T (y1:T ; θ)
1 1
= arg min AT (θ∗ , θ) + MT (θ∗ , θ) = arg min AT (θ∗ , θ) + MT (θ∗ , θ)
θ∈Θ θ∈Θ T T
1 ∗
≃ arg min AT (θ , θ), T is large
θ∈Θ T
= θ∗ ,
when T is large and dim(θ) is small. Obviously the large T behaviour of AT (θ∗ , θ)/T needs formalizing in this
argument.
To finish, now focus on a semiparametric model where F = Hθ∗ , which is indexed by some modelled aspect
of F , e.g. E[Yt |Ft−1 ] = θ∗ Yt−1 . If
θ∗ = arg min ∆At (θ∗ , θ),
θ∈Θ
for all t and Y1:t−1 , then log L(θ; y1:T ) is called a quasi-likelihood for the “psuedo-true” θ∗ .
Example 2.5.6. Think of the Gaussian autoregressive case from Example 2.5.4 as a parametric model with
∆At = E log fYt |Y1:t−1 (yt )|Ft−1
1 1
−c + log σ 2 + 2 E[Yt2 |Ft−1 ] − 2θYt−1 E[Yt |Ft−1 ] + θ2 Yt−1
2
.
2 2σ
Assume, under F = Hθ∗ the conditional moment E[Yt |Ft−1 ] = θ∗ Yt−1 holds, then
1 1
∆At (θ∗ , θ) = E log fYt |Y1:t−1 ;θ∗ (yt )|Ft−1 − c + log σ 2 + 2 E[Yt2 |Ft−1 ] − 2θθ∗ Yt−1
2
+ θ2 Yt−1
2
.
2 2σ
Hence the θ∗ has a psuedo-true interpretation. The Gaussianity in the Gaussian autoregressive is not vital.
2.6 Recap
This Chapter has covered a great deal of material, starting with using probability to define a time series model.
The Chapter has focused on the use of prediction to unlock fundamental properties of time series, moving
through the prediction decomposition and going on to martingales and martingale difference sequences. Some
of this material was used to justify an interest in the MLE.
2.6. RECAP 37
Stationarity
In introductory statistics much of the analysis is based on the i.i.d. assumption. Here we focus on the time
series analog of the identically distributed assumption. It has two versions:
Recall a time series model is the joint distribution function FY of the sequence of random variables
Y = (Y1 , ..., YT ) .
FYt , t = 1, ..., T.
Recall we are extending to time series the idea of the identically distributed assumption. A crude version
of this is that marginal distributions do not change through time
(written in long hand this means that FYt (y1 ) = FY1 (y1 ) for all y1 ∈ Y) which would drive time-invariance to
the marginal means, variance, quantiles, etc. But for a time series this is often not enough for the dependence
between the elements in Y could change dramatically through time.
Instead strict stationarity is often employed. This is stated for an infinitely process. The idea is we
select some k arbitrary times t1 , ..., tk and calculate the joint distribution of the corresponding random variables
FYt1 ,...,Ytk . Then, if we shift forward all these times by the same period s, then the process is said to be
39
40 CHAPTER 3. STATIONARITY
strictly stationary if the joint distribution function does not change — the Yt1 , ..., Ytk have the identical joint
distribution under time shifts. This is stated mathematically below.
Definition 3.1.1. The process {Yt }t∈N is said to be k-th order strictly stationary if
If this holds for any choice of k ∈ Z>0 , the process is simply called strictly stationary. It is the latter construction
that is used here.
Yt = ε, t ∈ N,
where ε is a single random variable repeatedly seen, with another stationary process
iid
Yt = εt , εt ∼ , t ∈ N.
They are polar opposites but both stationary. Stationarity tells you nothing about the degree of dependence
inside the time series.
Remark 5. “Moving average”, very unfortunately, has two entirely different meanings in time series. First,
as a process, which we will define in a moment. Second, as an algorithm: using local averages of a time series
to reduce “noise.” In the academic literature the second method is usually called a type of filter or smoother.
Two examples of such a moving average are:
ct = 1 Yt−1 + 1 Yt + 1 Yt+1 ,
M t = 2, ..., T − 1,
4 2 4
a smoother (which estimates something at time t, using data in the past, the current t and the future) and
ft = 1 Yt+1 + 1 Yt−1 + 1 Yt ,
M t = 3, ..., T,
6 3 2
a filter (which estimates something at time t, using data in the past and the current day). We will see much
more of these types of filters and smoothers, e.g. EWMA and cubic splines, in Chapter 5.
The moving average process is one of the most influential in time series, developed independently by Yule
(1921, Yule (1926, Yule (1927) and Slutsky (1927). It was named by Wold (1938). Much of modern macroe-
conometrics is phrased using moving averages. This will be discussed in Chapter 7.
3.2. MOVING AVERAGE PROCESS 41
Definition 3.2.1. [Moving average process] The q-th order (homogeneous) linear moving average (denoted
MA(q)) process
Yt = θ0 εt + θ1 εt−1 + ... + θq εt−q , t ∈ N, (3.1)
q
is driven by the time series of “shocks” {εt }t∈Z of i.i.d. (vector) random variables while {θj }j=0 are d × r
matrices of constants (often researchers set θ0 = Id in applications).
In most applications d = r.
Theorem 3.2.2. MA(q) process is always stationary. Further, if {Yt } is a MA(q) process, then for any r × d
real matrix A the process {AYt } is a MA(q) process (i.e. the class of process is closed under linear combinations).
Proof. [Sketch proof only]. We will use a generic notation for the cumulant-generating function (the log of the
T
characteristic function) C {ω ‡ X} = log E[eiX ω ] of a vector random variable X and ω a vector of reals. Then
for every t
T
C {ω ‡ Yt } = log E[eiω Yt ]
n T o n T o n T o
= C ω T θ0 ‡ εt + C ω T θ1 ‡ εt−1 + ...C ω T θq ‡ εt−q , {εt }t∈Z independent
q n o
X T
= C ω T θj ‡ ε1 , by identical distribution
j=0
= C {ω ‡ Yt+s } .
So by uniqueness theorem of characteristic functions (there is a 1-1 relationship between distributions and
characteristic functions), FYt exists and FYt = FYt+s . The same type of argument holds when we look at joint
distributions, e.g. think of the MA(1), then
T T
C {(ω1 , ω2 ) ‡ (Yt1 , Yt2 )} = log E[ei(ω Yt1 +λ Yt2 ) ]
hh n T o n T oi h n T o n T oii
= C ω T θ0 ‡ ε1 + C ω T θ1 ‡ ε1 + C λ T θ0 ‡ ε1 + C λ T θ1 ‡ ε1 )
n n T o n T o n T oo
+ C ω T + λ T θ1 ‡ ε1 − C ω T θ1 ‡ ε1 − C λ T θ1 ‡ ε1 1|t2 −t1 |=1 .
Hence the MA(q) process are always stationary. The second point of the theorem
AYt = Aθ0 εt + Aθ1 εt−1 + ... + Aθq εt−q = λ0 εt + λ1 εt−1 + ... + λq εt−q ,
In macroeconomics MA(∞) models are often used. Are they strictly stationary?
Example 3.2.3. Think about when ε1 is univariate symmetric α-stable where α ∈ (0, 2], then C {ω ‡ ε1 } =
α α Pq α
− |ω| where α > 0. Then for a symmetric α-stable MA(q) process C {ω ‡ Yt } = − |ω| j=0 |θj | . So the
42 CHAPTER 3. STATIONARITY
P∞ α
symmetric α-stable MA(∞) process is stationary iff j=0 |θj | < ∞ (see Rootzén (1978) for a more extensive
discussion). This shows strict stationarity issues can be intricate for MA(∞) processes — the tail thickness of
the shocks {εt } interact with the dependence parameters {θj }.
h 3.2.4. Strict stationarity is straightforward for MA(q) for finite q, but very intricate for MA(∞) processes.
This is somewhat odd as q could be enormous, for example 100 billion, and the process is always strictly
stationary, but 100 billion is not big enough to cover the MA(∞) case.
In Chapter 2 we saw that martingale differences appear naturally in time series and play some similar roles
in probability theory to zero mean independence. So it is interesting to define a zero mean MA(q) process not
driven by i.i.d. shocks, but by MD sequences:
q
the {θj }j=0 are, again, d × d matrices of constants but now {εt }t∈Z is a martingale difference sequence with
respect to {Ft }. We will call this a MA(q) MD process and a MA(q) MD L2 process if the {εt } is a MD L2
process.
Replacing the MD process by i.i.d. shocks (which do not need 0 means) where Var(ε1 ) < ∞, then
q
X
E[Yt ] = θj E[ε1 ] (3.5)
j=0
In the special case where the MD L2 process has Var(εt ) = Var(ε1 ) for all t, then (3.4) is replaced by (3.6).
E[Yt ] = 0,
always holds. For a MA(q) MD L2 process the Var(εt ) exists (but is not necessarily constant through time)
and Cov(εt , εt−s ) = 0 and so
q
X
T
Cov(Yt , Yt−s ) = θj Var(εt−j )θj−s .
j=0
The same results holds for i.i.d. shocks so long as the moments exist, but now it enforces homoskedasticity,
which simplifies (3.4) down to (3.6).
3.3. COVARIANCE STATIONARITY 43
Strict stationarity requires all joint distributions to be invariant to a time shifted by some period s. A different,
but related, concept is covariance stationarity — which switches the focus from distributions to solely means
and covariances.
for every t and s. The Cov(Y1 , Y1−s ) is called the s-th autocovariance, while the function {γs }s∈Z , where
is called the autocovariance function. The s-th autocorrelation is, for a scalar time series
Cov(Y1 , Y1−s ) γs
ρs = Cor(Y1 , Y1−s ) = p =
Var(Y1 )Var(Y1−s ) γ0
Remark 6. Just becomes a series is covariance stationary does not imply it is strictly stationary and some
strictly stationary processes are not covariance stationary (e.g. a sequence of i.i.d. Cauchy random variables
are strictly stationary but not covariance stationary). Strict stationarity plus the existence of Var(Y1 ) implies
covariance stationarity. A special case of this covers the Gaussian process: where strict and covariance
Example 3.3.2. Yt = ε, where Var(ε) < ∞, then {Yt } is covariance stationary process with γs = γ0 for all s.
Example 3.3.3. A martingale differences always have E[Yt ] = 0, but its unconditional Var(Yt ) is not guaranteed
to exist and, even if it does exist, can vary with t. Hence martingale differences are not in general covariance
stationary.
Example 3.3.4. Let λ ∈ (0, 2π), a “frequency”, and define the process {Yt } by
γs = {cos (λ1 t) cos (λ1 t − λ1 s) + sin (λ1 t) sin (λ1 t − λ1 s))} σ12
= cos (λ1 s) σ12 , using trig identity: cos(α − β) = cos(α) cos(β) + sin(α) sin(β).
Thus {Yt } is a weakly stationary process (if β is Gaussian, it is strictly stationary). Notice γs oscillates as s
increases, never settling down.
The i.i.d. assumption appears frequently in introductory statistic and drives the basic MA(q) definition.
The covariance stationary version of this is called white noise.
Definition 3.3.5. [Weak white noise] A time series {Yt } is called weak white noise if
Sometimes the finite variance i.i.d. case is called “independent white noise.”
Example 3.3.6. [Weak white noise driven MA(q)] In the univariate zero mean weak noise driven MA(1), that
is where {εt } is weak white noise, then
The weak noise driven MA(q) are often called the covariance stationary MA(q) process. The univariate version
of the autocovariance
q
X
γs = Var(ε1 ) θj θj−s
j=0
Definition 3.3.7. [MD white noise] Assume {Yt } is a martingale difference sequence with respect to {Ft } and
Vart (Yt ) = Var(Y1 ) < ∞, then {Yt } is called a martingale difference white noise sequence — MD white noise.
MD white noise is always covarince stationary, with γs = 0 for s ̸= 0.
MD white noise is important in statistics as it drives the properties of many sophisticated estimators (e.g.
Method of Moments and least squares estimators). Here we collect two core MD white noise results — they
are not new, they are special (relatively simple) cases of existing law of large numbers and CLTs for martingale
differences. It will be helpful to refer to these results later.
so long as:
PT p
(a) T −1 t=1 σt2 → Var(Y1 ) (that is σt2 is ergodic, where σt2 = Var(Yt |Ft−1 )) and,
(b) the Lindeberg condition holds. In practice, if {Yt } is stationary, this can be check by making sure that
The MA(∞) is rather important from a theoretical viewpoint. Is the weak white noise driven MA(∞) covariance
stationary? The answer is sometimes!
The basic question is: do E[|Y1 |] and Var(Y1 ) exist? If they do, then the autocovariance function will just
be (3.7) with q = ∞. Now
∞
X
Var(Y1 ) = Var(ε1 ) θj2 , Var(ε1 ) < ∞ and by weak white noise assumption
j=0
X∞
< ∞, if θj2 < ∞ (squared summability).
j=0
Of course if Var(Y1 ) exists then so does E[Y1 ] (that is E[|Y1 |] < ∞) and
∞
X
γs = Cov(Y1 , Y1−s ) = Var(ε1 ) θj θj−s , s = 1, 2, ...
j=0
P∞ 2
weak white noise driven MA(∞) is covariance stationary under square summability j=0 θj < ∞.
It turns out this absolute summability condition on {θj } is enough to guarantee the absolute summability of
Absolute summability implies squared summability. Why does this classic mathematics result hold?
Assume absolute summability, then |θj | → 0 as j increases. Find a finite N > 0 such that for all j > N
the corresponding θj < 1. Then
∞
X N
X ∞
X
θj2 = θj2 + θj2
j=0 j=0 j=N +1
N
X ∞
X N
X ∞
X N
X
≤ θj2 + |θj | ≤ θj2 + |θj | , θj2 is finite
j=0 j=N +1 j=0 j=0 j=0
X∞
< ∞, as |θj | < ∞,
j=0
Averages are a core building block for much of statistics. The following result is helpful for inference purposes.
P∞
If s=1 |γs | < ∞, then as T increases
n√ ∞
o X
Var T Y − E[Y1 ] → γs .
s=−∞
The term
∞
X
γs
s=−∞
Proof.
T T
1 XX
T × Var(Y ) = Cov(Yt , Yj )
T t=1 j=1
T T
1 XX
= Cov(Y1 , Y1+(t−j) ), covariance stationarity (3.9)
T j=1 t=1
T −1 T −s
1 XX
= Var(Y1 ) + {Cov(Y1 , Y1−s ) + Cov(Y1 , Y1+s )} (3.10)
T s=1 t=1
T −1
1 X
(T − |s|) γs + γsT
= γ0 + (3.11)
T s=1
T −1
X |s|
γs + γsT .
= γ0 + 1− (3.12)
s=1
T
As T goes to infinity the limit is straightforward, if it exists. Its existence is assumed here.
In time series, if the time average of a process converges to its expectation, the underlying process is said
to be “ergodic.” There is a vast math discipline on ergodicity, here when you hear ergodicity you should just
think that averages converge to their expectations.
P∞
Remark 8. A sufficient condition for the covariance stationary MA(∞) process to be ergodic if that j=0 |θj | <
P∞ P∞
∞. Why? We saw in the previous subsection that j=0 |θj | < ∞ is enough to force s=1 |γs | < ∞, which
gives the result.
More broadly (i.e. not necessarily assuming a MA(∞) process), for covariance stationarity a rather beautiful
Lemma 1. For a covariance stationary {Yt } process and lim |γ(s)| = 0, then
s→∞
p
Y → E[Y1 ].
Proof. Recall Cesàro’s lemma, which says that strictly positive real numbers bn ↑ ∞, and convergence sequence
of reals vn → v∞ ∈ R, then
n
1 X
(bk − bk−1 )vk → v∞ .
bn
k=1
48 CHAPTER 3. STATIONARITY
= 0,
−α
Example 3.3.9. Assume γs = |s| , for α > 0, then
T −1
X 1
T × Var(Y ) = 1 + 2 , (3.13)
s=1
sα
p
while lim |γ(s)| = 0. Hence by Lemma 1, the Y → E[Y1 ]. But at what speed does this convergence happen as
s→∞
T increases? The right hand side of (3.13) is called a “p-series” in mathematics. It diverges if α ≤ 1, that is
∞
X
γs = ∞,
s=−∞
P∞
so when α ≤ 1 the Y goes to E[Y1 ] slower than T −1/2 ! However, s=−∞ γs does exists if α > 1, in which case
Y goes to E[Y1 ] at rate T −1/2 .
We finish with our second CLT. Here the usual driving weak white noise has been strengthen to a driving
independent zero mean white noise assumption. The proof will be delayed to the next Chapter when we discuss
M -dependence. The same CLT holds if the driving noise is MD white noise.
Remark 9. Unfortunately weak white noise is not enough to drive a Gaussian CLT. Why? At first sight
it is shocking, but the result is not deep. Focus on a very simple case, Yt = Xεt , where {εt } is i.i.d. with
There is simply too much dependence allowed by weak white noise (here due to the common random scale X)
to drive a Gaussian CLT.
3.4. METHOD OF MOMENTS 49
Before we state the result, we give a simple but very compact result.
P∞
Lemma 2. For a covariance stationary MA(∞) with absolute summability j=0 |θj | < ∞, define
∞
X
θ(z) = θj z j .
j=0
Then
∞
X
γs = Var(ε1 ) × θ(1)2 .
s=−∞
P∞
Theorem 3.3.10. Assume an independent white noise driven MA(∞) with absolute summability j=0 |θj | <
P∞
∞, then writing θ(z) = j=0 θj z j , assuming θ(1)2 > 0, then
∞
√ d X
T Y − E[Y1 ] → N (0, Ψ) , Ψ= γs = Var(ε1 ) × θ(1)2 .
s=−∞
It is helpful to take a step back and make a summary about MA(∞) processes.
P∞
covariance stationary needs 2
j=0 θj < ∞, when driven by weak white noise;
P∞
ergodicity if j=0 |θj | < ∞, when driven by weak white noise;
P∞
CLT needs j=0 |θj | plus more than weak white noise (e.g. independence or MD white noise).
The method of moments is one of the three major estimation strategies discussed in Introductory Statistics.
It is due to Karl Pearson from 1894. The other two are MLE and Bayesian inference. The broad idea is to
relate the estimands to moments and then the method of moment principle replaces estimands by estimators
and moments by averages.
50 CHAPTER 3. STATIONARITY
Means and covariances naturally appear in covariance stationary processes, so the method of moments is often
Here FY1:s ;θ is the marginal distribution of Y1:s (think of it as the stationary distribution) and
Z
EY1:s ;θ [h(Y1 )] = h(y1:s )fY1:s ;θ (y1:s )dy1 ...dys .
where b
θM oM is a method of moments estimator (typically, for any given scientific problem, there are a massive
number of method of moments estimators available — so the function h should be selected with some care).
θ1
γ1 = Cov(Y1 , Y0 ) = α(θ1 ) = , θ1 ∈ Θ = [−1, 1],
1 + θ12
If |γ1 | ≤ 1/2, then the pair of solutions to γ1 θ12 − θ1 + γ1 = 0 are real. If |γ1 | > 1/2 the solutions are a pair of
complex conjugates. If γ1 ∈ [−1/2, 1/2] then the first real solution
p
1 − 1 − 4γ12
∈ [−1, 1],
2γ1
q
1+2γ1
p
(to show the range, when γ1 ∈ [0, 1/2], then 1 − 1 − 4γ12 = 1 − (1 − 2γ1 ) 1−2γ 1
≤ 2γ1 ). If γ1 ∈ (−1/2, 1/2)
then the second real solution p
1+ 1 − 4γ12
∈
/ [−1, 1],
2γ1
3.4. METHOD OF MOMENTS 51
as the method of moment’s principle offers no guidance about how to select θ1 when |γ1 | > 1/2 — which is a
big fail (a reasonable ad hoc choice would be to replace NA with sign(γ1 ), but that goes outside the method of
moments principle). As a result the method of moment estimator becomes:
√ 2
1− 2bγ1−4bγ 1 ,
|r1 | ≤ 1/2
1
θ1 =
b
NA |r1 | > 1/2.
zero fast. However, it will not if the true value of θ1 ∈ {−1, 1}, where the NA solution will be problematic.
Famously, MoM estimators can be good, bad and ugly.
A different framing of the same idea, which is somewhat more opaque but more flexible, is to work with the
expectation of a function of the data and the parameter EY1:s ;θ0 [g(Y1:s , θ)], where θ0 is the true value of the
while EY1:s ;θ0 [g(Y1 , θ)] ̸= 0 for all θ ∈ Θ̸=θ0 , that is all possible θ not being θ0 . Then (3.15) is called a moment
Example 3.4.3. By selecting g(Y1:s , θ) = h(Y1:s )−α(θ) from (3.14) reproduces the first approach when α is 1-1
(which guarantees uniqueness). By writing g(Y1 , θ) = ∂ log L(θ0 ; Y1 )/∂θ produces the typical case of maximum
likelihood estimation.
Applying the method of moment’s principle to this problem, replace the expectation by an average and the
estimand by an estimator, yields
T −s
1 X
g(Yt:t+s , b
θM oM ) = 0p .
T − s t=1
Often b
θM oM has to be found numerically (it may not be unique). This is often implemented by minimizing
( T −s
)T ( T −s
)
1 X 1 X
θM oM = arg min
b g(Yt:t+s , θ) g(Yt:t+s , θ) .
θ∈Θ T − s t=1 T − s t=1
h 3.4.4. Proving θ = θ0 is the unique θ which solves EY1:s ;θ0 [g(Y1:s , θ)] = 0p is difficult for generic MoM
problems. It is typically carried out case-by-case or not at all.
52 CHAPTER 3. STATIONARITY
∞
Assume {h(Yt:t+s )}t=−∞ is covariance stationary and write
T −s
1 X p
h(Yt:t+s ) → E[h(Y1:s )],
T − s t=1
p
θM oM → α−1 (E[h(Y1:s )]) = θ,
b
that is the method of moment estimator is consistent. This simple idea has been applied massively in time
series over the decades.
In terms of moment conditions, assume g(Yt:t+s , θ) is covariance stationary for each value of θ in Θ, then if
the autocovariances goes to 0 at large lags, then pointwise
T −s
1 X p
g(Yt:t+s , θ) → EY1:s ;θ0 [g(Y1:s , θ)].
T − s t=1
p
θM oM → θ0 as
Hence b
θ0 = arg min EY1:s ;θ0 [g(Y1:s , θ)]T EY1:s ;θ0 [g(Y1:s , θ)], under uniqueness of θ0 .
h 3.4.5. The driving assumption is that {g(Yt:t+s , θ)} is covariance stationary, for every θ, which may be
difficult to show outside cases where g is linear in Yt or cases where {Yt } is assumed to be a strictly stationary
process.
None of the above gives us a inference engine. The usual way of doing this is by applying the delta method.
p
b→
Remark 10. [Recalling the delta method] Assume λ λ, then the delta method establishes the limit distri-
which holds if {h(Yt )}. As we saw in Section 3.3.4 this is not simple! It does hold, for example, for a
P∞
independent, zero mean white noise driven MA(∞) process with j=0 |θj | < ∞.
If those two pieces hold, then by Slutsky’s theorem the method of moments estimator would have the CLT
T −1
" −1 #
√
d ∂α(λ) ∂α(λ)
θM oM − θ) → N 0,
T (b Ψ(h) .
∂λT ∂λ
If the MoM is phrased through a moment condition then the equivalent CLT would be1
√ d T
θM oM − θ0 ) → N (0, H −1 Ψ(g) H −1
T (b ), assuming H > 0, (3.16)
How do models which involve martingale differences map into the method of moments?
1 Recall θM oM − θ is small, so by Taylor expansion
why (3.16) holds. This is from Introductory Statistics. Now b
T T T
X X X ∂g(Yt , θ0 ) b
0= θM oM ) ≃
g(Yt , b g(Yt , θ0 ) + T
θM oM − θ0 ,
t=1 t=1 t=1
∂θ
so !−1
√ T T T
1 X ∂g(Yt , θ0 ) 1 X X ∂g(Yt , θ0 )
θM oM − θ0 ) ≃
T (b − g(Yt , θ0 ), assuming > 0.
T t=1 ∂θT T 1/2 t=1 t=1
∂θT
Hence the CLT we need to derive two results, (3.17) and a CLT for {g(Yt , θ0 )}, for if we have those two we can apply Slutsky’s
theorem to yield the result. The (3.17) holds if, for example, {∂g(Yt , θ0 )/∂θT } is covariance stationary and the autocovariance
function goes to 0 as the lag length extends to infinity.
54 CHAPTER 3. STATIONARITY
Example 3.4.6. In Section 2.3.3 we how {Ut′ }, the marginal utility from a choice, follows a martingale difference
condition
E[Ut′ |Ft−1 ] = 0,
choice is (assuming E[|Ut (Yt (at−1 ); θ)|] < ∞ for all at−1 and θ)
then
∂Ut (Yt (at−1 ) ; θ)
Ut′ (Yt , θ) = , Yt = Yt (b
at−1 (θ0 ))
∂at−1 at−1 =b
at−1 (θ0 )
such that
E[Ut′ (Yt , θ0 )|Ft−1 ] = 0.
Hence if we see a time series of {Yt } we can estimate θ by using the method of moments
T
1X ′
U (Yt , b
θM oM ) = 0.
T t=1 t
One of the attractive aspects of this is there is no need to model the law of the process {Yt }. The assumption
of the choice problem induces the martingale difference sequence, which drives the MoM estimator. This type
of approach to inference on choice problems using time series data was advocated by Hansen and Singleton
(1982).
Now think about this more abstractly. Suppose θ0 is such that Gt = g(Yt , θ0 ) forms a sequence {Gt } which
is a martingale differences in L2 with respect to the filtration {Ft }. Then this is a conditional moment condition
E[g(Yt , θ0 )] = 0, t = 1, ..., T.
Theorem 3.4.7. Assume {Gt } which is a MD in L2 process with respect to the filtration {Ft } and uniqueness
of the moment constraint E[g(Yt , θ0 )] = 0 holds, then
a.s.
θM oM → θ0 ,
b
whenever ⟨M, M ⟩T → 0.
3.5. INVERTIBILITY 55
a.s.
whenever ⟨M, M ⟩T → 0. This is enough for b
θM oM → θ0 (assuming uniqueness).
p
Further, under a Lindeberg condition and ST−1 ⟨M, M ⟩T → Ip , then
T
−1/2 d
X
ST g(Yt , θ0 ) → N (0, Ip ).
t=1
If {Gt } is covariance stationary as well as being a MD sequence, then St = tE[G1 GT1 ], then the familiar CLT
for the MoM (3.16) would hold with a particularly simple to estimate Ψ(g) = E[G1 GT1 ]. This is summarized in
the following theorem.
Theorem 3.4.8. Assume {Gt } which is a martingale differences with respect to the filtration {Ft }, is covariance
stationary and the uniqueness of the moment constraint E[g(Yt , θ0 )] = 0 holds, then
" −1 −1 #
√ T
d ∂g(Y 1 , θ0 ) ∂g(Y 1 , θ0 )
θM oM − θ0 ) → N 0, E
T (b E[G1 GT1 ] E ,
∂θT ∂θ
Hence the CLT for a MoM based on covariance stationary, martingale difference sequence {Gt } is as if the
{Gt } is i.i.d..
3.5 Invertibility
Y = θ0 ε,
where θ0 is a known d × d matrix of constants which that θ0−1 exists. Then ε = θ0−1 Y , so informally, seeing Y
implies seeing ε.
What is the time series version of this? Think about the MA(1) process
Yt = θ0 εt + θ1 εt−1 ,
Now ask: can I see εt from Ft , the history of the observed up to time t. More formally, is εt ∈ Ft , i.e.
Now ask: why would I care if εt ∈ Ft . It is jolly useful for it means εt−1 ∈ Ft−1 , so
L
(Yt − θ1 εt−1 ) |Ft−1 = θ0 εt ,
56 CHAPTER 3. STATIONARITY
Hence we can simply convert the MA(1) into a super simple predictive form.
εt ∈ Ft (e.g. b
Definition 3.5.1. For any η > 0, if there exists a b εt = E[εt |Ft ]) such that
lim P ({|εt − b
εt | > η} |Ft ) = 0,
t→∞
εt = Yt − θ1b
θ0 b εt−1 , t = 1, ...,
so
εt = Yt − θ1b
θ0 b εt−1 = θ0 εt + θ1 (εt−1 − b
εt−1 ) .
Rewriting
(εt − b
εt ) = −θ0−1 θ1 (εt−1 − b
εt−1 ) , a difference equation in εt − b
εt
t
= −θ0−1 θ1 (ε0 − b ε0 ) ,
or, letting wt = εt − b
εt , then
t
So if |θ1 /θ0 | < 1 then the process is invertible as |θ1 /θ0 | → 0. Statistically the most worked version of the
|θ1 | < 1.
εt = Yt − θ1b
θ0 b εt−1 ... − θq b
εt−q = θ0 εt + θ1 (εt−1 − b
εt−1 ) + ... + θq (εt−q − b
εt−q ) ,
so
We will see that the MA(q) will be invertibility if each eigenvalue of the q × q matrix ϕ has an absolute value
which is less than one.
Write the eigenvalues as λ1:q , while the corresponding eigenvectors are u1 , ..., uq , so
t
Zt = aj (λj ) uj , j = 1, ..., q, t = 1, 2, ... (3.21)
t−1
ϕZt−1 = aj (λj ) ϕuj , from defn of Zt−1 from (3.21)
t−1
= aj (λj ) λj uj , from (3.20)
t
= aj (λj ) uj
so, immediately,
q
X q
X
ϕZt−1 = aj λt−1
j ϕuj = aj λtj uj = Zt .
i=1 i=1
Hence if
|λj | < 1, j = 1, ..., q,
then |Zt | → 0 as t → ∞, for whatever choice of constants a1:q . This is enough to guarantee that the MA(q) is
invertible.
58 CHAPTER 3. STATIONARITY
λ + θ1 θ2
0 = |λI2 − ϕ| = = λ2 + θ1 λ + θ2 ,
−1 λ
a 2nd order polynomial in λ, with coefficients θ1:2 . Hence the pair of roots
p
−θ1 ± θ12 − 4θ2
λ1:2 = ,
2
need, in absolute value, to be less than one. The invertibility region of MA(2) process in terms of θ1:2 is
which appears on the right hand side of Figure 3.1 and is the subject of Theorem 3.5.3. There are three types
of pairs: both real and distinct (θ12 − 4θ2 > 0); complex conjugate pair (θ12 − 4θ2 < 0); 2 real roots both equal
to −θ1 /2 (θ12 − 4θ2 = 0). The picture shows the values of θ1:2 with complex roots.
non−invertible
1.0
non−invertible non−invertible
−1.0
−1.5
−2 −1 0 1 2
θ1
Theorem 3.5.3. For the MA(2) the θ1:2 which yield invertibility satisfy
absolute value is
θ12 1
−θ12 + 4θ2 = θ2 , ⇒ θ2 < 1.
+
4 4
p
Real roots: θ12 − 4θ2 > 0. Largest real root implies −θ1 + θ12 − 4θ2 < 2, so
2
θ12 − 4θ2 < (2 + θ1 ) = 4 + 4θ1 + θ12 , ⇒ −1 < θ2 + θ1 .
p
Finally smallest implies −2 < −θ1 − θ12 − 4θ2
2
(θ1 − 2) > θ12 − 4θ2 ⇒ 1 > θ1 − θ2 .
√
Remark 11. Recall, in the complex conjugate pair case λ1 = b + ic and λ2 = b − ic, where i = −1 (when the
eigenvalues appear in complex conjugate pairs, then so do the eigenvectors: u1 = d + ig and u2 = d − ig, where
p √
d and g are real vectors). In that case |λ1 | = (b + ic)(b − ic) = b2 + c2 = |λ2 |. Hence
b2 + c2 < 1
is needed for these |λ1 | < 1. This is sometimes called the roots are inside the unit circle. The exponential form
√
of conjugate complex pairs λ1 = b + ic and λ2 = b − ic, writes λ1 = ρeiθ , ρ = b2 + c2 and θ = tan−1 (c/b),
t t
while λ2 = ρe−iθ . Then λt1 = (ρ) eiθt and λt2 = (ρ) e−iθt , so |λt1 | = ρt .
The above result, relating the eigenvalues to the roots of a polynomial, hold for the MA(q) process (here
a q-th order polynomial. Then the roots will be a collection of distinct real roots, pairs of real roots and pairs
of complex conjugate roots. If they are all inside the unit circle then the MA(q) is invertible.
Before we end think a little more abstractly, which will help later on models, like the autoregression. Recall
from Chapter 1 the lag operator.
Definition 3.5.4. Define a lag operator L, which works on any time series yt , so that
θ(z) = 1 + θ1 z + ... + θq z q ,
a q-th order lag polynomial. The θ(L) is the lag-polynomial. We saw θ(z) in the CLT for MA(∞) processes in
Theorem 3.3.10 and Lemma 2.
Then
z q θ(z −1 ) = z q + θ1 z q−1 + ... + θq
instead — which is the polynomial we saw above in (3.22). Notice that if z solves
too. The roots of z q θ(z −1 ) = 0 are 1/a1 , ..., 1/aq , the reciprocal of those for θ(z) = 0. Then the requirement
for invertibility, that the roots of z q ϕ(z −1 ) = 0 are strictly inside the unit circle, imply the roots of θ(z) = 0
are outside the unit circle. This leads to confusion, sometimes researchers talk about roots being inside and
sometimes outside the unit circle. The reason is the swapping between these two polynomials: θ(z) and
z q θ(z −1 ).
3.6 Recap
This Chapter has covered a lot of ground! The mail topics are listed in Table 3.1.
Students often get confused as to the point of stationarity. Thinking of it as the time series version of
the identically distributed assumption you see in Introductory Statistics really puts you on a sound footing.
Stationarity opens up the potential wide use of the method of moments, based upon the stationary distribution,
martingale differences or covariance stationarity type properties.
Chapter 4
Memory
This places the one-step ahead prediction at the centre of much of modern time series. This Chapter will focus
on how Yt is impacted by Yt ’s past, Y1:t−1 , that is the degree of memory in the process.
The most famous special case of one-step ahead prediction is a Markov process.
Definition 4.1.1. [Markov process] If Y1:T has, for every t = 1, ..., T , the property
L
Yt | (Y1:t−1 = y1:t−1 ) = Yt | (Yt−1 = yt−1 ) ,
where A ⊥
⊥ B is the standard notation for the random variable A and B are independent and (A ⊥⊥ B) |C notes
A and B are independent conditional on C.
Definition 4.1.2. Let Yt ∈ {0, 1} to each t, then the binary Markov process {Yt } is governed by the transition
probabilities
2
P (Yt = i|Yt−1 = j), i, j ∈ {0, 1} .
61
62 CHAPTER 4. MEMORY
Definition 4.1.3. The first order linear d-dimensional autoregressive process {Yt } (denoted VAR(1), for
vector autoregressions and AR(1) for the scalar case) sets
iid
Yt = ϕ1 Yt−1 + εt , εt ∼ , t = 1, ..., T, (4.1)
Yt = Yt−1 + εt , (4.2)
∆Yt = εt .
The VAR(1) process in (4.1) can be reparameterized, placing ∆Yt at its heart, by taking Yt−1 from both sides,
∆Yt = γYt−1 + εt , γ = ϕ1 − Id
T
= αβ T Yt−1 + εt , γ = α β (rank factorization, non-unique: r = rank(γ) ∈ {0, 1, ..., d} ).
d×d d×r r×d
This is called an error correction model, which appears often in applied research. When r < d, then this is an
example of reduce rank regression, regressing the d-dimensional ∆Yt on the r-dimensional β T Yt−1 .
L
Yt | (Y1:t−1 = y1:t−1 ) = Yt | (Yt−K:t−1 = yt−K:t−1 ) ,
Example 4.1.4. Shannon (1948) used an K-order Markov model of text, initiating the field of quantative
language models.
iid
Yt = ϕ1 Yt−1 + ... + ϕp Yt−p + εt , εt ∼ ,
K-order Markov model can be written in a dK-dimensional Markov process. This is helpful conceptually
(as Markov thinking can be immediately ported to many initially non-Markov processes) and computationally
(code based on a Markov structure can be used much more widely).
4.1. MARKOV PROCESS 63
Theorem 4.1.6. Assume {Yt } is a K-order Markov process and define the stacked variable Zt := Yt−K:t . Then
{Zt } is Markovian if the dimension of Zt does not grow systematically with t. If the dim(Zt ) does not change
with t then {Zt } is called a “companion form” of the non-Markovian process. If {Yt } is strictly stationary,
L
Yt |Y1:t−1 = Yt |Yt−2:t−1 .
Obviously {Yt } is not a Markov chain. Work with the discrete case, then
which is the stated result. The result on strict stationarity are definitional, the covariance case is immediate.
Example 4.1.7. For the VAR(p) from Definition 4.1.5, stacking Yt−p+1:t , produces the companion form
Yt ϕ1 ϕ2 ··· ϕp εt
Yt−1 Id 0d×d ··· 0d×d 0d
Zt = ϕZt−1 + εt , where Zt = , ϕ = , εt = ,
.. .. .. .. .. ..
. . . . . .
Yt−p+1 0d×d Id 0d×d 0d
a pd-dimensional VAR(1).
Example 4.1.8. An impactful model (we will see it prominently in Chapter 5) is where
iid
∆2 Yt = εt , εt ∼ , noting ∆2 Yt = ∆Yt − ∆Yt−1 = Yt − 2Yt−1 + Yt−2 .
Now {Yt } is 2nd order Markovian. Keep track of the first differences by writing
βt−1 = ∆Yt .
Then
L
Yt |G1:t , Y1:t−1 = Yt |Gt , Gt = g(Gt−1 , Yt−1 ), t = 2, ..., T, dim(Gt ) = dim(Gt−1 ).
Then {Yt } is not Markovian, but {(Yt , Gt )} is. A special case of this is the GARCH(1,1) model of Bollerslev
1/2 2
(1986), which writes that Yt = Gt εt and Gt = α+βGt−1 +γYt−1 , regarding (α, β, γ) as known (or parameters)
and {εt } are i.i.d..
L
Zt |Zt−1 , Zt−2 = Zt |Zt−1 , t = 1, 2, ...,
and conclude that Zt is a companion form. But it is not. It is Markovian, but notice the dimension of Zt
increases systematically through time — ruining the practical usefulness of any Markovian property.
4.2 Autoregression
4.2.1 AR(p) and VAR(p)
The VAR(p) process was set out in Definition 4.1.5. The implication from the prediction decomposition is that,
conditioning on some initial values Y1:p , the
T
Y
fYp+1:T |Y1:p (yp+1:T ) = fε1 (yt − ϕ1 yt−1 − ... − ϕp yt−p ),
t=p+1
which is not an autoregression in {AYt } (unless A is invertible, in which case write Aϕ1 Yt−1 = Aϕ1 A−1 AYt−1 ,
etc., producing an VAR(p) in {AYt } or ϕj = Id for all j = 1, ..., p). This result contrasts with the case where
Theorem 4.2.2. For an VAR(1) process Yt = ϕ1 Yt−1 + εt , the cumulant generating function satisfies, for each
q ≥ 1,
q
X T T
C {ω ‡ Yt }) = C ω T ϕj1 ‡ ε1 +C ω T ϕq+1
1 ‡ Yt−q−1 . (4.3)
j=0
T
P∞
If if j=0 C ω T ϕj1 ‡ ε1 exists, then {Yt } has a stationary solution
∞
X
Yt = ϕj1 εt−j ,
j=0
4.2. AUTOREGRESSION 65
an MA(∞) process.
P∞
Remark 12. If ω is small and E[|ε1 |] < ∞, then C {ω ‡ ε1 } ≃ iω T E[ε1 ] + o(|ω|) as ω → 0, so if j=0 ϕj1 exists,
then T
∞
X T X∞
C ω T ϕj1 ‡ ε1 ≃ i ωT ϕj1 E[ε1 ].
j=0 j=0
P∞
But j=0 ϕj1 exists if all the eigenvalues of ϕ1 are inside the unit circle.
so taking the cumulants of the sum of independent terms yields the result (4.3). Now assume Yt−q−1 has a
T
P∞ T j
cumulant function j=0 C ω ϕ1 ‡ ε1 which exists, then
q
X T ∞
X T
C {ω ‡ Yt } = C ω T ϕj1 ‡ ε1 + C ω T ϕq+1
1 ϕj1 ‡ ε1
j=0 j=0
Xq T ∞
X T
= C ω T ϕj1 ‡ ε1 + C ω T ϕq+1+j
1 ‡ ε1
j=0 j=0
Xq T ∞
X T ∞
X T
= C ω T ϕj1 ‡ ε1 + C ω T ϕj1 ‡ ε1 = C ω T ϕj1 ‡ ε1 ,
j=0 j=q+1 j=0
P∞
the MA(∞) cumulant function. Hence Yt = j=0 ϕj1 εt−j is a strictly stationary solution.
Thus the stationary AR(1) is written as a special case of the MA(∞) process.
Focus on the weights: in this MA(∞) the weights go to zero exponentially fast when eigenvalue of ϕ1 all
P∞
inside unit circle. Think about univariate case: |ϕ1 | < 1, so immediately we have that j=0 |θj | < ∞ and so
P∞ 2
P∞ P∞
j=−∞ θj < ∞ and j=0 |γj | < ∞. It was j=0 |γj | < ∞ plus i.i.d. {εt } (or MD white noise) which drove
the CLT for the sample average in Section 3.3.4 so long as Var(ε1 ) < ∞, while the asymptotic variance was
P∞ 2
j=−∞ θj . This fast decay in the weights is common to all covariance stationary AR(p) processes. This makes
Example 4.2.3. Return to Example 3.2.3, with ε1 being univariate α-stable, where α ∈ (0, 2], then limit of
Pq j
j=0 C(ωϕ1 ‡ ε1 ) as q increases:
∞ α
α
X α j − |ω|
C {ω ‡ Y1 } = − |ω| (|ϕ1 | ) = α,
j=0
1 − |ϕ1 |
for |ϕ1 | < 1, noting Y1 is marginally a scaled α-stable variable. Hence, when |ϕ1 | < 1, the AR(1) model with
α-stable shocks is strictly stationary for all α ∈ (0, 2] — a beautifully simple result compared to the strict
stationarity conditions needed for the general MA(∞) process.
66 CHAPTER 4. MEMORY
Suppose {εt } is not i.i.d. but zero mean weak white noise with Var(ε1 ) < ∞. Then the {Yt } which follows the
iff
|ϕ1 | < 1.
In that case
∞
Var(ε1 ) X 2j
E[Y1 ] = 0, Var(Y1 ) = = ϕ1 Var(ε1 ), γs = Cov(Y1 , Y1−s ) = ϕs1 Var(Y1 ).
1 − ϕ21 j=0
Importantly, if {εt } is a MD white noise with respect to FtY , then the AR(1) can be written as Yt =
P∞
j=0 θj εt−j with the same MD white noise {εt } sequence.
P∞
Assume |ϕ1 | < 1 and Var(Y1 ) = j=0 ϕ2j
1 Var(ε1 ) < ∞. Then at t = 2
then this holds for t = 2, 3, .... The γs result is trivial. Hence there exists a weakly stationary solution.
Now return to the companion form from Example 4.1.7, where
ϕ1 ϕ2 ··· ϕp
Id 0 d×d ··· 0d×d
Zt = ϕZt−1 + εt , ϕ = . .
. . .. .. ..
. . .
0d×d Id 0d×d
If {Yt } is covariance stationary then {Zt } is covariance stationary and so Var(Z1 ) satisfies the equation
T
Var(Z1 ) = ϕVar(Z1 )ϕ + Var(ε1 ). (4.5)
4.2. AUTOREGRESSION 67
One way to numerically solve (4.5) is to use the vec and Kronecker product notation from matrix algebra
(e.g. Chapter 2 of Magnus and Neudecker (2019)). Then in the covariance stationary case
implying
−1
vec {Var(Z1 )} = Ip2 − (ϕ ⊗ ϕ) vec {Var(ε1 )} . (4.6)
The first block column of Var(Z1 ) reads off the matrices γ0 , γ1 , ..., γp−1 of {Yt }.
The same type of result holds for the AR(p) process. To manipulate AR(p) processes it is helpful to write
it using lag-polynomial
If all the roots of the equation z p ϕ(z −1 ) = 0 are inside the unit circle, then
εt
Yt =
ϕ(L)
∞
X ∞
X
= θj εt−j , |θj | < 1,
j=0 j=0
∞
X
= θ(L)εt , θ(z) = θj z j ,
j=0
∞
where the {θj }j=0 are determined by ϕ1:p . That math from Section 3.5.2 immediately ports to this problem:
∞ P∞
there we saw the implied {|θj |}j=0 declines exponentially implying j=0 |θj | < 1, crucially implying that
P∞
j=−∞ |γj | < 1.
p
For the covariance stationary AR(p) process driven by white noise it is relatively easy to go from {ϕj }j=1 to
∞
{γj }j=p , once γ0:p−1 are found, using the Yule-Walker equations given in Theorem 4.2.4 below. But γ0:p−1 can
be determined by working through the companion form and calculating (4.6) numerically. The analytic form of
γ0:p−1 is (in my view) not particularly illuminating.
Theorem 4.2.4. [Yule-Walker] The covariance stationary AR(p) driven by zero mean white noise {εt }, has
Proof. Recall Yt = ϕ1 Yt−1 + ... + ϕp Yt−p + εt , take s > 0 and multiply both sides by Yt+s and take expectations.
Note that
Cov(Yt+s , εt ) = 0, s > 0.
68 CHAPTER 4. MEMORY
Using covariance stationarity, that delivers (4.7). The s = 0 case has an extra term, due to
Cov(Yt , εt ) = Var(ε1 ).
p p
It is relatively easy to go from {γj }j=1 to the {ϕj }j=1 . Stacking the equation (4.7) for s = 1, ..., p, yields,
noting γs = γ−s , that
−1
γ1 γ0 γ1 ··· γp−1 ϕ1 ϕ1 γ0 γ1 ··· γp−1 γ1
γ2 γ1 γ0 ··· γp−2 ϕ2 ϕ2 γ1 γ0 ··· γp−2 γ2
= , so = ,
.. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . .
γp γp−1 γp−2 ··· γ0 ϕp ϕp γp−1 γp−2 ··· γ0 γp
4.3 m-dependence
4.3.1 Definition
One-step ahead predictions have many of the features of independent random variables we see in introductory
statistics. A different place independence bites in time series is m-dependence.
∞
Definition 4.3.1. [m-dependence] The {Yt }t=−∞ process is m-dependent iff
Yt ⊥
⊥ Ys , for all t, s and |t − s| > m.
∞
The most basic version of this if when m = 0 then the {Yt }t=−∞ are independent through time.
that is Yt is independent of Y1:t−2 , but only conditionally on Yt−1 . Conditional independence does not imply
unconditional independence, so m-th order Markov processes are not m-dependent.
Definition 4.3.3. [Moving average process] The q-th order (non-linear) moving average process, says
iid
Yt = g(εt−q:t ), εt ∼ , t ∈ Z,
where Yt and εt are assumed to be d-dimensional for all t = 1, ..., T and g is a non-stochastic function.
Example 4.3.4. In the linear case, the q-th order moving average process becomes Yt = θ0 εt + ... + θq εt−q ,
which is q-dependent.
4.3. M -DEPENDENCE 69
The moving average process has Yt depending upon εt−q:t and Yt+q+s depending upon εt+s:t+q+s for s >
0. But εt−q:t and εt+s:t+q+s have no overlap and so are independent, so Yt and Yt+q+s are probabilistically
independent. Thus this process is m-dependent with m = q.
iid
h 4.3.5. Think of the Gaussian, linear, scalar, first order moving average Yt = εt + θεt−1 , where εt ∼
N (0, 1/(1 + θ2 )) and t = 1, ..., T . When t ≥ 3, then
Yt 1 ρ 0
Yt−1 ∼ N 03 , ρ 1 ρ , θ
ρ= ∈ [−1/2, 1/2],
1 + θ2
Yt−2 0 ρ 1
so
ρYt−1 − ρ2 Yt−2 1 − 2ρ2
Yt |Yt−1 ∼ N (ρYt−1 , 1 − ρ2 ), Yt |Yt−2 ∼ N (0, 1), Yt |Yt−2:t−1 ∼ N , .
1 − ρ2 1 − ρ2
Thus Yt is independent of Yt−2 , but given Yt−1 the pair {Yt−2 , Yt } are not independent. This is the opposite of a
Markov process (recall the Markov process had conditional independence but not independence), see Biohazard
4.3.2. That (Yt ⊥
/⊥ Yt−2 ) |Yt−1 has substantial computational and statistical importance for moving averages,
e.g. it means that the law of the Yt |Y1:t−1 , that is the predictive distribution, will depend on all elements of
Y1:t−1 (in the Markov process it only depends upon Yt−1 ) even though Yt is independent of Y1:t−2 .
So far we have seen a CLT based on i.i.d. and MD sequences, but crucially the CLT fails for weak white noise.
What about when {Yt } being m-dependent?
You might expect a CLT to hold as the sequence has a form of replication, due to the independence structure
beyond m-lags. This expectation does indeed hold. CLTs for m-dependent processes have a storied history in
{XT,t }T ≥1,1≤t≤T .
The assumption we will use says the array structure has each T , the time series model
being m-dependent. Define, for each value of T , the time series sum
T
X
ST = XT,t .
t=1
1 In the probability literature only univariate and infinite dimensional CLTs are discussed. The reason for this is that for the
multivariate version the corresponding results can be established by using the Cramér-Wold device (Cramer and Wold (1936)).
PNT T
That is define the d-dimensional vector of constants a, then check univariate CLT for t=1 a XT,t . If this holds for all a, then
the multivariate CLT holds.
70 CHAPTER 4. MEMORY
iid
Example 4.3.6. Suppose Yt ∼ N (0, 1) and, then to get it into the notation used in Theorem 4.3.7 below, write
T
1 X
σT∗−1 ST ∼ N (0, 1).
p
XT,t = √ Yt , ST = XT,t ∼ N (0, 1), σT∗ = Var(ST ) = 1,
T t=1
Theorem 4.3.7 is a CLT for σT∗−1 ST as T increases allowing the data to be non-Gaussian and when the
i.i.d. assumption is replaced by m-dependence. It has a complicated condition named after Lindeberg, which
is standard in not identically distributed CLTs. It will be replaced in a moment by an easier to think about
condition.
Theorem 4.3.7. [Janson (2021)] Suppose {XT,t }T ≥1,1≤t≤T is an m-dependent triangular array of univariate
random variables, Var(XT,t ) < ∞ and E[XT,t ] = 0. Define
p
σT∗ = Var(ST )
and assume that σT∗ > 0 for all T . Assume a Lindeberg-type condition: for every ε > 0 as T → ∞ then
T
1 X 2
E[XT,t 1|XT t |>(εσ∗ ) ] → 0. (4.8)
σT2∗ t=1 T
Then, as T → ∞, so
d
σT∗−1 ST → N (0, 1).
The m = 0 case is exactly the Lindeberg-Feller CLT, one of the most celebrated CLTs which deals with
independent but not identically distributed random variables. Romano and Wolf (2000) and Janson (2021)
Remark 13. The Lindeberg-type condition (4.8) only involves the marginal laws of XT,1 , ..., XT,T , not their
dynamics. It can be tricky to check. The Lyapunov-type condition, which says that if there exists a δ > 0 such
−(2+δ) PT 2+δ
that (σT∗ ) t=1 E[|XT,t | ] → 0, implies the Lindeberg-type condition holds and is easier to think about
δ
for time series. Why? If |XT,t | ≥ εσT∗ then |XT,t /εσT∗ | ≥ 1, so |XT,t /εσT∗ | ≥ 1. The
T T T
1 X 2 1 X 2+δ 1 X 2+δ
E[XT,t 1|XT ,t |>(εσ∗ ) ] ≤ E[|XT,t | 1|XT ,t /εσT |δ >1 ] ≤
∗ E[|XT,t | ],
σT2∗ t=1 T δ ∗
ε (σT )
2+δ
t=1
δ ∗
ε (σT )
2+δ
t=1
hence the RHS going to zero is enough for the Lindeberg-type condition to hold.
Theorem 4.3.7 can be applied to the MA(q) process. The result is beautifully simple and powerfully useful.
Remark 14. In my view Lemma 3 and the MD CLT are the most insightful CLTs in time series.
4.4. INTEGRATION AND DIFFERENCING 71
2+δ
Lemma 3. For the scalar MA(q) process, where there exists a δ > 0 such that E[|Y1 | ] < ∞. Then
q
√ n −1/2 o d X
T Ψ Y − E[Y1 ] → N (0, 1) , Ψ= Cov(Y1 , Y1+s ), (4.9)
s=−q
2+δ Pm
so long as 0 < Ψ < ∞. In the linear MA(q) case, this holds if E[|ε1 | ] < ∞, and j=0 θj ̸= 0, yielding
2
Xq q
X
Ψ = Var(ε1 )θ(1)2 = Var(ε1 ) θj , θ(x) = θ j xj , where θ0 = 1. (4.10)
j=0 j=0
Proof. The MA(q) is an m-dependent process, so we will check the Lyapunov-type condition for Theorem 4.3.7.
2+δ
For the MA(q) process E[|XT,t | ] = E[|Y1 |2+δ ]. The remaining task is to compute σT∗2 , the variance of
PT 2
t=1 Yt , which equals T Var Y . Let
T
X s−1
ΨT = T × Var Y = Var(Y1 ) + 2 1− Cov(Y1 , Ys )
s=2
T
q q
X s−1 2X
= Var(Y1 ) + 2 1− Cov(Y1 , Ys ) = Ψ − (s − 1) Cov(Y1 , Ys )
s=2
T T s=2
→ Ψ, as T → ∞ for fixed q.
2+δ
So by Slutsky’s Theorem the result. In the linear case E[|Y1 | ] < ∞ holds for E[|ε1 |2+δ ] < ∞ is assumed while
q
X q
X
2+δ 2+δ 2+δ 2+δ
E[|Y1 | ]≤ E[|ε1−j θj | ] = E[|ε1 | ] |θj | ,
j=0 j=0
Pq 2+δ
This is bounded as j=0 |θj | < ∞ automatically as q is finite.
Notice linearity only plays a role going from (4.9) to (4.10), yielding a pretty expression for Ψ, and giving a
2+δ
more primitive condition for E[|Y1 | ] < ∞.
Differencing has already played a significant role, going from a martingale {Mt } to a martingale difference
sequence {∆Mt }. Undoing differencing is, oddly, called integration in the time series literature — rather than
what you might think it would be called: cumulating.
Definition 4.4.1. Start with a time series {Ct }, then the integrated version is
t
X
Yt = Y0 + Cj , t = 1, 2, ...
j=1
72 CHAPTER 4. MEMORY
Clearly Ct = ∆Yt . In the time series literature, if {Ct } is a stationary process then the integrated version {Yt }
is sometimes denoted I(1), integrated of order 1.
Ct −1
Yt = , or Yt = (1 − L) Ct ,
1−L
If {Ct } is i.i.d. then the integrated version {Yt } is a d-dimensional random walk,
t
X
Yt = Yt−1 + Ct = Cj , Y0 = 0, t = 1, 2, ....
j=1
The random walk is a special case of a Markov chain, with, cumulant function
T
C {ω ‡ Yt } = log E[eiω Yt
] = tC {ω ‡ C1 } .
Thus the random walk is a non-stationary process, so long as at least one of the diagonal elements of Var(C1 )
are strictly positive.
Typically (but not always) integrated processes, built out of covariance stationary processes, are non-
stationary as the means and variances change through time.
Theorem 4.4.3. If {Ct } is a covariance stationary process then E[Yt ] = tE[C1 ], and
T −1
X |s|
Var(Yt ) = t 1− γs .
T
s=−(t−1)
P∞
If 0 < s=−∞ γs < ∞, then
∞
X
Var(Yt ) ≃ t γs . (4.11)
s=−∞
4.4. INTEGRATION AND DIFFERENCING 73
Proof. Trivial from Theorem 3.3.8, the result on the properties of the sample mean.
h 4.4.4. If {εt } is weak white noise, then Ct = (1−L)εt = εt −εt−1 , so {Ct } is a zero mean, covariance stationary
process (it is a non-invertible MA(1) process), with ρ1 = −1/2. Then Yt = εt − ε0 , by telescoping, which means
P∞
{Yt } is covariance stationary. How is this compatible with Theorem 4.4.3. In this case s=−∞ γs = 0, as
ρ1 = −1/2 and E[C1 ] = 0. Hence the limit result (4.11) is not helpful in that case. The {Ct } process is called
over-differenced.
4.4.2 Cointegration
which have singular covariance matrices, i.e. rank(ββ T ) = 1, but both {Yt,1 } and {Yt,2 } are non-stationary. So
far, nothing is very interesting.
But now think about, for any α,
T T
β2 β2 β1
α Ct = 0, as = 0.
−β1 −β1 β2
Thus
T
β2
α Yt = 0, with probability one,
−β1
n o
T T
so α (β2 , −β1 ) Yt is (the most trivial special case of) stationary. In the time series literature {α (β2 , −β1 )}
is called a cointegrating vector (not unique as it holds for any α), while {Yt } exhibits cointegration.
Var(C1 ) = α2 ββ T , where β = (β1 , β2 )T : then rank(Var(C1 )) = 1. In this case {Yt } is bivariate non-
Var(C1 ) > 0: then rank(Var(C1 )) = 2. In this case {Yt } is bivariate non-stationary and has no cointe-
gration.
74 CHAPTER 4. MEMORY
Definition 4.4.5. Suppose all the elements of the d-dimensional {Yt } is a non-stationary time series. If there
exists a 1 ≤ r < d and r × d dimensional matrix A such that the process {AYt } is stationary, then {Yt } exhibits
cointegation.
The concept of cointegration was formalized in Granger (1981) and Engle and Granger (1987).
Example 4.4.6. [Common trend model] Work with the bivariate process
t
X β1
Yt = βµt + ηt , µt = εj , β =
β2
j=1
where the scalar sequence {εt } is i.i.d. with Var(ε1 ) > 0 and the bivariate sequence {ηt } is stationary. The
{µt } is a common random walk which hits both {Y1,t } and {Y2,t } and has the same structure as (4.12). This
model is often called a common trend model — while the {µt } is called the common trend. Then
Example 4.4.7. Granger (1981) has a beautiful thought experiment. Suppose {Yt,1 } is the number of cars
since 9am which have entered a tunnel which only opened at 9am (which is one way and only has 1 exit and 1
entrance), where t increments up by 1 each second and {Yt,2 } is the number of cars since 9am which have left
the same tunnel. Then both {Yt,1 } and {Yt,2 } are non-stationary processes, incrementing up each second. But,
by construction, Yt,1 − Yt,2 ≥ 0 is the number of cars in the tunnel at time t. It maybe reasonable to model
{Yt,1 − Yt,2 } as a stationary process.
iid
Yt = Yt−1 + εt , εt ∼ .
Here we discuss taking this random walk process to times which can be recorded continuously
{Y (t)}t≥0 .
This is useful as it expands the kinds of processes we can use to build new models (e.g. in financial econometrics),
allows some analysis of problems where the data is not equally spaced through time, some of these objects appear
4.4. INTEGRATION AND DIFFERENCING 75
in important asymptotic arguments (e.g. functional central limit theorem) in later chapters. Before we define
what this process us we need to recall what a càdlàg function is.
A càdlàg function is a function which is right continuous with left limit. It is familiar in introductory
statistics from, for example, the distribution function of a binary random variable — shown in Figure 4.2. A
càdlàg process {Y (t)}t≥0 has
Definition 4.4.8. [Lévy process] A càdlàg process {Y (t)}t≥0 , where L0 = 0, is a Lévy process if it obeys two
additional (beyond càdlàg) properties:
{Y (t) − Y (s)} ⊥
⊥ {Y (b) − Y (a)} , for all 0 ≤ a < b < s < t;
L
{Y (t + s) − Y (t)} = {Y (s) − Y (0)} , for all 0 ≤ s, 0 ≤ t.
is called a subordinator. The Poisson version is the Poisson process, the Gaussian case is Brownian motion.
Lévy processes all have independent and stationary increments, so for t > 0, the cumulant function
C {ω ‡ Y (t)} = tC {ω ‡ Y (1)} ,
E[Y (t) − Y (s)] = (t − s)E[Y (1)], Var[Y (t) − Y (s)] = (t − s)Var[Y (1)], for all t > s > 0.
Example 4.4.9. A Poisson process is a special case of a Lévy process. It has Poisson increments
indep
Y (t) − Y (s) ∼ P o(ψ(t − s)), ψ > 0, t > s ≥ 0. (4.13)
The ψ is usually called the intensity in this context. A simulated path of a Poisson process with ψ = 1 is given
on the left hand side of Figure 4.1. All the jumps in the Poisson process are one, with probability one.
76 CHAPTER 4. MEMORY
5
6
10
4
5
8
3
4
6
Y
Y
3
4
2
2
1
0
0
0
−1
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Figure 4.1: LHS: sample path of Poisson process with intensity ψ = 1. Middle: sample path of Brownian motion
with µ = 0 and σ = 1. RHS: sample path of a gamma process.
Example 4.4.10. A Brownian motion is a special case of a Lévy process. It has Gaussian increments
indep
Y (t) − Y (s) ∼ N (µ (t − s) , σ 2 (t − s)), µ ∈ R, σ > 0, t > s ≥ 0. (4.14)
The µ, σ are usually called the drift and volatility, respectively, in this context. A simulated path of a Brownian
motion with µ = 0 and σ = 1 is given in the middle of Figure 4.1. It is the only Lévy process without jumps —
it has, a continuous sample path, but the path is nowhere differentiable. When µ = 0 and σ = 1, then {Y (t)}
is called standard Brownian motion. In the standard Brownian motion case, the process {Y (t) − tY (1)}t∈[0,1]
is called a Brownian bridge.
Example 4.4.11. A gamma process is a special case of a Lévy process. It has gamma increments
indep
Y (t) − Y (s) ∼ Gamma(α (t − s) , β), α, β > 0, t > s ≥ 0. (4.15)
A simulated path of a Brownian motion with α = β = 1 is given on the right hand side of Figure 4.1. This
process has, however small the time increment, strictly positive increments (as the increments are gamma) —
made up of an infinity of jumps. In the gamma process case, the process {Y (t)/Y (1)}t∈[0,1] is called a Dirichlet
process — a continuous time version of the Dirichlet distribution. It plays a large role in modern Bayesian
nonparametric statistics.
The Poisson process and Brownian motion are the two most famous continuous time processes in probability.
Another process which is characterized by the behaviour of its increments in continuous time is an orthogonal
process (this is the continuous time version of the discrete time integration of zero mean, weak white noise).
An orthogonal process is less prescriptive, it just features the first two moments.
F (y)
•1 •
p = 0.6 • •
0.3 • ◦
◦0
• •
0 1 y
1 = F −1 (0.6) = Q(0.6)
Figure 4.2: Example of càdlàg function. Distribution function of a binary random variable with P (Y = 0) = 0.3
and P (Y = 1) = 0.7. Here we compute the 0.6-quantile, F −1 (0.6) = Q(0.6), which is 1.
Of course, all zero mean Lévy processes with the additional feature that Var(Y (t)) < ∞ are orthogonal
Although we have discussed the stationary and covariance stationary properties of autoregressions, from a
statistical perspective it is helpful to classifying how autoregressions are discussed in the empirical literature
and used to carry out inference. Often writers are unclear as to which kind of autoregression they are working
with.
The classification will be into 3 buckets. Each has its own voice (typically the 1st appears in probability
theory, the 2nd in statistics and the 3rd in econometrics) and vices. The focus will be on the AR(1) case to
look at core issues.
iid
Yt = ϕ1 Yt−1 + εt , εt ∼ , t = 1, ..., T.
This model is a special case of the Markov process above. It has been extensively discussed above.
Definition 4.5.2. [AR(1) predictions] The process {Yt } has AR(1) predictions iff it obeys, for t = 1, ...,
Definition 4.5.3. [AR(1) projection] The zero mean, covariance stationary {Yt } has, for each Yt the AR(1)
projection ϕ1 Yt−1 , where
Cov(Y0 , Y1 )
ϕ1 = .
Var(Y0 )
The AR(1) projection model is not a Markov process. It starts with a covariance stationarity assumption,
and then highlights an interesting estimand ϕ1 . There are no additional assumptions are made.
These three autoregressions share many features, but dramatically depart in other aspects. The AR(1)
process was discussed extensively above. Focus on to the AR(1) prediction and projection cases.
AR(1) prediction
If {Yt } has E|Yt | < ∞, then it follows that E[Yt |Ft−1 ] exists. Then for all conditional mean prediction based
models decompose
Yt = E[Yt |Ft−1 ] + Ut ,
where {Ut } is a MD sequence adapted to {Ft }, as E|Yt | < ∞ implies E|Ut | < ∞.
In the special case of the AR(1) prediction assumption: E[Yt |Ft−1 ] = ϕ1 Yt−1 , we can use the MD property
of {Ut } to carry out inference on ϕ1 , e.g. based on MD laws of large numbers or CLTs.
AR(1) projection
In a lot of modern econometrics and statistics regression is presented as a linear projection. The time series
version of this, for 1 lag, is AR(1) projection which comes from assuming {Yt } to be a zero mean covariance
stationary sequence and define a projection estimand
2 E[Y0 Y1 ] γ1
ϕ1 = arg min E[(Y1 − aY0 ) ] = = ∈ [−1, 1].
a E[Y02 ] γ0
Then {ϕ1 Yt−1 } is the linear projection of {Yt }. The properties are immediate, from familiar properties of
regression.
Theorem 4.5.4. If {Yt } is covariance stationary, define the the projection error
Ut = Yt − ϕ1 Yt−1 , t = 1, 2, ....
This error process {Ut } is covariance stationary and has four properties (I do not know a fifth!)
More broadly,
E[Ut Ut−s ] ̸= 0, s > 1,
so {Ut } is not weak white noise, nor i.i.d. nor a MD sequence. Thus laws of large numbers and CLTs will be
tricky!
= 0, using (4.17),
2 E[Y0 Y1 ]
ϕ1 = arg min E[(Y1 − aY0 ) ] = ,
a E[Y02 ]
then the method of moments principle can be used to estimate E[Y0 Y1 ] and E[Y02 ] and so deliver ϕ
b . Of
LS
course, the ϕ
b is also the numerical solution to the least squares principle:
LS
T
X
ϕ
b = arg min
LS (Yt − aYt−1 )2 .
a
t=2
4.5.3 Properties of ϕ
b
LS
First note that for a AR(1) prediction problem, if Var(Yt−1 ) < ∞ and Var(Ut ) < ∞ for every t, then {Yt−1 Ut }
is a MD sequence with respect to FtY .
Why?
Check conditions (a) and (b) for MD sequence. First (a): that is E|Yt−1 Ut | < ∞ for every t. The Cauchy-
Schwartz inequality says that for a pair of random variables {X, Z} where Var(X) < ∞ and Var(Z) < ∞,
then
2 2
so {E|Yt−1 Ut |} ≤ E[Yt−1 ]E[Ut2 ]. So Var(Yt−1 ) < ∞ and Var(Ut ) < ∞ for every t, is enough for condition (a).
Now for condition (b): that is E[Yt−1 Ut |Y1:t−1 ] = 0. The
as desired. So {Yt−1 Ut } is a MD sequence so long as Var(Yt−1 ) < ∞ and Var(Ut ) < ∞ for every t.
Again assuming Var(Ut ) < ∞, define σt2 = Var(Ut |Y1:t−1 ), then under the AR(1) prediction case
t
X t
X t
X
2 2
⟨M, M ⟩t = Var(∆Mj |Y1:j−1 ) = Yj−1 Var(Uj |Y1:j−1 ) = Yj−1 σj2 , t = 2, 3, ...,
j=2 j=2 j=2
a.s.
If ⟨M, M ⟩T → ∞ as T → ∞, then MT / ⟨M, M ⟩T → 0 using Theorem 2.3.14, the martingale strong law of large
PT 2
PT
numbers. If t=2 Yt−1 σt2 / t=2 Yt−1
2
is bounded from above (e.g. this is guaranteed if σt2 ≤ d < ∞ for all t),
Next focus on the CLT for the prediction model, in the |ϕ1 | < 1 case, using a MD CLT. I will state a basic
result, many of the conditions can be relaxed.
Theorem 4.5.5. Assume the AR(1) prediction model, |ϕ1 | < 1 and Var(Ut ) = Var(U1 ) < ∞ for all t. Then
if Var(Yt−1 Ut ) = Var(Y0 U1 ) < ∞ and Yt2 is ergodic,
√ E(Y02 U12 )
d Var(U1 )
T ϕb − ϕ1 →
LS N 0, , E[Y02 ] = .
E[Y02 ]2 1 − ϕ21
√
d Var(U1 )
= N 0, 1 − ϕ21 .
T ϕLS − ϕ1 → N 0,
b
2 (4.18)
E[Y0 ]
4.5. INFERENCE AND LINEAR AUTOREGRESSION 81
2
E[|Yt−1 Ut |] ≤ E[Yt−1 ]E[Ut2 ]
As |ϕ1 | < 1 and Var(Ut ) = Var(U1 ), so {Yt } is covariance stationary. Thus E[|Yt−1 Ut |] < ∞ holds so {Yt−1 Ut }
is a MD sequence. As we assumed Var(Yt−1 Ut ) = Var(Y0 U1 ) < ∞, the {Yt−1 Ut } is a MD white noise sequence,
so
T
1 X d
√ Yt−1 Ut → N (0, Var(Y0 U1 )) .
T t=2
Covariance stationarity plus ergodicity, implies that
T
1X 2 p
Y → E[Y02 ].
T t=2 t−1
Combining the results using Slutsky’s theorem yields the first displayed equation.
Lemma 4. Assume the AR(1) process and |ϕ1 | < 1 and Var(U1 ) < ∞, then (4.18) holds.
Proof. It is the homoskedastic case while E(Y02 U12 ) = E[U12 ]2 / 1 − ϕ21 while ergodicity holds as γs = ϕs1 → 0.
The inference for the AR(1) prediction model hides a danger for empirical work, which is not much discussed
in the literature.
Remark 15. The assumption that Var(Y0 U1 ) < ∞ is not trivial empirically. Think of (a special case of a
ARCH(1) model σt2 = Y02 )
iid
U1 = |Y0 | ε1 , εt ∼ , E[ε1 ] = 0, Var[ε1 ] = 1,
then E[U12 ] = E[Y12 ] = E[Y12 ] < ∞, so {Ut } is MD white noise. But now Var(Y0 U1 ) < ∞ needs E[Y04 ] < ∞
— which is quite an ask for many time series problems. Under homoskedasticity this problem disappears, just
The AR(1) projection only makes sense under the covariance stationarity assumption. Then recall the {Ut } has
four properties — given in Theorem 4.5.4. As a result
T T
1X 1X
T −1 MT = Yt−1 Ut , so E[T −1 MT ] = E[Yt−1 Ut ] = 0
T t=2 T t=2
T T
1X 2 h
2] = 1
i X
\
E[Y 2
0] = Y , so \
E E[Y Y 2 = E[Y02 ].
T t=2 t−1 0
T t=2 t−1
2 p
If we assume that Yt2 is ergodic then E[Y
\ 2
0 ] → E[Y0 ].
82 CHAPTER 4. MEMORY
The main problem is deriving a CLT when we do not know any properties of the {Yt−1 Ut } sequence beyond
it has a zero mean. One popular approach is to assert some high level assumptions. This is very close to
being circular — assuming the CLT for ϕ
b holds directly.
LS
Theorem 4.5.6. Assume {Yt } is covariance stationary and think about the AR(1) projection. Assume {Yt−1 Ut }
is covariance stationary and ergodic sequence and that it obeys the CLT
∞
T
!
√ 1X d
X
T Yt−1 Ut → N (0, Ψ(Y0 U1 )), Ψ(Y0 U1 ) = Cov(Y0 U1 , Y−s U1−s ), 0 < Ψ(Y0 U1 ) < ∞.
T t=2 s=−∞
Then
√
b − ϕ1 →d Ψ(Y0 U1 )
T ϕ LS N 0, . (4.19)
E[Y02 ]2
1
PT 2 p
Proof. Ergodicity plus stationarity means T t=2 Yt−1 → E(Y02 ). Then combine with the assumed CLT, yields
(4.19) by Slutsky’s theorem.
h 4.5.7. This looks simple, but we have already seen CLTs of time series are not trivial, e.g. covariance
stationarity is not enough to drive a CLT. Some kind of replication is needed. Replication can be obtained by
Perhaps the prettiest version of this is to assume up-front that {Yt } is strictly stationary, which implies
{Yt−1 Ut } is strictly stationary. If, additionally, Var(Y1 ) < ∞ and Var(Y0 U1 ) < ∞, then strict stationarity
implies both {Yt } and {Yt−1 Ut } are covariance stationary. Then ergodicity just needs Cov(Y02 , Ys2 ) → 0 as
s → ∞. The CLT is still hard due to the absence of replication. If {Yt } is additionally assumed m-dependent,
then the CLT holds (and indeed ergodicity will hold) subject to a Lindeberg condition.
Theorem 4.5.8. Assume {Yt } is strictly stationary, m-dependent with Var(Y1 ) < ∞ and Var(Y0 U1 ) < ∞.
Then
M
√
b − ϕ1 →d Ψ(Y0 U1 ) X
T ϕ LS N 0, , Ψ(Y0 U1 ) = Cov(Y0 U1 , Y−s U1−s ), 0 < Ψ(Y0 U1 ) < ∞,
E[Y02 ]2
s=−M
Of course this relies on M being finite. But there are CLTs for m-dependent processes where m in-
crease with T or you can relax and think of M being large and finite, e.g. m is 100 billion. So long as
P∞
0 < s=−∞ Cov(Y0 U1 , Y−s U1−s ) < ∞, using this finite m case is likely to yield only tiny errors.
P∞
Of course estimating the long-run variance s=−∞ Cov(Y0 U1 , Y−s U1−s ) is difficult in practice.
4.6. RECAP 83
4.6 Recap
We have covered a lot of ground. Special cases of the predictive distribution play enormously roles in time
series: Markov processes, AR, VAR, ECM, Brownian motion, Poisson processes, ARCH. Here some of these
are developed. Sometimes they are linked back to MA(∞) processes and martingales.
Table 4.1 has some of the highlighted ideas from the Chapter.
84 CHAPTER 4. MEMORY
Chapter 5
5.1 Seasonality
In descriptive analysis we often try to get a fast summary of the structure of the data by reporting sample aver-
ages, sample quantiles or correlograms. But time series often have fascinating structures which are important
the seasons (e.g. cereals harvested per month, price of fruit, public holidays),
In time series these types of effects are usually called seasonal effects, where S are the number of seasons
being modelled. Sometimes S can be very large (e.g. think about diurnal features where the data is recorded
at the second-by-second basis, twenty-four hours a day), or small (e.g. S = 4, for quarterly data)
Example 5.1.1. The time series of electricity demand gives insight into managing a modern electrical grid,
which is becoming an every more important skill as the amount of solar and wind energy ramps up. Fig-
ure 5.1 shows demand in Great Britain every 30 minutes from 1 January 2023 until 6 September 2023.
The data comes from the UK’s national grid, which manages the movement of electricity over the coun-
try and between the UK and other countries (through interconnectors). The data was downloaded from
https://data.nationalgrideso.com/demand/historic-demand-data
The website has the data going back to 2009, so T ≃ 122k. A wonderfully accessible way to gain an under-
standing of UK electricity generation, in real time, is via the website https://grid.iamkate.com/, due to data
scientist Kate Morley, who displays data merged from various sources. The left hand side plots the demand on
85
86 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
35
40
35
35
30
30
30
Output
Output
Output
25
25
25
20
20
20
15
15
15
0 1 2 3 4 0 2 4 6 8 10 12 14 0 100 200 300
Figure 5.1: Demand for electricity from the Great Britian electric grid during 2023, in giga watts, recorded
every 30 minutes.
the first 4 days of the year, showing the diurnal feature of the data. The middle graph plots the first 14 days
which shows the day of the week effect (electricity demand is lower on Sunday, not much lower on Saturday).
The right hand side shows the demand so far this year, which has fallen during the summer but will ramp up
again over the rest of the year. Hence to model this dataset at least 3 seasonal components would be needed:
one with S = 48, one with S = 24 × 7 and one with S = 24 × 365. The data has many usages, e.g. short run
forecasting of demand, longer term planning dealing with managing the peaks of demand and exploiting the
troughs to recharge storage capacity (batteries or pumped storage). Similar datasets are available for electricity
generation, e.g. through coal, gas, wind, solar and transferred through interconnectors.
Yt = γt + Xt , t = 1, 2, ..., T,
where {γt } is the additive deterministic seasonal “component” of a time series. Here we will use the convention
that seasonality cumulates to zero over the scan of a season.
γt = γ (t/S − ⌊t/S⌋)
while ⌊x⌋ is the floor function (largest integer not larger than x).
One approach to dealing with seasonality transform the series to make it less sensitive to seasonality.
5.1. SEASONALITY 87
Example 5.1.3. [Inflation] Traditionally price inflation is reported every month, but is calculated as the
percentage change in prices over the last year. Write {Yt } as the price index at time t, measured in months,
then arithmetic annual inflation is the year of monthly price moves
∆S Yt
100 × , ∆S Yt = Yt − Yt−S , S = 12,
Yt−S
where ∆S = 1 − LS is called the seasonal difference operator. Notice that
the sum of S monthly changes. Why report these annual moves, rather than, say 6 month moves? Well
∆S Yt (γt − γt−S ) Xt − Xt−S
100 × = 100 × + 100 ×
Yt−S Yt−S Yt−S
Xt − Xt−S
= 100 × , as γt − γt−S = 0 under deterministic seasonal.
Yt−S
Hence annual inflation is robust to deterministic seasonal variations in prices, due to the use of seasonal differ-
ences.
Example 5.1.4. [Financial volatility] Setup a database of the price in a financial market {Yj }j=1 measured
every ∆ > 0 seconds. Then the quadratic variation (QV) of {Yj } is the time series
t
X
[Y, Y ]t = (Yj − Y(j−1) )2 .
j=1
is the increase in the QV over the last day — differencing out the complicated diurnal patterns of trading often
seen in financial markets (a high percentage of trading happens at the start and end of the trading day). In
financial econometrics ∆S [Y, Y ]t is called the realized variance and was formalized by Andersen, Bollerslev,
Diebold, and Labys (2001) and Barndorff-Nielsen and Shephard (2002). It is the modern way of measuring
how volatile a financial market has been using high frequency data. Its simplicity is due to it being robust to
seasonal patterns, again through a seasonal difference operator.
A second approach is to estimate the seasonal component and then remove it — allowing the “seasonally
adjusted” series to be subsequently analyzed by other researchers.
Yt − γ
bt , t = 1, ..., T,
Example 5.1.6. Most U.S. Government time series are published as “seasonally adjusted,” that is preprocessed
using a statistical model or procedure to try to remove the seasonal effects before it is published. Hence,
for example, many U.S. macroeconomists rarely discuss seasonality, although unemployment and prices have
pronounced seasonal patterns, e.g. the classic time series textbook by Hamilton (1994) has little discussion of
seasonality. In time series practice outside academic macroeconomics, on the other hand, seasonality is a very
big deal — worthy of serious thought.
A major approach to statistical model building of the seasonal function {γ(u)}u∈[0,1] is to use a Fourier repre-
sentation {γB (u)}u∈[0,1] — where B determines the number of terms in a sum. This representation is
B
X
γ1+2(j−1) cos(λj u) + γ2j sin(λj u) , λj = 2πj, u ∈ [0, 1].
j=0
h 5.1.7. I realize there are some coefficients γ−1 and γ0 with unusual subscripts. They will disappear in a
B
X
γB (u) = γ1+2(j−1) cos(λj u) + γ2j sin(λj u) , λj = 2πj, u ∈ [0, 1],
j=0
noting sin(λ0 u) = 0 making γ0 redundant. Because of this the series is often written as
B
X
γB (u) = γ−1 + γ1+2(j−1) cos(λj u) + γ2j sin(λj u) .
j=1
As B goes off to infinity it is able to well approximate continuous period functions (Fourier’s Theorem) on
u ∈ [0, 1].
B
X
λ−1
= γ1 + j γ1+2(j−1) sin(2πj) − γ2j {cos(2πj) − 1}
j=1
= γ−1 .
To make {γB (u)}u∈[0,1] suitable for a seasonal model, which should integrated to 0, makes sense to set γ−1 = 0,
yielding what the time series literature typically calls:
5.1. SEASONALITY 89
or in matrix form:
∗
γ2 (1/4) 1 0 −1 γ1
γ 1 γ2∗ S
γ2 (2/4) 0 −1 1 X
γk∗ = 0,
γ2 = ∗
γ2 (3/4) = −1
, where
0 −1 γ3
γ3 k=1
γ2 (4/4) 0 1 1 γ4∗
which is the same as the textbook dummy approach to seasonality. This dummy approach is often seen in
simple regression analysis with small S.
A practical advantage of the Fourier approach is that it can be used even if S varies over time (e.g. S is the
number of days in a month).
It turns out that the {γB (u)} function has a recursive structure which is helpful computationally and inspires
further model building. To see this, the following well known stacked trigonometric identity is useful.
γ cos(λt) + γ ∗ sin(λt)
γt
: =
γt∗ −γ sin(λt) + γ ∗ cos(λt)
γt−a cos (λa) sin (λa)
= ϕ(λa) ∗ , ϕ(λa) = .
γt−a − sin (λa) cos (λa)
cos(λt) = cos {λa + λ (t − a)} = cos (λa) cos {λ (t − a)} − sin (λa) sin {λ (t − a)}
sin(λt) = sin {λa + λ (t − a)} = sin (λa) cos {λ (t − a)} + cos (λa) sin {λ (t − a)} ,
Thus
γ cos(λt) + γ ∗ sin(λt) = γ cos (λa) cos {λ (u − a)} − γ sin (λa) sin {λ (u − a)}
and
−γ sin(λt) + γ ∗ cos(λt) = −γ sin (λa) cos {λ (t − a)} − γ cos (λa) sin {λ (t − a)}
Writing a = 1/S, and building a collection of such terms each with a different frequency:
γt,1 γt−1,1
= ϕ(λ1 /S) ,
γt,2 γt−1,2
Then (notice the sum does not involve the γt,2j term, it just appears in the computation)
B
X
γB (t/S) = γt,1+2(j−1) , t = 1, 2, ..., S,
j=1
Example 5.1.12. [Continuing Example 5.1.8] Writing a = 1/S and u = t/S, then equation (??) in Theorem
?? implies that γt = γ(t/S) has the recursive structure
B
X γt,1+2(j−1) γt−1,1+2(j−1)
γt = γt,1+2(j−1) , = ϕ(λj /S) , j = 1, ..., B.
γt,2j γt−1,2j
j=1
with
ϕ(λ1 /S) 02×2 ··· ··· 02×2
γt,1
02×2 ϕ(λ2 /S) 02×2 ··· 02×2 γt,2
.. ..
.. ..
..
αt+1 =
. 02×2 . . . αt ,
αt =
. .
.. .. .. ..
γt,2B−1
. . . . 02×2
02×2 02×2 ··· 02×2 ϕ(λB /S) γt,2B
5.1. SEASONALITY 91
The above is a 2B-dimensional VAR(1), inside of which there are B bivariate VAR(1) processes, but with
no noise in sight — rather like we saw when we studied invertibility of a moving average in Section 3.5.1.
Do these bivariate VAR(1)s converge to 02 as it is iterated through t = 1, 2, ...? From Section 3.5.1 we
know this would happen if the eigenvalues were inside the unit circle. The result below shows that for each
j = 1, ..., B the ϕ(λj /S) has a pair of complex conjugate eigenvalues who are on the unit circle. Hence there
is no convergence to zero. They go up and down, as time evolves, never converging to 0.
Remark 16. Think of (??) as a difference equation Zλu = ϕ(aλ)Zλ(u−a) . The eigenvalues, written ω1:2 , of
ϕ(λ) solve
2 2 2 2
|I2 ω − ϕ(λ)| = {ω − cos (λ)} + sin (λ) = ω 2 − 2ω cos (λ) + cos (λ) + sin (λ) = ω 2 − 2ω cos (λ) + 1,
which are, for any choice of λ, a complex conjugate pair of roots on the unit circle:
q
2
2 cos (λ) ± 4 cos (λ) − 4 q
2
ω1:2 = = cos (λ) ± i 1 − cos (λ) implying |ωk | = 1, k = 1, 2.
2
In applied time series the seasonal components are rarely unchanged through lengthy periods of time. Think of
electricity demand. The advent of efficient heat pumps reduces electricity demand during harsh seasons, while
increase it due to the move away from gas central heating during winter. Likewise the use of electric vehicles,
increases electricity demand, but it may flatten the seasonal effects due to the increase spread of high powered
batteries.
How can seasonal components be allowed to change through time? We start with
Yt = γt + εt , t = 1, 2, ..., T,
but now allow the seasonal component {γt } to be stochastic. It is generated by the stochastic seasonal compo-
nent.
Example 5.1.14. [Stochastic trigonometric seasonal model] The stochastic trigonometric seasonal MD model
has
γt = ((1, 0) ⊗ ι )T αt ,
B×1
92 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
where
ϕ(λ1 /S) 02×2 ··· ··· 02×2
02×2 ϕ(λ2 /S) 02×2 ··· 02×2
.. ..
.. ..
αt+1 = ϕ1 αt + ω t , ϕ1 =
. 02×2 . . .
.. .. .. ..
. . . . 02×2
02×2 02×2 ··· 02×2 ϕ(λB /S)
where {ω t } is a martingale difference sequence — that is E [|ωt,j |] < ∞ and
Z
Yt
E ω t |Ft−1 = 0, where Zt = .
αt+1
Taken all together, the system of data plus the stochastic seasonals can be written as
0 ιT εt
Zt = Zt−1 + .
02B ϕ1 ωt
Z 2
Var ω t |Ft−1 = diag(σω,1:2B ).
2
The textbook version of this model has σω,1:2B = σω2 . The Hindrayanto, Aston, Koopman, and Ooms (2013)
approach has
2 2
σω,1+2(j−1) = σω,2j , j = 1, 2, ..., B.
The same mathematical structure is sometimes used to explicitly force a component to cycle.
Definition 5.1.15. [Stochastic cycle model] A stochastic cycle MD model {ψt } is where
T
1 ψt,1
ψt = ψt,1 = ψt , ψt = , ψ t+1 = ρC × ϕ(λC )ψ t + κt , λC ∈ [0, π], ρC ∈ [0, 1).
0 ψt,2
T
where κt = (κt,1:2 ) is a bivariate MD sequence.
Remark 16 applies, so now ψ t is a VAR(1) with a pair of complex conjugate roots, with absolute value which
is |ρC | < 1. Hence if {κt } is a MD white noise process then ψt is covariance stationary. In the special case
Z
where λC = π, then ψt = ρC ψt−1 +κt,1 as sin (λC ) = 0. Typically researchers impose that Var[κt |Ft−1 ] = σκ2 I2 .
h 5.1.16. When a cycle model is used it is important to be careful about modeling the additional noise
component, e.g.
Yt = ψt + εt .
Once one goes beyond {εt } being a MD white noise process, and allow it to be covariance stationary, then the
model threatens trouble, as {ψt } is typically covariance stationary too. One approach would be to make {εt }
and AR(p) but ensure it has no complex roots.
5.2. STOCHASTIC TREND 93
Example 5.2.1. Many countries have grown wealthier through time, per adjusting for population growth. A
main way of measuring this is to compute a real GDP per capita number through time, looking at the rate
of growth. Of course, GDP has its own limits as a measure of human welfare, e.g. Coyle (2015). However,
we sidestep that issue here. The left hand side of Figure 5.2 shows real GDP per capita in the US from 1947
onwards (seasonally adjusted), based on 2017 dollars (it is the series on FRED coded: A939RX0Q048SBEA).
The middle plot shows the series in logs, which makes looking at growth rates easier. It is helpful to put these
U.S. Real GDP per capita Log U.S. real GDP per capita Log U.K. real, per capita, GDP
11.0
10
60000
50000
9
10.5
log of GDP
log GDP
GDP
40000
8
30000
10.0
20000
7
1960 1980 2000 2020 1960 1980 2000 2020 1400 1600 1800 2000
Figure 5.2: Per capita real GDP. Left: US. Middle: log for US. Right: log for UK.
growth rates in a broader historical context. Data over very long periods are rare. Broadberry, Campbell,
Klein, Overton, and van Leeuwen (2015) estimated annual real GDP for England from 1270 to 1870, which
can be sliced with GDP numbers from Great Britain onwards. Here the data stops in 2016. It is in the file
UKGDP1270.csv, downloaded from Our World in Data website. The right hand side of Figure 5.2 shows the
result on a log-scale, the series tends to go upwards, but the rate at which is goes up has changed over time.
In terms of the US data, does the middle graph indicate that the US growth rate fallen? This is the subject of
the books by Gordon (2016) and DeLong (2022). In terms of the UK on the right hand side, you can see from
the early 1700s the establishment of some systematic economic growth, which seems to accelerate again after
about 1820 and again from about half way through the 1900s. Another feature, which is less obvious, is the
yearly variability of the series about the long-run development of GDP seems to have fallen a great deal after
the early 1700s — this is important to human welfare, as rapid falls in real GDP can be catastrophic. Finally,
the first four centuries are very thought provoking. It shows almost no growth at all. I find this the most
stunning data I have ever seen. Economic growth is not an inevitable result of the human condition!
94 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
In a simple growth rate regression model, e.g. for the log of GDP,
Yt = µt + εt ,
where
then β would be the growth rate. The {µt } process will be called a trend component. It will be helpful later
to get used to writing the system as a VAR(1):
Yt
Zt =
µt+1
0 0 1
= ϕ1 Zt−1 + , ϕ1 = .
β 0 1
Notice that the elements of the first column of ϕ1 are all zero, so Yt−1 has no impact on Zt,2 = µt . All the time
series dependence carries through via {µt }. The column of zeros in ϕ1 , means the memory in {Yt } is induced
indirectly, in this case through {µt }. These columns of zeros in ϕ1 will be a common theme in the discussion
below.
Y Y Y
E[Yt+s |Ft−1 ] = E[µt+s |Ft−1 ] + E[εt+s |Ft−1 ], s ≥ 0,
Y Y
= E[µt |Ft−1 ] + sβ + E[εt+s |Ft−1 ].
In the special case where {εt } is a MD sequence with respect to FtY , then
Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ] + sβ,
Yt − µ
bt , t = 1, ..., T,
One popular approach to the statistical analysis of trending series is to difference away the growth:
= β + ∆εt ,
5.2. STOCHASTIC TREND 95
so for the differenced series the growth rate is the location of the differenced series. Often β might be estimated
using data Y1:T through a simple statistic like
T
1 X
β
b = ∆Yt
T − 1 t=2
YT − Y1
= , telescoping
T −1
εT − ε1
= β+ .
T −1
The UK data, at least, suggests the growth rates has changed over long periods of time, while Gordon (2016)
has argued at great length that U.S. growth rates have fallen in recent times — an important conclusion for
public policy if it is true. One approach to deepen the regression model to deal with this kind of question, is
to make the growth rate a stochastic process {βt }, so
µt+1 = µt + βt = µ0 + β0 + β1 + ... + βt .
Writing
βt+1 = βt + ξt ,
then
Yt εt 0 1 0
Zt = µt+1 = ϕ1 Zt−1 + 0 , ϕ1 = 0 1 1 . (5.1)
βt+1 ξt 0 1 1
In the special case where (εt , ξt ) is independent of Zt−1 , then {Zt } is a Markov chain, but of course {Yt } is not.
Notice again the elements of the first column of ϕ1 are all zero.
Definition 5.2.3. [Smooth trend] A core model is to assume that E[|ξt |] < ∞ and
Z
E[ξt |Ft−1 ] = 0,
that is the {βt } is a martingale — expressing ignorance about how the slope might change in the future. The
implied {µt } process is called a smooth trend component. Notice that
= ∆2 µt+2 .
Y Y Y
E[Yt+s |Ft−1 ] = E[µt+s |Ft−1 ] + E[εt+s |Ft−1 ]
t+s
X
Y Y Y
= E[µt |Ft−1 ]+ E[βj |Ft−1 ] + E[εt+s |Ft−1 ]
j=t+1
Y Y Y
= E[µt |Ft−1 ] + sE βt |Ft−1 + E[εt+s |Ft−1 ], as {βt } is a martingale.
96 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
In the special case where {εt } is a MD sequence with respect to FtZ , then
Y Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ] + sE βt |Ft−1 ,
Y
so then the prediction is a function of forecast horizon s, is a straight-line with a slope E βt |Ft−1 — the
If {εt , ξt } are i.i.d. independent, zero mean Gaussian variables, then the model (5.1) has
log f (Y1:T , µ1:T |µ−1:0 ) = log f (Y1:T |µ1:T ) + log f (µ1:T |µ0,−1 )
T T
1 X
2 1 X 2
= c− (Yt − µt ) − ∆2 µt
2Var(ε1 ) t=1 2Var(ξ1 ) t=1
( T T
)
1 X
2 1 X 2 2 Var(ξ1 )
= c− (Yt − µt ) + ∆ µt , qξ = ,
2Var(ε1 ) t=1 qξ t=1 Var(ε1 )
so the posterior mean and median of µ1:T |Y1:T is the L2-penalized least squares estimator of µ1:T , which is
T T
X 1 X 2 2
µ
b1:T = arg min (Yt − µt )2 + ∆ µt .
µ1:T
t=1
qξ t=1
b1:T is a ridge regression applied to ∆2 µt . As qξ increases the more flexible the µ1:T can be.
The µ In
the limits qξ → ∞, so µ
bt → Yt , while as qξ → 0, so µ
bt → Y for all t. In macroeconomics µ
b1:T is sometimes
called the Hodrick and Prescott (1997) filter (they advocated the universal data free choice of qξ = 1/1600 for
quarterly macro data), but it appeared earlier in many areas of economics, statistics and applied mathematics,
e.g. Whittaker (1923), Kimmeldorf and Wahba (1970) (cubic spline), Green and Silverman (1994) (penalty),
Harvey (1989) (time series smoothing).
If {εt , ξt } are i.i.d. independent, zero mean random variables, with εt being Gaussian but ξt is Laplace, then
T T
1 X 1 X
log f (Y1:T , µ1:T |µ−1:0 ) = c− (Yt − µt )2 − p ∆2 µt ,
2Var(ε1 ) t=1 Var(ξ1 )/2 t=1
( T T
) p
1 X
2 1 X 2 Var(ξ1 )/2
= c− (Yt − µt ) + ∆ µt , qξ = ,
2Var(ε1 ) t=1 qξ t=1 2Var(ε1 )
so the posterior mode of µ1:T |Y1:T is the L1-penalized least squares estimator of µ1:T , which is
T T
X 1 X 2
µ
e1:T = arg min (Yt − µt )2 + ∆ µt .
µ1:T
t=1
qξ t=1
e1:T is a Lasso regression applied to ∆2 µt . It seems due to Kim, Koh, Boyd, and Gorinevsky (2009).
The µ
µt } has some of the fitted ∆2 µ
An extensive discussion of it is given in Tibshirani (2014). Crucially the {e et set
to zero, so {e
µt } is a continuous, piecewise linear function of time.
5.3. HIDDEN MARKOV MODELS 97
Some series have time-varying levels but no particular systematic growth. We will see an example of this in
Example 5.5.7.
A local linear trend model mixes the martingale trend with the smooth trend model to yield
Yt εt 0 1 0
Zt = µt+1 = ϕ1 Zt−1 + ηt , ϕ1 = 0 1 1
βt+1 ξt 0 1 1
where {εt , ηt , ξt } is MD with respect to FtZ . Notice again the elements of the first column of ϕ1 are all zero.
Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ], as {εt } is MD, s ≥ 0.
Y
Notice that E[µt |Ft−1 ] is not Yt−1 , so the {Yt } process is not a martingale, even though {µt } is. This structure
Y Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ] + sE[βt |Ft−1 ], s ≥ 0.
so here both the level and slope needs to be estimated from the data in order to extrapolate it into the future.
The next model structure is perhaps the most commonly used in modern time series.
Definition 5.3.1. [Hidden Markov model] The Hidden Markov model (HMM) of {Yt , αt+1 }, labels Yt the t-th
observation and αt as the t-th state. The HMM has two assumptions:
(a) The
L
Yt | (Y1:t−1 , Yt+1:T , α1:T ) = Yt |αt
98 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
αt+1 |αt
rather than αt |αt−1 — obviously this is just a notational change (it is carried out to make some formulas more
compact). Notice that although {αt } is a Markov chain, it is mostly non-stationary in applied work.
Remark 17. HMM have other names in the literature. Sometimes they are called state space models or
parameter driven models. In the linear case, they are sometimes called dynamic linear models. HMM are
special cases of Markov random fields, which play an important role in physics and probability as models of
undirected graphs.
The HMM is very extensively used in most areas of applied science, e.g. signal processing, bioinformatics,
pattern recognition and economics.
The HMM induces a Markov chain on
Yt
Zt =
αt+1
with the transition density
having a special form — with the memory in {Yt } being carried only through the Markovian “state” αt .
When Y1:T , α1:T are Gaussian this line of work started with the linear model of Kalman (1960), looking at
the “linear state space system”
Yt 0d×d Zt Ht 0d×r
Zt = = ϕt,1 Z t−1 + Bt ωt , ϕt,1 = , Bt = ,
αt+1 0r×d Tt 0r×d Qt
where
iid
ωt ∼ E [ωt ] = 0d+r , Var [ωt ] = Id+r , r = dim(αt ),
and the matrices ϕt,1 , Bt are non-stochastic. Book length treatments of the many models which are included
in this class include Harvey (1989), West and Harrison (1989) and Durbin and Koopman (2012).
Section ?? will discuss these systems when i.i.d. assumption is replaced by weak white noise.
The local level model captures many crucial features of models with state variables.
To avoid confusion when reading research papers it is useful to clearly delineate three versions of the local
level model:
5.3. HIDDEN MARKOV MODELS 99
In the local level MD model, we have that Var(εt ) < ∞, Var(ηt ) < ∞ and
εt Z Z
E |Ft−1 = 02d , E[εt ηtT |Ft−1 ] = 0d×d .
ηt
Then {µt } is a martingale. But taken together, the local level MD model is not a HMM.
Then {µt } is a random walk. Then the local level model is a HMM with αt = µt . In practice, the noise
terms are typically assumed to have second moments, but there are heavy tailed applications where this
addition is not made.
In the local level WN model, the are assumed to {εt , ηt } is weak zero mean white noise, with Cov(εt , ηt ) =
0d . This is not a HMM. The local level model implies the local level WN model if the noise terms have
a zero mean and their variances exist. The local level MD model is a local level WN model if the noise
terms are unconditionally homoskedastic, that is Var(εt ) = Var(ε1 ) and Var(ηt ) = Var(η1 ) for all t.
where {εt } , {ηt } are zero mean weak white processes with Var(εt ) = Σε and Var(ηt ) = Ση . Then {∆Yt } is
covariance stationary, has an MA(1) representation with
T T
E[∆Yt ] = 0, Var[∆Yt ] = Ση + 2Σε , E[∆Yt ∆Yt−1 ] = −Σε , E[∆Yt ∆Yt−s ] = 0, s > 1.
100 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
less important). For the MA(1) process Yt = ωt + θ1 ωt−1 , then Cor(Yt , Yt−1 ) = θ1 /(1 + θ12 ), so equating terms
1 1
−2 − qη = (1 + θ1 )2 , implying qn = + (−θ1 ) − 2
θ1 (−θ1 )
which is only possible if θ1 ∈ [−1, 0). Notice that qη monotonically decreases as θ1 moves from 0 to −1.
Proof. The
T
so E[∆Yt ] = 0 and E[∆Yt ∆Yt−s ] = 0 for s > 1, while
= −Σε .
One of the main ways of building a statistical model for a martingale difference sequence is through stochastic
volatility. This is directly important in financial applications, but is super helpful as an ingredient for flexible
statistical models.
iid
Yt = σt εt , εt ∼ , E[εt ] = 0, Var[εt ] = 1, σt ≥ 0, {εt } ⊥⊥ {σs } ,
then {Yt } follows a univariate stochastic volatility (SV) process. The {σt } is called the volatility process. If
{σt } is Markovian, then the SV process is a HMM. More broadly, if some state αt is Markovian, and σt = h(αt ),
If E[σt ] < ∞ for all t, then {Yt } is a MD sequence with respect to FtY .
5.3. HIDDEN MARKOV MODELS 101
If E[σt2 ] = E[σ12 ] < ∞ then {Yt } is a zero mean, weak white noise sequence.
while
2 2 2
E [|Y1 |] = E [|ε1 |] E[σ1 ], Var(|Y1 |) = E[σ12 ]−{E [|ε1 |]} {E[σ1 ]} , Cov[|Y1 | , |Y1−s |] = {E [|ε1 |]} Cov(σ1 , σ1−s ).
The SV model can be used to parameterize a local level MD model, making it a HMM model.
Example 5.3.4. [Local level SV model] The univariate local level SV model assumes
Yt 0 1
Zt = = Z t−1 + Bt ωt , Bt = diag(σt,1:2 ), ωt ∼ N (0, I2 ),
µt+1 0 1
so long as {σt,1:2 } is Markovian. There has been a very vibrant recent literature on the local level SV model
and various extensions, include Stock and Watson (2007, Stock and Watson (2016a), Shephard (2015) and Li
is often called a log-normal SV model and it has its roots in the work of Taylor (1982). Then
while
E [σ1 σ1−s ] = exp(2µ + ση2 (1 + ϕs1 )), Cov [σ1 , σ1−s ] = exp 2µ + ση2 exp(ση2 ϕs1 ) − 1 .
102 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
Perhaps the most famous example of the HMM is where αt is a finite state Markov chain. The initial devel-
opment and analysis of this model is usually credited to Leonard Baum is a series of papers (e.g. Baum and
Petrie (1966) and Baum and Eagon (1967)).
Example 5.3.6. In economics it is often associated with Hamilton (1989), who uses αt ∈ {0, 1} as a recession
indicator, taking the value 1 if the economy is in recession at time t, with {αt }, a binary Markov chain, buried
inside an autoregression, e.g.
Yt = µ(αt ) + ϕ1 Yt−1 + εt .
More extensive work along this line include Kim and Nelson (1999).
Example 5.3.7. The use of HMM is extremely common in DNA analysis, e.g. Durbin, Eddy, Krogh, and
Mitchison (1998), where, in the most basic model, αt ∈ {A, G, C, T }, where the letter A denotes adenine, G is
guanine, C is cytosine and T is thymine, while {Yt } is a high throughput initial guess at the sequence of the
letters.
is commonly used, allowing time-varying volatility (notice there is not assumption that the {ε1,j } are uncorre-
a multivariate SV model introduced by Harvey, Ruiz, and Shephard (1994). The {Yt−p:t , σt,1:d } is a HMM.
Example 5.3.9. [Dynamic factor model] Suppose d is very high, then many researchers use a dynamic factor
model
Yt = Zαt + εt ,
αt+1 = T αt + ηt ,
with a low dimensional αt and {εt , ηt } being independent white noise. Z is called a factor loading matrix.
This is often called a dynamic factor model. Stock and Watson (2016b) provides a survey.
Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ].
5.4. FILTERING AND SMOOTHING 103
Y
E[µt |Ft−1 ].
Think more broadly about using a sequence of observations {Yt } to learn a sequence of states {αt }. There
filtering: calculating features (e.g. moments and quantiles) of the posterior distribution of the current
state
αt |Y1:t , t = 1, 2, ...
αt+1 |Y1:t ;
αt |Y1:T , t = 1, 2, ..., T,
Filtering is important as an ingredient for prediction and statistical inference. Smoothing tends to be for
In principle this raises no new intellectual issues — everything really is just computational. First look at
the principles.
With knowledge of joint law of the observations and states up to time t, that is
f (Y1:t , α1:t )
f (α1:t |Y1:t ) = R , t = 1, 2, ..., T
f (Y1:t , α1:t )dα1:t
In special models these integrals can be computed, e.g. if Y1:t , α1:t is jointly Gaussian. Typically though
we should expect to see the use of computational methods, e.g. MCMC where we generate samples
[1] [B]
α1:t , ..., α1:t
[1] [B]
αt , ..., αt
from αt |Y1:t , which can be used to simulated based estimation of moments and quantiles of the posterior for
filtering. The same simulation strategy potentially works for the smoothing problem, drawing
[1] [B]
α1:T , ..., α1:T
5.4.2 HMM
For HMM the law of the observations given the states and the states themselves massively simplify due to the
assumed sequential nature of the conditional independence assumptions:
t
Y t
Y
f (Y1:t |α1:t ) = f (Yj |αj ), f (α1:t ) = f (α1 ) f (αj |αj−1 ).
j=1 j=2
Theorem 5.4.1. [Filtering for HMM] For a HMM {Yt , αt } there are three terms
(1) prediction step
Z
f (αt |Y1:t−1 ) = f (αt |αt−1 ) × f (αt−1 |Y1:t−1 )dαt−1 ,
Proof. The
while
The implication is that filtering can potentially be carried out sequentially, if one can solve 2t integrals of
dimension αt . This may be much easier than the problem without the HMM structure, where we had to do
Pt
one j=1 dim(αj )-dimensional integral.
Example 5.4.2. [Baum filter] In the case where the state αt has finite support, the integrals are replaced by
sums. This special case is called the Baum filter. It takes on the form
X X
P (αt |Y1:t−1 ) = P (αt |αt−1 )P (αt−1 |Y1:t−1 ), f (Yt |Y1:t−1 ) = f (Yt |αt )P (αt |αt−1 )P (αt−1 |Y1:t−1 ).
at−1 at−1
Smoothing can also be carried out recursively, after filtering going though the data forward t = 1, 2, ..., T ,
the smoother goes backwards in time, starting with the final output from the filter f (αT |Y1:T ).
106 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
Theorem 5.4.3. [Smoothing for HMM] The joint distribution of the smoother is
−1
TY
f (αt |Y1:t )
f (α1:T |Y1:T ) = f (αT |Y1:T ) f (αj |αj+1 , Y1:j ), where f (αt |αt+1 , Y1:t ) = f (αt+1 |αt ) × .
j=1
f (αt+1 |Y1:t )
In the literature this decomposition of the joint distribution is often called the forward-backward filter. Further,
the time-t smoother is
Z
f (αt |Y1:T ) = f (αt |αt+1 , Y1:t ) × f (αt+1 |Y1:T )dαt+1 .
f (α1:T |Y1:T ) = f (αT |Y1:T )f (αT −1 |αT , Y1:T )...f (α1 |α2:T , Y1:T ), prediction decomposition in state, going backwards!
= f (αT |Y1:T )f (αT −1 |αT , Y1:T −1 )...f (α1 |α2 , Y1:t ), HMM
−1
TY
= f (αT |Y1:T ) f (αj |αj+1 , Y1:j ), Markov state.
j=1
Now
Example 5.4.4. [Baum smooth] In the case where the support of αt the integrals are replaced by sums. Then
X
P (αt |Y1:T ) = P (αt |αt+1 , Y1:t ) × P (αt+1 |Y1:T ).
αt+1
The importance of the forward-backward filter was emphasized by Carter and Kohn (1994) and Frühwirth-
[b]
α1:T ∼ α1:T |Y1:T , b = 1, ..., B,
using:
[b]
Simulate from αT ∼ αT |Y1:T ;
5.5. COMPUTATION METHODS 107
[b] [b]
Simulate from αj ∼ αj |αj+1 , Y1:j , for j = (T − 1), (T − 2), ..., 1.
Of course to use this we need to be able to simulate from αT |Y1:T and, then, repeatedly
In the binary state case filtering and smoothing for the HMM are quite simple. But more broadly the presence
of the integrals makes the filtering and smoothing recursions nontrivial. In the case where the joint law of
α1:T , Y1:T is Gaussian, then they can be analytically solved. This model is a Gaussian HMM. In the literature
is is sometimes called a Gaussian state space model, a linear state space model (Harvey (1989)) or a dynamic
linear model (West and Harrison (1989)).
Definition 5.5.1. Gaussian HMM (GHMM) has the pair {Yt , αt+1 } following
Yt 0d×d Zt Ht 0d×r iid
Zt = = ϕt,1 Z t−1 + Bt ωt , ϕt,1 = , Bt = , ωt ∼ N (0, Id+r ),
αt+1 0r×d Tt 0r×d Qt
where
α1 ∼ N (a1 , P1 ), [{ωt } ⊥⊥ α1 ] ,
we assume
T T
a1 , P1 , ϕt,1 , Bt t=1
⊥⊥ {ωt }t=1 , α1
(in which case we work with {Yt , αt } | a1 , P1 , ϕt,1 , Bt ) or assume that a1 , P1 , ϕt,1 , Bt are non-stochastic.
We denote this as
This linear structure was introduced by Kalman (1960), although he did not use Gaussianity, instead he
focused on weak white noise assumptions on the {εt , ζt }. This weak white noise version of his model will be
Definition 5.5.2. The Kalman filter computes {at+1 , Pt+1 } where αt+1 |Y1:t ∼ N (at+1 , Pt+1 ) assuming the
{Yt , αt+1 } ∼ GHMM.
Write the “prediction error” at time t + 1 as vt+1 = Yt+1 − E[Yt+1 |Y1:t ] and the corresponding conditional
variance Ft+1 = Var(vt+1 |Y1:t ). Then the Kalman filter runs sequentially.
Algorithm 5.5.3. [Kalman filter] Assume {Yt , αt+1 } ∼ GHMM. Then for t = 1, 2, ...
108 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
1. Compute
(a) vt = Yt − Zt at ,
2. Compute
(a) at+1 = Tt at + Kt vt ,
so
vt = Yt − E[Yt |Y1:t−1 ] = Yt − Tt at .
Step 1b:
= Zt Pt ZtT + Ht HtT .
implying that
αt+1 Y Tt at Tt Pt TtT + Qt QTt Tt Pt ZtT
|Ft−1 ∼N , ,
vt 0 Zt Pt TtT Ft
and finishing by applying Bayes theorem for a Gaussian likelihood and prior by conditioning on vt , so αt+1 |vt , Y1:t ∼
The corresponding smoothing result is αt |Y1:T ∼ N (at|T , Pt|T ). Here we give a very simple and fast algorithm
for computing at|T and/or Pt|T . It is due to Durbin and Koopman (2012), where a proof can be found.
Often in this literature these kinds of smoothers are called Kalman smoothers, which is a bit odd as Kalman
never discussed smoothing.
5.5. COMPUTATION METHODS 109
Algorithm 5.5.4. [Durbin-Koopman state smoother] Assume {Yt , αt+1 } ∼ GHMM, run the Kalman filter, storing
at , vt , Kt , Ft−1 , Pt and computing {Lt } where Lt = Tt − Kt Zt . For t = T, T − 1, ...
1. If at|T is needed, set rT = 0a , and compute
2. If Pt|T is also needed, then set NT = 0a×a , and compute backwards
To handle non-Gaussian HMM, it is sometimes helpful to draw from the joint smoothing density
[1]
α1:T ∼ α1:T |Y1:T .
This is called simulation smoothing. Early simulation smoothers are due to Carter and Kohn (1994), Frühwirth-
Schnatter (1994) and de Jong and Shephard (1995). However, we will focus on Durbin and Koopman (2002)
who have a pretty solution. At its core, their idea is not a time series result, but it is super helpful here.
Theorem 5.5.5. Suppose (X, Y ) are jointly Gaussian and the task is to simulate from X|Y . Assume
X [1] , Y [1] ∼ (X, Y ), then
E[X|Y ] + X [1] − E[X [1] |Y [1] ] ∼ X|Y.
Proof. Now X|Y ∼ N (E[X|Y ], V ) but X [1] |Y [1] ∼ N (E[X|Y [1] ], V ), as the posterior variance matrix is data
independent in the Gaussian model. Hence, as required:
E[X|Y ] + X [1] |Y [1] − E[X|Y [1] ] ∼ N (E[X|Y ], V ).
To implement: run the state smoother twice: once of the real data and once on the simulated data.
What could be easier than that! Notice this simulation smoother does not use the Pt|T part of the state
smoother, nor is it necessary to twice compute {Kt , Ft , Pt } in the Kalman filter, as these terms are invariant to
the data.
110 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
can be computed, where θ is buried inside {vt , Ft }. If the dim(θ) is small or the problem has a particularly
simple structure, then the above log-likelihood can be used directly to implement MLE or Bayesian inference,
typically by simulating from
Taken together this approach has been applied massively in empirical work.
For larger dimensional θ there are some advantages in carrying out inference using data augmentation —
either through the EM algorithm or through Bayesian simulation methods.
Think of the Bayesian case where {Yt , αt+1 } |θ ∼ GHMM and there is a prior on θ. Then the task is to simulate
[1] [B]
and then discarding the drawn states α1:T , ..., α1:T . One way of doing this is via a block MCMC (e.g. Gibbs
sampler) approach. It is quite simple for many problems:
with α1 ∼ N (a0 , P0 ). Then carry out inference based on Y1:T |θ. Define
εt = Yt − αt , ηt = αt+1 − αt , t = 1, ..., T.
5.5. COMPUTATION METHODS 111
f (σε |α1:T , Y1:T , ση ) ∝ f (σε )f (ε1 , ..., εT |σε ), f (ση |α1:T , Y1:T , σε ) ∝ f ση )f (η1 , ..., ηT |ση ).
A simple conjugate approach is to assume σε−2 ∼ Ga(aε /2, bε /2) and ση−2 ∼ Ga(aη /2, bη /2), where X ∼
Ga(α, β) meaning fX (x) ∝ xα−1 exp(−xβ), so E[X] = α/β and Var (X) = α/β 2 (in R this accessed through the
gamma(shape=α, rate=β) family of functions). Then
T
! ! T
! !
X X
−2
σε |α1:T , Y1:T , ση ∼ Ga (aε + T ) /2, bε + 2
εt /2 , ση−2 |α1:T , Y1:T , σε ∼ Ga (aη + T ) /2, bη + ηt2 /2 .
t=1 t=1
In experiments I have found there is some danger by setting bη , bε too high for the data, for in some time series
PT PT
problems either t=1 ηt2 or t=1 ε2t can be tiny. If the prior bη , bε are too small for the data, then this typically
does not matter that much as the data will dominate it.
To illustrate this return to Chapter 1 where we discussed US monthly geometric inflation, recorded monthly
from 1947. Throughout we will work with a univariate Gaussian local level model. I expect the measurement
noise to be much more substantial than the change in the rate of underlying inflation. Thus I set the independent
priors
a0 = 0, P0 = 102 , aε = 1, bε = 0.52 , aη = 1, bη = 0.012 .
I ran the algorithm with B = 2, 000 iterations, initialized at σε = 4 and ση = 3, throwing out the first 1/3 of
the iterations.
Table ? gives the simulation based estimates of the prior and posterior 0.1, 0.5 and 0.9 quantiles for σε and
ση .
σε ση
quantiles 0.1 0.5 0.9 0.1 0.5 0.9
prior 0.31 0.77 3.8 0.0058 0.015 0.067
posterior 0.30 0.31 0.32 0.05 0.06 0.07
Notice how much bigger the σε is to ση , while both are quite precisely estimated using this data.
The right hand side of Figure 5.3 shows the smoothed estimated underlying inflation rate, E[µt |Y1:T , b
θ],
where b
θ are fixed at the posterior medians. The right hand side plots the same measure but focusing on the
most recent data.
The Gaussian assumption is strong, it can yield estimates which are overly sensitive to extreme data.
It would be attractive to allow for outliers, e.g. by replacing the Gaussian distribution by a Laplace
distribution.
112 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
15
20
10
Annualized monthly inflation
5
0
0
−10
−5
−20
1960 1980 2000 2020 2017 2018 2019 2020 2021 2022 2023 2024
Time Time
Figure 5.3: Smooth estimate of underlying geometric annualized U.S. inflation, based on monthly data. LHS:
long scale of data (dots are observations, red line is underlying rate of inflation). RHS: more recent data (red
line is underlying rate of inflation).
There is some evidence that the volatility of inflation has reduced through time, at least until recently. It
would be attractive to allow for SV on the measurement error and the random walk component.
Throwing away B/3 of the iterations is a bit of a hack, there are more rigorous ways of ensuring the initial
conditions of a MCMC sampler do not overly impact the statistical conclusions. The work of Jacob and
John O’Leary (2020) largely solves this issue using parallel computation and coupling of Markov chains.
This work is quite easy to implement in our context and is particularly inspiring, but is beyond the scope
of these notes.
extends to a vast class of non-Gaussian HMM, which have enough structure to make the computations still
relatively simple.
Suppose there exists a time series β1:T (ignore for now parameters θ, they again can be dealt with by data
T T
{Yt , αt+1 }t=1 | {βt }t=1 ∼ GHMM
5.5. COMPUTATION METHODS 113
using blocks
Sometimes this structure is called a conditionally Gaussian HMM. It has long roots, e.g. Shephard (1994)
and Kim and Nelson (1999).
Example 5.5.8. [Robust filtering] A problem with GHMM is the Gaussianity of the shocks, which make the
filters super sensitive to unusual datapoints — which for some scientific problems can be a drawback. One
approach would be to make the Gaussian state space model
Yt 0d×d Zt Ht 0d×r
Zt = = ϕt,1 Zt−1 + Bt ωt , ϕt,1 = , Bt = ,
αt+1 0r×d Tt 0r×d Qt
where
iid iid
βt,j ∼ , E[βt,j ] = 1, implying ωt,j ∼ , E[ωt,j ] = 0, Var(ωt,j ) = 1.
Then {Zt , (αt , βt )} is a HMM with Y1:T |β1:T ∼ GHM M . One parameterization is to use a mixture of two
normals by setting (think, e.g. c = 40)
c2
iid c c−1 1
βt,j ∼ , P βt,j = = , P βt,j = = , c > 1,
2c − 1 c 2c − 1 c
iid
ωt,j ∼ , E[ωt,j ] = 0, Var(ωt,j ) = 1.
The left hand side of Figure 5.4 shows the density function of a standard normal random variable (black line)
and the corresponding density for ωt,j (red line) when c = 40. The tail behaviour is clearer when plotting the
log-density, which is given in the right hand side plot in Figure 5.4. These pictures show this simple mixture
model ups the chance some variables are a long way from zero. Then sampling β1:T |Y1:T , α1:T just becomes
the task of sampling from the conditionally independent
c2
2
−1/2 c
P (βt,j |ωt,j ) ∝ P (βt,j ) exp −ωt,j /2βt,j βt,j , βt,j ∈ , ,
2c − 1 2c − 1
114 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
−1
0.5
−2
0.4
−3
Log−density
0.3
Density
−4
0.2
−5
0.1
−6
0.0
−7
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
x x
Figure 5.4: Comparing a standard normal distribution with a mixture of normals which have a zero mean and
unit variable. LHS: density functions; black line is the mixture random variable, the red line is the normal
variable. RHS: log-density functions; black line is the mixture random variable, the red line is the normal
variable.
which is sampled as a Bernoulli draw. Kim, Shephard, and Chib (1998) extend this to a mixture of many
normals to model some very skew data. An alternative is to model βt,j as generalized inverse Gaussian with
E[βt,j ] = 1, so ωt,j are generalized hyperbolic and βt,j |ωt,j is generalized inverse Gaussian (e.g. Jørgensen
(1982)). The left hand side of Figure 5.5 shows the sample path of a local level model in the Gaussian case,
with the standard deviations of εt and ηt being 1 and 0.1, respectively, while T = 1, 000. The right hand side
is tramatically different. It uses the same signal, but now takes c=40 in the measurement noise. Hence the
second order properties of process has not change, but now some very odd datapoints arises in the process.
iid
Example 5.5.9. [SV noise] Instead of βt,j ∼ then the {βt } could be its own stochastic process with significant
memory, delivering a GHMM-SV model. We saw a local level version of this in Example 5.3.4.
Over the last 30 years another Monte Carlo method has been extensively used to tackle HMMs. They are
pretty effective as long as the state vector is only moderately long. These methods are called particle filters or
sequential Monte Carlo. An attractive recent review includes Chopin and Papasphiliopoulos (2020), Dai, Heng,
Jacob, and Whiteley (2022) and Fearnhead and Kunsch (2018). The latter is a good place to start. Here we
will give a simple introduction.
5.5. COMPUTATION METHODS 115
Gaussian tailed local level model Heavier tailed local level model
5
2
Y
0
0
−2
−5
0 200 400 600 800 1000 0 200 400 600 800 1000
Time Time
Figure 5.5: LHS: concention Gaussian local level model. RHS: same signal, but now the measurement error is
a mixture of normals with c = 40.
Imagine we have a sample of size B directly from the filtering distribution αt |Y1:t−1
[1] [B]
αt , ..., αt .
In this literature this sample is often called a swarm of particles. Then we can form the empirical distribution
function of it as
B
1 X [b]
Fb(αt |Y1:t−1 ) = 1(αt ≤ αt ).
B
b=1
R
Now f (Yt |Y1:t−1 ) = f (yt |αt )dF (αt |Y1:t−1 ). Replacing F (αt |Y1:t−1 ) by Fb(αt |Y1:t−1 ), then the integral can be
solved, yielding
B
1 X [b]
fb(Yt |Y1:t−1 ) = f (Yt |αt ).
B
b=1
which an unbiased estimator of f (yt |y1:t−1 ). As B gets large this sum should become very accurate.
This is nice, but what about the next time step? If we can simulate from αt+1 |Y1:t , then we can repeat the
above, stepping through time via simulation.
How can we simulate from αt+1 |y1:t . We could do data augmentation, using MCMC to simulate from
α1:t+1 |Y1:t
and throwing α1:t away. But the cost of this will grow with t and so will be expensive. Here we follow a
different approach.
116 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
Now
f (Yt |αt )
Z Z Z
f (αt+1 |Y1:t ) = f (αt+1 , αt |Y1:t )dαt = f (αt+1 |αt )f (αt |Y1:t )dαt = f (αt+1 |αt ) f (αt |Y1:t−1 )dαt
f (Yt |Y1:t−1 )
Z
∝ f (αt+1 |αt )f (Yt |αt )dF (αt |Y1:t−1 ).
B
[b] [b]
X
fb(αt+1 |Y1:t ) ∝ f (Yt |αt )f (αt+1 |αt ).
b=1
Then the task is to simulate from this approximation! This looks tricky computationally. B is large so
evaluating fb(αt+1 |y1:t ) is awful.
But think about this as a data augmentation problem: build a fake joint density
[b] [b]
fb(b, αt+1 |Y1:t ) ∝ f (Yt |αt )f (αt+1 |αt ),
then this marginalizes to fb(αt+1 |Y1:t ). Hence we can sample from this, which is potentially very cheap, and
then throw away the sampled b variable.
One simple way to sample from fb(b, αt+1 |Y1:t ) is the bootstrap particle filter. This systematically samples b
[b] [b]
and the samples from αt+1 |αt , weighing the sample using Yt |αt . When you first read the bootstrap particle
filter, take k = 1, so C = B. Taking k>1 improves the sampler, but it is not conceptually vital.
[1] [B]
Algorithm 5.5.10. Bootstrap particle filter. Start with αt , ..., αt from αt |Y1:t−1
(1) Sample from
[c] [c−⌊c/B⌋]
αt+1 ∼ αt+1 |αt , c = 1, 2, ..., B, ..., C, C = kB, k ≥ 1 integer,
[c] [c]
and compute wt = f (yt |αt ) ≥ 0. This yields the C pairs
[1] [1] [C] [C]
αt+1 , wt , ..., αt+1 , wt ,
[1] [B]
αt+1 , ..., αt+1 ,
in the 1990s. Influential work includes Liu and Chen (1995) and Liu and Chen (1998). To my knowledge the
first use of particle filters in economics appeared in Kim, Shephard, and Chib (1998) and Pitt and Shephard
(1999). Herbst and Schorfheide (2015) gives a textbook account of using these types of methods in some forms
of macroeconomics.
is an unbiased estimator of f (Y1:T ) and so the likelihood function (but not the log-likelihood). It turns out this
means that this estimated likelihood can be used inside an MCMC algorithm to make inference on underlying
parameters θ — the estimation error does not spoil the output from the MCMC output. See Andrieu, Doucet,
and Holenstein (2010) and Flury and Shephard (2011).
5.6 Recap
Linearity
Spectral analysis is a classic, massive area of time series. Most often this area is called frequency domain time
series, due to its focus on the impact of frequency, while most other areas are called time domain time series
due to its focus on temporal memory. Frequency domain time series relies entirely on the covariance structure
of the {Yt } process, so belongs in the category of linear time series methods. All the core methods will be
transformations of statistics which are linear in the data.
An exhaustive treatment of frequency domain methods is given by Priestley (1981) and Percival and Walden
(1993). The notes by Subba Rao (2022) are also very good on this topic. It is also closely connected to wavelet
analysis, e.g. Percival and Walden (2000), but we will not discuss it here.
Spectral analysis is closely connected to Fourier analysis, so depending on your background some of the
material will appear trivial or somewhat abstract or not how you would express it. However, various aspects of
spectral analysis appears throughout modern time series and so knowing some of this material is important.
h 6.1.1. So far nearly all our time series thinking has been focused on prediction — the prediction decompo-
sition, Kalman filtering, martingales, autoregressions. In the frequency domain we junk that line of thinking!
Example 6.1.2. Think about recording electronically a voice (or music) at a very high level of resolution,
where the recording is made every microsecond, and think of that recording as a long time series! A classic
problem is:
119
120 CHAPTER 6. LINEARITY
any continuous function can be represented by an infinite sum of weighted sine and cosine functions. So
one approach to compressing the voice is to think of the voice as a continuous function and approximate the
corresponding time series by a finite sum of weighted sine and cosine functions and send the corresponding large
weights in the sum not the original time series! The collection of large weights may be tiny compared to the
original series. Then at the other end a computer can take the weights and reconstruct an approximation to the
original time series — and play that as our approximation to the original recording. Approximating the time
series of the voice has nothing to do with prediction — it is more naturally phrased in the frequency domain.
Example 6.1.3. A macroeconomist might be interested in a similar idea to the voice one. If they are interested
in very long-run effects, to do with population growth and technological progress, then they may find it helpful
to simplify the time series of annual GDPs, to extract the long-run trend. Statistically this is the same problem
as Example 6.1.2. A different macroeconomist might be interested in recessions and booms, that is short-run
fluctuations. So they might be interested in the opposite of the voice problem — studying what is left over
where {εt } is weak white noise, so Y1:T has a simple structure, dominated by the choice of
2πj
λj = ∈ [0, 2π]
T
a single “frequency” and how (αj , γj ) magnifies the sine and cosine functions — the nomenclature “frequency
p
domain” derives from this kind of frequency! The scaling by 1/T is selected to help the math be compact
later.
Remark 18. For those readers familiar with the weak instrument literature in econometrics (see Andrews,
Stock, and Sun (2019)), writing down regressors which are scaled by the square root of the sample size may well
feel familiar. It leads to the same effects seen here: the coefficient βj cannot be consistently estimated, but a
The cosine and sine functions are periodic, with period 2π, so λj ∈ [0, 2π] is without loss. If λj is small,
then cos(λj t) cycles in t very slowly, inducing massive memory in {Yt }. If λj is close to π the impact of cos(λj t)
is close to being instantaneous.
6.1. FREQUENCY DOMAIN TIME SERIES 121
If we have data y1:T then think about the least squares estimate of (αj , γj ):
q P
PT 2 PT !−1 2 T
2
j 2πt 2 2πt 2π cos (λj t) yt
α
b j,OLS T t=1 cos T T t=1 cos j T sin j t T t=1
= PT PT T
2πt 2
q P
γ 2 2πt
sin j 2π 2
T
2
t=1 cos j T T t t=1 sin j T sin (λ t) y
bj,OLS
T T t=1T j t
set x = y, then
= 2 cos(x)2 − 1.
So
T 2 T 2
2X 2πt 2X 2πt
cos j = sin j = 1. (6.2)
T t=1 T T t=1 T
while, using sin(x) cos(x) = sin(2x)/2, the
T T
1X 2πt 2π 1 X 2π
cos j sin j t = sin 2j t (6.3)
T t=1 T T 2T t=1 T
= 0, 2π-periodicity of sine and integer j. (6.4)
Taken together (6.2) and (6.3) means that the regressors are othonormal: othogonal plus the sums of squares
regressors is π. That is
2
PT 2 PT !
j 2πt 2 2πt
sin j 2π
T t=1 cos T T t=1 cos j T T
t
= I2 .
T 2πt 2
2 2πt
PT
sin j 2π 2
P
T t=1 cos j T T t T t=1 sin j T
122 CHAPTER 6. LINEARITY
Now apply the lessons from this detour: it yields the beautiful and practical result that
r T r T
2X 2X
α
b j,OLS = cos (λj t) yt = cos (λj t) (yt − y) , 2π-periodicity of cosine,
T t=1 T t=1
and r T r T
2X 2X
γ
bj,OLS = sin (λj t) yt = sin (λj t) (yt − y)
T t=1 T t=1
T
( T
) ( T
)
X 2 2X 2 2X 2
yt − y) =
(b b 2j,OLS
α cos (λj t) + b2j,OLS
γ sin (λj t) =α b2j,OLS .
b 2j,OLS + γ
t=1
T t=1 T t=1
b 2j,OLS and γ
indicating α b2j,OLS measure the contribution the frequency λj makes to the variation in y1:T .
This is the simple version, of course there are many frequencies. But that is the main point.
Remark 19. You might ask why do we care about sequences {Xt } of the generic form
Take a leap in the dark and assume {α, β} are zero mean, uncorrelated random variables with Var(α) = Var(γ) =
Thus {Xt } is covariance stationary. If we were to add together p such components, each uncorrelated from one
another, each with a different variance σj2 and different frequency {λj }, then
p
X
Cov(Xt , Xt−s ) = σj2 cos (λj s) , σj2 ≥ 0, λj ∈ [0, 2π],
j=1
which is a very flexible way of modeling the second order properties of a covariance stationary process. It will
turn out that it will be sufficiently flexible, when setup the right way, to represent the second order properties
of all covariance stationary processes.
6.1. FREQUENCY DOMAIN TIME SERIES 123
To work compactly in the frequency domain we often use complex numbers and random extensions. Here we
remind ourselves of the former and introduce some aspects of the latter. I will do this in four stages:
x = a + ib,
x∗ = Conj(x) = a − ib.
Likewise
∗
eia = e−ia .
The notation Re(x) means the real part of a complex x, so Re(x) = a = Re(x∗ ). Likewise Im(x) = b =
− Im(x∗ ).
Also crucial for us is the complex version of squaring. It is denoted
There is some heavy investment working in the frequency domain, but some of the results are beautiful. You
can glimpse the elegance of complex numbers through this example.
124 CHAPTER 6. LINEARITY
Example 6.1.4. The elegance of complex variables can be seen immediately by noting
2
eia = eia e−ia = 1
= cos(a)2 + sin(a)2 .
= 1.
= ei(2π−λ)t .
The term e−iλt appears all over the frequency domain manipulations.
Our second step of preparation is to define a complex function of time and frequency:
r
1 −iλt
x(t, λ) = e ,
T
2πj
λj = , j = 1, ..., T.
T
Then the
T T T
X 1 X i(λk −λj )t 1 X 2i(k−j)πt/T
x(t, λk )∗ x(t, λj ) = e = e
t=1
T t=1 T t=1
1 λk = λj
=
0 λk ̸= λj ,
Definition 6.1.6. [Complex random variables] Let the complex random variable X be
X = A + iB
Definition 6.1.7. [Mean and variance of complex random variables]. Let the pair of complex random variables
(X, Y ) be defined as
X = A + iB, Y = C + iD,
2
and Var(X) = E[X 2 ] − |E[X]| . Notice that Var(X) ≥ 0 is real, but Cov(X, Y ) can be complex.
1 −iλt 1
Q(t) = √ e β + eiλt β ∗ , β = √ (A + iB), A ⊥⊥ B, A, B ∼ N (0, σλ2 ), t ∈ [0, T ]
T 2
2 −iλt
= √ Re(e β), as e β is the complex conjugate of e−iλt β
iλt ∗
T
1 −iλt 1 iλt
(A + iB) + eiλt (A − iB) = √ e + e−iλt A + i e−iλt − eiλt B
= √ e
2T 2T
r
2
= {cos(λt)A + sin(λt)B} .
T
1 n −iλt iλ(t+s) o σ2
Var(β) + eiλt e−iλ(t+s) Var(β) = λ eiλs + e−iλs
Cov[Q(t), Q(t + s)] = e e
T T
2σλ2
= cos(λs).
T
We have seen this result before: in Example 3.3.4 and Remark 19.
Example 6.1.8 is not new. But the derivation is easier now. Why? The use of complex random variables
allows us to use eiλt rather than cosine and sine functions in the calculations. The eiλt terms are much simpler
to manipulate.
126 CHAPTER 6. LINEARITY
A couple of time we will refer to a complex continuous time stochastic process {Z(t)}t≥0 . A simple version is
Definition 6.1.9. [Complex Brownian motion] Let the process {Z(t)}t≥0 be defined as
1
Z(t) := √ {B(t) + iW (t)} , t ≥ 0,
2
where {B(t)}t≥0 ⊥
⊥ {W (t)}t≥0 are independent standard Brownian motions. Then {Z(t)}t≥0 is called complex
E[Z(t)] = 0, and
2 1 1
|Z(t)| = {B(t) + iW (t)} {B(t) − iW (t)} = B(t)2 + W (t)2 .
2 2
Thus
2
Var(Z(t)) = E[|Z(t)| ] = t,
??.
At a basic level think about Q[1] (t) t∈[0,T ]
(and the corresponding Q[2] (t)) as the pointwise (that is for
each individual t) probability limit of the sum
B
[1] 1 X j t j−1 j j−1
QB (t) = √ 1 ≤ h T B T −B T , t ∈ [0, T ],
2 j=1 B T B B B
as B → ∞. As Q[1] (t) is the sum of weighted independent Gaussians (as {h(t)} is non-stochastic), it has
Gaussian increments
1 t
Z
Q[1] (t) − Q[1] (s) ∼ N 0, h(u)2 du , t > s ≥ 0,
2 s
which are independent (but not necessarily stationary) Q[1] (t) − Q[1] (s) ⊥⊥ Q[1] (b) − Q[1] (a) for all t > s ≥
b > a ≥ 0. The same holds for Q[2] (t) . Sometimes you will see
1 1
dQ(t) = h(t)dZ(t) = √ h(t)dB(t) + i √ h(t)dW (t),
2 2
which means that
2
E[dQ(t)] = 0, and Var(dQ(t)) = E(|dQ(t)| ) = h(t)2 dt.
6.1. FREQUENCY DOMAIN TIME SERIES 127
At one point, we will need a continuous time complex orthogonal process — which drops the Gaussian assump-
tion, noting {Q(t)}t≥0 is an orthogonal process. You saw an orthogonal process in Definition 4.4.12, but now
it has to be complex.
Definition 6.1.10. [Complex orthogonal process] Let the process {Z(t)}t∈[0,T ] be defined as
1
Z(t) := √ {A(t) + iD(t)} , t ≥ 0,
2
where {A(t)}t∈[0,T ] ⊥{D(t)}t∈[0,T ] are uncorrelated, zero mean orthogonal processes. Then {Z(t)}t∈[0,T ] is a
complex orthogonal process on [0, T ].
Then
1
Var(Z(t)) = {Var(A(t)) + Var(D(t))} .
2
Definition 6.1.11. [Circular orthogonal process] Suppose {Z(λ)}λ∈[0,π] is a complex orthogonal process Z(λ) =
√
{A(λ) + iD(λ)} / 2, where A(0) = D(0) = 0, and extend time to [0, 2π] by defining
Z(2π − λ) = Conj(Z(λ))
1
= √ {A(λ) − iD(λ)} , λ ∈ (π, 2π].
2
Circular orthogonal processes are useful in the frequency domain for frequencies living on [0, 2π].
Remark 20. The circular orthogonal process {Z(λ)}λ∈[0,2π] has uncorrelated increments for {Z(λ)}λ∈[0,π] , but
not in general:
= 2 Re(e−iλt Z(λ))
√
= 2 {cos(λt)A(λ) + sin(λt)D(λ)} , using Example 6.1.8.
128 CHAPTER 6. LINEARITY
Now let us use what we have learnt. This will put us on the launch pad for what we want.
1 1
α eiλt + e−iλt + γi e−iλt − eiλt = e−iλt (α + iγ) /2 + eiλt (α − iγ) /2
α cos(λt) + γ sin(λt) =
2 2
−iλt iλt ∗
√
= e β+e β / 2 (6.5)
√
= 2 Re(e−iλt β) (6.6)
∗
where denotes a complex conjugate and
√ √ √ 1 1 √
β = (α + iγ) / 2, β ∗ = (α − iγ) / 2, so α= 2 Re(β) = √ (β + β ∗ ) , iγ = √ (β − β ∗ ) , γ= 2 Im(β).
2 2
Using (6.5) and thinking about frequency λj , the trigonometric seasonal model is
(r ) (r ) r
2 2 1 −iλj t
Yt = αj cos (λj t) + γj sin (λj t) + εt = e βλj + eiλj t βλ∗j + εt ,
T T T
r
1 −iλj t
= e βλj + Conj(e−iλj t βλj ) + εt (6.7)
T
r
X 1 −iλt
= x(t, λ)βλ + εt , where x(t, λ) := e , β2π−λ := Conj(βλ ), (6.8)
T
λ∈{λj ,2π−λj }
r !
X
−iλt 1b X
= y+ e βλ = y + x(t, λ)βb ,
λ
T
λ∈{λj ,2π−λj } λ∈{λj ,2π−λj }
noting ybt is real, even though each of the terms in the sum are complex, due to the complex conjugate structure.
2
Notice that β b
λj = α b 2j,OLS + γ
b2j,OLS /2.
6.2. FOURIER TRANSFORM OF DATA 129
The material covered in this Section will be the following core ideas.
– The Fourier transform of the data y1:T is the (complex) functional statistic
r T
1 X iλt
JT (λ) = yt e , frequency λ, recall eiλt = cos(λt) + i sin(λt) (6.9)
2πT t=1
(r T
) (r T
)
1 X 1 X
= yt cos(λt) + i yt sin(λt) (6.10)
2πT t=1 2πT t=1
= Re {JT (λ)} + i Im {JT (λ)} . (6.11)
Re {JT (λ)} = Re {JT (2π − λ)} , Im {JT (λ)} = − Im {JT (2π − λ)} , λ ∈ [0, 2π]. (6.12)
This implies all the statistical information embedded in {JT (λ)}λ∈[0,2π] appears in {JT (λ)}λ∈[0,π] —
although using the full range of {JT (λ)}λ∈[0,2π] often yields nicer formulae.
– The (complex) discrete Fourier transforms (DFT) evaluates JT (λ) at specific frequencies
– The periodogram
a real functional statistic of the data y1:T . Due to (6.12), the periodogram cycles with period 2π,
that is IT (λ) = IT (2π − λ), λ ∈ R. As a result of the periodicity, researchers typically only plotted
IT (λ) against λ ∈ [0, π]. Then
T −|s|
" T #
1 X T −s 1 X
T
IT (λ) = γ
b +2 cos(sλ)b
γs , γ
bs = yt yt+|s| ,
2π 0 s=1
T T − |s| t=1
T T −|s|
1 X iλs 1 X T
= e γ es , γ
es = yt yt+|s| ,
2π T t=1
s=−T
130 CHAPTER 6. LINEARITY
6.2.2 Setup
{YT (t)}t∈[0,T ] ,
t = seq(1,T); iC = 1i;
sqrt(1.0/(2*pi*T))*sum(exp(lambda*t*iC)*y);
It is not that hard to use modern scripting languages to do simple manipulations of complex objects.
Inverting (6.14) back to the data
r T T T T T T
2π X −iλj t 1 X −iλj t X iλj s 1 X X 2ijπ(s−t)/T 1X
e JT (λj ) = e e ys = e ys = yt
T j=1 T j=1 s=1
T s=1 j=1 T s=1
= yt , t = 1, ..., T.
6.2. FOURIER TRANSFORM OF DATA 131
It is worth writing this out again, but the other way around
r T
2π X −iλj t
yt = e JT (λj ) (6.15)
T j=1
T T
√
r r
X 1 X iλj t 1 −iλt
= x(t, λj )β
b , β
j
b := 2πJT (λj ) =
j e yt , x(t, λ) = e .
j=1
T t=1 T
This is an exact OLS fit of a regression model with T orthonormal (complex) regressors, with T regression
coefficients (proportional to the Fourier transform of the data) and no error! The above delivers expressions
(6.9) and (6.13) above. This completes the data quantities statement.
which precisely goes through the data y1:T . This function has a period of T .
Remark 21. For any frequency of the form λj = 2πj/T , for j = 1, ..., T , the
T X T T
X 2πj 2πj X 2πj 0 j ̸= 0
exp it = cos t +i sin t =
T T T T j = 0,
t=1 t=1 t=1
This squared term is called the j-th periodogram ordinate of the data y1:T . It is one of the most famous objects
in time series — it parallels the sample autocovariance function in the time domain.
The final result in this Section appeared in Bartlett (1950). It relates the periodogram to the sample
autocovariance function.
132 CHAPTER 6. LINEARITY
Theorem 6.2.3. The periodogram at frequency λ ∈ [0, π], written IT (λ), can be expressed as
T T −|s|
X T −s 1 X
2πIT (λ) = γ
b0 + 2 cos(sλ)b
γs, where γ
bs = yt yt+|s| , s = −T, −(T − 1), ..., T
s=1
T T − |s| t=1
T T −|s|
X T −s 1 X
= γ
e0 + 2 cos(sλ)e
γs, where γ
es = yt yt+|s|
s=1
T T t=1
T
X
= eisλ γ
es .
s=−T
PT
Obviously IT (λ) ≥ 0, so, in particular, IT (0) = s=−T es ≥ 0.
γ
Proof. Now
T 2 T T
1 X iλt 1 X X iλ(t−s)
2π × IT (λ) = e yt = e yt ys
T t=1 T s=1 t=1
T
1X 2
= y
T t=1 t
T −1 T
1 X iλ 1 X −iλ
+ e yt yt+1 + e yt yt−1
T t=1 T t=2
T −2 T
1 X 2iλ 1 X −2iλ
+ e yt yt+2 + e yt yt−2
T t=1 T t=3
1 T
1 X (T −1)iλ 1 X −(T −1)iλ
... + e y1 yT + e yT y1
T t=1 T
t=T
e0 + eiλ + e−iλ γe1 + ... + e(T −1)iλ + e−(T −1)iλ γ
= γ eT −1
T
X
= eisλ γ
es .
s=−T
In turn
T
X T
X
eisλ γ
es = γ
b0 + 2 cos(sλ)e
γs
s=−T s=1
T
X T −s
= γ
b0 + 2 cos(sλ)b
γs.
s=1
T
This result allows us to go the other way: from the periodogram to the sample autocovariance function.
T
2π X −isλj
γ
es = e IT (λj ).
T j=1
6.3. POPULATION QUANTITIES: BUILDING THE SPECTRAL DENSITY 133
1
PT
Proof. Start with IT (λ) = 2π s=−T eisλ γ
es . Then, taking the discrete Fourier inverse gives the result. Writing
this out:
T T T
2π X 1 X X ibλj
IT (λj )e−isλj = e eb e−isλj
γ
T j=1 T j=1
b=−T
T T
X 1 X
= γ
eb ei(b−s)λj
T j=1
b=−T
T
X
= γ
eb 1(b = s)
b=−T
= γ
es .
In this Section we study some of the population properties of a Fourier representation of a stationary process
and see that as the number of terms increases all covariance stationary processes can be written this way.
P∞
– If {Yt } is covariance stationary process with the additional assumption that s=1 |γs | < ∞, then the
(real) spectral density function
∞
1 X iλs
fY Y (λ) = e γs , λ ∈ [0, 2π], (6.17)
2π s=−∞
1
P∞
exists. The leading example is fY Y (0) = 2π s=−∞ γs , a scaled version of the long-run variance —
which appears often as the asymptotic variance of the sample average of a time series. This appears
all over these notes, as it drives most CLTs used in practice by combining this results with Slutsky’s
– Assume {Yt } is a zero mean covariance stationary process with spectral distribution function FY Y .
Cramér’s representation (also called the spectral representation) says there exists a zero mean (com-
plex) orthogonal processes {Z(λ)}λ∈[0,2π] with
Z 2π
Yt = e−iλt dZ(λ), t ≥ 0,
0
134 CHAPTER 6. LINEARITY
2
where E[|Z(λ2 ) − Z(λ1 )| ] = 2 {FY Y (λ2 ) − FY Y (λ1 )} and FY Y (0) = 0. Definition 4.4.12, in the
section on continuous time processes, says the core characteristic of an orthogonal process is that is
has uncorrelated increments. The leading special case is where
√
1/2 1
dZ(λ) = {fY Y (λ)} √ {dB(λ) + idW (λ)} , i= −1,
2
where {B(λ)}λ∈[0,2π] and {W (λ)}λ∈[0,π] are independent standard Brownian motion, then {Yt }t≥0 is
a stationary Gaussian process.
Here we will use a circular orthogonal process, which was introduced in Definition 6.1.11.
Definition 6.3.1. [Fourier representation model] Let {Z(λ)}λ∈[0,2π] be a zero mean circular orthogonal (com-
plex) process with
2
E[|Z(λ)| ] = FY Y (λ), λ ∈]0, π].
Define
T
X
YT (t) = e−iλj t {Z(λj ) − Z(λj−1 )} ,
j=1
PT /2
Where does this come from? Start with a sum j=1 e−iλj t {Z(λj ) − Z(λj−1 )} and its complex conjugate
PT /2
j=1 eiλj t {Z(λj )∗ − Z(λj−1 )∗ }. Add them:
T /2 T /2
X X
−iλj t
YT (t) = e {Z(λj ) − Z(λj−1 )} + eiλj t {Z(λj )∗ − Z(λj−1 )∗ } ,
j=1 j=1
T /2 T /2
X X
= e−iλj t {Z(λj ) − Z(λj−1 )} + e−i(2π−λj )t {Z(λj )∗ − Z(λj )∗ } , as eiλj t = e−i(2π−λj )t (Example 6.1.5)
j=1 j=1
T /2 T /2
X X
−iλj t
= e {Z(λj ) − Z(λj−1 )} + e−i(2π−λj )t {Z(2π − λj ) − Z(2π − λj−1 )} , circular orthogonality
j=1 j=1
T
X
= e−iλj t {Z(λj ) − Z(λj−1 )} .
j=1
6.3. POPULATION QUANTITIES: BUILDING THE SPECTRAL DENSITY 135
a Riemann-Stieltjes integral.
Further
where {G(λ)} has the properties of a probability distribution function. Then γ(0) = FY Y (2π) and
Z 2π
ρ(s) = Cor[Y (t), Y (t + s)] = e−iλs dG(λ),
0
Hence the autocovariance function of {Y (t)}≥0 mathematically plays the role of a characteristic function,
which uniquely maps backwards and forwards to G via the uniqueness theorem of characteristic functions. As
T goes to infinity then we produce
Z 2π
Y (t) = e−iλt dZ(λ) (6.18)
0
Z 2π
γ(s) = e−iλs dFY Y (λ). (6.19)
0
Theorem 6.3.2. [Cramér’s representation] For any covariance stationary process {Yt } it is possible to find an
orthogonal process {Z(λ)} so that (6.18) holds with probability one.
136 CHAPTER 6. LINEARITY
and sometimes as
Z π Z π
Y (t) = cos(λt)dU (λ) + sin(λt)dU (λ),
0 0
where {U (λ)}λ∈[0,π] is a real orthogonal process with Var[dU (λ)] = 2fY Y (λ)dt.
We do not prove this theorem here, but the result is kind of obvious from what we have done as the
autocovariance function uniquely determines F , a result called the Wiener–Khintchine theorem in time series,
and so selects the correct orthogonal process.
Of course knowing the right orthogonal process is not enough to simulate from {Y (t)} as the class of
orthogonal process is not generative. As an example, think of all Lévy processes with the same drift and finite
variance — they yield the same orthogonal process. But everyone knows the Brownian motion and Poisson
process are quite different!
Start with some assumptions. The {Yt } is covariance stationary with {γs } being the associate autocovariance
function.
P∞
Definition 6.3.4. If {Yt } is covariance stationary process with the additional assumption that s=1 |γs | < ∞,
then the spectral density function is defined as
∞
1 X iλs
fY Y (λ) = e γs , λ ∈ R, (6.20)
2π s=−∞
∞
( )
1 X
= γ0 + 2 cos(λs)γs . (6.21)
2π s=1
Rλ
The FY Y (λ) = 0
fY Y (ω)dω, is the spectral distribution function.
Note
P∞ P∞ 2 P∞
(a) |fY Y (λ)| ≤ s=−∞ eiλs |γs | ≤ s=−∞ |γs | as, for all a, the eia = 1. We assumed s=−∞ |γs | so
γ0 Var(Y1 )
fY Y (λ) = = ,
2π 2π
Example 6.3.6. If {Yt } is a MA(1) driven by weak white noise Yt = θ1 (L)εt , then
γ0 + 2 cos(λ)γ1 Var(ε1 ) 1 + θ12 + 2θ1 cos(λ)
fY Y (λ) = = .
2π 2π
0.7
15
0.5
0.6
0.4
0.5
10
spectral density
spectral density
spectral density
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 6.1: LHS: spectral density fY Y (λ) for two MA(1) models, θ1 = −0.5 (red) and θ1 = 0.9 (black). Middle:
spectral density fY Y (λ) for two AR(1) models, ϕ1 = −0.5 (red) and ϕ1 = 0.9 (black). RHS: spectral density
fY Y (λ) for an AR(2) models, ϕ1 = 0.8 and ϕ1 = −0.4.
To efficiently manipulate the spectral density it is helpful to define an autocovariance generating function.
P∞
Definition 6.3.7. The autocovariance generating function (agf) is gY (z) = s=−∞ z s γs .
Then
gY (eiλ ).
fY Y (λ) =
2π
The following Theorem about polynomials is helpful in managing MA(∞) and thus other linear models.
P∞ P∞
Theorem 6.3.8. Define as = j=0 θj θj+s , for s = ..., −1, 0, 1, ..., and s=1 |as | < ∞, then
∞
X ∞
X
z s as = θ(z)θ(z −1 ), where θ(z) = θj z j .
s=−∞ j=0
138 CHAPTER 6. LINEARITY
Proof.
∞
X ∞ X
X ∞ ∞
X ∞
X
z s as = z s θj θj+s = z s θj θj+s , θj = 0, for j < −1,
s=−∞ s=−∞ j=0 j=−∞ s=−j
∞
X ∞
X
= θj θh z h−j , j + s = h;
j=−∞ h=−∞
∞
! ∞
X X
= θh z h θj z −j = θ(z)θ(z −1 ).
h=0 j=0
P∞ ∞
j
P
Example 6.3.9. For an MA(∞) process θ(z) = j=0 θj z with |θj | < ∞ (which is enough to guarantee
j=0
P∞
that s=1 |γs | < ∞,) then
Var(ε1 ) 2
fY Y (λ) = θ(eiλ ) ,
2π
e.g. in the MA(1) case θ(z) = 1 + θ1 z, then
2
θ(eiλ ) = (1 + θ1 eiλ )(1 + θ1 e−iλ ) = 1 + θ12 + 2θ1 cos(λ).
This is drawn in the right hand side of Figure 6.1 for θ1 ∈ {−0.5, 0.9} — when θ1 = −0.5 the shocks somewhat
cancel out through time and so most of the variation is at the higher frequencies (shocks die out really fast);
when θ1 = 0.9 the shocks appear positively twice and so this lifts a little the activity at the lower frequencies.
Pp 1
Example 6.3.10. For an AR(p) process with ϕ(z) = 1 − j=1 ϕj z j , so in the stationary case Yt = ϕ(L) εt .
Thus
Var(ε1 ) 1
fY Y (λ) = ,
2π |ϕ(eiλ )|2
e.g. in the AR(1) case ϕ(z) = 1 − ϕ1 z, then
2
ϕ(eiλ ) = (1 − ϕ1 eiλ )(1 − ϕ1 e−iλ ) = 1 + ϕ21 − 2ϕ1 cos(λ),
so
Var(ε1 ) 1
fY Y (λ) = .
2π 1 + ϕ21 − 2ϕ1 cos(λ)
This is drawn in the middle of Figure 6.1 for ϕ1 ∈ {−0.5, 0.9} — when ϕ1 = −0.5 shocks somewhat cancel out
through time and so is of higher frequencies (things move away fast); when ϕ1 = 0.9 shocks get reinforced and
so most of the action is at the lower frequencies. Notice how much higher the AR(1) spectrum goes than the
MA(1) does. The right hand side of Figure 6.1 shows the spectrum for an AR(2) process with ϕ1 = 0.8 and
ϕ2 = −0.4 — this has a pair of complex eigenvalues and so yields a process which cycles. This is nicely shown
in the spectrum which has a peak around frequency one. Although it is possible to work out the spectrum in
terms of cosines, it is easier not to do the math and just use complex variables to do the computations. Given
6.3. POPULATION QUANTITIES: BUILDING THE SPECTRAL DENSITY 139
below is an R snippet which computes the spectrum for an AR(p), implementing with p = 4. I m not claiming
this code is efficient: but I am saying it is simple to use and simple to read.
spAR = (1.0/(2.0*pi))/(abs(AC)^2);
plot(spAR);
Definition 6.3.4 goes from the sequence of autocovariances {γs } to the spectral density function {fY Y (λ)}λ∈[0,2π] .
What about the reverse? This is really Bochnor’s Theorem, but here it is derived.
P∞
Theorem 6.3.11. If {Yt } is covariance stationary process with the additional assumption that s=1 |γs | < ∞,
then
Z 2π
γs = fY Y (λ)e−isλ dλ.
0
and takes out certain frequencies, e.g. a low pass filter removes higher frequencies.
Z 2π
Ya (t) = 1|λ−π|>a e−iλs dZ(λ), a ∈ [0, π).
0
140 CHAPTER 6. LINEARITY
10
5
0
Series
−5
−10
Time
Figure 6.2: Simulated autoregression (black dots) plotted against time, together with the low pass filter
(smoother) drawn as a red solid line.
This is implemented as
r T
2π X −iλj t
µ
bt|T = e JT (λj )1|λj −π|>a .
T j=1
Advocacy of this approach for macroeconomics includes Baxter and King (1999) and Watson (2007).
n o n o
Example 6.3.12. Figure 6.2 plot Yt , µ
bt|T from running a low pass filter ( µbt|T , plotted in red) through a
simulated Gaussian AR(1) ({Yt } plotted as black dots) with ϕ1 = 0.95, taking a = π, so all but the smallest
frequencies are discarded. The low pass filter roughly follows the main thread of the series, but it has some
problems at the start of the process — it will have difficulties with end effects.
– For a covariance stationary process, the spectrum {fY Y (λ)} is usually estimated by fitting an AR(p)
6.4. ESTIMATING THE SPECTRAL DENSITY 141
Var(ε1 )
fbY Y (λ) = 2;
b iλ )
ϕ(e
the spectral density at one frequency λ ∈ [0, π]. There are many ways of estimating θ.
A core way of estimating θ is to assume a covariance stationary process follows a AR(p) process driven by white
noise, which approximates θ by the spectral density of the AR(p) process:
Var(ε1 )
θ ≃ θAR(p) = Pp 2,
1 − j=1 ϕj eiλj
when p is large. Now estimate the associated Var(ε1 ) and ϕ1:p , and plug the estimated parameters into the
spectral density for a AR(p), yielding
\1 )
Var(ε
θAR(p) =
b
Pp b iλj 2 .
1− ϕ e
j=1 j
Confidence intervals can then be generated by the delta method. For a Gaussian AR(p) process this procedure
is the MLE of the spectral density at frequency λ. For large p (perhaps shrinking AR coefficients at large lags
towards zero) and T , this procedure is widely used in applied problems.
Here we will focus on estimating θ using the periodogram ordinates IT (λ1 ), ..., IT (λT ), with high weights for
frequencies near λ.
142 CHAPTER 6. LINEARITY
Example 6.4.1. Figure 6.3 shows a simulation experiment with T ∈ {50, 250, 1000} and an MA(1) model with
θ1 = 0.9. Notice that as T increases each ordinate of the periodogram does not get any more precise, there are
just more of them! The same setup is reported in Figure 6.4, but now for a AR(1) model with ϕ1 = 0.9.
MA(1) case, Periodogram: T= 50 MA(1) case, Periodogram: T= 250 MA(1) case, Periodogram: T= 1000
2.0
3.0
2.5
2.5
1.5
2.0
2.0
Spectrum
Spectrum
Spectrum
1.5
1.0
1.5
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 6.3: Spectrum (red line) and periodogram (bacl line) for an MA(1) model with θ1 = 0.9. LHS: T = 50.
Middle: T = 250. RHS: T = 1, 000.
It really helps to understand the properties of IT (λj ) under the Fourier representation model, for each T ,
2 FY Y (λj ) − FY Y (λj−1 )
E[|JT (λj )| ] = , Cov[JT (λj ), JT (λj )] = 0; j ̸= k.
λj − λj−1
then JT (λj ) ⊥
⊥ JT (λk ) and IT (λj ) ⊥
⊥ IT (λk ). In particular, rather beautifully,
! !
1 λ
Z Z λ
2 L 1
|Z(λ)| = Z(λ)Z(λ)∗ = fY Y (c)dc × B(1)2 + W (1)2 ∼ fY Y (c)dc × χ22 ,
2 0 2 0
so R λj
ind
1 2
λj−1
fY Y (λ)dλ
IT (λj ) ∼ χ2 fY Y,j , fY Y,j = , j = 1, 2, ..., T.
2 λj − λj−1
That is the periodogram ordinates have converted a covariance time series into a sequence of scaled independent
variables! Possible to show covariance stationary processes, then
IT (λj ) d 1 2
→ χ2 ,
fY Y (λj ) 2
6.4. ESTIMATING THE SPECTRAL DENSITY 143
AR(1) case, Periodogram: T= 50 AR(1) case, Periodogram: T= 250 AR(1) case, Periodogram: T= 1000
80
15
5
4
60
10
3
Spectrum
Spectrum
Spectrum
40
2
20
1
0
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 6.4: Spectrum (red line) and periodogram (bacl line) for an AR(1) model with ϕ1 = 0.9. LHS: T = 50.
Middle: T = 250. RHS: T = 1, 000.
and this ratio is asymptotically independent over j. If λj is close to λ, then if {fY Y (λ)} is continuously twice
differentiable, then by a second order Taylor expansion
1
fY Y,j ≃ fY Y (λ) + (λj − λ)fY′ Y (λ) + (λj − λ)2 fY′′ Y (λ). (6.22)
2
More broadly, notice that the complex JT (λ) is a weighted sum of the data, where weights cannot be extreme
2
as eiλt ≤ 1. Hence you might expect that if {Yt } is covariance stationary with spectral density {fY Y (λ)}
(and a linear model with MD, M-dependence or i.i.d. shocks, so a CLT can be driven), that JT (λ) will obey a
CLT. Corollary 11.2.1 of Subba Rao (2022) gives such a CLT
Re(JT (λ)) d 1
→ N 0, fY Y (λ)I2 .
Im(JT (λ)) 2
the joint limit for the real and imaginary elements of JT (λ). So the Gaussian part of the Fourier representation
model is not essential.
Kernel estimators
the work of Grenander and Rosenblatt (1953) (later, the same idea was ported into kernel density estimation
by Parzen (1962) and Rosenblatt (1956)). Subba Rao (2022) has a good analysis of the general case.
θh (λ) ≥ 0 as the periodogram ordinates and the kernel weights are non-negative.
Notice that b
144 CHAPTER 6. LINEARITY
Example 6.4.2. The left hand side of Figure 6.5 shows the case of the MA(1) process with θ1 = 0.9 and
T = 1, 000. The red line is the true spectral density. The dotted black line is the kernel spectral density using a
Gaussian kernel, with a standard deviation of 0.07. The green line is the rectangular kernel with h = 0.04/pi.
100
0.6
80
0.5
60
0.4
Spectrum
Spectrum
0.3
40
0.2
20
0.1
0.0
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Lambda Lambda
Figure 6.5: Non-parametric estimator of the spectral density. Red: truth; green: rectangular kernel; black:
Gaussian kernel. Throughout T = 1, 000. LHS: MA(1) case with θ1 = 0.9 and RHS: AR(1) with ϕ1 = 0.9.
The right hand side shows the same result but for an AR(1) process with ϕ1 = 0.9. The results indicate the
superiority of the Gaussian kernel for the case where the periodogram has some sharp peaks, here at frequencies
close to 0. This is not surprising. The Gaussian kernel puts more weight closer to 0 and then damps everything
else out. The rectangular puts equal weight on all the frequencies close to 0.
Here we study in some detail the properties of a relatively simple version: the kernel estimator based on a
rectangular kernel
T
1 X 2πj
θh (λ) :=
b IT (λj )1(λj ∈ (λ − h, λ + h]), λj = ,
nh j=1 T
the effective sample size of this statistic is
T
X 2πj Th
nh := 1 ∈ (λ − h, λ + h] ≃ 2 .
j=1
T 2π
Theorem 6.4.3. Assume the spectral density {fY Y (λ)}λ∈[0,π] is twice continuously differentiable, then
1 √
2
bias h−2b
θh (λ) → fY′′ Y (λ) , and Var T hbθh (λ) → πfY Y (λ) , λ ∈ [0, π],
6
h4 ′′ 2 1 π 2
as h → 0 and T h → ∞. Hence mse(b
θh (λ)) ≃ 36 fY Y (λ) + h T fY Y (λ) .
Here the bias and variance are in conflict, with the bias liking a small h, while the variance liking a large h.
θh ) = O(T −4/5 ).
hence the mse(b
⌊X⌋ Th
2π 2 ⌊X
2π ⌋
Th
2π 1 2π 1 ′′
= j+ f (λ) j2.
T nh T 2
j=−⌊ T2πh ⌋ j=−⌊ T2πh ⌋
Hence
2 Th Th Th n
1 2π 1 ′′ 2π ( 2π + 1)(2 2π + 1) n(n + 1)(2n + 1)
X
bias[h−2b
θh ] ≃ h −2
f (λ) , j2 =
nh T 2 6 j=1
6
−1
Th T
≃ h−2 2 h3 f ′′ (λ) /3
2π 2π
→ f ′′ (λ) /6.
146 CHAPTER 6. LINEARITY
Likewise
√ T
1 X 2
Var[ T hb
θh ] = Th f 1(λj ∈ (λ − h, λ + h])
n2h j=1 Y Y,j
−2
Th 2 Th
≃ Th 2 fY Y (λ) 2
2π 2π
= πfY Y (λ)2 .
T T −|s|
1 X T − |s| 1 X
T
IT (λ) = cos(sλ)b
γs, γ
bs = yt yt+|s| ,
2π T T − |s| t=1
s=−T
T T −|s|
1 X 1 X T
= cos(sλ)e
γs, γ
es = yt yt+|s| ,
2π T t=1
s=−T
weights the periodogram ordinates. Thus it must be the case that the kernel estimator can be written in the
time-domain as a sum
T T −|s|
1 X 1 X T
θh (λ) =
b ws (λ)e
γs, γ
es = yt yt+|s| ,
2π T t=1
s=−T
where {ws (λ)} are weights, called lag windows. Think of w0 (λ), it places weight on γ
e0 the sample variance;
while w1 (λ), it places weight on γ
e1 , the sample autocovariance at lag 1.
Example 6.4.4. Suppose K(u) ∝ 1(|u| ≤ 1), then the s-th weight is
and
T B B
X
−isλ
X
−isλ
X sin [(B + 1/2)λ]
K(λ) = ws e = e =1+2 cos(sλ) =
s=1
sin(λ/2)
s=−T s=−B
the Dirichlet kernel in Fourier analysis. As Figure 6.6 demonstrates (this problem does not go away as B
becomes quite large), K(λ) can have rather negative weights so b
θh (λ) can sometimes be negative, even though
the estimand is non-negative. This is problematic for applications. Why does this Dirichlet kernel hold?
20
10
4
15
3
10
weights
weights
weights
2
5
1
2
0
0
−1
−2
−5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 6.6: Spectral weight function K(λ) corresponding to a rectangular weight function in the time domain.
LHS: B = 2; middle: B = 5 and RHS: B = 10.
Thus
n
X n
X
1+2 cos(sλ) = eisλ , this sum is called the Dirichlet kernel.
s=1 s=−n
e(n+1)iλ − e−inλ
= Dn (eisλ ) =
eiλ − 1
(n+1/2)iλ −i(n+1/2)λ
e −e
= , scale both top and bottom by eiλ/2
e iλ/2 − e−iλ/2
sin [(n + 1/2)λ]
= .
sin(λ/2)
Example 6.4.6. [Bartlett weight] Suppose ws = 1 − |s| B 1(|s| ≤ B), a weight function due to Bartlett. Then
B 2
X |s| −isλ 1 sin (Bλ/2)
KB (λ) = 1− e = ,
B B sin(λ/2)
s=−B
the Fejér kernel in Fourier analysis. This is non-negative and is drawn in Figure 6.7 for a variety of values of
B. Hence b
θh (λ) is non-negative. Why? Recall
40
20
10
30
8
15
weights
weights
weights
6
20
10
4
10
5
2
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 6.7: Spectral weight function K(λ) corresponding to the Bartlett weight function in the time domain.
LHS: B = 2; middle: B = 5 and RHS: B = 10.
n
X wn+1 − w−n
Dn (w) := ws = .
s=−n
w−1
Then
n−1 n−1 n−1 n−1
!
X X ws+1 − w−s 1 X X
nFn (w) = Ds (w) = = ws+1 − w−s
s=0 s=0
w−1 w − 1 s=0 s=0
( n−1 n−1
)
1 − wn 1 − (1/w)n
1 X
s
X
s 1
= w w − (1/w) = w −
w−1 s=0 s=0
w−1 1−w 1 − 1/w
w n n w n n
= 2 {(w − 1) − (1 − (1/w) )} = 2 {w − 2 + (1/w) }
(w − 1) (w − 1)
1 n n (wn/2 − w−n/2 )
= 2 {w − 2 + (1/w) } = 2 .
w1/2 − w−1/2 w1/2 − w−1/2
KB (λ) = Fn (eiλ )
1 (eiλn/2 − w−iλn/2 )
= ,
n wiλ/2 − w−iλ/2 2
which is the stated result. This is non-negative and integrates to one. Hence b
θh (λ) is non-negative.
is used extensively in applied time series. In the case of λ = 0, then the resulting long run variance estimator
B T −|s|
X |s| 1 X
2πb
θB (0) = 1− γ
es , γ
es = yt yt+|s| ,
B T t=1
s=−B
is a non-negative estimator.
Remark 22. In econometrics the scaled special case of the Bartlett estimator, the 2πb
θB (0), is often called the
Newey and West (1987) estimator. The Bartlett version of 2πb
θB (0) is used very extensively as an ingredient
to statistical inference procedures for problems. More extensive discussion of estimators of the zero frequency
includes Andrews (1991) and Lazarus, Lewis, and Stock (2021). Obviously, the case where the data is a vector,
then the Bartlett version of 2πb
θB (0) is a square, symmetric positive semi-definite matrix.
Before discussing the Wold representation, to get into the swing of things, it is elegant to talk again about
martingales.
Recall the beautiful and deep Doob martingale: if E [|θ|] < ∞ and {Ft } is a sequence of filtrations, then
Xt = E[θ|Ft ],
150 CHAPTER 6. LINEARITY
then
is a martingale difference with respect to {Ft }. We will use a similar idea here.
Think of a time series {Yt }t=−∞ adapted to the {Ft }t=−∞ , a sequence of natural filtrations going into the
infinite past. Assume for each t the
(think of this as E[Yt ], expectation given you no knowing). Then at time t, the
Yt = E[Yt |Ft ]
+E[Yt |Ft−1 ]
∞
which produces a sequence {Ut,j }j=0 which is a martingale difference in j = 0, 1, 2, ..., for each t. Then keeping
going:
= ϕj1 εt−j .
Instead of working with conditional expectations, work with linear projections. Before we so this in general, it
Before we so this in general, think about a generic linear regression. Suppose the pair X, Y have a zero
mean and each has a variance. Recall, if Var(X) is non-singular, then
In the context below, the special case where the X are uncorrelated is important. If Var(X) and, so, Var(X)−1
are diagonal, then
T
βY ∼X1
βY ∼X2
βY ∼X = ..
,
.
βY ∼Xp−1
βY ∼Xp
collecting p small regressions.
We are going to apply these ideas to time series. There are two stages.
{Ut }t≥1 .
The second stage needs some preliminary work! For each t ≥ 2, there exists a non-stochastic (t − 1) × (t − 1)
matrix Bt−1 such that U1:t−1 = Bt−1 Y1:t−1 (just think about how Ut is built out of Yt and βYt ∼Y1:t−1 × Y1:t−1
— the same type of relationship holds for each t) so
yielding
noting, in general, that βYt ∼Ut−s , for s ≥ 0, will change with t (no stationarity assumption has been made).
Thus the fitted value of the regression of Yt on U1:t is
There is no error, Yet = Yt as the regressors are U1:t which is a linear combination of Y1:t . The result
t−1
X
Yt = ψt,s Ut−s
s=0
Intuitively, if {Yt } is covariance stationary and the first stage allowed Yt to be regressed on Y−∞:t−1 and
then Yt on U−∞:t , then you would expect the t in the notation ψt,s to cease to move those coefficients around
P∞
and yield Yt = s=0 ψs Ut−s . Of course some additional technique is needed allowing an infinite number of
regressors.
The math behind this extension is linear projection in Hilbert space, which is discussed briefly in Appendix
6.8.
∞
Theorem 6.5.2. [Wold decomposition] Assume {Yt }t=−∞ is a covariance stationary process, then
∞
X
Yt = ψs Ut−s + µt
s=0
where {Ut } are zero mean, weak white noise and {µt } is non-stochastic.
Proof. The best linear projection of Yt on the past Yt−1 , Yt−2 , ... is
∞
X
Ybt := P (Yt |Y−∞:t−1 ) = a0 + aj Yt−j ,
j=1
where the sequence {aj } are time invariant due to covariance stationarity. Let {Ut } be an infinitely lived
sequence of one-step ahead errors from the linear projections,
Ut = Yt − Ybt .
6.5. WOLD REPRESENTATION 153
Cov(Ut , Us ) = 0, t ̸= s.
Cov(Yt , Ut−j )
ψj = , j = 0, 1, 2, ...
Var(Ut−j )
Cov(Y0 , U−j )
= , {Yt , Ut } covariance stationary.
Var(U−j )
In most applications we work with zero mean covariance stationary processes and the Wold decomposition
Example 6.5.3. Suppose Yt = ϕ1 Yt−1 + εt , where {εt } is zero mean weak white noise and |ϕ| < 1, so {Yt } is
so
Ut = Yt − Ybt = Yt − ϕ1 Yt−1 = εt ,
where
Cov(Yt , Ut−j ) Cov(Yt , εt−j )
ψj = = = ϕj1 ,
Var(Ut−j ) Var(εt )
so
∞
X ∞
X
Yt = ψj εt−j = ϕj1 εt−j .
j=0 j=0
154 CHAPTER 6. LINEARITY
Example 6.5.4. Suppose Yt = α cos(λt) + β sin(λt), where α, β are uncorrelated, zero mean random variables
with variance σλ2 . We saw in Example 3.3.4 that {Yt } is covariance stationary, with zero mean and γs =
σλ2 cos(λs). What is the Wold decomposition? Suppose we start at time t = 1, then
−1
Y1 cos(λ) sin(λ) α α cos(λ) sin(λ) Y1
= , so = .
Y2 cos(2λ) sin(2λ) β β cos(2λ) sin(2λ) Y2
More broadly
Ut = Yt − Ybt = 0, t = 3, 4, ...
Thus
µt = α cos(λt) + β sin(λt),
which is now viewed, by the Wold decomposition, as non-stochastic as (α, β) is determined by Y1:2 . Hence the
Wold decomposition is
Yt = µt , t = 3, 4, ...,
Recall the linear state space form, but now remove the Gaussian assumption. The resulting model is not a
hidden Markov model. It is the model which appears in the Kalman (1960) paper. Here we will analyse it using
linear projection theory.
6.6. KALMAN FILTER AS A LINEAR PROJECTION 155
Definition 6.6.1. [Linear state space] The LSS model has the pair {Yt , αt+1 } following
Yt 0d×d Zt Ht 0d×r
Zt = = ϕt,1 Z t−1 +Bt ωt , ϕt,1 = , Bt = , E[ω1 ] = 0d+r , Var(ω1 ) = Id+r ,
αt+1 0r×d Tt 0r×d Qt
we assume a1 , P1 , ϕt,1 , Bt are non-stochastic. We denote this as
Then the Kalman filter replaces conditional expectations by linear projections and unconditional expecta-
tions.
T
at+1 : = P (αt+1 |Y1:t ), Pt+1 := E[(αt+1 − at+1 ) (αt+1 − at+1 ) ],
T
at|t : = P (αt |Y1:t ), Pt|t := E[ αt − at|t αt − at|t−1 ],
then the sequence {at+1 , Pt+1 } is the same as Algorithm 5.5.3, which had assumed Gaussianity.
Proof. As Var(ω1 ) = Id+r and Var(α1 ) = P1 , so Y1:T , α1:T ∈ H. Assume the result
at := P (αt |Y1:t−1 ),
vt : = Yt − P (Yt |Y1:t−1 )
= Yt − Zt at . (6.25)
so
and
= Pt−1 − βt Ft βtT .
These are the third and fourth result in the Kalman filter. Likewise
iid
{εt } ⊥⊥ σt2 ,
Xt = σt εt , εt ∼ N (0, 1),
where {Ut } is a martingale difference sequence if E[σt2 ] < ∞. Further, {Ut } is zero mean, weak noise if
E[σt4 ] < ∞. If, in addition,
2
= µσ2 + ϕ1 σt2 − µσ2 + Vt ,
σt+1 {Ut } ⊥ {Vt } ,
and {Vt } is weak white noise, then {Yt , αt+1 } is in the LSS and the Kalman filter delivers
2 2 2
P (σt+1 |Y1:t ) = P (σt+1 |X1:t ).
6.7. RECAP 157
6.7 Recap
Again we have covered a lot of ground, now focusing entirely on linear methods, based on covariance stationarity.
The main topics have been spectral methods, the Wold decomposition and the Kalman filter.
Table 6.1 contains the major topics covered in this Chapter.Action
Take a step back from time series, returning to Introductory Statistics. Think of the collection of random
variables with a variance:
H = Z : E[Z 2 ] < ∞ .
(Then H is a special case of a Hilbert space.) In this subsection we will make three assumptions:
2. Let
p
X
M= W = bj Xj : bj ∈ R = sp(1, X1 , ..., Xp ); X0 = 1.
j=0
3. Assume Y ∈ H.
Now predict Y using X, but constrain ourselves to linear efforts. Measure the error of the prediction using
then
h 2 i D E
γ, β Y |X = arg min EY,X Y − bT X = arg min Y − Yb ,Y − Yb .
b∈Rp+1 b ∈M
Y
Yb = P (Y |X) = µY + β TY |X (X − µX ),
noting
We will write
UY |X = Y − P (Y |X)
Why?
= Var(Y ) − β TY |X Cov(X, Y ).
Why? Think of {Y1 , ..., YJ } as a collection of scalar random variables, each possessing a variance. Then
XJ XJ XJ XJ
µZ = µZj , Cov X, Zj = Cov (X, Zj ) , implying β PJj=1 Zj |X = β Zj |X ,
j=1 j=1 j=1 j=1
6.8. APPENDIX: LINEAR PROJECTION 159
imply
J
X XJ
Cov X, Zj = Var (X) βZ j |X
,
j=1 j=1
P (Xj |X) = Xj .
P (Y |X) = µY , if Cov(X, Y ) = 0p .
So far we have made projections using X. What happens if we add to our set of predictors Z. Then the linear
projection of Y using (X, Z) is
X µX
P (Y |X, Z) = µY + β TY |X,Z − .
Z µZ
so β TY |X,Z splits
This result is specialized, requiring orthogonality, but orthogonality can be induced. Build the linear
projection errors
UZ|X = Z − P (Z|X),
Cov(UZ|X , X) = 0.
So then
while
Var(UY |X,Z ) = Var(UY |X ) − β TY −µY |UZ|X Var(UZ|X )β Y −µY |UZ|X
160 CHAPTER 6. LINEARITY
Chapter 7
Action: Causality
In introductory statistics regression is about seeing predictors X and estimating where outcome Y is likely to
land. This is passive — we see X and then see Y .
Causality in introductory statistics is about how outcome Y will change as an assignment A is moved. This
is active — we impact the world by altering A and then this pinballs to changing Y .
In time series causality is similar, but different. In prediction, we use past data, Y1:t−1 , as predictors of Yt .
In causality, we change (or imagine changing) an assignment At−s which then moves Yt , that is s ≥ 0 periods
later.
Many areas of the time series literature are causal. Example include:
treatment effects: estimating average treatment effects under some form of sequential randomization or
unconfoundedness. These treatment effects are often called impulse response functions in the time series
literature.
control and reinforcement learning: designing a sequence of {At } so a sequence of outcomes {Yt } or utilities
Start with formalizing the language of a time series experiment based on randomization. This language will be
expressed in terms of potential outcomes, outcomes and assignments.
First think of a setup we will quickly move away from! A basic causal system might be the sequence
T
{Zt,T }t≥1 which is made up of the random variables
{Yt (a1:T )}a1:T ∈AT
Zt,T := Xt ,
A1:T ∈ AT
161
162 CHAPTER 7. ACTION: CAUSALITY
the p-dimensional a1:T ∈ AT are possible assignment paths from time 1 to time T , the {Yt (a1:T )}a1:T ∈AT are
the time-t potential outcomes and A1:T is the assignment path, while the time-t d-dimensional outcome is
Yt = Yt (A1:T ). The predictor Xt is not causally moved by the assignments (in economics the Xt variable would
It would then be natural to redefine the notation and write the potential outcomes as Yt (a1:t ). Using that
fundamental restriction, we can define a causal time series system, which will be the center of the action in this
section.
Definition 7.1.1. The causal time series system {Zt }t≥1 is made up of the random variables
Here the p-dimensional a1:t ∈ At are possible treatment paths from time 1 to time t, the {Yt (a1:t )}a1:t ∈At
are the time-t potential outcomes and A1:t is the assignment path, while the time-t -dimensional outcome is
Yt = Yt (A1:t ). The predictor Xt is not causally moved by the assignments.
Example 7.1.2. [Control and treatment] The leading case of this is univariate with A = {0, 1}, then when the
time-t assignment is At = 1 the assignment is usually said to be to treatment and when At = 0 the assignment
is said to be to control. Often researchers call At the treatment, but I prefer the more neutral nomenclature
assignment (and use treatment and control for its standard meanings). I think it is clearer. The left hand side
of Figure 7.1 shows the corresponding potential outcomes associated with a path of these binary assignments
going up to T = 3. That is it plots {Y1 (a1 )}a1 ∈A , then {Y2 (a1:2 )}a1:2 ∈A2 and finally {Y3 (a1:3 )}a1:3 ∈A3 . The
right hand side highlights the path of Yt = Yt (A1:t ), where here A1:3 = (1, 1, 0)T .
The time-t causal effect of moving from assignment path a′1:t to assignment path a1:t is the random
Bojinov and Shephard (2019) work with this causal time series system to study time series experiments.
They provide many references to the literature, noting the vast majority of the work in this literature looks
at panel data not pure time series. See also Angrist and Kuersteiner (2011), Angrist, Jordà, and Kuersteiner
(2018), Rambachan and Shephard (2021) and Bojinov, Rambachan, and Shephard (2021).
7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 163
A1:3 = (1, 1, 0)
Y3 (1, 1, 1) Y3 (1, 1, 1)
Y2 (1, 1) Y2 (1, 1)
Y3 (1, 1, 0) Y3 (1, 1, 0)
Y1 (1) Y1 (1)
Y3 (1, 0, 1) Y3 (1, 0, 1)
Y3 (0, 1, 0) Y3 (0, 1, 0)
Y1 (0) Y1 (0)
Y3 (0, 0, 1) Y3 (0, 0, 1)
Y2 (0, 0) Y2 (0, 0)
Y3 (0, 0, 0) Y3 (0, 0, 0)
Figure 7.1: The left figure shows all the potential outcome paths for T = 3. The right figure shows the observed
outcome path Y1:3 (A1:3 ) where A1:3 = (1, 1, 0)T , indicated by the thick blue line. The gray arrows indicate the
missing data.
In experiments the researcher will control this law, in observational studies it will be unknown. Crucially
notice that potential outcomes do not appear in the conditioning set — so assignment is based on observables.
Example 7.1.4. [Linear case] Hasbrouck (1991a, Hasbrouck (1991b) studied the causal impact of a buy initiated
trade (on a financial market) on the mid-price just after the buy (or sell). Think of At = 1 if trade-t is a buy
(treatment), At = 0 if sell (control). A recent discussion of the finance literature on this topic included in
Some advanced methods we will discuss in Section 7.2 will try to make causal conclusions just seeing the path
Y1:T — this is traditional in empirical macroeconomics. This can be done under very strong assumptions.
In nearly all of the discussion in this section, the predictors will be entirely ignored to ease exposition.
164 CHAPTER 7. ACTION: CAUSALITY
Example 7.1.5. [Linear case] The causality time series system with linear potential outcomes has
t−1
X
Yt (a1:t ) = θj at−j + Vt ,
j=0
where {Vt } is a stochastic process not causally impacted by a1:t (e.g. it could include Xt or lagged versions of
t−1
Xt ) and {θj }j=0 is a non-stochastic sequence. If Vt = 0, with probability one, then the potential outcomes are
non-stochastic.
We often measure the causal impact of changing the first element of At . Highlighting the causal impact of
other elements of At follows the same logic and raises no new intellectual issues. It is sometimes helpful to
work with the ultra compact notation: lag-s potential outcomes are defined as
The insightful notation {Yt,s (at−s,1 )}at−s,1 ∈A1 appears in the work of Angrist and Kuersteiner (2011) and
Angrist, Jordà, and Kuersteiner (2018). Then Yt = Yt,s (At−s,1 ), noticing that Yt,s is not the s-th element of
Yt . This lag-s potential outcome notation buries many details, so needs to be used with thought — this is not
a non-interference assumption, instead the deep dependence on A1:t−s−1 , At−s,2:p , At−s+1:t are just suppressed.
Definition 7.1.6. [lag-s average treatment effect] Assuming the causal system, the lag-s average treatment
effect of moving the first element of At , written At−s,1 , from 0 to 1 is, at time t,
Example 7.1.7. [Linear case] (Continuing Example 7.1.5) The causality time series system with linear potential
outcomes has
at−s,1 0
0 s−1
X At−s,2 t−1
X
Yt,s (at−s,1 ) = θs + Ut , Ut = θj At−j + θs + θj At−j + Vt .
.. ..
. j=0
. j=s+1
0 At−s,p
m-order causality restricts our study of causal impacts of assignments on outcomes to not more than m periods.
Assumption 7.1.8. Assume a causal time series system. It is m-order causal if, for each t,
Again it is a time series version of a non-interference assumption (Cox (1958b)). A m-th order causal time series
system write the time-t potential outcome using the shorthand:
Yt (at−m:t ),
h 7.1.9. m-order causality says nothing about the probability law of the A1:T , X1:T , Y1:T — so it has no direct
relationship to m-order Markov processes or m-dependence. However, it shares some of the spirit of those
methods, here limiting causal effects to m lags. If you are worried about this restriction, think of m as a
billion.
Under m-th order causality, the causal system {Zt }t≥1 simplifies to
{Yt (at−m:t )}at−m:t ∈Am+1
Zt = Xt , Yt = Yt (At−m:t ).
At−m:t ∈ Am+1
The dimension of Zt does not change with t. Thus under m-order causality, it is possible to think of {Zt }
as infinitely lived without having to deal with the technicality of infinitely dimensional objects — opening up
the possibility of working with a stationarity asusmption. The causal effect of moving from assignment path
Definition 7.1.10. The (time-invariant) linear m-order causal time series system (with no predictors), has
m
X p
X q
X
Yt (at−m:t ) = θj at−j + Vt , At = ϕj At−j + βj Yt−j + εt , Yt = Yt (At−m:t ).
j=0 j=1 j=1
Notice that in the linear system time-t assignment instantly impacts the time-t potential outcome, but
outcomes only impact assignments with a lag. This is crucial. It needs to be justified in the applied context
this model is used. The outcome and causal effects are, respectively,
m
X m
X
and Yt (at−m:t ) − Yt (a′t−m:t ) = θj at−j − a′t−m:t .
Yt = θj At−j + Vt
j=0 j=0
166 CHAPTER 7. ACTION: CAUSALITY
h 7.1.11. If the {At } are i.i.d. and Vt = 0 for all t, then the time series {Yt }t=1 is a vector MA(m) process.
For a moment think of the MA(1) case to simplify the exposition, then
In this i.i.d. assignment case, if we see the time series of assignments we can estimate Var(A1 ), while the
outcomes allow us to estimate γ0:1 . Taken together they allow us to learn θ0:1 . This carries over to the MA(m)
case. Game over, we are then causal heroes. But the reach of this strategy is limited. If we do not see
m
{At } we cannot learn Var(A1 ) and so there is no way of splitting the θs apart from Var(A1 ), so {θs }s=0 are not
identified. This is because θ0 is not constrained to be Id in the MA(m) process. This point is an animating
observation in much of the macroeconometric causal literature. Notice that θ0 should not be expected to be
the identity matrix! This is not the textbook linear moving average seen in statistics, although our treatment
of linear MA(q) processes in previous chapters has often allowed a flexible θ0 .
Under m-th order causality, if the law of the causal system {Zt } is strictly stationary, then the lag-s average
treatment effects do not depend upon t, and so write them as
using the lag-s potential outcome notation. Due to the m-th order causality assumption τs = 0, s > m. Finally,
the causal triple (the assignment, the predictors and the outcome), form a sequence {At , Xt , Yt }t≥1 which is
strictly stationary.
Much of our work will be based around the m-order causal system under stationarity.
Example 7.1.12. [Linear case] (Example 7.1.4 continued) Hasbrouck (1991a, Hasbrouck (1991b) defined the
univariate outcome Yt as the observed mid-quote return (after the t-trade, (qt − qt−1 )/qt−1 , where qt is the
mid-price of the t-th trade). Hasbrouck assumed returns were linear in the buy/sell indicator. Think of the
structure as a linear m-order causal time series system, with p = 2, q = 1 and m = 1, to make keeping track
easier. Then
implying
1
0
τ0 = Yt {At−1 , (1, At,2:p )} − Yt {At−1 , (0, At,2:p )} = θ0 ,
..
.
0
1
0
τ1 = Yt {(1, At−1,2:p ) , At } − Yt {(0, At−1,2:p ) , At−1 } = θ1 ,
..
.
0
while
βYt ∼At−1:t = Cov(Yt , At−1:t )Var(At−1:t )−1 = {Cov((θ1 , θ0 ) At−1:t , At−1:t ) + Cov(Vt , At−1:t )} Var(At−1:t )−1
where βVt ∼At−1:t is Cov(Vt , At−1:t )Var(At−1:t )−1 . This is not what we want from a causal perspective. Extra
conditions are needed to encourage βVt ∼At−1:t to be zero. This will be the topic of the next subsection.
Linearity makes things easier to think about. But it is not really the point.
Example 7.1.13. [Non-linear potential outcomes] The univariate m-th order causality system with strict
stationarity means that
Yt (at−m:t ) = g(at−m:t , Vt ),
where {Vt } is strictly stationary. The Yt (at−s,1 ) := g(At−m:t−s−1 , at−s , At−s+1:t , Vt ), so Yt,s (1) − Yt,s (0) is a
strictly stationary process (in the linear case it was non-stochastic!), while
which is non-stochastic, due to the time-seperability of the non-linear model. A special case of the time-
seperable structure is the ARCH(m) process (Engle (1982) and Bollerslev (1986)), when h(x) = x2 . The study
of the impact of assignments in volatility models was initiated by Engle and Ng (1993) — although it was not
really couched in a causal language and again, like in macroeconometrics, those authors assume they only see
the outcome time series.
168 CHAPTER 7. ACTION: CAUSALITY
Randomization in Fisher (1925)’s randomized control trials selects assignments randomly, independent of ev-
erything else in the world (e.g. by drawing assignments randomly on a computer) — and thus are independent
Definition 7.1.14. Assume an m-order causal time series system. Sequential randomization is where
The independence assumption is between the path of assignments At−m:t and the unobservable potential
outcomes {Yt (at−m:t )}at−m:t ∈Am+1 . Validation of this assumption must come from outside the joint distribution
of the A1:T , Y1:T , e.g. I conduct an experiment and draw the path At−m:t randomly on a computer as an autore-
gression. It is crucial to understand we do not need assignments to be i.i.d. to get sequential randomization
Example 7.1.15. [Linear case] (Example 7.1.12 continued) In the linear m-order causal time series system
p
X q
X
Yt (at−m:t ) = (θm , ..., θ0 ) at−m:t + Vt , At = ϕj At−j + βj Yt−j + εt , Yt = Yt (At−m:t ),
j=1 j=1
then if
Vt ⊥⊥ At−m:t , (7.1)
ind
{Vt , εt } ∼ over t, Vt ⊥⊥ εt .
(a) dynamics: assignments can depend upon lagged outcomes, so lagged Vt can be inside the assignments, which
ind
could induce correlation between Vt and the assignments. The {Vt , εt } ∼ assumption removes this danger.
(b) contemporaneous: if the Vt , εt are contemporaneously dependent it would immediately induce dependence
between At and Vt .
For linear models estimated using linear methods, the conditions can be expressed using second moments:
{Vt , εt } are weak white noise and Cov(Vt , εt ) = 0.
The core takeaway from Example 7.1.15 is that the sequential randomization restrictions are made on the
innovations in the model!
7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 169
Lag-s randomization
Sometimes we will see a different set of assumptions, which are written in terms of At−s,1 and the lag-s potential
This is attractive as At−s,1 is the assignment we most centrally focus on in terms of causal inference and
Yt,s (at−s,1 ) is the corresponding outcome. Hence it is tempting to work with lag-s sequential randomization.
Definition 7.1.16. Assume an m-order causal time series system. Lag-s randomization is where
Lag-s randomization is the focus of Angrist and Kuersteiner (2011) and Angrist, Jordà, and Kuersteiner
(2018). It is pretty but it sure is hard to think about if we want to validate it in practical models. This kind
of assumption is at the heart of the assumptions made in linear causal models in macroeconometrics, discussed
in Section 7.2.
It is difficult to see how that will hold the without making the assignments {At }t=1 and system noise {Vt } obey
something like:
(1) At−m:t are independent vectors through time;
(2) At−s ⊥
⊥ Vt ;
(3) and the element At−s,1 ⊥
⊥ At−s,2:p .
This would implies that the outcomes {Yt } would be moving average type process.
The core takeaway from Remark 23 is that the lag-s randomization restrictions are made on the assignments
themselves!
Example 7.1.17 works through the linear model to make the lag-s randomization assumption concrete.
Example 7.1.17. [Linear case] Under m-th order causality system with linear lag-s potential outcomes
at−s,1 0
0 s−1
X At−s,2 m
X
Yt,s (at−1,1 ) = θs + U , U = θ A + θ + θj At−j + Vt ,
.. t t j t−j s ..
. j=0
. j=s+1
0 At−s,p
so lag-s randomization is achieved, for all t, s, by assuming
Ut ⊥⊥ At−s,1 .
170 CHAPTER 7. ACTION: CAUSALITY
That looks easy! But there is a lot in Ut . Sufficient conditions for this are Assumptions 1-3 of Remark 23. Add
to these conditions Vt = 0 for all t, and {As }s=1 are i.i.d., then
m
X
Yt = Yt,s (At−s ) = θj At−j , and Var(At ) = D = diag(σ12 , ..., σp2 ),
j=0
assuming the variance exists. In the literature this is called a structural linear m–th order moving average
(written SMA(m)) process.
The same Assumptions 1-3 of Remark 23 yield Sequential randomization in the non-linear causal system in
Example 7.1.13 — linearity is not the point. The latter point seems a tad under appreciated in the econometrics
literature, but it is at the heart of Angrist and Kuersteiner (2011) and Angrist, Jordà, and Kuersteiner (2018).
Sequential unconfoundedness
Unconfoundedness is at the core of most cross-sectional observational studies. There we make assignments
independent of potential outcomes conditional on the predictors. What is the time series version of that?
The time series version of unconfoundedness is sequential unconfoundedness. Here the conditioning is on the
filtration of the observations and the time t − s predictor Xt−s . This is an example of selection on observables.
Definition 7.1.18. Assume an order-m causal time series system. Think of m ≥ s. The sequential uncon-
foundedness is where
n o
Y,A,X
{Yt (At−m:t−s+1 , at−s:t )}at−s:t ∈As+1 ⊥⊥ At−s:t |Ft−s−1 , Xt−s
A,X,Y
where Ft−s−1 is the history of the assignments, predictors and outcomes (not potential outcomes) up to time
t − s − 1.
The really handy aspect of sequential unconfoundedness is that it only refers to the assignments At−s:t , not
all the way back to t − m.
Example 7.1.19. [Linear case] Return to the linear m-order causal time series system:
m
X p
X q
X
Yt (at−m:t ) = θj at−j + Vt , At = ϕj At−j + βj Yt−j + εt , Yt = Yt (At−m:t ).
j=0 j=1 j=1
Y,A
When we condition on Ft−s−1 many terms above become non-stochastic. If
Y,A
(Vt ⊥⊥ At−s:t ) |Ft−s−1 , (7.3)
Y,A Y,A
βYt ∼At−s:t |F Y,A = Cov(Yt , At−s:t |Ft−s−1 )Var(At−s:t |Ft−s−1 )−1 = (θs , ..., θ0 ) .
t−s−1
In the linear model, the conditions can be expressed in terms of second conditional moments:
n o
Y,A
Cov(Vt , εt |Ft−1 ) = 0, and {Vt , εt } is MD with respect to FtY,A .
7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 171
Lag-s unconfoundedness
The extension of lag-s randomization to lag-s unconfoundedness is important, again knocking out the assump-
Definition 7.1.20. Assume an m-order causal time series system. Lag-s unconfoundedness is where
h i
Y,A,X
{Yt,s (at−s,1 )}at−s,1 ⊥⊥ At−s,1 |Ft−s−1 , Xt−s . (7.4)
Even though At−m:t−s+1 no longer features, very strong assumptions are still needed to validate it — but
Y,A,X
now all stated conditioning on Ft−s−1 , Xt−s . To see this, again note
Y,A,X
V t , At−s,2:p , At−s+1:t |Ft−s−1 , Xt−s .
Think about estimating the expected causal effect of moving from assignment a′t−m:t to at−m:t
under
Define:
then
E[Yt (at−m:t )] − E[Yt (a′t−m:t )] = µ(at−m:t ) − µ(a′t−m:t ).
This can be estimated using two non-parametric regressions. So we are back to standard time series. No new
ideas are needed going forward.
172 CHAPTER 7. ACTION: CAUSALITY
Example 7.1.21. Assume the linear m-order causal time series system and sequential randomization, if the
second moments exist, then
A difficulty here is that m could be quite large. Using sequential unconfounding would reduce this problem
in practice.
If the focus is on the less ambitious lag-s average treatment effect:
then
E[Yt,s (at−s )] = E[Yt (At−m:t−s−1 , at−s , At−s+1:t )], recalling Yt,s (at−s ) = Yt (At−m:t−s−1 , at−s , At−s+1:t ),
So
τs = E[µ(At−m:t )|At−s = 1] − E[µ(At−m:t )|At−s = 0]. (7.5)
From a statistical perspective, this is a much easier estimand: the At−m:t−s−1 , At−s+1 are averaged out which
Example 7.1.22. Assume the linear m-order causal time series system and sequential randomization, if the
The E[At−m:t |At−s = 1] is easy to estimate — element by element. If m is large, the regressions could
potentially be estimated with some form of shrinkage, e.g. ridge regression or Lasso regression.
under
Now
are a pair from the stationary distribution. This means the causal τs is just the difference of a couple of
traditional time series estimands, expressed entirely in terms of observables:
[non-parametrics] kernel regression of the output Yt on the assignment At−s,1 evaluated at 1 and 0, e.g.
PT PT
t=s+1 Yt 1(At−s,1 ∈ (1 − h, 1 + h]) Yt 1(At−s,1 ∈ (−h, h])
τs =
b PT − Pt=s+1
T
,
t=s+1 1(At−s,1 ∈ (1 − h, 1 + h]) t=s+1 1(At−s,1 ∈ (−h, h])
for a small bandwidth h > 0. If the assignment At−s,1 is binary, then the simpler version:
PT PT
Yt 1(At−s,1 = 1) Yt 1(At−s,1 = 0)
τ s = Pt=s+1
b T
− Pt=s+1
T
,
t=s+1 1(At−s,1 = 1) t=s+1 1(At−s,1 = 0)
can be used.
174 CHAPTER 7. ACTION: CAUSALITY
[linear projection] instead of estimating τs using conditional expectations, approximate µs (a) = E[Yt |At−s =
Cov(Yt , At−s,1 )
µL
s (a) := E[Yt ] + (a − E[At−s ]) ,
Var(At−s,1 )
then
Cov(Yt , At−s,1 )
τsL := µL L
s (1) − µs (0) = .
Var(At−s,1 )
Notice τsL is not directly causal. The τsL can be estimated by a least squares regression of {Yt } on {At−s,1 }:
PT
L t=s+1 Yt − Y At−s,1 − A
τs =
b PT 2 .
t=s+1 At−s,1 − A
Remark 24. Regressing outcomes on lagged assignments appear in the work of Jorda (2005) under the title
“local projection.” Notice it is not really local in any statistical sense (we have assume stationarity), it is simply
a linear projection, used twice. This local projection approach has been very influential in modern research.
Recent papers on this include Olea and Plagborg-Moller (2021) and Plagborg-Moller and Wolf (2021).
The above goes from assignments to outcomes. The basic conception is you randomize, see the assignments
and trace out the resulting outcomes. In the linear model with lag-s randomization this yields
m
X
Yt = θj At−j + Vt ,
j=0
with Vt ⊥
⊥ At−s,1 ; At−s,2:p ⊥
⊥ At−s,1 and {At } being independent through time.
If we would like to infer not just causal impacts on At−s,1 but all of At−s , in turn, then this needs to
strengthen to Vt ⊥
⊥ At−s and
Then
βYt ∼At−s = θs ,
the matrix of lag-s average treatment effects. This section will have much of this spirit — the same model will
pop up, but with Vt being set to 0. But the practice will be really different.
There will be models, things called structural shocks will appear which look like assignments. This is mere
labelling. The big idea will be that the time series properties of the
{Yt }
7.2. STRUCTURAL MODELS: SVAR AND SVMA 175
alone plus a causal story are enough to yield estimators of causal quantities. There is no need to see the
assignments. This is a vast and powerful step. It is also somewhat fundamentally fragile.
In the treatment below, I will cut everything back to its very simplest case, focusing on a single lag. You
have enough knowledge of time series that you know how to extend things to longer lags and really they raise
no new intellectual issues.
The core models in this literature do not use the language of potential outcomes, so I will not force it upon the
models either. But I will again use the notation {At } to drive everything: now these former assignments will
be called structural shocks!
Definition 7.2.1. A d-dimensional structural first-order autoregressive process (SVAR(1), we follow the old
style convention of this literature in using vectors in front of names of processes) is where
iid
BYt = Γ1 Yt−1 + At , At ∼
where B is a square matrix, the process {At } are called structural shocks, obeying the constraint that At,j are
independent over j and possess a variance,
Var(A1 ) = D,
a diagonal matrix. Typically B is constrained to have 1 on the leading diagonal and is assumed to be invertible.
Remark 25. Making B have unit leading diagonal elements, e.g. when d = 2,
1 b1,2
B= ,
b2,1 1
just constrains the scaling of the model, so can be done without any loss of flexibility and makes it easier to
interpret. Constraining Var(A1 ) to be diagonal makes sense when trying to think about the shocks as assignment
type objects. We saw this type of assumption frequently above — it appeared due to lag-s randomization and
linearity assumption.
The SVAR comes out of one of the main econometric traditions: simultaneous equation models (SEMs),
which were significantly developed in the late 1940s and 1950s, e.g. Haavelmo (1943) and Koopmans (1950).
That case is where Γ1 = 0.
Rewrite the SVAR, assuming the matrix B is invertible, into the more conventional VAR(1)
Yt = ϕ1 Yt−1 + εt ,
where
T
ϕ1 = B −1 Γ1 , εt = B −1 At , Σ = Var(ε1 ) = B −1 D B −1 .
176 CHAPTER 7. ACTION: CAUSALITY
This parameterization is often called “reduced form”, again mimicking the nomenclature of the earlier SEM
literature. If |ϕ1 | < 1, then the VAR(1) can, of course, be written as an VMA(∞)
∞
X j
Yt = ψj εt−j , ψj = B −1 Γ1 ; j = 0, 1, 2, ...
j=0
so
∞
X j
Yt = θj At−j , θj = ψj B −1 = B −1 Γ1 B −1 ; j = 0, 1, 2, ....
j=0
Crucially recall Var(A1 ) = D and now there is no reason to expect θ0 to be Id . This structure is called
a structural moving average (SVMA(∞)). We have already studied its probabilistic properties at enormous
length, it is just an infinite order moving average process. It raises no new issues.
∂Yt
= θs
∂AT
t−s
= ψs B −1
= ϕs1 B −1 .
The θs appears from the SVMA. It is a causal quantities, due to the assumptions in the SVAR. Of course the
SVMA is not easy to estimate from the path of {Yt } alone as θ0 is not the identity matrix. The ψs appears
in the VMA, so is estimable, likewise the ϕ1 from the VAR. But there is also the B −1 matrix here.
let us take stock. We have defined 4 models, simply reparameterizing the original SVAR! It is easy to get lost
in the blizzard of models, so the summary in Table 7.1 helps (at least me):
Enormous amount of attention is paid in this literature to the difficulty of learning B, Γ1 and D from an
infinite amount of data: what is often called identification in statistics. That this is tricky can be seen from a
simple example.
Example 7.2.2. Think about the d = 2 case. Then B has 2 free elements (ones on the leading diagonal), Γ1
has 4 and D is 2 (diagonal matrix). Thus the SVAR(1) has 8 parameters. The VAR(1) only has 7: four from
7.2. STRUCTURAL MODELS: SVAR AND SVMA 177
ϕ1 and three from Var(ε1 ) (due to the symmetry of the matrix). All the VAR(1) parameters can be estimated
from the data. But that is not enough to pin down all aspects of the SVAR(1). 8 into 7 does not go, so the
SVAR(1) is not identified — that is with an infinite amount of data ϕ1 and Var(ε1 ) can be learnt, but that is
not enough to determine all of B, Γ1 and D.
The general issue is outlined in Table 7.2, which counts parameters for the (unconstrained) SVAR(1) and
VAR(1):
Economists have labored on imposing sensible constraints on B, Γ1 and D to remove this gap.
There are a variety of strategies to enforce identification with these kinds of models.
Strategy 1. In the most influential work in this area, Sims (1980) suggested constraining B to be upper
triangular, which knocks out enough parameters to yield identification. In the d = 2 case this means that
1 b1,2 −1 1 −b1,2 γ1,1 γ1,2 −1 γ1,1 − b1,2 γ2,1 γ1,2 − b1,2 γ2,2
B= , B = , B Γ1 = .
0 1 0 1 γ2,1 γ2,2 γ2,1 γ2,2
Table 7.2 deals with the general case. Unfortunately this messes up some issues, e.g. it means that
causal inference is sensitive to the ordering of the system. If there is a compelling economic reason for
the restrictions imposed by the upper triangular B this is fine. Otherwise this is quackery. Once B is
Strategy 2. Suppose the variance of the shocks changes over time. The simplest version is
D t ≤ T /2
Var(At ) =
∗
D t > T /2.
178 CHAPTER 7. ACTION: CAUSALITY
then the SVAR has 10 parameters, gaining 2 parameters from D∗ . But now the VAR(1) also has a switch
in its variances, the VAR(1) grows by 3 parameters to 10. Magic, the problem is solved. Of course
more sophisticated volatility models are possible, e.g. using SV models for the shocks. This again leads
to identification and better fitting models. Some econometricians feel uncomfortable with this strategy
as they feel only information conveyed by covariances is regarded as solid. It is unclear to me if this is
sound, or simply stuck in the past, rejecting empirical progress.
Strategy 3. For a Bayesian the lack of identification is not a dead stop as a prior can be placed over B,
Γ1 and D, and Bayes theorem still provides valid probabilistic calculations. The potential difficulty here
is that the scientific conclusion may be sensitivity to the prior specification, as there are aspects of the
likelihood which will be flat due to the lack of identification, even with quite large samples. This is a
milder version of the a priori imposition of B being upper triangular. An alternative to Bayesian inference
in this context is the use of regularization on the parameters, which should limit the damage caused by
the lack of identification.
So far we started with a SVAR(1) and rewrote it as a VMA(∞) and a SVMA(∞). In this analysis the
causal structure and the assignments have been expressed using independence assumptions, e.g. sequential
randomization. This yielded a mathematically clear causal meaning to the impulse response functions, recalling
A different approach is to start with by assuming that {Yt } is covariance stationary and then appeal to the
Wold decomposition
∞
X
Yt = θj Ut−j + Vt
j=0
where {Wt } are zero mean white noise and {Vt } is a non-stochastic sequence. Here the Ut = Yt −P (Yt |Y−∞:t−1 ),
So far, so good.
The next step is to assert that the VMA(∞) structure is causal, i.e. is a SVMA(∞) structure. This is
sometimes called the Frisch (1933)-Slutsky (1927) paradigm. Although this is a the conventional viewpoint
of modern macroeconometric causality, I find this difficult to follow, as Ut is physically built from the path of
Y−∞:t . I literally have no way of moving Ut , as it follows from Yt .
A different viewpoint of the Frisch (1933)-Slutsky (1927) paradigm is to say I will build a causal model
∞
X
Yt = θj Ut−j
j=0
7.3. MARKOV DECISION PROCESS 179
where {Ut } are assignments or shocks. The Wold decomposition shows this is a superflexible approach, for
covariance stationary sequences. In the statistical analysis the {Ut } are only assumed to be a zero mean white
noise process. I find this more appealing — but it is a matter of taste for how science is carried out.
Suppose a unit or an agent (e.g. an individual, a company or a society) has a time-t manipulable action variable
which can be used to impact other future variables for the unit’s benefit. This is also causality! A variable is
moved, other variables respond — here by design.
This area of study is often called control theory or more recently reinforcement learning. It appears all over
The main structure used to study this problem is a Markov decision process (MDP). It has:
Like the HMM, the MDP is stated rather abstractly. Instead of having a measurement density and a transition
density, it has a utility function, a policy and an environment. The environment will be quite similar to the
transition density, but where the action can impact the states too.
To start we will start with a preamble, which I find helpful! All expositions I have seen skip this step,
regarding it as all obvious. But I am slow, I need to spell it out. It also helps me relate everything back to
the way we discussed causality before.
Let At be called the time-t action or control, St the potential states and Ut is the utility (utility is usually
the label used in economics, minus loss in machine learning and reward in reinforcement learning — it is all the
same intellectually). Collect them initially (in a moment a simpler, stripped down version will be given) into
Ut
Zt′ = {St (a1:t−1 )}a1:t−1 ∈At−1 , t = 1, ..., T.
At
the state
Ut ({a1:t−2 , at−1 } , {s1:t−2 , st−1:t }) = Ut ( a′1:t−2 , at−1 , s′1:t−2 , st−1:t ), a1:t−2 , a′1:t−2 , s1:t−2 , s′1:t−2 ,
for all
In what is below, I will write the single path of potential states from time t to time T , as they are selected
by a single path of actions, as
h 7.3.1. St:T (at−1:T −1 ) is a powerful, but possibly confusing notation, as it looks like all the states St:T can
depend upon all the actions at−1:T −1 , but that is not the case.
and then
and assume this follows a Markovian law (the causal understanding comes from the preamble). Now work with
the decomposition
P ({St (at−1 )} |Zt−1 ) = P ({St (at−1 )} |At−1 , St−1 ), called the environment
P (At |Zt−1 , {St (at−1 )}) = P (At |St ), called the policy (controlled by agent), recalling St = St (At−1 )
This structure has a lot of similarities with the transition equation of a HMM (where st−1 is the time-t state),
but now there is this extra action element. If there was no action, the environment would be the transition
equation of the HMM. The non-interference assumption makes the time-t utility a little like a potential outcome,
with the action being the assignment, but here it is lagged.
The policy is the law of the action given the state At |St . Thinking of St = s, the support of possible actions
is sometimes written as
A(s),
One of the reasons the MDP is mathematically tractable is the time separability of the utility functions,
which means that the discounted time-t utility function splits into
the current utility plus δ times the discounted time-(t + 1) utility function. This equation looks a bit weird.
It works backwards in time — relating terms at time t in terms of things at time-(t + 1). But we have seen
equations which go backwards in time:
182 CHAPTER 7. ACTION: CAUSALITY
smoothing in HMMs;
Remark 26. In many setups the policy, the environment and/or the utility will not be stochastic. In the
extreme case where nothing is random and the system is time-invariant, then zt = (ut , st , at ) with the functions
Example 7.3.3. [LQG control] The Linear-Quadratic-Gaussian (LQG) control problem sets the time-t potential
states as
This structure means that the agent selects the action, which the state to push through to the next period.
Instead of having a measurement density there is a utility function
So utility likes the actions at−1 and states st−1 to be small. Hansen and Sargent (2014) mostly focus on using
this case for different problems in macroeconomics. When there is no ηt , this setup is called the Linear-Quadratic
(LQ) control problem.
One of the most influential ideas in the second half of the 20th century in engineering and dynamic economics
is called the value function of the stochastic dynamic program — the highest achievable expected utility given
we are in state St−1 = s.
7.4. STOCHASTIC DYNAMIC PROGRAM 183
The time separability of the discounted time-t utility function means the stochastic dynamic program’s value
function also splits
vt (s) = max {E[Ut (at−1 , s, St (at−1 ))|St−1 = s] + δE [vt+1 (St (at−1 ))|St−1 = s]} .
at−1
This is an example of a Bellman equation — relating the value function at time t to the value function value at
Proof. The first term is obvious, the second is notation heavy but quite simple:
∗ ∗
max E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] = max max E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] ,
at−1:T −1 at−1 at:T −1
while
∗ ∗
E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] = E[E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St (at−1 )]|St−1 = s], Adam’s law
∗
= E[E[Ut+1 (at:T −1 , St (at−1 ), St+1:T (at:T −1 ))|St (at−1 )]|St−1 = s].
Thus
∗ ∗
maxat−1:T −1 E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] = maxat−1 E[maxat:T −1 E[Ut+1 (at:T −1 , St (at−1 ), St+1:T (at:T −1 ))|St (
= maxat−1 E [vt+1 (St (at−1 ))|St−1 = s] .
It takes some time to get use to these value functions. Thinking about a special case helps.
Example 7.4.3. Recall from Example 7.3.3 the Linear-quadratic-Gaussian (LQG) control problem. Then
vt (s) = −sT Hs + max −aTt−1 Qat−1 + δE[vt+1 (T s + Rat−1 + ηt )|St−1 = s] .
at−1
How to progress?
Theorem 7.4.4. The stochastic dynamic program case of a LQG program sets:
−1
at−1
b = −Kt+1 T s, where Kt+1 = Q + δRT Pt+1 R RT Pt+1
Pt = H + T T Kt+1
T
QKt+1 T + δT T (I − Kt+1
T
)Pt+1 (I − Kt+1 )T.
184 CHAPTER 7. ACTION: CAUSALITY
holds for every t. If this is true then (removing all the negative signs) using that form at time t and t + 1, the
equality:
sT Pt s + ρt = sT Hs + min aTt−1 Qat−1 + δE (T s + Rat−1 + ηt )T Pt+1 (T s + Rat−1 + ηt )|St−1 = s + δρt , (7.7)
at−1
at−1 which does the minimization, by solving second order condition, assuming E[ηt |St−1 = s] = 0,
holds. Find b
so
−1
at−1 = −Kt+1 T s,
b where Kt+1 = Q + δRT Pt+1 R RT Pt+1 .
Now let us get back to ρt , Pt . Can they be found to verify (7.6). Noting T s − Rb
at−1 = (I − Kt+1 )T s, plug
= sT Hs + sT T T Kt+1
T
QKt+1 T s + δsT T T (I − Kt+1 )T Pt+1 (I − Kt+1 )T s + δE[ηtT Pt+1 ηt |St−1 = s] + δρt
= sT Pt s + δ E[ηtT Pt+1 ηt |St−1 = s] + ρt
where
Pt = H + T T Kt+1
T
QKt+1 T + δT T (I − Kt+1
T
)Pt+1 (I − Kt+1 )T.
b
at−1 only relies on the assumptions: E[ηt |St−1 = s] = 0 and that E[ηtT Pt+1 ηt |St−1 = s] does not vary with
s. The solved Pt does not depend upon the properties of ηt .
Example 7.4.5.
St = St−1 − at−1 , St ≥ 0.
7.5. REINFORCEMENT LEARNING 185
Stochastic integration plays a large role in modern science, engineering, economics and statistics. There are
quite a few beautiful books on this subject. Mikosch (1998) is an elegant introduction to stochastic calculus
by a very strong mathematician. His Chapter 2 on integration is extremely clear. He assumes only very basic
probability. Steele (2001) mixes finance with probability to deliver an elegant book. Protter (2010) is my
favorite book on stochastic calculus. It is very elegant, with enormous economy of effort. Duffie (2001), Shreve
(2004) and Karatzas and Shreve (1998) are classic references on mathematical finance expressed in continuous
time.
In this Chapter we will focus on stochastic integrals which are driven by Brownian motion. We have already
seen objects a bit like Brownian motion: random walks:
iid
Yt = Yt−1 + εt , εt ∼ .
Here we discuss taking this random walk process to times which can be recorded continuously
{Y (t)}t≥0 ,
where the increments are Gaussian. This is useful as it expands the kinds of processes we can use to build
new models (e.g. in financial econometrics), allowing some analysis of problems where the data is not equally
spaced through time, some of these objects appear in important asymptotic arguments (e.g. functional central
8.1 Background
8.1.1 A càdlàg function
A càdlàg function is a function which is right continuous with left limit. It is familiar in introductory statistics
from, for example, the distribution function of a binary random variable — shown in Figure 8.1. A càdlàg
187
188 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
F (y)
•1 •
p = 0.6 • •
0.3 • ◦
◦0
• •
0 1 y
1 = F −1 (0.6) = Q(0.6)
Figure 8.1: Example of càdlàg function. Distribution function of a binary random variable with P (Y = 0) = 0.3
and P (Y = 1) = 0.7. Here we compute the 0.6-quantile, F −1 (0.6) = Q(0.6), which is 1.
the difference between the right and left limits, respectively. I will use C(0, 1) to denote the space (i.e. informally
the collection) of continuous functions on the unit interval, while D(0, 1) denotes the corresponding space of
càdlàg functions (it is called the Skorokhod space in the literature).
It is helpful to take a step back and remind ourselves about what we usually mean by integration. The usual
integrals we see in statistics are typically Riemann integrals or its extension Riemann–Stieltjes integrals. In
our discussion we will follow some of the lines setout by Mikosch (1998).
To do this start with p-variation. This will be based on partitioning of the interval [0, 1] governed by the
end points:
the length of the the longest subinterval — this is also the L∞ norm of the time-gaps.
Definition 8.1.1. [Finite p-variation] A real function g on [0, 1] is said to be of finite p-variation (also called
bounded p-variation) for some p ≥ 1 if
n
X p
sup sup |g(tj ) − g(tj−1 )| < ∞,
n τn
j=1
8.1. BACKGROUND 189
This is a pretty abstract quantity, how does it vary with p? The following shows that for any ∞ > q ≥ 0 if
g is p-variation then g is (p + q)-variation.
n
Why? For any {aj > 0}j=1 and 1 ≤ p < ∞, then let the p-norm of a (also called the L∞ norm of a) be
1/p
Xn
∥a∥p := apj
j=1
∥a∥p+q ≤ ∥a∥p ,
By far the most famous case of finite p-variation is when p = 1, then it is called finite variation. In that
The term
All non-decreasing functions are finite variation, by telescoping T V (g, τn ) = g(1) − g(0), for any choice of
τn .
n
X
T V (F, τn ) = F (tj ) − F (tj−1 ) = F (1) − F (0) = 1,
j=1
whatever the choice of τn . A Poisson process with intensity ψ, written {Y (t)}t≥0 is non-decreasing, so
Hence the probability T V (Y, τn ) is less than c can be made arbitrarily close to one by allowing c large. Thus
Poisson processes are finite variation with probability one.
190 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
Example 8.1.3. Assume g is continuously differentiable with bounded derivative g ′ . Then by the mean value
theorem, for t > s,
|g(t) − g(s)| ≤ (t − s) sup |g ′ (u)| ,
u∈[0,1]
so
n
X
T V (g, τn ) ≤ sup |g ′ (u)| × tj − tj−1 = sup |g ′ (u)| < ∞.
u∈[0,1] j=1 u∈[0,1]
as ∥τn ∥ → 0. The quadratic variation of the function, written [g, g](1), is the mse limit of QV (g, τn ), that is
2
lim E[{QV (g, τn ) − [g, g](1)} ] = 0.
∥τn ∥→0
[g, g](1) = 0.
Quadratic variation is not 2-variation, as quadratic variation looks at the limit as ∥τn ∥ → 0, while 2-variation
looks at the sup over all partitions. These things are different.
Think of
n
X
S(f, g, τn ) = f (tj−1 ) {g(tj ) − g(tj−1 )} ,
j=1
where the functions {f (t)}t∈[0,1] and {g(t)}t∈[0,1] , then it turns out this has a limit called the Riemann–Stieltjes
integral, written
Z 1
f (u)dg(u)
0
f has finite p-variation and g has finite q-variation, for p, q > 0 such that
In the integral, f is called the integrand, and g is labelled the integrator. In the special case that g(tj ) −
g(tj−1 ) = tj − tj−1 , this is called a Riemann integral.
This result is due to Young (1936). Some textbooks make out that the Riemann–Stieltjes integral needs g
to be of finite variation, but that is not true. Mikosch (1998) has a very accessible and precise discussion of the
issues. See also Dudley and Norvaiša (1999). The point about (8.1) is rather important as Brownian motion
will turn out not to have finite variation, but it does have finite 2-variation, so-called quadratic variation.
Standard references for the basic theory of Riemann–Stieltjes integrals are Apostol (1957) and Widder
(1946).
Example 8.1.4. In statistics you often see Riemann–Stieltjes integrals in the context of moments, e.g.
Z 1
E[Y ] = ydF (y),
0
dY (t) = f (t)dg(t).
This is super powerful, as it is the basis of the Newton’s chain rule, applied to transformation:
A(t) = A {Y (t), t} ,
where the function A is assumed to be continuously differentiable in Y and t (written as A ∈ C 1,1 ). The chain
rule says that
∂A {Y (t), t} ∂A {Y (t), t}
dA(t) = dt + dY (t)
∂t ∂Y (t)
∂A {Y (t), t} ∂A {Y (t), t}
= At (t)dt + AY (t)dY (t), AY (t) := At (t) :=
∂Y (t) ∂t
= At (t)dt + AY (t)f (t)dg(t),
so
Z t Z t
A(t) = At (u)du + AY (u)f (u)dg(u)
0 0
Example 8.1.5. If g(t) = µt, then dY (t) = f (t)µdt, then {A(t)}t∈[0,T ] is the solution to the differential
equation:
dA(t) = {At (t) + AY (t)f (t)µ} dt.
192 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
A Brownian motion is the Gaussian special case of a Lévy process, a Gaussian continuous time random walk.
10
4
5
8
3
4
6
Y
Y
3
4
2
2
1
0
0
0
−1
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Figure 8.2: LHS: sample path of Poisson process with intensity ψ = 1. Middle: sample path of Brownian motion
with µ = 0 and σ = 1. RHS: sample path of a gamma process.
Definition 8.2.1. The process {B(t)}t≥0 is standard Brownian motion iff it has
(i) càdlàg sample paths,
(ii) independent increments
The process {Y (t)}t≥0 where Y (t) = µt + σB(t), is called Brownian motion, where µ is drift and σ is volatility.
so the randomness in the increment Y (t) − Y (s) will dominate the drift.
8.2. BROWNIAN MOTION 193
t3
E[Y (t)|FsY ] = Y (s) + (t − s)B(s), Y (t) ∼ N 0, , Cov(Y (t), Y (s)) = s3 /3 + (t − s)s2 /2,
3
But {B(u) − B(s)}u≥s is independent of B(s), as Brownian motion has independent increments. So
But
t v t
v2
Z Z Z
min(u, v)du = udu + vdu = + v (t − v) = vt − v 2 /2,
0 0 v 2
so Var(Y (t)) = t /2 − t /6 = t /3. Using the E[Y (t)|FsY ] expression,
3 3 3
Cov(Y (t), Y (s)) = E[Y (t)Y (s)] = E[E[Y (t)Y (s)|FsY ]] = E[Y (s)2 ] + (t − s)E[B(s)Y (s)],
But
Z t Z t
E[B(t)Y (t)] = E[B(t)B(u)]du = udu = t2 /2.
0 0
We have just see that using Brownian motion type objects in a Reimann integral raises no big issues. But
what about using Brownian motion as an integrator? To answer this we need to determine some of the core
is a martingale;
194 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
is self-similar;
is at the center of a process version of the Lindeburg-Lévy central limit theory, Donsker’s invariance
principle.
To see that Brownian motion {B(t)}t≥0 is a martingale {F(t)}t≥0 , write, for t > s,
Then
= B(s).
We know Var(B(t)) = t, so E[|B(t)|] exists, and thus Brownian motion satisfies the two conditions for being a
martingale.
First, think about asking if {B(t)}t≥0 is a continuous function of time. Figure 8.2 shows a simulated path of a
standard Brownian motion. The plot is made using small dots at each datapoint, not a line graph. The graph
gives the impression the Brownian motion has a continuous sample path. We can prove this is true.
Proof. Uses “Kolmogorov continuity criteria” e.g. Bass (2011, Ch. 8). This says that if {X(t)}t∈[0,1] is a real
valued process and ∃ constants c, ε, p > 0 s.t.
p 1+ε
E {|X(t) − X(s)| } ≤ c |t − s| ,
where t and s live [0, 1], then with probability one {X(t)}t∈[0,1] is uniformly continuous on [0, 1]. In the
L p
B(t) − B(s) = (t − s)U, U ∼ N (0, 1).
8.2. BROWNIAN MOTION 195
So
p p/2 p/2
E {|B(t) − B(s)| } = µp |t − s| , µp = E |U | .
Now
p/2 1+ε
|t − s| ≤ c |t − s| ,
Next: is Brownian motion differentiable? The answer is no, but actually this is pretty hard to prove
rigorously. To think informally about it, suppose τ > 0 as small, then
so
B(t + τ ) − B(t)
∼ N (0, τ −1 ).
τ
Hence it is not well behaved as τ ↓ 0. In fact {B(t)}t≥0 is nowhere differentiable.
The smoothness of the Brownian motion path can be measured using the following Theorem.
|B(t + h) − B(t)|
lim sup sup p = 1.
h↓0 0≤t≤1−h 2h log(1/h)
The proof of this is beyond these notes. This means that Brownian motion has everywhere locally Hölder
L √
so T V (B, τn ) sums up independent terms. Now B(t) − B(s) = t − sU , U ∼ N (0, 1). This means that
r
2√ 2 2
E |B(t) − B(s)| = t − s, Var |B(t) − B(s)| = E {B(t) − B(s)} −{E |B(t) − B(s)|} = (t−s) {1 − (2/π)} .
π
(8.2)
Notice the variance does not depend upon the details of τn , only T .
196 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
T V (B) = ∞
This means that Brownian motion does not have finite variation. This means its use as an integrator in a
Riemann–Stieltjes integral will be, at best, subtle.
Definition 8.2.5. [Y, Y ](1) is quadratic variation of a process {Y (t)}t∈[0,1] . It is the mean square error limit
of
n
X 2
{Y (tj ) − Y (tj−1 )} ,
j=1
Now let
n
X 2
QV (B, τn ) = {B(tj ) − B(tj−1 )} ,
j=1
then
n
X n
X
2
E[QV (B, τn )] = 1, Var(QV (B, τn )) = 2 (tj − tj−1 ) ≤ 2 ∥τn ∥ (tj − tj−1 ) = 2 ∥τn ∥ ,
j=1 j=1
so if ∥τn ∥ → 0 then
2
lim E[{QV (B, τn ) − [B, B](1)} ] = 0, where [B, B](1) = 1,
∥τn ∥→0
to more generally
[B, B](t) = t.
Finally, think about the p-variation of Brownian motion. The following theorem shows how slightly different
quadratic variation is to 2-variation.
p > 2.
Proof. Lévy’s modulus of continuity means that for p > 2 then there exists a Kp such that
|Bt − Bs | ≤ Kp |t − s|1/p ,
8.2. BROWNIAN MOTION 197
so
n
X n
X
p
|B(tj ) − B(tj−1 )| ≤ Kpp |tj − tj−1 | = Kpp < ∞
j=1 j=1
almost surely.
Self-similarity
The next property we discuss is that scaling Brownian motion by σ > 0 is the same as running time faster:
d
{σB(t)}t≥0 = B(σ 2 t) t≥0
.
This follows from the property of the normal distribution plus the B(t) ∼ N (0, t) property. This implies
Brownian motion is an example of a self-similar process, with γ = 1/2 and a = σ 2 .
Definition 8.2.7. The process {X(t)}t≥0 is said to be self-similar if there exists a γ > 0 such that for all a > 0,
d
the {aγ X(t)}t≥0 = {X(at)}t≥0 .
One of the most important results in probability theory is the Lindeburg-Lévy central limit theory, which is
seen in Introductory Statistics classes.
∞
Theorem 8.2.8. [Lindeburg-Lévy CLT] Assume that the sequence {Xj }j=1 are i.i.d. draws with mean µ and
variance σ 2 < ∞. Calculate the sequence of scaled sample centered averages
n
1 X
d
1/2
(Xj − µ) → N (0, 1), (8.3)
σn
j=1
as n → ∞,
This theorem places the Gaussian distribution at the heart of much of statistics.
Brownian motion also plays a large role, extending the Lindeburg-Lévy Theorem to processes.
To start, construct an artificial continuous time process {Sn (t)}t∈[0,1] formed by the partial sum
1 ⌊tn⌋
X
Sn (t) = 1/2
(Xj − µ) , t ∈ (0, 1],
σ ⌊tn⌋ j=1
using a 100t% of the data. Here ⌊x⌋ denotes the integer part of x. At each individual t, the Sn (t) is the same
as (8.3) but based on a sample size ⌊tn⌋, a fraction of the data. So as n → ∞ for fixed t, the Lindeburg-Lévy
CLT applies with, marginally
d
Sn (t) → N (0, t),
198 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
as n → ∞.
But what happens to the entire {Sn (t)}t∈[0,1] process? The functional central limit theorem, which is also
called the Donsker’s invariance principle, provides the answer.
∞
Theorem 8.2.9. [Donsker’s invariance principle] Assume that the sequence {Xj }j=1 are i.i.d. draws with
mean µ and variance σ 2 < ∞, then on the space C(0, 1) of continuous functions on the unit interval, using a
sup-norm metric, {Sn (t)}t∈[0,1] converges in distribution to {B(t)}t∈[0,1] . This is often written as
f has finite p-variation and g has finite q-variation, for p, q > 0 such that
Now we are going to think of {f (t)}t∈[0,1] and integrator {B(t)}t∈[0,1] as continuous time stochastic processes
adapted to {Ft }t∈[0,1] (so this filtration must contain at least the history of the integrand and integrator). We
are eventually going to get to an Ito integral,
Z 1
f (u)dB(u).
0
Y (t) = f · B(t).
However, this is not so easy. Brownian motion has continuous sample paths, so bullet point one of the
condition for a Riemann-Stieltjes integral is dealt with. But Brownian motion is not finite variation, so we
8.3. STOCHASTIC INTEGRATION 199
need to be careful. All we have is q = 2. This rules out allowing p = 2, which would happen if, for example,
f (t) = B(t), the same Brownian motion, i.e.
Z 1
B(u)dB(u).
0
that yields, in the limit, a totally different integral: not an Ito integral but a Fisk-Stratonovich integral. Using
f (tj−1 ) is vita, it makes the integrand previsible.
To tackle this problem about {f (t)}t∈[0,1] , think about replacing it with a less rough version.
For a fixed n, build a “simple process” f (n) (t) t∈[0,1] :
which is a càdlàg step function (like a distribution function). Then, for fixed n, the
n
X
T V (f (n) ) = sup T V (f (n) , τ ) = |f (tj ) − f (tj−1 )| < ∞,
τ
j=1
converges in mean square to a unique limit, of the integral {f (t)}t∈[0,1] by {B(t)}t∈[0,1] , written
Z 1
f (u)dB(u).
0
200 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
This is an Ito integral but now {f (t)}t∈[0,1] does not need to be of finite variation. To do this we need to make
a couple of assumptions.
Now assume that:
the
Z 1
E[f (u)2 ]du < ∞.
0
To start the work, think of the sum (8.5). We are going to see three properties:
1. The increment
f (n) (tj−1 ) {B(tj ) − B(tj−1 )}
has
h i
E f (n) (tj−1 ) {B(tj ) − B(tj−1 )} |Ftj−1
= 0,
hence is a martingale difference, as the E f (n) (tj−1 ) {B(tj ) − B(tj−1 )} < ∞ due to
h i h i
Var f (n) (tj−1 ) {B(tj ) − B(tj−1 )} E f (n) (tj−1 )Var {B(tj ) − B(tj−1 )} |Ftj−1
=
h i
= (tj − tj−1 ) E f (n) (tj−1 )2 ,
3. Think of (8.5) as a partial sum, where the partial sum is a martingale with respect to {Ft } and so
Z 1 Xn h i
(n)
Var f (u)dB(u) = E f (n) (tj−1 )2 (tj − tj−1 )
0 j=1
Z 1
= E[f (n) (u)2 ]du.
0
R1
Expressing the variance of f (n) · B(1) as 0
E[f (n) (u)2 ]du is called L2 isometry. It an important feature
of Ito integrals.
Under the two bullet point assumptions it turns out it is always possible to find a sequence of simple processes
f (n) (t) t∈[0,1]
such that
Z 1 n o2
(n)
E f (u) − f (u) du → 0.
0
8.3. STOCHASTIC INTEGRATION 201
(n)
A proof of this is in Kloeden and Platen (1992), Lemma 3.2.1. Thus f (t) t∈[0,1]
is a Cauchy sequence,
approximating {f (t)}t∈[0,1] arbitrarily well as n goes to infinity in L2 .
Applying Doob’s maximal quadratic inequality from equation (2.7), that is for any square integrable mar-
tingale
E sup Ys2 ≤ 4E[Yt2 ],
s≤t
Letting k → ∞, the
" Z 1 Z t 2 # Z 1 n o2
(n) (n)
E sup f (u)dB(u) − f (u)dB(u) ≤4 E f (u) − f (u) du → 0.
t∈[0,1] 0 0 0
as a valid integral, the mean square error limit of a sequence of Reimann-Stieltjes integrals. This limit is called
an Ito integral of {f (t)}t∈[0,1] by {B(t)}t∈[0,1] . That was the goal of this section.
More broadly, instead of working with the sums, one can define a partial sum
n
X
Y (n) (t) = 1(tj ≤ t)f (n) (tj−1 ) {B(tj ) − B(tj−1 )}
j=1
RT
and it is possible to show if 0
E[f (u)2 ]du < ∞ that
2
lim E sup Y (n) (t) − Y (t) → 0,
∥τn ∥→0 0≤t≤T
where
Z t
Y (t) = f (u)dB(u),
0
Definition 8.3.3. For the processes {X(t), Y (t)}t∈[0,T ] adapted to {F(t)}t∈[0,T ] , the mean square limit of
n
X
QV (X, Y, τn ) = {X(tj ) − X(tj−1 )} {Y (tj ) − Y (tj−1 )} ,
j=1
By Cauchy-Schwartz inequality
so
and so
1
[X, Y ] = {[X + Y, X + Y ] − [X − Y, X − Y ]} .
4
The latter result is called the polarization identity.
Example 8.3.4. If {g(t)}t∈[0,T ] is a continuous function of finite variation, then in Section 8.1.2 we saw that
[g, g](t) = 0. Thus, using (8.6),
[g, B](t) = 0.
8.4. STOCHASTIC DIFFERENTIATION EQUATION 203
This means, if
Z t
Y (t) = g(t) + σ(u)dB(u),
0
then
Z t
[Y, Y ](t) = σ 2 (u)du.
0
Having defined a stochastic integral, it is possible to work on a stochastic differential equation (SDE) corre-
sponding to this solution. In this section we will work with SDEs associated with Ito processes.
is called an Ito process, assuming {µ(t), σ(t), B(t)}t∈[0,T ] are adapted to {F(t)}t∈[0,T ] .
Think of the corresponding stochastic differential equation for {Y (t)} in terms of {µ(t), σ(t), B(t)}t∈[0,T ] as
is thought of as
Y (t) − Y (t − dt).
The SDE (8.8) is short-hand for (8.7) — the solution (8.7) is the fundamental object, the SDE is shorthand for
it.
Example 8.4.2. [Geometric Brownian motion]: One of the most used Ito processes is geometric Brownian
motion
dY (t) = µY (t)dt + σY (t)dB(t), Y (0) > 0, σ ≥ 0, µ ∈ R,
then {Z(t)}t≥0 is non-negative with probability one. This is a standard continuous time model of prices in
financial economics. It is often written, in short-hand as
dY (t)
= µdt + σdB(t),
Y (t)
204 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES
or more abstractly
dY
= µdt + σdB.
Y
Later we will see it has the solution
At first sight, the appearance of the σ 2 /2 term in the solution is not obvious. More fascinating is that an
innocent reading of the SDE, with shocks Y (t)dB(t), suggests the process {Y (t)}t≥0 could go negative, as the
increments of Brownian motion are on the real line — however this is not true.
h 8.4.3. Think of the Brownian increment dB(t). It is quite subtle. Go back to quadratic variation, then
2
Xn
2
lim E {B(tj ) − B(tj−1 )} − [B, B](t) = 0. (8.9)
∥τn ∥→0
j=1
2
In terms of the differential notation dB(t) ∼ N (0, dt), this means that if we see a {dB(t)} then to get the right
or
2
{dB(u)} = dt = d[B, B](t).
2
When you first see this, it can appear mighty odd as it looks like for all the world that {dB(u)} ∼ χ21 dt. But
the SDE notation is really just a shorthand for the fundamental object: the stochastic integral. To get the
2
right stochastic integral you need to take {dB(u)} = dt. This is very important and not obvious.
Now suppose:
A(t) = A {Y (t), t} ,
a smooth function of Y (t), then what SDE does {A(t)} follow? This is a fundamental question! In short-hand,
what is the chain rule for Ito processes?
To start recall if A ∈ C 1,1 and Y (t) is continuously differentiable, then Newton’s chain rule says that
∂A {Z(t), t} ∂A {Z(t), t}
dA(t) = At (t)dt + AZ (t)dY (t), where At (t) = , AZ (t) = .
∂t ∂Z(t)
But that does not work for Ito processes, which are not differentiable! The Extension of Newton’s result to Ito
processes is called Ito’s Formula (or Ito’s Lemma).
8.5. ITO’S FORMULA 205
Lemma 6. [Ito’s Formula] If A(y, t) ∈ C 2,1 , {µ(t), σ(t), B(t)}t≥0 is adapted to {Ft }t≥0 , and A(t) = A {Y (t), t},
then Ito’s Formula says that
Z t Z t
∂ 2 A {Y (t), t}
1
A(t) = A(0) + Au (u) + AY Y (u)σu2 du + AY (u)dY (u), where AY Y (t) = .
0 2 0 ∂Y (t)2
1 2 1 2 ∂ 2 A {Y (t), t}
dA(t) = At (t)dt + Att (t) (dt) + AY (t)dY (t) + AY Y (t) × (dY (t)) , where Att (t) =
2 2 ∂t2
∂ 2 A {Y (t), t}
+AY t (t) (dY (t)) (dt) , where AY t (t) = .
∂t∂Y (t)
1 2
dA {Y (t), t} = At (t)dt + AY (t)dY (t) + AY Y (t) {dY (t)} .
2
2
But what is {dY (t)} ? From Bioharzard 8.4.3 the
2
{dY (t)} = d[Y, Y ](t) = σt2 dt,
so
1
dA {Y (t), t} = At (t)dt + AY Y (t)d[Y, Y ](t) + AY (t)dY (t).
2
This yields the stated result, by rearranging and integrating.
2
It is the crucial {dY (t)} = d[Y, Y ](t) step, which is the heart of Ito’s Formula.
Here we will discuss the four examples of the use of Ito’s formula given in Table 8.1. Some of them just get the
relevant SDE, others use the form of the SDE to solve for the stochastic integral itself.
1
dA(t) = A(t)σ 2 dt + A(t)dY (t)
2
1 2
= A(t) µ + σ dt + A(t)σdB(t), (8.10)
2
1
µ = − σ2 ,
2
Brownian motion. If µ = 12 σ 2 then {A(t)}t≥0 is a martingale with respect to {Ft }t≥0 . In that case geometric
Brownian motion {Y (t)}t≥0 has the solution:
1
Y (t) = Y (0) exp µ − σ 2 t + σB(t) .
2
or, beautifully
Z t
2 B(u)dB(u) = B(t)2 − t.
0
8.6. APPLICATIONS IN TIME SERIES 207
is often used as the definition of [Y, Y ](t), the quadratic variation up to time t.
The solution to this SDE is called the Ornstein-Uhlenbeck process. The form of this SDE reminds me of the
error correction mechanism we saw in Definition 4.1.3, a reparameterization of an autoregression, where the
differences of the series are regressed on lagged levels. How to solve this SDE? Let
A(y, t) = yeλt
dA(t) = At (t)dt + Ax (t)dY (t) = λY (t)eλt dt + eλt {−λY (t)dt + λµdt + σdB(t)} = eλt σdB(t) + eλt λµdt.
So solving
Z t Z t Z t
eλu du + σ eλu dB(u) = A(0) + µ eλt − 1 + σ eλu dB(u),
A(t) = A(0) + λµ
0 0 0
In Section 4.5.3 we estimated ϕ1 in an autoregression. In the stationary case, this yielded a Gaussian central
limit theorem as T → ∞.
Now suppose the data is non-stationary. This time it will be coming from a Gaussian random walk
Yt = Yt−1 + εt , εt ∼ N (0, σ 2 ), Y0 = 0,
Thus PT n o
√1 Yt−1 √1 Yt − √1 Yt−1
t=1
T T T
T ϕ OLS − 1 = .
b 2
PT
T −1 √1 Yt−1
t=1 T
Now write
t t
1 1 X X
√ Yt = √ εj = σ {B(j/T ) − B((j − 1)/T )} = σB(t/T ),
T T j=1 j=1
then
T T
X 1 1 1 X
√ Yt−1 √ Yt − √ Yt−1 = σ2 B((t − 1)/T ) {B(t/T ) − B((t − 1)/T )}
t=1
T T T t=1
T 2 T
X 1 X
T −1 √ Yt−1 = σ2 B((t − 1)/T )2 {(t/T ) − (t − 1)/T } ,
t=1
T t=1
so R1
d 0
B(u)dB(u)
T ϕb
OLS − 1 → R1 .
B(u) 2 du
0
Using Example 8.5.3, we can simplify the numerator, so
1
d 2 B(1)2 − 1
T ϕ b
OLS − 1 → R 1 .
0
B(u)2 du
The law of the right hand side is often called the unit root distribution. This is a highly skewed distribution.
Some of the work on unit roots is reviewed by Stock (1994).
More broadly, if there is heteroskedasticity with
1 1
√ Yt − √ Yt−1 = σ((j − 1) /T ) {B(j/T ) − B((j − 1)/T )} .
T T
1
X((t − 1)/T ) = √ Yt−1 ,
T
then R1 t
X(u)σ(u)dB(u)
Z
d
T ϕOLS − 1 → 0 R 1
b , where X(t) = σ(u)dB(u).
0
X(u)2 du 0
Work on related problems includes Boswijk, Cavliere, Rahbek, and Taylor (2016) and the references contained
within.
To add
8.7. RECAP 209
8.7 Recap
Index
Ft . See filtration
∆. See jump
Action
Adam’s law
235
236 CHAPTER 12. INDEX
Adapted
Angle bracket
Assignment
Autocorrelation
Autocovariance
Autoregression
Bandwidth
Bartlett kernel
Baum filter
Bellman equation
Bootstrap
Brownian motion
Càdlàg
Cauchy sequence
Chain rule
Characteristic function
Choice
Cointegration
Cointegration vector
Companion form
Complex conjugate
237
Complex variables
Consistency
Continuous function
Continuous time
Control
Controllability
Covariance stationarity
Cumulant
Cumulant function
Cramer’s representation
Cycle
Difference equation
Difference
Difference operator
Dirichlet kernel
Discounted utility
Doob’s decomposition
Doob’s inequality
Donsker’s theorem
Dynamic programming
Eigenvalues
Ergodicity
Expected loss
238 CHAPTER 12. INDEX
Expected utility
Filter
Filtration
Finite variation.
Forecast
Fourier transform
Frequency
Gamm process
Baum filter
Gaussian HMM
Kalman filter
Particle filter
Hilbert space
Initial conditions
Integrated of order d
Invertible process
Ito’s Formula
Ito process
Jumps
239
Kalman filter
Koopman-Durbin smoother
Kullback-Liebler divergence
Lag operator
Lag polynomial
Lag-s randomization
Lag-s unconfoundedness
Lasso regression
Lead
Least squares
Lévy process
Likelihood function
Linear control
Linear projection
Long memory
Loss function
m-order dependent
m-order causality
m-order Markov
Markov chain
Martingale
Martingale difference
240 CHAPTER 12. INDEX
Martingale transform
Method of moments
Moment condition
Moving average
Filter
Wold representation
Non-stationary
Norm of partition
Nowcasting
Optimal policy
Ornstein-Uhlenbeck process
Outcome
p-norm
p-variation
Partial autocorrelation
Particle filter
Partition
Periodogram
Poisson process
Potential outcome
Prediction decomposition
Predictor
Previsible
Quadratic covariation
Quadratic variation
Quasi-likelihood
Random walk
Realized volatility
241
Reinforcement learning
Riccatti equation
Ridge regression
Riemann integral
Riemann-Stieltjes integral
Roots of polynomial
Sample mean
Sample autocovariance
Sample autocorrelation
Seasonal adjustment
Seasonal component
Seasonality
Sequential randomization
Shrinkage
Simple process
Smoother
Spectral density
Spline
Stationarity
Covariance
Strict
Stochastic integration
Strict stationarity
Summability
Absolute
242 CHAPTER 12. INDEX
Square
Stochastic differential equation
Stochastic volatility
Trend
Trigonometric seasonality
Unit root
Utility
VAR. See Autoregression
Value function
Volatility. See stochastic volatility
Weak convergence
Weiner process. See Brownian motion
White noise
Wold decomposition
Yule-Walker equation
Bibliography
Andersen, T. G., T. Bollerslev, F. X. Diebold, and P. Labys (2001). The distribution of exchange rate volatility.
Journal of the American Statistical Association 96, 42–55.
Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation.
Econometrica 59, 817–858.
Andrews, I., J. Stock, and L. Sun (2019). Weak instruments in IV regression: Theory and practice. Annual
Review of Economics 11, 727–753.
Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain Monte Carlo methods (with dis-
cussion). Journal of the Royal Statistical Society, Series B 72, 1–33.
Angrist, J. D., Ò. Jordà, and G. M. Kuersteiner (2018). Semiparametric estimates of monetary policy effects:
string theory revisited. Journal of Business & Economic Statistics 36, 371–387.
Angrist, J. D. and G. M. Kuersteiner (2011). Causal effects of monetary shocks: Semiparametric conditional
independence tests with a multinomial propensity score. Review of Economics and Statistics 93, 725–747.
Apostol, T. M. (1957). Mathematical Analysis. London: Addison-Wesley.
Barndorff-Nielsen, O. E. and N. Shephard (2002). Econometric analysis of realised volatility and its use in
estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B 64, 253–280.
Bartlett, M. S. (1950). Periodogram analysis and continuous spectra. Biometrika 37, 1–16.
Bass, R. F. (2011). Stochastic Processes. Cambridge: Cambridge University Press.
Baum, L. E. and Eagon (1967). An inequality with applications to statistical estimation for probabilistic func-
tions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society 73,
360–363.
Baum, L. E. and T. Petrie (1966). Statistical inference for probabilistic functions of finite state Markov chains.
The Annals of Mathematical Statistics 37, 1554–1563.
Baxter, M. and R. G. King (1999). Measuring business cycles: Approximate band-pass filters for economic
time series. The Review of Economics and Statistics 81, 575–593.
Billingsley, P. (1999). Convergence of Probability Measure (1 ed.). New York: Wiley.
Bladt, M. and A. J. McNeil (2022). Time series copula models using d-vines and v-transforms. Econometrics
and Statistics 24, 27–48.
Blitzstein, J. K. and J. Hwang (2019). Introduction to Probability (2 ed.). Chapman and Hall.
Blitzstein, J. K. and N. Shephard (2023). Introduction to statistical inference. Unpublished: Stat111 lecture
notes, Harvard University.
Bojinov, I., A. Rambachan, and N. Shephard (2021). Panel experiments and dynamic causal effects: A finite
population perspective. Quantitiative Economics 12, 1171–1196.
Bojinov, I. and N. Shephard (2019). Time series experiments and causal estimands: exact randomization
tests and trading. Journal of the American Statistical Association 114, 1665–1682.
Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedasticity. Journal of Econometrics 51,
307–327.
243
244 BIBLIOGRAPHY
Boswijk, H. P., G. Cavliere, A. Rahbek, and A. M. Taylor (2016). Inference on co-integration parameters in
heteroskedastic vector autoregressions. Journal of Econometrics 192, 64–85.
Broadberry, S., B. M. S. Campbell, A. Klein, M. Overton, and B. van Leeuwen (2015). British Economic
Growth: 1270–1870. Cambridge University Press.
Brown, B. M. (1971). Martingale central limit theorems. Annals of Mathematical Statistics 49, 59–66.
Campigli, F., G. Bormetti, and F. Lillo (2022). Measuring price impact and information content of trades in
a time-varying setting. Unpublished paper: University of Bologna.
Carter, C. K. and R. Kohn (1994). On Gibbs sampling for state space models. Biometrika 81, 541–553.
Chopin, N. and O. Papasphiliopoulos (2020). An Introduction to Sequential Monte Carlo. Springer.
Cox, D. R. (1958a). Planning of Experiments. Oxford: Wiley.
Cox, D. R. (1958b). The regression analysis of binary sequences (with discussion). Journal of the Royal
Statistical Society, Series B 20, 215–42.
Cox, D. R. (1981). Statistical analysis of time series: some recent developments. Scandinavian Journal of
Statistics 8, 93–115.
Coyle, D. (2015). GDP: A Brief but Affectionate History. Princeton University Press.
Cramer, H. and H. Wold (1936). Some theorems on distribution functions. Journal of the London Mathematical
Society 11, 290–294.
Dai, C., J. Heng, P. E. Jacob, and N. Whiteley (2022). An invitation to sequential monte carlo samplers.
Journal of the American Statistical Association 117, 1587–1600.
de Jong, P. and N. Shephard (1995). The simulation smoother for time series models. Biometrika 82, 339–350.
DeLong, B. J. (2022). Slouching Towards Utopia: An Economic History of the Twentieth Century. Basic
Books.
Diebold, F. X., T. A. Gunther, and T. S. Tay (1998). Evaluating density forecasts with applications to
financial risk management. International Economic Review 39, 863–883.
Doob, J. L. (1949). Application of the theory of martingales. In Actes du Colloque International Le Calcul
des Probabilities et ses applications: Lyon, 28 Juin – 3 Juillet, 1948 , 23–27.
Doob, J. L. (1953). Stochastic Processes. New York: John Wiley and Sons.
Dudley, R. and R. Norvaiša (1999). Product integrals, Young integrals and p-variation. In R. Dudley and
R. Norvaiša (Eds.), Differentiability of Six Operators on Nonsmooth Functions and p-variation, pp. 73–
208. New York: Springer-Verlag. Lecture Notes in Mathematics 1703.
Duffie, D. (2001). Dynamic Asset Pricing Theory. Princeton University Press.
Durbin, J. and S. J. Koopman (2002). A simple and efficient simulation smoother for state space time series
analysis. Biometrika 89, 603–616.
Durbin, J. and S. J. Koopman (2012). Time Series Analysis by State Space Methods (2 ed.). Oxford: Oxford
University Press.
Durbin, R., S. R. Eddy, A. Krogh, and G. Mitchison (1998). Biological sequence analysis: probability models
of proteins and nucleic acids. Cambridge University Press.
Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of the United
Kingdom inflation. Econometrica 50, 987–1007.
Engle, R. F. and C. W. J. Granger (1987). Co-integration and error correction: representation, estimation
and testing. Econometrica 55, 251–276.
Engle, R. F. and V. Ng (1993). Measuring and testing the impact of news on volatility. Journal of Finance 48,
1749–1778.
Fearnhead, P. and H. R. Kunsch (2018). Particle filters and data assimilation. Annual Review of Statistics
and Its Applications 5, 421–449.
Fisher, R. A. (1925). Statistical Methods for Research Workers (1 ed.). London: Oliver and Boyd.
BIBLIOGRAPHY 245
Flury, T. and N. Shephard (2011). Bayesian inference based only on simulated likelihood: particle filter
analysis of dynamic economic models. Econometric Theory 27, 933–956.
Frisch, R. (1933). Propagation Problems and Impulse Problems in Dynamic Economics. London: Allen and
Unwin.
Frühwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series
Analysis 15, 183–202.
Gneiting, T. and M. Katzfuss (2014). Probabilistic forecasts. Annual Review of Statistics and Its Application 1,
125–151.
Gordon, N. J., D. J. Salmond, and A. F. M. Smith (1993). A novel approach to nonlinear and non-Gaussian
Bayesian state estimation. IEE-Proceedings F 140, 107–113.
Gordon, R. J. (2016). The Rise and Fall of American Growth: The U.S. Standard of Living since the Civil
War. Princeton University Press.
Granger, C. W. J. (1981). Some properties of time series data and their use in econometric model specification.
Journal of Econometrics 16, 121–130.
Granger, C. W. J. and A. Andersen (1978). On the invertibility of time series models. Stochastic Processes
and their Applications 8, 87–92.
Green, P. and B. W. Silverman (1994). Nonparameteric Regression and Generalized Linear Models: A Rough-
ness Penalty Approach. London: Chapman & Hall.
Grenander, U. and M. Rosenblatt (1953). Statistical spectral analysis of time-series arising from stationary
stochastic processes. Annals of Mathematical Statistics 24, 537–558.
Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations. Econometrica 11,
1–12.
Hamilton, J. (1989). A new approach to the economic analysis of nonstationary time series and the business
cycle. Econometrica 57, 357–384.
Hamilton, J. D. (1994). Time Series Analysis. Princeton: Princeton University Press.
Hansen, L. P. and T. J. Sargent (2014). Recursive Models of Dynamic Linear Economies. Princeton University
Press.
Hansen, L. P. and K. J. Singleton (1982). Generalized instrumental variables estimation of nonlinear rational
expectations models. Econometrica 50, 1269–1286.
Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge: Cam-
bridge University Press.
Harvey, A. C. (1993). Time Series Models (2 ed.). Hemel Hempstead: Harvester Wheatsheaf.
Harvey, A. C., E. Ruiz, and N. Shephard (1994). Multivariate stochastic variance models. Review of Economic
Studies 61, 247–264.
Hasbrouck, J. (1991a). Measuring the information content of stock trades. The Journal of Finance 46,
179–207.
Hasbrouck, J. (1991b). The summary informativeness of stock trades: An econometric analysis. The Review
of Financial Studies 4, 571–595.
Herbst, E. and F. Schorfheide (2015). Bayesian Estimation of DSGE Models. Princeton: Princeton University
Press.
Hindrayanto, I., J. A. D. Aston, S. J. Koopman, and M. Ooms (2013). Modelling trigonometric seasonal
components for monthly economic time series. Applied Economics 45, 3024–3034.
Hodrick, R. J. and E. C. Prescott (1997). Postwar U.S. business cycles: an empirical investigation. Journal
of Money, Credit, and Banking 24, 1–16.
Jacob, P. E. and Y. F. A. John O’Leary (2020). Unbiased Markov chain Monte Carlo methods with couplings.
Journal of the Royal Statistical Society, Series B 82, 543–600.
246 BIBLIOGRAPHY
Janson, S. (2021). A central limit theorem for m-dependent variables. Unpublished paper: Department of
Mathematics, Uppsala University.
Jørgensen, B. (1982). Statistical Properties of the Generalised Inverse Gaussian Distribution. New York:
Springer-Verlag.
Jorda, O. (2005). Estimation and inference of impulse responses by local projections. American Economic
Review 95, 161—-182.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engi-
neering, Transactions ASMA, Series D 82, 35–45.
Karatzas, I. and S. E. Shreve (1998). Methods of Mathematical Finance. New York: Springer–Verlag.
Kim, C.-J. and C. R. Nelson (1999). State-Space Models with Regime Switching. Classical and Gibbs-Sampling
Approaches with Applications. Cambridge: MIT.
Kim, S., N. Shephard, and S. Chib (1998). Stochastic volatility: likelihood inference and comparison with
ARCH models. Review of Economic Studies 65, 361–393.
Kim, S. J., K. Koh, S. Boyd, and D. Gorinevsky (2009). l1 trend filtering. SIAM Review 51, 339–360.
Kimmeldorf, G. S. and G. Wahba (1970). A correspondence between Bayesian estimation on stochastic pro-
cesses and smoothing by splines. The Annals of Mathematical Statistics 41, 495–502.
Kloeden, P. E. and E. Platen (1992). Numerical Solutions to Stochastic Differential Equations. New York:
Springer.
Kong, A., J. S. Liu, and W. H. Wong (1994). Sequential imputations and Bayesian missing data problems.
Journal of the American Statistical Association 89, 278–88.
Koopmans, T. C. (1950). Statistical inferene in dynamic economic models, Volume 10. Wiley. Cowles Com-
mission.
Lazarus, E., D. J. Lewis, and J. H. Stock (2021). The size-power tradeoff in HAR inference. Econometrica 89,
2497–2516.
Li, M. and S. J. Koopman (2021). Unobserved components with stochastic volatility: Simulation-based
estimation and signal extraction. Journal of Applied Econometrics 36, 614–627.
Liu, J. S. and R. Chen (1995). Blind deconvolution via sequential imputation. Journal of the American
Statistical Association 90, 567–76.
Liu, J. S. and R. Chen (1998). Sequential Monte Carlo methods for dynamic systems. Journal of the American
Statistical Association 93, 1032–1044.
Magnus, J. R. and H. Neudecker (2019). Matrix Differential Calculus with Applications in Statistics and
Econometrics (3 ed.). New York: Wiley.
Mandelbrot, B. B. (2021). The Fractal Geometry of Nature. Echo Point Books & Media.
Merton, R. (1969). Lifetime portfolio selection under uncertainty: the continuous time case. Review of Eco-
nomics and Statistics 51, 247–257.
Mikosch, T. (1998). Elementary Stochastic Calculus with Finance in View. Singapore: World Scientific.
Miller, J. W. (2018). A detailed treatment of Doob’s theorem. Unpublished paper: Harvard University.
Newey, W. K. and K. D. West (1987). A simple positive semi-definite, heteroskedasticity and autocorrelation
consistent covariance matrix. Econometrica 55, 703–708.
Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles.
section 9. Statistical Science 5, 465–472. Originally published 1923, republished in 1990, translated by
Dorota M. Dabrowska and Terence P. Speed.
Olea, J. L. M. and M. Plagborg-Moller (2021). Local projection inference is simpler and more robust than
you think. Econometrica 89, 1789–1823.
Omori, Y., S. Chib, N. Shephard, and J. Nakajima (2007). Stochastic volatility with leverage: fast and
efficient likelihood inference. Journal of Econometrics 140, 425–449.
BIBLIOGRAPHY 247
Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical
Statistics 33, 1065–1076.
Percival, D. B. and A. T. Walden (1993). Spectral Analysis for Physical Applications. Cambridge University
Press.
Percival, D. B. and A. T. Walden (2000). Wavelet Methods for Time Series Analysis. Cambridge: Cambridge
University Press.
Pitt, M. K. and N. Shephard (1999). Filtering via simulation: auxiliary particle filter. Journal of the American
Statistical Association 94, 590–599.
Plagborg-Moller, M. and C. K. Wolf (2021). Local projections and VARs estimate the same impulse reponse
functions. Econometrica 89, 955–980.
Priestley, M. B. (1981). Spectral Analysis and Time Series. London: Academic Press.
Protter, P. (2010). Stochastic Integration and Differential Equations: A New Approach (Third ed.). New
York: Springer-Verlag.
Rambachan, A. and N. Shephard (2021). When do common time series estimands have nonparametric causal
meaning? Unpublished paper: Department of Economics, Harvard University.
Romano, J. P. and M. Wolf (2000). A more general central limit theorem for m-dependent random variables
with unbounded m. Statistics and Probability Letters 47, 115—-124.
Rootzén, H. (1978). Extremes of moving averages of stable processes. Annals of Probability 6, 847–869.
Rosenblatt, M. (1952). Remarks on a multivariate transformation. Annals of Mathematical Statistics 23,
470–472.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of
Mathematical Statistics 27, 832–837.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal 27,
379–423.
Shephard, N. (1994). Partial non-Gaussian state space. Biometrika 81, 115–131.
Shephard, N. (2015). Martingale unobserved component models. In S. J. Koopman and N. Shephard (Eds.),
Unobserved components and time series econometrics, Chapter 10. Oxford University Press.
Shreve, S. (2004). Stochastic Calculus for Finance II: Continuous Time Models. Springer.
Sims, C. A. (1980). Macroeconomics and reality. Econometrica 48, 1–48.
Slutsky, E. E. (1927). The summation of random causes as the source of cyclic processes. Note: Published in
Russian in 1927 and reprinted in Econometrica in 1937.
Steele, J. M. (2001). Stochastic Calculus and Financial Applications. New York: Springer.
Stock, J. H. (1994). Unit roots, structural breaks and trends. In R. F. Engle and D. L. McFadden (Eds.),
Handbook of Econometrics, Volume 4, pp. 2739–2841. Elsevier.
Stock, J. H. and M. Watson (2016a). Core inflation and trend inflation. Review of Economics and Statistics 98,
770–784.
Stock, J. H. and M. Watson (2016b). Dynamic factor models, factor-augmented vector autoregressions, and
structural vector autoregressions in macroeconomics. In J. B. Taylor and H. Uhlig (Eds.), Handbook of
Macroeconomics, Volume 2A, pp. 415–525. Elsevier.
Stock, J. H. and M. W. Watson (2007). Why has U.S. inflation become harder to forecast? Journal of Money,
Credit, and Banking 39, 3–34.
Subba Rao, S. (2022). A course in time series analysis. Unpublished: Texas A&M.
Taylor, S. J. (1982). Financial returns modelled by the product of two stochastic processes — a study of
daily sugar prices 1961-79. In O. D. Anderson (Ed.), Time Series Analysis: Theory and Practice, 1, pp.
203–226. Amsterdam: North-Holland.
248 BIBLIOGRAPHY
Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statis-
tics 42, 285–323.
Watson, M. W. (2007). How accurate are real-time estimates of output trends and gaps? Economic Quar-
terly 93, 143–161.
West, M. and J. Harrison (1989). Bayesian Forecasting and Dynamic Models. New York: Springer-Verlag.
Whittaker, E. T. (1923). A new method of graduation. Proceedings of the Edinburgh Mathematical Society 41,
63–75.
Widder, D. V. (1946). The Laplace Transform. Princeton: Princeton University Press.
Williams, D. (1991). Probability with Martingales. Cambridge: Cambridge University Press.
Wold, H. (1938). A Study in the Analysis of Stationary Time Series. Uppsala: Almqvist and Wiksell.
Young, L. C. (1936). An inequality of the Hölder type, connected with Stieltjes integration. Acta Mathemat-
ica 67, 251–282.
Yule, G. U. (1921). On the time-correlation problem, with special reference to the variate-difference correlation
method. Journal of the Royal Statistical Society 84, 497–526.
Yule, G. U. (1926). Why do we sometimes get nonsense-correlations between time series? a study in sampling
and the nature of time series. Journal of the Royal Statistical Society 89, 1–63.
Yule, G. U. (1927). On a method for investigating periodicities in disturbed series with special reference
to Wolfer’s sunspot numbers. Philosophical Transactions of the Royal Society of London, Series A 226,
267—-298.