0% found this document useful (0 votes)
30 views221 pages

Modern Time Series: Description, Prediction and Causality: ©2023 Neil Shephard

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views221 pages

Modern Time Series: Description, Prediction and Causality: ©2023 Neil Shephard

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 221

Modern Time Series:

Description, Prediction and Causality1

Neil Shephard

December 8, 2023

1
©2023 Neil Shephard
Contents

1 Introduction 7
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Describe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Lags and differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Time series model 17


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Prediction decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Martingale and martingale difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Five properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 A taste of martingale differences in economics . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.4 A taste of martingale differences in statistics . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Probability integral transform — replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Estimand, statistic, estimator and estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Estimand, statistic, estimator and estimate . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.2 Parametric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.3 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.4 Parametric models and MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Stationarity 39
3.1 Strict stationarity and marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Moving average process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Covariance stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Examples of covariance stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Covariance stationarity and the MA(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.4 Covariance stationarity and the sample mean . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Covariance stationary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Inference under martingale differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Invertibility of moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2 Lag operator and lag polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3
4 CONTENTS

4 Memory 61
4.1 Markov process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.1 Basic case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.2 Companion form and K-th order Markov processes . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 AR(p) and VAR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.2 Autoregressions, the M A(∞) and covariance stationarity . . . . . . . . . . . . . . . . . . 66
4.3 m-dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2 m-dependence CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Integration and differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Random walk and integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Lévy, Brownian and Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Inference and linear autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.1 Three versions of an autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.2 Least squares and AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.3 Properties of ϕ b
LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5 Action: Describing via filtering and smoothing 85


5.1 Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1 Quantifying seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.2 Fourier representation of a seasonal function . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.3 Stochastic seasonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.4 Stochastic cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Stochastic trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 Quantifying trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 Stochastic trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.3 Trend estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.4 Local level and local linear trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.2 Three local level models: one is a HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.3 Stochastic volatility and the HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.4 Some other HMM models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Filtering and smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4.1 Big picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4.2 HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5 Computation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.1 Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.2 Inference and Gaussian HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.3 Inference and conditionally Gaussian HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.4 Particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 Linearity 119
6.1 Frequency domain time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.1 A regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1.2 Some background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1.3 Writing the regression model using complex variables . . . . . . . . . . . . . . . . . . . . 128
6.2 Fourier transform of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.1 Core ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
CONTENTS 5

6.2.3 Main event for the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130


6.3 Population quantities: building the spectral density . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.1 Core ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.2 Main event for the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.3 Spectral density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.4 Band filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4 Estimating the spectral density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.4.1 Core ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.4.2 Main event for spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4.3 Long AR approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4.4 Nonparametric estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Wold representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5.1 Decomposing using martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5.2 Decomposing using best linear projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.6 Kalman filter as a linear projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.7 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.8 Appendix: Linear projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.8.1 Building the linear projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.8.2 Some properties of linear projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.8.3 Updating from X to the (X, Z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7 Action: Causality 161


7.1 Causality and time series assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1.1 Causal time series system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1.2 m-order causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.1.3 Stationary m-order causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.1.4 Sequential assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.1.5 Estimating under sequential assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.2 Structural models: SVAR and SVMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.2.1 One model, three rewrites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.2.2 Identification and the SVMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.3 Wold decomposition and the above . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.2.4 Other Factor augmented SVAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2.5 Synthetic control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3.1 A preamble to MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3.2 Defining a MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.4 Stochastic dynamic program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.5 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.6 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

8 Stochastic integration and time series 187


8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.1.1 A càdlàg function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.1.2 Finite p-variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.1.3 Riemann–Stieltjes integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.1.4 Local time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.2 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.2.2 Continuous sample path, non-differentiable and infinite length . . . . . . . . . . . . . . . 193
8.3 Stochastic integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.3.1 Simple process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.3.2 Limit of Ito integrals of simple processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.3.3 Quadratic covariation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Chapter 1

Introduction

1.1 Overview

Welcome to the cross-listed course on Time Series, which is labelled Stat242 and Econ2142.

Recall the main goals of statistics are

ˆ describing;

ˆ predicting;

ˆ causality.

The main distinctive feature of time series is the role of the past and future, but many aspects of the above
goals carry over.

ˆ how data X in the past might help predict data Y in the future.

ˆ how changing some assignment A, might impact the future outcome of Y .

Learn through

ˆ theory and model building;

ˆ simulation;

ˆ applications.

Data ordered in time:

y1 , ..., yT

Often dim(yt ) > 1, which is called a multivariate time series.

7
8 CHAPTER 1. INTRODUCTION

Remark 1. There are two academic disciplines which look at sequences of data in time: “time series” and
“stochastic processes.” They share many features and it is rather sad there is not just a single topic. Courses
and writing about time series tend to be rather more statistical and applied. Stochastic processes tend to

be focused on probability theory. There a couple of dedicated time series journals: “Journal of Time Series
Analysis” and the “Journal of Forecasting.” Of course you will see time series papers published in Nature,
Econometrica, Journal of Econometrics etc. Leading stochastic process journals are “Stochastic Processes and
Their Applications” and “Stochastics” and “Statistical Inference for Stochastic Processes,” while you will also

see stochastic process papers in general probability journals like the “Annals of Probability” and the “Annals of
Applied Probability.” There has been a surge of time series focus on global warming type problems in the last
decade, e.g. leading econometricians such as Jim Stock, Lars Hansen, David Hendry and Frank Diebold have

been working in that area. More traditionally, time series is seen mostly in academic economics in finance and
macroeconomics. Outside economics, it is widely seen in engineering (e.g. modern versions would be automatic
driving of cars).

I will teach a general time series course, but my applied interest is in economics. There will be a bias towards

economic problems, both in theory and applications. I tend to think of things from a time series viewpoint,
but for a time series person I nudge towards stochastic processes due to my background in continuous time
processes.

1.2 Describe

Sometimes helpful to think about

ˆ temporal averages
T
1X
y= yt
T t=1

or quantile versions.

ˆ chance of two downs in consecutive periods


T
1X
1(yt < 0)1(yt−1 < 0).
T t=1

1.2.1 Lags and differences

In many problems it is helpful to transform the data, to make it more meaningful. As well as the usual log,
square root and other transforms we see in Introductory statistics, there are some new and vital time series
transforms. They appear all over the subject and have some specialized notation.
1.2. DESCRIBE 9

ˆ Lag: from yt back on period yt−1 .

Lyt = yt−1

L is called a lag operator. Each time it sees time, it goes back one period. e.g. Lt = t − 1, Lt2 = (t − 1)2 .

The lag operator can be applied multiple times, so for integer S

LS = yt−S ,

e.g. L2 yt = yt−2 and L−2 yt = yt+2 .

ˆ Difference

∆yt = yt − yt−1

= (1 − L)yt .

ˆ Logs (of strictly positive time series) and differences combine in important ways

∆ log yt = log yt − log yt−1

= log(yt /yt−1 )
  
yt − yt−1
= log 1 + .
yt−1

Now
yt − yt−1
100
yt−1
yt −yt−1
is percentage change. If yt−1 small, then, by Taylor expansion

yt − yt−1
∆ log yt ≃ .
yt−1

So 100∆ log yt is roughly the percentage change.

ˆ A more subtle for of difference is to go back multiple lags, e.g.

∆2 yt = yt − yt−2

= (yt − yt−1 ) + (yt−1 − yt−2 )

= ∆yt + ∆yt−1 .

This type of differences cumulates first differences. More generally, for integer S > 1,

∆S y t = yt − yt−S

= ∆yt + ∆yt−1 + ... + ∆yt−S+1 .


10 CHAPTER 1. INTRODUCTION

Price Index

300
250
200
CPI

150
100
50

1960 1980 2000 2020

Time

Figure 1.1: Non-seasonally adjusted US CPI index from 1947 onwards.

Example 1.2.1. (Inflation) Pt time t price monthly index, plotted in Figure 1.1 from 1947, the non-seasonally

adjusted CPI: which is labelled CPIAUCNS. Then

Pt − Pt−12
100 ,
Pt−12

is the annual percentage change, which is the traditional way of computing the time-t inflation rate. Thus

∆12 Pt ∆Pt + ... + ∆Pt−11


100 = 100 .
Pt−12 Pt−12

This is not very pretty, as this is not the sum of monthly inflation

∆Pt
100 .
Pt−1

Could convert through


     
∆12 Pt ∆Pt Pt−1 ∆Pt−1 Pt−2 ∆Pt−11
= × + × + ... + ,
Pt−12 Pt−1 Pt−12 Pt−2 Pt−12 Pt−12

which is the weighted version of the monthly inflation. If annual inflation is low these weights should all be
close to one, but for high inflation the weights could be substantially larger than one. A mathematically more
1.2. DESCRIBE 11

Monthly geometric inflation Annual geometric inflation

10
1
Monthly change

Monthly change

5
0
−1

0
−2

1960 1980 2000 2020 1960 1980 2000 2020

Time Time

Figure 1.2: LHS: Monthly geometric inflation in U.S.A. from 1947, 100∆ log Pt . RHS: Annual geometric
inflation in U.S.A. from 1947, 100∆12 log Pt .

attractive way of thinking about the time series of inflation is to work with

100∆12 log Pt = 100 (∆ log Pt + ... + ∆ log Pt−11 )

= 100 (log Pt − log Pt−12 ) , telescoping sum

then monthly inflation aggregates to yearly inflation. The left hand side of Figure 1.2 Right hand side of Figure
1.2 shows 100∆ log Pt , while the right hand side shows 100∆12 log Pt . Summary measures include the average

annual geometric inflation rate


T
1 X
100∆S log Pt ≃ 3.43
T
t=S+1

while the corresponding quantiles show inflation is historically skew

Q(0.1) ≃ 0.97, Q(0.5) ≃ 2.82, Q(0.9) ≃ 7.2.

Recent inflation targeting has been at 2%, but it has been above that frequently
T
1 X
1(100∆S log Pt > 2) ≃ 0.67.
T
t=S+1

Remark 2. I often download data through APIs. In R there is a nice package quantmod which allows access
to FRED (St. Louis Fed’s economic database free service), google, yahoo and many other databases. Once this
is installed the data is got through the command

\texttt{
12 CHAPTER 1. INTRODUCTION

getSymbols(Symbols = ’CPIAUCNS’, src = "FRED"); # not seasonally adjusted


}
\texttt{head(CPIAUCNS,15); # head() gets us the first rows of the dataset

}
\texttt{tail(CPIAUCNS,15); # tail() gets us the last rows of the dataset
}

Example 1.2.2. (Stock returns) Example 1.2.1 is also important in thinking about the price (plus any rein-
vested dividends) Pt of a risky asset at time t. Then over S periods:

Pt − Pt−S
100 ,
Pt−S

is the “percentage return.” Think of S = 2, then


   
∆2 P t Pt − Pt−2 ∆Pt + ∆Pt−1 ∆Pt Pt−1 ∆Pt−1
= = = × + .
Pt−2 Pt−2 Pt−2 Pt−1 Pt−2 Pt−2

Think of the
∆Pt ∆Pt−1
= 0.05 but = −0.05,
Pt−1 Pt−2

then the impact of the weight Pt−1 /Pt−2 < 1 means that

∆ 2 Pt
< 0.
Pt−2

Hence early negative returns outweighed later positive returns of the same size. This can be super confusing
for an investor. Reporting returns through

100∆S log Pt = 100 (∆ log Pt + ... + ∆ log Pt−S+1 ) ,

removes this problem. These {100∆ log Pt } are called geometric returns in finance — these sum up over time.
The {100∆Pt /Pt−S } are called arithmetic returns.

1.3 Predict

A prime goal of applied researchers using time series data is to make forecasts — what comes next?

h 1.3.1. ChatGTP is a large language model, a version of natural language processing. The statistical basis

of this is to make some text (user input) and to predict what would come next using a corpus of existing text
(data). The intellectual origins of this are due to Shannon (1948), who thought of this as a time series type
problem — predicting future text from past text.
1.4. CAUSALITY 13

A tradition way of thinking about this is to use data to estimate statistically using some data, e.g.

E[Yt |Y0 , ..., Yt−1 ].

Of course, for some purposes computing the conditional median or alike might be better, but let us think about
that later.

The conditional expectation sure looks like the kind of problems we saw in Introductory Statistics, where X
is not past data, and Y is not future data. But there we saw many pairs of random variables (X, Y ) in order
to produce a good E[Y |X]. In time series we do not have that replication, we only have one series. This is a

fundamental challange which means applied researchers are likely to have to make stronger assumptions than
you see in Introductory Statistics.

Example 1.3.2. Think about predicting next month’s official inflation number. Recall from Example 1.2.1 it
is
     
∆12 Pt ∆Pt Pt−1 ∆Pt−1 Pt−2 ∆Pt−11
= × + × + ... + .
Pt−12 Pt−1 Pt−12 Pt−2 Pt−12 Pt−12
This looks mighty complicated but we predict it given what we know at time t − 1 — and we know a lot! One

approach is to think about


 
∆12 Pt
E | (P1 , ..., Pt−1 ) = (p1 , ..., pt−1 )
Pt−12
     
E[Pt | (P1 , ..., Pt−1 )] − pt−1 pt−1 ∆pt−1 pt−2 ∆pt−11
= × + × + ... + .
pt−1 pt−12 pt−2 pt−12 pt−12

Hence the only thing missing is E[ Ptp−p


t−1
t−1
| (P1 , ..., Pt−1 )], the last month’s inflation.

Pt −pt−1
Write Yt = pt−1 and then yt−s = ∆pt−s /pt−s−1 for s = 1, 2, .... A famous forecasts is:

∆pt−1
yt−1 = ,
pt−2

the last data point we saw. Of course many other forecasts are possible and we will explore some of them.

1.4 Causality

In Introductory Statistics a classical way of quantifying causality is think about the instant impact on an

outcome of taking action a = 1 instead of action a = 0 at time t − s. The corresponding pair of outcomes at
time t are {Yt,s (0), Yt,s (1)} (here Yt,s (1) corresponding to action a = 1 having been taken at time t − s) under
these possible actions are called lag-s “potential outcomes” (the idea of potential outcomes is due to Neyman
(1923)), while the lag-s average treatment effect is the random variable

τt,s = E[Yt,s (1)] − E[Yt,s (0)].


14 CHAPTER 1. INTRODUCTION

In practice we cannot see both potential outcomes, so we have no choice but to select a single action
At−s ∈ {0, 1}, and then seeing the outcome Yt = Yt,s (At−s ). One way to progress is to use a lag-s sequential
randomization assumption:

[At−s ⊥⊥ {Yt,s (0), Yt,s (1)}] .

This is a simple assumption to write down, understanding it takes some thought! It will be discussed extensively

in Chapter 7, where this condition will also be substantially weakened.


Build the random variable

E[Yt,s (1)] = E[Yt,s (1)|At−s = 1], lag-s sequential randomization

= E[Yt,s (At−s )|At−s = 1]

= E[Yt |At−s = 1]

= µt,s (1),

where, generically
µt,s (a) = E[Yt |At−s = a]

a nonparametric regression of outcomes on lagged assignments. This would imply

τt,s = µt,s (1) − µt,s (0),

the difference of two nonparametric regressions.

1.5 Recap

Lag operator
Difference
Difference operator

Assignment
Outcome
Potential outcome

Seasonal difference operator


This Chapter has mapped out some of the major ideas and challanges in time series. The main topics
covered are listed in Table 1.1.
1.5. RECAP 15

Formula or idea Description or name


At Assignment
Yt − Yt−1 Difference
∆=1−L Difference operator
L Lag operator
Yt,s (a) Lag-s potential outcome
Yt Outcome
∆S = 1 − LS Seasonal difference operator

Table 1.1: Main ideas and notation in Chapter 1.


16 CHAPTER 1. INTRODUCTION
Chapter 2

Time series model

2.1 Introduction

What is a time series? A simple view of this is that it is some data

y1 , ..., yT

ordered in time, which we might plot. But where does this data come from?

I will answer this question using the language of probability theory, defining a time series model. As
usual, I will use upper case letters to denote random variables and lower case letters to denote data or arguments
in integrals, etc., and bold vectors and matrices. This can be confusing as sometimes this notation will clash

with the linear algebra convention of using capital letters for matrices and lower case for vectors and scalars.
Further, I will use Y as my prime notation for a random variable, as is standard in statistics, while noting X is
predominately used for the same purpose in probability theory.

h 2.1.1. The probabilistic viewpoint of time series is not the only viable approach. There is also an algorithmic
tradition of filters and optimization which intersects with the time series model viewpoint. I will discuss some
aspects of this in later Chapters.

The time series model views the data as a realization of a sequence of random variables. Sometimes it is
helpful to go outside the range of data we see, e.g. for the purposes of forecasting or asymptotic arguments.

Definition 2.1.2. A time series model views the data

y = (y1 , ..., yT )

is a single realization (draw) of length T from the joint distribution of a sequence of random variables

Y = (Y1 , ..., YT ) ,

17
18 CHAPTER 2. TIME SERIES MODEL

T
ordered in time t = 1, ..., T . Write the dim(yt ) = dt ≥ 1. The sequence Y is written compactly as {Yt }t=1
and the subsequence from time s to time t as Ys:t = (Ys , Ys+1 , ..., Yt ), where s ∈ {1, 2, ..., t}. Sometimes it is
T +s
helpful to think of Y as a single snippet of a longer {Yt }t=1 , where s > 0, or infinitely lived time series, written

as {Yt }t=−∞ or {Yt }t∈Z .

Remark 3. [Calendar time] The definition of a time series model makes no reference to the calendar times
at which the Y1:T appear, only that these random variables appear in sequence. In some applied contexts
the times are crucial, e.g. times at which trades in a financial market happen. In general we will denote the

corresponding times as

τ1 < τ2 < ... < τT .

Sometimes scientifically it is helpful to think of the times as random variables, and include these times as an

element of the time series

Yt = (τt , XtT )T , t = 1, ..., T,

e.g. Xt is bivariate, the price and volume of the t-th transaction, while τt is the time of the t-th trade. But for
most problems this level of generality is not needed and the times are regarded as a deterministic sequence, e.g.
every month, and Yt does not refer to the calendar time explicitly.

h 2.1.3. The infinitely lived version is rarely literally believable, at least in the social and medical sciences, e.g.

the GDP of Belgium literally cannot exist after the Earth is swallowed by the expanding Sun. It is important
to not unthinkingly use such an assumption.

Example 2.1.4. An example of the use of a longer series is to focus on the law of YT +1 | (Y1:T = y1:T ), the
distribution of the next variable in the sequence extrapolated beyond the single realization y1:T of Y1:T .

The definition of a time series model excludes the possibility that Yt is directly a text or an image, but data
of that kind can be potentially included by preprocessing the text or image into a real variable.

Example 2.1.5. ChatGPT output is a forecast from a very dimensional time series model, coding input text

to numbers.

h 2.1.6. [Fundamental challange of time series: no easy replication] y1 is the single data point from the random
variable Y1 . There is no other data point with the same law as Y1 — unless some additional assumptions are
made. Likewise the pair y1:2 is the only pair from the joint law of Y1:2 . This lack of straightforward replication

is the fundamental statistical challange of time series and places it apart from introductory statistics. This is
summarized by the warning:
There is no easy replication in time series without assumption
2.1. INTRODUCTION 19

The ordering in time makes it distinct from other dependent data such as from networks, spatial structures and
images.

From the definition of expectations for continuous random variables, if it exists, the
Z
E[g(Yt1 , ..., Ytp )] = g(y1 , ..., yp )fYt1 ,...,Ytp (y1 , ..., yp )dy1 ...dyp

p
where fYt1 ,...,Ytp is the density of Yt1 , ..., Ytp and (t1 , ..., tp ) ∈ {1, ..., T } . The equivalent result for discrete

variables can be obtained using probability mass functions.

More abstractly, the cumulative distribution function FY of Y is derived from the probability triple (Ω, F, P ).

An important condition, which appears frequently in model building and estimation is the existence of means
and variances for each t = 1, ..., T .

Definition 2.1.7. The univariate process {Yt } is a bounded in L1 if

sup E[|Yt |] < ∞


t

(“integrable”) and {Yt } is bounded in L2 if

sup E[Yt2 ] < ∞


t

(“square integrable”). Here the sup is over time, t, typically t ∈ {1, ..., T } but for asymptotics it could be t ∈ N.

The space of square integrable random variables is sometimes written as L2 (Ω, F, P ). I will refer to this as L1
and L2 processes as shorthand.

h 2.1.8. In many application areas, time series data has very heavy tails and so it is important to be careful
about the existence of higher order moments.

h 2.1.9. Think of the univariate process {Yt } which is in L1 , think about

E[Y2 | (Y1 = y1 )],

is, for a single y1 , a real number. Calculating this conditional expectation for a variety of possible values for

y1 , yields a real function of y1 . But if I write

E[Y2 |Y1 ]

I now usually think of this as a random variable, as this expectation is just a function of Y1 having integrated
out the randomness in Y2 . You will often see in these notes terms like

E[Yt |Y1:t−1 ]
20 CHAPTER 2. TIME SERIES MODEL

or
fYt |Y1:t−1 (yt ).

In both case these are random. The term

fYt |(Y1:t−1 =y1:t−1 ) (yt )

is not! Once in a while I will get bored writing things out so carefully, and express the last term as fYt |Y1:t−1 (yt ),
even when I read it as a constant, e.g. in expressing a likelihood or carrying out filtering. I hope you will

forgive me for using this more compact notation, where I think the context should make things clear. This may
be particularly tricky when I use a standard abstract notation for conditioning on the past, Ft−1 . Then I will
typically think of

E[Yt |Ft−1 ]

as random, but not always. Sorry!

Example 2.1.10. [Mean and variance of an average] For a time series model, define the statistic Y =
PT
T −1 t=1 Yt . Then, if Y is in L1 , then E[Y ] exists and is
T
1X
E[Y ] = E[Yt ].
T t=1

If Y is in L2 , then Var(Y ) exists and


T T
1 XX
Var(Y ) = Cov(Yt , Ys )
T 2 t=1 s=1
T T T
1 X 1 X X 
= Var(Yt ) + Cov(Yt , Ys ) + Cov(Yt , Ys )T , (2.1)
T 2 t=1 T 2 t=1 s=t+1

as Cov(Ys , Yt ) = Cov(Yt , Ys )T . The (2.1) is sometimes called the Bienaymé’s identity. There is no reason to

expect Var(Y ) to go to 0d,d as T increases for time series models, nor for E[Y ] to be scientifically interesting.
Additional assumptions will be needed for this to be true.

2.2 Prediction decomposition

Prediction plays a distinctive role in time series model building, statistical analysis, decision making and applied

work. In particular, one-step ahead prediction has many probabilistic features which make it a very close cousin
to the role of independence in introductory statistics.

Definition 2.2.1. [One-step ahead prediction] The law of the conditional random variable

Yt | (Y1:t−1 = y1:t−1 ) , t ∈ {1, ..., T } ,


2.2. PREDICTION DECOMPOSITION 21

is called the (one-step ahead) time-t predictive distribution of Yt given the past data y1:t−1 . This past data,
or information set, plays a central role in time series.

Remark 4. The past data y1:t−1 is often (particularly in probability theory) labeled the natural filtration and
written (using σ as the notation for a sigma-algebra) as

Y
Ft−1 = σ(Y1:t−1 ),

or, more plainly,


Ft−1 ,

Y
if the context is obvious. Note Ft−1 ⊆ FtY = σ(Y1:t ). The natural filtration is a special case of a general
filtration or information set
X
Ft−1 = σ(X1:t−1 ).

For Yt to be adapted with respect to FtX , there must exists a function ht such that

Yt = ht (X1:t ),

Y
i.e. glimpsing X1:t is enough to determine Yt . Thus for Yt to be adapted to FtX , then Ft−1 X
⊆ Ft−1 . For the

process {Zt }t≥1 to be previsible (sometimes the label “predictable” is used instead) the Zt must be determined
X
by Ft−1 for each t ≥ 1.

By the definition of a conditional density

fY1 ,Y2 (y1 , y2 ) = fY1 (y1 )fY2 |Y1 =y1 (y2 ),

so using the same idea, repeatedly, produces


T
Y
fY1:T (y1:T ) = fYt |(Y1:t−1 =y1:t−1 ) (yt ), (2.2)
t=1

which is called the prediction decomposition of the joint law.

Recall, the joint distribution being equal to the product of marginal distributions is the defining characteristic
of independence. In a time series, the sequence itself is not independent, but its one-step-ahead predictions

Yt |Y1:t−1 , t = 1, ..., T,

have an independence-like feature. The result (2.2) is profoundly important in the study of time series.

The prediction decomposition opens up an approach to parametric modeling of Yt |Y1:t−1 .

Example 2.2.2. Write Yt | (Y1:t−1 = y1:t−1 ) ∼ N (ϕyt−1 , σ 2 ), a Gaussian autoregression model where the pa-
rameters are ϕ, σ 2 . Write Yt | (Y1:t−1 = y1:t−1 ) ∼ N (0, α + βyt−1
2
), a Gaussian autoregressive heteroskedasticity
model (Engle (1982)) where the parameters are α, β.
22 CHAPTER 2. TIME SERIES MODEL

Following the logic expressed in introductory statistics, it is often convenient, both mathematically and
computationally, to work with logs of the joint density

log fY1:T (y1:T ) = log fY1:T −1 (y1:T −1 ) + log fYT |Y1:T −1 (yT )
T
X
= log fYt |Y1:t−1 (yt )
t=1

rather than the joint density itself or in ratios, comparing two joint densities g to f for Y1:T :

gY1:T (y1:T )
LRT : =
fY1:T (y1:T )
gYt |Y1:t−1 (yt )
= LRT −1 × Λ(G||F )T , where Λ(G||F )t := .
fYt |Y1:t−1 (yt )

The above ratio has the fundamental property that

E [LRT | (Y1:T −1 = y1:T −1 )] = LRT −1 ,

if fY1:T (y1:T ) > 0 for all y1:T (more formally f absolutely dominates g) as

gYT |Y1:T −1 (yT )


Z
E [Λ(G||F )T | (Y1:T −1 = y1:T −1 )] = fY |Y (yT )dyT = 1. (2.3)
fYT |Y1:T −1 (yT ) T 1:T −1

We will return to the statistical importance of these results in Section 2.5.3.

2.3 Martingale and martingale difference


2.3.1 Definitions

The result above holds for each t, so

E [LRt | (Y1:t−1 = y1:t−1 )] = LRt−1 , t = 1, ..., T, (2.4)

and is an example of a fundamental sequence in time: a martingale. This section defines these objects and
studies some of their properties. Before we start, note that it is traditional to (2.4) more compactly, removing
the subscripts and using filtration notation, as

E [LRt |Ft−1 ] = LRt−1 .

Definition 2.3.1. The sequence {Yt }t∈N>0 is a martingale process with respect to an adapted filtration

{Ft }t∈N>0 , iff both


(a) E[|Yt |] < ∞, (b) E[Yt |Ft−1 ] = Yt−1

hold for every t = 1, 2, .... If (b) switches to E[Yt |Ft−1 ] = 0, then it is a martingale difference. If (b) switches to
E[Yt |Ft−1 ] ≤ Yt−1 , then it is a supermartingale. If (b) switches to E[Yt |Ft−1 ] ≥ Yt−1 , then it is a submartingale.
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 23
n o
gY1:t (y1:t )
Example 2.3.2. The process fY1:t (y1:t ) is a martingale process with respect to the filtration generated
t∈N>0
gY1:t (y1:t )
by {Yt }t∈N>0 , noting that fY1:t (y1:t ) ≥ 0 and
 
gY1:t (Y1:t )
E = 1, for every, t = 1, 2, ....
fY1:t (Y1:t )

Definition 2.3.3. In the martingale difference Definition 2.3.1 if (a) is strengthened to (a) supt E[Yt2 ] < ∞,
then {Yt }t∈{1,...,T } is a MD in L2 .

h 2.3.4. You might think that being i.i.d. and symmetrically distributed about 0 is enough to be a martingale
difference — but this is not true (but i.i.d. plus E[Yt ] = 0 is enough). For example, if {Yt }t∈N>0 is an i.i.d.

Cauchy sequence, then it is not a martingale difference sequence as E[|Yt |] = ∞. Heavy tailed processes play
a much more prominent role in time series than in introductory statistic: it is important to keep tract of this
seemingly technical condition.

2.3.2 Five properties

Martingales and martingale differences have many important properties. Here focus on five.

MD and correlation

Here is the first. It is my favourite! Recall that previsible means that Ct ∈ Ft−1 , i.e. it is part of the past.

Theorem 2.3.5. For {Yt }t∈N>0 , a martingale difference with respect to the filtration {Ft }t∈N>0 , and the scalar
Ct is previsible such that E[|Ct Yt |] < ∞, then, for every t,

E[Ct Yt ] = 0.

Proof. E[Ct Yt ] exists by assumption. Then:

E[Ct E[Yt |Ft−1 ]] = E[Ct × 0] = 0,

which is the stated result.

The most famous special case of this is where Ct = 1, which implies that for all MDs (as E[|Yt |] < ∞ is part
of the definition of a MD)

E[Yt ] = 0.

This a super useful property.

As E[Yt ] = 0, the Theorem means that Yt is uncorrelated (but not independent!) with the past

Cov(Ct , Yt ) = 0,
24 CHAPTER 2. TIME SERIES MODEL

so long as E[|Ct Yt |] < ∞. The most used case of this Ct = Yt−s for a fixed integer s > 0. This means, for
every t,

Cov(Yt−s , Yt ) = 0, s = 1, 2, ..., t − 1, (2.5)

so long as E[|Yt Yt−s |] < ∞. By Holder’s inequality E[|Yt−s Yt |] ≤ E[|Yt |2 ]1/2 E[|Yt−s |2 ]1/2 , so for MDs in L2 ,
Cov(Yt−s , Yt ) = 0 always holds.

Martingale transform theorem

The next result is remarkable. It shows that martingale reproduce, when combined with previsible processes.
This result plays a huge role in the theory of finance and the theory of gambling.

Theorem 2.3.6. [Martingale transform theorem] Assume that (i) {Ct , Yt }t∈N>0 are adapted with respect to

{Ft }t∈N>0 ; (ii) {Yt }t∈N>0 is a martingale with respect to {Ft }t∈N>0 ; (iii) {Ct }t∈N>0 is previsible and bounded,
that is |Ct | < c for all t ∈ N>0 . Then
t
X
Mt := Cj ∆Yj = C · Yt ,
j=1

is a martingale with respect to {Ft }t∈N>0 .

Proof. Bounded {Ct }t∈N>0 means that E[|Ct ∆Yt |] < ∞, the only issue is

E[Ct ∆Yt |Ft−1 ] = Ct E[∆Yt |Ft−1 ], previsible

= 0, martingale {Yt }t∈N>0 .

Example 2.3.7. Suppose {Yt } is the value of a risky asset:

iid
Yt = Yt−1 ηt , ηt ∼ , E[ηt ] = 1, Y0 > 0, ηt ≥ 0, t = 1, 2, ....

Then

E[Yt |Ft−1 ] = Yt−1 E[ηt |Ft−1 ], Yt−1 is in Ft−1


iid
= Yt−1 E[ηt ], ηt ∼ so ηt independent from history

= Yt−1 , assumption E[ηt ] = 1,

so it is a martingale. Let Ct be the (bounded) share of the risky asset at time t you hold! Then Ct must be
decided at time t − 1, so Ct ∈ Ft−1 , i.e. it is previsible (this rules out knowledge of ηt when the investment
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 25

is made, that would be insider knowledge!). Thus the change in the value of your investment during the t-th
period is

Ct ∆Yt .

Thus the value of the investment account, through time, is


t
X
Mt = M0 + Cj ∆Yj ,
j=1

which is a martingale with respect to {Ft }. The main beauty of this result is it abstracts from the details of
the investment strategy: all that matters is the timing of the investment — it is previsible.

Weak law of large numbers

Here is the third property. An implication of the Cov(Yt−s , Yt ) = 0 result is a weak law of large numbers for
time series.

Theorem 2.3.8. [WLLN for MDs] If {Yt }t∈N , a martingale differences in L2 with respect to the filtration
{Ft }t∈N>0 , then
T
1 X
E(Y ) = 0, Var(Y ) = Var(Yt ). (2.6)
T 2 t=1

Then as T → ∞, the
p
Y → 0.

Proof. Go back to the variance of the sample average (2.1), apply the Cov(Yt−s , Yt ) = 0 result for martingale
differences in L2 . This yields (2.6). The

sup Vart (Yt )


Var(Y ) ≤ .
T

The property that sup Vart (Yt ) < ∞ for all t ∈ N comes from the L2 assumption. Then the Var(Y ) is driven
to 0. Then Chebyshev’s inequality yields the convergence result.

Note, so course,
T t
1X 1 X
Y = Yt = MT , Mt = Yj ,
T t=1 T j=1

where {Mt }t∈N>0 is a zero-mean martingale. This means that

1 p
Mt → 0,
t

as t gets large. This is a profoundly simplifying result for many problems.


26 CHAPTER 2. TIME SERIES MODEL

Doob decomposition

Here is the fourth property. How do martingales relate to a broader time series {Yt }t∈N>0 which is adapted to

the filtration {Ft }t∈N>0 and has E[|Yt |] < ∞ for all t? The Doob (1953) decomposition says that
t
X t
X
Yt = At + Mt , At = E[Yj |Fj−1 ] − Yj−1 , Mt = Y0 + {Yj − E[Yj |Fj−1 ]} ,
j=1 j=1

almost surely, where the splitting into {At }t∈N>0 and {Mt }t∈N>0 is unique. The {Mt }t∈N>0 is a martingale
with M0 = Y0 , and the {At }t∈N>0 is previsible. It is sometimes useful to call {E[Yt |Ft−1 ] − Yt−1 }t∈N>0 the

drift process.
Why does this hold? It is by iteration, unwinding the process. The first step is to write

Yt = {E[Yt |Ft−1 ] − Yt−1 } + [Yt−1 + {Yt − E[Yt |Ft−1 ]}] ,

repeating the process, working on the Yt−1 term in the square bracket, yields the desired result. This decom-
position can be shown to be unique, but I will not give a proof of that here.

Why is this useful? Every time series with E[|Yt |] < ∞, has a martingale component and another important
term. We have tools to understand the behaviour of martingales, which can simplify the problem, e.g. by the
WLLN of martingale differences in L2 , the

Yt − At Mt p
= →0
t t

as t → ∞, that is the martingale component averages away as t gets large.

Example 2.3.9. Let Yt = ϕYt−1 + εt , where {εt }t∈N>0 is martingale difference sequence with respect to
{Ft }t∈N>0 , then E[Yt |Ft−1 ] = ϕYt−1 and so the Doob decomposition writes
t
X t
X
At = (ϕ − 1)Yj−1 , Mt = Y0 + εj .
j=1 j=1

Doob’s inequalities

Here is the fifth property. Let {Yt } be a martingale, then Doob’s inequalities are:

(1) for any c > 0 and p ≥ 1,


 
P sup |Ys | > c ≤ E [|Yt |p ] /cp ,
s≤t

(2) for any p ≥ 1,


p
sup Ys ≤ ∥Yt ∥p , (2.7)
s≤t p
p−1
where
1/p
∥X∥p = (E|X|p ) .
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 27

Thus, taking p = 2 implies


 1/2
1/2
E| sup Ys |2 ≤ 2 E|Yt |2 ,
s≤t

so rearranging.
 
E sup Ys ≤ 4E[Yt2 ].
2
s≤t

The latter is sometimes called the Doob’s maximal quadratic inequality. The proof of (2.7) is beyond these

lectures.

Pt
Example 2.3.10. Suppose Yt = j=1 εt , where {εt } is i.i.d., zero mean and variance σ 2 . Then {Yt } is a

martingale and
tσ 2
   
P sup |Ys | > c ≤ 2 , and E sup Ys2 ≤ 4tσ 2 .
s≤t c s≤t

2.3.3 A taste of martingale differences in economics

Many dynamic economics and engineering problems are phrased in terms of taking actions to maximize expected
utility, given the current information set. Leading cases are consumption based models and dynamic portfolio
analysis in finance, as well as problems of control and reinforcement learning. Here I will discuss this abstractly,

before giving a classic example from financial economics.


Think of the time-(t − 1) action that we can select

at−1 ∈ A.

For each level of action, there is a random potential outcome

Yt (at−1 ),

(hence the output is a random function of at−1 and is not interfered by a1:t−2 ). Associated with each level of
time-t outcome yt ∈ Y is the deterministic time-t utility function

∂Ut (yt ) ∂ 2 Ut (yt )


Ut (yt ); > 0, < 0, Ut (−∞) = −∞.
∂yt ∂yt2

Then the random time-t potential utility


Ut (Yt (at−1 )),

varies with the action at−1 . Throughout assume that E[|Ut (Yt (at−1 ))|] < ∞ uniformly over at−1 ∈ A.

The conditional expected time-t utility is

E[Ut {Yt (at−1 )} |Ft−1 ],

given the filtration, the information used to make the decision.


28 CHAPTER 2. TIME SERIES MODEL

Now assume that the best action is defined as

at−1 = arg max E[Ut (Yt (at−1 ))|Ft−1 ].


b
at−1

Assume that Yt (at−1 ) is linear in at−1 and note that almost surely
2
∂ 2 Ut (Yt (at−1 )) ∂ 2 Ut (yt )

∂Yt (at−1 )
2 = × < 0.
∂at−1 ∂yt2 yt =Yt (at−1 ) ∂at−1

Thus E[Ut (Yt (at−1 ))|Ft−1 ] is strictly concave in the action at−1 , with a unique maximizer given by the turning
point
∂E[Ut (Yt (at−1 ))|Ft−1 ] ∂E[Ut (Yt )|Ft−1 ]
= = 0,
∂at−1 at−1 =b
at−1 ∂at−1
where
Yt = Yt (b
at−1 ).

If we are able to interchange the order of integration and differentiation, then

∂Ut (Yt )
E [Ut′ |Ft−1 ] = 0, where Ut′ = ,
∂at−1

that is Ut′ , the marginal conditional expected utility from action at−1 evaluated at b
at−1 , is a martingale difference
sequence with respect to the filtration.

Example 2.3.11. [discrete time version of Merton (1969)] Think of the portfolio allocation problem, holding
at−1 units in risky asset and 1 − at−1 in the riskless asset. Then portfolio at time t is worth

Yt (at−1 ) = (1 + r) + at−1 (Rt − r),

where Rt is the time-t return on the risky asset and r is the return on the riskless asset. To get analytic results,

assume utility from the portfolio has constant absolute risk aversion (CARA)

1 − e−γyt
Ut (yt ) = ; γ ̸= 0.
γ

Then b
at−1 is found by selecting at−1 to minimizing (as log transforms are 1-1)

at−1 = arg min log E[e−γYt (at−1 ) |Ft−1 ],


b
at−1

noting that

∂E[Ut (Yt (at−1 ))|Ft−1 ] 1 ∂E[e−γYt (at−1 ) |Ft−1 ] 1 ∂ log E[e−γYt (at−1 ) |Ft−1 ]
=− = − E[e−γYt (at−1 ) |Ft−1 ] × .
∂at−1 γ ∂at−1 γ ∂at−1

Hence
∂ log E[e−γYt (at−1 ) |Ft−1 ]
E [Ut′ |Ft−1 ] = 0 ⇔ = 0.
∂at−1 at−1 =b
at−1
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 29

If Rt |Ft−1 ∼ N (µt , σt2 ), then using the moment generating function of a normal,

log E[e−γYt (at−1 ) |Ft−1 ] = −γ(1 + r) − γat−1 (µt − r) + γ 2 a2t−1 σt2 /2,
∂ log E[e−γYt (at−1 ) |Ft−1 ]
= −γ (µt − r) + γ 2 at−1 σt2 ,
∂at−1

so
µt − r
at−1 = ,
γσt2
b

implying the excess returns on the wealth portfolio, beyond the risk free rate, is
!
2 2
(µt − r) (µt − r)
{Yt (b
at−1 ) − (1 + r)} |Ft−1 ∼ N , .
γσt2 γ 2 σt2

Thus the marginal conditional expected utility from investing Ut′ is a martingale difference sequence, not the
returns.

2.3.4 A taste of martingale differences in statistics

Bayes theorem is simple but extraordinary: combining prior knowledge of θ and the likelihood function, the
result is a posterior, reflecting rational knowledge about θ gleaned from the data. But is not that easy to use:

it needs the specification of the prior and the likelihood and then the use of computer resources to manipulate
the posterior.

The following is a partial shortcut, available without explicitly specifying the prior or likelihood!

Let θ be an unknown with prior E[|θ|] < ∞, then posterior expectation

Mt = E[θ|Y1:t ]

is a martingale (Doob (1949), Miller (2018)). This is a remarkable and inspiring result. It has a direct proof:

Mt−1 : = E[θ|Y1:t−1 ]

= E[E[θ|Y1:t−1 , Yt ]|Y1:t−1 ], Adam’s law

= E[Mt |Y1:t−1 ], definition of Mt

as required. The result holds more generally for E[h(θ)|Y1:t ] is a martingale if E[|h(θ)|] < ∞. An important
example of this is E[1(θ ≤ c)|Y1:t ], the posterior cumulative distribution function evaluated at the constant c.

This is a key result in time series practice. Repeatedly forecasts of the same object, e.g. daily nowcast
updates of GDP in 2024Q3, should form a martingale sequence.
30 CHAPTER 2. TIME SERIES MODEL

h 2.3.12. E[h(θ)|Y1:t ] is not everything. Assume the prior has E[θ2 ] < ∞, then the posterior variance using
information at time t − 1 is

2
Vt−1 = E[(θ − Mt−1 ) |Y1:t−1 ], recalling Mt−1 = E[θ|Y1:t−1 ]
2
= E[E[(θ − Mt−1 ) |Y1:t−1 , Yt ]|Y1:t−1 ]

= E[Vt + (Mt − Mt−1 )2 |Y1:t−1 ]

= E[Vt |Y1:t−1 ] + E[(∆Mt )2 |Y1:t−1 ],

where the difference

∆Mt = Mt − Mt−1 .

Rearranging,

E[Vt |Y1:t−1 ] = Vt−1 − E[(∆Mt )2 |Y1:t−1 ] ≤ Vt−1 , (2.8)

implies the posterior variance is a supermartingale (but Vt is not, in general, less than or equal to Vt−1 — which
superficially you might think would hold). Likewise, posterior quantiles will not, in general, be martingales.

The cumulative sum of term (2.8) in Biohazard 2.3.12 plays an important role in probability theory and

statistics.


Definition 2.3.13. [Quadratic variation] For the martingale differences in L2 , the {Yt }t=1 is a martingale on

the filtration {Ft }t=1 and
t
X
⟨Y, Y ⟩t = E[(∆Yj )2 |Fj−1 ],
j=1

is called the predictable quadratic variation (predictable QV) of {Y }t=1 or the “angle-bracket” process in

probability theory. Crucially ⟨Y, Y ⟩t ∈ Ft−1 , that is it is previsible, and is bounded due to the L2 assumption.

The predictable QV drives a version of the strong law of large number (SLLN) for martingales.


Theorem 2.3.14. [SLLN for MDs] If {Yt }t=1 is a martingale and Y0 = 0, then

Yt a.s.
→ 0,
⟨Y, Y ⟩t

for every {Yt }t=1 where, almost surely, ⟨Y, Y ⟩t → ∞.

Proof. e.g. Section 12.14, Williams (1991).

Example 2.3.15. [Regression] Think about the bivariate time series {Xt , Yt }t∈N>0 . Suppose E[Yt |Xt , Ft−1 ] =
βXt and construct

Ut = Yt − βXt ,
2.3. MARTINGALE AND MARTINGALE DIFFERENCE 31

implying E[Ut |Xt , Ft−1 ] = 0, where the filtration Ft is generated by {Xt , Yt }t∈N>0 . The least squares estimator
PT PT
t=1 Xt Yt Xt Ut
β T = PT
b = β + Pt=1
T
.
2 2
t=1 Xt t=1 Xt

This means, by Adam’s law, if E[|Xt Ut |] < ∞ then

E[Xt Ut |Ft−1 ] = 0.

2
Thus {Xt Ut }t∈N>0 is a MD sequence. If E[(Xt Ut ) ] < ∞ for all t, then {Xt Ut }t∈N>0 is a MD in L2 and so
Pt
⟨XU, XU ⟩t = j=1 E[(Xj Uj )2 |Fj−1 ]. Then by the SLLN for MDs
PT
t=1 Xt Ut a.s.
→ 0,
⟨XU, XU ⟩T

so long as ⟨XU, XU ⟩T → 0.

Martingale differences are also is at the center of the Brown (1971) basic central limit theorem (CLT) for
martingales.


Theorem 2.3.16. [CLT for MDs] If {Yt }t=1 is a martingale, define St2 = E[⟨Y, Y ⟩t ] and assume that as t → ∞

the
⟨Y, Y ⟩t p
→ 1, (2.9)
St2
and that the Lindeberg-type condition
t
1 X p
E[(∆Yj )2 1(∆Yj )2 ≥ϵSt ] → 0, for all ϵ > 0. (2.10)
St2 j=1

Then
Yt d
→ N (0, 1).
St
Proof. See Brown (1971).
Pt
Equation (2.9) requires that the ratio of j=1 E[(∆Yj )2 |Fj−1 ] to its unconditional mean
t
X t
X
St2 = E[⟨Y, Y ⟩t ] = E[(∆Yj )2 ] = Var(∆Yj ),
j=1 j=1

to converges to one when the same size is large. This should happen if the E[(∆Yj )2 |Fj−1 ] has only a moderate
degree of memory through time — we will formalize this later.
Equation (2.10) appears in the Lindeberg CLT for independent but not identically distributed random
variables — time plays no special role here. The Lindeberg-type condition stops a few data points (as the data

is not identically distributed through time) from dominating the average. The Lindeberg condition is implied by
−(2+δ) Pt 2+δ p
the Lyapunov’s condition, which is that for some δ > 0 it is possible to show that St j=1 E[|∆Yj | ] → 0.
2+δ
One way to satisfy this is to assume that E[|∆Yt | ] < ∞ and d < E[∆Yt2 ] < ∞, for some d > 0.
32 CHAPTER 2. TIME SERIES MODEL

2.4 Probability integral transform — replication

The independence-like feature of the prediction decomposition


T
Y
fY1:T (y1:T ) = fYt |(Y1:t−1 =y1:t−1 ) (yt ),
t=1

has a deeper aspect.

Assume a univariate Yt comes from a continuous FYt |Y1:t−1 , then evaluate FYt |Y1:t−1 at the random Yt . This
yields

Ut := FYt |Y1:t−1 (Yt ) ∼ U (0, 1), t = 1, ..., T, (2.11)

a standard uniform, using the universality of uniform result (e.g. Section 5.3 of Blitzstein and Hwang (2019)).

The transform (2.11) is often called the probability integral transform. The joint distribution of the
sequence U1 , ..., UT is

fU1:T (u1:T ) = 1, u1 , ..., uT ∈ [0, 1]T , (2.12)

are independent standard uniforms (and so any statistic T (U1:T ) is a pivot), where the independence is due to
the product (2.12). Hence

underneath the sequence Y1:T , there are T independent uniforms U1:T .

At a high level this means that one-step ahead predictions reveal a pure source of replication in time series
U1:T , providing an answer to Biohazard 2.1.6. Of course, forming these predictions in practice involves some
form of modeling, so is not easy. This contrasts with Introductory Statistics problems where blocks of data are

directly assumed to be i.i.d..

Example 2.4.1. An early use of the i.i.d. uniformity of U1 , ..., UT was in checking weather forecasting (e.g. a
model forecasts rain from 10-11am with 6% chance, then check all model forecasts through time by asking if
forecasts with a 6% chance happen 6% of the time). For an extensive review and much more see, for example,

Gneiting and Katzfuss (2014).

The result (2.12) goes back at least to Rosenblatt (1952), their use in econometrics was introduced by
Diebold, Gunther, and Tay (1998), Kim, Shephard, and Chib (1998), Omori, Chib, Shephard, and Nakajima
(2007) and Bladt and McNeil (2022).

To see a proof of (2.12), it is sufficient to look at the T = 2 case as T > 2 case has the same structure, then

FU1 ,U2 (u1 , u2 ) = P (FY1 (Y1 ) ≤ u1 , FY2 |Y1 (Y2 ) ≤ u2 )


= P (Y1 ≤ FY−11
(u1 ))P (Y2 ≤ FY−1 2 |Y1
(u2 )) = FY1 (FY−1
1
(u1 ))FY2 |Y1 (FY−1
2 |Y1
(u2 ))
= u1 u2 .
2.5. ESTIMAND, STATISTIC, ESTIMATOR AND ESTIMATE 33

Differentiating the bivariate distribution function yields the stated joint density (Diebold, Gunther, and Tay
(1998) give a different type of proof, one based the change of variable method for densities, which uses Jaco-
bians).

h 2.4.2. This argument extends to the d-dimensional Yt case, then

Uj,t = FYj,t |Y1:t−1 (Yj,t ), j = 1, ..., d,

iid
where the Uj,t ∼ U (0, 1) is through time. However, the U1,t , ..., Ud,t are not necessarily independent of one

another (they are linked through a time-t copula — but that copula can, potentially, change every time period).

2.5 Estimand, statistic, estimator and estimate


2.5.1 Estimand, statistic, estimator and estimate

Just because we deal with time series, does not change the definitions of estimands, statistics, estimators and

estimates. They are the same as in introductory statistics. An estimand θ is an object we would like to learn
from data, a statistic is a function of the data T (Y), and estimator is a statistic aimed at learning an estimand
θ = T (Y) and an estimate is the data version of the estimator b
b θ = T (y).

Example 2.5.1. Examples of estimands include:

ˆ The θ = E[YT +2 | (Y = y)], the conditional mean forecast of YT +2 given the past data up to time T is

y1 , ..., yT ;

ˆ parameters θ which index a parametric model, here expressed through the cumulative distribution function

of Y. In frequentist work this will be written as

FY;θ , or FY (y; θ).

In Bayesian analysis this will be written as

FY|θ , or FY (y|θ).

ˆ if Yt is quarterly GDP growth in the t-th period, then


T
1 X
θ= P (Yt < 0, Yt−1 < 0)
T − 1 t=2

is the chance a randomly selected pair of consecutive quarters will both have strictly negative growth
(recall two consecutive contractions in GDP is one crude indicator of a recession).
34 CHAPTER 2. TIME SERIES MODEL

2.5.2 Parametric model

One approach to learning from data is through the use of parametric models.

Definition 2.5.2. In a parametric statistical model knowledge of each parameter value θ ∈ Θ ∈ RK is


enough to determine the function FY1 ,...,YT ;θ (y1 , ..., yT ) completely for all y1 , ..., yT (although we may not have
enough knowledge to compute FY1 ,...,YT ;θ or fY1 ,...,YT ;θ or simulate from Y1 , ..., YT ; θ). In a nonparametric

statistical model K = ∞.

h 2.5.3. Many researchers in the last 30 years have made a big deal out of the difference between parametric
and nonparametric models. But some of this is nonsense. The autoregressive language model behind ChatGPT

is a parametric statistical model but it has K = 175 billion. Hence it is a super flexible parametric model. Do
not be taken in by the mantra nonparametric good, parametric bad. It is intellectually vacuous. Instead, be
aware of the difference between flexible models and very tightly constrained models. Sometimes very tightly
constrained models can yield simple but fragile results and so can be dangerous. Thought is needed.

For parametric models the log-likelihood

log L(θ) = log fY2 ,...,YT |Y1 ;θ


T
X
= log fYt |Y1:t−1 ;θ ,
t=2

plays a large role.

Example 2.5.4. Suppose Yt |Y1:t−1 ; θ ∼ N (ϕYt−1 , σ 2 ), where ϕ and σ 2 > 0 are unknown parameters and so
 T
θ = ϕ, σ 2 and Θ = R × R≥0 . This is called a first order Gaussian autoregression. Then
T
T −1 2 1 X 2
log L(θ) = c − log σ − 2 (Yt − ϕYt−1 ) .
2 2σ t=2

2.5.3 Kullback-Leibler divergence

To compare the joint model FY1:t to GY1:t look at the ratio of the densities

fY1:t (y1:t ) fYt |Y1:t−1 (yt )


Λt = = Λt−1 × Λ(F ||G)t , Λ0 = 1, Λ(F ||G)t = ,
gY1:t (y1:t ) gYt |Y1:t−1 (yt )

so
log Λt = log Λt−1 + log Λ(F ||G)t .

This form means that if E[| log Λt |] < ∞, then E[log Λt |Ft−1 ] − log Λt−1 equals E [log {Λ(F ||G)j } |Fj−1 ] so the

Doob decomposition splits


t
X
log Λt = At + Mt , where At = E [log {Λ(F ||G)j } |Fj−1 ]
j=1
2.5. ESTIMAND, STATISTIC, ESTIMATOR AND ESTIMATE 35

noting log Λ0 = 0, where {Mt } is a martingale with respect to the filtration generated by {Yt }t∈N>0 , with
increment ∆Mt = Mt − Mt−1 which equals ∆ log Λt − E [log {Λ(F ||G)j } |Fj−1 ].

From the LLN for martingales, we know if t gets large Mt /t goes to zero so

p
T −1 {log ΛT − At } → 0.

So what? Is AT interesting?

Theorem 2.5.5. Assume that E[| log {Λ(F ||G)t } |] < ∞, then the predictive Kullback-Leibler divergence

∆At = E [log {Λ(F ||G)t } |Ft−1 ] ≥ 0,

where the expectation is write respect to the model F .

Proof. Recall the core Kullback-Leibler divergence inequality. For any distribution functions P and Q, the
h i h i h i h i
DKL (P ||Q) = EP log pq = −EP log pq ≥ − log EP pq by Jensen’s inequality. Then note EP pq = 1
yielding the celebrated result that

DKL (P ||Q) ≥ 0,

the Gibbs inequality. Consequently


Z
∆At = log {Λ(F ||G)t } × fYt |Y1:t−1 (yt )dyt ≥ 0, as DKL (P ||Q) ≥ 0.

This is the stated result.

As ∆At ≥ 0 for every t, then {log Λt } is a submartingale with respect to the filtration generated by {Yt }t∈N>0
— so it will tend to drift upwards as t increases unless F = G, when ∆At = 0.

2.5.4 Parametric models and MLE

Comparing the data generating process F to the parametric model G = Fθ , then for each θ ∈ Θ, the corre-
sponding ∆At changes with θ. Write this as

∆At (θ) = E [log {Λ(F ||Fθ )t } |Ft−1 ] ≥ 0.

In the special case where F = Fθ∗ where θ∗ ∈ Θ (the truth is a special case of the model) then the corresponding
∆At changes with θ, θ∗ . Write this as

∆At (θ∗ , θ) = E [log {Λ(Fθ∗ ||Fθ )t } |Ft−1 ] ≥ 0,

with the equality holding iff θ = θ∗ so long as Fθ ̸= Fθ∗ for all θ ̸= θ.


36 CHAPTER 2. TIME SERIES MODEL

This inequality provides a major justification for the use of the maximum likelihood estimator (MLE) for
time series models

θ =
b arg max log L(θ; y1:T ) = arg max log fY1:T (y1:T ; θ)
θ∈Θ θ∈Θ
fY1:T (y1:T ; θ∗ )
 
= arg min log
θ∈Θ fY1:T (y1:T ; θ)
1 1
= arg min AT (θ∗ , θ) + MT (θ∗ , θ) = arg min AT (θ∗ , θ) + MT (θ∗ , θ)
θ∈Θ θ∈Θ T T
1 ∗
≃ arg min AT (θ , θ), T is large
θ∈Θ T
= θ∗ ,

when T is large and dim(θ) is small. Obviously the large T behaviour of AT (θ∗ , θ)/T needs formalizing in this
argument.

To finish, now focus on a semiparametric model where F = Hθ∗ , which is indexed by some modelled aspect
of F , e.g. E[Yt |Ft−1 ] = θ∗ Yt−1 . If
θ∗ = arg min ∆At (θ∗ , θ),
θ∈Θ

for all t and Y1:t−1 , then log L(θ; y1:T ) is called a quasi-likelihood for the “psuedo-true” θ∗ .

Example 2.5.6. Think of the Gaussian autoregressive case from Example 2.5.4 as a parametric model with

known σ 2 and θ = ϕ, the

 
∆At = E log fYt |Y1:t−1 (yt )|Ft−1
1 1 
−c + log σ 2 + 2 E[Yt2 |Ft−1 ] − 2θYt−1 E[Yt |Ft−1 ] + θ2 Yt−1
2
.
2 2σ

Assume, under F = Hθ∗ the conditional moment E[Yt |Ft−1 ] = θ∗ Yt−1 holds, then

1 1 
∆At (θ∗ , θ) = E log fYt |Y1:t−1 ;θ∗ (yt )|Ft−1 − c + log σ 2 + 2 E[Yt2 |Ft−1 ] − 2θθ∗ Yt−1
2
+ θ2 Yt−1
2
 
.
2 2σ

Then for all Y1:t−1 the


θ∗ = arg min ∆At (θ∗ , θ).
θ∈Θ

Hence the θ∗ has a psuedo-true interpretation. The Gaussianity in the Gaussian autoregressive is not vital.

2.6 Recap

This Chapter has covered a great deal of material, starting with using probability to define a time series model.

The Chapter has focused on the use of prediction to unlock fundamental properties of time series, moving
through the prediction decomposition and going on to martingales and martingale difference sequences. Some
of this material was used to justify an interest in the MLE.
2.6. RECAP 37

Formula or idea Description or name


Pt
⟨M, M ⟩t = j=1 V ar(∆Mj |Fj−1 ) angle bracket process
CARA Constant absolute risk aversion
Yt = At + Mt Doob decomposition
  Doob’s inequality
E sups≤t Ms2 ≤ 4E[Mt2 ] Doob’s maximal quadratic inequality
integrable
T
Q
fY1:T (y1:T ) = fYt |(Y1:t−1 =y1:t−1 ) (yt ) prediction decomposition
t=1
E[Yt |Ft−1 ] = Yt−1 martingale
E[Yt |Ft−1 ] = 0 martingale difference
martingale transform theorem
n (Yt ) ∼ U (0,o1)
FYht |Y1:t−1
fYt |Y1:t−1 (yt )
i PIT
E log gYt |Y1:t−1 (yt ) |Y1:t−1 Kullback-Liebler divergence
square integrable

Table 2.1: Main ideas and notation in Chapter 2.

The main topics are listed in Table 2.1.


38 CHAPTER 2. TIME SERIES MODEL
Chapter 3

Stationarity

In introductory statistics much of the analysis is based on the i.i.d. assumption. Here we focus on the time
series analog of the identically distributed assumption. It has two versions:

ˆ stationarity, which is about marginal and joint distributions;

ˆ covariance stationarity, which is only about means and covariances.

Prediction does not really play any fundamental role here.

3.1 Strict stationarity and marginal distributions

Recall a time series model is the joint distribution function FY of the sequence of random variables

Y = (Y1 , ..., YT ) .

Focus on the T marginal distributions

FYt , t = 1, ..., T.

Recall we are extending to time series the idea of the identically distributed assumption. A crude version
of this is that marginal distributions do not change through time

FYt = FY1 , for all t = 1, ..., T,

(written in long hand this means that FYt (y1 ) = FY1 (y1 ) for all y1 ∈ Y) which would drive time-invariance to
the marginal means, variance, quantiles, etc. But for a time series this is often not enough for the dependence
between the elements in Y could change dramatically through time.

Instead strict stationarity is often employed. This is stated for an infinitely process. The idea is we
select some k arbitrary times t1 , ..., tk and calculate the joint distribution of the corresponding random variables
FYt1 ,...,Ytk . Then, if we shift forward all these times by the same period s, then the process is said to be

39
40 CHAPTER 3. STATIONARITY

strictly stationary if the joint distribution function does not change — the Yt1 , ..., Ytk have the identical joint
distribution under time shifts. This is stated mathematically below.

Definition 3.1.1. The process {Yt }t∈N is said to be k-th order strictly stationary if

FYt1 ,...,Ytk = FYt1 +s ,...,Ytk +s , for all t1 , ..., tk , s ∈ Z,

If this holds for any choice of k ∈ Z>0 , the process is simply called strictly stationary. It is the latter construction
that is used here.

This is very influential way of defining a form of stability through time.

Example 3.1.2. To help you think: compare the stationary process

Yt = ε, t ∈ N,

where ε is a single random variable repeatedly seen, with another stationary process

iid
Yt = εt , εt ∼ , t ∈ N.

They are polar opposites but both stationary. Stationarity tells you nothing about the degree of dependence
inside the time series.

3.2 Moving average process

The next example is a moving average process.

Remark 5. “Moving average”, very unfortunately, has two entirely different meanings in time series. First,

as a process, which we will define in a moment. Second, as an algorithm: using local averages of a time series
to reduce “noise.” In the academic literature the second method is usually called a type of filter or smoother.
Two examples of such a moving average are:

ct = 1 Yt−1 + 1 Yt + 1 Yt+1 ,
M t = 2, ..., T − 1,
4 2 4

a smoother (which estimates something at time t, using data in the past, the current t and the future) and

ft = 1 Yt+1 + 1 Yt−1 + 1 Yt ,
M t = 3, ..., T,
6 3 2

a filter (which estimates something at time t, using data in the past and the current day). We will see much

more of these types of filters and smoothers, e.g. EWMA and cubic splines, in Chapter 5.

The moving average process is one of the most influential in time series, developed independently by Yule
(1921, Yule (1926, Yule (1927) and Slutsky (1927). It was named by Wold (1938). Much of modern macroe-
conometrics is phrased using moving averages. This will be discussed in Chapter 7.
3.2. MOVING AVERAGE PROCESS 41

Definition 3.2.1. [Moving average process] The q-th order (homogeneous) linear moving average (denoted
MA(q)) process
Yt = θ0 εt + θ1 εt−1 + ... + θq εt−q , t ∈ N, (3.1)

q
is driven by the time series of “shocks” {εt }t∈Z of i.i.d. (vector) random variables while {θj }j=0 are d × r
matrices of constants (often researchers set θ0 = Id in applications).

In most applications d = r.

Theorem 3.2.2. MA(q) process is always stationary. Further, if {Yt } is a MA(q) process, then for any r × d
real matrix A the process {AYt } is a MA(q) process (i.e. the class of process is closed under linear combinations).

Proof. [Sketch proof only]. We will use a generic notation for the cumulant-generating function (the log of the
T
characteristic function) C {ω ‡ X} = log E[eiX ω ] of a vector random variable X and ω a vector of reals. Then
for every t

T
C {ω ‡ Yt } = log E[eiω Yt ]
n T o n T o n T o
= C ω T θ0 ‡ εt + C ω T θ1 ‡ εt−1 + ...C ω T θq ‡ εt−q , {εt }t∈Z independent
q n o
X T
= C ω T θj ‡ ε1 , by identical distribution
j=0
= C {ω ‡ Yt+s } .

So by uniqueness theorem of characteristic functions (there is a 1-1 relationship between distributions and

characteristic functions), FYt exists and FYt = FYt+s . The same type of argument holds when we look at joint
distributions, e.g. think of the MA(1), then

T T
C {(ω1 , ω2 ) ‡ (Yt1 , Yt2 )} = log E[ei(ω Yt1 +λ Yt2 ) ]
hh n T o n T oi h n T o n T oii
= C ω T θ0 ‡ ε1 + C ω T θ1 ‡ ε1 + C λ T θ0 ‡ ε1 + C λ T θ1 ‡ ε1 )
n n  T o n T o n T oo
+ C ω T + λ T θ1 ‡ ε1 − C ω T θ1 ‡ ε1 − C λ T θ1 ‡ ε1 1|t2 −t1 |=1 .

Hence the MA(q) process are always stationary. The second point of the theorem

AYt = Aθ0 εt + Aθ1 εt−1 + ... + Aθq εt−q = λ0 εt + λ1 εt−1 + ... + λq εt−q ,

which is an MA(q) process.

In macroeconomics MA(∞) models are often used. Are they strictly stationary?

Example 3.2.3. Think about when ε1 is univariate symmetric α-stable where α ∈ (0, 2], then C {ω ‡ ε1 } =
α α Pq α
− |ω| where α > 0. Then for a symmetric α-stable MA(q) process C {ω ‡ Yt } = − |ω| j=0 |θj | . So the
42 CHAPTER 3. STATIONARITY

P∞ α
symmetric α-stable MA(∞) process is stationary iff j=0 |θj | < ∞ (see Rootzén (1978) for a more extensive
discussion). This shows strict stationarity issues can be intricate for MA(∞) processes — the tail thickness of
the shocks {εt } interact with the dependence parameters {θj }.

h 3.2.4. Strict stationarity is straightforward for MA(q) for finite q, but very intricate for MA(∞) processes.
This is somewhat odd as q could be enormous, for example 100 billion, and the process is always strictly

stationary, but 100 billion is not big enough to cover the MA(∞) case.

In Chapter 2 we saw that martingale differences appear naturally in time series and play some similar roles

in probability theory to zero mean independence. So it is interesting to define a zero mean MA(q) process not
driven by i.i.d. shocks, but by MD sequences:

Yt = θ0 εt + θ1 εt−1 + ...θq εt−q , t ∈ N, (3.2)

q
the {θj }j=0 are, again, d × d matrices of constants but now {εt }t∈Z is a martingale difference sequence with
respect to {Ft }. We will call this a MA(q) MD process and a MA(q) MD L2 process if the {εt } is a MD L2
process.

Theorem 3.2.5. For an MA(q) MD process,


E[Yt ] = 0. (3.3)

For a MA(q) L2 MD process


q
X
T
Cov(Yt , Yt−s ) = θj Var(εt−j )θj−s . (3.4)
j=0

Replacing the MD process by i.i.d. shocks (which do not need 0 means) where Var(ε1 ) < ∞, then
 
q
X
E[Yt ] =  θj  E[ε1 ] (3.5)
j=0

while (3.4) simplifies to


q
X
T
Cov[Yt , Yt−s ] = θj Var(ε1 )θj−s . (3.6)
j=0

In the special case where the MD L2 process has Var(εt ) = Var(ε1 ) for all t, then (3.4) is replaced by (3.6).

Proof. By MD definition, we know that E[|εt |] < ∞ and E[εt ] = 0, so

E[Yt ] = 0,

always holds. For a MA(q) MD L2 process the Var(εt ) exists (but is not necessarily constant through time)
and Cov(εt , εt−s ) = 0 and so
q
X
T
Cov(Yt , Yt−s ) = θj Var(εt−j )θj−s .
j=0
The same results holds for i.i.d. shocks so long as the moments exist, but now it enforces homoskedasticity,
which simplifies (3.4) down to (3.6).
3.3. COVARIANCE STATIONARITY 43

3.3 Covariance stationarity


3.3.1 Definition

Strict stationarity requires all joint distributions to be invariant to a time shifted by some period s. A different,
but related, concept is covariance stationarity — which switches the focus from distributions to solely means
and covariances.

Definition 3.3.1. A time series {Yt } is covariance stationary iff

Et [Yt ] = E[Y1 ] and Covt,t−s (Yt , Yt−s ) = Cov(Y1 , Y1−s )

for every t and s. The Cov(Y1 , Y1−s ) is called the s-th autocovariance, while the function {γs }s∈Z , where

γs = Cov(Y1 , Y1−s ) = Cov(Y1−s , Y1 )T = Cov(Y1 , Y1+s )T = γ−s


T
,

is called the autocovariance function. The s-th autocorrelation is, for a scalar time series

Cov(Y1 , Y1−s ) γs
ρs = Cor(Y1 , Y1−s ) = p =
Var(Y1 )Var(Y1−s ) γ0

while function {ρs }s∈Z is called the autocorrelation function.

Remark 6. Just becomes a series is covariance stationary does not imply it is strictly stationary and some
strictly stationary processes are not covariance stationary (e.g. a sequence of i.i.d. Cauchy random variables
are strictly stationary but not covariance stationary). Strict stationarity plus the existence of Var(Y1 ) implies
covariance stationarity. A special case of this covers the Gaussian process: where strict and covariance

stationarity are the same.

3.3.2 Examples of covariance stationarity

Example 3.3.2. Yt = ε, where Var(ε) < ∞, then {Yt } is covariance stationary process with γs = γ0 for all s.

Example 3.3.3. A martingale differences always have E[Yt ] = 0, but its unconditional Var(Yt ) is not guaranteed

to exist and, even if it does exist, can vary with t. Hence martingale differences are not in general covariance
stationary.

Example 3.3.4. Let λ ∈ (0, 2π), a “frequency”, and define the process {Yt } by

Yt = cos (λ1 t) β1 + sin (λ1 t) β2 , β = (β1 , β2 )T , E[β] = 02 , Var[β] = σ12 I2 ,

then it is covariance stationary with

E[Yt ] = 0, γs = cos (λ1 s) σ12 .


44 CHAPTER 3. STATIONARITY

Why? Mean is obvious, but

γs = {cos (λ1 t) cos (λ1 t − λ1 s) + sin (λ1 t) sin (λ1 t − λ1 s))} σ12

= cos (λ1 s) σ12 , using trig identity: cos(α − β) = cos(α) cos(β) + sin(α) sin(β).

Thus {Yt } is a weakly stationary process (if β is Gaussian, it is strictly stationary). Notice γs oscillates as s
increases, never settling down.

The i.i.d. assumption appears frequently in introductory statistic and drives the basic MA(q) definition.
The covariance stationary version of this is called white noise.

Definition 3.3.5. [Weak white noise] A time series {Yt } is called weak white noise if

E[Yt ] = E[Y1 ], Var(Yt ) = γ0 , Cov(Yt , Yt−s ) = 0, s ̸= 0.

If the mean is 0, it is labelled zero mean weak white noise.

Sometimes the finite variance i.i.d. case is called “independent white noise.”

Example 3.3.6. [Weak white noise driven MA(q)] In the univariate zero mean weak noise driven MA(1), that
is where {εt } is weak white noise, then

Var(Y1 ) = θ02 + θ12 Var(ε1 )



E[Y1 ] = 0,

Cov(Yt , Yt−1 ) = θ0 θ1 Var(ε1 )

Cov(Yt , Yt−s ) = 0, s > 1,

so it covariance stationary. Hence


θ0 θ1
Cor(Yt , Yt−1 ) = ,
θ02 + θ12
and Cor(Yt , Yt−s ) = 0, s > 1. Hence the weak white noise driven MA(1) is always covariance stationary. This
also holds for the zero mean weak noise driven vector MA(q) process when q is finite where
q
X
T
E[Y1 ] = 0, Cov(Y1 , Y1−s ) = θj Var(ε1 )θj−s . (3.7)
j=0

The weak noise driven MA(q) are often called the covariance stationary MA(q) process. The univariate version
of the autocovariance
q
X
γs = Var(ε1 ) θj θj−s
j=0

is often useful to keep in mind.

The following special case if important in statistical model building.


3.3. COVARIANCE STATIONARITY 45

Definition 3.3.7. [MD white noise] Assume {Yt } is a martingale difference sequence with respect to {Ft } and
Vart (Yt ) = Var(Y1 ) < ∞, then {Yt } is called a martingale difference white noise sequence — MD white noise.
MD white noise is always covarince stationary, with γs = 0 for s ̸= 0.

MD white noise is important in statistics as it drives the properties of many sophisticated estimators (e.g.
Method of Moments and least squares estimators). Here we collect two core MD white noise results — they

are not new, they are special (relatively simple) cases of existing law of large numbers and CLTs for martingale
differences. It will be helpful to refer to these results later.

Remark 7. [MD white noise asymptotics] If {Yt } is MD white noise then:


1. The sample average always obeys a weak law of large numbers (specializing Theorem 2.3.8)
T
1X p
Yt → 0.
T t=1

2. The sample average obeys a CLT (specializing Theorem 2.3.16)


T
!
√ 1X d
T Yt → N (0, Var(Y1 )),
T t=1

so long as:
PT p
(a) T −1 t=1 σt2 → Var(Y1 ) (that is σt2 is ergodic, where σt2 = Var(Yt |Ft−1 )) and,


(b) the Lindeberg condition holds. In practice, if {Yt } is stationary, this can be check by making sure that

E[|Yt |2+δ ] < ∞ for a δ > 0.

3.3.3 Covariance stationarity and the MA(∞)

The MA(∞) is rather important from a theoretical viewpoint. Is the weak white noise driven MA(∞) covariance
stationary? The answer is sometimes!

The basic question is: do E[|Y1 |] and Var(Y1 ) exist? If they do, then the autocovariance function will just
be (3.7) with q = ∞. Now

X
Var(Y1 ) = Var(ε1 ) θj2 , Var(ε1 ) < ∞ and by weak white noise assumption
j=0
X∞
< ∞, if θj2 < ∞ (squared summability).
j=0

Of course if Var(Y1 ) exists then so does E[Y1 ] (that is E[|Y1 |] < ∞) and

X
γs = Cov(Y1 , Y1−s ) = Var(ε1 ) θj θj−s , s = 1, 2, ...
j=0

as |γs | ≤ Var(Y1 ). Thus we make the important conclusion:


46 CHAPTER 3. STATIONARITY

P∞ 2
weak white noise driven MA(∞) is covariance stationary under square summability j=0 θj < ∞.

Later it will be convenient to work with another condition:



X
|θj | < ∞ (absolute summability).
j=0

It turns out this absolute summability condition on {θj } is enough to guarantee the absolute summability of

the autocovariance {γs }, that is



X
|γs | < ∞,
s=−∞

which is an important quantity statistically.


It is helpful to explicitly relate

X ∞
X ∞
X
θj2 < ∞ to |θj | < ∞ and |γs | < ∞.
j=0 j=0 s=−∞

ˆ Absolute summability implies squared summability. Why does this classic mathematics result hold?
Assume absolute summability, then |θj | → 0 as j increases. Find a finite N > 0 such that for all j > N
the corresponding θj < 1. Then

X N
X ∞
X
θj2 = θj2 + θj2
j=0 j=0 j=N +1
N
X ∞
X N
X ∞
X N
X
≤ θj2 + |θj | ≤ θj2 + |θj | , θj2 is finite
j=0 j=N +1 j=0 j=0 j=0
X∞
< ∞, as |θj | < ∞,
j=0

which implies the stated result.


P∞
ˆ Absolute summability implies s=−∞ |γs | < ∞. Now

X
γs = θj θj−s
j=0
∞ ∞ X
∞ ∞ X
∞ ∞
!2
X X X X
|γs | ≤ |θj | |θj−s | ≤ |θj | |θs | = |θs | < ∞,
s=0 s=0 j=0 s=0 j=0 s=0

which implies the result.

3.3.4 Covariance stationarity and the sample mean

Averages are a core building block for much of statistics. The following result is helpful for inference purposes.

Theorem 3.3.8. For a covariance stationary {Yt } process the


T −1   T −1  
X |s| X |s|
γs + γsT =

T × Var(Y ) = γ0 + 1− 1− γs . (3.8)
s=1
T T
s=−(T −1)
3.3. COVARIANCE STATIONARITY 47

P∞
If s=1 |γs | < ∞, then as T increases
n√ ∞
o X
Var T Y − E[Y1 ] → γs .
s=−∞

The term

X
γs
s=−∞

is sometimes called the long-run variance.

Proof.
T T
1 XX
T × Var(Y ) = Cov(Yt , Yj )
T t=1 j=1
T T
1 XX
= Cov(Y1 , Y1+(t−j) ), covariance stationarity (3.9)
T j=1 t=1
T −1 T −s
1 XX
= Var(Y1 ) + {Cov(Y1 , Y1−s ) + Cov(Y1 , Y1+s )} (3.10)
T s=1 t=1
T −1
1 X
(T − |s|) γs + γsT

= γ0 + (3.11)
T s=1
T −1  
X |s|
γs + γsT .

= γ0 + 1− (3.12)
s=1
T

As T goes to infinity the limit is straightforward, if it exists. Its existence is assumed here.

In time series, if the time average of a process converges to its expectation, the underlying process is said
to be “ergodic.” There is a vast math discipline on ergodicity, here when you hear ergodicity you should just
think that averages converge to their expectations.
P∞
Remark 8. A sufficient condition for the covariance stationary MA(∞) process to be ergodic if that j=0 |θj | <
P∞ P∞
∞. Why? We saw in the previous subsection that j=0 |θj | < ∞ is enough to force s=1 |γs | < ∞, which
gives the result.

More broadly (i.e. not necessarily assuming a MA(∞) process), for covariance stationarity a rather beautiful

ergodicity result holds.

Lemma 1. For a covariance stationary {Yt } process and lim |γ(s)| = 0, then
s→∞

p
Y → E[Y1 ].

Proof. Recall Cesàro’s lemma, which says that strictly positive real numbers bn ↑ ∞, and convergence sequence
of reals vn → v∞ ∈ R, then
n
1 X
(bk − bk−1 )vk → v∞ .
bn
k=1
48 CHAPTER 3. STATIONARITY

From (3.8), the


T −1 T −1 T −1
1 X s 1 X 1 X
Var(Y ) = 1− γ(s) ≤ |γ(s)| = {s − (s − 1)} |γ(s)|
T s=1 T T s=1 T s=1
→ lim |γ(s)| , Cesàro’s lemma
s→∞

= 0,

Then the result follows from Chebysev’s inequality.

−α
Example 3.3.9. Assume γs = |s| , for α > 0, then
T −1
X 1
T × Var(Y ) = 1 + 2 , (3.13)
s=1

p
while lim |γ(s)| = 0. Hence by Lemma 1, the Y → E[Y1 ]. But at what speed does this convergence happen as
s→∞
T increases? The right hand side of (3.13) is called a “p-series” in mathematics. It diverges if α ≤ 1, that is

X
γs = ∞,
s=−∞
P∞
so when α ≤ 1 the Y goes to E[Y1 ] slower than T −1/2 ! However, s=−∞ γs does exists if α > 1, in which case
Y goes to E[Y1 ] at rate T −1/2 .

We finish with our second CLT. Here the usual driving weak white noise has been strengthen to a driving
independent zero mean white noise assumption. The proof will be delayed to the next Chapter when we discuss

M -dependence. The same CLT holds if the driving noise is MD white noise.

Remark 9. Unfortunately weak white noise is not enough to drive a Gaussian CLT. Why? At first sight
it is shocking, but the result is not deep. Focus on a very simple case, Yt = Xεt , where {εt } is i.i.d. with

⊥ {εt }, Var(X) < ∞. Then E[εt ] = 0 and so E[Yt ] = 0, Var[Yt2 ] = E[X 2 ]


P (εt = 1) = P (εt = −1) = 1/2, X ⊥
as ε2t = 1. Crucially, for s ̸= 0, the

E[Yt Yt−s ] = E[X 2 εt εt−s ] = E[X 2 ]E[εt ]E[εt−s ] = 0, by independence,

so {Yt } is weak white noise. But


√  T
T Y − E[Y1 ] 1 X d
=√ εt → U, U ∼ N (0, 1), U ⊥⊥ X.
X T t=1
√  .
Hence T Y − E[Y1 ] ∼ U × X, a mixed Gaussian distribution, which is not Gaussian unless X is a constant.
This holds even though for every T ,

T × Var Y − E[Y1 ] = E[X 2 ].


 

There is simply too much dependence allowed by weak white noise (here due to the common random scale X)
to drive a Gaussian CLT.
3.4. METHOD OF MOMENTS 49

Before we state the result, we give a simple but very compact result.

P∞
Lemma 2. For a covariance stationary MA(∞) with absolute summability j=0 |θj | < ∞, define


X
θ(z) = θj z j .
j=0

Then

X
γs = Var(ε1 ) × θ(1)2 .
s=−∞

Proof. Set the Var(ε1 ) = 1, then


 
∞ X
∞ ∞ ∞ ∞ ∞
! !
X X X X X
Ψ= θj θj−s = θj θj−s = θj  θi = θ(1)2 .
s=−∞ j=0 j=0 s=−∞ j=0 i=0

P∞
Theorem 3.3.10. Assume an independent white noise driven MA(∞) with absolute summability j=0 |θj | <
P∞
∞, then writing θ(z) = j=0 θj z j , assuming θ(1)2 > 0, then


√  d X
T Y − E[Y1 ] → N (0, Ψ) , Ψ= γs = Var(ε1 ) × θ(1)2 .
s=−∞

The Ψ is often called the long-run variance.

It is helpful to take a step back and make a summary about MA(∞) processes.

ˆ strict stationary: its certainly complicated!

P∞
ˆ covariance stationary needs 2
j=0 θj < ∞, when driven by weak white noise;

P∞
ˆ ergodicity if j=0 |θj | < ∞, when driven by weak white noise;

P∞
ˆ CLT needs j=0 |θj | plus more than weak white noise (e.g. independence or MD white noise).

ˆ CLT asymptotic variance, is Ψ = Var(ε1 ) × ψ(1)2 , the long-run variance.

3.4 Method of moments

The method of moments is one of the three major estimation strategies discussed in Introductory Statistics.

It is due to Karl Pearson from 1894. The other two are MLE and Bayesian inference. The broad idea is to
relate the estimands to moments and then the method of moment principle replaces estimands by estimators
and moments by averages.
50 CHAPTER 3. STATIONARITY

3.4.1 Covariance stationary

Means and covariances naturally appear in covariance stationary processes, so the method of moments is often

used for covariance stationary processes.


To start thinking, suppose {Yt } is a process and the transformed process {h(Yt:t+s−1 )} is covariance station-
ary.

Example 3.4.1. Think of s = 3 and h(Yt:t+s−1 ) = Yt Yt+2 .

Let θ ∈ Rp be an estimand and

α(θ) = EY1:s ;θ [h(Y1:s )], (3.14)

where α is a one-to-one function Rp → Rp , so

θ = α−1 (EY1:s ;θ [h(Y1:s )]) .

Here FY1:s ;θ is the marginal distribution of Y1:s (think of it as the stationary distribution) and
Z
EY1:s ;θ [h(Y1 )] = h(y1:s )fY1:s ;θ (y1:s )dy1 ...dys .

Applying the method of moment principle to (3.14) starts with


T −s T −s
!
1 X 1 X
α(b
θM oM ) = θM oM = α−1
h(Yt:t+s ), implying b h(Yt:t+s ) ,
T − s t=1 T − s t=1

where b
θM oM is a method of moments estimator (typically, for any given scientific problem, there are a massive
number of method of moments estimators available — so the function h should be selected with some care).

h 3.4.2. For an MA(1) process parameterized to have Var(Y1 ) = 1 and

θ1
γ1 = Cov(Y1 , Y0 ) = α(θ1 ) = , θ1 ∈ Θ = [−1, 1],
1 + θ12

which implies that


p
1± 1 − 4γ12
γ1 θ12 − θ1 + γ1 = 0, yielding pair of solutions for θ1 : .
2γ1

If |γ1 | ≤ 1/2, then the pair of solutions to γ1 θ12 − θ1 + γ1 = 0 are real. If |γ1 | > 1/2 the solutions are a pair of
complex conjugates. If γ1 ∈ [−1/2, 1/2] then the first real solution
p
1 − 1 − 4γ12
∈ [−1, 1],
2γ1
q
1+2γ1
p
(to show the range, when γ1 ∈ [0, 1/2], then 1 − 1 − 4γ12 = 1 − (1 − 2γ1 ) 1−2γ 1
≤ 2γ1 ). If γ1 ∈ (−1/2, 1/2)
then the second real solution p
1+ 1 − 4γ12

/ [−1, 1],
2γ1
3.4. METHOD OF MOMENTS 51

which is redundant statistically. Hence it makes sense to take the solution


 √ 2
 1− 2γ1−4γ1 ,
 |γ1 | ≤ 1/2
−1 1
θ1 = α (γ1 ) =

N A, |γ1 | > 1/2,

as the method of moment’s principle offers no guidance about how to select θ1 when |γ1 | > 1/2 — which is a
big fail (a reasonable ad hoc choice would be to replace NA with sign(γ1 ), but that goes outside the method of
moments principle). As a result the method of moment estimator becomes:
 √ 2
 1− 2bγ1−4bγ 1 ,
 |r1 | ≤ 1/2
1
θ1 =
b

NA |r1 | > 1/2.

b1 should be close to θ1 /(1 + θ12 ) ∈ [−1/2, 1/2], so the P (b


Of course if T is large then γ θ1 = N A) will go to

zero fast. However, it will not if the true value of θ1 ∈ {−1, 1}, where the NA solution will be problematic.
Famously, MoM estimators can be good, bad and ugly.

A different framing of the same idea, which is somewhat more opaque but more flexible, is to work with the
expectation of a function of the data and the parameter EY1:s ;θ0 [g(Y1:s , θ)], where θ0 is the true value of the

estimand. Then assume that the p-dimensional g function is chosen so that

EY1:s ;θ0 [g(Y1:s , θ0 )] = 0p , (3.15)

while EY1:s ;θ0 [g(Y1 , θ)] ̸= 0 for all θ ∈ Θ̸=θ0 , that is all possible θ not being θ0 . Then (3.15) is called a moment

condition and θ = θ0 is the unique solution.

Example 3.4.3. By selecting g(Y1:s , θ) = h(Y1:s )−α(θ) from (3.14) reproduces the first approach when α is 1-1
(which guarantees uniqueness). By writing g(Y1 , θ) = ∂ log L(θ0 ; Y1 )/∂θ produces the typical case of maximum
likelihood estimation.

Applying the method of moment’s principle to this problem, replace the expectation by an average and the
estimand by an estimator, yields
T −s
1 X
g(Yt:t+s , b
θM oM ) = 0p .
T − s t=1

Often b
θM oM has to be found numerically (it may not be unique). This is often implemented by minimizing
( T −s
)T ( T −s
)
1 X 1 X
θM oM = arg min
b g(Yt:t+s , θ) g(Yt:t+s , θ) .
θ∈Θ T − s t=1 T − s t=1

h 3.4.4. Proving θ = θ0 is the unique θ which solves EY1:s ;θ0 [g(Y1:s , θ)] = 0p is difficult for generic MoM
problems. It is typically carried out case-by-case or not at all.
52 CHAPTER 3. STATIONARITY

Consistency under covariance stationarity


Assume {h(Yt:t+s )}t=−∞ is covariance stationary and write

γj = Cov(h(Y1:s ), h(Y1−j:s−j )), j = 0, 1, ....

If |γj | → 0 as j → ∞, then by Lemma 1 applies, so

T −s
1 X p
h(Yt:t+s ) → E[h(Y1:s )],
T − s t=1

and so by the continuous mapping theorem

p
θM oM → α−1 (E[h(Y1:s )]) = θ,
b

that is the method of moment estimator is consistent. This simple idea has been applied massively in time
series over the decades.

In terms of moment conditions, assume g(Yt:t+s , θ) is covariance stationary for each value of θ in Θ, then if
the autocovariances goes to 0 at large lags, then pointwise
T −s
1 X p
g(Yt:t+s , θ) → EY1:s ;θ0 [g(Y1:s , θ)].
T − s t=1

p
θM oM → θ0 as
Hence b

θ0 = arg min EY1:s ;θ0 [g(Y1:s , θ)]T EY1:s ;θ0 [g(Y1:s , θ)], under uniqueness of θ0 .

h 3.4.5. The driving assumption is that {g(Yt:t+s , θ)} is covariance stationary, for every θ, which may be

difficult to show outside cases where g is linear in Yt or cases where {Yt } is assumed to be a strictly stationary
process.

Inference under covariance stationarity

None of the above gives us a inference engine. The usual way of doing this is by applying the delta method.

p
b→
Remark 10. [Recalling the delta method] Assume λ λ, then the delta method establishes the limit distri-

bution of β(λ). b − λ should be small so by a


b Assume β is continuously differentiable, then if T is large then λ

first order Taylor expansion


b − β(λ) ≃ ∂β(λ) b
β(λ) (λ − λ).
∂λT
b − λ and the ability to calculate or approximate ∂β(λ)/∂λT . See
Hence the delta method needs a CLT for λ
Chapter 3 of Blitzstein and Shephard (2023) for more details.
3.4. METHOD OF MOMENTS 53

This looks applicable for the method of moment estimator for


T −s
!
−1 1 X
θM oM = α
b h(Yt:t+s )
T − s t=1

a function of an average. We need two more pieces of work

ˆ The derivative of an inverse function:


−1
∂α−1 (λ)

∂α(λ)
= , λ = E[h(Y1:s )],
∂λT ∂λT

which is simple. Notice this will be big if α(λ) is unresponsive to λ.

ˆ A central limit theorem for


T ∞
√   1 X d
X
T λb−λ = {h(Yt:t+s ) − E[h(Yt:t+s )]} → N (0, Ψ(h)) , Ψ= Cov(h(Y1:s ), h(Y1−j:s−j ))
T 1/2 t=1 j=−∞

which holds if {h(Yt )}. As we saw in Section 3.3.4 this is not simple! It does hold, for example, for a
P∞
independent, zero mean white noise driven MA(∞) process with j=0 |θj | < ∞.

If those two pieces hold, then by Slutsky’s theorem the method of moments estimator would have the CLT
T −1
"  −1  #


d ∂α(λ) ∂α(λ)
θM oM − θ) → N 0,
T (b Ψ(h) .
∂λT ∂λ

If the MoM is phrased through a moment condition then the equivalent CLT would be1

√ d T
θM oM − θ0 ) → N (0, H −1 Ψ(g) H −1
T (b ), assuming H > 0, (3.16)

where, for covariance stationary ∂g(Yt:t+s , θ0 )/∂θT


T  
1 X ∂g(Yt:t+s , θ0 ) p ∂g(Y1:s , θ0 )
→H=E . (3.17)
T t=1 ∂θT ∂θT

3.4.2 Inference under martingale differences

How do models which involve martingale differences map into the method of moments?
1 Recall θM oM − θ is small, so by Taylor expansion
why (3.16) holds. This is from Introductory Statistics. Now b

T T T
X X X ∂g(Yt , θ0 ) b 
0= θM oM ) ≃
g(Yt , b g(Yt , θ0 ) + T
θM oM − θ0 ,
t=1 t=1 t=1
∂θ
so !−1
√ T T T
1 X ∂g(Yt , θ0 ) 1 X X ∂g(Yt , θ0 )
θM oM − θ0 ) ≃
T (b − g(Yt , θ0 ), assuming > 0.
T t=1 ∂θT T 1/2 t=1 t=1
∂θT
Hence the CLT we need to derive two results, (3.17) and a CLT for {g(Yt , θ0 )}, for if we have those two we can apply Slutsky’s
theorem to yield the result. The (3.17) holds if, for example, {∂g(Yt , θ0 )/∂θT } is covariance stationary and the autocovariance
function goes to 0 as the lag length extends to infinity.
54 CHAPTER 3. STATIONARITY

Example 3.4.6. In Section 2.3.3 we how {Ut′ }, the marginal utility from a choice, follows a martingale difference
condition
E[Ut′ |Ft−1 ] = 0,

for each t. Think about a researcher parameterizing a utility function, e.g.


1 − e−θyt
Ut (yt ; θ) =
θ
and using data to learn the risk aversion parameter θ. Recall the structure of this problem. The optimal

choice is (assuming E[|Ut (Yt (at−1 ); θ)|] < ∞ for all at−1 and θ)

at−1 (θ) = arg max E[Ut (Yt (at−1 ); θ)|Ft−1 ],


b
at−1

then
∂Ut (Yt (at−1 ) ; θ)
Ut′ (Yt , θ) = , Yt = Yt (b
at−1 (θ0 ))
∂at−1 at−1 =b
at−1 (θ0 )
such that
E[Ut′ (Yt , θ0 )|Ft−1 ] = 0.

Hence if we see a time series of {Yt } we can estimate θ by using the method of moments
T
1X ′
U (Yt , b
θM oM ) = 0.
T t=1 t

One of the attractive aspects of this is there is no need to model the law of the process {Yt }. The assumption

of the choice problem induces the martingale difference sequence, which drives the MoM estimator. This type
of approach to inference on choice problems using time series data was advocated by Hansen and Singleton
(1982).

Now think about this more abstractly. Suppose θ0 is such that Gt = g(Yt , θ0 ) forms a sequence {Gt } which
is a martingale differences in L2 with respect to the filtration {Ft }. Then this is a conditional moment condition

E[g(Yt , θ0 )|Ft−1 ] = 0 and so (as it is assumed L2 ) it must be that

E[g(Yt , θ0 )] = 0, t = 1, ..., T.

Recall the notation


T
X t
X t
X
Mt = Gj , ⟨M, M ⟩t = E[Gj GTj |Fj−1 ], St = E[⟨M, M ⟩t ] = E[Gj GTj ].
j=1 j=1 j=1

Theorem 3.4.7. Assume {Gt } which is a MD in L2 process with respect to the filtration {Ft } and uniqueness
of the moment constraint E[g(Yt , θ0 )] = 0 holds, then

a.s.
θM oM → θ0 ,
b

whenever ⟨M, M ⟩T → 0.
3.5. INVERTIBILITY 55

Why? The MD SLLN in Lemma 2.3.14 says that


T
−1 a.s.
X
{⟨M, M ⟩T } g(Yt , θ0 ) → 0,
t=1

a.s.
whenever ⟨M, M ⟩T → 0. This is enough for b
θM oM → θ0 (assuming uniqueness).
p
Further, under a Lindeberg condition and ST−1 ⟨M, M ⟩T → Ip , then
T
−1/2 d
X
ST g(Yt , θ0 ) → N (0, Ip ).
t=1

If {Gt } is covariance stationary as well as being a MD sequence, then St = tE[G1 GT1 ], then the familiar CLT
for the MoM (3.16) would hold with a particularly simple to estimate Ψ(g) = E[G1 GT1 ]. This is summarized in
the following theorem.

Theorem 3.4.8. Assume {Gt } which is a martingale differences with respect to the filtration {Ft }, is covariance
stationary and the uniqueness of the moment constraint E[g(Yt , θ0 )] = 0 holds, then
"   −1 −1 #
√ T
 
d ∂g(Y 1 , θ0 ) ∂g(Y 1 , θ0 )
θM oM − θ0 ) → N 0, E
T (b E[G1 GT1 ] E ,
∂θT ∂θ

if the matrix inverses above exist.

Hence the CLT for a MoM based on covariance stationary, martingale difference sequence {Gt } is as if the
{Gt } is i.i.d..

3.5 Invertibility

Think about a single random variable

Y = θ0 ε,

where θ0 is a known d × d matrix of constants which that θ0−1 exists. Then ε = θ0−1 Y , so informally, seeing Y
implies seeing ε.
What is the time series version of this? Think about the MA(1) process

Yt = θ0 εt + θ1 εt−1 ,

and work with the natural filtration Ft , that is ..., Y0 , Y1 , ..., Yt .

ˆ Now ask: can I see εt from Ft , the history of the observed up to time t. More formally, is εt ∈ Ft , i.e.

adapted to the filtration?

ˆ Now ask: why would I care if εt ∈ Ft . It is jolly useful for it means εt−1 ∈ Ft−1 , so

L
(Yt − θ1 εt−1 ) |Ft−1 = θ0 εt ,
56 CHAPTER 3. STATIONARITY

so the predictive distribution is

fYt |Ft−1 (yt ) = fθ0 εt (yt − θ1 εt−1 ).

Hence we can simply convert the MA(1) into a super simple predictive form.

More broadly perhaps εt ∈


/ Ft but knowing Ft tells us a lot about εt , e.g. |εt − E[εt |Ft ]| is small. Here we
will focus on a formalization of this idea which goes back to Granger and Andersen (1978).

εt ∈ Ft (e.g. b
Definition 3.5.1. For any η > 0, if there exists a b εt = E[εt |Ft ]) such that

lim P ({|εt − b
εt | > η} |Ft ) = 0,
t→∞

then {Yt } is called an invertible process.

3.5.1 Invertibility of moving average

For the univariate MA(1), with Yt = θ0 εt + θ1 εt−1 , start with an arbitrary b


ε0 , and then define b
εt such that

εt = Yt − θ1b
θ0 b εt−1 , t = 1, ...,

so

εt = Yt − θ1b
θ0 b εt−1 = θ0 εt + θ1 (εt−1 − b
εt−1 ) .

Rewriting

(εt − b
εt ) = −θ0−1 θ1 (εt−1 − b
εt−1 ) , a difference equation in εt − b
εt
t
= −θ0−1 θ1 (ε0 − b ε0 ) ,

or, letting wt = εt − b
εt , then

wt = ϕwt−1 = ϕt w0 , ϕ = −θ0−1 θ1 . (3.18)

t
So if |θ1 /θ0 | < 1 then the process is invertible as |θ1 /θ0 | → 0. Statistically the most worked version of the

MA(1) process is where θ0 = 1, in which case the invertibility condition is that

|θ1 | < 1.

For the MA(q) the same argument yields

εt = Yt − θ1b
θ0 b εt−1 ... − θq b
εt−q = θ0 εt + θ1 (εt−1 − b
εt−1 ) + ... + θq (εt−q − b
εt−q ) ,

so

wt = ϕ1 wt−1 + ... + ϕq wt−q , ϕj = −θ0−1 θj , εt − b


εt = w t . (3.19)
3.5. INVERTIBILITY 57

To study its properties, define the q × 1 vector


 
wt
 wt−1 
Zt =  ,
 
..
 . 
wt−q+1

then (3.19) can be compactly in its “companion form”


 
ϕ1 ϕ2 ··· ϕq
 1 0 ··· 0 
Zt = ϕZt−1 , ϕ = .
 
.. .. .. ..
 . . . . 
0 1 0

Companion forms will appear prominently in Section 4.1.2.

We will see that the MA(q) will be invertibility if each eigenvalue of the q × q matrix ϕ has an absolute value
which is less than one.
Write the eigenvalues as λ1:q , while the corresponding eigenvectors are u1 , ..., uq , so

ϕuj = λj uj , j = 1, ..., q (3.20)

I now claim that, for any real number ai , the sequence

t
Zt = aj (λj ) uj , j = 1, ..., q, t = 1, 2, ... (3.21)

satisfies Zt = ϕZt−1 . To see this,

t−1
ϕZt−1 = aj (λj ) ϕuj , from defn of Zt−1 from (3.21)
t−1
= aj (λj ) λj uj , from (3.20)
t
= aj (λj ) uj

= Zt , from defn of Zt from (3.21).

Then do this q times, so a possible solution to Zt is


q
X
Zt = ai λti ui ,
i=1

so, immediately,
q
X q
X
ϕZt−1 = aj λt−1
j ϕuj = aj λtj uj = Zt .
i=1 i=1
Hence if
|λj | < 1, j = 1, ..., q,

then |Zt | → 0 as t → ∞, for whatever choice of constants a1:q . This is enough to guarantee that the MA(q) is
invertible.
58 CHAPTER 3. STATIONARITY

Example 3.5.2. In MA(2) case with θ0 = 1, so ϕj = −θj , then


 
−θ1 −θ2
ϕ= ,
1 0

then the eigenvalues are the roots of the

λ + θ1 θ2
0 = |λI2 − ϕ| = = λ2 + θ1 λ + θ2 ,
−1 λ

a 2nd order polynomial in λ, with coefficients θ1:2 . Hence the pair of roots
p
−θ1 ± θ12 − 4θ2
λ1:2 = ,
2

need, in absolute value, to be less than one. The invertibility region of MA(2) process in terms of θ1:2 is

θ1 + θ2 > −1, θ1 − θ2 < 1, θ2 < 1,

which appears on the right hand side of Figure 3.1 and is the subject of Theorem 3.5.3. There are three types

of pairs: both real and distinct (θ12 − 4θ2 > 0); complex conjugate pair (θ12 − 4θ2 < 0); 2 real roots both equal
to −θ1 /2 (θ12 − 4θ2 = 0). The picture shows the values of θ1:2 with complex roots.

MA(2) and invertibility


1.5

non−invertible
1.0

complex roots: |a|>1


0.5
0.0
θ2

real roots: |a|>1


−0.5

non−invertible non−invertible
−1.0
−1.5

−2 −1 0 1 2

θ1

Figure 3.1: Region of MA(2) invertibility for {θ1 , θ1 }.

Theorem 3.5.3. For the MA(2) the θ1:2 which yield invertibility satisfy

θ1 + θ2 > −1, θ1 − θ2 < 1, θ2 < 1.


3.5. INVERTIBILITY 59
 p  p
Proof. Recall the roots are −θ1 ± θ12 − 4θ2 /2. Imaginary root: roots are −θ1 /2 ± (i/2) −θ12 + 4θ2 , so

absolute value is
θ12 1
−θ12 + 4θ2 = θ2 , ⇒ θ2 < 1.

+
4 4
p
Real roots: θ12 − 4θ2 > 0. Largest real root implies −θ1 + θ12 − 4θ2 < 2, so

2
θ12 − 4θ2 < (2 + θ1 ) = 4 + 4θ1 + θ12 , ⇒ −1 < θ2 + θ1 .

p
Finally smallest implies −2 < −θ1 − θ12 − 4θ2

2
(θ1 − 2) > θ12 − 4θ2 ⇒ 1 > θ1 − θ2 .

Hence the stated result holds.


Remark 11. Recall, in the complex conjugate pair case λ1 = b + ic and λ2 = b − ic, where i = −1 (when the

eigenvalues appear in complex conjugate pairs, then so do the eigenvectors: u1 = d + ig and u2 = d − ig, where
p √
d and g are real vectors). In that case |λ1 | = (b + ic)(b − ic) = b2 + c2 = |λ2 |. Hence

b2 + c2 < 1

is needed for these |λ1 | < 1. This is sometimes called the roots are inside the unit circle. The exponential form

of conjugate complex pairs λ1 = b + ic and λ2 = b − ic, writes λ1 = ρeiθ , ρ = b2 + c2 and θ = tan−1 (c/b),
t t
while λ2 = ρe−iθ . Then λt1 = (ρ) eiθt and λt2 = (ρ) e−iθt , so |λt1 | = ρt .

The above result, relating the eigenvalues to the roots of a polynomial, hold for the MA(q) process (here

written with θ0 = 1), then

0 = |λIq − ϕ| = λq + θ1 λq−1 + ... + θq , (3.22)

a q-th order polynomial. Then the roots will be a collection of distinct real roots, pairs of real roots and pairs
of complex conjugate roots. If they are all inside the unit circle then the MA(q) is invertible.

3.5.2 Lag operator and lag polynomial

Before we end think a little more abstractly, which will help later on models, like the autoregression. Recall
from Chapter 1 the lag operator.

Definition 3.5.4. Define a lag operator L, which works on any time series yt , so that

Lyt = yt−1 , L−1 yt = yt+1 .

In engineering L is often called a backshift operator (sometimes denoted by B).


60 CHAPTER 3. STATIONARITY

Then MA(q) has Yt = εt + θ1 εt−1 + ... + θq εt−q = θ(L)εt , where

θ(z) = 1 + θ1 z + ... + θq z q ,

a q-th order lag polynomial. The θ(L) is the lag-polynomial. We saw θ(z) in the CLT for MA(∞) processes in
Theorem 3.3.10 and Lemma 2.
Then
z q θ(z −1 ) = z q + θ1 z q−1 + ... + θq

instead — which is the polynomial we saw above in (3.22). Notice that if z solves

θ(z) = 0 then it solves z q θ(z −1 ) = 0,

too. The roots of z q θ(z −1 ) = 0 are 1/a1 , ..., 1/aq , the reciprocal of those for θ(z) = 0. Then the requirement
for invertibility, that the roots of z q ϕ(z −1 ) = 0 are strictly inside the unit circle, imply the roots of θ(z) = 0

are outside the unit circle. This leads to confusion, sometimes researchers talk about roots being inside and
sometimes outside the unit circle. The reason is the swapping between these two polynomials: θ(z) and
z q θ(z −1 ).

3.6 Recap

This Chapter has covered a lot of ground! The mail topics are listed in Table 3.1.

Formula or idea Description or name


strict stationarity
ϵt + θ1 ϵt−1 moving average
covariance stationarity
γs = Cov(Yt , Yt−s ) autocovariance
E[Yt ] = 0; γs = 0, s̸= 0 weak white noise
εt + θεt−1 moving average
method of moments
E[g(Yt , θ0 )] = 0 moment condition
LYt = Yt−1 L operator
ϕ(L) = 1 − ϕ1 L − ... − ϕp Lp lag polynomial
εt ∈ Ft invertibility
Y−c:0 initial conditions

Table 3.1: Main ideas and notation in Chapter 3.

Students often get confused as to the point of stationarity. Thinking of it as the time series version of
the identically distributed assumption you see in Introductory Statistics really puts you on a sound footing.

Stationarity opens up the potential wide use of the method of moments, based upon the stationary distribution,
martingale differences or covariance stationarity type properties.
Chapter 4

Memory

Recall from (2.2) the prediction decomposition


T
Y
fY1:T (y1:T ) = fYt |(Y1:t−1 =y1:t−1 ) (yt ).
t=1

This places the one-step ahead prediction at the centre of much of modern time series. This Chapter will focus
on how Yt is impacted by Yt ’s past, Y1:t−1 , that is the degree of memory in the process.

4.1 Markov process


4.1.1 Basic case

The most famous special case of one-step ahead prediction is a Markov process.

Definition 4.1.1. [Markov process] If Y1:T has, for every t = 1, ..., T , the property

L
Yt | (Y1:t−1 = y1:t−1 ) = Yt | (Yt−1 = yt−1 ) ,

then it is called a Markov process. An equivalent way of expressing this is that

{Yt ⊥⊥ Y1:t−2 } |Yt−1 ,

where A ⊥
⊥ B is the standard notation for the random variable A and B are independent and (A ⊥⊥ B) |C notes
A and B are independent conditional on C.

The implication of this is


T
Y
fY1:T (y1:T ) = fYt |(Yt−1 =yt−1 ) (yt ).
t=1
Two simple special cases of the Markov process stand out in their use in applied work.

Definition 4.1.2. Let Yt ∈ {0, 1} to each t, then the binary Markov process {Yt } is governed by the transition
probabilities
2
P (Yt = i|Yt−1 = j), i, j ∈ {0, 1} .

61
62 CHAPTER 4. MEMORY

Definition 4.1.3. The first order linear d-dimensional autoregressive process {Yt } (denoted VAR(1), for
vector autoregressions and AR(1) for the scalar case) sets

iid
Yt = ϕ1 Yt−1 + εt , εt ∼ , t = 1, ..., T, (4.1)

which reminds us of regression: regressing Yt on Yt−1 . The case of ϕ1 = I, that is

Yt = Yt−1 + εt , (4.2)

is called a random walk. In the random walk case, the increments,

∆Yt = εt .

The VAR(1) process in (4.1) can be reparameterized, placing ∆Yt at its heart, by taking Yt−1 from both sides,

∆Yt = γYt−1 + εt , γ = ϕ1 − Id
T
= αβ T Yt−1 + εt , γ = α β (rank factorization, non-unique: r = rank(γ) ∈ {0, 1, ..., d} ).
d×d d×r r×d

This is called an error correction model, which appears often in applied research. When r < d, then this is an

example of reduce rank regression, regressing the d-dimensional ∆Yt on the r-dimensional β T Yt−1 .

4.1.2 Companion form and K-th order Markov processes

The K-th order Markov process extends the Markov model to

L
Yt | (Y1:t−1 = y1:t−1 ) = Yt | (Yt−K:t−1 = yt−K:t−1 ) ,

allowing K lags to impact the prediction.

Example 4.1.4. Shannon (1948) used an K-order Markov model of text, initiating the field of quantative

language models.

Definition 4.1.5. The p-th order linear autoregression process (VAR(p))

iid
Yt = ϕ1 Yt−1 + ... + ϕp Yt−p + εt , εt ∼ ,

is a p-th order Markov process.

K-order Markov model can be written in a dK-dimensional Markov process. This is helpful conceptually
(as Markov thinking can be immediately ported to many initially non-Markov processes) and computationally
(code based on a Markov structure can be used much more widely).
4.1. MARKOV PROCESS 63

Theorem 4.1.6. Assume {Yt } is a K-order Markov process and define the stacked variable Zt := Yt−K:t . Then
{Zt } is Markovian if the dimension of Zt does not grow systematically with t. If the dim(Zt ) does not change
with t then {Zt } is called a “companion form” of the non-Markovian process. If {Yt } is strictly stationary,

then so is {Zt }. If {Yt } is covariance stationary, then so is {Zt }.

Proof. Set K = 2, as other cases follow trivially. By 2-order Markovian assumption

L
Yt |Y1:t−1 = Yt |Yt−2:t−1 .

Obviously {Yt } is not a Markov chain. Work with the discrete case, then

P (Yt = yt |Y1:t−1 = y1:t−1 ) = P (Yt = yt |Yt−2:t−1 = yt−2:t−1 )

= P (Zt,1 = yt , Zt,2 = yt−1 )|Zt−1 = (yt−2:t−1 )), where Zt = (Zt,1 , Zt,2 )T ,

= P (Zt = (yt−1:t )|Zt−1 = (yt−2:t−1 )),

which is the stated result. The result on strict stationarity are definitional, the covariance case is immediate.

Example 4.1.7. For the VAR(p) from Definition 4.1.5, stacking Yt−p+1:t , produces the companion form
     
Yt ϕ1 ϕ2 ··· ϕp εt
 Yt−1   Id 0d×d ··· 0d×d   0d 
Zt = ϕZt−1 + εt , where Zt =  , ϕ = , εt =  ,
     
.. .. .. .. .. ..
 .   . . . .   . 
Yt−p+1 0d×d Id 0d×d 0d

a pd-dimensional VAR(1).

Example 4.1.8. An impactful model (we will see it prominently in Chapter 5) is where

iid
∆2 Yt = εt , εt ∼ , noting ∆2 Yt = ∆Yt − ∆Yt−1 = Yt − 2Yt−1 + Yt−2 .

Now {Yt } is 2nd order Markovian. Keep track of the first differences by writing

βt−1 = ∆Yt .

Then

Yt = Yt−1 + βt−1 , βt = βt−1 + εt+1 .

Stacking into a vector Zt = (Yt , βt )T , then


      
Yt 1 1 Yt−1 0
= + ,
βt 0 1 βt−1 εt+1

is in companion form — so {Zt } is a VAR(1), so is Markovian.


64 CHAPTER 4. MEMORY

Example 4.1.9. Suppose

L
Yt |G1:t , Y1:t−1 = Yt |Gt , Gt = g(Gt−1 , Yt−1 ), t = 2, ..., T, dim(Gt ) = dim(Gt−1 ).

Then {Yt } is not Markovian, but {(Yt , Gt )} is. A special case of this is the GARCH(1,1) model of Bollerslev
1/2 2
(1986), which writes that Yt = Gt εt and Gt = α+βGt−1 +γYt−1 , regarding (α, β, γ) as known (or parameters)
and {εt } are i.i.d..

h 4.1.10. It is tempting to define Zt = Y1:t and then note that

L
Zt |Zt−1 , Zt−2 = Zt |Zt−1 , t = 1, 2, ...,

and conclude that Zt is a companion form. But it is not. It is Markovian, but notice the dimension of Zt

increases systematically through time — ruining the practical usefulness of any Markovian property.

4.2 Autoregression
4.2.1 AR(p) and VAR(p)

The VAR(p) process was set out in Definition 4.1.5. The implication from the prediction decomposition is that,
conditioning on some initial values Y1:p , the
T
Y
fYp+1:T |Y1:p (yp+1:T ) = fε1 (yt − ϕ1 yt−1 − ... − ϕp yt−p ),
t=p+1

which is helpful for inference, e.g. in computing the likelihood.

h 4.2.1. Let A be a r × d matrix of constants, then

AYt = Aϕ1 Yt−1 + ... + Aϕp Yt−p + Aεt ,

which is not an autoregression in {AYt } (unless A is invertible, in which case write Aϕ1 Yt−1 = Aϕ1 A−1 AYt−1 ,


etc., producing an VAR(p) in {AYt } or ϕj = Id for all j = 1, ..., p). This result contrasts with the case where

{Yt } is an MA(q), then {AYt } is always a MA(q) — see Theorem 3.2.2.

Think about the AR(1) and relate it to the MA(∞) process.

Theorem 4.2.2. For an VAR(1) process Yt = ϕ1 Yt−1 + εt , the cumulant generating function satisfies, for each
q ≥ 1,
q
X  T   T 
C {ω ‡ Yt }) = C ω T ϕj1 ‡ ε1 +C ω T ϕq+1
1 ‡ Yt−q−1 . (4.3)
j=0
 T 
P∞
If if j=0 C ω T ϕj1 ‡ ε1 exists, then {Yt } has a stationary solution


X
Yt = ϕj1 εt−j ,
j=0
4.2. AUTOREGRESSION 65

an MA(∞) process.
P∞
Remark 12. If ω is small and E[|ε1 |] < ∞, then C {ω ‡ ε1 } ≃ iω T E[ε1 ] + o(|ω|) as ω → 0, so if j=0 ϕj1 exists,
then  T

X  T   X∞ 
C ω T ϕj1 ‡ ε1 ≃ i ωT  ϕj1  E[ε1 ].
 
j=0 j=0
P∞
But j=0 ϕj1 exists if all the eigenvalues of ϕ1 are inside the unit circle.

Proof. In the AR(1) case

Yt = ϕ1 Yt−1 + εt = ϕ1 (ϕ1 Yt−2 + εt−1 ) + εt = εt + ϕ1 εt−1 + ϕ21 Yt−2

= εt + ϕ1 εt−1 + ... + ϕq1 εt−q + ϕq+1


1 Yt−q−1 = M A(q) + ϕq+1
1 Yt−q−1 .

so taking the cumulants of the sum of independent terms yields the result (4.3). Now assume Yt−q−1 has a
 T 
P∞ T j
cumulant function j=0 C ω ϕ1 ‡ ε1 which exists, then

q
X  T  ∞
X  T 
C {ω ‡ Yt } = C ω T ϕj1 ‡ ε1 + C ω T ϕq+1
1 ϕj1 ‡ ε1
j=0 j=0
Xq  T  ∞
X  T 
= C ω T ϕj1 ‡ ε1 + C ω T ϕq+1+j
1 ‡ ε1
j=0 j=0
Xq  T  ∞
X  T  ∞
X  T 
= C ω T ϕj1 ‡ ε1 + C ω T ϕj1 ‡ ε1 = C ω T ϕj1 ‡ ε1 ,
j=0 j=q+1 j=0
P∞
the MA(∞) cumulant function. Hence Yt = j=0 ϕj1 εt−j is a strictly stationary solution.

Thus the stationary AR(1) is written as a special case of the MA(∞) process.

Focus on the weights: in this MA(∞) the weights go to zero exponentially fast when eigenvalue of ϕ1 all
P∞
inside unit circle. Think about univariate case: |ϕ1 | < 1, so immediately we have that j=0 |θj | < ∞ and so
P∞ 2
P∞ P∞
j=−∞ θj < ∞ and j=0 |γj | < ∞. It was j=0 |γj | < ∞ plus i.i.d. {εt } (or MD white noise) which drove

the CLT for the sample average in Section 3.3.4 so long as Var(ε1 ) < ∞, while the asymptotic variance was
P∞ 2
j=−∞ θj . This fast decay in the weights is common to all covariance stationary AR(p) processes. This makes

this class relatively easy to work with.

Example 4.2.3. Return to Example 3.2.3, with ε1 being univariate α-stable, where α ∈ (0, 2], then limit of
Pq j
j=0 C(ωϕ1 ‡ ε1 ) as q increases:
∞ α
α
X α j − |ω|
C {ω ‡ Y1 } = − |ω| (|ϕ1 | ) = α,
j=0
1 − |ϕ1 |

for |ϕ1 | < 1, noting Y1 is marginally a scaled α-stable variable. Hence, when |ϕ1 | < 1, the AR(1) model with
α-stable shocks is strictly stationary for all α ∈ (0, 2] — a beautifully simple result compared to the strict
stationarity conditions needed for the general MA(∞) process.
66 CHAPTER 4. MEMORY

4.2.2 Autoregressions, the M A(∞) and covariance stationarity

Suppose {εt } is not i.i.d. but zero mean weak white noise with Var(ε1 ) < ∞. Then the {Yt } which follows the

univariate AR(1) process


Yt = ϕ1 Yt−1 + εt (4.4)

is covariance stationary, written as a MA(∞) process



X ∞
X
Yt = θj εt−j , θj = ϕj1 , |θj | < ∞,
j=0 j=0

iff
|ϕ1 | < 1.

In that case

Var(ε1 ) X 2j
E[Y1 ] = 0, Var(Y1 ) = = ϕ1 Var(ε1 ), γs = Cov(Y1 , Y1−s ) = ϕs1 Var(Y1 ).
1 − ϕ21 j=0

Importantly, if {εt } is a MD white noise with respect to FtY , then the AR(1) can be written as Yt =
P∞
j=0 θj εt−j with the same MD white noise {εt } sequence.

Why does this hold AR(1) result hold? Now recall

Yt = ϕ1 Yt−1 + εt = εt + ϕ1 εt−1 + ϕ21 Yt−2

= εt + ϕ1 εt−1 + ... + ϕt−2 t−1 t−1


1 ε2 + ϕ1 Y1 = M A(t − 2) + ϕ1 Y1 .

P∞
Assume |ϕ1 | < 1 and Var(Y1 ) = j=0 ϕ2j
1 Var(ε1 ) < ∞. Then at t = 2

Var(Y2 ) = ϕ21 Var(Y1 ) + Var(ε1 )



X j+1
= ϕ21 Var(ε1 ) + Var(ε1 )
j=0

X ∞
X
= ϕ2j
1 Var(ε1 ), by assumption ϕ2j
1 Var(ε1 ) < ∞,
j=0 j=0

then this holds for t = 2, 3, .... The γs result is trivial. Hence there exists a weakly stationary solution.
Now return to the companion form from Example 4.1.7, where
 
ϕ1 ϕ2 ··· ϕp
 Id 0 d×d ··· 0d×d 
Zt = ϕZt−1 + εt , ϕ =  . .
 
. . .. .. ..
 . . . 
0d×d Id 0d×d

If {Yt } is covariance stationary then {Zt } is covariance stationary and so Var(Z1 ) satisfies the equation

T
Var(Z1 ) = ϕVar(Z1 )ϕ + Var(ε1 ). (4.5)
4.2. AUTOREGRESSION 67

One way to numerically solve (4.5) is to use the vec and Kronecker product notation from matrix algebra
(e.g. Chapter 2 of Magnus and Neudecker (2019)). Then in the covariance stationary case

vec {Var(Z1 )} = (ϕ ⊗ ϕ) vec {Var(Z1 )} + vec {Var(ε1 )} ,

implying
 −1
vec {Var(Z1 )} = Ip2 − (ϕ ⊗ ϕ) vec {Var(ε1 )} . (4.6)

The first block column of Var(Z1 ) reads off the matrices γ0 , γ1 , ..., γp−1 of {Yt }.

The same type of result holds for the AR(p) process. To manipulate AR(p) processes it is helpful to write

it using lag-polynomial

ϕ(L)Yt = εt , ϕ(z) = 1 − ϕ1 z − ... − ϕp z p ,

recalling the properties of lag-polynomial we analyzed in Section 3.5.2.

If all the roots of the equation z p ϕ(z −1 ) = 0 are inside the unit circle, then

εt
Yt =
ϕ(L)

X ∞
X
= θj εt−j , |θj | < 1,
j=0 j=0

X
= θ(L)εt , θ(z) = θj z j ,
j=0


where the {θj }j=0 are determined by ϕ1:p . That math from Section 3.5.2 immediately ports to this problem:
∞ P∞
there we saw the implied {|θj |}j=0 declines exponentially implying j=0 |θj | < 1, crucially implying that
P∞
j=−∞ |γj | < 1.
p
For the covariance stationary AR(p) process driven by white noise it is relatively easy to go from {ϕj }j=1 to

{γj }j=p , once γ0:p−1 are found, using the Yule-Walker equations given in Theorem 4.2.4 below. But γ0:p−1 can

be determined by working through the companion form and calculating (4.6) numerically. The analytic form of
γ0:p−1 is (in my view) not particularly illuminating.

Theorem 4.2.4. [Yule-Walker] The covariance stationary AR(p) driven by zero mean white noise {εt }, has

γs = ϕ1 γs−1 + ... + ϕp γs−p , s = 1, ..., p (4.7)

and γ0 = ϕ1 γ1 + ... + ϕp γp + Var(ε1 ).

Proof. Recall Yt = ϕ1 Yt−1 + ... + ϕp Yt−p + εt , take s > 0 and multiply both sides by Yt+s and take expectations.
Note that

Cov(Yt+s , εt ) = 0, s > 0.
68 CHAPTER 4. MEMORY

Using covariance stationarity, that delivers (4.7). The s = 0 case has an extra term, due to

Cov(Yt , εt ) = Var(ε1 ).

p p
It is relatively easy to go from {γj }j=1 to the {ϕj }j=1 . Stacking the equation (4.7) for s = 1, ..., p, yields,
noting γs = γ−s , that
        −1  
γ1 γ0 γ1 ··· γp−1 ϕ1 ϕ1 γ0 γ1 ··· γp−1 γ1
 γ2   γ1 γ0 ··· γp−2  ϕ2   ϕ2   γ1 γ0 ··· γp−2   γ2 
= , so = ,
          
 .. .. .. .. ..  ..  .. .. .. .. ..   ..
 .   . . . .  .   .   . . . .   . 
γp γp−1 γp−2 ··· γ0 ϕp ϕp γp−1 γp−2 ··· γ0 γp

if the matrix is invertible.

4.3 m-dependence
4.3.1 Definition

One-step ahead predictions have many of the features of independent random variables we see in introductory
statistics. A different place independence bites in time series is m-dependence.


Definition 4.3.1. [m-dependence] The {Yt }t=−∞ process is m-dependent iff

Yt ⊥
⊥ Ys , for all t, s and |t − s| > m.


The most basic version of this if when m = 0 then the {Yt }t=−∞ are independent through time.

h 4.3.2. A Markov process has


(Yt ⊥⊥ Y1:t−2 )|Yt−1 ,

that is Yt is independent of Y1:t−2 , but only conditionally on Yt−1 . Conditional independence does not imply
unconditional independence, so m-th order Markov processes are not m-dependent.

An elegant process which exhibits m-dependence is the moving average process.

Definition 4.3.3. [Moving average process] The q-th order (non-linear) moving average process, says

iid
Yt = g(εt−q:t ), εt ∼ , t ∈ Z,

where Yt and εt are assumed to be d-dimensional for all t = 1, ..., T and g is a non-stochastic function.

Example 4.3.4. In the linear case, the q-th order moving average process becomes Yt = θ0 εt + ... + θq εt−q ,
which is q-dependent.
4.3. M -DEPENDENCE 69

The moving average process has Yt depending upon εt−q:t and Yt+q+s depending upon εt+s:t+q+s for s >
0. But εt−q:t and εt+s:t+q+s have no overlap and so are independent, so Yt and Yt+q+s are probabilistically
independent. Thus this process is m-dependent with m = q.

iid
h 4.3.5. Think of the Gaussian, linear, scalar, first order moving average Yt = εt + θεt−1 , where εt ∼
N (0, 1/(1 + θ2 )) and t = 1, ..., T . When t ≥ 3, then
    
Yt 1 ρ 0
 Yt−1  ∼ N 03 ,  ρ 1 ρ  , θ
ρ= ∈ [−1/2, 1/2],
1 + θ2
Yt−2 0 ρ 1
so
ρYt−1 − ρ2 Yt−2 1 − 2ρ2
 
Yt |Yt−1 ∼ N (ρYt−1 , 1 − ρ2 ), Yt |Yt−2 ∼ N (0, 1), Yt |Yt−2:t−1 ∼ N , .
1 − ρ2 1 − ρ2
Thus Yt is independent of Yt−2 , but given Yt−1 the pair {Yt−2 , Yt } are not independent. This is the opposite of a
Markov process (recall the Markov process had conditional independence but not independence), see Biohazard
4.3.2. That (Yt ⊥
/⊥ Yt−2 ) |Yt−1 has substantial computational and statistical importance for moving averages,
e.g. it means that the law of the Yt |Y1:t−1 , that is the predictive distribution, will depend on all elements of

Y1:t−1 (in the Markov process it only depends upon Yt−1 ) even though Yt is independent of Y1:t−2 .

4.3.2 m-dependence CLT

So far we have seen a CLT based on i.i.d. and MD sequences, but crucially the CLT fails for weak white noise.
What about when {Yt } being m-dependent?
You might expect a CLT to hold as the sequence has a form of replication, due to the independence structure
beyond m-lags. This expectation does indeed hold. CLTs for m-dependent processes have a storied history in

probability theory. Here I report a beautiful version by Janson (2021).


As with most sophisticated CLTs, this CLT is stated using a triangular array of univariate random variables1 ,
linked across t, for each separate T :

{XT,t }T ≥1,1≤t≤T .

The assumption we will use says the array structure has each T , the time series model

XT,1 , ..., XT,T

being m-dependent. Define, for each value of T , the time series sum
T
X
ST = XT,t .
t=1
1 In the probability literature only univariate and infinite dimensional CLTs are discussed. The reason for this is that for the

multivariate version the corresponding results can be established by using the Cramér-Wold device (Cramer and Wold (1936)).
PNT T
That is define the d-dimensional vector of constants a, then check univariate CLT for t=1 a XT,t . If this holds for all a, then
the multivariate CLT holds.
70 CHAPTER 4. MEMORY

iid
Example 4.3.6. Suppose Yt ∼ N (0, 1) and, then to get it into the notation used in Theorem 4.3.7 below, write

T
1 X
σT∗−1 ST ∼ N (0, 1).
p
XT,t = √ Yt , ST = XT,t ∼ N (0, 1), σT∗ = Var(ST ) = 1,
T t=1

Theorem 4.3.7 is a CLT for σT∗−1 ST as T increases allowing the data to be non-Gaussian and when the
i.i.d. assumption is replaced by m-dependence. It has a complicated condition named after Lindeberg, which
is standard in not identically distributed CLTs. It will be replaced in a moment by an easier to think about

condition.

Theorem 4.3.7. [Janson (2021)] Suppose {XT,t }T ≥1,1≤t≤T is an m-dependent triangular array of univariate
random variables, Var(XT,t ) < ∞ and E[XT,t ] = 0. Define

p
σT∗ = Var(ST )

and assume that σT∗ > 0 for all T . Assume a Lindeberg-type condition: for every ε > 0 as T → ∞ then

T
1 X 2
E[XT,t 1|XT t |>(εσ∗ ) ] → 0. (4.8)
σT2∗ t=1 T

Then, as T → ∞, so
d
σT∗−1 ST → N (0, 1).

Proof. See Janson (2021).

The m = 0 case is exactly the Lindeberg-Feller CLT, one of the most celebrated CLTs which deals with
independent but not identically distributed random variables. Romano and Wolf (2000) and Janson (2021)

also have CLTs which allow m to increase with T .

Remark 13. The Lindeberg-type condition (4.8) only involves the marginal laws of XT,1 , ..., XT,T , not their
dynamics. It can be tricky to check. The Lyapunov-type condition, which says that if there exists a δ > 0 such
−(2+δ) PT 2+δ
that (σT∗ ) t=1 E[|XT,t | ] → 0, implies the Lindeberg-type condition holds and is easier to think about
δ
for time series. Why? If |XT,t | ≥ εσT∗ then |XT,t /εσT∗ | ≥ 1, so |XT,t /εσT∗ | ≥ 1. The

T T T
1 X 2 1 X 2+δ 1 X 2+δ
E[XT,t 1|XT ,t |>(εσ∗ ) ] ≤ E[|XT,t | 1|XT ,t /εσT |δ >1 ] ≤
∗ E[|XT,t | ],
σT2∗ t=1 T δ ∗
ε (σT )
2+δ
t=1
δ ∗
ε (σT )
2+δ
t=1

hence the RHS going to zero is enough for the Lindeberg-type condition to hold.

Theorem 4.3.7 can be applied to the MA(q) process. The result is beautifully simple and powerfully useful.

Remark 14. In my view Lemma 3 and the MD CLT are the most insightful CLTs in time series.
4.4. INTEGRATION AND DIFFERENCING 71

2+δ
Lemma 3. For the scalar MA(q) process, where there exists a δ > 0 such that E[|Y1 | ] < ∞. Then
q
√ n −1/2 o d X
T Ψ Y − E[Y1 ] → N (0, 1) , Ψ= Cov(Y1 , Y1+s ), (4.9)
s=−q

2+δ Pm
so long as 0 < Ψ < ∞. In the linear MA(q) case, this holds if E[|ε1 | ] < ∞, and j=0 θj ̸= 0, yielding
 2
Xq q
X
Ψ = Var(ε1 )θ(1)2 = Var(ε1 )  θj  , θ(x) = θ j xj , where θ0 = 1. (4.10)
j=0 j=0

Proof. The MA(q) is an m-dependent process, so we will check the Lyapunov-type condition for Theorem 4.3.7.
2+δ
For the MA(q) process E[|XT,t | ] = E[|Y1 |2+δ ]. The remaining task is to compute σT∗2 , the variance of
PT 2

t=1 Yt , which equals T Var Y . Let

T  
 X s−1
ΨT = T × Var Y = Var(Y1 ) + 2 1− Cov(Y1 , Ys )
s=2
T
q   q
X s−1 2X
= Var(Y1 ) + 2 1− Cov(Y1 , Ys ) = Ψ − (s − 1) Cov(Y1 , Ys )
s=2
T T s=2
→ Ψ, as T → ∞ for fixed q.

The m-dependent CLT focuses on


T
( T
)
1 X T 1X −1/2  √ −1/2
(Yt − E[Y1 ]) = T T 2 Var Y
  
(Yt − E[Y1 ]) = Y − E[Y1 ] = T ΨT Y − E[Y1 ]
σT∗ t=1 σT∗ T t=1
d
→ N (0, 1), m-dependent CLT.

2+δ
So by Slutsky’s Theorem the result. In the linear case E[|Y1 | ] < ∞ holds for E[|ε1 |2+δ ] < ∞ is assumed while
q
X q
X
2+δ 2+δ 2+δ 2+δ
E[|Y1 | ]≤ E[|ε1−j θj | ] = E[|ε1 | ] |θj | ,
j=0 j=0
Pq 2+δ
This is bounded as j=0 |θj | < ∞ automatically as q is finite.

Notice linearity only plays a role going from (4.9) to (4.10), yielding a pretty expression for Ψ, and giving a
2+δ
more primitive condition for E[|Y1 | ] < ∞.

4.4 Integration and differencing

Differencing has already played a significant role, going from a martingale {Mt } to a martingale difference

sequence {∆Mt }. Undoing differencing is, oddly, called integration in the time series literature — rather than
what you might think it would be called: cumulating.

Definition 4.4.1. Start with a time series {Ct }, then the integrated version is
t
X
Yt = Y0 + Cj , t = 1, 2, ...
j=1
72 CHAPTER 4. MEMORY

Clearly Ct = ∆Yt . In the time series literature, if {Ct } is a stationary process then the integrated version {Yt }
is sometimes denoted I(1), integrated of order 1.

h 4.4.2. Sometimes integration is written as

Ct −1
Yt = , or Yt = (1 − L) Ct ,
1−L

but care needs to be taken with this for it suggests



X
Yt = Cj ,
j=1

which may well not exist.

4.4.1 Random walk and integration

If {Ct } is i.i.d. then the integrated version {Yt } is a d-dimensional random walk,
t
X
Yt = Yt−1 + Ct = Cj , Y0 = 0, t = 1, 2, ....
j=1

The random walk is a special case of a Markov chain, with, cumulant function

T
C {ω ‡ Yt } = log E[eiω Yt
] = tC {ω ‡ C1 } .

If E|C1 | < ∞, the

E[Yt |Y1:t−1 ] = E[Yt |Yt−1 ] = E[C1 ] + Yt−1 .

Hence, if E[C1 ] = 0, then {Yt } is a martingale. If Var(C1 ) < ∞, then

Var(Yt ) = tVar(C1 ), Cov(Yt , Ys ) = Var(Ys ) = sVar(C1 ).

Thus the random walk is a non-stationary process, so long as at least one of the diagonal elements of Var(C1 )
are strictly positive.

Typically (but not always) integrated processes, built out of covariance stationary processes, are non-
stationary as the means and variances change through time.

Theorem 4.4.3. If {Ct } is a covariance stationary process then E[Yt ] = tE[C1 ], and
T −1  
X |s|
Var(Yt ) = t 1− γs .
T
s=−(t−1)

P∞
If 0 < s=−∞ γs < ∞, then

X
Var(Yt ) ≃ t γs . (4.11)
s=−∞
4.4. INTEGRATION AND DIFFERENCING 73

Proof. Trivial from Theorem 3.3.8, the result on the properties of the sample mean.

h 4.4.4. If {εt } is weak white noise, then Ct = (1−L)εt = εt −εt−1 , so {Ct } is a zero mean, covariance stationary
process (it is a non-invertible MA(1) process), with ρ1 = −1/2. Then Yt = εt − ε0 , by telescoping, which means
P∞
{Yt } is covariance stationary. How is this compatible with Theorem 4.4.3. In this case s=−∞ γs = 0, as

ρ1 = −1/2 and E[C1 ] = 0. Hence the limit result (4.11) is not helpful in that case. The {Ct } process is called
over-differenced.

4.4.2 Cointegration

Think about the bivariate random walk


t  
X β1
Yt = Cj , Ct = βεt , β= (4.12)
β2
j=1

where {εt } is univariate i.i.d. Then

Var(C1 ) = ββ T Var(ε1 ), Var(Yt ) = tββ T Var(ε1 ).

which have singular covariance matrices, i.e. rank(ββ T ) = 1, but both {Yt,1 } and {Yt,2 } are non-stationary. So
far, nothing is very interesting.
But now think about, for any α,
 T  T  
β2 β2 β1
α Ct = 0, as = 0.
−β1 −β1 β2

Thus
 T
β2
α Yt = 0, with probability one,
−β1
n o
T T
so α (β2 , −β1 ) Yt is (the most trivial special case of) stationary. In the time series literature {α (β2 , −β1 )}
is called a cointegrating vector (not unique as it holds for any α), while {Yt } exhibits cointegration.

Now think more abstractly about the bivariate random walk


t
X
Yt = Cj , Var(C1 ) < ∞.
j=1

Then there are 3 distinct cases:

ˆ Var(C1 ) = 02 : then rank(Var(C1 )) = 0. In this case {Yt } is bivariate stationary, as Yt = 02

ˆ Var(C1 ) = α2 ββ T , where β = (β1 , β2 )T : then rank(Var(C1 )) = 1. In this case {Yt } is bivariate non-

stationary, but has cointegration with cointegrating vector α(β2 , −β1 )T .

ˆ Var(C1 ) > 0: then rank(Var(C1 )) = 2. In this case {Yt } is bivariate non-stationary and has no cointe-

gration.
74 CHAPTER 4. MEMORY

The general version of this is defined below.

Definition 4.4.5. Suppose all the elements of the d-dimensional {Yt } is a non-stationary time series. If there
exists a 1 ≤ r < d and r × d dimensional matrix A such that the process {AYt } is stationary, then {Yt } exhibits
cointegation.

The concept of cointegration was formalized in Granger (1981) and Engle and Granger (1987).

Example 4.4.6. [Common trend model] Work with the bivariate process
t  
X β1
Yt = βµt + ηt , µt = εj , β =
β2
j=1

where the scalar sequence {εt } is i.i.d. with Var(ε1 ) > 0 and the bivariate sequence {ηt } is stationary. The
{µt } is a common random walk which hits both {Y1,t } and {Y2,t } and has the same structure as (4.12). This

model is often called a common trend model — while the {µt } is called the common trend. Then

Yt,1 = β1 µt + ηt,1 , Yt,1 = β2 µt + ηt,2 ,

are non-stationary if β1 ̸= 0 and β2 ̸= 0, respectively. But the linear combination


 T  T
β2 β2
Yt = ηt ,
−β1 −β1

is a stationary process. Hence {Yt } is cointegrated.

Example 4.4.7. Granger (1981) has a beautiful thought experiment. Suppose {Yt,1 } is the number of cars
since 9am which have entered a tunnel which only opened at 9am (which is one way and only has 1 exit and 1

entrance), where t increments up by 1 each second and {Yt,2 } is the number of cars since 9am which have left
the same tunnel. Then both {Yt,1 } and {Yt,2 } are non-stationary processes, incrementing up each second. But,
by construction, Yt,1 − Yt,2 ≥ 0 is the number of cars in the tunnel at time t. It maybe reasonable to model
{Yt,1 − Yt,2 } as a stationary process.

4.4.3 Lévy, Brownian and Poisson processes

Recall from (4.2) that a random walk has

iid
Yt = Yt−1 + εt , εt ∼ .

Here we discuss taking this random walk process to times which can be recorded continuously

{Y (t)}t≥0 .

This is useful as it expands the kinds of processes we can use to build new models (e.g. in financial econometrics),
allows some analysis of problems where the data is not equally spaced through time, some of these objects appear
4.4. INTEGRATION AND DIFFERENCING 75

in important asymptotic arguments (e.g. functional central limit theorem) in later chapters. Before we define
what this process us we need to recall what a càdlàg function is.
A càdlàg function is a function which is right continuous with left limit. It is familiar in introductory

statistics from, for example, the distribution function of a binary random variable — shown in Figure 4.2. A
càdlàg process {Y (t)}t≥0 has

lim Y (u) = Y (t), lim Y (u) = Y (t−), t ∈ R>0 ,


u↓t u↑t

while the jumps at time t are


Y (t) − Y (t−),

the difference between the right and left limits, respectively.

Definition 4.4.8. [Lévy process] A càdlàg process {Y (t)}t≥0 , where L0 = 0, is a Lévy process if it obeys two
additional (beyond càdlàg) properties:

(a) [independent increments]

{Y (t) − Y (s)} ⊥
⊥ {Y (b) − Y (a)} , for all 0 ≤ a < b < s < t;

(b) [stationary increments]

L
{Y (t + s) − Y (t)} = {Y (s) − Y (0)} , for all 0 ≤ s, 0 ≤ t.

Finally, any Lévy process with non-negative increments

Y (t) − Y (s) ≥ 0, for all t > s > 0,

is called a subordinator. The Poisson version is the Poisson process, the Gaussian case is Brownian motion.

Lévy processes all have independent and stationary increments, so for t > 0, the cumulant function

C {ω ‡ Y (t)} = tC {ω ‡ Y (1)} ,

while, if the mean and variance exists,

E[Y (t) − Y (s)] = (t − s)E[Y (1)], Var[Y (t) − Y (s)] = (t − s)Var[Y (1)], for all t > s > 0.

Example 4.4.9. A Poisson process is a special case of a Lévy process. It has Poisson increments

indep
Y (t) − Y (s) ∼ P o(ψ(t − s)), ψ > 0, t > s ≥ 0. (4.13)

The ψ is usually called the intensity in this context. A simulated path of a Poisson process with ψ = 1 is given
on the left hand side of Figure 4.1. All the jumps in the Poisson process are one, with probability one.
76 CHAPTER 4. MEMORY

Poisson process Brownian motion Gamma process

5
6

10
4
5

8
3
4

6
Y

Y
3

4
2

2
1

0
0

0
−1
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Time Time Time

Figure 4.1: LHS: sample path of Poisson process with intensity ψ = 1. Middle: sample path of Brownian motion
with µ = 0 and σ = 1. RHS: sample path of a gamma process.

Example 4.4.10. A Brownian motion is a special case of a Lévy process. It has Gaussian increments

indep
Y (t) − Y (s) ∼ N (µ (t − s) , σ 2 (t − s)), µ ∈ R, σ > 0, t > s ≥ 0. (4.14)

The µ, σ are usually called the drift and volatility, respectively, in this context. A simulated path of a Brownian
motion with µ = 0 and σ = 1 is given in the middle of Figure 4.1. It is the only Lévy process without jumps —

it has, a continuous sample path, but the path is nowhere differentiable. When µ = 0 and σ = 1, then {Y (t)}
is called standard Brownian motion. In the standard Brownian motion case, the process {Y (t) − tY (1)}t∈[0,1]
is called a Brownian bridge.

Example 4.4.11. A gamma process is a special case of a Lévy process. It has gamma increments

indep
Y (t) − Y (s) ∼ Gamma(α (t − s) , β), α, β > 0, t > s ≥ 0. (4.15)

A simulated path of a Brownian motion with α = β = 1 is given on the right hand side of Figure 4.1. This
process has, however small the time increment, strictly positive increments (as the increments are gamma) —
made up of an infinity of jumps. In the gamma process case, the process {Y (t)/Y (1)}t∈[0,1] is called a Dirichlet

process — a continuous time version of the Dirichlet distribution. It plays a large role in modern Bayesian
nonparametric statistics.

The Poisson process and Brownian motion are the two most famous continuous time processes in probability.
Another process which is characterized by the behaviour of its increments in continuous time is an orthogonal
process (this is the continuous time version of the discrete time integration of zero mean, weak white noise).

An orthogonal process is less prescriptive, it just features the first two moments.

Definition 4.4.12. [Orthogonal process] The process {Y (t)}t≥0 is “orthogonal” if

E[Y (t) − Y (s)] = 0, Cov[Y (t) − Y (s), Y (u) − Y (v)] = 0,


4.5. INFERENCE AND LINEAR AUTOREGRESSION 77

F (y)
•1 •

p = 0.6 • •

0.3 • ◦

◦0
• •
0 1 y

1 = F −1 (0.6) = Q(0.6)

Figure 4.2: Example of càdlàg function. Distribution function of a binary random variable with P (Y = 0) = 0.3
and P (Y = 1) = 0.7. Here we compute the 0.6-quantile, F −1 (0.6) = Q(0.6), which is 1.

for all t > s > u > v ≥ 0.

Of course, all zero mean Lévy processes with the additional feature that Var(Y (t)) < ∞ are orthogonal

processes. However many other orthogonal processes are important.

4.5 Inference and linear autoregression


4.5.1 Three versions of an autoregression

Although we have discussed the stationary and covariance stationary properties of autoregressions, from a

statistical perspective it is helpful to classifying how autoregressions are discussed in the empirical literature
and used to carry out inference. Often writers are unclear as to which kind of autoregression they are working
with.

The classification will be into 3 buckets. Each has its own voice (typically the 1st appears in probability
theory, the 2nd in statistics and the 3rd in econometrics) and vices. The focus will be on the AR(1) case to
look at core issues.

Definition 4.5.1. [AR(1) process] The process {Yt } is a AR(1) process if

iid
Yt = ϕ1 Yt−1 + εt , εt ∼ , t = 1, ..., T.

This model is a special case of the Markov process above. It has been extensively discussed above.

Definition 4.5.2. [AR(1) predictions] The process {Yt } has AR(1) predictions iff it obeys, for t = 1, ...,

(a) E[|Yt |] < ∞, (b) E[Yt |Ft−1 ] = ϕ1 Yt−1 . (4.16)

In the prediction model, if it exists, we write

σt2 = Var[Yt |Ft−1 ].


78 CHAPTER 4. MEMORY

If σt2 does not change through time, this is called homoskedasticity.

The AR(1) prediction model is not a Markov process.

Definition 4.5.3. [AR(1) projection] The zero mean, covariance stationary {Yt } has, for each Yt the AR(1)
projection ϕ1 Yt−1 , where
Cov(Y0 , Y1 )
ϕ1 = .
Var(Y0 )

The AR(1) projection model is not a Markov process. It starts with a covariance stationarity assumption,
and then highlights an interesting estimand ϕ1 . There are no additional assumptions are made.

These three autoregressions share many features, but dramatically depart in other aspects. The AR(1)
process was discussed extensively above. Focus on to the AR(1) prediction and projection cases.

AR(1) prediction

If {Yt } has E|Yt | < ∞, then it follows that E[Yt |Ft−1 ] exists. Then for all conditional mean prediction based
models decompose

Yt = E[Yt |Ft−1 ] + Ut ,

where {Ut } is a MD sequence adapted to {Ft }, as E|Yt | < ∞ implies E|Ut | < ∞.

In the special case of the AR(1) prediction assumption: E[Yt |Ft−1 ] = ϕ1 Yt−1 , we can use the MD property
of {Ut } to carry out inference on ϕ1 , e.g. based on MD laws of large numbers or CLTs.

AR(1) projection

In a lot of modern econometrics and statistics regression is presented as a linear projection. The time series
version of this, for 1 lag, is AR(1) projection which comes from assuming {Yt } to be a zero mean covariance
stationary sequence and define a projection estimand

2 E[Y0 Y1 ] γ1
ϕ1 = arg min E[(Y1 − aY0 ) ] = = ∈ [−1, 1].
a E[Y02 ] γ0

Then {ϕ1 Yt−1 } is the linear projection of {Yt }. The properties are immediate, from familiar properties of
regression.

Theorem 4.5.4. If {Yt } is covariance stationary, define the the projection error

Ut = Yt − ϕ1 Yt−1 , t = 1, 2, ....

This error process {Ut } is covariance stationary and has four properties (I do not know a fifth!)

Var(U1 ) = 2γ0 1 − ρ21 < ∞, E[Y0 U1 ] = E[Y0 Y1 ] − ϕ1 E[Y02 ] = 0,



E[U1 ] = 0, E[U1 U0 ] = 0, t = 1, ....
4.5. INFERENCE AND LINEAR AUTOREGRESSION 79

More broadly,
E[Ut Ut−s ] ̸= 0, s > 1,

so {Ut } is not weak white noise, nor i.i.d. nor a MD sequence. Thus laws of large numbers and CLTs will be
tricky!

4.5.2 Least squares and AR(1)

One way to estimate ϕ1 is


PT
b = Pt=2 Yt−1 Yt ,
ϕ LS T 2
t=2 Yt−1
which we call the least squares estimator. It can be derived from broad inference principles in multiple ways

ˆ Under the AR(1) process, if ε1 is zero mean, Gaussian, then ϕ


b
M LE = ϕLS is the MLE based upon the
b

conditional joint density


fY2:T |Y1 .

ˆ Under the AR(1) prediction model,

E[(Yt − ϕ1 Yt−1 ) |Y1:t−1 ] = 0. (4.17)

As Yt−1 is previsible, then

E[Yt−1 (Yt − ϕ1 Yt−1 )] = E [Yt−1 E[(Yt − ϕ1 Yt−1 ) |Y1:t−1 ]] , Adam’s law

= 0, using (4.17),

so the method of moments principle delivers ϕ


b
M oM = ϕLS .
b

ˆ Under the AR(1) projection method

2 E[Y0 Y1 ]
ϕ1 = arg min E[(Y1 − aY0 ) ] = ,
a E[Y02 ]

then the method of moments principle can be used to estimate E[Y0 Y1 ] and E[Y02 ] and so deliver ϕ
b . Of
LS

course, the ϕ
b is also the numerical solution to the least squares principle:
LS

T
X
ϕ
b = arg min
LS (Yt − aYt−1 )2 .
a
t=2

4.5.3 Properties of ϕ
b
LS

First some algebra, which always applies. Define Ut = Yt − ϕ1 Yt−1 , then


t
b − ϕ1 = P MT
X
ϕ LS T
, Mt = Yj−1 Uj .
2
t=2 Yt−1 j=2
80 CHAPTER 4. MEMORY

AR(1) process and AR(1) prediction case

First note that for a AR(1) prediction problem, if Var(Yt−1 ) < ∞ and Var(Ut ) < ∞ for every t, then {Yt−1 Ut }

is a MD sequence with respect to FtY .

Why?

Check conditions (a) and (b) for MD sequence. First (a): that is E|Yt−1 Ut | < ∞ for every t. The Cauchy-
Schwartz inequality says that for a pair of random variables {X, Z} where Var(X) < ∞ and Var(Z) < ∞,
then

E[XZ]2 ≤ E[X 2 ]E[Z 2 ],

2 2
so {E|Yt−1 Ut |} ≤ E[Yt−1 ]E[Ut2 ]. So Var(Yt−1 ) < ∞ and Var(Ut ) < ∞ for every t, is enough for condition (a).
Now for condition (b): that is E[Yt−1 Ut |Y1:t−1 ] = 0. The

E[Yt−1 Ut |Y1:t−1 ] = Yt−1 E[Ut |Y1:t−1 ] = 0,

as desired. So {Yt−1 Ut } is a MD sequence so long as Var(Yt−1 ) < ∞ and Var(Ut ) < ∞ for every t.

Again assuming Var(Ut ) < ∞, define σt2 = Var(Ut |Y1:t−1 ), then under the AR(1) prediction case
t
X t
X t
X
2 2
⟨M, M ⟩t = Var(∆Mj |Y1:j−1 ) = Yj−1 Var(Uj |Y1:j−1 ) = Yj−1 σj2 , t = 2, 3, ...,
j=2 j=2 j=2

the angle bracket process. Thus


PT 2 2
b − ϕ1 = MT t=2 Yt−1 σt
ϕ LS × P T
.
⟨M, M ⟩T 2
t=2 Yt−1

a.s.
If ⟨M, M ⟩T → ∞ as T → ∞, then MT / ⟨M, M ⟩T → 0 using Theorem 2.3.14, the martingale strong law of large
PT 2
PT
numbers. If t=2 Yt−1 σt2 / t=2 Yt−1
2
is bounded from above (e.g. this is guaranteed if σt2 ≤ d < ∞ for all t),

then this would be enough to yield consistency of ϕ


b for ϕ1 .
LS

Next focus on the CLT for the prediction model, in the |ϕ1 | < 1 case, using a MD CLT. I will state a basic
result, many of the conditions can be relaxed.

Theorem 4.5.5. Assume the AR(1) prediction model, |ϕ1 | < 1 and Var(Ut ) = Var(U1 ) < ∞ for all t. Then

if Var(Yt−1 Ut ) = Var(Y0 U1 ) < ∞ and Yt2 is ergodic,

√  E(Y02 U12 )
 

d Var(U1 )
T ϕb − ϕ1 →
LS N 0, , E[Y02 ] = .
E[Y02 ]2 1 − ϕ21

Under homoskedasticity, that is E(Ut2 |Y1:t−1 ) = E(U12 ), then

√ 
 

d Var(U1 )
= N 0, 1 − ϕ21 .

T ϕLS − ϕ1 → N 0,
b
2 (4.18)
E[Y0 ]
4.5. INFERENCE AND LINEAR AUTOREGRESSION 81

Proof. The {Yt−1 Ut } is a MD sequence if E[|Yt−1 Ut |] < ∞.

2
E[|Yt−1 Ut |] ≤ E[Yt−1 ]E[Ut2 ]

As |ϕ1 | < 1 and Var(Ut ) = Var(U1 ), so {Yt } is covariance stationary. Thus E[|Yt−1 Ut |] < ∞ holds so {Yt−1 Ut }
is a MD sequence. As we assumed Var(Yt−1 Ut ) = Var(Y0 U1 ) < ∞, the {Yt−1 Ut } is a MD white noise sequence,
so
T
1 X d
√ Yt−1 Ut → N (0, Var(Y0 U1 )) .
T t=2
Covariance stationarity plus ergodicity, implies that
T
1X 2 p
Y → E[Y02 ].
T t=2 t−1

Combining the results using Slutsky’s theorem yields the first displayed equation.

The typical textbook result for estimating an AR process is given in Lemma 4.

Lemma 4. Assume the AR(1) process and |ϕ1 | < 1 and Var(U1 ) < ∞, then (4.18) holds.

Proof. It is the homoskedastic case while E(Y02 U12 ) = E[U12 ]2 / 1 − ϕ21 while ergodicity holds as γs = ϕs1 → 0.

The inference for the AR(1) prediction model hides a danger for empirical work, which is not much discussed

in the literature.

Remark 15. The assumption that Var(Y0 U1 ) < ∞ is not trivial empirically. Think of (a special case of a
ARCH(1) model σt2 = Y02 )
iid
U1 = |Y0 | ε1 , εt ∼ , E[ε1 ] = 0, Var[ε1 ] = 1,

then E[U12 ] = E[Y12 ] = E[Y12 ] < ∞, so {Ut } is MD white noise. But now Var(Y0 U1 ) < ∞ needs E[Y04 ] < ∞
— which is quite an ask for many time series problems. Under homoskedasticity this problem disappears, just

needing E[Y12 ] < ∞.

Least squares and projection

The AR(1) projection only makes sense under the covariance stationarity assumption. Then recall the {Ut } has
four properties — given in Theorem 4.5.4. As a result
T T
1X 1X
T −1 MT = Yt−1 Ut , so E[T −1 MT ] = E[Yt−1 Ut ] = 0
T t=2 T t=2
T T
1X 2 h
2] = 1
i X
\
E[Y 2
0] = Y , so \
E E[Y Y 2 = E[Y02 ].
T t=2 t−1 0
T t=2 t−1

2 p

If we assume that Yt2 is ergodic then E[Y
\ 2
0 ] → E[Y0 ].
82 CHAPTER 4. MEMORY

The main problem is deriving a CLT when we do not know any properties of the {Yt−1 Ut } sequence beyond
it has a zero mean. One popular approach is to assert some high level assumptions. This is very close to
being circular — assuming the CLT for ϕ
b holds directly.
LS

Theorem 4.5.6. Assume {Yt } is covariance stationary and think about the AR(1) projection. Assume {Yt−1 Ut }
is covariance stationary and ergodic sequence and that it obeys the CLT

T
!
√ 1X d
X
T Yt−1 Ut → N (0, Ψ(Y0 U1 )), Ψ(Y0 U1 ) = Cov(Y0 U1 , Y−s U1−s ), 0 < Ψ(Y0 U1 ) < ∞.
T t=2 s=−∞

Then

√ 
 

b − ϕ1 →d Ψ(Y0 U1 )
T ϕ LS N 0, . (4.19)
E[Y02 ]2

1
PT 2 p
Proof. Ergodicity plus stationarity means T t=2 Yt−1 → E(Y02 ). Then combine with the assumed CLT, yields
(4.19) by Slutsky’s theorem.

h 4.5.7. This looks simple, but we have already seen CLTs of time series are not trivial, e.g. covariance
stationarity is not enough to drive a CLT. Some kind of replication is needed. Replication can be obtained by

using MA(∞) processes driven by i.i.d. or MD shocks or through a M-dependence assumption.

Perhaps the prettiest version of this is to assume up-front that {Yt } is strictly stationary, which implies
{Yt−1 Ut } is strictly stationary. If, additionally, Var(Y1 ) < ∞ and Var(Y0 U1 ) < ∞, then strict stationarity
implies both {Yt } and {Yt−1 Ut } are covariance stationary. Then ergodicity just needs Cov(Y02 , Ys2 ) → 0 as

s → ∞. The CLT is still hard due to the absence of replication. If {Yt } is additionally assumed m-dependent,
then the CLT holds (and indeed ergodicity will hold) subject to a Lindeberg condition.

Theorem 4.5.8. Assume {Yt } is strictly stationary, m-dependent with Var(Y1 ) < ∞ and Var(Y0 U1 ) < ∞.
Then
M
√ 
 

b − ϕ1 →d Ψ(Y0 U1 ) X
T ϕ LS N 0, , Ψ(Y0 U1 ) = Cov(Y0 U1 , Y−s U1−s ), 0 < Ψ(Y0 U1 ) < ∞,
E[Y02 ]2
s=−M

subject to a Lindeberg condition (e.g. E[|Y0 U1 |2+δ ] < ∞).

Of course this relies on M being finite. But there are CLTs for m-dependent processes where m in-
crease with T or you can relax and think of M being large and finite, e.g. m is 100 billion. So long as
P∞
0 < s=−∞ Cov(Y0 U1 , Y−s U1−s ) < ∞, using this finite m case is likely to yield only tiny errors.
P∞
Of course estimating the long-run variance s=−∞ Cov(Y0 U1 , Y−s U1−s ) is difficult in practice.
4.6. RECAP 83

4.6 Recap

We have covered a lot of ground. Special cases of the predictive distribution play enormously roles in time
series: Markov processes, AR, VAR, ECM, Brownian motion, Poisson processes, ARCH. Here some of these
are developed. Sometimes they are linked back to MA(∞) processes and martingales.

Formula or idea Description or name


L
Yt |Y1:t−1 = Yt |Yt−1 Markov
L
Yt |Y1:t−1 = Yt |Yt−K:t−1 K-other Markov
Yt ⊥⊥ Yt−M −1 m-dependent
Yt = ϕ1 Yt−1 + εt AR
multivariate AR VAR
∆Yt = αβ T Yt−1 + εt ECM
Y (t) ∼ P o(ψt) Poisson process
Y (t) ∼ N (µt, tσ 2 ) Brownian motion
indep and stationary increments Lévy process
ϕ = 1 − ϕ1 z − ... − ϕp z p lag polynomial
iid
Yt = ϕ1 Yt−1 + εt , εt ∼ AR process
E[Yt |Y1:t−1 ] = ϕYt−1 AR prediction
ϕ1 = Cov(Y1 , Y0 )/V ar(Y1 ) AR projection
b = PT Yt Yt−1 / PT Y 2
ϕ Least squares estimator
1 t=2 t=2 t−2

Table 4.1: Main ideas and notation in Chapter 4.

Table 4.1 has some of the highlighted ideas from the Chapter.
84 CHAPTER 4. MEMORY
Chapter 5

Action: Describing via filtering and


smoothing

5.1 Seasonality

In descriptive analysis we often try to get a fast summary of the structure of the data by reporting sample aver-
ages, sample quantiles or correlograms. But time series often have fascinating structures which are important

in their own right or get in the way.


Perhaps the most famous version of this are repeating patterns which appear due to

ˆ the seasons (e.g. cereals harvested per month, price of fruit, public holidays),

ˆ the work week (e.g. number of subway journeys per day) or

ˆ the intraday (e.g. the amount of electricity generated by solar panels).

In time series these types of effects are usually called seasonal effects, where S are the number of seasons
being modelled. Sometimes S can be very large (e.g. think about diurnal features where the data is recorded

at the second-by-second basis, twenty-four hours a day), or small (e.g. S = 4, for quarterly data)

Example 5.1.1. The time series of electricity demand gives insight into managing a modern electrical grid,
which is becoming an every more important skill as the amount of solar and wind energy ramps up. Fig-
ure 5.1 shows demand in Great Britain every 30 minutes from 1 January 2023 until 6 September 2023.

The data comes from the UK’s national grid, which manages the movement of electricity over the coun-
try and between the UK and other countries (through interconnectors). The data was downloaded from
https://data.nationalgrideso.com/demand/historic-demand-data

The website has the data going back to 2009, so T ≃ 122k. A wonderfully accessible way to gain an under-
standing of UK electricity generation, in real time, is via the website https://grid.iamkate.com/, due to data
scientist Kate Morley, who displays data merged from various sources. The left hand side plots the demand on

85
86 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

Electricity Demand in 2023 Electricity Demand in 2023 Electricity Demand in 2023

35

40
35

35
30

30

30
Output

Output

Output
25

25

25
20
20

20

15
15

15
0 1 2 3 4 0 2 4 6 8 10 12 14 0 100 200 300

Days Days Days

Figure 5.1: Demand for electricity from the Great Britian electric grid during 2023, in giga watts, recorded
every 30 minutes.

the first 4 days of the year, showing the diurnal feature of the data. The middle graph plots the first 14 days
which shows the day of the week effect (electricity demand is lower on Sunday, not much lower on Saturday).
The right hand side shows the demand so far this year, which has fallen during the summer but will ramp up

again over the rest of the year. Hence to model this dataset at least 3 seasonal components would be needed:
one with S = 48, one with S = 24 × 7 and one with S = 24 × 365. The data has many usages, e.g. short run
forecasting of demand, longer term planning dealing with managing the peaks of demand and exploiting the

troughs to recharge storage capacity (batteries or pumped storage). Similar datasets are available for electricity
generation, e.g. through coal, gas, wind, solar and transferred through interconnectors.

5.1.1 Quantifying seasonality

To start quantifying, think of a process of a single season:

Yt = γt + Xt , t = 1, 2, ..., T,

where {γt } is the additive deterministic seasonal “component” of a time series. Here we will use the convention
that seasonality cumulates to zero over the scan of a season.

Definition 5.1.2. A deterministic seasonal component, with S ≥ 2 seasons, sets

γt = γ (t/S − ⌊t/S⌋)

driven by the function {γ(u)}u∈[0,1] , where


Z 1
γ(u)du = 0,
0

while ⌊x⌋ is the floor function (largest integer not larger than x).

One approach to dealing with seasonality transform the series to make it less sensitive to seasonality.
5.1. SEASONALITY 87

Example 5.1.3. [Inflation] Traditionally price inflation is reported every month, but is calculated as the
percentage change in prices over the last year. Write {Yt } as the price index at time t, measured in months,
then arithmetic annual inflation is the year of monthly price moves
∆S Yt
100 × , ∆S Yt = Yt − Yt−S , S = 12,
Yt−S
where ∆S = 1 − LS is called the seasonal difference operator. Notice that

∆S Yt = ∆Yt + ... + ∆Yt−S+1 ,

the sum of S monthly changes. Why report these annual moves, rather than, say 6 month moves? Well
∆S Yt (γt − γt−S ) Xt − Xt−S
100 × = 100 × + 100 ×
Yt−S Yt−S Yt−S
Xt − Xt−S
= 100 × , as γt − γt−S = 0 under deterministic seasonal.
Yt−S
Hence annual inflation is robust to deterministic seasonal variations in prices, due to the use of seasonal differ-

ences.

Example 5.1.4. [Financial volatility] Setup a database of the price in a financial market {Yj }j=1 measured
every ∆ > 0 seconds. Then the quadratic variation (QV) of {Yj } is the time series
t
X
[Y, Y ]t = (Yj − Y(j−1) )2 .
j=1

Suppose there are S seconds in a financial day, then

∆S [Y, Y ]t = [Y, Y ]t − [Y, Y ]t−S

is the increase in the QV over the last day — differencing out the complicated diurnal patterns of trading often
seen in financial markets (a high percentage of trading happens at the start and end of the trading day). In

financial econometrics ∆S [Y, Y ]t is called the realized variance and was formalized by Andersen, Bollerslev,
Diebold, and Labys (2001) and Barndorff-Nielsen and Shephard (2002). It is the modern way of measuring
how volatile a financial market has been using high frequency data. Its simplicity is due to it being robust to
seasonal patterns, again through a seasonal difference operator.

A second approach is to estimate the seasonal component and then remove it — allowing the “seasonally
adjusted” series to be subsequently analyzed by other researchers.

Definition 5.1.5. [Seasonal adjustment] Suppose {b


γ t } is the estimated seasonal component, then

Yt − γ
bt , t = 1, ..., T,

is the (estimated) seasonally adjusted series. If γ


bt is built using Y1:t , then γ
bt is called a filter. If it depends on
Y1:T , so is a historical estimate, it is called a smoother.
88 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

Seasonal adjustment is a major descriptive method in modern time series modeling.

Example 5.1.6. Most U.S. Government time series are published as “seasonally adjusted,” that is preprocessed

using a statistical model or procedure to try to remove the seasonal effects before it is published. Hence,
for example, many U.S. macroeconomists rarely discuss seasonality, although unemployment and prices have
pronounced seasonal patterns, e.g. the classic time series textbook by Hamilton (1994) has little discussion of

seasonality. In time series practice outside academic macroeconomics, on the other hand, seasonality is a very
big deal — worthy of serious thought.

5.1.2 Fourier representation of a seasonal function

A major approach to statistical model building of the seasonal function {γ(u)}u∈[0,1] is to use a Fourier repre-
sentation {γB (u)}u∈[0,1] — where B determines the number of terms in a sum. This representation is

B
X 
γ1+2(j−1) cos(λj u) + γ2j sin(λj u) , λj = 2πj, u ∈ [0, 1].
j=0

h 5.1.7. I realize there are some coefficients γ−1 and γ0 with unusual subscripts. They will disappear in a

moment, leaving a more conventional setup.

Definition 5.1.8. A trigonometric representation (approximation) of the function {γ(u)}u∈[0,1] is

B
X 
γB (u) = γ1+2(j−1) cos(λj u) + γ2j sin(λj u) , λj = 2πj, u ∈ [0, 1],
j=0

noting sin(λ0 u) = 0 making γ0 redundant. Because of this the series is often written as
B
X 
γB (u) = γ−1 + γ1+2(j−1) cos(λj u) + γ2j sin(λj u) .
j=1

As B goes off to infinity it is able to well approximate continuous period functions (Fourier’s Theorem) on
u ∈ [0, 1].

For any B ≥ 1 and γ−1:2B , the trigonometric structure means that


Z 1 B 
X Z 1 Z 1 
γB (u)du = γ−1 + γ1+2(j−1) cos(λj u)du + γ2j sin(λj u)du
0 j=1 0 0

B
X
λ−1

= γ1 + j γ1+2(j−1) sin(2πj) − γ2j {cos(2πj) − 1}
j=1
= γ−1 .

To make {γB (u)}u∈[0,1] suitable for a seasonal model, which should integrated to 0, makes sense to set γ−1 = 0,
yielding what the time series literature typically calls:
5.1. SEASONALITY 89

Definition 5.1.9. The “trigonometric seasonal model” puts


B
X 
γB (u) = γ1+2(j−1) cos(λj u) + γ2j sin(λj u) , λj = 2πj, u ∈ [0, 1],
j=1

Note, if B = S/2, the sin(λS/2 u) = 0, so γ2B is redundant — so it has S − 1 free parameters.

Example 5.1.10. For quarterly data S = 4 and B = 2, then λ1 /S = π/2, λ2 /S = π and so

γ2 (k/4) = γ1 cos(πk/2) + γ2 sin(πk/2) + γ3 (−1)k , k = 1, 2, ...

or in matrix form:
     ∗ 
γ2 (1/4) 1 0 −1   γ1
γ 1  γ2∗ S
 γ2 (2/4)   0 −1 1  X
γk∗ = 0,

  γ2  =  ∗
 γ2 (3/4)  =  −1
   , where
0 −1   γ3 
γ3 k=1
γ2 (4/4) 0 1 1 γ4∗
which is the same as the textbook dummy approach to seasonality. This dummy approach is often seen in
simple regression analysis with small S.

A practical advantage of the Fourier approach is that it can be used even if S varies over time (e.g. S is the
number of days in a month).

It turns out that the {γB (u)} function has a recursive structure which is helpful computationally and inspires
further model building. To see this, the following well known stacked trigonometric identity is useful.

Theorem 5.1.11. For any t and a, define

γ cos(λt) + γ ∗ sin(λt)
   
γt
: =
γt∗ −γ sin(λt) + γ ∗ cos(λt)
   
γt−a cos (λa) sin (λa)
= ϕ(λa) ∗ , ϕ(λa) = .
γt−a − sin (λa) cos (λa)

Proof. For any u,

cos(λt) = cos {λa + λ (t − a)} = cos (λa) cos {λ (t − a)} − sin (λa) sin {λ (t − a)}

sin(λt) = sin {λa + λ (t − a)} = sin (λa) cos {λ (t − a)} + cos (λa) sin {λ (t − a)} ,

Thus

γ cos(λt) + γ ∗ sin(λt) = γ cos (λa) cos {λ (u − a)} − γ sin (λa) sin {λ (u − a)}

+γ ∗ sin (λa) cos {λ (t − a)} + γ ∗ cos (λa) sin {λ (t − a)}

= cos (λa) {γ cos {λ (u − a)} + γ ∗ sin {λ (t − a)}}

+ sin (λa) {γ ∗ cos {λ (t − a)} − γ sin {λ (u − a)}}


γ cos {λ (u − a)} + γ ∗ sin {λ (t − a)}
 
= {cos (λa) , sin (λa)}
−γ sin {λ (u − a)} + γ ∗ cos {λ (t − a)}
90 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

and

−γ sin(λt) + γ ∗ cos(λt) = −γ sin (λa) cos {λ (t − a)} − γ cos (λa) sin {λ (t − a)}

+γ ∗ cos (λa) cos {λ (t − a)} − γ ∗ sin (λa) sin {λ (t − a)}

= cos (λa) {γ ∗ cos {λ (t − a)} − γ sin {λ (t − a)}}

− sin (λa) {γ cos {λ (t − a)} + γ ∗ sin {λ (t − a)}}


γ cos {λ (u − a)} + γ ∗ sin {λ (t − a)}
 
= {− sin (λa) , cos (λa)} .
−γ sin {λ (u − a)} + γ ∗ cos {λ (t − a)}

Stacking yields the stated result.

Writing a = 1/S, and building a collection of such terms each with a different frequency:
   
γt,1 γt−1,1
= ϕ(λ1 /S) ,
γt,2 γt−1,2

and for the 2nd frequency


   
γt,3 γt−1,3
= ϕ(λ2 /S) ,
γt,4 γt−1,4
or writing down the results for all frequencies using the more abstract notation:
   
γt,1+2(j−1) γt−1,1+2(j−1)
= ϕ(λj /S) , j = 1, ..., B.
γt,2j γt−1,2j

Then (notice the sum does not involve the γt,2j term, it just appears in the computation)
B
X
γB (t/S) = γt,1+2(j−1) , t = 1, 2, ..., S,
j=1

e.g. when B = 2, then γB (t/S) = γt,1 + γt,3 .

Example 5.1.12. [Continuing Example 5.1.8] Writing a = 1/S and u = t/S, then equation (??) in Theorem
?? implies that γt = γ(t/S) has the recursive structure
B    
X γt,1+2(j−1) γt−1,1+2(j−1)
γt = γt,1+2(j−1) , = ϕ(λj /S) , j = 1, ..., B.
γt,2j γt−1,2j
j=1

This can be written compactly as

γt = (1, 0, 1, 0, ..., 1, 0)T αt = ((1, 0) ⊗ ι )T αt


1×1 2B×1 B×1 2B×1

with  
ϕ(λ1 /S) 02×2 ··· ··· 02×2 
γt,1

 02×2 ϕ(λ2 /S) 02×2 ··· 02×2   γt,2 
.. ..
 
 .. ..  
..

αt+1 =
 . 02×2 . . .  αt ,
 αt = 

. .

 .. .. .. ..  
 γt,2B−1

 . . . . 02×2  
02×2 02×2 ··· 02×2 ϕ(λB /S) γt,2B
5.1. SEASONALITY 91

The above is a 2B-dimensional VAR(1), inside of which there are B bivariate VAR(1) processes, but with
no noise in sight — rather like we saw when we studied invertibility of a moving average in Section 3.5.1.

Do these bivariate VAR(1)s converge to 02 as it is iterated through t = 1, 2, ...? From Section 3.5.1 we
know this would happen if the eigenvalues were inside the unit circle. The result below shows that for each

j = 1, ..., B the ϕ(λj /S) has a pair of complex conjugate eigenvalues who are on the unit circle. Hence there
is no convergence to zero. They go up and down, as time evolves, never converging to 0.

Remark 16. Think of (??) as a difference equation Zλu = ϕ(aλ)Zλ(u−a) . The eigenvalues, written ω1:2 , of
ϕ(λ) solve

2 2 2 2
|I2 ω − ϕ(λ)| = {ω − cos (λ)} + sin (λ) = ω 2 − 2ω cos (λ) + cos (λ) + sin (λ) = ω 2 − 2ω cos (λ) + 1,

which are, for any choice of λ, a complex conjugate pair of roots on the unit circle:
q
2
2 cos (λ) ± 4 cos (λ) − 4 q
2
ω1:2 = = cos (λ) ± i 1 − cos (λ) implying |ωk | = 1, k = 1, 2.
2

5.1.3 Stochastic seasonal

In applied time series the seasonal components are rarely unchanged through lengthy periods of time. Think of
electricity demand. The advent of efficient heat pumps reduces electricity demand during harsh seasons, while
increase it due to the move away from gas central heating during winter. Likewise the use of electric vehicles,

increases electricity demand, but it may flatten the seasonal effects due to the increase spread of high powered
batteries.

How can seasonal components be allowed to change through time? We start with

Yt = γt + εt , t = 1, 2, ..., T,

but now allow the seasonal component {γt } to be stochastic. It is generated by the stochastic seasonal compo-

nent.

Definition 5.1.13. The stochastic seasonal component, with S ≥ 2 seasons, sets


Z 1
γt = γt (t/S − ⌊t/S⌋) , where γt (u)du = 0,
0

driven by the stochastic function {γt (u)}.

Example 5.1.14. [Stochastic trigonometric seasonal model] The stochastic trigonometric seasonal MD model
has

γt = ((1, 0) ⊗ ι )T αt ,
B×1
92 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

where  
ϕ(λ1 /S) 02×2 ··· ··· 02×2
 02×2 ϕ(λ2 /S) 02×2 ··· 02×2 
.. ..
 
 .. .. 
αt+1 = ϕ1 αt + ω t , ϕ1 = 
 . 02×2 . . . 

 .. .. .. .. 
 . . . . 02×2 
02×2 02×2 ··· 02×2 ϕ(λB /S)
where {ω t } is a martingale difference sequence — that is E [|ωt,j |] < ∞ and
 
 Z
 Yt
E ω t |Ft−1 = 0, where Zt = .
αt+1

Taken all together, the system of data plus the stochastic seasonals can be written as
   
0 ιT εt
Zt = Zt−1 + .
02B ϕ1 ωt

Typically researchers impose homoskedasticity plus a diagonal structure

Z 2

Var ω t |Ft−1 = diag(σω,1:2B ).

2
The textbook version of this model has σω,1:2B = σω2 . The Hindrayanto, Aston, Koopman, and Ooms (2013)

approach has
2 2
σω,1+2(j−1) = σω,2j , j = 1, 2, ..., B.

5.1.4 Stochastic cycle

The same mathematical structure is sometimes used to explicitly force a component to cycle.

Definition 5.1.15. [Stochastic cycle model] A stochastic cycle MD model {ψt } is where
 T  
1 ψt,1
ψt = ψt,1 = ψt , ψt = , ψ t+1 = ρC × ϕ(λC )ψ t + κt , λC ∈ [0, π], ρC ∈ [0, 1).
0 ψt,2
T
where κt = (κt,1:2 ) is a bivariate MD sequence.

Remark 16 applies, so now ψ t is a VAR(1) with a pair of complex conjugate roots, with absolute value which
is |ρC | < 1. Hence if {κt } is a MD white noise process then ψt is covariance stationary. In the special case
Z
where λC = π, then ψt = ρC ψt−1 +κt,1 as sin (λC ) = 0. Typically researchers impose that Var[κt |Ft−1 ] = σκ2 I2 .

h 5.1.16. When a cycle model is used it is important to be careful about modeling the additional noise
component, e.g.
Yt = ψt + εt .

Once one goes beyond {εt } being a MD white noise process, and allow it to be covariance stationary, then the
model threatens trouble, as {ψt } is typically covariance stationary too. One approach would be to make {εt }
and AR(p) but ensure it has no complex roots.
5.2. STOCHASTIC TREND 93

5.2 Stochastic trend


5.2.1 Quantifying trend

Some series show a tendency to grow through time.

Example 5.2.1. Many countries have grown wealthier through time, per adjusting for population growth. A
main way of measuring this is to compute a real GDP per capita number through time, looking at the rate
of growth. Of course, GDP has its own limits as a measure of human welfare, e.g. Coyle (2015). However,

we sidestep that issue here. The left hand side of Figure 5.2 shows real GDP per capita in the US from 1947
onwards (seasonally adjusted), based on 2017 dollars (it is the series on FRED coded: A939RX0Q048SBEA).
The middle plot shows the series in logs, which makes looking at growth rates easier. It is helpful to put these

U.S. Real GDP per capita Log U.S. real GDP per capita Log U.K. real, per capita, GDP
11.0

10
60000
50000

9
10.5
log of GDP

log GDP
GDP

40000

8
30000

10.0
20000

7
1960 1980 2000 2020 1960 1980 2000 2020 1400 1600 1800 2000

Time Time Time

Figure 5.2: Per capita real GDP. Left: US. Middle: log for US. Right: log for UK.

growth rates in a broader historical context. Data over very long periods are rare. Broadberry, Campbell,
Klein, Overton, and van Leeuwen (2015) estimated annual real GDP for England from 1270 to 1870, which
can be sliced with GDP numbers from Great Britain onwards. Here the data stops in 2016. It is in the file

UKGDP1270.csv, downloaded from Our World in Data website. The right hand side of Figure 5.2 shows the
result on a log-scale, the series tends to go upwards, but the rate at which is goes up has changed over time.
In terms of the US data, does the middle graph indicate that the US growth rate fallen? This is the subject of
the books by Gordon (2016) and DeLong (2022). In terms of the UK on the right hand side, you can see from

the early 1700s the establishment of some systematic economic growth, which seems to accelerate again after
about 1820 and again from about half way through the 1900s. Another feature, which is less obvious, is the
yearly variability of the series about the long-run development of GDP seems to have fallen a great deal after

the early 1700s — this is important to human welfare, as rapid falls in real GDP can be catastrophic. Finally,
the first four centuries are very thought provoking. It shows almost no growth at all. I find this the most
stunning data I have ever seen. Economic growth is not an inevitable result of the human condition!
94 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

In a simple growth rate regression model, e.g. for the log of GDP,

Yt = µt + εt ,

where

µt = µt−1 + β, implying µt = µ0 + tβ,

then β would be the growth rate. The {µt } process will be called a trend component. It will be helpful later
to get used to writing the system as a VAR(1):
 
Yt
Zt =
µt+1
   
0 0 1
= ϕ1 Zt−1 + , ϕ1 = .
β 0 1

Notice that the elements of the first column of ϕ1 are all zero, so Yt−1 has no impact on Zt,2 = µt . All the time
series dependence carries through via {µt }. The column of zeros in ϕ1 , means the memory in {Yt } is induced

indirectly, in this case through {µt }. These columns of zeros in ϕ1 will be a common theme in the discussion
below.

Assume that E [|εt |] < ∞, then

Y Y Y
E[Yt+s |Ft−1 ] = E[µt+s |Ft−1 ] + E[εt+s |Ft−1 ], s ≥ 0,
Y Y
= E[µt |Ft−1 ] + sβ + E[εt+s |Ft−1 ].


In the special case where {εt } is a MD sequence with respect to FtY , then

Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ] + sβ,

so then the prediction is a function of forecast horizon s, is a straight-line with a slope β.

Definition 5.2.2. [Detrending] Suppose {b


µt } is the estimated trend component, then

Yt − µ
bt , t = 1, ..., T,

is the (estimated) detrended series. It can be carried out as a filter or as a smoother.

Again, detrending is a major descriptive method in modern time series modeling.

One popular approach to the statistical analysis of trending series is to difference away the growth:

∆Yt = ∆µt + ∆εt

= β + ∆εt ,
5.2. STOCHASTIC TREND 95

so for the differenced series the growth rate is the location of the differenced series. Often β might be estimated
using data Y1:T through a simple statistic like
T
1 X
β
b = ∆Yt
T − 1 t=2
YT − Y1
= , telescoping
T −1
εT − ε1
= β+ .
T −1

5.2.2 Stochastic trend

The UK data, at least, suggests the growth rates has changed over long periods of time, while Gordon (2016)

has argued at great length that U.S. growth rates have fallen in recent times — an important conclusion for
public policy if it is true. One approach to deepen the regression model to deal with this kind of question, is
to make the growth rate a stochastic process {βt }, so

µt+1 = µt + βt = µ0 + β0 + β1 + ... + βt .

Writing
βt+1 = βt + ξt ,

then      
Yt εt 0 1 0
Zt =  µt+1  = ϕ1 Zt−1 +  0  , ϕ1 =  0 1 1 . (5.1)
βt+1 ξt 0 1 1
In the special case where (εt , ξt ) is independent of Zt−1 , then {Zt } is a Markov chain, but of course {Yt } is not.
Notice again the elements of the first column of ϕ1 are all zero.

Definition 5.2.3. [Smooth trend] A core model is to assume that E[|ξt |] < ∞ and

Z
E[ξt |Ft−1 ] = 0,

that is the {βt } is a martingale — expressing ignorance about how the slope might change in the future. The
implied {µt } process is called a smooth trend component. Notice that

ξt = βt+1 − βt = ∆µt+2 − ∆µt+1

= ∆2 µt+2 .

Assume that E [|εt |] < ∞, then

Y Y Y
E[Yt+s |Ft−1 ] = E[µt+s |Ft−1 ] + E[εt+s |Ft−1 ]
t+s
X
Y Y Y
= E[µt |Ft−1 ]+ E[βj |Ft−1 ] + E[εt+s |Ft−1 ]
j=t+1
Y Y Y
 
= E[µt |Ft−1 ] + sE βt |Ft−1 + E[εt+s |Ft−1 ], as {βt } is a martingale.
96 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

In the special case where {εt } is a MD sequence with respect to FtZ , then

Y Y Y
 
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ] + sE βt |Ft−1 ,

 Y

so then the prediction is a function of forecast horizon s, is a straight-line with a slope E βt |Ft−1 — the

estimated slope using current data.

5.2.3 Trend estimation

If {εt , ξt } are i.i.d. independent, zero mean Gaussian variables, then the model (5.1) has

log f (Y1:T , µ1:T |µ−1:0 ) = log f (Y1:T |µ1:T ) + log f (µ1:T |µ0,−1 )
T T
1 X
2 1 X 2
= c− (Yt − µt ) − ∆2 µt
2Var(ε1 ) t=1 2Var(ξ1 ) t=1
( T T
)
1 X
2 1 X 2 2 Var(ξ1 )
= c− (Yt − µt ) + ∆ µt , qξ = ,
2Var(ε1 ) t=1 qξ t=1 Var(ε1 )

so the posterior mean and median of µ1:T |Y1:T is the L2-penalized least squares estimator of µ1:T , which is
T T
X 1 X 2 2
µ
b1:T = arg min (Yt − µt )2 + ∆ µt .
µ1:T
t=1
qξ t=1

b1:T is a ridge regression applied to ∆2 µt . As qξ increases the more flexible the µ1:T can be.
The µ In
the limits qξ → ∞, so µ
bt → Yt , while as qξ → 0, so µ
bt → Y for all t. In macroeconomics µ
b1:T is sometimes
called the Hodrick and Prescott (1997) filter (they advocated the universal data free choice of qξ = 1/1600 for
quarterly macro data), but it appeared earlier in many areas of economics, statistics and applied mathematics,

e.g. Whittaker (1923), Kimmeldorf and Wahba (1970) (cubic spline), Green and Silverman (1994) (penalty),
Harvey (1989) (time series smoothing).
If {εt , ξt } are i.i.d. independent, zero mean random variables, with εt being Gaussian but ξt is Laplace, then
T T
1 X 1 X
log f (Y1:T , µ1:T |µ−1:0 ) = c− (Yt − µt )2 − p ∆2 µt ,
2Var(ε1 ) t=1 Var(ξ1 )/2 t=1
( T T
) p
1 X
2 1 X 2 Var(ξ1 )/2
= c− (Yt − µt ) + ∆ µt , qξ = ,
2Var(ε1 ) t=1 qξ t=1 2Var(ε1 )

so the posterior mode of µ1:T |Y1:T is the L1-penalized least squares estimator of µ1:T , which is
T T
X 1 X 2
µ
e1:T = arg min (Yt − µt )2 + ∆ µt .
µ1:T
t=1
qξ t=1

e1:T is a Lasso regression applied to ∆2 µt . It seems due to Kim, Koh, Boyd, and Gorinevsky (2009).
The µ
µt } has some of the fitted ∆2 µ
An extensive discussion of it is given in Tibshirani (2014). Crucially the {e et set
to zero, so {e
µt } is a continuous, piecewise linear function of time.
5.3. HIDDEN MARKOV MODELS 97

5.2.4 Local level and local linear trend

Some series have time-varying levels but no particular systematic growth. We will see an example of this in
Example 5.5.7.

Definition 5.2.4. The local level MD model has


     
Yt εt 0 1
Zt = = ϕ1 Zt−1 + , ϕ1 = ,
µt+1 ηt 0 1
 Z
where {εt , ηt } is MD with respect to Ft . It is conventional to further assume that Var (εt ) < ∞ and
Var(ηt ) < ∞, while
Z
E[εt ηtT |Ft−1 ] = 0d×d .

A local linear trend model mixes the martingale trend with the smooth trend model to yield
     
Yt εt 0 1 0
Zt =  µt+1  = ϕ1 Zt−1 +  ηt  , ϕ1 =  0 1 1 
βt+1 ξt 0 1 1

where {εt , ηt , ξt } is MD with respect to FtZ . Notice again the elements of the first column of ϕ1 are all zero.

In the local level model

Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ], as {εt } is MD, s ≥ 0.

Y
Notice that E[µt |Ft−1 ] is not Yt−1 , so the {Yt } process is not a martingale, even though {µt } is. This structure

has been very influential in the history of time series.

For the local linear trend model

Y Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ] + sE[βt |Ft−1 ], s ≥ 0.

so here both the level and slope needs to be estimated from the data in order to extrapolate it into the future.

5.3 Hidden Markov models


5.3.1 Introduction

The next model structure is perhaps the most commonly used in modern time series.

Definition 5.3.1. [Hidden Markov model] The Hidden Markov model (HMM) of {Yt , αt+1 }, labels Yt the t-th
observation and αt as the t-th state. The HMM has two assumptions:
(a) The
L
Yt | (Y1:t−1 , Yt+1:T , α1:T ) = Yt |αt
98 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

that is conditional on the states, the observations are independent.


(b) The {αt } is a Markov chain. In the modern literature on HMM it is conventional to work with the law of

αt+1 |αt

rather than αt |αt−1 — obviously this is just a notational change (it is carried out to make some formulas more

compact). Notice that although {αt } is a Markov chain, it is mostly non-stationary in applied work.

Remark 17. HMM have other names in the literature. Sometimes they are called state space models or

parameter driven models. In the linear case, they are sometimes called dynamic linear models. HMM are
special cases of Markov random fields, which play an important role in physics and probability as models of
undirected graphs.

The HMM is very extensively used in most areas of applied science, e.g. signal processing, bioinformatics,
pattern recognition and economics.
The HMM induces a Markov chain on
 
Yt
Zt =
αt+1
with the transition density

f (Z t |Z t−1 ) = f (Yt , αt+1 |Yt−1 , αt )

= f (Yt |αt )f (αt+1 |αt ),

having a special form — with the memory in {Yt } being carried only through the Markovian “state” αt .
When Y1:T , α1:T are Gaussian this line of work started with the linear model of Kalman (1960), looking at
the “linear state space system”
     
Yt 0d×d Zt Ht 0d×r
Zt = = ϕt,1 Z t−1 + Bt ωt , ϕt,1 = , Bt = ,
αt+1 0r×d Tt 0r×d Qt

where
iid
ωt ∼ E [ωt ] = 0d+r , Var [ωt ] = Id+r , r = dim(αt ),

and the matrices ϕt,1 , Bt are non-stochastic. Book length treatments of the many models which are included
in this class include Harvey (1989), West and Harrison (1989) and Durbin and Koopman (2012).
Section ?? will discuss these systems when i.i.d. assumption is replaced by weak white noise.

5.3.2 Three local level models: one is a HMM

The local level model captures many crucial features of models with state variables.
To avoid confusion when reading research papers it is useful to clearly delineate three versions of the local
level model:
5.3. HIDDEN MARKOV MODELS 99

ˆ local level martingale difference (MD) model;

ˆ local level model;

ˆ local level white noise (WN) model.

They all have a common skeleton


     
Yt 0d×d Id εt
Zt = = Z t−1 + .
µt+1 0d×d Id ηt

It is all about the assumptions placed on {εt , ηt }.

ˆ In the local level MD model, we have that Var(εt ) < ∞, Var(ηt ) < ∞ and
  
εt Z Z
E |Ft−1 = 02d , E[εt ηtT |Ft−1 ] = 0d×d .
ηt

Then {µt } is a martingale. But taken together, the local level MD model is not a HMM.

ˆ In the local level model


iid iid
εt ∼ , ηt ∼ , {εt } ⊥⊥ {ηt } .

Then {µt } is a random walk. Then the local level model is a HMM with αt = µt . In practice, the noise

terms are typically assumed to have second moments, but there are heavy tailed applications where this
addition is not made.

ˆ In the local level WN model, the are assumed to {εt , ηt } is weak zero mean white noise, with Cov(εt , ηt ) =

0d . This is not a HMM. The local level model implies the local level WN model if the noise terms have
a zero mean and their variances exist. The local level MD model is a local level WN model if the noise
terms are unconditionally homoskedastic, that is Var(εt ) = Var(ε1 ) and Var(ηt ) = Var(η1 ) for all t.

The local level WN model can be related to moving average models.

Theorem 5.3.2. For d-dimensional {Yt } follows a local level WN model

Yt = µt + εt , µt+1 = µt + ηt , {εt } ⊥ {ηt } ,

where {εt } , {ηt } are zero mean weak white processes with Var(εt ) = Σε and Var(ηt ) = Ση . Then {∆Yt } is
covariance stationary, has an MA(1) representation with

T T
E[∆Yt ] = 0, Var[∆Yt ] = Ση + 2Σε , E[∆Yt ∆Yt−1 ] = −Σε , E[∆Yt ∆Yt−s ] = 0, s > 1.
100 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

In the scalar case


σε2 1
Cor[∆Yt , ∆Yt−1 ] = − =− , qη = ση2 /σε2 ,
2σε2 + ση2 2 + qη
so the first autocorrelation must be non-positive. q is called the signal-to-noise ratio — a high qη yields a low
autocorrelation, a small qη yields a high autocorrelation (the process is close to a random walk, as the noise is

less important). For the MA(1) process Yt = ωt + θ1 ωt−1 , then Cor(Yt , Yt−1 ) = θ1 /(1 + θ12 ), so equating terms

1 1
−2 − qη = (1 + θ1 )2 , implying qn = + (−θ1 ) − 2
θ1 (−θ1 )

which is only possible if θ1 ∈ [−1, 0). Notice that qη monotonically decreases as θ1 moves from 0 to −1.

Proof. The

∆Yt = ∆µt + ∆εt = ηt−1 + ∆εt ,

T
so E[∆Yt ] = 0 and E[∆Yt ∆Yt−s ] = 0 for s > 1, while

Var[∆Yt ] = Var(ηt−1 + εt − εt−1 )

= Ση + 2Σε , white noise, uncorrelated {εt , ηt }


T T
E[∆Yt ∆Yt−1 ] = E[(εt − εt−1 ) (εt−1 − εt−2 ) ] = −Σε , white noise {ηt }

= −Σε .

Hence the {∆Yt } is a covariance stationary MA(1).

5.3.3 Stochastic volatility and the HMM

One of the main ways of building a statistical model for a martingale difference sequence is through stochastic
volatility. This is directly important in financial applications, but is super helpful as an ingredient for flexible

statistical models.

Definition 5.3.3. [Stochastic volatility] If

iid
Yt = σt εt , εt ∼ , E[εt ] = 0, Var[εt ] = 1, σt ≥ 0, {εt } ⊥⊥ {σs } ,

then {Yt } follows a univariate stochastic volatility (SV) process. The {σt } is called the volatility process. If
{σt } is Markovian, then the SV process is a HMM. More broadly, if some state αt is Markovian, and σt = h(αt ),

then the SV model is a HMM.

The main properties of SV are immediate:

ˆ If E[σt ] < ∞ for all t, then {Yt } is a MD sequence with respect to FtY .

5.3. HIDDEN MARKOV MODELS 101

ˆ If E[σt2 ] = E[σ12 ] < ∞ then {Yt } is a zero mean, weak white noise sequence.

ˆ If {σt } is covariance stationary then

E[Y1 ] = 0, Var(Y1 ) = E[σ12 ], Cov(Y1 , Y1−s ) = 0, s > 0,

while

2 2 2
E [|Y1 |] = E [|ε1 |] E[σ1 ], Var(|Y1 |) = E[σ12 ]−{E [|ε1 |]} {E[σ1 ]} , Cov[|Y1 | , |Y1−s |] = {E [|ε1 |]} Cov(σ1 , σ1−s ).

Notice that if {σt } is highly autocorrelated, then so will be {|Yt |}.

The SV model can be used to parameterize a local level MD model, making it a HMM model.

Example 5.3.4. [Local level SV model] The univariate local level SV model assumes
   
Yt 0 1
Zt = = Z t−1 + Bt ωt , Bt = diag(σt,1:2 ), ωt ∼ N (0, I2 ),
µt+1 0 1

where the noise {ωt } ⊥


⊥ {Bt }. This is a HMM for {Yt , αt+1 } with
 
µt
αt =  σt,1  ,
σt,2

so long as {σt,1:2 } is Markovian. There has been a very vibrant recent literature on the local level SV model
and various extensions, include Stock and Watson (2007, Stock and Watson (2016a), Shephard (2015) and Li

and Koopman (2021). The latter provides an extensive review.

Example 5.3.5. The SV process


q
T iid
log σt+1 = µ + ϕ1 (log σt − µ) + 1 − ϕ21 ση ηt , (εt , ηt ) ∼ N (02 , I2 ), |ϕ1 | < 1, log σ1 ∼ N (µ, ση2 ),

is often called a log-normal SV model and it has its roots in the work of Taylor (1982). Then

log σt ∼ N (µ, ση2 ), log σt + log σt+s ∼ N (2µ, 2ση2 1 + ϕ21 ),




implying, using the properties of the log-normal distribution:

E[σ1 ] = exp µ + ση2 /2 , E[σ12 ] = exp 2µ + 2ση2 ,


 

while

E [σ1 σ1−s ] = exp(2µ + ση2 (1 + ϕs1 )), Cov [σ1 , σ1−s ] = exp 2µ + ση2 exp(ση2 ϕs1 ) − 1 .

102 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

5.3.4 Some other HMM models

Perhaps the most famous example of the HMM is where αt is a finite state Markov chain. The initial devel-

opment and analysis of this model is usually credited to Leonard Baum is a series of papers (e.g. Baum and
Petrie (1966) and Baum and Eagon (1967)).

Example 5.3.6. In economics it is often associated with Hamilton (1989), who uses αt ∈ {0, 1} as a recession
indicator, taking the value 1 if the economy is in recession at time t, with {αt }, a binary Markov chain, buried
inside an autoregression, e.g.

Yt = µ(αt ) + ϕ1 Yt−1 + εt .

More extensive work along this line include Kim and Nelson (1999).

Example 5.3.7. The use of HMM is extremely common in DNA analysis, e.g. Durbin, Eddy, Krogh, and

Mitchison (1998), where, in the most basic model, αt ∈ {A, G, C, T }, where the letter A denotes adenine, G is
guanine, C is cytosine and T is thymine, while {Yt } is a high throughput initial guess at the sequence of the
letters.

Example 5.3.8. [VAR-SV] In modern macroeconomics the VAR(p)


iid
Yt = ϕ1 Yt−1 + ... + ϕp Yt−p + diag(σt,1:d )εt , εt ∼ , E[ε1 ] = 0, Var(ε1,j ) = 1, j = 1, ..., d, {σt,1:d } ⊥
⊥ {εt } ,

is commonly used, allowing time-varying volatility (notice there is not assumption that the {ε1,j } are uncorre-

lated over j). The time series of shocks


diag(σt,1:d )εt

a multivariate SV model introduced by Harvey, Ruiz, and Shephard (1994). The {Yt−p:t , σt,1:d } is a HMM.

Example 5.3.9. [Dynamic factor model] Suppose d is very high, then many researchers use a dynamic factor
model

Yt = Zαt + εt ,

αt+1 = T αt + ηt ,

with a low dimensional αt and {εt , ηt } being independent white noise. Z is called a factor loading matrix.

This is often called a dynamic factor model. Stock and Watson (2016b) provides a survey.

5.4 Filtering and smoothing


5.4.1 Big picture

Recall from the local level MD model we had

Y Y
E[Yt+s |Ft−1 ] = E[µt |Ft−1 ].
5.4. FILTERING AND SMOOTHING 103

This is nice but it does not tell us how to calculate

Y
E[µt |Ft−1 ].

Think more broadly about using a sequence of observations {Yt } to learn a sequence of states {αt }. There

are two big categories of carrying this out:

ˆ filtering: calculating features (e.g. moments and quantiles) of the posterior distribution of the current

state

αt |Y1:t , t = 1, 2, ...

given current information. Its cousin is prediction:

αt+1 |Y1:t ;

ˆ smoothing: calculating features of the posterior distribution of the state

αt |Y1:T , t = 1, 2, ..., T,

given historical information.

Filtering is important as an ingredient for prediction and statistical inference. Smoothing tends to be for

historical analysis, such as how to seasonally adjust a time series.

In principle this raises no new intellectual issues — everything really is just computational. First look at
the principles.

With knowledge of joint law of the observations and states up to time t, that is

f (Y1:t , α1:t ), t = 1, 2, ...

then the posterior for filtering is:

f (Y1:t , α1:t )
f (α1:t |Y1:t ) = R , t = 1, 2, ..., T
f (Y1:t , α1:t )dα1:t

while for smoothing it is


f (Y1:T , α1:T )
f (α1:T |Y1:T ) = R .
f (Y1:T , α1:T )dα1:T
Marginalizing these joint densities gives the law of αt |Y1:t for t = 1, ..., T , or features (e.g. moments and
quantiles) and αt |Y1:T for t = 1, ..., T . Of course, we can choose to work with the conditional and marginal
distributions

f (Y1:t , α1:t ) = f (Y1:t |α1:t )f (α1:t ),


104 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

but this only makes these formulas more complicated.

In special models these integrals can be computed, e.g. if Y1:t , α1:t is jointly Gaussian. Typically though
we should expect to see the use of computational methods, e.g. MCMC where we generate samples

[1] [B]
α1:t , ..., α1:t

from α1:t |Y1:t , discarding most of the output to deliver draws

[1] [B]
αt , ..., αt

from αt |Y1:t , which can be used to simulated based estimation of moments and quantiles of the posterior for

filtering. The same simulation strategy potentially works for the smoothing problem, drawing

[1] [B]
α1:T , ..., α1:T

from α1:T |Y1:T .

5.4.2 HMM

For HMM the law of the observations given the states and the states themselves massively simplify due to the
assumed sequential nature of the conditional independence assumptions:
t
Y t
Y
f (Y1:t |α1:t ) = f (Yj |αj ), f (α1:t ) = f (α1 ) f (αj |αj−1 ).
j=1 j=2

This structure allows filtering to be carried out sequentially.

Theorem 5.4.1. [Filtering for HMM] For a HMM {Yt , αt } there are three terms
(1) prediction step
Z
f (αt |Y1:t−1 ) = f (αt |αt−1 ) × f (αt−1 |Y1:t−1 )dαt−1 ,

(2) updating step


f (Yt |αt )
f (αt |Y1:t ) = f (αt |Y1:t−1 ),
f (Yt |Y1:t−1 )

(3) prediction distribution


Z
f (Yt |Y1:t−1 ) = f (Yt |αt )f (αt |Y1:t−1 )dαt−1 .

Finally, note, in terms of paths

f (Yt |αt )f (αt |αt−1 )


f (α1:t |Y1:t ) = × f (α1:t−1 |Y1:t−1 ). (5.2)
f (Yt |Y1:t−1 )
5.4. FILTERING AND SMOOTHING 105

Proof. The

f (Y1:t |α1:t )f (α1:t )


f (α1:t |Y1:t ) = , conditional prob
f (Y1:t )
f (Y1:t−1 |α1:t−1 )f (α1:t−1 )
= f (Yt |αt )f (αt |αt−1 ) , HMM
f (Y1:t )
f (Yt |αt )f (αt |αt−1 ) f (Y1:t−1 |α1:t−1 )f (α1:t−1 )
= , prediction decomposition
f (Yt |Y1:t−1 ) f (Y1:t−1 )
f (Yt |αt )f (αt |αt−1 )
= × f (α1:t−1 |Y1:t−1 ), conditional prob,
f (Yt |Y1:t−1 )

which is equation (5.2). Now


Z
f (αt |Y1:t−1 ) = f (αt |αt−1 ) × f (α1:t−1 |Y1:t−1 )dα1:t−1
Z
= f (αt |αt−1 ) × f (αt−1 |Y1:t−1 )dαt−1 , integrating out α1:t−2 . (5.3)

while

f (Yt |αt )f (αt |αt−1 )


Z Z
f (α1:t |Y1:t )dα1:t−1 = f (α1:t−1 |Y1:t−1 )dα1:t−1
f (Yt |Y1:t−1 )
f (yt |αt )
Z
= f (αt |αt−1 )f (α1:t−1 |Y1:t−1 )dα1:t−1
f (yt |y1:t−1 )
f (Yt |αt )
= f (αt |Y1:t−1 ), using eqn (5.3).
f (Yt |Y1:t−1 )

again as stated. The term


Z
f (Yt |Y1:t−1 ) = f (Yt |αt )f (αt |Y1:t−1 )dαt .

The implication is that filtering can potentially be carried out sequentially, if one can solve 2t integrals of
dimension αt . This may be much easier than the problem without the HMM structure, where we had to do
Pt
one j=1 dim(αj )-dimensional integral.

Example 5.4.2. [Baum filter] In the case where the state αt has finite support, the integrals are replaced by

sums. This special case is called the Baum filter. It takes on the form

X X
P (αt |Y1:t−1 ) = P (αt |αt−1 )P (αt−1 |Y1:t−1 ), f (Yt |Y1:t−1 ) = f (Yt |αt )P (αt |αt−1 )P (αt−1 |Y1:t−1 ).
at−1 at−1

In economics it is often called the Hamilton (1989) filter.

Smoothing can also be carried out recursively, after filtering going though the data forward t = 1, 2, ..., T ,
the smoother goes backwards in time, starting with the final output from the filter f (αT |Y1:T ).
106 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

Theorem 5.4.3. [Smoothing for HMM] The joint distribution of the smoother is
−1
TY
f (αt |Y1:t )
f (α1:T |Y1:T ) = f (αT |Y1:T ) f (αj |αj+1 , Y1:j ), where f (αt |αt+1 , Y1:t ) = f (αt+1 |αt ) × .
j=1
f (αt+1 |Y1:t )

In the literature this decomposition of the joint distribution is often called the forward-backward filter. Further,
the time-t smoother is
Z
f (αt |Y1:T ) = f (αt |αt+1 , Y1:t ) × f (αt+1 |Y1:T )dαt+1 .

Proof. The forward-backward result is

f (α1:T |Y1:T ) = f (αT |Y1:T )f (αT −1 |αT , Y1:T )...f (α1 |α2:T , Y1:T ), prediction decomposition in state, going backwards!

= f (αT |Y1:T )f (αT −1 |αT , Y1:T −1 )...f (α1 |α2 , Y1:t ), HMM
−1
TY
= f (αT |Y1:T ) f (αj |αj+1 , Y1:j ), Markov state.
j=1

Now

f (αt , αt+1 |Y1:t )


f (αt |αt+1 , Y1:t ) = , conditional prob
f (αt+1 |Y1:t )
f (αt+1 |αt )f (αt |Y1:t )
= , HMM.
f (αt+1 |Y1:t )

This is the first result. The second


Z
f (αt |Y1:T ) = f (αt , αt+1 |Y1:T )dαt+1
Z
= f (αt |αt+1 , Y1:T ) × f (αt+1 |Y1:T )dαt+1
Z
= f (αt |αt+1 , Y1:t ) × f (αt+1 |Y1:T )dαt+1 , HMM.

This is the second.

Example 5.4.4. [Baum smooth] In the case where the support of αt the integrals are replaced by sums. Then

X
P (αt |Y1:T ) = P (αt |αt+1 , Y1:t ) × P (αt+1 |Y1:T ).
αt+1

The importance of the forward-backward filter was emphasized by Carter and Kohn (1994) and Frühwirth-

Schnatter (1994), who used it to simulate a smoothing path

[b]
α1:T ∼ α1:T |Y1:T , b = 1, ..., B,

using:

[b]
ˆ Simulate from αT ∼ αT |Y1:T ;
5.5. COMPUTATION METHODS 107

[b] [b]
ˆ Simulate from αj ∼ αj |αj+1 , Y1:j , for j = (T − 1), (T − 2), ..., 1.

Of course to use this we need to be able to simulate from αT |Y1:T and, then, repeatedly

f (αt |αt+1 , Y1:t ) ∝ f (αt |Y1:t )f (αt+1 |αt ).

5.5 Computation methods


5.5.1 Kalman filter

In the binary state case filtering and smoothing for the HMM are quite simple. But more broadly the presence

of the integrals makes the filtering and smoothing recursions nontrivial. In the case where the joint law of
α1:T , Y1:T is Gaussian, then they can be analytically solved. This model is a Gaussian HMM. In the literature
is is sometimes called a Gaussian state space model, a linear state space model (Harvey (1989)) or a dynamic
linear model (West and Harrison (1989)).

Definition 5.5.1. Gaussian HMM (GHMM) has the pair {Yt , αt+1 } following
     
Yt 0d×d Zt Ht 0d×r iid
Zt = = ϕt,1 Z t−1 + Bt ωt , ϕt,1 = , Bt = , ωt ∼ N (0, Id+r ),
αt+1 0r×d Tt 0r×d Qt

where

α1 ∼ N (a1 , P1 ), [{ωt } ⊥⊥ α1 ] ,

we assume
 
 T T
a1 , P1 , ϕt,1 , Bt t=1
⊥⊥ {ωt }t=1 , α1
   
(in which case we work with {Yt , αt } | a1 , P1 , ϕt,1 , Bt ) or assume that a1 , P1 , ϕt,1 , Bt are non-stochastic.
We denote this as

{Yt , αt+1 } ∼ GHMM.

This linear structure was introduced by Kalman (1960), although he did not use Gaussianity, instead he
focused on weak white noise assumptions on the {εt , ζt }. This weak white noise version of his model will be

discussed in Section ??.

Definition 5.5.2. The Kalman filter computes {at+1 , Pt+1 } where αt+1 |Y1:t ∼ N (at+1 , Pt+1 ) assuming the
{Yt , αt+1 } ∼ GHMM.

Write the “prediction error” at time t + 1 as vt+1 = Yt+1 − E[Yt+1 |Y1:t ] and the corresponding conditional
variance Ft+1 = Var(vt+1 |Y1:t ). Then the Kalman filter runs sequentially.

Algorithm 5.5.3. [Kalman filter] Assume {Yt , αt+1 } ∼ GHMM. Then for t = 1, 2, ...
108 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

1. Compute

(a) vt = Yt − Zt at ,

(b) Ft = Zt Pt ZtT + Ht HtT ,

(c) store the Kalman gain Kt = Tt Pt ZtT Ft−1

2. Compute

(a) at+1 = Tt at + Kt vt ,

(b) Pt+1 = Tt Pt TtT + Qt QTt − Kt Ft KtT .

The computational cost of running the Kalman filter is O(T × dim(αt )3 ).

Proof. Step 1a:

E[Yt |Y1:t−1 ] = Tt E[αt |Y1:t−1 ] + E[εt |Y1:t−1 ] = Tt at , where εt = (Ht , 0)ωt ,

so
vt = Yt − E[Yt |Y1:t−1 ] = Yt − Tt at .

Step 1b:

Ft = Var(Yt |Y1:t−1 ) = Zt Var(αt |Y1:t−1 )ZtT + Var(εt |Y1:t−1 )

= Zt Pt ZtT + Ht HtT .

The derivation of 2a and 2b follow from writing


     
αt+1 Tt αt + ζt Tt at + Tt (αt − at ) + ζt
= Y = ,
vt Yt − E[Yt |Ft−1 ] Zt (αt − at ) + εt

implying that
     
αt+1 Y Tt at Tt Pt TtT + Qt QTt Tt Pt ZtT
|Ft−1 ∼N , ,
vt 0 Zt Pt TtT Ft
and finishing by applying Bayes theorem for a Gaussian likelihood and prior by conditioning on vt , so αt+1 |vt , Y1:t ∼

N (at+1 , Pt+1 ). Conditioning on vt , Y1:t−1 is the same as condition on Y1:t .

Sometimes it is helpful to record

at|t := E[αt |Y1:t ] = at + Kt vt , Pt|t := Var(αt |Y1:t ) = Pt TtT − Pt ZtT Ft−1 Zt Pt .

The corresponding smoothing result is αt |Y1:T ∼ N (at|T , Pt|T ). Here we give a very simple and fast algorithm
 
for computing at|T and/or Pt|T . It is due to Durbin and Koopman (2012), where a proof can be found.
Often in this literature these kinds of smoothers are called Kalman smoothers, which is a bit odd as Kalman
never discussed smoothing.
5.5. COMPUTATION METHODS 109

Algorithm 5.5.4. [Durbin-Koopman state smoother] Assume {Yt , αt+1 } ∼ GHMM, run the Kalman filter, storing
at , vt , Kt , Ft−1 , Pt and computing {Lt } where Lt = Tt − Kt Zt . For t = T, T − 1, ...



1. If at|T is needed, set rT = 0a , and compute

(a) rt−1 = ZtT Ft−1 vt + LTt rt ;

(b) at|T = at + Pt rt−1 .


2. If Pt|T is also needed, then set NT = 0a×a , and compute backwards

(a) Nt−1 = ZtT Ft−1 Zt + LTt Nt Lt ;

(b) Pt|T = Pt − Pt Nt−1 Pt .

To handle non-Gaussian HMM, it is sometimes helpful to draw from the joint smoothing density

[1]
α1:T ∼ α1:T |Y1:T .

This is called simulation smoothing. Early simulation smoothers are due to Carter and Kohn (1994), Frühwirth-
Schnatter (1994) and de Jong and Shephard (1995). However, we will focus on Durbin and Koopman (2002)
who have a pretty solution. At its core, their idea is not a time series result, but it is super helpful here.

Theorem 5.5.5. Suppose (X, Y ) are jointly Gaussian and the task is to simulate from X|Y . Assume

X [1] , Y [1] ∼ (X, Y ), then
 
E[X|Y ] + X [1] − E[X [1] |Y [1] ] ∼ X|Y.

Proof. Now X|Y ∼ N (E[X|Y ], V ) but X [1] |Y [1] ∼ N (E[X|Y [1] ], V ), as the posterior variance matrix is data
independent in the Gaussian model. Hence, as required:
 
E[X|Y ] + X [1] |Y [1] − E[X|Y [1] ] ∼ N (E[X|Y ], V ).

The resulting Gaussian simulation smoother is given below.


 
{1} {1}
Algorithm 5.5.6. [Durbin-Koopman simulation smoother] Simulate α1:T , Y1:T ∼ GHMM and then calculate
n o
[1] {1} {1}
α1:T = E[α1:T |Y1:T ] + α1:T − E[α1:T |Y1:T ] ∼ α1:T |Y1:T .

To implement: run the state smoother twice: once of the real data and once on the simulated data.

What could be easier than that! Notice this simulation smoother does not use the Pt|T part of the state

smoother, nor is it necessary to twice compute {Kt , Ft , Pt } in the Kalman filter, as these terms are invariant to
the data.
110 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

5.5.2 Inference and Gaussian HMM

Think about parametric models so, in the Bayesian case

{Yt , αt+1 } |θ ∼ GHMM.

The Kalman filter delivers Yt |Y1:t−1 , θ and so the log-likelihood


T T T
X 1X 1 X T −1
log fY1:T |θ (y1:T ) = log fYt |Y1:t−1 ,θ (yt ) = c − log |Ft | − v F vt ,
t=1
2 t=1 2 t=1 t t

can be computed, where θ is buried inside {vt , Ft }. If the dim(θ) is small or the problem has a particularly
simple structure, then the above log-likelihood can be used directly to implement MLE or Bayesian inference,
typically by simulating from

θ[1] , ..., θ[B] ∼ θ|Y1:T .

Taken together this approach has been applied massively in empirical work.

For larger dimensional θ there are some advantages in carrying out inference using data augmentation —
either through the EM algorithm or through Bayesian simulation methods.
Think of the Bayesian case where {Yt , αt+1 } |θ ∼ GHMM and there is a prior on θ. Then the task is to simulate

θ[1] , ..., θ[B] ∼ θ|Y1:T

which we do by simulating from


   
[1] [B]
α1:T , θ[1] , ..., α1:T , θ[B] ∼ α1:T , θ|Y1:T

[1] [B]
and then discarding the drawn states α1:T , ..., α1:T . One way of doing this is via a block MCMC (e.g. Gibbs
sampler) approach. It is quite simple for many problems:

ˆ simulate from α1:T |Y1:T , θ using the Gaussian simulation smoother

ˆ simulate from θ|α1:T , Y1:T .

This approach was first proposed by Frühwirth-Schnatter (1994).

Example 5.5.7. Think of a univariate local level model:


      
Yt 0 1 Yt−1 σε 0 iid
= + ωt , ωt ∼ N (0, I2 ),
αt+1 0 1 αt 0 ση

with α1 ∼ N (a0 , P0 ). Then carry out inference based on Y1:T |θ. Define

εt = Yt − αt , ηt = αt+1 − αt , t = 1, ..., T.
5.5. COMPUTATION METHODS 111

If the priors α1 , σε , ση are independent then

f (σε |α1:T , Y1:T , ση ) ∝ f (σε )f (ε1 , ..., εT |σε ), f (ση |α1:T , Y1:T , σε ) ∝ f ση )f (η1 , ..., ηT |ση ).

A simple conjugate approach is to assume σε−2 ∼ Ga(aε /2, bε /2) and ση−2 ∼ Ga(aη /2, bη /2), where X ∼
Ga(α, β) meaning fX (x) ∝ xα−1 exp(−xβ), so E[X] = α/β and Var (X) = α/β 2 (in R this accessed through the
gamma(shape=α, rate=β) family of functions). Then
T
! ! T
! !
X X
−2
σε |α1:T , Y1:T , ση ∼ Ga (aε + T ) /2, bε + 2
εt /2 , ση−2 |α1:T , Y1:T , σε ∼ Ga (aη + T ) /2, bη + ηt2 /2 .
t=1 t=1
In experiments I have found there is some danger by setting bη , bε too high for the data, for in some time series
PT PT
problems either t=1 ηt2 or t=1 ε2t can be tiny. If the prior bη , bε are too small for the data, then this typically
does not matter that much as the data will dominate it.

To illustrate this return to Chapter 1 where we discussed US monthly geometric inflation, recorded monthly
from 1947. Throughout we will work with a univariate Gaussian local level model. I expect the measurement
noise to be much more substantial than the change in the rate of underlying inflation. Thus I set the independent

priors
a0 = 0, P0 = 102 , aε = 1, bε = 0.52 , aη = 1, bη = 0.012 .

I ran the algorithm with B = 2, 000 iterations, initialized at σε = 4 and ση = 3, throwing out the first 1/3 of
the iterations.
Table ? gives the simulation based estimates of the prior and posterior 0.1, 0.5 and 0.9 quantiles for σε and

ση .

σε ση
quantiles 0.1 0.5 0.9 0.1 0.5 0.9
prior 0.31 0.77 3.8 0.0058 0.015 0.067
posterior 0.30 0.31 0.32 0.05 0.06 0.07

Table 5.1: Bayesian inference on local level model.

Notice how much bigger the σε is to ση , while both are quite precisely estimated using this data.
The right hand side of Figure 5.3 shows the smoothed estimated underlying inflation rate, E[µt |Y1:T , b
θ],
where b
θ are fixed at the posterior medians. The right hand side plots the same measure but focusing on the
most recent data.

There are some problems with this simple analysis

ˆ The Gaussian assumption is strong, it can yield estimates which are overly sensitive to extreme data.

It would be attractive to allow for outliers, e.g. by replacing the Gaussian distribution by a Laplace
distribution.
112 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

U.S. geometric inflation U.S. geometric inflation

15
20

10
Annualized monthly inflation

Annualized monthly inflation


10

5
0

0
−10

−5
−20

1960 1980 2000 2020 2017 2018 2019 2020 2021 2022 2023 2024

Time Time

Figure 5.3: Smooth estimate of underlying geometric annualized U.S. inflation, based on monthly data. LHS:
long scale of data (dots are observations, red line is underlying rate of inflation). RHS: more recent data (red
line is underlying rate of inflation).

ˆ There is some evidence that the volatility of inflation has reduced through time, at least until recently. It

would be attractive to allow for SV on the measurement error and the random walk component.

ˆ Throwing away B/3 of the iterations is a bit of a hack, there are more rigorous ways of ensuring the initial

conditions of a MCMC sampler do not overly impact the statistical conclusions. The work of Jacob and

John O’Leary (2020) largely solves this issue using parallel computation and coupling of Markov chains.
This work is quite easy to implement in our context and is particularly inspiring, but is beyond the scope
of these notes.

5.5.3 Inference and conditionally Gaussian HMM

The approach to inference when

Y1:T |θ ∼ GHMM of sampling from α1:T , θ|Y1:T ,

extends to a vast class of non-Gaussian HMM, which have enough structure to make the computations still
relatively simple.

Suppose there exists a time series β1:T (ignore for now parameters θ, they again can be dealt with by data

augmentation, they raise no new issues) where

T T
{Yt , αt+1 }t=1 | {βt }t=1 ∼ GHMM
5.5. COMPUTATION METHODS 113

then data augmentation suggests sampling from

α1:T , β1:T |Y1:T ,

using blocks

ˆ α1:T |Y1:T , β1:T ,

ˆ β1:T |Y1:T , α1:T .

Sometimes this structure is called a conditionally Gaussian HMM. It has long roots, e.g. Shephard (1994)
and Kim and Nelson (1999).

Example 5.5.8. [Robust filtering] A problem with GHMM is the Gaussianity of the shocks, which make the

filters super sensitive to unusual datapoints — which for some scientific problems can be a drawback. One
approach would be to make the Gaussian state space model
     
Yt 0d×d Zt Ht 0d×r
Zt = = ϕt,1 Zt−1 + Bt ωt , ϕt,1 = , Bt = ,
αt+1 0r×d Tt 0r×d Qt

have ωt not N (0, Id+r ) but instead


ωt |β1:T ∼ N (0, diag(βt )),

where
iid iid
βt,j ∼ , E[βt,j ] = 1, implying ωt,j ∼ , E[ωt,j ] = 0, Var(ωt,j ) = 1.

Then {Zt , (αt , βt )} is a HMM with Y1:T |β1:T ∼ GHM M . One parameterization is to use a mixture of two
normals by setting (think, e.g. c = 40)

c2
   
iid c c−1 1
βt,j ∼ , P βt,j = = , P βt,j = = , c > 1,
2c − 1 c 2c − 1 c

so E[βt,j ] = 1, implying, unconditionally

iid
ωt,j ∼ , E[ωt,j ] = 0, Var(ωt,j ) = 1.

The left hand side of Figure 5.4 shows the density function of a standard normal random variable (black line)
and the corresponding density for ωt,j (red line) when c = 40. The tail behaviour is clearer when plotting the
log-density, which is given in the right hand side plot in Figure 5.4. These pictures show this simple mixture

model ups the chance some variables are a long way from zero. Then sampling β1:T |Y1:T , α1:T just becomes
the task of sampling from the conditionally independent

c2
 
2
 −1/2 c
P (βt,j |ωt,j ) ∝ P (βt,j ) exp −ωt,j /2βt,j βt,j , βt,j ∈ , ,
2c − 1 2c − 1
114 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

Density function Log−density function

−1
0.5

−2
0.4

−3
Log−density
0.3
Density

−4
0.2

−5
0.1

−6
0.0

−7
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

x x

Figure 5.4: Comparing a standard normal distribution with a mixture of normals which have a zero mean and
unit variable. LHS: density functions; black line is the mixture random variable, the red line is the normal
variable. RHS: log-density functions; black line is the mixture random variable, the red line is the normal
variable.

which is sampled as a Bernoulli draw. Kim, Shephard, and Chib (1998) extend this to a mixture of many

normals to model some very skew data. An alternative is to model βt,j as generalized inverse Gaussian with
E[βt,j ] = 1, so ωt,j are generalized hyperbolic and βt,j |ωt,j is generalized inverse Gaussian (e.g. Jørgensen
(1982)). The left hand side of Figure 5.5 shows the sample path of a local level model in the Gaussian case,

with the standard deviations of εt and ηt being 1 and 0.1, respectively, while T = 1, 000. The right hand side
is tramatically different. It uses the same signal, but now takes c=40 in the measurement noise. Hence the
second order properties of process has not change, but now some very odd datapoints arises in the process.

iid
Example 5.5.9. [SV noise] Instead of βt,j ∼ then the {βt } could be its own stochastic process with significant
memory, delivering a GHMM-SV model. We saw a local level version of this in Example 5.3.4.

5.5.4 Particle filter

Over the last 30 years another Monte Carlo method has been extensively used to tackle HMMs. They are
pretty effective as long as the state vector is only moderately long. These methods are called particle filters or

sequential Monte Carlo. An attractive recent review includes Chopin and Papasphiliopoulos (2020), Dai, Heng,
Jacob, and Whiteley (2022) and Fearnhead and Kunsch (2018). The latter is a good place to start. Here we
will give a simple introduction.
5.5. COMPUTATION METHODS 115

Gaussian tailed local level model Heavier tailed local level model

5
2
Y

0
0
−2

−5
0 200 400 600 800 1000 0 200 400 600 800 1000

Time Time

Figure 5.5: LHS: concention Gaussian local level model. RHS: same signal, but now the measurement error is
a mixture of normals with c = 40.

Imagine we have a sample of size B directly from the filtering distribution αt |Y1:t−1

[1] [B]
αt , ..., αt .

In this literature this sample is often called a swarm of particles. Then we can form the empirical distribution
function of it as
B
1 X [b]
Fb(αt |Y1:t−1 ) = 1(αt ≤ αt ).
B
b=1
R
Now f (Yt |Y1:t−1 ) = f (yt |αt )dF (αt |Y1:t−1 ). Replacing F (αt |Y1:t−1 ) by Fb(αt |Y1:t−1 ), then the integral can be

solved, yielding
B
1 X [b]
fb(Yt |Y1:t−1 ) = f (Yt |αt ).
B
b=1

which an unbiased estimator of f (yt |y1:t−1 ). As B gets large this sum should become very accurate.

This is nice, but what about the next time step? If we can simulate from αt+1 |Y1:t , then we can repeat the
above, stepping through time via simulation.

How can we simulate from αt+1 |y1:t . We could do data augmentation, using MCMC to simulate from

α1:t+1 |Y1:t

and throwing α1:t away. But the cost of this will grow with t and so will be expensive. Here we follow a
different approach.
116 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING

Now

f (Yt |αt )
Z Z Z
f (αt+1 |Y1:t ) = f (αt+1 , αt |Y1:t )dαt = f (αt+1 |αt )f (αt |Y1:t )dαt = f (αt+1 |αt ) f (αt |Y1:t−1 )dαt
f (Yt |Y1:t−1 )
Z
∝ f (αt+1 |αt )f (Yt |αt )dF (αt |Y1:t−1 ).

Again using Fb(αt |Y1:t−1 ), produces

B
[b] [b]
X
fb(αt+1 |Y1:t ) ∝ f (Yt |αt )f (αt+1 |αt ).
b=1

Then the task is to simulate from this approximation! This looks tricky computationally. B is large so
evaluating fb(αt+1 |y1:t ) is awful.

But think about this as a data augmentation problem: build a fake joint density

[b] [b]
fb(b, αt+1 |Y1:t ) ∝ f (Yt |αt )f (αt+1 |αt ),

then this marginalizes to fb(αt+1 |Y1:t ). Hence we can sample from this, which is potentially very cheap, and
then throw away the sampled b variable.

One simple way to sample from fb(b, αt+1 |Y1:t ) is the bootstrap particle filter. This systematically samples b
[b] [b]
and the samples from αt+1 |αt , weighing the sample using Yt |αt . When you first read the bootstrap particle
filter, take k = 1, so C = B. Taking k>1 improves the sampler, but it is not conceptually vital.

[1] [B]
Algorithm 5.5.10. Bootstrap particle filter. Start with αt , ..., αt from αt |Y1:t−1
(1) Sample from

[c] [c−⌊c/B⌋]
αt+1 ∼ αt+1 |αt , c = 1, 2, ..., B, ..., C, C = kB, k ≥ 1 integer,

[c] [c]
and compute wt = f (yt |αt ) ≥ 0. This yields the C pairs

   
[1] [1] [C] [C]
αt+1 , wt , ..., αt+1 , wt ,

and the weighted empirical distribution function


C [c]
X [c] [c] [c] w
Fe(αt+1 |Y1:t ) = et 1(αt+1 ≤ αt+1 ),
w et = PC t
w [d]
.
c=1 d=1 wt

(2) Draw B times from Fe(αt+1 |Y1:t ) to yield

[1] [B]
αt+1 , ..., αt+1 ,

which is approximately from αt+1 |Y1:t .


5.6. RECAP 117

As B and C go to infinity then Fe(αt+1 |Y1:t ) converges to F (αt+1 |Y1:t ).


Early work on the modern particle filter includes Gordon, Salmond, and Smith (1993) and Kong, Liu, and
Wong (1994), while a great deal of the associated deep probability theory was developed by Pierre Del Moral

in the 1990s. Influential work includes Liu and Chen (1995) and Liu and Chen (1998). To my knowledge the
first use of particle filters in economics appeared in Kim, Shephard, and Chib (1998) and Pitt and Shephard
(1999). Herbst and Schorfheide (2015) gives a textbook account of using these types of methods in some forms
of macroeconomics.

An important property of the above particle filter is that


T
Y
fb(Yt |Y1:t−1 ),
t=1

is an unbiased estimator of f (Y1:T ) and so the likelihood function (but not the log-likelihood). It turns out this

means that this estimated likelihood can be used inside an MCMC algorithm to make inference on underlying
parameters θ — the estimation error does not spoil the output from the MCMC output. See Andrieu, Doucet,
and Holenstein (2010) and Flury and Shephard (2011).

5.6 Recap

Table 5.2 contains the major topics covered in this Chapter.

Formula or idea Description or name


Yt − µ̂t Detrending
E[αt |Y1:t ] Filter
GHMM Gaussian hidden Markov model
HMM hidden Markov model
Computation GHMM Kalman filter
Yt = µt + ϵt , µt+1 = µt + ηt Local level model
Simulate joint distn MCMC
Monte Carlo filtering Particle filter
Removing seasonal effect Seasonal Adjustment
γt Seasonality
E[αt |Y1:T ] Smoother
∆2 µt+2 = ξt Smooth trend
µt Trend
Trigonometric seasonality
Yt − γ̂ t Seasonal adjustment
Same as particle filter Sequential Monte Carlo

Table 5.2: Main ideas and notation in Chapter 5.


118 CHAPTER 5. ACTION: DESCRIBING VIA FILTERING AND SMOOTHING
Chapter 6

Linearity

6.1 Frequency domain time series

Spectral analysis is a classic, massive area of time series. Most often this area is called frequency domain time
series, due to its focus on the impact of frequency, while most other areas are called time domain time series

due to its focus on temporal memory. Frequency domain time series relies entirely on the covariance structure
of the {Yt } process, so belongs in the category of linear time series methods. All the core methods will be
transformations of statistics which are linear in the data.

An exhaustive treatment of frequency domain methods is given by Priestley (1981) and Percival and Walden
(1993). The notes by Subba Rao (2022) are also very good on this topic. It is also closely connected to wavelet

analysis, e.g. Percival and Walden (2000), but we will not discuss it here.

Spectral analysis is closely connected to Fourier analysis, so depending on your background some of the

material will appear trivial or somewhat abstract or not how you would express it. However, various aspects of
spectral analysis appears throughout modern time series and so knowing some of this material is important.

h 6.1.1. So far nearly all our time series thinking has been focused on prediction — the prediction decompo-
sition, Kalman filtering, martingales, autoregressions. In the frequency domain we junk that line of thinking!

Just get it out of your head. It takes some effort.

The following two example give a glipse

Example 6.1.2. Think about recording electronically a voice (or music) at a very high level of resolution,
where the recording is made every microsecond, and think of that recording as a long time series! A classic
problem is:

I want to send that recording wirelessly or using the internet!


The person receiving it just wants to listen to the voice and does not need perfect resolution. There is massive
advantage in making the file much smaller before it is sent. How do you do that? Fourier theory says that

119
120 CHAPTER 6. LINEARITY

any continuous function can be represented by an infinite sum of weighted sine and cosine functions. So
one approach to compressing the voice is to think of the voice as a continuous function and approximate the
corresponding time series by a finite sum of weighted sine and cosine functions and send the corresponding large

weights in the sum not the original time series! The collection of large weights may be tiny compared to the
original series. Then at the other end a computer can take the weights and reconstruct an approximation to the
original time series — and play that as our approximation to the original recording. Approximating the time
series of the voice has nothing to do with prediction — it is more naturally phrased in the frequency domain.

Example 6.1.3. A macroeconomist might be interested in a similar idea to the voice one. If they are interested

in very long-run effects, to do with population growth and technological progress, then they may find it helpful
to simplify the time series of annual GDPs, to extract the long-run trend. Statistically this is the same problem
as Example 6.1.2. A different macroeconomist might be interested in recessions and booms, that is short-run
fluctuations. So they might be interested in the opposite of the voice problem — studying what is left over

having taken out the long-run trend.

6.1.1 A regression model

To get started think about a trigonometric regression model


(r  ) (r  )
2 2πj 2 2πj
Yt = αj cos t + γj sin t + εt , t = 1, ..., T, j ∈ {0, 1, 2, ..., T } , (6.1)
T T T T

where {εt } is weak white noise, so Y1:T has a simple structure, dominated by the choice of

2πj
λj = ∈ [0, 2π]
T

a single “frequency” and how (αj , γj ) magnifies the sine and cosine functions — the nomenclature “frequency
p
domain” derives from this kind of frequency! The scaling by 1/T is selected to help the math be compact
later.

Remark 18. For those readers familiar with the weak instrument literature in econometrics (see Andrews,
Stock, and Sun (2019)), writing down regressors which are scaled by the square root of the sample size may well
feel familiar. It leads to the same effects seen here: the coefficient βj cannot be consistently estimated, but a

great deal can be productively learnt about it.

The cosine and sine functions are periodic, with period 2π, so λj ∈ [0, 2π] is without loss. If λj is small,
then cos(λj t) cycles in t very slowly, inducing massive memory in {Yt }. If λj is close to π the impact of cos(λj t)
is close to being instantaneous.
6.1. FREQUENCY DOMAIN TIME SERIES 121

If we have data y1:T then think about the least squares estimate of (αj , γj ):
 q P 
PT 2 PT  !−1 2 T
2
j 2πt 2 2πt 2π cos (λj t) yt
  
α
b j,OLS T t=1 cos  T T t=1 cos j T sin j t T t=1
= PT PT  T
2πt 2
 q P 
γ 2 2πt
sin j 2π 2
 T
2
t=1 cos j T T t t=1 sin j T sin (λ t) y
bj,OLS
T T t=1T j t

This is easier than it looks!


Math detour: Recall the trigonometric identity

cos(x + y) = cos(x) cos(y) − sin(x) sin(y)

set x = y, then

cos(2x) = cos(x)2 − sin(x)2

cos(x)2 − 1 − cos(x)2 , 1 = cos(x)2 + sin(x)2



= as

= 2 cos(x)2 − 1.

Rearranging: cos(x)2 = 21 (1 + cos 2x). Likewise, recall

sin(x + y) = sin(x) cos(y) + cos(x) sin(y),

so if x = y, then sin(2x) = 2 sin(x) cos(x).

Now take cos(x)2 = (1 + cos 2x)/2 out to play. It implies that


T  2 T
X 2πt 1X 2πt
cos j = (1 + cos 2j )
t=1
T 2 t=1 T
= T /2, 2π-periodicity of cosine and integer j

and, again using 1 = cos(x)2 + sin(x)2 , the


T  2 X T  2
X 2πt 2πt
cos j + sin j =T
t=1
T t=1
T

So
T  2 T  2
2X 2πt 2X 2πt
cos j = sin j = 1. (6.2)
T t=1 T T t=1 T
while, using sin(x) cos(x) = sin(2x)/2, the
T     T  
1X 2πt 2π 1 X 2π
cos j sin j t = sin 2j t (6.3)
T t=1 T T 2T t=1 T
= 0, 2π-periodicity of sine and integer j. (6.4)

Taken together (6.2) and (6.3) means that the regressors are othonormal: othogonal plus the sums of squares
regressors is π. That is
2
PT 2 PT  !
j 2πt 2 2πt
sin j 2π

T t=1 cos  T T t=1 cos j T  T
t
= I2 .
T 2πt 2
2 2πt
PT
sin j 2π 2
P 
T t=1 cos j T T t T t=1 sin j T
122 CHAPTER 6. LINEARITY

How pretty is that! It is the sign of something really computationally fast.

Now apply the lessons from this detour: it yields the beautiful and practical result that

r T r T
2X 2X
α
b j,OLS = cos (λj t) yt = cos (λj t) (yt − y) , 2π-periodicity of cosine,
T t=1 T t=1
and r T r T
2X 2X
γ
bj,OLS = sin (λj t) yt = sin (λj t) (yt − y)
T t=1 T t=1

while the regression model’s, that is equation (6.1), fitted version is


r r
2 2
ybt := y + α
b j,OLS cos (λj t) + γ
bj,OLS sin (λj t) , t = 1, ..., T,
T T
PT
so t=1 yt − y) = 0 and
(b

T
( T
) ( T
)
X 2 2X 2 2X 2
yt − y) =
(b b 2j,OLS
α cos (λj t) + b2j,OLS
γ sin (λj t) =α b2j,OLS .
b 2j,OLS + γ
t=1
T t=1 T t=1

This means that


T
X T
X T
X T
X
2 2 2 2
(yt − y) = (yt − ybt ) + yt − y) =
(b b 2j,OLS + γ
(yt − ybt ) + α b2j,OLS ,
t=1 t=1 t=1 t=1

b 2j,OLS and γ
indicating α b2j,OLS measure the contribution the frequency λj makes to the variation in y1:T .

This is the simple version, of course there are many frequencies. But that is the main point.

Remark 19. You might ask why do we care about sequences {Xt } of the generic form

Xt = α cos (λt) + γ sin (λt)?

Take a leap in the dark and assume {α, β} are zero mean, uncorrelated random variables with Var(α) = Var(γ) =

σλ2 . This is Example 3.3.4. Then

E[Xt ] = 0, Var(Xt ) = σλ2 , Cov(Xt , Xt−s ) = σλ2 cos (λs) .

Thus {Xt } is covariance stationary. If we were to add together p such components, each uncorrelated from one

another, each with a different variance σj2 and different frequency {λj }, then
p
X
Cov(Xt , Xt−s ) = σj2 cos (λj s) , σj2 ≥ 0, λj ∈ [0, 2π],
j=1

which is a very flexible way of modeling the second order properties of a covariance stationary process. It will
turn out that it will be sufficiently flexible, when setup the right way, to represent the second order properties
of all covariance stationary processes.
6.1. FREQUENCY DOMAIN TIME SERIES 123

6.1.2 Some background

To work compactly in the frequency domain we often use complex numbers and random extensions. Here we

remind ourselves of the former and introduce some aspects of the latter. I will do this in four stages:

ˆ complex numbers and Euler’s fomula;

ˆ sums of complex exponentials;

ˆ complex random numbers;

ˆ complex Brownian motion;

ˆ complex orthogonal processes.

Preparation: stage one

Recall Euler’s formula, that for any real a

eia = cos(a) + i sin(a),

where the imaginary number is denoted



i= −1.

Corresponding to this are


1 ia 1
e + e−ia , i e−ia − eia .
 
cos(a) = sin(a) =
2 2
An important idea in complex numbers is the complex conjugate. Think of

x = a + ib,

where the pair a, b are real. Then the complex conjugate of x is

x∗ = Conj(x) = a − ib.

Likewise
∗
eia = e−ia .

The notation Re(x) means the real part of a complex x, so Re(x) = a = Re(x∗ ). Likewise Im(x) = b =
− Im(x∗ ).
Also crucial for us is the complex version of squaring. It is denoted

|x|2 := xx∗ = (a + ib) (a − ib) = a2 + b2 = Re(x)2 + Im(x)2 .

There is some heavy investment working in the frequency domain, but some of the results are beautiful. You
can glimpse the elegance of complex numbers through this example.
124 CHAPTER 6. LINEARITY

Example 6.1.4. The elegance of complex variables can be seen immediately by noting

2
eia = eia e−ia = 1

= {cos(a) + i sin(a)} {cos(a) − i sin(a)}

= cos(a)2 + sin(a)2 .

2π-periodicity plays a big role in the frequency domain.

Example 6.1.5. For any integer t, the

ei2πt = cos(2πt) + i sin(2πt)

= 1.

An implication of this is that

e−iλt = e−iλt ei2πt

= ei(2π−λ)t .

The term e−iλt appears all over the frequency domain manipulations.

Preparation: stage two

Our second step of preparation is to define a complex function of time and frequency:
r
1 −iλt
x(t, λ) = e ,
T

and think of T different specific frequencies in the interval (0, 2π]:

2πj
λj = , j = 1, ..., T.
T

Then the
T T T
X 1 X i(λk −λj )t 1 X 2i(k−j)πt/T
x(t, λk )∗ x(t, λj ) = e = e
t=1
T t=1 T t=1

1 λk = λj
=
0 λk ̸= λj ,

due to 2π periodicity, routing back to Example 6.1.5.

Preparation: stage three

We will need the concept of a complex random variable.


6.1. FREQUENCY DOMAIN TIME SERIES 125

Definition 6.1.6. [Complex random variables] Let the complex random variable X be

X = A + iB

where A, B are real random variables. The complex conjugate of X as X ∗ = A − iB.

Definition 6.1.7. [Mean and variance of complex random variables]. Let the pair of complex random variables

(X, Y ) be defined as

X = A + iB, Y = C + iD,

where A, B, C, D are real random variables. Assume A, B possess a mean then

E[X] = E[A] + iE[B].

If A, B, C, D each of possess a variance, then

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])∗ ] = E[XY ] − E[X]E[Y ]∗ ,

2
and Var(X) = E[X 2 ] − |E[X]| . Notice that Var(X) ≥ 0 is real, but Cov(X, Y ) can be complex.

Example 6.1.8. Define the continuous time random process {Q(t)}t≥0 , by

1  −iλt 1
Q(t) = √ e β + eiλt β ∗ , β = √ (A + iB), A ⊥⊥ B, A, B ∼ N (0, σλ2 ), t ∈ [0, T ]
T 2
2 −iλt
= √ Re(e β), as e β is the complex conjugate of e−iλt β
iλt ∗
T
1  −iλt 1  iλt
(A + iB) + eiλt (A − iB) = √ e + e−iλt A + i e−iλt − eiλt B
 
= √ e
2T 2T
r
2
= {cos(λt)A + sin(λt)B} .
T

Now E[β] = 0 and Var(β) = σλ2 , so E[Q(t)] = 0 and

1 n −iλt iλ(t+s) o σ2
Var(β) + eiλt e−iλ(t+s) Var(β) = λ eiλs + e−iλs

Cov[Q(t), Q(t + s)] = e e
T T
2σλ2
= cos(λs).
T

We have seen this result before: in Example 3.3.4 and Remark 19.

Example 6.1.8 is not new. But the derivation is easier now. Why? The use of complex random variables
allows us to use eiλt rather than cosine and sine functions in the calculations. The eiλt terms are much simpler
to manipulate.
126 CHAPTER 6. LINEARITY

Preparation: stage four

A couple of time we will refer to a complex continuous time stochastic process {Z(t)}t≥0 . A simple version is

complex Brownian motion.

Definition 6.1.9. [Complex Brownian motion] Let the process {Z(t)}t≥0 be defined as
1
Z(t) := √ {B(t) + iW (t)} , t ≥ 0,
2
where {B(t)}t≥0 ⊥
⊥ {W (t)}t≥0 are independent standard Brownian motions. Then {Z(t)}t≥0 is called complex

standard Brownian motion.



Complex standard Brownian motion has some elegant properties. The Z(t)∗ = {B(t) − iW (t)} / 2, while

E[Z(t)] = 0, and
2 1 1
|Z(t)| = {B(t) + iW (t)} {B(t) − iW (t)} = B(t)2 + W (t)2 .
2 2
Thus
2
Var(Z(t)) = E[|Z(t)| ] = t,

and {Z(t) − Z(s)} ⊥


⊥ {Z(b) − Z(a)} for all t > s ≥ b > a ≥ 0.
Sometimes we will see stochastic integrals, where the integrator is complex standard Brownian motion:
Z t Z t Z t
1 1
Q(t) = h(s)dZ(s) = Q[1] (t) + iQ[2] (t), Q[1] (t) = √ h(s)dB(s), Q[2] (t) = √ h(s)dW (s),
0 2 0 2 0
Rt 
and {h(t)}t=0 is a deterministic sequence of time with the assumption that 0 h(s)2 ds < ∞. The Q[1] (t), Q[2] (t)
are relatively simple Ito integral — as {h(t)} is non-stochastic. We formally deal with these processes in Section

??.

At a basic level think about Q[1] (t) t∈[0,T ]
(and the corresponding Q[2] (t)) as the pointwise (that is for
each individual t) probability limit of the sum
B        
[1] 1 X j t j−1 j j−1
QB (t) = √ 1 ≤ h T B T −B T , t ∈ [0, T ],
2 j=1 B T B B B

as B → ∞. As Q[1] (t) is the sum of weighted independent Gaussians (as {h(t)} is non-stochastic), it has

Gaussian increments
1 t
 Z 
Q[1] (t) − Q[1] (s) ∼ N 0, h(u)2 du , t > s ≥ 0,
2 s

which are independent (but not necessarily stationary) Q[1] (t) − Q[1] (s) ⊥⊥ Q[1] (b) − Q[1] (a) for all t > s ≥

b > a ≥ 0. The same holds for Q[2] (t) . Sometimes you will see
1 1
dQ(t) = h(t)dZ(t) = √ h(t)dB(t) + i √ h(t)dW (t),
2 2
which means that
2
E[dQ(t)] = 0, and Var(dQ(t)) = E(|dQ(t)| ) = h(t)2 dt.
6.1. FREQUENCY DOMAIN TIME SERIES 127

Preparation: stage five

At one point, we will need a continuous time complex orthogonal process — which drops the Gaussian assump-
tion, noting {Q(t)}t≥0 is an orthogonal process. You saw an orthogonal process in Definition 4.4.12, but now

it has to be complex.

Definition 6.1.10. [Complex orthogonal process] Let the process {Z(t)}t∈[0,T ] be defined as

1
Z(t) := √ {A(t) + iD(t)} , t ≥ 0,
2

where {A(t)}t∈[0,T ] ⊥{D(t)}t∈[0,T ] are uncorrelated, zero mean orthogonal processes. Then {Z(t)}t∈[0,T ] is a
complex orthogonal process on [0, T ].

Then
1
Var(Z(t)) = {Var(A(t)) + Var(D(t))} .
2

Definition 6.1.11. [Circular orthogonal process] Suppose {Z(λ)}λ∈[0,π] is a complex orthogonal process Z(λ) =

{A(λ) + iD(λ)} / 2, where A(0) = D(0) = 0, and extend time to [0, 2π] by defining

Z(2π − λ) = Conj(Z(λ))
1
= √ {A(λ) − iD(λ)} , λ ∈ (π, 2π].
2

Call {Z(λ)}λ∈[0,2π] a circular orthogonal process.

Circular orthogonal processes are useful in the frequency domain for frequencies living on [0, 2π].

Remark 20. The circular orthogonal process {Z(λ)}λ∈[0,2π] has uncorrelated increments for {Z(λ)}λ∈[0,π] , but
not in general:

Cov(Z(λ), Z(2π − λ) − Z(λ)) = Cov(Z(λ), Conj(Z(λ))) − Var(Z(λ))

= E[A(λ)2 ] − E[B(λ)2 ] − E[A(λ)2 ] + E[B(λ)2 ] = −2E[B(λ)2 ].




Assume {Z(λ)}λ∈[0,2π] is a circular orthogonal process and λ ∈ [0, π], then

e−iλt Z(λ) + eiλt Z(2π − λ) = e−iλt Z(λ) + eiλt Conj (Z(λ))

= e−iλt Z(λ) + Conj(e−iλt Z(λ))

= 2 Re(e−iλt Z(λ))

= 2 {cos(λt)A(λ) + sin(λt)D(λ)} , using Example 6.1.8.
128 CHAPTER 6. LINEARITY

6.1.3 Writing the regression model using complex variables

Now let us use what we have learnt. This will put us on the launch pad for what we want.

For any frequency λ, the

1  1
α eiλt + e−iλt + γi e−iλt − eiλt = e−iλt (α + iγ) /2 + eiλt (α − iγ) /2

α cos(λt) + γ sin(λt) =
2 2
−iλt iλt ∗
 √
= e β+e β / 2 (6.5)

= 2 Re(e−iλt β) (6.6)


where denotes a complex conjugate and

√ √ √ 1 1 √
β = (α + iγ) / 2, β ∗ = (α − iγ) / 2, so α= 2 Re(β) = √ (β + β ∗ ) , iγ = √ (β − β ∗ ) , γ= 2 Im(β).
2 2

Using (6.5) and thinking about frequency λj , the trigonometric seasonal model is
(r ) (r ) r 
2 2 1 −iλj t 
Yt = αj cos (λj t) + γj sin (λj t) + εt = e βλj + eiλj t βλ∗j + εt ,
T T T
r
1  −iλj t
= e βλj + Conj(e−iλj t βλj ) + εt (6.7)
T
r
X 1 −iλt
= x(t, λ)βλ + εt , where x(t, λ) := e , β2π−λ := Conj(βλ ), (6.8)
T
λ∈{λj ,2π−λj }

going back to Example 6.1.5 so x(t, 2π − λj ) = x(t, λj )

The complex estimate


T T
 √
r r
1X 1 X iλj t
β
b = α
λj γ j,OLS / 2 =
b j,OLS + ib {cos(λj t) + i sin(λj )} yt = e yt .
T t=1 T t=1

Looking at the expression for β


b , and going back to Example 6.1.5, the
λj

x(t, 2π − λj ) = x(t, λj ), and


r T r T
b∗ 1 X −iλj t 1 X i(2π−λj )t
β λj = e yt = e yt = β
b
2π−λj = β −λj ,
b
T t=1 T t=1

so the fitted value can be written two ways:


r ! r !
1 1
ybt = y + e−iλj t β
b + eiλt β
b b + x(t, −λj )β
= y + x(t, λj )β b
T λj T −λj λj −λj

r !
X
−iλt 1b X
= y+ e βλ = y + x(t, λ)βb ,
λ
T
λ∈{λj ,2π−λj } λ∈{λj ,2π−λj }

noting ybt is real, even though each of the terms in the sum are complex, due to the complex conjugate structure.
2  
Notice that β b
λj = α b 2j,OLS + γ
b2j,OLS /2.
6.2. FOURIER TRANSFORM OF DATA 129

6.2 Fourier transform of data


6.2.1 Core ideas

The material covered in this Section will be the following core ideas.

ˆ – The Fourier transform of the data y1:T is the (complex) functional statistic
r T
1 X iλt
JT (λ) = yt e , frequency λ, recall eiλt = cos(λt) + i sin(λt) (6.9)
2πT t=1
(r T
) (r T
)
1 X 1 X
= yt cos(λt) + i yt sin(λt) (6.10)
2πT t=1 2πT t=1
= Re {JT (λ)} + i Im {JT (λ)} . (6.11)

This is typically calculated with λ ∈ [0, 2π]. Notice

Re {JT (λ)} = Re {JT (2π − λ)} , Im {JT (λ)} = − Im {JT (2π − λ)} , λ ∈ [0, 2π]. (6.12)

Further, JT (λ)∗ = Re {JT (λ)} − i Im {JT (λ)} and so

JT (λ) = JT (2π − λ)∗ .

This implies all the statistical information embedded in {JT (λ)}λ∈[0,2π] appears in {JT (λ)}λ∈[0,π] —
although using the full range of {JT (λ)}λ∈[0,2π] often yields nicer formulae.

– The (complex) discrete Fourier transforms (DFT) evaluates JT (λ) at specific frequencies

JT (λj ), λj = 2πj/T, j = 1, ..., T.

Inverting the Fourier transform, yields the real


r T
2π X −iλj t
yt = e JT (λj ), λj = 2πj/T, t = 1, ..., T. (6.13)
T j=1

– The periodogram

IT (λ) = |JT (λ)|2 , recall |a + bi|2 = a2 + b2


2 2
= [Re {JT (λ)}] + [Im {JT (λ)}] ,

a real functional statistic of the data y1:T . Due to (6.12), the periodogram cycles with period 2π,

that is IT (λ) = IT (2π − λ), λ ∈ R. As a result of the periodicity, researchers typically only plotted
IT (λ) against λ ∈ [0, π]. Then
T −|s|
" T   #
1 X T −s 1 X
T
IT (λ) = γ
b +2 cos(sλ)b
γs , γ
bs = yt yt+|s| ,
2π 0 s=1
T T − |s| t=1
T T −|s|
1 X iλs 1 X T
= e γ es , γ
es = yt yt+|s| ,
2π T t=1
s=−T
130 CHAPTER 6. LINEARITY

mapping from the time domain statistics {b


γ s } to the frequency domain objects, here the periodogram.
The IT (λj ) is called the j-th periodogram ordinate.

6.2.2 Setup

Now we will go through this somewhat more slowly!


To start take a large step back.
Remember in Introductory Statistics sufficient statistics are a simple principled example of data compression.
We start our discussion with the reverse of this question: find a good approximation to a continuous time process

using our data y1:T .


To quantify this ask: can we will find an interpolated continuous times process

{YT (t)}t∈[0,T ] ,

so that at the T discrete times


YT (t) = yt , t = 1, ..., T,
T
where {yt }t=1 is some time series data. Obviously there are infinite ways of carrying out this interpolation.
Here the interpolation will be carried out using Fourier methods.

6.2.3 Main event for the data

Now let us return to the raw Fourier representation.

Definition 6.2.1. The Fourier transform of the data y1:T is defined as


r T
1 X iλt
JT (λ) = e yt , λ ∈ [0, 2π]
2πT t=1

noting the discrete Fourier transform (DFT)


r T r
1 X iλj t 1 2πj
JT (λj ) = e yt = βj ; λj = , j = 1, ..., T. (6.14)
2πT t=1 2π T

R code to compute the DFT at frequency lambda is:

t = seq(1,T); iC = 1i;
sqrt(1.0/(2*pi*T))*sum(exp(lambda*t*iC)*y);

It is not that hard to use modern scripting languages to do simple manipulations of complex objects.
Inverting (6.14) back to the data
r T T T T T T
2π X −iλj t 1 X −iλj t X iλj s 1 X X 2ijπ(s−t)/T 1X
e JT (λj ) = e e ys = e ys = yt
T j=1 T j=1 s=1
T s=1 j=1 T s=1
= yt , t = 1, ..., T.
6.2. FOURIER TRANSFORM OF DATA 131

It is worth writing this out again, but the other way around
r T
2π X −iλj t
yt = e JT (λj ) (6.15)
T j=1
T T

r r
X 1 X iλj t 1 −iλt
= x(t, λj )β
b , β
j
b := 2πJT (λj ) =
j e yt , x(t, λ) = e .
j=1
T t=1 T

This is an exact OLS fit of a regression model with T orthonormal (complex) regressors, with T regression
coefficients (proportional to the Fourier transform of the data) and no error! The above delivers expressions
(6.9) and (6.13) above. This completes the data quantities statement.

The above implies the imputed continuous time process is


T
X
Yb (t) = x(t, λj )β
b ,
j t ≥ 0, (6.16)
j=1

which precisely goes through the data y1:T . This function has a period of T .

Remark 21. For any frequency of the form λj = 2πj/T , for j = 1, ..., T , the
T   X T   T   
X 2πj 2πj X 2πj 0 j ̸= 0
exp it = cos t +i sin t =
T T T T j = 0,
t=1 t=1 t=1

so, for example,


r T
1 X
JT (λj ) = (yt − y)eiλj t .
2πT t=1
Hence all terms we discuss above and below can equally be stated for centered data. Although this is a vital
point from an applied perspective, we ignore it here to reduce clutter.

Now focus on the square of the discrete Fourier transform


r T 2 T T 2
2 1 2 1 X iλj t 1 X X
|JT (λj )| = |βj | = e yt = yt cos(λj t) + i yt sin(λj t)
2π 2πT t=1 2πT t=1 t=1

T
!2 T
!2 
1  X X
= yt cos(λj t) + yt sin(λj t)  , as |a + bi|2 = a2 + b2 .
2πT t=1 t=1

This squared term is called the j-th periodogram ordinate of the data y1:T . It is one of the most famous objects
in time series — it parallels the sample autocovariance function in the time domain.

Definition 6.2.2. Periodogram of {Yt } is, for λ ∈ [0, 2π],



T
!2 T
!2 
1 X X
IT (λ) = |JT (λ)|2 =  yt cos(λt) + yt sin(λt)  , recall |a + bi|2 = a2 + b2 .
2πT t=1 t=1

The final result in this Section appeared in Bartlett (1950). It relates the periodogram to the sample
autocovariance function.
132 CHAPTER 6. LINEARITY

Theorem 6.2.3. The periodogram at frequency λ ∈ [0, π], written IT (λ), can be expressed as

T   T −|s|
X T −s 1 X
2πIT (λ) = γ
b0 + 2 cos(sλ)b
γs, where γ
bs = yt yt+|s| , s = −T, −(T − 1), ..., T
s=1
T T − |s| t=1
T   T −|s|
X T −s 1 X
= γ
e0 + 2 cos(sλ)e
γs, where γ
es = yt yt+|s|
s=1
T T t=1
T
X
= eisλ γ
es .
s=−T

PT
Obviously IT (λ) ≥ 0, so, in particular, IT (0) = s=−T es ≥ 0.
γ

Proof. Now

T 2 T T
1 X iλt 1 X X iλ(t−s)
2π × IT (λ) = e yt = e yt ys
T t=1 T s=1 t=1
T
1X 2
= y
T t=1 t
T −1 T
1 X iλ 1 X −iλ
+ e yt yt+1 + e yt yt−1
T t=1 T t=2
T −2 T
1 X 2iλ 1 X −2iλ
+ e yt yt+2 + e yt yt−2
T t=1 T t=3
1 T
1 X (T −1)iλ 1 X −(T −1)iλ
... + e y1 yT + e yT y1
T t=1 T
t=T
 
e0 + eiλ + e−iλ γe1 + ... + e(T −1)iλ + e−(T −1)iλ γ

= γ eT −1
T
X
= eisλ γ
es .
s=−T

In turn
T
X T
X
eisλ γ
es = γ
b0 + 2 cos(sλ)e
γs
s=−T s=1
T  
X T −s
= γ
b0 + 2 cos(sλ)b
γs.
s=1
T

This result allows us to go the other way: from the periodogram to the sample autocovariance function.

Lemma 5. Using the periodogram IT (λ), the

T
2π X −isλj
γ
es = e IT (λj ).
T j=1
6.3. POPULATION QUANTITIES: BUILDING THE SPECTRAL DENSITY 133

1
PT
Proof. Start with IT (λ) = 2π s=−T eisλ γ
es . Then, taking the discrete Fourier inverse gives the result. Writing
this out:
T T T
2π X 1 X X ibλj
IT (λj )e−isλj = e eb e−isλj
γ
T j=1 T j=1
b=−T
 
T T
X 1 X
= γ
eb  ei(b−s)λj 
T j=1
b=−T
T
X
= γ
eb 1(b = s)
b=−T
= γ
es .

6.3 Population quantities: building the spectral density


6.3.1 Core ideas

In this Section we study some of the population properties of a Fourier representation of a stationary process
and see that as the number of terms increases all covariance stationary processes can be written this way.
P∞
ˆ – If {Yt } is covariance stationary process with the additional assumption that s=1 |γs | < ∞, then the
(real) spectral density function

1 X iλs
fY Y (λ) = e γs , λ ∈ [0, 2π], (6.17)
2π s=−∞
1
P∞
exists. The leading example is fY Y (0) = 2π s=−∞ γs , a scaled version of the long-run variance —

which appears often as the asymptotic variance of the sample average of a time series. This appears
all over these notes, as it drives most CLTs used in practice by combining this results with Slutsky’s

Theorem and the delta method. The


Z λ
FY Y (λ) = fY Y (ω)dω,
0

is the spectral distribution function. Inverting the relationship (6.17),


Z 2π
γs = e−isλ dFY Y (λ),
0
R 2π
e.g. γ0 = 0
fY Y (λ)dλ.

– Assume {Yt } is a zero mean covariance stationary process with spectral distribution function FY Y .
Cramér’s representation (also called the spectral representation) says there exists a zero mean (com-
plex) orthogonal processes {Z(λ)}λ∈[0,2π] with
Z 2π
Yt = e−iλt dZ(λ), t ≥ 0,
0
134 CHAPTER 6. LINEARITY

2
where E[|Z(λ2 ) − Z(λ1 )| ] = 2 {FY Y (λ2 ) − FY Y (λ1 )} and FY Y (0) = 0. Definition 4.4.12, in the
section on continuous time processes, says the core characteristic of an orthogonal process is that is
has uncorrelated increments. The leading special case is where


 
1/2 1
dZ(λ) = {fY Y (λ)} √ {dB(λ) + idW (λ)} , i= −1,
2

where {B(λ)}λ∈[0,2π] and {W (λ)}λ∈[0,π] are independent standard Brownian motion, then {Yt }t≥0 is
a stationary Gaussian process.

6.3.2 Main event for the population

Here we will use a circular orthogonal process, which was introduced in Definition 6.1.11.

Definition 6.3.1. [Fourier representation model] Let {Z(λ)}λ∈[0,2π] be a zero mean circular orthogonal (com-
plex) process with

2
E[|Z(λ)| ] = FY Y (λ), λ ∈]0, π].

Define
T
X
YT (t) = e−iλj t {Z(λj ) − Z(λj−1 )} ,
j=1

Then we call {YT (t)}t∈[0,T ] the Fourier representation model.

PT /2
Where does this come from? Start with a sum j=1 e−iλj t {Z(λj ) − Z(λj−1 )} and its complex conjugate
PT /2
j=1 eiλj t {Z(λj )∗ − Z(λj−1 )∗ }. Add them:

T /2 T /2
X X
−iλj t
YT (t) = e {Z(λj ) − Z(λj−1 )} + eiλj t {Z(λj )∗ − Z(λj−1 )∗ } ,
j=1 j=1
T /2 T /2
X X
= e−iλj t {Z(λj ) − Z(λj−1 )} + e−i(2π−λj )t {Z(λj )∗ − Z(λj )∗ } , as eiλj t = e−i(2π−λj )t (Example 6.1.5)
j=1 j=1
T /2 T /2
X X
−iλj t
= e {Z(λj ) − Z(λj−1 )} + e−i(2π−λj )t {Z(2π − λj ) − Z(2π − λj−1 )} , circular orthogonality
j=1 j=1
T
X
= e−iλj t {Z(λj ) − Z(λj−1 )} .
j=1
6.3. POPULATION QUANTITIES: BUILDING THE SPECTRAL DENSITY 135

Likewise zero mean orthogonal process implies E[YT (t)] = 0 and


 
XT XT
Cov[YT (t), YT (t + s)] = E  e−iλj t {Z(λj ) − Z(λj−1 )} eiλk (t+s) {Z(λk )∗ − Z(λk−1 )∗ }
j=1 k=1
 
XT X
T
= E e−iλj t {Z(λj ) − Z(λj−1 )} eiλk (t+s) {Z(2π − λk ) − Z(2π − λk−1 )}
j=1 k=1
T
X
= e−iλj t eiλj (t+s) {FY Y (λj ) − FY Y (λj−1 )}
j=1
T
X
= eiλj s {FY Y (λj ) − FY Y (λj−1 )} .
j=1

As T goes to infinity we get the stochastic integral


Z π  Z 2π
Y (t) = 2 Re e−iλt dZ(λ) = e−iλt dZ(λ),
0 0

and the Riemann integral


Z 2π
Cov[Y (t), Y (t + s)] = e−iλs dFY Y (λ),
0

a Riemann-Stieltjes integral.
Further

γ(s) = Cov[Y (t), Y (t + s)]


Z 2π
FY Y (λ)
= FY Y (2π) × e−iλs dG(λ), G(λ) = ,
0 FY Y (2π)

where {G(λ)} has the properties of a probability distribution function. Then γ(0) = FY Y (2π) and
Z 2π
ρ(s) = Cor[Y (t), Y (t + s)] = e−iλs dG(λ),
0

is the characteristic function (evaluated at −s) of an artificial random λ ∼ G.

Hence the autocovariance function of {Y (t)}≥0 mathematically plays the role of a characteristic function,
which uniquely maps backwards and forwards to G via the uniqueness theorem of characteristic functions. As
T goes to infinity then we produce
Z 2π
Y (t) = e−iλt dZ(λ) (6.18)
0
Z 2π
γ(s) = e−iλs dFY Y (λ). (6.19)
0

Equation (6.18) is call Cramér’s representation. Equation (6.19) is Bochnor’s theorem.

Theorem 6.3.2. [Cramér’s representation] For any covariance stationary process {Yt } it is possible to find an
orthogonal process {Z(λ)} so that (6.18) holds with probability one.
136 CHAPTER 6. LINEARITY

h 6.3.3. Cramer’s representation is sometimes written as


Z π
Y (t) = e−iλt dZ(λ)
−π

and sometimes as
Z π Z π
Y (t) = cos(λt)dU (λ) + sin(λt)dU (λ),
0 0

where {U (λ)}λ∈[0,π] is a real orthogonal process with Var[dU (λ)] = 2fY Y (λ)dt.

We do not prove this theorem here, but the result is kind of obvious from what we have done as the
autocovariance function uniquely determines F , a result called the Wiener–Khintchine theorem in time series,
and so selects the correct orthogonal process.

Of course knowing the right orthogonal process is not enough to simulate from {Y (t)} as the class of
orthogonal process is not generative. As an example, think of all Lévy processes with the same drift and finite
variance — they yield the same orthogonal process. But everyone knows the Brownian motion and Poisson
process are quite different!

6.3.3 Spectral density

Start with some assumptions. The {Yt } is covariance stationary with {γs } being the associate autocovariance
function.
P∞
Definition 6.3.4. If {Yt } is covariance stationary process with the additional assumption that s=1 |γs | < ∞,
then the spectral density function is defined as

1 X iλs
fY Y (λ) = e γs , λ ∈ R, (6.20)
2π s=−∞

( )
1 X
= γ0 + 2 cos(λs)γs . (6.21)
2π s=1

The FY Y (λ) = 0
fY Y (ω)dω, is the spectral distribution function.

Note
P∞ P∞ 2 P∞
(a) |fY Y (λ)| ≤ s=−∞ eiλs |γs | ≤ s=−∞ |γs | as, for all a, the eia = 1. We assumed s=−∞ |γs | so

fY Y (λ) exists under that condition.


(b) {fY Y (λ)} is periodic with period 2π (as ei2πa is periodic in a with period 1) that is fY Y (λ) = fY Y (2π − λ).
Hence it is conventional to plot
fY Y (λ) against λ ∈ [0, π],

not λ ∈ [0, 2π].


(c) {fY Y (λ)} is real and symmetric about zero, fY Y (λ) = fY Y (−λ).
6.3. POPULATION QUANTITIES: BUILDING THE SPECTRAL DENSITY 137

Example 6.3.5. If {Yt } is weak white noise then

γ0 Var(Y1 )
fY Y (λ) = = ,
2π 2π

so the spectral density is flat with respect to λ.

Example 6.3.6. If {Yt } is a MA(1) driven by weak white noise Yt = θ1 (L)εt , then

γ0 + 2 cos(λ)γ1 Var(ε1 ) 1 + θ12 + 2θ1 cos(λ)
fY Y (λ) = = .
2π 2π

This is draw in the left hand side of Figure 6.1.

MA(1) AR(1) AR(2)

0.7
15
0.5

0.6
0.4

0.5
10
spectral density

spectral density

spectral density

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

lambda lambda lambda

Figure 6.1: LHS: spectral density fY Y (λ) for two MA(1) models, θ1 = −0.5 (red) and θ1 = 0.9 (black). Middle:
spectral density fY Y (λ) for two AR(1) models, ϕ1 = −0.5 (red) and ϕ1 = 0.9 (black). RHS: spectral density
fY Y (λ) for an AR(2) models, ϕ1 = 0.8 and ϕ1 = −0.4.

To efficiently manipulate the spectral density it is helpful to define an autocovariance generating function.

P∞
Definition 6.3.7. The autocovariance generating function (agf) is gY (z) = s=−∞ z s γs .

Then
gY (eiλ ).
fY Y (λ) =

The following Theorem about polynomials is helpful in managing MA(∞) and thus other linear models.

P∞ P∞
Theorem 6.3.8. Define as = j=0 θj θj+s , for s = ..., −1, 0, 1, ..., and s=1 |as | < ∞, then


X ∞
X
z s as = θ(z)θ(z −1 ), where θ(z) = θj z j .
s=−∞ j=0
138 CHAPTER 6. LINEARITY

Proof.

X ∞ X
X ∞ ∞
X ∞
X
z s as = z s θj θj+s = z s θj θj+s , θj = 0, for j < −1,
s=−∞ s=−∞ j=0 j=−∞ s=−j

X ∞
X
= θj θh z h−j , j + s = h;
j=−∞ h=−∞


! ∞ 
X X
= θh z h  θj z −j  = θ(z)θ(z −1 ).
h=0 j=0

P∞ ∞
j
P
Example 6.3.9. For an MA(∞) process θ(z) = j=0 θj z with |θj | < ∞ (which is enough to guarantee
j=0
P∞
that s=1 |γs | < ∞,) then
Var(ε1 ) 2
fY Y (λ) = θ(eiλ ) ,

e.g. in the MA(1) case θ(z) = 1 + θ1 z, then

2
θ(eiλ ) = (1 + θ1 eiλ )(1 + θ1 e−iλ ) = 1 + θ12 + 2θ1 cos(λ).

This is drawn in the right hand side of Figure 6.1 for θ1 ∈ {−0.5, 0.9} — when θ1 = −0.5 the shocks somewhat
cancel out through time and so most of the variation is at the higher frequencies (shocks die out really fast);
when θ1 = 0.9 the shocks appear positively twice and so this lifts a little the activity at the lower frequencies.
Pp 1
Example 6.3.10. For an AR(p) process with ϕ(z) = 1 − j=1 ϕj z j , so in the stationary case Yt = ϕ(L) εt .

Thus
Var(ε1 ) 1
fY Y (λ) = ,
2π |ϕ(eiλ )|2
e.g. in the AR(1) case ϕ(z) = 1 − ϕ1 z, then

2
ϕ(eiλ ) = (1 − ϕ1 eiλ )(1 − ϕ1 e−iλ ) = 1 + ϕ21 − 2ϕ1 cos(λ),

so
Var(ε1 ) 1
fY Y (λ) = .
2π 1 + ϕ21 − 2ϕ1 cos(λ)
This is drawn in the middle of Figure 6.1 for ϕ1 ∈ {−0.5, 0.9} — when ϕ1 = −0.5 shocks somewhat cancel out

through time and so is of higher frequencies (things move away fast); when ϕ1 = 0.9 shocks get reinforced and
so most of the action is at the lower frequencies. Notice how much higher the AR(1) spectrum goes than the
MA(1) does. The right hand side of Figure 6.1 shows the spectrum for an AR(2) process with ϕ1 = 0.8 and

ϕ2 = −0.4 — this has a pair of complex eigenvalues and so yields a process which cycles. This is nicely shown
in the spectrum which has a peak around frequency one. Although it is possible to work out the spectrum in
terms of cosines, it is easier not to do the math and just use complex variables to do the computations. Given
6.3. POPULATION QUANTITIES: BUILDING THE SPECTRAL DENSITY 139

below is an R snippet which computes the spectrum for an AR(p), implementing with p = 4. I m not claiming
this code is efficient: but I am saying it is simple to use and simple to read.

B=1000; iC = 1i; p =4; phi = c(0.6,-0.4,0.2,-0.3);


lambda = pi*seq(1,B)/B;
AC=rep(1,B);
for (j in (1:p)){AC = AC - (phi[j]*exp(j*lambda*iC));}

spAR = (1.0/(2.0*pi))/(abs(AC)^2);
plot(spAR);

Definition 6.3.4 goes from the sequence of autocovariances {γs } to the spectral density function {fY Y (λ)}λ∈[0,2π] .
What about the reverse? This is really Bochnor’s Theorem, but here it is derived.

P∞
Theorem 6.3.11. If {Yt } is covariance stationary process with the additional assumption that s=1 |γs | < ∞,
then
Z 2π
γs = fY Y (λ)e−isλ dλ.
0

Proof. Start with



1 X iλj
fY Y (λ) = e γj ,
2π j=−∞

multiple by e−isλ and integrate, yielding


Z 2π ∞ Z 2π ∞
X 1 X
fY Y (λ)e−isλ dλ = γj eiλ(j−s) dλ = γj 1j=s = γs .
0 j=−∞
2π 0 j=−∞

The important special case is the beautiful


Z 2π
Var(Y1 ) = γ0 = fY Y (λ)dλ.
0

6.3.4 Band filter

A band filter starts with the Cramer’s representation


Z 2π
Y (t) = e−iλs dZ(λ)
0

and takes out certain frequencies, e.g. a low pass filter removes higher frequencies.
Z 2π
Ya (t) = 1|λ−π|>a e−iλs dZ(λ), a ∈ [0, π).
0
140 CHAPTER 6. LINEARITY

Low pass filter

10
5
0
Series

−5
−10

0 200 400 600 800 1000

Time

Figure 6.2: Simulated autoregression (black dots) plotted against time, together with the low pass filter
(smoother) drawn as a red solid line.

This is implemented as
r T
2π X −iλj t
µ
bt|T = e JT (λj )1|λj −π|>a .
T j=1

Advocacy of this approach for macroeconomics includes Baxter and King (1999) and Watson (2007).
n o n o
Example 6.3.12. Figure 6.2 plot Yt , µ
bt|T from running a low pass filter ( µbt|T , plotted in red) through a
simulated Gaussian AR(1) ({Yt } plotted as black dots) with ϕ1 = 0.95, taking a = π, so all but the smallest
frequencies are discarded. The low pass filter roughly follows the main thread of the series, but it has some

problems at the start of the process — it will have difficulties with end effects.

6.4 Estimating the spectral density


6.4.1 Core ideas
P∞
ˆ – If {Yt } is covariance stationary process obeying the additional assumption that s=1 |γs | < ∞, then
as T increases
IT (λj ) d 1 2
→ χ2 , j = 1, ..., T,
fY Y (λj ) 2

and this ratio is asymptotically independent of the other ratios.

– For a covariance stationary process, the spectrum {fY Y (λ)} is usually estimated by fitting an AR(p)
6.4. ESTIMATING THE SPECTRAL DENSITY 141

with large p and using the fitted model to imply an estimated

Var(ε1 )
fbY Y (λ) = 2;
b iλ )
ϕ(e

or by local averaging, e.g. using a kernel estimator


 
PT λj −λ
j=1 IT (λ|j| )K h
fbY Y (λ) =   ,
PT λj −λ
j=1 K h
R
where h > 0 is a bandwidth and {K(x)}x∈R is a kernel we can design such that R
K(x)dx = 1 and
K(x) ≥ 0. Other local averaging methods (e.g. local linear regression) can also be used. These
methods give, for example, consistent estimators of fY Y (0), which is important in applied statistics,
so long as p and h are selected wisely.

6.4.2 Main event for spectrum

Suppose the estimand is


θ = fY Y (λ),

the spectral density at one frequency λ ∈ [0, π]. There are many ways of estimating θ.

6.4.3 Long AR approach

A core way of estimating θ is to assume a covariance stationary process follows a AR(p) process driven by white
noise, which approximates θ by the spectral density of the AR(p) process:

Var(ε1 )
θ ≃ θAR(p) = Pp 2,
1 − j=1 ϕj eiλj

when p is large. Now estimate the associated Var(ε1 ) and ϕ1:p , and plug the estimated parameters into the
spectral density for a AR(p), yielding
\1 )
Var(ε
θAR(p) =
b
Pp b iλj 2 .
1− ϕ e
j=1 j

Confidence intervals can then be generated by the delta method. For a Gaussian AR(p) process this procedure
is the MLE of the spectral density at frequency λ. For large p (perhaps shrinking AR coefficients at large lags
towards zero) and T , this procedure is widely used in applied problems.

6.4.4 Nonparametric estimation


Behaviour of periodogram

Here we will focus on estimating θ using the periodogram ordinates IT (λ1 ), ..., IT (λT ), with high weights for
frequencies near λ.
142 CHAPTER 6. LINEARITY

Example 6.4.1. Figure 6.3 shows a simulation experiment with T ∈ {50, 250, 1000} and an MA(1) model with
θ1 = 0.9. Notice that as T increases each ordinate of the periodogram does not get any more precise, there are
just more of them! The same setup is reported in Figure 6.4, but now for a AR(1) model with ϕ1 = 0.9.

MA(1) case, Periodogram: T= 50 MA(1) case, Periodogram: T= 250 MA(1) case, Periodogram: T= 1000
2.0

3.0

2.5
2.5
1.5

2.0
2.0
Spectrum

Spectrum

Spectrum

1.5
1.0

1.5

1.0
1.0
0.5

0.5
0.5
0.0

0.0
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

lambda lambda lambda

Figure 6.3: Spectrum (red line) and periodogram (bacl line) for an MA(1) model with θ1 = 0.9. LHS: T = 50.
Middle: T = 250. RHS: T = 1, 000.

It really helps to understand the properties of IT (λj ) under the Fourier representation model, for each T ,

the Fourier transform of the data at frequency λj is


T
1 X iλs t {Z(λj ) − Z(λj−1 )}
JT (λj ) = √ e yt = p ,
2πT t=1 λj − λj−1

as it provides super clear results. By orthogonal increments,

2 FY Y (λj ) − FY Y (λj−1 )
E[|JT (λj )| ] = , Cov[JT (λj ), JT (λj )] = 0; j ̸= k.
λj − λj−1

In the case where


 1/2
1
dZ(λ) = fY Y (λ) {dB(λ) + idW (λ)} , {B(λ)}λ∈[0,2π] ⊥⊥ {W (λ)}λ∈[0,2π] ,
2

then JT (λj ) ⊥
⊥ JT (λk ) and IT (λj ) ⊥
⊥ IT (λk ). In particular, rather beautifully,
! !
1 λ
Z Z λ
2 L 1
|Z(λ)| = Z(λ)Z(λ)∗ = fY Y (c)dc × B(1)2 + W (1)2 ∼ fY Y (c)dc × χ22 ,

2 0 2 0

so R λj
ind

1 2

λj−1
fY Y (λ)dλ
IT (λj ) ∼ χ2 fY Y,j , fY Y,j = , j = 1, 2, ..., T.
2 λj − λj−1
That is the periodogram ordinates have converted a covariance time series into a sequence of scaled independent
variables! Possible to show covariance stationary processes, then

IT (λj ) d 1 2
→ χ2 ,
fY Y (λj ) 2
6.4. ESTIMATING THE SPECTRAL DENSITY 143

AR(1) case, Periodogram: T= 50 AR(1) case, Periodogram: T= 250 AR(1) case, Periodogram: T= 1000

80
15
5
4

60
10
3
Spectrum

Spectrum

Spectrum

40
2

20
1

0
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

lambda lambda lambda

Figure 6.4: Spectrum (red line) and periodogram (bacl line) for an AR(1) model with ϕ1 = 0.9. LHS: T = 50.
Middle: T = 250. RHS: T = 1, 000.

and this ratio is asymptotically independent over j. If λj is close to λ, then if {fY Y (λ)} is continuously twice
differentiable, then by a second order Taylor expansion

1
fY Y,j ≃ fY Y (λ) + (λj − λ)fY′ Y (λ) + (λj − λ)2 fY′′ Y (λ). (6.22)
2

More broadly, notice that the complex JT (λ) is a weighted sum of the data, where weights cannot be extreme
2
as eiλt ≤ 1. Hence you might expect that if {Yt } is covariance stationary with spectral density {fY Y (λ)}

(and a linear model with MD, M-dependence or i.i.d. shocks, so a CLT can be driven), that JT (λ) will obey a
CLT. Corollary 11.2.1 of Subba Rao (2022) gives such a CLT
   
Re(JT (λ)) d 1
→ N 0, fY Y (λ)I2 .
Im(JT (λ)) 2

the joint limit for the real and imaginary elements of JT (λ). So the Gaussian part of the Fourier representation
model is not essential.

Kernel estimators

To get an estimator of fY Y (λ) we will use a kernel type estimator


 
PT 1 λj −λ
j=1 IT (λj ) h K h
θh (λ) =
b
PT 1  λj −λ  ,
j=1 h K h
 
PT λj −λ
j=1 I T (λ j )K h
=  
PT λj −λ
j=1 K h
R
where {K(x)}x∈R be a kernel with R
K(x)dx = 1, K(x) ≥ 0, and h > 0 is some bandwidth This appears in

the work of Grenander and Rosenblatt (1953) (later, the same idea was ported into kernel density estimation
by Parzen (1962) and Rosenblatt (1956)). Subba Rao (2022) has a good analysis of the general case.
θh (λ) ≥ 0 as the periodogram ordinates and the kernel weights are non-negative.
Notice that b
144 CHAPTER 6. LINEARITY

Example 6.4.2. The left hand side of Figure 6.5 shows the case of the MA(1) process with θ1 = 0.9 and
T = 1, 000. The red line is the true spectral density. The dotted black line is the kernel spectral density using a
Gaussian kernel, with a standard deviation of 0.07. The green line is the rectangular kernel with h = 0.04/pi.

MA(1) Kernel spectrum estimator AR(1) Kernel spectrum estimator

100
0.6

80
0.5

60
0.4
Spectrum

Spectrum
0.3

40
0.2

20
0.1
0.0

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Lambda Lambda

Figure 6.5: Non-parametric estimator of the spectral density. Red: truth; green: rectangular kernel; black:
Gaussian kernel. Throughout T = 1, 000. LHS: MA(1) case with θ1 = 0.9 and RHS: AR(1) with ϕ1 = 0.9.

The right hand side shows the same result but for an AR(1) process with ϕ1 = 0.9. The results indicate the
superiority of the Gaussian kernel for the case where the periodogram has some sharp peaks, here at frequencies
close to 0. This is not surprising. The Gaussian kernel puts more weight closer to 0 and then damps everything
else out. The rectangular puts equal weight on all the frequencies close to 0.

Here we study in some detail the properties of a relatively simple version: the kernel estimator based on a
rectangular kernel
T
1 X 2πj
θh (λ) :=
b IT (λj )1(λj ∈ (λ − h, λ + h]), λj = ,
nh j=1 T
the effective sample size of this statistic is
T  
X 2πj Th
nh := 1 ∈ (λ − h, λ + h] ≃ 2 .
j=1
T 2π

Under the Fourier representation model


T
  1 X
bias bθh (λ) = {fY Y,j − fY Y (λ)} 1(λj ∈ (λ − h, λ + h]),
nh j=1

while under the Gaussian Fourier representation model


T
1 X 2
Var[b
θh (λ)] = 2 f 1(λj ∈ (λ − h, λ + h]).
nh j=1 Y Y,j
6.4. ESTIMATING THE SPECTRAL DENSITY 145

To gain a clearer understanding of bias[b


θh (λ)] and Var[b
θh (λ)], it is helpful to think about its behaviour under
asymptotics, allowing h to get small as T increases.

Theorem 6.4.3. Assume the spectral density {fY Y (λ)}λ∈[0,π] is twice continuously differentiable, then
  1 √ 
2
bias h−2b
θh (λ) → fY′′ Y (λ) , and Var T hbθh (λ) → πfY Y (λ) , λ ∈ [0, π],
6
h4 ′′ 2 1 π 2
as h → 0 and T h → ∞. Hence mse(b
θh (λ)) ≃ 36 fY Y (λ) + h T fY Y (λ) .

Here the bias and variance are in conflict, with the bias liking a small h, while the variance liking a large h.

The approximation to the mean square error is minimized by selecting h to be


(  2 )−1/5
36 π fY Y (λ)
hT,λ = = T −1/5 c(λ),
4 T fY′′ Y (λ)

θh ) = O(T −4/5 ).
hence the mse(b

Proof. To make expressions more compact, I will write suppress λ in the b


θh (λ) expression, so writing it as b
θh .
Write the spread of the frequencies about λ as
T
X 2
bh = (λj − λ) 1(λj ∈ (λ − h, λ + h]),
j=1

and assume the spectral density is twice continuously differentiable, then


2
1 bh f (λ)
θh ] ≃ f ′′ (λ)
bias[b , θh ] ≃
Var[b .
2 nh nh
Why? Using the Taylor expansion (6.22), the bias is
T    
1 ′ X 2πj 2πj
θh ] ≃
bias[b f (λ) −λ 1 ∈ (λ − h, λ + h]
nh j=1
T T
T  2  
1 1 ′′ X 2πj 2πj
+ f (λ) −λ 1 ∈ (λ − h, λ + h]
nh 2 j=1
T T
 T 
X    
2π Tλ Tλ Th Tλ Th
= j− 1 j∈ − , +
T j=1
2π 2π 2π 2π 2π
 2 T  2   
1 2π 1 ′′ X Tλ Tλ Th Tλ Th
+ f (λ) j− 1 j∈ − , +
nh T 2 2π 2π 2π 2π 2π
j=−T

  ⌊X⌋ Th
2π  2 ⌊X
2π ⌋
Th

2π 1 2π 1 ′′
= j+ f (λ) j2.
T nh T 2
j=−⌊ T2πh ⌋ j=−⌊ T2πh ⌋

Hence
 2 Th Th Th n
1 2π 1 ′′ 2π ( 2π + 1)(2 2π + 1) n(n + 1)(2n + 1)
X
bias[h−2b
θh ] ≃ h −2
f (λ) , j2 =
nh T 2 6 j=1
6
 −1  
Th T
≃ h−2 2 h3 f ′′ (λ) /3
2π 2π
→ f ′′ (λ) /6.
146 CHAPTER 6. LINEARITY

Likewise

√ T
1 X 2
Var[ T hb
θh ] = Th f 1(λj ∈ (λ − h, λ + h])
n2h j=1 Y Y,j
 −2  
Th 2 Th
≃ Th 2 fY Y (λ) 2
2π 2π
= πfY Y (λ)2 .

From frequency domain to time domain

Recall Bartlett’s result that the periodogram

T   T −|s|
1 X T − |s| 1 X
T
IT (λ) = cos(sλ)b
γs, γ
bs = yt yt+|s| ,
2π T T − |s| t=1
s=−T
T T −|s|
1 X 1 X T
= cos(sλ)e
γs, γ
es = yt yt+|s| ,
2π T t=1
s=−T

is a weighted sum of the sample autocovariances, while the kernel estimator


PT
j=1 IT (λj )Kj
 
λj − λ
θh (λ) =
b PT , Kj = K
j=1 Kj
h

weights the periodogram ordinates. Thus it must be the case that the kernel estimator can be written in the
time-domain as a sum
T T −|s|
1 X 1 X T
θh (λ) =
b ws (λ)e
γs, γ
es = yt yt+|s| ,
2π T t=1
s=−T

where {ws (λ)} are weights, called lag windows. Think of w0 (λ), it places weight on γ
e0 the sample variance;
while w1 (λ), it places weight on γ
e1 , the sample autocovariance at lag 1.

What are these weights? Now


PT PT T
1 j=1 s=−T cos(sλ)e
γ s Kj 1 X
θh (λ) =
b PT = ws (λ)e
γs,
2π j=1 Kj

s=−T

where the desired weights are  


PT λj −λ
K
j=1 h cos(sλj )
ws (λ) =   ,
PT λj −λ
j=1 K h

which is a local average of {cos(sλj )} for {λj } close to λ.

Example 6.4.4. Suppose K(u) ∝ 1(|u| ≤ 1), then the s-th weight is

⌊(λ+h)T /2π⌋ ⌊(λ+h)T /2π⌋


1 X X
ws (λ) = cos(s × j2π/T ), where nh (λ) = .
nh (λ)
j=⌈(λ−h)T /2π⌉ j=⌈(λ−h)T /2π⌉
6.4. ESTIMATING THE SPECTRAL DENSITY 147

From time domain to frequency domain

The opposite question can also be asked. Let


 
T T T
1 X 1 X 2π X
θh (λ)
b = es eisλ =
ws γ ws  eisλ e−isλj IT (λj )
2π 2π T j=1
s=−T s=−T
T T
!
1X X
= IT (λj ) ws e−is(λj −λ)
T j=1
s=−T
T T
1X X
= IT (λj )K(λj − λ), K(λ) = ws e−isλ .
T j=1
s=−T
T
Hence {K(λ)} is the Fourier transform of the sequence of weights {ws }s=−T .

Example 6.4.5. [Rectangular weight] Suppose ws = 1(|s| ≤ B), then


B
1 X
θh (λ) =
b es eisλ ,
γ

s=−B

and
T B B
X
−isλ
X
−isλ
X sin [(B + 1/2)λ]
K(λ) = ws e = e =1+2 cos(sλ) =
s=1
sin(λ/2)
s=−T s=−B
the Dirichlet kernel in Fourier analysis. As Figure 6.6 demonstrates (this problem does not go away as B
becomes quite large), K(λ) can have rather negative weights so b
θh (λ) can sometimes be negative, even though
the estimand is non-negative. This is problematic for applications. Why does this Dirichlet kernel hold?

Rectangular kernel, B = 2 Rectangular kernel, B = 5 Rectangular kernel, B = 10


5

20
10
4

15
3

10
weights

weights

weights
2

5
1

2
0

0
−1

−2

−5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

lambda lambda lambda

Figure 6.6: Spectral weight function K(λ) corresponding to a rectangular weight function in the time domain.
LHS: B = 2; middle: B = 5 and RHS: B = 10.

For any integer n ≥ 0, let


n
X n
X n
X n−1
X n
X
s s
Dn (w) = ws = (1/w) + ws = (1/w) (1/w) + ws
s=−n s=1 s=0 s=0 s=0
1 1 − (1/w)n 1 − wn+1
= + , finite geometric progression
w 1 − 1/w 1−w
w−n − 1 1 − wn+1 w(n+1) − w−n
= + = .
1−w 1−w w−1
148 CHAPTER 6. LINEARITY

Thus
n
X n
X
1+2 cos(sλ) = eisλ , this sum is called the Dirichlet kernel.
s=1 s=−n

e(n+1)iλ − e−inλ
= Dn (eisλ ) =
eiλ − 1
(n+1/2)iλ −i(n+1/2)λ
e −e
= , scale both top and bottom by eiλ/2
e iλ/2 − e−iλ/2
sin [(n + 1/2)λ]
= .
sin(λ/2)
 
Example 6.4.6. [Bartlett weight] Suppose ws = 1 − |s| B 1(|s| ≤ B), a weight function due to Bartlett. Then

B    2
X |s| −isλ 1 sin (Bλ/2)
KB (λ) = 1− e = ,
B B sin(λ/2)
s=−B

the Fejér kernel in Fourier analysis. This is non-negative and is drawn in Figure 6.7 for a variety of values of

B. Hence b
θh (λ) is non-negative. Why? Recall

Bartlett kernel, B = 2 Bartlett kernel, B = 5 Bartlett kernel, B = 10


25
12

40
20
10

30
8

15
weights

weights

weights
6

20
10
4

10
5
2
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

lambda lambda lambda

Figure 6.7: Spectral weight function K(λ) corresponding to the Bartlett weight function in the time domain.
LHS: B = 2; middle: B = 5 and RHS: B = 10.

n
X wn+1 − w−n
Dn (w) := ws = .
s=−n
w−1

Claim: then for


n n−1
X
s
X (wn/2 − w−n/2 )
nFn (w) := (n − |s|) w = Ds (w) = 2 .
s=−n s=0 w1/2 − w−1/2
Verify:
n−1
X n−2
X
nFn (w) − (n − 1)Fn (w) = (n − |s|) ws − {(n − 1) − |s|} ws
s=−(n−1) s=−(n−2)
 
 n−2
X n−2
X 
= wn−1 + w−(n−1) + (n − |s|) ws − {(n − 1) − |s|} ws
 
s=−(n−2) s=−(n−2)
n−2
X n−1
X
= wn−1 + w−(n−1) + ws = ws = Dn−1 (w).
s=−(n−2) s=−(n−1)
6.5. WOLD REPRESENTATION 149

Then
n−1 n−1 n−1 n−1
!
X X ws+1 − w−s 1 X X
nFn (w) = Ds (w) = = ws+1 − w−s
s=0 s=0
w−1 w − 1 s=0 s=0
( n−1 n−1
)
1 − wn 1 − (1/w)n
 
1 X
s
X
s 1
= w w − (1/w) = w −
w−1 s=0 s=0
w−1 1−w 1 − 1/w
w n n w n n
= 2 {(w − 1) − (1 − (1/w) )} = 2 {w − 2 + (1/w) }
(w − 1) (w − 1)
1 n n (wn/2 − w−n/2 )
= 2 {w − 2 + (1/w) } = 2 .
w1/2 − w−1/2 w1/2 − w−1/2

This is the stated about nFn (w). Then

KB (λ) = Fn (eiλ )
1 (eiλn/2 − w−iλn/2 )
=  ,
n wiλ/2 − w−iλ/2 2

which is the stated result. This is non-negative and integrates to one. Hence b
θh (λ) is non-negative.

The Bartlett estimator


B   T −|s|
1 X |s| 1 X
θB (λ) =
b 1− es eisλ
γ s , γ
es = yt yt+|s| ,
2π B T t=1
s=−B

is used extensively in applied time series. In the case of λ = 0, then the resulting long run variance estimator
B   T −|s|
X |s| 1 X
2πb
θB (0) = 1− γ
es , γ
es = yt yt+|s| ,
B T t=1
s=−B

is a non-negative estimator.

Remark 22. In econometrics the scaled special case of the Bartlett estimator, the 2πb
θB (0), is often called the
Newey and West (1987) estimator. The Bartlett version of 2πb
θB (0) is used very extensively as an ingredient

to statistical inference procedures for problems. More extensive discussion of estimators of the zero frequency
includes Andrews (1991) and Lazarus, Lewis, and Stock (2021). Obviously, the case where the data is a vector,
then the Bartlett version of 2πb
θB (0) is a square, symmetric positive semi-definite matrix.

6.5 Wold representation


6.5.1 Decomposing using martingales

Before discussing the Wold representation, to get into the swing of things, it is elegant to talk again about
martingales.
Recall the beautiful and deep Doob martingale: if E [|θ|] < ∞ and {Ft } is a sequence of filtrations, then

Xt = E[θ|Ft ],
150 CHAPTER 6. LINEARITY

then

Xt − Xt−1 = E[θ|Ft ] − E[θ|Ft−1 ]

is a martingale difference with respect to {Ft }. We will use a similar idea here.

Think of a time series {Yt }t=−∞ adapted to the {Ft }t=−∞ , a sequence of natural filtrations going into the
infinite past. Assume for each t the

E[Yt |F−∞ ] < ∞,

(think of this as E[Yt ], expectation given you no knowing). Then at time t, the

Yt = E[Yt |Ft ]

= {E[Yt |Ft ] − E[Yt |Ft−1 ]}

+E[Yt |Ft−1 ]

= Ut,0 + E[Yt |Ft−1 ], {Ut,0 } is MD.

Now define, for each t,

Ut,j := E[Yt |Ft−j ] − E[Yt |Ft−j−1 ], j = 0, 1, 2, ...


which produces a sequence {Ut,j }j=0 which is a martingale difference in j = 0, 1, 2, ..., for each t. Then keeping
going:

Yt = Ut,0 + Ut,1 + E[Yt |Ft−2 ],




X
= E[Yt |F−∞ ] + Ut,j , telescoping, {Ut,j }j=0 is MD.
j=0

The core point is that Yt can be decomposed into a sum of MD terms.

Example 6.5.1. Suppose Yt = ϕ1 Yt−1 + εt , where {εt } is a MD sequence. Then

E[Yt |Ft−j ] = ϕj1 Yt−j ,

so, for finite j

Ut,j : = E[Yt |Ft−j ] − E[Yt |Ft−j−1 ]

= ϕj1 Yt−j − ϕj+1 j


1 Yt−j−1 = ϕ1 (Yt−j − ϕ1 Yt−j−1 )

= ϕj1 εt−j .

If |ϕ1 | < 1, then E[Yt |F−∞ ] = 0 and



X
Yt = ϕj1 εt−j .
j=0
6.5. WOLD REPRESENTATION 151

6.5.2 Decomposing using best linear projections

Instead of working with conditional expectations, work with linear projections. Before we so this in general, it

helps me at least, to think about some familar regressions.

Before we so this in general, think about a generic linear regression. Suppose the pair X, Y have a zero
mean and each has a variance. Recall, if Var(X) is non-singular, then

βY ∼X := Var(X)−1 Cov(X, Y ), Yb = βY ∼X × X, U := Y − Yb implies Cov(U, X) = 0.

In the context below, the special case where the X are uncorrelated is important. If Var(X) and, so, Var(X)−1
are diagonal, then
 T
βY ∼X1

 βY ∼X2 

βY ∼X = ..
 ,
 
 . 
 βY ∼Xp−1 
βY ∼Xp
collecting p small regressions.

We are going to apply these ideas to time series. There are two stages.

The first stage is regress Yt on Y1:t−1 , then:

Ybt := βYt ∼Y1:t−1 × Y1:t−1 , Ut := Yt − Ybt , implies Ut ⊥(Y1:t−1 ).

We will do this for each t, delivering a sequence

{Ut }t≥1 .

The second stage needs some preliminary work! For each t ≥ 2, there exists a non-stochastic (t − 1) × (t − 1)
matrix Bt−1 such that U1:t−1 = Bt−1 Y1:t−1 (just think about how Ut is built out of Yt and βYt ∼Y1:t−1 × Y1:t−1
— the same type of relationship holds for each t) so

Cov(U1:t−1 , Ut ) = Cov(Bt−1 Y1:t−1 , Ut ) = Bt Cov(Y1:t−1 , Ut ) = 0t−1 ,

yielding

Var(U1:t ) = diag(Var(U1 ), ..., Var(Ut )).

If Var(Uj ) > 0 for each j = 1, ..., t, then the regression of Yt on U1:t is


 T
βYt ∼U1

 βYt ∼U2 

βYt ∼U1:t = Var(U1:t )−1 Cov(U1:t , Yt ) =  ..
 .
 
 . 
 βYt ∼Ut−1 
βYt ∼Ut
152 CHAPTER 6. LINEARITY

For a moment I will introduce a notation which focuses on lag-length:


   
ψt,t−1 βYt ∼U1
 ψt,t−2   βYt ∼U2 
   
.. := ..
,
   

 . 


 . 
 ψt,1   βYt ∼Ut−1 
ψt,0 βYt ∼Ut

noting, in general, that βYt ∼Ut−s , for s ≥ 0, will change with t (no stationarity assumption has been made).
Thus the fitted value of the regression of Yt on U1:t is

Yet = βYt ∼U1:t × U1:t


t−1
X
= ψt,s Ut−s = Yt .
s=0

That is the end of the second stage.

There is no error, Yet = Yt as the regressors are U1:t which is a linear combination of Y1:t . The result
t−1
X
Yt = ψt,s Ut−s
s=0

is a representation of Yt in terms of the, by construction, uncorrelated {Ut }.

Intuitively, if {Yt } is covariance stationary and the first stage allowed Yt to be regressed on Y−∞:t−1 and
then Yt on U−∞:t , then you would expect the t in the notation ψt,s to cease to move those coefficients around
P∞
and yield Yt = s=0 ψs Ut−s . Of course some additional technique is needed allowing an infinite number of

regressors.

The math behind this extension is linear projection in Hilbert space, which is discussed briefly in Appendix
6.8.


Theorem 6.5.2. [Wold decomposition] Assume {Yt }t=−∞ is a covariance stationary process, then

X
Yt = ψs Ut−s + µt
s=0

where {Ut } are zero mean, weak white noise and {µt } is non-stochastic.

Proof. The best linear projection of Yt on the past Yt−1 , Yt−2 , ... is

X
Ybt := P (Yt |Y−∞:t−1 ) = a0 + aj Yt−j ,
j=1

where the sequence {aj } are time invariant due to covariance stationarity. Let {Ut } be an infinitely lived
sequence of one-step ahead errors from the linear projections,

Ut = Yt − Ybt .
6.5. WOLD REPRESENTATION 153

The {Ut } is zero mean, covariance stationary process, with

Cov(Ut , Us ) = 0, t ̸= s.

That is {Ut } is weak white noise. Write

Cov(Yt , Ut−j )
ψj = , j = 0, 1, 2, ...
Var(Ut−j )
Cov(Y0 , U−j )
= , {Yt , Ut } covariance stationary.
Var(U−j )

Then project Yt on the orthogonal (that is uncorrelated) Ut , Ut−1 , ..., so



X
Yt = ψj Ut−j + µt
j=0

where {µt } is a deterministic sequence (e.g. µt = µ or µt = α cos(λt) + β sin(λt)).

In most applications we work with zero mean covariance stationary processes and the Wold decomposition

works says the process can be written as a MA(∞) process



X
Yt = ψj Ut−j ,
j=0

driven by white noise. For it to exist we must have, of course,



X
ψj2 < ∞,
j=0

otherwise Var(Yt ) will not exist.

Example 6.5.3. Suppose Yt = ϕ1 Yt−1 + εt , where {εt } is zero mean weak white noise and |ϕ| < 1, so {Yt } is

covariance stationary. Then the best linear projection of Yt on Y−∞:t−1 is

Ybt = PY−∞:t−1 (Yt ) = ϕ1 Yt−1 ,

so

Ut = Yt − Ybt = Yt − ϕ1 Yt−1 = εt ,

where
Cov(Yt , Ut−j ) Cov(Yt , εt−j )
ψj = = = ϕj1 ,
Var(Ut−j ) Var(εt )

so

X ∞
X
Yt = ψj εt−j = ϕj1 εt−j .
j=0 j=0
154 CHAPTER 6. LINEARITY

Example 6.5.4. Suppose Yt = α cos(λt) + β sin(λt), where α, β are uncorrelated, zero mean random variables
with variance σλ2 . We saw in Example 3.3.4 that {Yt } is covariance stationary, with zero mean and γs =
σλ2 cos(λs). What is the Wold decomposition? Suppose we start at time t = 1, then
        −1  
Y1 cos(λ) sin(λ) α α cos(λ) sin(λ) Y1
= , so = .
Y2 cos(2λ) sin(2λ) β β cos(2λ) sin(2λ) Y2

So α, β is exactly deduced after 2 datapoints, while write


 −1  
 cos(λ) sin(λ) Y1
µt = cos(tλ) sin(tλ) , t = 3, 4, ...
cos(2λ) sin(2λ) Y2
= α cos(λt) + β sin(λt),

where α, β now are functions of Y1:2 ! Then


   
Y1 cos(λ) sin(λ)  
 Y2  =  cos(2λ) α
sin(2λ)  ,
β
Y3 cos(3λ) sin(3λ)

(which has a singular covariance matrix!) so

P (Y3 |Y1:2 ) = µt = α cos(3λ) + β sin(3λ),

More broadly

Ybt : = P (Yt |Y1:2 ) = α cos(λt) + β sin(λt), t = 3, 4, ...,

Ut = Yt − Ybt = 0, t = 3, 4, ...

Thus

µt = α cos(λt) + β sin(λt),

which is now viewed, by the Wold decomposition, as non-stochastic as (α, β) is determined by Y1:2 . Hence the
Wold decomposition is

Yt = µt , t = 3, 4, ...,

an entirely deterministic sequence.

6.6 Kalman filter as a linear projection

Recall the linear state space form, but now remove the Gaussian assumption. The resulting model is not a
hidden Markov model. It is the model which appears in the Kalman (1960) paper. Here we will analyse it using
linear projection theory.
6.6. KALMAN FILTER AS A LINEAR PROJECTION 155

Definition 6.6.1. [Linear state space] The LSS model has the pair {Yt , αt+1 } following
     
Yt 0d×d Zt Ht 0d×r
Zt = = ϕt,1 Z t−1 +Bt ωt , ϕt,1 = , Bt = , E[ω1 ] = 0d+r , Var(ω1 ) = Id+r ,
αt+1 0r×d Tt 0r×d Qt

where {ωt } is weak white noise

E[α1 ] = a1 , Var(α1 ) = P1 , {ωt } ⊥ α1 ,


we assume a1 , P1 , ϕt,1 , Bt are non-stochastic. We denote this as

{Yt , αt+1 } ∼ LSS.

Then the Kalman filter replaces conditional expectations by linear projections and unconditional expecta-

tions.

Theorem 6.6.2. Assume {Yt , αt+1 } ∼ LSS, then write

T
at+1 : = P (αt+1 |Y1:t ), Pt+1 := E[(αt+1 − at+1 ) (αt+1 − at+1 ) ],
 T
at|t : = P (αt |Y1:t ), Pt|t := E[ αt − at|t αt − at|t−1 ],

then the sequence {at+1 , Pt+1 } is the same as Algorithm 5.5.3, which had assumed Gaussianity.

Proof. As Var(ω1 ) = Id+r and Var(α1 ) = P1 , so Y1:T , α1:T ∈ H. Assume the result

at := P (αt |Y1:t−1 ),

then the projection error

vt : = Yt − P (Yt |Y1:t−1 )

= Yt − P (Zt αt |Y1:t−1 ) − P ((Ht , 0)ωt |Y1:t−1 ) (6.23)

= Yt − Zt P (αt |Y1:t−1 ), ωt ⊥ Y1:t−1 (6.24)

= Yt − Zt at . (6.25)

= Zt (αt − at ) + εt , εt = (Ht , 0)ωt . (6.26)

This is the first result of the Kalman filter. Then note

vt Zt (αt − at ) + εt , εt = (Ht , 0)ωt ,

so

Ft := E[vt2 ] = Zt Pt|t−1 ZtT + Ht HtT ,


156 CHAPTER 6. LINEARITY

which is the second result of the Kalman filter. Further

Cov(vt , αt − at|t−1 ) = Zt Pt|t−1 .

Now go to work! vt ⊥ Y1:t−1 so

at|t = P (αt |Y1:t )

= P (αt |Y1:t−1 ) + P (αt − at|t−1 |vt )

= at−1 + βt vt , βt = Cov(αt − at|t−1 , vt )Var(vt )−1

= at−1 + Pt|t−1 ZtT Ft−1 vt , recall Cov(vt , αt − at|t−1 ) = Zt Pt|t−1 , Var(vt ) = Ft .

and

Pt|t = Var(αt − at|t )

= Pt−1 − βt Ft βtT .

Start by moving at|t forward in time

at+1 = Tt at|t−1 + Kt vt , Kt = Tt βt = Tt Pt|t−1 ZtT Ft−1 .

These are the third and fourth result in the Kalman filter. Likewise

Pt+1 = Tt Pt TtT + Qt QTt − Tt βt Ft βtT TtT

= Tt Pt TtT + Qt QTt − Kt Ft KtT ,

which is the last result.

Example 6.6.3. Suppose {Xt , αt+1 } is a HMM with

iid
{εt } ⊥⊥ σt2 ,

Xt = σt εt , εt ∼ N (0, 1),

where {σt } is a stationary Markov chain. Then

Yt = Xt2 = σt2 + Ut , Ut = (ε2t − 1)σt2 ,

where {Ut } is a martingale difference sequence if E[σt2 ] < ∞. Further, {Ut } is zero mean, weak noise if
E[σt4 ] < ∞. If, in addition,

2
= µσ2 + ϕ1 σt2 − µσ2 + Vt ,

σt+1 {Ut } ⊥ {Vt } ,

and {Vt } is weak white noise, then {Yt , αt+1 } is in the LSS and the Kalman filter delivers

2 2 2
P (σt+1 |Y1:t ) = P (σt+1 |X1:t ).
6.7. RECAP 157

6.7 Recap

Again we have covered a lot of ground, now focusing entirely on linear methods, based on covariance stationarity.
The main topics have been spectral methods, the Wold decomposition and the Kalman filter.
Table 6.1 contains the major topics covered in this Chapter.Action

Formula or idea Description or name


A + iC Complex random variable
R 2π
Y (t) = 0 qe−iλs dZ(λ) Cramér representation
1
PT iλj t
JT (λj ) = 2πT t=1 yt e Discrete Fourier transform
λ Frequency
at+1 := P (αt+1 |Y1:t )  Kalman filter
λ −λ

PT j
j=1IT (λj )K h
θh (λ) =
b PT  λ −λ 
j
Kernel spectral density estimator
j=1 K h
 
1 − |s|
PB
s=−B B γ es Long-run variance estimator
2
IT (λ) = |JT (λ)| Periodogram
P (Y |X) P∞ Projection
1 iλs
fY Y (λ)
P∞ = 2π s=−∞ e γs Spectral density
Yt = j=0 ψj Ut−j + Vt Wold decomposition

Table 6.1: Main ideas and notation in Chapter 6.

6.8 Appendix: Linear projection


6.8.1 Building the linear projection

Take a step back from time series, returning to Introductory Statistics. Think of the collection of random
variables with a variance:
H = Z : E[Z 2 ] < ∞ .


(Then H is a special case of a Hilbert space.) In this subsection we will make three assumptions:

1. Suppose X1 , ..., Xp ∈H.

2. Let  
 p
X 
M= W = bj Xj : bj ∈ R = sp(1, X1 , ..., Xp ); X0 = 1.
 
j=0

Then M ⊆ H (that is W has a variance!).

3. Assume Y ∈ H.

Now predict Y using X, but constrain ourselves to linear efforts. Measure the error of the prediction using

expected squared error:


h 2 i
EY,X Y − bT X = Y − bT X,Y − bT X , using notation ⟨X, Y ⟩ := E[XY ],
158 CHAPTER 6. LINEARITY

then
  h 2 i D E
γ, β Y |X = arg min EY,X Y − bT X = arg min Y − Yb ,Y − Yb .
b∈Rp+1 b ∈M
Y

Then Yb is the linear least squares projection

Yb = P (Y |X) = µY + β TY |X (X − µX ),

noting

γ = µY − β TY |X µX , Cov(X, Y ) = Var (X) β Y |X .

We will write

UY |X = Y − P (Y |X)

the linear projection error.

Think of P (·|X), as an operator applied to Y . This operator is called a linear projection.

Example 6.8.1. Suppose A ∈ H, then

P (A|X) = µA + β TA|X (X − µX ), β A|X such that Cov(X, A) = Var (X) β A|X .

6.8.2 Some properties of linear projections

The following 5 properties hold for linear projections:

ˆ E[UY |X ] = 0 and E[XUY |X ] = 0p .

Var(UY |X ) = Var(Y ) − β TY |X Cov(X, Y ) = Var(Y ) − β TY |X Var(X)β Y |X

Why?

Var(UY |X ) = Var(Y ) − 2Cov(Y, P (Y |X)) + Var(P (Y |X)) = Var(Y ) − 2β TY |X Cov(X, Y ) + β TY |X Var(X)β Y |X

= Var(Y ) − β TY |X Cov(X, Y ).

When P (·|X) is applied to Z, then

P (Z|X) = µZ + β TZ|X (X − µX ), Cov(X, Z) = Var (X) β Z|X .


P   P
J J
ˆ P j=1 Yj |X = j=1 P (Yj |X).

Why? Think of {Y1 , ..., YJ } as a collection of scalar random variables, each possessing a variance. Then
 
XJ XJ XJ XJ
µZ = µZj , Cov X, Zj  = Cov (X, Zj ) , implying β PJj=1 Zj |X = β Zj |X ,
j=1 j=1 j=1 j=1
6.8. APPENDIX: LINEAR PROJECTION 159

imply    
J
X XJ
Cov X, Zj  = Var (X)  βZ j |X
,
j=1 j=1

which yields the result.

ˆ P (Xj |X) = Xj .

ˆ P (Y |X) = µY , if Cov(X, Y ) = 0p .

6.8.3 Updating from X to the (X, Z)

So far we have made projections using X. What happens if we add to our set of predictors Z. Then the linear
projection of Y using (X, Z) is
   
X µX
P (Y |X, Z) = µY + β TY |X,Z − .
Z µZ

This raises no new issues.


In the special case where Cov(X, Z) = 0 and Var(X) > 0, Var(Z) > 0, then
       
X Cov(X, Y ) Var(X) 0 β Y |X
Cov ,Y = = ,
Z Cov(Z, Y ) 0 Var(Z) β Y |Z

so β TY |X,Z splits

P (Y |X, Z) = µY + β TY |X (X − µX ) + β TY |Z (Z − µZ ) = P (Y |X) + P ((Y − µY ) |Z)

This result is specialized, requiring orthogonality, but orthogonality can be induced. Build the linear
projection errors
UZ|X = Z − P (Z|X),

then by the linear projection properties

Cov(UZ|X , X) = 0.

So then

P (Y |X, Z) = P (Y |X) + P ((Y − µY ) |UZ|X ),

while
Var(UY |X,Z ) = Var(UY |X ) − β TY −µY |UZ|X Var(UZ|X )β Y −µY |UZ|X
160 CHAPTER 6. LINEARITY
Chapter 7

Action: Causality

In introductory statistics regression is about seeing predictors X and estimating where outcome Y is likely to
land. This is passive — we see X and then see Y .

Causality in introductory statistics is about how outcome Y will change as an assignment A is moved. This
is active — we impact the world by altering A and then this pinballs to changing Y .

In time series causality is similar, but different. In prediction, we use past data, Y1:t−1 , as predictors of Yt .
In causality, we change (or imagine changing) an assignment At−s which then moves Yt , that is s ≥ 0 periods
later.

Many areas of the time series literature are causal. Example include:

ˆ treatment effects: estimating average treatment effects under some form of sequential randomization or

unconfoundedness. These treatment effects are often called impulse response functions in the time series
literature.

ˆ control and reinforcement learning: designing a sequence of {At } so a sequence of outcomes {Yt } or utilities

{Ut } obey the designer’s hopes through time.

7.1 Causality and time series assignments


7.1.1 Causal time series system

Start with formalizing the language of a time series experiment based on randomization. This language will be
expressed in terms of potential outcomes, outcomes and assignments.

First think of a setup we will quickly move away from! A basic causal system might be the sequence
T
{Zt,T }t≥1 which is made up of the random variables
 
{Yt (a1:T )}a1:T ∈AT
Zt,T :=  Xt ,
A1:T ∈ AT

161
162 CHAPTER 7. ACTION: CAUSALITY

the p-dimensional a1:T ∈ AT are possible assignment paths from time 1 to time T , the {Yt (a1:T )}a1:T ∈AT are
the time-t potential outcomes and A1:T is the assignment path, while the time-t d-dimensional outcome is
Yt = Yt (A1:T ). The predictor Xt is not causally moved by the assignments (in economics the Xt variable would

sometimes be called exogenous).


This structure is somewhat weird, as the assignments in the future can impact outcomes now. Most
often researchers in time series are context to a priori exclude this in their causal models. This is a type of
non-interference assumption (in the cross-section case noninterference was introduced by Cox (1958a)). It says:

Yt (a1:t , at+1:T ) = Yt (a1:t , a′t+1:T ), for all a1:T , a′t+1:T .

It would then be natural to redefine the notation and write the potential outcomes as Yt (a1:t ). Using that
fundamental restriction, we can define a causal time series system, which will be the center of the action in this
section.

Definition 7.1.1. The causal time series system {Zt }t≥1 is made up of the random variables

Zt := {Yt (a1:t )}a1:t ∈At , Xt , A1:t ∈ At .




Here the p-dimensional a1:t ∈ At are possible treatment paths from time 1 to time t, the {Yt (a1:t )}a1:t ∈At
are the time-t potential outcomes and A1:t is the assignment path, while the time-t -dimensional outcome is
Yt = Yt (A1:t ). The predictor Xt is not causally moved by the assignments.

Example 7.1.2. [Control and treatment] The leading case of this is univariate with A = {0, 1}, then when the
time-t assignment is At = 1 the assignment is usually said to be to treatment and when At = 0 the assignment

is said to be to control. Often researchers call At the treatment, but I prefer the more neutral nomenclature
assignment (and use treatment and control for its standard meanings). I think it is clearer. The left hand side
of Figure 7.1 shows the corresponding potential outcomes associated with a path of these binary assignments
going up to T = 3. That is it plots {Y1 (a1 )}a1 ∈A , then {Y2 (a1:2 )}a1:2 ∈A2 and finally {Y3 (a1:3 )}a1:3 ∈A3 . The

right hand side highlights the path of Yt = Yt (A1:t ), where here A1:3 = (1, 1, 0)T .

The time-t causal effect of moving from assignment path a′1:t to assignment path a1:t is the random

Yt (a1:t ) − Yt (a′1:t ), where a1:t , a′1:t ∈ At .

Bojinov and Shephard (2019) work with this causal time series system to study time series experiments.

They provide many references to the literature, noting the vast majority of the work in this literature looks
at panel data not pure time series. See also Angrist and Kuersteiner (2011), Angrist, Jordà, and Kuersteiner
(2018), Rambachan and Shephard (2021) and Bojinov, Rambachan, and Shephard (2021).
7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 163

A1:3 = (1, 1, 0)
Y3 (1, 1, 1) Y3 (1, 1, 1)

Y2 (1, 1) Y2 (1, 1)
Y3 (1, 1, 0) Y3 (1, 1, 0)

Y1 (1) Y1 (1)
Y3 (1, 0, 1) Y3 (1, 0, 1)

Y2 (1, 0) Y3 (1, 0, 0) Y2 (1, 0) Y3 (1, 0, 0)

Y2 (0, 1) Y3 (0, 1, 1) Y2 (0, 1) Y3 (0, 1, 1)

Y3 (0, 1, 0) Y3 (0, 1, 0)
Y1 (0) Y1 (0)

Y3 (0, 0, 1) Y3 (0, 0, 1)
Y2 (0, 0) Y2 (0, 0)

Y3 (0, 0, 0) Y3 (0, 0, 0)

Figure 7.1: The left figure shows all the potential outcome paths for T = 3. The right figure shows the observed
outcome path Y1:3 (A1:3 ) where A1:3 = (1, 1, 0)T , indicated by the thick blue line. The gray arrows indicate the
missing data.

Definition 7.1.3. The sequential assignment mechanism is the probability law of

At |A1:t−1 , X1:t , Y1:t−1 .

In experiments the researcher will control this law, in observational studies it will be unknown. Crucially
notice that potential outcomes do not appear in the conditioning set — so assignment is based on observables.

This is in keeping with the very broad causal literature.

Example 7.1.4. [Linear case] Hasbrouck (1991a, Hasbrouck (1991b) studied the causal impact of a buy initiated
trade (on a financial market) on the mid-price just after the buy (or sell). Think of At = 1 if trade-t is a buy
(treatment), At = 0 if sell (control). A recent discussion of the finance literature on this topic included in

Campigli, Bormetti, and Lillo (2022). Hasbrouck modelled


p
X q
X
At = ϕj At−j + βj Yt−j + εt , where p, q ≥ 0.
j=1 j=1

We initially assume we observe the sequences of assignments, predictors and outcomes:

A1:T , X1:T , Y1:T .

Some advanced methods we will discuss in Section 7.2 will try to make causal conclusions just seeing the path

Y1:T — this is traditional in empirical macroeconomics. This can be done under very strong assumptions.
In nearly all of the discussion in this section, the predictors will be entirely ignored to ease exposition.
164 CHAPTER 7. ACTION: CAUSALITY

Example 7.1.5. [Linear case] The causality time series system with linear potential outcomes has

t−1
X
Yt (a1:t ) = θj at−j + Vt ,
j=0

where {Vt } is a stochastic process not causally impacted by a1:t (e.g. it could include Xt or lagged versions of
t−1
Xt ) and {θj }j=0 is a non-stochastic sequence. If Vt = 0, with probability one, then the potential outcomes are
non-stochastic.

We often measure the causal impact of changing the first element of At . Highlighting the causal impact of
other elements of At follows the same logic and raises no new intellectual issues. It is sometimes helpful to

work with the ultra compact notation: lag-s potential outcomes are defined as

Yt,s (at−s,1 ) = Yt (A1:t−s−1 , (at−s,1 , At−s,2:p ) , At−s+1:t ), at−s,1 ∈ A1 ,

which looks at how Yt varies as at−s,1 moves.

The insightful notation {Yt,s (at−s,1 )}at−s,1 ∈A1 appears in the work of Angrist and Kuersteiner (2011) and

Angrist, Jordà, and Kuersteiner (2018). Then Yt = Yt,s (At−s,1 ), noticing that Yt,s is not the s-th element of
Yt . This lag-s potential outcome notation buries many details, so needs to be used with thought — this is not
a non-interference assumption, instead the deep dependence on A1:t−s−1 , At−s,2:p , At−s+1:t are just suppressed.

A popular causal effect is the lag-s average treatment effect.

Definition 7.1.6. [lag-s average treatment effect] Assuming the causal system, the lag-s average treatment
effect of moving the first element of At , written At−s,1 , from 0 to 1 is, at time t,

τt,s := E[Yt,s (1)] − E[Yt,s (0)].

Example 7.1.7. [Linear case] (Continuing Example 7.1.5) The causality time series system with linear potential
outcomes has
   
at−s,1 0
 0  s−1
X  At−s,2  t−1
X
Yt,s (at−s,1 ) = θs   + Ut , Ut = θj At−j + θs  + θj At−j + Vt .
   
.. .. 
 .  j=0
 .  j=s+1
0 At−s,p

Then the lag-s causal effect is the non-stochastic


 
1
 0 
Yt,s (1) − Yt,s (0) = θs  .
 
..
 . 
0
7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 165

7.1.2 m-order causality

m-order causality restricts our study of causal impacts of assignments on outcomes to not more than m periods.

Assumption 7.1.8. Assume a causal time series system. It is m-order causal if, for each t,

Yt (a1:t−m−1 , at−m:t ) = Yt (a′1:t−m−1 , at−m:t ), for all a1:t , a′1:t−m−1 .

Again it is a time series version of a non-interference assumption (Cox (1958b)). A m-th order causal time series
system write the time-t potential outcome using the shorthand:

Yt (at−m:t ),

burying the irrelevance of a1:t−m−1 .

h 7.1.9. m-order causality says nothing about the probability law of the A1:T , X1:T , Y1:T — so it has no direct
relationship to m-order Markov processes or m-dependence. However, it shares some of the spirit of those
methods, here limiting causal effects to m lags. If you are worried about this restriction, think of m as a

billion.

Under m-th order causality, the causal system {Zt }t≥1 simplifies to
 
{Yt (at−m:t )}at−m:t ∈Am+1
Zt =  Xt , Yt = Yt (At−m:t ).
At−m:t ∈ Am+1

The dimension of Zt does not change with t. Thus under m-order causality, it is possible to think of {Zt }
as infinitely lived without having to deal with the technicality of infinitely dimensional objects — opening up
the possibility of working with a stationarity asusmption. The causal effect of moving from assignment path

a′t−m:t to assignment path at−m:t is Yt (at−m:t ) − Yt (a′t−m:t ).


I (alone!) regard the m-th order causal time series system as the work horse for causal studies for time
series.

Definition 7.1.10. The (time-invariant) linear m-order causal time series system (with no predictors), has
m
X p
X q
X
Yt (at−m:t ) = θj at−j + Vt , At = ϕj At−j + βj Yt−j + εt , Yt = Yt (At−m:t ).
j=0 j=1 j=1

Notice that in the linear system time-t assignment instantly impacts the time-t potential outcome, but
outcomes only impact assignments with a lag. This is crucial. It needs to be justified in the applied context

this model is used. The outcome and causal effects are, respectively,
m
X m
X
and Yt (at−m:t ) − Yt (a′t−m:t ) = θj at−j − a′t−m:t .

Yt = θj At−j + Vt
j=0 j=0
166 CHAPTER 7. ACTION: CAUSALITY

h 7.1.11. If the {At } are i.i.d. and Vt = 0 for all t, then the time series {Yt }t=1 is a vector MA(m) process.
For a moment think of the MA(1) case to simplify the exposition, then

γ0 = Var(Yt ) = θ0 Var(A1 )θ0T + θ1 Var(A1 )θ1T , γ1 = Cov(Yt , Yt−1 ) = θ1 Var(A1 )θ0T .

In this i.i.d. assignment case, if we see the time series of assignments we can estimate Var(A1 ), while the

outcomes allow us to estimate γ0:1 . Taken together they allow us to learn θ0:1 . This carries over to the MA(m)
case. Game over, we are then causal heroes. But the reach of this strategy is limited. If we do not see
m
{At } we cannot learn Var(A1 ) and so there is no way of splitting the θs apart from Var(A1 ), so {θs }s=0 are not
identified. This is because θ0 is not constrained to be Id in the MA(m) process. This point is an animating

observation in much of the macroeconometric causal literature. Notice that θ0 should not be expected to be
the identity matrix! This is not the textbook linear moving average seen in statistics, although our treatment
of linear MA(q) processes in previous chapters has often allowed a flexible θ0 .

7.1.3 Stationary m-order causality

Under m-th order causality, if the law of the causal system {Zt } is strictly stationary, then the lag-s average
treatment effects do not depend upon t, and so write them as

τs = E[Yt,s (1)] − E[Yt,s (0)],

using the lag-s potential outcome notation. Due to the m-th order causality assumption τs = 0, s > m. Finally,

the causal triple (the assignment, the predictors and the outcome), form a sequence {At , Xt , Yt }t≥1 which is
strictly stationary.

Much of our work will be based around the m-order causal system under stationarity.

Example 7.1.12. [Linear case] (Example 7.1.4 continued) Hasbrouck (1991a, Hasbrouck (1991b) defined the
univariate outcome Yt as the observed mid-quote return (after the t-trade, (qt − qt−1 )/qt−1 , where qt is the

mid-price of the t-th trade). Hasbrouck assumed returns were linear in the buy/sell indicator. Think of the
structure as a linear m-order causal time series system, with p = 2, q = 1 and m = 1, to make keeping track
easier. Then

Yt (at−1:t ) = (θ1 , θ0 ) at−1:t + Vt , so Yt = (θ1 , θ0 ) At−1:t + Vt ,


7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 167

implying
 
1
 0 
τ0 = Yt {At−1 , (1, At,2:p )} − Yt {At−1 , (0, At,2:p )} = θ0  ,
 
..
 . 
0
 
1
 0 
τ1 = Yt {(1, At−1,2:p ) , At } − Yt {(0, At−1,2:p ) , At−1 } = θ1  ,
 
..
 . 
0
while

At = (ϕ2 , ϕ1 ) At−2:t−1 + β1 Yt−1 + εt = (ϕ1 + β1 θ0 ) At−1 + (ϕ2 + β1 θ1 ) At−2 + β1 Vt−1 + εt .

The regression of outcomes on assignments yields

βYt ∼At−1:t = Cov(Yt , At−1:t )Var(At−1:t )−1 = {Cov((θ1 , θ0 ) At−1:t , At−1:t ) + Cov(Vt , At−1:t )} Var(At−1:t )−1

= (θ1 , θ0 ) + βVt ∼At−1:t ,

where βVt ∼At−1:t is Cov(Vt , At−1:t )Var(At−1:t )−1 . This is not what we want from a causal perspective. Extra
conditions are needed to encourage βVt ∼At−1:t to be zero. This will be the topic of the next subsection.

Linearity makes things easier to think about. But it is not really the point.

Example 7.1.13. [Non-linear potential outcomes] The univariate m-th order causality system with strict
stationarity means that
Yt (at−m:t ) = g(at−m:t , Vt ),

where {Vt } is strictly stationary. The Yt (at−s,1 ) := g(At−m:t−s−1 , at−s , At−s+1:t , Vt ), so Yt,s (1) − Yt,s (0) is a

strictly stationary process (in the linear case it was non-stochastic!), while

τs = E[g(At−m:t−s−1 , 1, At−s+1:t , Vt )] − E[g(At−m:t−s−1 , 0, At−s+1:t , Vt )].

The special case of a time-separable potential outcomes


m
X
Yt (at−m:t ) = θj h(at−j ),
j=0

where h(x) ≥ 0 for every x, is important in practice. Then

Yt,s (a) − Yt,s (0) = θs {h(a) − h(0)} , a ∈ R,

which is non-stochastic, due to the time-seperability of the non-linear model. A special case of the time-

seperable structure is the ARCH(m) process (Engle (1982) and Bollerslev (1986)), when h(x) = x2 . The study
of the impact of assignments in volatility models was initiated by Engle and Ng (1993) — although it was not
really couched in a causal language and again, like in macroeconometrics, those authors assume they only see
the outcome time series.
168 CHAPTER 7. ACTION: CAUSALITY

7.1.4 Sequential assignment


Sequential randomization

Randomization in Fisher (1925)’s randomized control trials selects assignments randomly, independent of ev-
erything else in the world (e.g. by drawing assignments randomly on a computer) — and thus are independent

of potential outcomes. What is the time series version of this?

Definition 7.1.14. Assume an m-order causal time series system. Sequential randomization is where

{Yt (at−m:t )}at−m:t ∈Am+1 ⊥⊥ At−m:t .

The independence assumption is between the path of assignments At−m:t and the unobservable potential

outcomes {Yt (at−m:t )}at−m:t ∈Am+1 . Validation of this assumption must come from outside the joint distribution
of the A1:T , Y1:T , e.g. I conduct an experiment and draw the path At−m:t randomly on a computer as an autore-
gression. It is crucial to understand we do not need assignments to be i.i.d. to get sequential randomization

— just independent of the potential outcomes!

Example 7.1.15. [Linear case] (Example 7.1.12 continued) In the linear m-order causal time series system
p
X q
X
Yt (at−m:t ) = (θm , ..., θ0 ) at−m:t + Vt , At = ϕj At−j + βj Yt−j + εt , Yt = Yt (At−m:t ),
j=1 j=1

then if

Vt ⊥⊥ At−m:t , (7.1)

sequential randomization holds. One way to implying (7.1) is to assume that:

ind
{Vt , εt } ∼ over t, Vt ⊥⊥ εt .

These two assumptions play different roles:

(a) dynamics: assignments can depend upon lagged outcomes, so lagged Vt can be inside the assignments, which
ind
could induce correlation between Vt and the assignments. The {Vt , εt } ∼ assumption removes this danger.
(b) contemporaneous: if the Vt , εt are contemporaneously dependent it would immediately induce dependence
between At and Vt .

For linear models estimated using linear methods, the conditions can be expressed using second moments:
{Vt , εt } are weak white noise and Cov(Vt , εt ) = 0.

The core takeaway from Example 7.1.15 is that the sequential randomization restrictions are made on the
innovations in the model!
7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 169

Lag-s randomization

Sometimes we will see a different set of assumptions, which are written in terms of At−s,1 and the lag-s potential

outcomes {Yt,s (at−s,1 )}at−s,1 , recalling

Yt,s (at−s,1 ) = Yt (A1:t−s−1 , (at−s,1 , At−s,2:p ) , At−s+1:t ).

This is attractive as At−s,1 is the assignment we most centrally focus on in terms of causal inference and
Yt,s (at−s,1 ) is the corresponding outcome. Hence it is tempting to work with lag-s sequential randomization.

Definition 7.1.16. Assume an m-order causal time series system. Lag-s randomization is where

{Yt,s (at−s,1 )}at−s,1 ⊥⊥ At−s,1 . (7.2)

Lag-s randomization is the focus of Angrist and Kuersteiner (2011) and Angrist, Jordà, and Kuersteiner
(2018). It is pretty but it sure is hard to think about if we want to validate it in practical models. This kind
of assumption is at the heart of the assumptions made in linear causal models in macroeconometrics, discussed

in Section 7.2.

Remark 23. Sufficient for (7.2) is that

(At−m:t−s−1 , At−s,2:p , At−s+1:t , Vt ) ⊥⊥ At−s,1 .

It is difficult to see how that will hold the without making the assignments {At }t=1 and system noise {Vt } obey

something like:
(1) At−m:t are independent vectors through time;
(2) At−s ⊥
⊥ Vt ;
(3) and the element At−s,1 ⊥
⊥ At−s,2:p .

This would implies that the outcomes {Yt } would be moving average type process.

The core takeaway from Remark 23 is that the lag-s randomization restrictions are made on the assignments

themselves!
Example 7.1.17 works through the linear model to make the lag-s randomization assumption concrete.

Example 7.1.17. [Linear case] Under m-th order causality system with linear lag-s potential outcomes
   
at−s,1 0
 0  s−1
X  At−s,2  m
X
Yt,s (at−1,1 ) = θs  + U , U = θ A + θ + θj At−j + Vt ,
   
..  t t j t−j s  .. 
 .  j=0
 .  j=s+1
0 At−s,p
so lag-s randomization is achieved, for all t, s, by assuming

Ut ⊥⊥ At−s,1 .
170 CHAPTER 7. ACTION: CAUSALITY

That looks easy! But there is a lot in Ut . Sufficient conditions for this are Assumptions 1-3 of Remark 23. Add
to these conditions Vt = 0 for all t, and {As }s=1 are i.i.d., then
m
X
Yt = Yt,s (At−s ) = θj At−j , and Var(At ) = D = diag(σ12 , ..., σp2 ),
j=0

assuming the variance exists. In the literature this is called a structural linear m–th order moving average
(written SMA(m)) process.

The same Assumptions 1-3 of Remark 23 yield Sequential randomization in the non-linear causal system in
Example 7.1.13 — linearity is not the point. The latter point seems a tad under appreciated in the econometrics
literature, but it is at the heart of Angrist and Kuersteiner (2011) and Angrist, Jordà, and Kuersteiner (2018).

Sequential unconfoundedness

Unconfoundedness is at the core of most cross-sectional observational studies. There we make assignments
independent of potential outcomes conditional on the predictors. What is the time series version of that?
The time series version of unconfoundedness is sequential unconfoundedness. Here the conditioning is on the

filtration of the observations and the time t − s predictor Xt−s . This is an example of selection on observables.

Definition 7.1.18. Assume an order-m causal time series system. Think of m ≥ s. The sequential uncon-
foundedness is where
n o
Y,A,X
{Yt (At−m:t−s+1 , at−s:t )}at−s:t ∈As+1 ⊥⊥ At−s:t |Ft−s−1 , Xt−s

A,X,Y
where Ft−s−1 is the history of the assignments, predictors and outcomes (not potential outcomes) up to time
t − s − 1.

The really handy aspect of sequential unconfoundedness is that it only refers to the assignments At−s:t , not
all the way back to t − m.

Example 7.1.19. [Linear case] Return to the linear m-order causal time series system:
m
X p
X q
X
Yt (at−m:t ) = θj at−j + Vt , At = ϕj At−j + βj Yt−j + εt , Yt = Yt (At−m:t ).
j=0 j=1 j=1
Y,A
When we condition on Ft−s−1 many terms above become non-stochastic. If

Y,A
(Vt ⊥⊥ At−s:t ) |Ft−s−1 , (7.3)

then, for each t,

Y,A Y,A
βYt ∼At−s:t |F Y,A = Cov(Yt , At−s:t |Ft−s−1 )Var(At−s:t |Ft−s−1 )−1 = (θs , ..., θ0 ) .
t−s−1

In the linear model, the conditions can be expressed in terms of second conditional moments:
n o
Y,A
Cov(Vt , εt |Ft−1 ) = 0, and {Vt , εt } is MD with respect to FtY,A .
7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 171

Lag-s unconfoundedness

The extension of lag-s randomization to lag-s unconfoundedness is important, again knocking out the assump-

tions on the assignments At−m:t−s+1 .

Definition 7.1.20. Assume an m-order causal time series system. Lag-s unconfoundedness is where
h i
Y,A,X
{Yt,s (at−s,1 )}at−s,1 ⊥⊥ At−s,1 |Ft−s−1 , Xt−s . (7.4)

Even though At−m:t−s+1 no longer features, very strong assumptions are still needed to validate it — but
Y,A,X
now all stated conditioning on Ft−s−1 , Xt−s . To see this, again note

Yt,s (at−s,1 ) = Yt (At−m:t−s−1 , (at−s,1 , At−s,2:p ), At−s+1:t ),

so now the remaining randomness includes

Y,A,X
V t , At−s,2:p , At−s+1:t |Ft−s−1 , Xt−s .

7.1.5 Estimating under sequential assignment


Estimating under sequential randomization

Think about estimating the expected causal effect of moving from assignment a′t−m:t to at−m:t

E[Yt (at−m:t )] − E[Yt (a′t−m:t )],

under

(1) a strictly stationary m-th order causal time series system;


(2) sequential randomization;
(3) the data is A1:T , Y1:T . The predictors {Xt } play no role so they will be dropped here.

Define:

µ(at−m:t ) = E[Yt (at−m:t )], strict stationarity

= E[Yt (at−m:t )| (At−m:t = at−m:t )], sequential randomization

= E[Yt (At−m:t )| (At−m:t = at−m:t )]

= E[Yt | (At−m:t = at−m:t )],

then
E[Yt (at−m:t )] − E[Yt (a′t−m:t )] = µ(at−m:t ) − µ(a′t−m:t ).

This can be estimated using two non-parametric regressions. So we are back to standard time series. No new
ideas are needed going forward.
172 CHAPTER 7. ACTION: CAUSALITY

Example 7.1.21. Assume the linear m-order causal time series system and sequential randomization, if the
second moments exist, then

βYt ∼At−m:t = Cov(Yt , At−m:t )Var(At−m:t )−1 = (θm , ..., θ0 ) .

This implies that


E[Yt (at−m:t )] − E[Yt (a′t−m:t )] = βYt ∼At−m:t at−m:t − a′t−m:t .


A difficulty here is that m could be quite large. Using sequential unconfounding would reduce this problem

in practice.
If the focus is on the less ambitious lag-s average treatment effect:

τs = E[Yt,s (1)] − E[Yt,s (0)],

then

E[Yt,s (at−s )] = E[Yt (At−m:t−s−1 , at−s , At−s+1:t )], recalling Yt,s (at−s ) = Yt (At−m:t−s−1 , at−s , At−s+1:t ),

= E[E[Yt (at−m:t )| (At−m:t = at−m:t )]|At−s = at−s ], sequential randomization

= E[E[Yt |At−m:t ]|At−s = at−s ],

= E[µ(At−m:t )|At−s = at−s ].

So
τs = E[µ(At−m:t )|At−s = 1] − E[µ(At−m:t )|At−s = 0]. (7.5)

From a statistical perspective, this is a much easier estimand: the At−m:t−s−1 , At−s+1 are averaged out which

improves statistical precision.


Think through how Equation (7.5) allows τs to be estimated:

ˆ estimate µt (at−m:t ), non-parametrically;

ˆ split µt (At−m:t ) according to the observed At−s , e.g.


PT PT
t=m+1 µt (At−m:t )1(At−s = 1) µt (At−m:t )1(At−s = 0)
PT − t=m+1
PT .
t=m+1 1(At−s = 1) t=m+1 1(At−s = 0)

Example 7.1.22. Assume the linear m-order causal time series system and sequential randomization, if the

second moments exist,

τs = E[Yt,s (1)] − E[Yt,s (0)]

= βYt ∼At−m:t {E[At−m:t |At−s = 1] − E[At−m:t |At−s = 0]} .


7.1. CAUSALITY AND TIME SERIES ASSIGNMENTS 173

The E[At−m:t |At−s = 1] is easy to estimate — element by element. If m is large, the regressions could
potentially be estimated with some form of shrinkage, e.g. ridge regression or Lasso regression.

Estimating under lag-s randomization

Here we will think about estimating

τs = E[Yt,s (1)] − E[Yt,s (0)]

under

(1) a strictly stationary m-th order causal time series system;

(2) lag-s randomization;

(3) the data is A1:T , Y1:T .

The predictors {Xt } play no role so they will be dropped here.

Now

τs = E[Yt,s (1)] − E[Yt,s (0)]

= E[Yt (1)|At−s,1 = 1] − E[Yt (0)|At−s,1 = 0], lag-s randomization

= E[Yt |At−s,1 = 1] − E[Yt |At−s,1 = 0]

= µs (1) − µs (0), µs (a) := E[Yt |At−s = a], introducing notation.

Under this model

(A1 , Y1 ) , ..., (AT , YT )

are a pair from the stationary distribution. This means the causal τs is just the difference of a couple of
traditional time series estimands, expressed entirely in terms of observables:

µs (1) = E[Yt |At−s,1 = 1] and µs (0) = E[Yt |At−s,1 = 0].

Examples of estimators are:

ˆ [non-parametrics] kernel regression of the output Yt on the assignment At−s,1 evaluated at 1 and 0, e.g.
PT PT
t=s+1 Yt 1(At−s,1 ∈ (1 − h, 1 + h]) Yt 1(At−s,1 ∈ (−h, h])
τs =
b PT − Pt=s+1
T
,
t=s+1 1(At−s,1 ∈ (1 − h, 1 + h]) t=s+1 1(At−s,1 ∈ (−h, h])

for a small bandwidth h > 0. If the assignment At−s,1 is binary, then the simpler version:
PT PT
Yt 1(At−s,1 = 1) Yt 1(At−s,1 = 0)
τ s = Pt=s+1
b T
− Pt=s+1
T
,
t=s+1 1(At−s,1 = 1) t=s+1 1(At−s,1 = 0)

can be used.
174 CHAPTER 7. ACTION: CAUSALITY

ˆ [linear projection] instead of estimating τs using conditional expectations, approximate µs (a) = E[Yt |At−s =

a] using a linear projection

Cov(Yt , At−s,1 )
µL
s (a) := E[Yt ] + (a − E[At−s ]) ,
Var(At−s,1 )

then
Cov(Yt , At−s,1 )
τsL := µL L
s (1) − µs (0) = .
Var(At−s,1 )
Notice τsL is not directly causal. The τsL can be estimated by a least squares regression of {Yt } on {At−s,1 }:
PT  
L t=s+1 Yt − Y At−s,1 − A
τs =
b PT 2 .
t=s+1 At−s,1 − A

Remark 24. Regressing outcomes on lagged assignments appear in the work of Jorda (2005) under the title
“local projection.” Notice it is not really local in any statistical sense (we have assume stationarity), it is simply
a linear projection, used twice. This local projection approach has been very influential in modern research.

Recent papers on this include Olea and Plagborg-Moller (2021) and Plagborg-Moller and Wolf (2021).

7.2 Structural models: SVAR and SVMA

The above goes from assignments to outcomes. The basic conception is you randomize, see the assignments

and trace out the resulting outcomes. In the linear model with lag-s randomization this yields
m
X
Yt = θj At−j + Vt ,
j=0

with Vt ⊥
⊥ At−s,1 ; At−s,2:p ⊥
⊥ At−s,1 and {At } being independent through time.
If we would like to infer not just causal impacts on At−s,1 but all of At−s , in turn, then this needs to
strengthen to Vt ⊥
⊥ At−s and

ˆ At−s,j are independent over j.

ˆ {At } being independent through time.

Then
βYt ∼At−s = θs ,

the matrix of lag-s average treatment effects. This section will have much of this spirit — the same model will
pop up, but with Vt being set to 0. But the practice will be really different.
There will be models, things called structural shocks will appear which look like assignments. This is mere

labelling. The big idea will be that the time series properties of the

{Yt }
7.2. STRUCTURAL MODELS: SVAR AND SVMA 175

alone plus a causal story are enough to yield estimators of causal quantities. There is no need to see the
assignments. This is a vast and powerful step. It is also somewhat fundamentally fragile.
In the treatment below, I will cut everything back to its very simplest case, focusing on a single lag. You

have enough knowledge of time series that you know how to extend things to longer lags and really they raise
no new intellectual issues.

7.2.1 One model, three rewrites

The core models in this literature do not use the language of potential outcomes, so I will not force it upon the

models either. But I will again use the notation {At } to drive everything: now these former assignments will
be called structural shocks!

Definition 7.2.1. A d-dimensional structural first-order autoregressive process (SVAR(1), we follow the old
style convention of this literature in using vectors in front of names of processes) is where

iid
BYt = Γ1 Yt−1 + At , At ∼

where B is a square matrix, the process {At } are called structural shocks, obeying the constraint that At,j are
independent over j and possess a variance,

Var(A1 ) = D,

a diagonal matrix. Typically B is constrained to have 1 on the leading diagonal and is assumed to be invertible.

Remark 25. Making B have unit leading diagonal elements, e.g. when d = 2,
 
1 b1,2
B= ,
b2,1 1

just constrains the scaling of the model, so can be done without any loss of flexibility and makes it easier to
interpret. Constraining Var(A1 ) to be diagonal makes sense when trying to think about the shocks as assignment
type objects. We saw this type of assumption frequently above — it appeared due to lag-s randomization and
linearity assumption.

The SVAR comes out of one of the main econometric traditions: simultaneous equation models (SEMs),
which were significantly developed in the late 1940s and 1950s, e.g. Haavelmo (1943) and Koopmans (1950).
That case is where Γ1 = 0.
Rewrite the SVAR, assuming the matrix B is invertible, into the more conventional VAR(1)

Yt = ϕ1 Yt−1 + εt ,

where
T
ϕ1 = B −1 Γ1 , εt = B −1 At , Σ = Var(ε1 ) = B −1 D B −1 .
176 CHAPTER 7. ACTION: CAUSALITY

This parameterization is often called “reduced form”, again mimicking the nomenclature of the earlier SEM
literature. If |ϕ1 | < 1, then the VAR(1) can, of course, be written as an VMA(∞)

X j
Yt = ψj εt−j , ψj = B −1 Γ1 ; j = 0, 1, 2, ...
j=0

There is nothing new here.


To get back to the original shocks At , because that is typically what a causal time series researcher will care

about, we can rewrite


ψj εt−j = ψj B −1 Bεt−j = ψj B −1 At

so

X j
Yt = θj At−j , θj = ψj B −1 = B −1 Γ1 B −1 ; j = 0, 1, 2, ....
j=0

Crucially recall Var(A1 ) = D and now there is no reason to expect θ0 to be Id . This structure is called
a structural moving average (SVMA(∞)). We have already studied its probabilistic properties at enormous
length, it is just an infinite order moving average process. It raises no new issues.

The impulse response function is

∂Yt
= θs
∂AT
t−s
= ψs B −1

= ϕs1 B −1 .

The θs appears from the SVMA. It is a causal quantities, due to the assumptions in the SVAR. Of course the

SVMA is not easy to estimate from the path of {Yt } alone as θ0 is not the identity matrix. The ψs appears
in the VMA, so is estimable, likewise the ϕ1 from the VAR. But there is also the B −1 matrix here.

7.2.2 Identification and the SVMA

Before we resolve that pesky issue of


B,

let us take stock. We have defined 4 models, simply reparameterizing the original SVAR! It is easy to get lost
in the blizzard of models, so the summary in Table 7.1 helps (at least me):
Enormous amount of attention is paid in this literature to the difficulty of learning B, Γ1 and D from an

infinite amount of data: what is often called identification in statistics. That this is tricky can be seen from a
simple example.

Example 7.2.2. Think about the d = 2 case. Then B has 2 free elements (ones on the leading diagonal), Γ1
has 4 and D is 2 (diagonal matrix). Thus the SVAR(1) has 8 parameters. The VAR(1) only has 7: four from
7.2. STRUCTURAL MODELS: SVAR AND SVMA 177

Model Name of model Memory Variance


BYt = Γ1 Yt−1 + At SVAR Var(A1 ) = D diagonal
T
Yt = ϕ1 Yt−1 + εt VAR ϕ1 = B −1 Γ1 Var(ε1 ) = B −1 D B −1
P∞ j T
Yt = j=0 ψj εt−j VMA ψj = B −1 Γ1 Var(ε1 ) = B −1 D B −1
P∞ j
Yt = j=0 θj At−j SVMA θj = B −1 Γ1 B −1 Var(A1 ) = D

Table 7.1: Reparameterization of SVAR model

ϕ1 and three from Var(ε1 ) (due to the symmetry of the matrix). All the VAR(1) parameters can be estimated
from the data. But that is not enough to pin down all aspects of the SVAR(1). 8 into 7 does not go, so the
SVAR(1) is not identified — that is with an infinite amount of data ϕ1 and Var(ε1 ) can be learnt, but that is
not enough to determine all of B, Γ1 and D.

The general issue is outlined in Table 7.2, which counts parameters for the (unconstrained) SVAR(1) and
VAR(1):

Unconstrained SVAR Upper triangular B: SVAR VAR


B : d2 − d B : d(d − 1)/2 ϕ1 : d 2
Γ1 : d2 Γ1 : d2 Var(ε1 ) : d(d + 1)/2
D:d D:d
total: 2d2 total: 3d2 /2 + d/2 total: 3d2 /2 + d/2

Table 7.2: Identification of SVAR model

Economists have labored on imposing sensible constraints on B, Γ1 and D to remove this gap.
There are a variety of strategies to enforce identification with these kinds of models.

ˆ Strategy 1. In the most influential work in this area, Sims (1980) suggested constraining B to be upper

triangular, which knocks out enough parameters to yield identification. In the d = 2 case this means that
      
1 b1,2 −1 1 −b1,2 γ1,1 γ1,2 −1 γ1,1 − b1,2 γ2,1 γ1,2 − b1,2 γ2,2
B= , B = , B Γ1 = .
0 1 0 1 γ2,1 γ2,2 γ2,1 γ2,2

Table 7.2 deals with the general case. Unfortunately this messes up some issues, e.g. it means that
causal inference is sensitive to the ordering of the system. If there is a compelling economic reason for
the restrictions imposed by the upper triangular B this is fine. Otherwise this is quackery. Once B is

estimated, then the causal impulse response functions


∂Yt
= θs = ϕs1 B −1
∂AT
t−s

can be estimated straightforwardly.

ˆ Strategy 2. Suppose the variance of the shocks changes over time. The simplest version is

 D t ≤ T /2
Var(At ) =
 ∗
D t > T /2.
178 CHAPTER 7. ACTION: CAUSALITY

then the SVAR has 10 parameters, gaining 2 parameters from D∗ . But now the VAR(1) also has a switch
in its variances, the VAR(1) grows by 3 parameters to 10. Magic, the problem is solved. Of course
more sophisticated volatility models are possible, e.g. using SV models for the shocks. This again leads

to identification and better fitting models. Some econometricians feel uncomfortable with this strategy
as they feel only information conveyed by covariances is regarded as solid. It is unclear to me if this is
sound, or simply stuck in the past, rejecting empirical progress.

ˆ Strategy 3. For a Bayesian the lack of identification is not a dead stop as a prior can be placed over B,

Γ1 and D, and Bayes theorem still provides valid probabilistic calculations. The potential difficulty here

is that the scientific conclusion may be sensitivity to the prior specification, as there are aspects of the
likelihood which will be flat due to the lack of identification, even with quite large samples. This is a
milder version of the a priori imposition of B being upper triangular. An alternative to Bayesian inference

in this context is the use of regularization on the parameters, which should limit the damage caused by
the lack of identification.

7.2.3 Wold decomposition and the above

So far we started with a SVAR(1) and rewrote it as a VMA(∞) and a SVMA(∞). In this analysis the
causal structure and the assignments have been expressed using independence assumptions, e.g. sequential
randomization. This yielded a mathematically clear causal meaning to the impulse response functions, recalling

the discussion of the Wold decomposition in Section 6.5.

A different approach is to start with by assuming that {Yt } is covariance stationary and then appeal to the
Wold decomposition

X
Yt = θj Ut−j + Vt
j=0

where {Wt } are zero mean white noise and {Vt } is a non-stochastic sequence. Here the Ut = Yt −P (Yt |Y−∞:t−1 ),

as the linear projection forecast errors.

So far, so good.

The next step is to assert that the VMA(∞) structure is causal, i.e. is a SVMA(∞) structure. This is
sometimes called the Frisch (1933)-Slutsky (1927) paradigm. Although this is a the conventional viewpoint

of modern macroeconometric causality, I find this difficult to follow, as Ut is physically built from the path of
Y−∞:t . I literally have no way of moving Ut , as it follows from Yt .

A different viewpoint of the Frisch (1933)-Slutsky (1927) paradigm is to say I will build a causal model

X
Yt = θj Ut−j
j=0
7.3. MARKOV DECISION PROCESS 179

where {Ut } are assignments or shocks. The Wold decomposition shows this is a superflexible approach, for
covariance stationary sequences. In the statistical analysis the {Ut } are only assumed to be a zero mean white
noise process. I find this more appealing — but it is a matter of taste for how science is carried out.

7.2.4 Other Factor augmented SVAR


7.2.5 Synthetic control

7.3 Markov decision process

Suppose a unit or an agent (e.g. an individual, a company or a society) has a time-t manipulable action variable
which can be used to impact other future variables for the unit’s benefit. This is also causality! A variable is
moved, other variables respond — here by design.

This area of study is often called control theory or more recently reinforcement learning. It appears all over

engineering and some areas of macroeconomics.

The main structure used to study this problem is a Markov decision process (MDP). It has:

ˆ a state, rather like a hidden Markov model;

ˆ an action, which is rather like a assignment in causal studies;

ˆ a utility, which is a tiny bit like an observation.

7.3.1 A preamble to MDP

Like the HMM, the MDP is stated rather abstractly. Instead of having a measurement density and a transition
density, it has a utility function, a policy and an environment. The environment will be quite similar to the
transition density, but where the action can impact the states too.

To start we will start with a preamble, which I find helpful! All expositions I have seen skip this step,
regarding it as all obvious. But I am slow, I need to spell it out. It also helps me relate everything back to
the way we discussed causality before.

Let At be called the time-t action or control, St the potential states and Ut is the utility (utility is usually
the label used in economics, minus loss in machine learning and reward in reinforcement learning — it is all the

same intellectually). Collect them initially (in a moment a simpler, stripped down version will be given) into
 
Ut
Zt′ =  {St (a1:t−1 )}a1:t−1 ∈At−1  , t = 1, ..., T.
At

In this structure we always assume a non-interference type condition on:


180 CHAPTER 7. ACTION: CAUSALITY

ˆ the state

St (a1:t−2 , at−1 ) = St (a′1:t−2 , at−1 ), for all a1:t−2 , a′1:t−2

so only the last action at−1 causally impacts the state;

ˆ the time-t utility function

Ut ({a1:t−2 , at−1 } , {s1:t−2 , st−1:t }) = Ut ( a′1:t−2 , at−1 , s′1:t−2 , st−1:t ), a1:t−2 , a′1:t−2 , s1:t−2 , s′1:t−2 ,
 
for all

so only at−1 , st−1:t matter.

For short-hand the time-t state and utility function is written as

St (at−1 ), Ut (at−1 , st−1:t ),

while time-t state and utility are

St = St (At−1 ), Ut = Ut (At−1 , St−1:t ).

In what is below, I will write the single path of potential states from time t to time T , as they are selected
by a single path of actions, as

St:T (at−1:T −1 ) = {St (at−1 ), St+1 (at ), ..., ST (aT −1 )} .

h 7.3.1. St:T (at−1:T −1 ) is a powerful, but possibly confusing notation, as it looks like all the states St:T can
depend upon all the actions at−1:T −1 , but that is not the case.

Collecting all these states, we write

{St:T (at−1:T −1 )}at−1:T −1 ∈AT −t+1 ,

and then

St:T = St:T (At−1:T −1 ).

7.3.2 Defining a MDP

Having the above understanding, we are able to state the MDP.

Definition 7.3.2. [Markov decision process] Let Zt be


 
Ut
Zt =  {St (at−1 )}at−1 ∈A  , t = 1, ..., T,
At
7.3. MARKOV DECISION PROCESS 181

and assume this follows a Markovian law (the causal understanding comes from the preamble). Now work with
the decomposition

P (Zt |Zt−1 ) = P ({St (at−1 )} , At , Ut |Zt−1 )

= P ({St (at−1 )} |Zt−1 )

×P (At |Zt−1 , {St (at−1 )})

×P (Ut |Zt−1 , At , {St (at−1 )}).

A Markov decision process additionally assumes the following simplifications:

P ({St (at−1 )} |Zt−1 ) = P ({St (at−1 )} |At−1 , St−1 ), called the environment

P (At |Zt−1 , {St (at−1 )}) = P (At |St ), called the policy (controlled by agent), recalling St = St (At−1 )

P (Ut |Zt−1 , At , {St (at−1 )}) = P (Ut |At−1 , St−1:t ).

The time-t discounted utility function is


T
X
Ut∗ (at−1:T −1 , st−1:T ) = δ j−t Uj (aj−1 , sj−1:j ), δ ∈ [0, 1), t = 1, ..., T,
j=t

while time-t discounted utility is Ut∗ = Ut∗ (At−1:T −1 , St−1:T ).

This structure has a lot of similarities with the transition equation of a HMM (where st−1 is the time-t state),
but now there is this extra action element. If there was no action, the environment would be the transition

equation of the HMM. The non-interference assumption makes the time-t utility a little like a potential outcome,
with the action being the assignment, but here it is lagged.

The policy is the law of the action given the state At |St . Thinking of St = s, the support of possible actions

is sometimes written as

A(s),

and is called the action space, so At has conditional support A(St−1 ).

One of the reasons the MDP is mathematically tractable is the time separability of the utility functions,

which means that the discounted time-t utility function splits into

Ut∗ (at−1:T −1 , st−1:T ) = Ut (at−1 , st−1:t ) + δUt+1



(at:T −1 , st:T ),

the current utility plus δ times the discounted time-(t + 1) utility function. This equation looks a bit weird.
It works backwards in time — relating terms at time t in terms of things at time-(t + 1). But we have seen
equations which go backwards in time:
182 CHAPTER 7. ACTION: CAUSALITY

ˆ smoothing in HMMs;

ˆ creating the beautiful Doob martingale.

Remark 26. In many setups the policy, the environment and/or the utility will not be stochastic. In the
extreme case where nothing is random and the system is time-invariant, then zt = (ut , st , at ) with the functions

loop through: st = s(at−1 , st−1 ), then at = a(st ), then ut = u(at−1 , st−1:t ).

Example 7.3.3. [LQG control] The Linear-Quadratic-Gaussian (LQG) control problem sets the time-t potential
states as

{St (at−1 )|At−1 , St−1 = s} = T s + Rat−1 + ηt , where ηt ⊥⊥ (At−1 , St−1 ) .

This structure means that the agent selects the action, which the state to push through to the next period.
Instead of having a measurement density there is a utility function

Ut (at−1 , st−1:t ) = −aTt−1 Qat−1 − sTt−1 Hst−1 , for t < T,

and end condition

UT (aT −1 , sT −1:T ) = −sTT −1 F sT −1 .

So utility likes the actions at−1 and states st−1 to be small. Hansen and Sargent (2014) mostly focus on using
this case for different problems in macroeconomics. When there is no ηt , this setup is called the Linear-Quadratic
(LQ) control problem.

7.4 Stochastic dynamic program

One of the most influential ideas in the second half of the 20th century in engineering and dynamic economics

is the stochastic dynamic program.

Definition 7.4.1. [Stochastic dynamic program principle] Assuming a MDP. If


 
at−1 (s) = arg max
b max E[Ut∗ (at−1:T −1 , s, St:T (at−1:T −1 ))|St−1 = s] .
at−1 at:T −1

then the deterministic policy {b


at−1 (s)}s∈S maximizes expected time-t discounted utility, for a given state. This
policy is called a stochastic dynamic program. The associated

vt (s) = max E[Ut∗ (at−1:T −1 , s, St:T (at−1:T −1 ))|St−1 = s],


at−1:T −1

is called the value function of the stochastic dynamic program — the highest achievable expected utility given
we are in state St−1 = s.
7.4. STOCHASTIC DYNAMIC PROGRAM 183

We will discuss this case latter.

Definition 7.4.2. [Greedy policy] Assuming a MDP. If

at−1 (s) = arg max E[Ut (at−1 , s, St (at−1 ))|St−1 = s],


e
at−1

then the deterministic policy {e


at−1 (s)}s∈S myopically maximizes expected time-t utility, ignoring any future
impacts. This policy is called greedy.

The time separability of the discounted time-t utility function means the stochastic dynamic program’s value
function also splits

vt (s) = max {E[Ut (at−1 , s, St (at−1 ))|St−1 = s] + δE [vt+1 (St (at−1 ))|St−1 = s]} .
at−1

This is an example of a Bellman equation — relating the value function at time t to the value function value at

time t + 1. Why does it hold?

Proof. The first term is obvious, the second is notation heavy but quite simple:
 
∗ ∗
max E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] = max max E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] ,
at−1:T −1 at−1 at:T −1

while
∗ ∗
E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] = E[E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St (at−1 )]|St−1 = s], Adam’s law

= E[E[Ut+1 (at:T −1 , St (at−1 ), St+1:T (at:T −1 ))|St (at−1 )]|St−1 = s].

Thus
∗ ∗
maxat−1:T −1 E[Ut+1 (at:T −1 , St:T (at−1:T −1 ))|St−1 = s] = maxat−1 E[maxat:T −1 E[Ut+1 (at:T −1 , St (at−1 ), St+1:T (at:T −1 ))|St (
= maxat−1 E [vt+1 (St (at−1 ))|St−1 = s] .

It takes some time to get use to these value functions. Thinking about a special case helps.

Example 7.4.3. Recall from Example 7.3.3 the Linear-quadratic-Gaussian (LQG) control problem. Then


vt (s) = −sT Hs + max −aTt−1 Qat−1 + δE[vt+1 (T s + Rat−1 + ηt )|St−1 = s] .
at−1

Notice that the only random object here is ηt |St−1 .

How to progress?

Theorem 7.4.4. The stochastic dynamic program case of a LQG program sets:

−1
at−1
b = −Kt+1 T s, where Kt+1 = Q + δRT Pt+1 R RT Pt+1

Pt = H + T T Kt+1
T
QKt+1 T + δT T (I − Kt+1
T
)Pt+1 (I − Kt+1 )T.
184 CHAPTER 7. ACTION: CAUSALITY

Proof. Guess a solution


vt (s) = −sT Pt s − ρt , (7.6)

holds for every t. If this is true then (removing all the negative signs) using that form at time t and t + 1, the
equality:
  
sT Pt s + ρt = sT Hs + min aTt−1 Qat−1 + δE (T s + Rat−1 + ηt )T Pt+1 (T s + Rat−1 + ηt )|St−1 = s + δρt , (7.7)
at−1

at−1 which does the minimization, by solving second order condition, assuming E[ηt |St−1 = s] = 0,
holds. Find b

at−1 + 2δRT Pt+1 (T s + Rb


2Qb at−1 ) = 0,

so
−1
at−1 = −Kt+1 T s,
b where Kt+1 = Q + δRT Pt+1 R RT Pt+1 .

Now let us get back to ρt , Pt . Can they be found to verify (7.6). Noting T s − Rb
at−1 = (I − Kt+1 )T s, plug

this back in (7.7) yields


 T  
sT Pt s + ρt = sT Hs + b
at−1 Qb at−1 + ηt )T Pt+1 (T s + Rb
at−1 + δE (T s + Rb at−1 + ηt )|St−1 = s + δρt

= sT Hs + sT T T Kt+1
T
QKt+1 T s + δsT T T (I − Kt+1 )T Pt+1 (I − Kt+1 )T s + δE[ηtT Pt+1 ηt |St−1 = s] + δρt

= sT Pt s + δ E[ηtT Pt+1 ηt |St−1 = s] + ρt

where
Pt = H + T T Kt+1
T
QKt+1 T + δT T (I − Kt+1
T
)Pt+1 (I − Kt+1 )T.

If E[ηtT Pt+1 ηt |St−1 = s] does not vary with s, then



ρt = δ ρt+1 + trace(Pt+1 E[ηt ηtT ] .

Then solve this at time T :


UT (aT −1 , s, ST (at−1 )) = sT F s,

so set PT = F and ρT = 0. This completes the argument.

What is important here? The following makes that list:

ˆ b
at−1 only relies on the assumptions: E[ηt |St−1 = s] = 0 and that E[ηtT Pt+1 ηt |St−1 = s] does not vary with
s. The solved Pt does not depend upon the properties of ηt .

ˆ In the LQ, non-stochastic case, b


at−1 , Pt do not change! In that case ρt = 0 for all t.

ˆ The ρt is a remainder term, it is not used to compute the control.

Example 7.4.5.
St = St−1 − at−1 , St ≥ 0.
7.5. REINFORCEMENT LEARNING 185

7.5 Reinforcement learning


7.6 Recap

The main topics are listed in Table 7.3.

Formula or idea Description or name


At assignment
dynamic program
environment
Yt,s (at−s,1 ) lag-s potential outcome
Markov decision process
MDP
Yt outcome
policy
Yt (at−m:t ) potential outcome
Xt predictor
reinforcement learning
sequential randomization
stochastic dynamic program
value function

Table 7.3: Main ideas and notation in Chapter 8.


186 CHAPTER 7. ACTION: CAUSALITY
Chapter 8

Stochastic integration and time series

Stochastic integration plays a large role in modern science, engineering, economics and statistics. There are
quite a few beautiful books on this subject. Mikosch (1998) is an elegant introduction to stochastic calculus

by a very strong mathematician. His Chapter 2 on integration is extremely clear. He assumes only very basic
probability. Steele (2001) mixes finance with probability to deliver an elegant book. Protter (2010) is my
favorite book on stochastic calculus. It is very elegant, with enormous economy of effort. Duffie (2001), Shreve

(2004) and Karatzas and Shreve (1998) are classic references on mathematical finance expressed in continuous
time.

In this Chapter we will focus on stochastic integrals which are driven by Brownian motion. We have already
seen objects a bit like Brownian motion: random walks:

iid
Yt = Yt−1 + εt , εt ∼ .

Here we discuss taking this random walk process to times which can be recorded continuously

{Y (t)}t≥0 ,

where the increments are Gaussian. This is useful as it expands the kinds of processes we can use to build
new models (e.g. in financial econometrics), allowing some analysis of problems where the data is not equally
spaced through time, some of these objects appear in important asymptotic arguments (e.g. functional central

limit theorem) in later chapters.

8.1 Background
8.1.1 A càdlàg function

A càdlàg function is a function which is right continuous with left limit. It is familiar in introductory statistics
from, for example, the distribution function of a binary random variable — shown in Figure 8.1. A càdlàg

187
188 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

F (y)
•1 •

p = 0.6 • •

0.3 • ◦

◦0
• •
0 1 y

1 = F −1 (0.6) = Q(0.6)

Figure 8.1: Example of càdlàg function. Distribution function of a binary random variable with P (Y = 0) = 0.3
and P (Y = 1) = 0.7. Here we compute the 0.6-quantile, F −1 (0.6) = Q(0.6), which is 1.

process {Y (t)}t≥0 has


lim Y (u) = Y (t), lim Y (u) = Y (t−), t ∈ R>0 ,
u↓t u↑t

while the jumps at time t are


Y (t) − Y (t−),

the difference between the right and left limits, respectively. I will use C(0, 1) to denote the space (i.e. informally

the collection) of continuous functions on the unit interval, while D(0, 1) denotes the corresponding space of
càdlàg functions (it is called the Skorokhod space in the literature).

8.1.2 Finite p-variation

It is helpful to take a step back and remind ourselves about what we usually mean by integration. The usual
integrals we see in statistics are typically Riemann integrals or its extension Riemann–Stieltjes integrals. In

our discussion we will follow some of the lines setout by Mikosch (1998).
To do this start with p-variation. This will be based on partitioning of the interval [0, 1] governed by the
end points:

τn = {0 = t0 < t1 < ... < tn = 1} .

Associated with this partition is the mesh or norm of τn , written as

∥τn ∥ := max {tj − tj−1 } ,


j∈{1,...,n}

the length of the the longest subinterval — this is also the L∞ norm of the time-gaps.

Definition 8.1.1. [Finite p-variation] A real function g on [0, 1] is said to be of finite p-variation (also called
bounded p-variation) for some p ≥ 1 if
n
X p
sup sup |g(tj ) − g(tj−1 )| < ∞,
n τn
j=1
8.1. BACKGROUND 189

so this is the supreme is over all possible partitions.

This is a pretty abstract quantity, how does it vary with p? The following shows that for any ∞ > q ≥ 0 if
g is p-variation then g is (p + q)-variation.
n
Why? For any {aj > 0}j=1 and 1 ≤ p < ∞, then let the p-norm of a (also called the L∞ norm of a) be

 1/p
Xn
∥a∥p :=  apj 
j=1

which is non-increasing in p. Why? Let q ≥ 0, then


 q/p
n
X   n
X  n
q/p X Xn n
X
p+q q p p+q
∥a∥p+q = ajp+q ≤ max aqj × apj = max apj apj ≤ p
aj  apj = ∥a∥p × ∥a∥p = ∥a∥p ,
j j
j=1 j=1 j=1 j=1 j=1

so, the claim follows, as

∥a∥p+q ≤ ∥a∥p ,

and setting aj = g(tj ) − g(tj−1 ).

By far the most famous case of finite p-variation is when p = 1, then it is called finite variation. In that

case we will write


n
X
T V (g, τn ) := |g(tj ) − g(tj−1 )| .
j=1

The term

T V (g) := sup sup T V (g, τn )


n τn

is often called the total variation of g.

All non-decreasing functions are finite variation, by telescoping T V (g, τn ) = g(1) − g(0), for any choice of

τn .

Example 8.1.2. A distribution function {F (y)}y∈[a,b] is of finite variation as it is non-decreasing and so

n
X
T V (F, τn ) = F (tj ) − F (tj−1 ) = F (1) − F (0) = 1,
j=1

whatever the choice of τn . A Poisson process with intensity ψ, written {Y (t)}t≥0 is non-decreasing, so

T V (Y, τn ) = Y (1) − Y (0) ∼ P o(ψ).

Hence the probability T V (Y, τn ) is less than c can be made arbitrarily close to one by allowing c large. Thus
Poisson processes are finite variation with probability one.
190 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

Example 8.1.3. Assume g is continuously differentiable with bounded derivative g ′ . Then by the mean value
theorem, for t > s,
|g(t) − g(s)| ≤ (t − s) sup |g ′ (u)| ,
u∈[0,1]

so
n
X
T V (g, τn ) ≤ sup |g ′ (u)| × tj − tj−1 = sup |g ′ (u)| < ∞.
u∈[0,1] j=1 u∈[0,1]

Then g is of finite variation.

How does p = 1 variation relate to quadratic variation? Start with


n
X n
X
2
QV (g, τn ) := {g(tj ) − g(tj−1 )} ≤ max |g(u) − g(v)| × |g(tj ) − g(tj−1 )| .
|u−v|≤∥τn ∥
j=1 j=1

But if g is continuous, then


max |g(u) − g(v)| → 0,
|u−v|≤∥τn ∥

as ∥τn ∥ → 0. The quadratic variation of the function, written [g, g](1), is the mse limit of QV (g, τn ), that is

2
lim E[{QV (g, τn ) − [g, g](1)} ] = 0.
∥τn ∥→0

For any finite variation, continuous function {g(t)}t∈[0,1] , then

[g, g](1) = 0.

Quadratic variation is not 2-variation, as quadratic variation looks at the limit as ∥τn ∥ → 0, while 2-variation
looks at the sup over all partitions. These things are different.

8.1.3 Riemann–Stieltjes integral

Think of
n
X
S(f, g, τn ) = f (tj−1 ) {g(tj ) − g(tj−1 )} ,
j=1

where the functions {f (t)}t∈[0,1] and {g(t)}t∈[0,1] , then it turns out this has a limit called the Riemann–Stieltjes
integral, written
Z 1
f (u)dg(u)
0

so long as ∥τn ∥ → 0 and

ˆ f and g do not have any discontinuities at same point on [0, 1];

ˆ f has finite p-variation and g has finite q-variation, for p, q > 0 such that

p−1 + q −1 > 1. (8.1)


8.1. BACKGROUND 191

In the integral, f is called the integrand, and g is labelled the integrator. In the special case that g(tj ) −
g(tj−1 ) = tj − tj−1 , this is called a Riemann integral.
This result is due to Young (1936). Some textbooks make out that the Riemann–Stieltjes integral needs g

to be of finite variation, but that is not true. Mikosch (1998) has a very accessible and precise discussion of the
issues. See also Dudley and Norvaiša (1999). The point about (8.1) is rather important as Brownian motion
will turn out not to have finite variation, but it does have finite 2-variation, so-called quadratic variation.
Standard references for the basic theory of Riemann–Stieltjes integrals are Apostol (1957) and Widder

(1946).

Example 8.1.4. In statistics you often see Riemann–Stieltjes integrals in the context of moments, e.g.
Z 1
E[Y ] = ydF (y),
0

where F is a distribution function. If F is continuously differentiable, then dF (y) = f (y)dy. If F is purely


R1 P
discontinuous, then 0 ydF (y) = yP (Y = y).

Thinking about the Riemann–Stieltjes integral as a function of time


Z t
Y (t) = f (u)dg(u), t ∈ [0, T ],
0

then {Y (t)}t∈[0,T ] is the solution to the differential equation

dY (t) = f (t)dg(t).

This is super powerful, as it is the basis of the Newton’s chain rule, applied to transformation:

A(t) = A {Y (t), t} ,

where the function A is assumed to be continuously differentiable in Y and t (written as A ∈ C 1,1 ). The chain
rule says that

∂A {Y (t), t} ∂A {Y (t), t}
dA(t) = dt + dY (t)
∂t ∂Y (t)
∂A {Y (t), t} ∂A {Y (t), t}
= At (t)dt + AY (t)dY (t), AY (t) := At (t) :=
∂Y (t) ∂t
= At (t)dt + AY (t)f (t)dg(t),

so
Z t Z t
A(t) = At (u)du + AY (u)f (u)dg(u)
0 0

Example 8.1.5. If g(t) = µt, then dY (t) = f (t)µdt, then {A(t)}t∈[0,T ] is the solution to the differential
equation:
dA(t) = {At (t) + AY (t)f (t)µ} dt.
192 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

If A(t) = exp {Y (t)}, then

dA(t) = A(t)dY (t).

8.1.4 Local time

BNS book. To be added.

8.2 Brownian motion


8.2.1 Definition

A Brownian motion is the Gaussian special case of a Lévy process, a Gaussian continuous time random walk.

Poisson process Brownian motion Gamma process


5
6

10
4
5

8
3
4

6
Y

Y
3

4
2

2
1

0
0

0
−1

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Time Time Time

Figure 8.2: LHS: sample path of Poisson process with intensity ψ = 1. Middle: sample path of Brownian motion
with µ = 0 and σ = 1. RHS: sample path of a gamma process.

Definition 8.2.1. The process {B(t)}t≥0 is standard Brownian motion iff it has
(i) càdlàg sample paths,
(ii) independent increments

{B(t) − B(s)} ⊥⊥ {B(u) − B(v)} , t > s > u > v,

(iii) Gaussian increments, that is B(t) − B(s) ∼ N (0, (t − s)).

The process {Y (t)}t≥0 where Y (t) = µt + σB(t), is called Brownian motion, where µ is drift and σ is volatility.

If t − s > 0 is small then



|µ(t − s)| ≪ σ t − s,

so the randomness in the increment Y (t) − Y (s) will dominate the drift.
8.2. BROWNIAN MOTION 193

Brownian motion as an integrand in a Reimann integral

Think about {Y (t)}t≥0 given by the Riemann integral


Z t
Y (t) = B(u)du.
0

As it is a Riemann–Stieltjes integral, {Y (t)} is a finite activity Gaussian process, while it has

t3
 
E[Y (t)|FsY ] = Y (s) + (t − s)B(s), Y (t) ∼ N 0, , Cov(Y (t), Y (s)) = s3 /3 + (t − s)s2 /2,
3

so is not a martingale, while


dY (t) = B(t)dt.

Why do these results hold? First, for t > s, the


Z t Z t
Y (t) = Y (s) + B(u)du = Y (s) + (t − s)B(s) + {B(u) − B(s)} du,
s s

But {B(u) − B(s)}u≥s is independent of B(s), as Brownian motion has independent increments. So

E[Y (t)|FsY ] = Y (s) + (t − s)B(s),

so {Y (t)}t≥0 it is not a martingale. Second


Z tZ t Z tZ t
Var(Y (t)) = Cov(B(u), B(v))dudv = min(u, v)dudv.
0 0 0 0

But
t v t
v2
Z Z Z
min(u, v)du = udu + vdu = + v (t − v) = vt − v 2 /2,
0 0 v 2
so Var(Y (t)) = t /2 − t /6 = t /3. Using the E[Y (t)|FsY ] expression,
3 3 3

Cov(Y (t), Y (s)) = E[Y (t)Y (s)] = E[E[Y (t)Y (s)|FsY ]] = E[Y (s)2 ] + (t − s)E[B(s)Y (s)],

But
Z t Z t
E[B(t)Y (t)] = E[B(t)B(u)]du = udu = t2 /2.
0 0

This delivers the result on Cov(Y (t), Y (s)).

8.2.2 Continuous sample path, non-differentiable and infinite length

We have just see that using Brownian motion type objects in a Reimann integral raises no big issues. But
what about using Brownian motion as an integrator? To answer this we need to determine some of the core

properties of Brownian motion. They will be that Brownian motion

ˆ is a martingale;
194 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

ˆ paths are each a continuous function of time;

ˆ is not differentiable with respect to time;

ˆ it is not finite variation;

ˆ has finite quadratic variation;

ˆ has p-variation for any p > 2;

ˆ is self-similar;

ˆ is at the center of a process version of the Lindeburg-Lévy central limit theory, Donsker’s invariance

principle.

To see that Brownian motion {B(t)}t≥0 is a martingale {F(t)}t≥0 , write, for t > s,

B(t) = B(s) + {B(t) − B(s)} .

Then

E[B(t)|F(s)] = E[B(s)|F(s)] + E[{B(t) − B(s)} |F(s)]

= E[B(s)|F(s)], independent, zero mean increments

= B(s).

We know Var(B(t)) = t, so E[|B(t)|] exists, and thus Brownian motion satisfies the two conditions for being a
martingale.

Continuous in time but not differentiable

First, think about asking if {B(t)}t≥0 is a continuous function of time. Figure 8.2 shows a simulated path of a

standard Brownian motion. The plot is made using small dots at each datapoint, not a line graph. The graph
gives the impression the Brownian motion has a continuous sample path. We can prove this is true.

Theorem 8.2.2. Brownian motion {B(t)}t≥0 has a continuous sample path.

Proof. Uses “Kolmogorov continuity criteria” e.g. Bass (2011, Ch. 8). This says that if {X(t)}t∈[0,1] is a real
valued process and ∃ constants c, ε, p > 0 s.t.

p 1+ε
E {|X(t) − X(s)| } ≤ c |t − s| ,

where t and s live [0, 1], then with probability one {X(t)}t∈[0,1] is uniformly continuous on [0, 1]. In the

Brownian motion case for fixed t and s,

L p
B(t) − B(s) = (t − s)U, U ∼ N (0, 1).
8.2. BROWNIAN MOTION 195

So
p p/2 p/2
E {|B(t) − B(s)| } = µp |t − s| , µp = E |U | .

Now
p/2 1+ε
|t − s| ≤ c |t − s| ,

if p > 2(1 + ε) as |t − s| < 1. So Brownian motion has a continuous sample path.

Next: is Brownian motion differentiable? The answer is no, but actually this is pretty hard to prove
rigorously. To think informally about it, suppose τ > 0 as small, then

B(t + τ ) − B(t) ∼ N (0, τ ),

so
B(t + τ ) − B(t)
∼ N (0, τ −1 ).
τ
Hence it is not well behaved as τ ↓ 0. In fact {B(t)}t≥0 is nowhere differentiable.

The smoothness of the Brownian motion path can be measured using the following Theorem.

Theorem 8.2.3. [Lévy’s modulus of continuity] Almost surely:

|B(t + h) − B(t)|
lim sup sup p = 1.
h↓0 0≤t≤1−h 2h log(1/h)

The proof of this is beyond these notes. This means that Brownian motion has everywhere locally Hölder

exponent α < 1/2, almost surely, i.e. |B(t + h) − B(t)| ≤ Chα .

Finite variation and quadratic variation of Brownian motion

Next: what is the length of Brownian motion?


Think of
n
X n
X
T V (B, τn ) = |B(tj ) − B(tj−1 )| , and T V (B) = sup sup |B(tj ) − B(tj−1 )| ,
n τn
j=1 j=1

L √
so T V (B, τn ) sums up independent terms. Now B(t) − B(s) = t − sU , U ∼ N (0, 1). This means that
r
2√ 2 2
E |B(t) − B(s)| = t − s, Var |B(t) − B(s)| = E {B(t) − B(s)} −{E |B(t) − B(s)|} = (t−s) {1 − (2/π)} .
π
(8.2)

This means that


r n
2 Xp
E[T V (B, τn )] = ti − ti−1 , and Var(T V (B, τn )) = T {1 − (2/π)} .
π i=1

Notice the variance does not depend upon the details of τn , only T .
196 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

Example 8.2.4. When ti = T i/n, then


r
2
E[T V (B, τn )] = Tn ,
π

which goes off to infinity as n increases. As the variance is invariant to τn , so

T V (B) = ∞

with probability one.

This means that Brownian motion does not have finite variation. This means its use as an integrator in a
Riemann–Stieltjes integral will be, at best, subtle.

Now think about quadratic variation.

Definition 8.2.5. [Y, Y ](1) is quadratic variation of a process {Y (t)}t∈[0,1] . It is the mean square error limit
of
n
X 2
{Y (tj ) − Y (tj−1 )} ,
j=1

as lim∥τn ∥→0 . More generally, the QV process is written as {[Y, Y ](t)}t∈[0,1] .

Now let
n
X 2
QV (B, τn ) = {B(tj ) − B(tj−1 )} ,
j=1

then
n
X n
X
2
E[QV (B, τn )] = 1, Var(QV (B, τn )) = 2 (tj − tj−1 ) ≤ 2 ∥τn ∥ (tj − tj−1 ) = 2 ∥τn ∥ ,
j=1 j=1

so if ∥τn ∥ → 0 then

2
lim E[{QV (B, τn ) − [B, B](1)} ] = 0, where [B, B](1) = 1,
∥τn ∥→0

to more generally
[B, B](t) = t.

Finally, think about the p-variation of Brownian motion. The following theorem shows how slightly different
quadratic variation is to 2-variation.

Theorem 8.2.6. Brownian motion has finite p-variation for any

p > 2.

Proof. Lévy’s modulus of continuity means that for p > 2 then there exists a Kp such that

|Bt − Bs | ≤ Kp |t − s|1/p ,
8.2. BROWNIAN MOTION 197

so
n
X n
X
p
|B(tj ) − B(tj−1 )| ≤ Kpp |tj − tj−1 | = Kpp < ∞
j=1 j=1

almost surely.

Self-similarity

The next property we discuss is that scaling Brownian motion by σ > 0 is the same as running time faster:

d 
{σB(t)}t≥0 = B(σ 2 t) t≥0
.

This follows from the property of the normal distribution plus the B(t) ∼ N (0, t) property. This implies
Brownian motion is an example of a self-similar process, with γ = 1/2 and a = σ 2 .

Definition 8.2.7. The process {X(t)}t≥0 is said to be self-similar if there exists a γ > 0 such that for all a > 0,
d
the {aγ X(t)}t≥0 = {X(at)}t≥0 .

Self-similarity is a typical property of fractals (e.g. Mandelbrot (2021)).

Donsker’s invariance principle

Finally, think about central limit theorems.

One of the most important results in probability theory is the Lindeburg-Lévy central limit theory, which is
seen in Introductory Statistics classes.


Theorem 8.2.8. [Lindeburg-Lévy CLT] Assume that the sequence {Xj }j=1 are i.i.d. draws with mean µ and
variance σ 2 < ∞. Calculate the sequence of scaled sample centered averages
 
n
1 X 
d
1/2
(Xj − µ) → N (0, 1), (8.3)
σn 
j=1

as n → ∞,

This theorem places the Gaussian distribution at the heart of much of statistics.
Brownian motion also plays a large role, extending the Lindeburg-Lévy Theorem to processes.
To start, construct an artificial continuous time process {Sn (t)}t∈[0,1] formed by the partial sum
 
1 ⌊tn⌋
X 
Sn (t) = 1/2 
(Xj − µ) , t ∈ (0, 1],
σ ⌊tn⌋ j=1

using a 100t% of the data. Here ⌊x⌋ denotes the integer part of x. At each individual t, the Sn (t) is the same

as (8.3) but based on a sample size ⌊tn⌋, a fraction of the data. So as n → ∞ for fixed t, the Lindeburg-Lévy
CLT applies with, marginally
d
Sn (t) → N (0, t),
198 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

as n → ∞.
But what happens to the entire {Sn (t)}t∈[0,1] process? The functional central limit theorem, which is also
called the Donsker’s invariance principle, provides the answer.


Theorem 8.2.9. [Donsker’s invariance principle] Assume that the sequence {Xj }j=1 are i.i.d. draws with
mean µ and variance σ 2 < ∞, then on the space C(0, 1) of continuous functions on the unit interval, using a

sup-norm metric, {Sn (t)}t∈[0,1] converges in distribution to {B(t)}t∈[0,1] . This is often written as

{Sn (t)}t∈[0,1] ⇒ {B(t)}t∈[0,1] .

Proof. e.g. Billingsley (1999)

8.3 Stochastic integration

Recall, the Riemann–Stieltjes integral is the limit of


n
X
S(f, g, τn ) = f (tj−1 ) {g(tj ) − g(tj−1 )} , where τn = {a = t0 < t1 < ... < tn = b}
j=1

as ∥τn ∥ → 0, under the conditions:

ˆ f, g do not have any discontinuities at same point on [0, 1];

ˆ f has finite p-variation and g has finite q-variation, for p, q > 0 such that

p−1 + q −1 > 1. (8.4)

Now we are going to think of {f (t)}t∈[0,1] and integrator {B(t)}t∈[0,1] as continuous time stochastic processes

adapted to {Ft }t∈[0,1] (so this filtration must contain at least the history of the integrand and integrator). We
are eventually going to get to an Ito integral,
Z 1
f (u)dB(u).
0

In that case the process


Z t
Y (t) = f (u)dB(u)
0
Rt
is a martingale so long as E[|Yt |] exists, which will hold if 0
E[f (u)2 ]du (due to isometry). This is often written
in the shorthand as

Y (t) = f · B(t).

However, this is not so easy. Brownian motion has continuous sample paths, so bullet point one of the
condition for a Riemann-Stieltjes integral is dealt with. But Brownian motion is not finite variation, so we
8.3. STOCHASTIC INTEGRATION 199

need to be careful. All we have is q = 2. This rules out allowing p = 2, which would happen if, for example,
f (t) = B(t), the same Brownian motion, i.e.
Z 1
B(u)dB(u).
0

h 8.3.1. If you miss time the integrand to, for example,


n
X
[{f (tj−1 ) + f (tj )} /2] {B(tj ) − B(tj−1 )} ,
j=1

that yields, in the limit, a totally different integral: not an Ito integral but a Fisk-Stratonovich integral. Using
f (tj−1 ) is vita, it makes the integrand previsible.

8.3.1 Simple process

To tackle this problem about {f (t)}t∈[0,1] , think about replacing it with a less rough version.

For a fixed n, build a “simple process” f (n) (t) t∈[0,1] :

f (n) (t) = f (tj−1 ), tj−1 ≤ t < tj , j = 1, ..., n,

which is a càdlàg step function (like a distribution function). Then, for fixed n, the
n
X
T V (f (n) ) = sup T V (f (n) , τ ) = |f (tj ) − f (tj−1 )| < ∞,
τ
j=1

so is of finite variation — even if f is not!



As f (n) (t) t∈[0,1] is of finite variation, the Reimann-Stieltjes integral
n
X
f (n) (tj−1 ) {B(tj ) − B(tj−1 )} , (8.5)
j=1

can be built. We will write it as


Z 1
f (n) (u)dB(u).
0

Integrating with respect to Brownian motion is usually called an Ito integral, integrating f (n) (t) t∈[0,1]
by
{B(t)}t∈[0,1] .

8.3.2 Limit of Ito integrals of simple processes

We are going to show that as n → ∞, the sequence of Reimann-Stieltjes integrals


Z 1
f (n) (u)dB(u)
0

converges in mean square to a unique limit, of the integral {f (t)}t∈[0,1] by {B(t)}t∈[0,1] , written
Z 1
f (u)dB(u).
0
200 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

This is an Ito integral but now {f (t)}t∈[0,1] does not need to be of finite variation. To do this we need to make
a couple of assumptions.
Now assume that:

ˆ {f (t), B(t)}t∈[0,1] are adapted to {Ft };

ˆ the
Z 1
E[f (u)2 ]du < ∞.
0

To start the work, think of the sum (8.5). We are going to see three properties:

1. The increment
f (n) (tj−1 ) {B(tj ) − B(tj−1 )}

has
h i
E f (n) (tj−1 ) {B(tj ) − B(tj−1 )} |Ftj−1

f (n) (tj−1 )E {B(tj ) − B(tj−1 )} |Ftj−1


 
=

= 0,

 
hence is a martingale difference, as the E f (n) (tj−1 ) {B(tj ) − B(tj−1 )} < ∞ due to
h i h i
Var f (n) (tj−1 ) {B(tj ) − B(tj−1 )} E f (n) (tj−1 )Var {B(tj ) − B(tj−1 )} |Ftj−1

=
h i
= (tj − tj−1 ) E f (n) (tj−1 )2 ,

which is finite by the second bullet point.

2. From 1, it implies the sum (8.5) is a martingale with respect to {Ft }.

3. Think of (8.5) as a partial sum, where the partial sum is a martingale with respect to {Ft } and so
Z 1  Xn h i
(n)
Var f (u)dB(u) = E f (n) (tj−1 )2 (tj − tj−1 )
0 j=1
Z 1
= E[f (n) (u)2 ]du.
0
R1
Expressing the variance of f (n) · B(1) as 0
E[f (n) (u)2 ]du is called L2 isometry. It an important feature
of Ito integrals.

Under the two bullet point assumptions it turns out it is always possible to find a sequence of simple processes

f (n) (t) t∈[0,1]
such that
Z 1 n o2 
(n)
E f (u) − f (u) du → 0.
0
8.3. STOCHASTIC INTEGRATION 201
 (n)
A proof of this is in Kloeden and Platen (1992), Lemma 3.2.1. Thus f (t) t∈[0,1]
is a Cauchy sequence,
approximating {f (t)}t∈[0,1] arbitrarily well as n goes to infinity in L2 .
Applying Doob’s maximal quadratic inequality from equation (2.7), that is for any square integrable mar-

tingale
 
E sup Ys2 ≤ 4E[Yt2 ],
s≤t

to the first line for any k, n ∈ N,


" Z 1 Z t 2 # "Z
1 Z t 2 #
(n+k) (n) (n+k) (n)
E sup f (u)dB(u) − f (u)dB(u) ≤ 4E f (u)dB(u) − f (u)dB(u)
t∈[0,1] 0 0 0 0
Z 1 n o2 
= 4 E f (n+k) (u) − f (n) (u) du, isometry.
0

Letting k → ∞, the
" Z 1 Z t 2 # Z 1 n o2 
(n) (n)
E sup f (u)dB(u) − f (u)dB(u) ≤4 E f (u) − f (u) du → 0.
t∈[0,1] 0 0 0

Hence we can think of


Z 1
f (u)dB(u)
0

as a valid integral, the mean square error limit of a sequence of Reimann-Stieltjes integrals. This limit is called
an Ito integral of {f (t)}t∈[0,1] by {B(t)}t∈[0,1] . That was the goal of this section.

More broadly, instead of working with the sums, one can define a partial sum
n
X
Y (n) (t) = 1(tj ≤ t)f (n) (tj−1 ) {B(tj ) − B(tj−1 )}
j=1

RT
and it is possible to show if 0
E[f (u)2 ]du < ∞ that
  
2
lim E sup Y (n) (t) − Y (t) → 0,
∥τn ∥→0 0≤t≤T

where
Z t
Y (t) = f (u)dB(u),
0

again an Ito integral for each t.


The Ito integral is a martingale and obeys Ito isometry, for every t,
Z t  Z t
Var f (u)dB(u) = E[f (u)2 ]du.
0 0

Example 8.3.2. Suppose {f (t)}t∈[0,T ] is {B(t)}t∈[0,T ] , then


Z t
Y (t) = B(u)dB(u).
0
202 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

Hence {Y (t)}t∈[0,T ] is not a Gaussian process, but it is a martingale with


Z t Z t
E f (u)2 du = udu = t2 /2.
 
Var(Y (t)) =
0 0

In more advanced work, can replace the condition


Z T
E f (u)2 du < ∞
 
0

with locally bounded condition:


Z T
f (u)2 du < ∞,
0

but that is beyond the scope of our work here.

8.3.3 Quadratic covariation

Quadratic variation extends to quadratic covariation.

Definition 8.3.3. For the processes {X(t), Y (t)}t∈[0,T ] adapted to {F(t)}t∈[0,T ] , the mean square limit of
n
X
QV (X, Y, τn ) = {X(tj ) − X(tj−1 )} {Y (tj ) − Y (tj−1 )} ,
j=1

as ∥τn ∥ → 0, is the quadratic covariation


[X, Y ](t).

By Cauchy-Schwartz inequality

QV (X, Y, τn )2 ≤ QV (X, X, τn )QV (Y, Y, τn ),

so

[X, Y ](t)2 ≤ [X, X](t) × [Y, Y ](t). (8.6)

Another super helpful result is that

[X + Y, X + Y ] = [X, X] + [Y, Y ] + 2[X, Y ],

and so
1
[X, Y ] = {[X + Y, X + Y ] − [X − Y, X − Y ]} .
4
The latter result is called the polarization identity.

Example 8.3.4. If {g(t)}t∈[0,T ] is a continuous function of finite variation, then in Section 8.1.2 we saw that
[g, g](t) = 0. Thus, using (8.6),
[g, B](t) = 0.
8.4. STOCHASTIC DIFFERENTIATION EQUATION 203

This means, if
Z t
Y (t) = g(t) + σ(u)dB(u),
0

then
Z t
[Y, Y ](t) = σ 2 (u)du.
0

In particular this holds if


Z t Z t
Y (t) = µ(u)du + σ(u)dB(u),
0 0

e.g. the process has time-varying drift.

8.4 Stochastic differentiation equation

Having defined a stochastic integral, it is possible to work on a stochastic differential equation (SDE) corre-
sponding to this solution. In this section we will work with SDEs associated with Ito processes.

Definition 8.4.1. The {Y (t)}t∈[0,T ] process


Z t Z t
Y (t) = µ(u)du + σ(u)dB(u), (8.7)
0 0

is called an Ito process, assuming {µ(t), σ(t), B(t)}t∈[0,T ] are adapted to {F(t)}t∈[0,T ] .

Think of the corresponding stochastic differential equation for {Y (t)} in terms of {µ(t), σ(t), B(t)}t∈[0,T ] as

dY (t) = µ(t)dt + σ(t)dB(t), (8.8)

where the notation


dY (t)

is thought of as
Y (t) − Y (t − dt).

The SDE (8.8) is short-hand for (8.7) — the solution (8.7) is the fundamental object, the SDE is shorthand for
it.

Example 8.4.2. [Geometric Brownian motion]: One of the most used Ito processes is geometric Brownian

motion
dY (t) = µY (t)dt + σY (t)dB(t), Y (0) > 0, σ ≥ 0, µ ∈ R,

then {Z(t)}t≥0 is non-negative with probability one. This is a standard continuous time model of prices in
financial economics. It is often written, in short-hand as

dY (t)
= µdt + σdB(t),
Y (t)
204 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

or more abstractly
dY
= µdt + σdB.
Y
Later we will see it has the solution

Y (t) = Y (0) exp (µ − σ 2 /2)t + σB(t) .




At first sight, the appearance of the σ 2 /2 term in the solution is not obvious. More fascinating is that an

innocent reading of the SDE, with shocks Y (t)dB(t), suggests the process {Y (t)}t≥0 could go negative, as the
increments of Brownian motion are on the real line — however this is not true.

h 8.4.3. Think of the Brownian increment dB(t). It is quite subtle. Go back to quadratic variation, then
 2 
Xn 
2
lim E  {B(tj ) − B(tj−1 )} − [B, B](t)  = 0. (8.9)
 
∥τn ∥→0  
j=1

2
In terms of the differential notation dB(t) ∼ N (0, dt), this means that if we see a {dB(t)} then to get the right

answer for stochastic integration we must think of (8.9) and so


Z t Z t
2
{dB(u)} = [B, B](t) = t = d[B, B](u),
0 0

or
2
{dB(u)} = dt = d[B, B](t).
2
When you first see this, it can appear mighty odd as it looks like for all the world that {dB(u)} ∼ χ21 dt. But
the SDE notation is really just a shorthand for the fundamental object: the stochastic integral. To get the
2
right stochastic integral you need to take {dB(u)} = dt. This is very important and not obvious.

8.5 Ito’s formula


8.5.1 Main result

Now suppose:
A(t) = A {Y (t), t} ,

a smooth function of Y (t), then what SDE does {A(t)} follow? This is a fundamental question! In short-hand,
what is the chain rule for Ito processes?
To start recall if A ∈ C 1,1 and Y (t) is continuously differentiable, then Newton’s chain rule says that

∂A {Z(t), t} ∂A {Z(t), t}
dA(t) = At (t)dt + AZ (t)dY (t), where At (t) = , AZ (t) = .
∂t ∂Z(t)

But that does not work for Ito processes, which are not differentiable! The Extension of Newton’s result to Ito
processes is called Ito’s Formula (or Ito’s Lemma).
8.5. ITO’S FORMULA 205

Lemma 6. [Ito’s Formula] If A(y, t) ∈ C 2,1 , {µ(t), σ(t), B(t)}t≥0 is adapted to {Ft }t≥0 , and A(t) = A {Y (t), t},
then Ito’s Formula says that
Z t Z t
∂ 2 A {Y (t), t}

1
A(t) = A(0) + Au (u) + AY Y (u)σu2 du + AY (u)dY (u), where AY Y (t) = .
0 2 0 ∂Y (t)2

Proof. Sketch proof. Do formal second order expansion of

1 2 1 2 ∂ 2 A {Y (t), t}
dA(t) = At (t)dt + Att (t) (dt) + AY (t)dY (t) + AY Y (t) × (dY (t)) , where Att (t) =
2 2 ∂t2
∂ 2 A {Y (t), t}
+AY t (t) (dY (t)) (dt) , where AY t (t) = .
∂t∂Y (t)

We only keep Op (dt) terms. Then

1 2
dA {Y (t), t} = At (t)dt + AY (t)dY (t) + AY Y (t) {dY (t)} .
2
2
But what is {dY (t)} ? From Bioharzard 8.4.3 the

2
{dY (t)} = d[Y, Y ](t) = σt2 dt,

so
1
dA {Y (t), t} = At (t)dt + AY Y (t)d[Y, Y ](t) + AY (t)dY (t).
2
This yields the stated result, by rearranging and integrating.

2
It is the crucial {dY (t)} = d[Y, Y ](t) step, which is the heart of Ito’s Formula.

8.5.2 Examples of the application of Ito’s formula

Here we will discuss the four examples of the use of Ito’s formula given in Table 8.1. Some of them just get the

relevant SDE, others use the form of the SDE to solve for the stochastic integral itself.

Idea Start Transform Result


A(y, t) dA or A(t)
dA 1 2

geometric BM dY (t) = µdt + σdB(t) exp(y) A = µ + 2σ dt + σdB
A(t) = A(0) + µ − 21 σ 2 t + σB(t)

log of geometric BM dY (t) = Y (t)µdt + Y (t)σdB(t) log(y)
Rt
squaring BM dY (t) = dB(t) y2 A(t) = A(0) + t + 2 0 B(s)dB(s)
 Rt
Ornstein-Uhlenbeck dY (t) = −λ {Y (t) − µ} dt + σdB(t) yeλt A(t) = A(0) + µ eλt − 1 + σ 0 eλs dBs

Table 8.1: Four examples of the use of Ito’s Formula.

Example 8.5.1. [geometric BM] Assume that

A(y, t) = exp(y) and dY (t) = µdt + σdB(t).


206 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

The stochastic integral has the solution

A(t) = A(0) exp {µt + σB(t)} , A(0) = eY (0) .

But what SDE does {A(t)} follow? Then At = 0, Ay = Ayy = A, so

1
dA(t) = A(t)σ 2 dt + A(t)dY (t)
2  
1 2
= A(t) µ + σ dt + A(t)σdB(t), (8.10)
2

geometric Brownian motion. In the special case where

1
µ = − σ2 ,
2

then {A(t)}t≥0 is a martingale with respect to {Ft }t≥0 .

Example 8.5.2. [log of geometric BM] Assume that

A(y, t) = log(y) and dY (t) = Y (t)µdt + Y (t)σdB(t),

then At = 0, Ay = 1/y, Ayy = −1/y 2 , so


2  
1 (A(t)σ) 1 1
dA(t) = − 2
dt + dA(t) = µ − σ 2 dt + σdB(t).
2 A(t) A(t) 2

Thus we have the solution, with A(t) = log {Y (t)},


 
1
A(t) = A(0) + µ − σ 2 t + σB(t),
2

Brownian motion. If µ = 12 σ 2 then {A(t)}t≥0 is a martingale with respect to {Ft }t≥0 . In that case geometric
Brownian motion {Y (t)}t≥0 has the solution:
  
1
Y (t) = Y (0) exp µ − σ 2 t + σB(t) .
2

Example 8.5.3. [squaring BM] Assume

A(y, t) = y 2 and dY (t) = dB(t),

then Ay = 2y, Ayy = 2, At = 0 so


dA(t) = dt + 2B(t)dB(t).

Then the solution is


Z t Z t
A(t) = t + 2 B(u)dB(u) = [B, B](t) + 2 B(u)dB(u),
0 0

or, beautifully
Z t
2 B(u)dB(u) = B(t)2 − t.
0
8.6. APPLICATIONS IN TIME SERIES 207

Often in more mathematical treatments of stochastic integrals, the


Z t
2
Y (t) − 2 Y (u)dY (u)
0

is often used as the definition of [Y, Y ](t), the quadratic variation up to time t.

Example 8.5.4. [Ornstein-Uhlenbeck process] Start with

dY (t) = −λ {Y (t) − µ} dt + σdB(t).

The solution to this SDE is called the Ornstein-Uhlenbeck process. The form of this SDE reminds me of the

error correction mechanism we saw in Definition 4.1.3, a reparameterization of an autoregression, where the
differences of the series are regressed on lagged levels. How to solve this SDE? Let

A(y, t) = yeλt

and Ay = eλt , Ayy = 0, At = λyeλt , then

dA(t) = At (t)dt + Ax (t)dY (t) = λY (t)eλt dt + eλt {−λY (t)dt + λµdt + σdB(t)} = eλt σdB(t) + eλt λµdt.

So solving
Z t Z t Z t
eλu du + σ eλu dB(u) = A(0) + µ eλt − 1 + σ eλu dB(u),

A(t) = A(0) + λµ
0 0 0

a continuous time Gaussian autoregression:


t
σ2 σ2
  Z
µ + e−λt {A(0) − µ} , 1 − e−2λt , e−2λ(t−s) ds = 1 − e−2λt .
 
A(t)|A(0) ∼ N as
2λ 0 2λ

Letting t large shows the stationary distribution is N (µ, σ 2 /2λ).

8.6 Applications in time series


8.6.1 Unit root distribution

In Section 4.5.3 we estimated ϕ1 in an autoregression. In the stationary case, this yielded a Gaussian central

limit theorem as T → ∞.
Now suppose the data is non-stationary. This time it will be coming from a Gaussian random walk

Yt = Yt−1 + εt , εt ∼ N (0, σ 2 ), Y0 = 0,

but we still estimate ϕ1 by least squares:


PT PT
t=1 Yt−1 Yt t=1 Yt−1 (Yt − Yt−1 )
ϕ
b
OLS = T
= 1 + PT .
2 2
P
t=1 Yt−1 t=1 Yt−1
208 CHAPTER 8. STOCHASTIC INTEGRATION AND TIME SERIES

Thus PT  n o
√1 Yt−1 √1 Yt − √1 Yt−1
t=1
 
T T T
T ϕ OLS − 1 = .
b  2
PT
T −1 √1 Yt−1
t=1 T

Now write
t t
1 1 X X
√ Yt = √ εj = σ {B(j/T ) − B((j − 1)/T )} = σB(t/T ),
T T j=1 j=1

then
T    T
X 1 1 1 X
√ Yt−1 √ Yt − √ Yt−1 = σ2 B((t − 1)/T ) {B(t/T ) − B((t − 1)/T )}
t=1
T T T t=1
T  2 T
X 1 X
T −1 √ Yt−1 = σ2 B((t − 1)/T )2 {(t/T ) − (t − 1)/T } ,
t=1
T t=1

so recognising this as an Ito integral and a Reimann integral, we get, as T → ∞,


T   R1 !
X B((t − 1)/T ) {B(t/T ) − B((t − 1)/T )} B(u)dB(u) mse
− 0R
1 → 0,
B((t − 1)/T )2 {(t/T ) − (t − 1)/T } B(u)2
du
t=1 0

so R1
 
d 0
B(u)dB(u)
T ϕb
OLS − 1 → R1 .
B(u) 2 du
0
Using Example 8.5.3, we can simplify the numerator, so
1

 
d 2 B(1)2 − 1
T ϕ b
OLS − 1 → R 1 .
0
B(u)2 du

The law of the right hand side is often called the unit root distribution. This is a highly skewed distribution.
Some of the work on unit roots is reviewed by Stock (1994).
More broadly, if there is heteroskedasticity with

1 1
√ Yt − √ Yt−1 = σ((j − 1) /T ) {B(j/T ) − B((j − 1)/T )} .
T T

Assume here {σ(t), B(t)}t∈[0,1] is adapted to {Ft }t∈[0,1] and writing

1
X((t − 1)/T ) = √ Yt−1 ,
T

then R1 t
X(u)σ(u)dB(u)
  Z
d
T ϕOLS − 1 → 0 R 1
b , where X(t) = σ(u)dB(u).
0
X(u)2 du 0

Work on related problems includes Boswijk, Cavliere, Rahbek, and Taylor (2016) and the references contained
within.

8.6.2 Realized volatility

To add
8.7. RECAP 209

8.7 Recap

The main topics are listed in Table 8.2.

Formula or idea Description or name


⟨Y, Y ⟩ (t) Angle bracket
Bounded variation
B(t) Brownian motion
dA(t) = At dt + AY f (t)dg(t) Chain rule
Finite variation
dY
Y = µdt + σdB Geometric Brownian motion
Ito’s formula
Rt Rt
0
µ(u)du + 0 σ(u)dB(u) Ito process
∥τn ∥ Norm of partition
dY (t) = −λ {Y (t) − µ} dt + σdB(t) Ornstein-Uhlenbeck process
τn = {0 = P
t0 < t1 < ... < tn = t} Partition
n
lim∥τn ∥→0 j=1 f (tj−1 ) {g(tj ) − g(tj−1 )} Riemann–Stieltjes integral
Quadratic covariation
Quadratic variation
Stochastic integral

Table 8.2: Main ideas and notation in Chapter 8.


Chapter 12

Index

⟨Y, Y ⟩t . See angle bracket process

γs . See autocovariance function

ρs . See autocorrelation function

ϕ(L). See autoregressive lag polynomial

B. See Brownian motion

∆. See difference operator

Ft . See filtration

∆. See jump

DKL (F ||G). See Kullback-Liebler distance


Rt
0
σ(t)dB(t). See Ito integral

L. See lag operator

Yt,s (a). See lag-s potential outcome

θ(L). See moving average lag polynomial

∥X∥p = E[|X|p ]1/p . See norm on the space Lp

∥τn ∥. See norm of partition


Rt
0
µ(t)dt. See Riemann integral
Rt
0
f (t)dg(t). See Riemann-Stieltjes integral

[Y, Y ]t . See quadratic variation process

∆S . See seasonal difference operator

Absolute summability sequence. See summability

ACF. See autocorrelation

Action

Adam’s law

235
236 CHAPTER 12. INDEX

Adapted

Angle bracket

AR. See autoregression

ARCH. See autoregressive conditional heteroskedasticity

ARIMA. See autoregressive integrated moving average

ARMA. See autoregressive moving average

Assignment

Autocorrelation

Autocovariance

Autocovariance generating function

Autoregression

Autoregressive conditional heteroskedasticity

Autoregressive integrated moving average

Autoregressive moving average

Average treatment effect

Bandwidth

Bartlett kernel

Baum filter

Bellman equation

Bootstrap

Bounded variation. See finite variation

Brownian motion

Càdlàg

CARA. See constant absolute risk aversion

Cauchy sequence

Causal time series system

Chain rule

Characteristic function

Choice

Cointegration

Cointegration vector

Companion form

Complex conjugate
237

Complex random variables

Complex valued time series

Complex variables

Conditional moment condition

Consistency

Constant absolute risk aversion

Continuous function

Continuous sample path

Continuous time

Control

Controllability

Covariance stationarity

Cumulant

Cumulant function

Cramer’s representation

Cycle

DFT. See discrete Fourier transform

Difference equation

Difference

Difference operator

Dirichlet kernel

Discounted utility

Discrete Fourier transform

Doob’s decomposition

Doob’s inequality

Doob’s maximal quadratic inequality

Donsker’s theorem

Dynamic factor model

Dynamic programming

Eigenvalues

Ergodicity

Error corrrection mechanism

Expected loss
238 CHAPTER 12. INDEX

Expected utility

Filter

Filtration

Finite variation.

Forecast

Fourier transform

Frequency

Functional central limit theorem

Gamm process

GARCH. See generalized autoregressive conditional heteroskedasticity

Generalized autoregressive conditional heteroskedasticity

Geometric Brownian motion

Hamilton filter. See Baum filter

Hidden Markov model

Baum filter

Gaussian HMM

Gaussian state space

Kalman filter

Particle filter

Sequential Monte Carlo

Hilbert space

HMM. See Hidden Markov model

I(d). See integrated of order d

i.i.d. See independent and identically distributed

Impulse response function

Independent and identically distributed

Initial conditions

Integrated of order d

Invertible process

IRF. See impulse response function

Ito’s Formula

Ito process

Jumps
239

Kalman filter

Kernel spectral density estimator

Koopman-Durbin smoother

Kullback-Liebler divergence

Lag operator

Lag polynomial

Lag-s average treatment effect

Lag-s potential outcome

Lag-s randomization

Lag-s unconfoundedness

Lasso regression

Law of large numbers

Law of iterated expectations. See Adam’s law

Lead

Least squares

Lévy process

Likelihood function

Linear control

Linear projection

Linear state space model. See HMM

Local level model

Local linear trend model

Long memory

Long-run variance estimator

Loss function

MA. See moving average process

m-order dependent

m-order causality

m-order Markov

Markov chain

Markov decision process

Martingale

Martingale difference
240 CHAPTER 12. INDEX

Martingale transform

Maximum likelihood estimation

MDP. See Markov decision process

Method of moments

Moment condition

Moving average

Filter

Moving average process

Wold representation

Non-stationary

Norm of partition

Nowcasting

Optimal policy

Ornstein-Uhlenbeck process

Outcome

p-norm

p-variation

Partial autocorrelation

Particle filter

Partition

Periodogram

Poisson process

Potential outcome

Prediction decomposition

Predictor

Previsible

Principle component analysis

Probability integral transform

Quadratic covariation

Quadratic variation

Quasi-likelihood

Random walk

Realized volatility
241

Reinforcement learning

Riccatti equation

Ridge regression

Riemann integral

Riemann-Stieltjes integral

Roots of polynomial

Sample mean

Sample autocovariance

Sample autocorrelation

SDE. See stochastic differential equation

Seasonal adjustment

Seasonal component

Seasonality

Sequential assignment mechanism

Sequential Monte Carlo. See particle filter

Sequential randomization

Shrinkage

Simple process

Smoother

Smooth trend model

Spectral density

Spline

Square summability sequence. See summability

State space model. See hidden Markov model

Stationarity

Covariance

Strict

Stochastic dynamic program

Stochastic integration

Stochastic trend model

Strict stationarity

Summability

Absolute
242 CHAPTER 12. INDEX

Square
Stochastic differential equation
Stochastic volatility

Structural moving average


Structural autoregression
SV. See stochastic volatility
Total variation

Trend
Trigonometric seasonality
Unit root

Utility
VAR. See Autoregression
Value function
Volatility. See stochastic volatility

Weak convergence
Weiner process. See Brownian motion
White noise

Wold decomposition
Yule-Walker equation
Bibliography

Andersen, T. G., T. Bollerslev, F. X. Diebold, and P. Labys (2001). The distribution of exchange rate volatility.
Journal of the American Statistical Association 96, 42–55.
Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation.
Econometrica 59, 817–858.
Andrews, I., J. Stock, and L. Sun (2019). Weak instruments in IV regression: Theory and practice. Annual
Review of Economics 11, 727–753.
Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain Monte Carlo methods (with dis-
cussion). Journal of the Royal Statistical Society, Series B 72, 1–33.
Angrist, J. D., Ò. Jordà, and G. M. Kuersteiner (2018). Semiparametric estimates of monetary policy effects:
string theory revisited. Journal of Business & Economic Statistics 36, 371–387.
Angrist, J. D. and G. M. Kuersteiner (2011). Causal effects of monetary shocks: Semiparametric conditional
independence tests with a multinomial propensity score. Review of Economics and Statistics 93, 725–747.
Apostol, T. M. (1957). Mathematical Analysis. London: Addison-Wesley.
Barndorff-Nielsen, O. E. and N. Shephard (2002). Econometric analysis of realised volatility and its use in
estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B 64, 253–280.
Bartlett, M. S. (1950). Periodogram analysis and continuous spectra. Biometrika 37, 1–16.
Bass, R. F. (2011). Stochastic Processes. Cambridge: Cambridge University Press.
Baum, L. E. and Eagon (1967). An inequality with applications to statistical estimation for probabilistic func-
tions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society 73,
360–363.
Baum, L. E. and T. Petrie (1966). Statistical inference for probabilistic functions of finite state Markov chains.
The Annals of Mathematical Statistics 37, 1554–1563.
Baxter, M. and R. G. King (1999). Measuring business cycles: Approximate band-pass filters for economic
time series. The Review of Economics and Statistics 81, 575–593.
Billingsley, P. (1999). Convergence of Probability Measure (1 ed.). New York: Wiley.
Bladt, M. and A. J. McNeil (2022). Time series copula models using d-vines and v-transforms. Econometrics
and Statistics 24, 27–48.
Blitzstein, J. K. and J. Hwang (2019). Introduction to Probability (2 ed.). Chapman and Hall.
Blitzstein, J. K. and N. Shephard (2023). Introduction to statistical inference. Unpublished: Stat111 lecture
notes, Harvard University.
Bojinov, I., A. Rambachan, and N. Shephard (2021). Panel experiments and dynamic causal effects: A finite
population perspective. Quantitiative Economics 12, 1171–1196.
Bojinov, I. and N. Shephard (2019). Time series experiments and causal estimands: exact randomization
tests and trading. Journal of the American Statistical Association 114, 1665–1682.
Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedasticity. Journal of Econometrics 51,
307–327.

243
244 BIBLIOGRAPHY

Boswijk, H. P., G. Cavliere, A. Rahbek, and A. M. Taylor (2016). Inference on co-integration parameters in
heteroskedastic vector autoregressions. Journal of Econometrics 192, 64–85.
Broadberry, S., B. M. S. Campbell, A. Klein, M. Overton, and B. van Leeuwen (2015). British Economic
Growth: 1270–1870. Cambridge University Press.
Brown, B. M. (1971). Martingale central limit theorems. Annals of Mathematical Statistics 49, 59–66.
Campigli, F., G. Bormetti, and F. Lillo (2022). Measuring price impact and information content of trades in
a time-varying setting. Unpublished paper: University of Bologna.
Carter, C. K. and R. Kohn (1994). On Gibbs sampling for state space models. Biometrika 81, 541–553.
Chopin, N. and O. Papasphiliopoulos (2020). An Introduction to Sequential Monte Carlo. Springer.
Cox, D. R. (1958a). Planning of Experiments. Oxford: Wiley.
Cox, D. R. (1958b). The regression analysis of binary sequences (with discussion). Journal of the Royal
Statistical Society, Series B 20, 215–42.
Cox, D. R. (1981). Statistical analysis of time series: some recent developments. Scandinavian Journal of
Statistics 8, 93–115.
Coyle, D. (2015). GDP: A Brief but Affectionate History. Princeton University Press.
Cramer, H. and H. Wold (1936). Some theorems on distribution functions. Journal of the London Mathematical
Society 11, 290–294.
Dai, C., J. Heng, P. E. Jacob, and N. Whiteley (2022). An invitation to sequential monte carlo samplers.
Journal of the American Statistical Association 117, 1587–1600.
de Jong, P. and N. Shephard (1995). The simulation smoother for time series models. Biometrika 82, 339–350.
DeLong, B. J. (2022). Slouching Towards Utopia: An Economic History of the Twentieth Century. Basic
Books.
Diebold, F. X., T. A. Gunther, and T. S. Tay (1998). Evaluating density forecasts with applications to
financial risk management. International Economic Review 39, 863–883.
Doob, J. L. (1949). Application of the theory of martingales. In Actes du Colloque International Le Calcul
des Probabilities et ses applications: Lyon, 28 Juin – 3 Juillet, 1948 , 23–27.
Doob, J. L. (1953). Stochastic Processes. New York: John Wiley and Sons.
Dudley, R. and R. Norvaiša (1999). Product integrals, Young integrals and p-variation. In R. Dudley and
R. Norvaiša (Eds.), Differentiability of Six Operators on Nonsmooth Functions and p-variation, pp. 73–
208. New York: Springer-Verlag. Lecture Notes in Mathematics 1703.
Duffie, D. (2001). Dynamic Asset Pricing Theory. Princeton University Press.
Durbin, J. and S. J. Koopman (2002). A simple and efficient simulation smoother for state space time series
analysis. Biometrika 89, 603–616.
Durbin, J. and S. J. Koopman (2012). Time Series Analysis by State Space Methods (2 ed.). Oxford: Oxford
University Press.
Durbin, R., S. R. Eddy, A. Krogh, and G. Mitchison (1998). Biological sequence analysis: probability models
of proteins and nucleic acids. Cambridge University Press.
Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of the United
Kingdom inflation. Econometrica 50, 987–1007.
Engle, R. F. and C. W. J. Granger (1987). Co-integration and error correction: representation, estimation
and testing. Econometrica 55, 251–276.
Engle, R. F. and V. Ng (1993). Measuring and testing the impact of news on volatility. Journal of Finance 48,
1749–1778.
Fearnhead, P. and H. R. Kunsch (2018). Particle filters and data assimilation. Annual Review of Statistics
and Its Applications 5, 421–449.
Fisher, R. A. (1925). Statistical Methods for Research Workers (1 ed.). London: Oliver and Boyd.
BIBLIOGRAPHY 245

Flury, T. and N. Shephard (2011). Bayesian inference based only on simulated likelihood: particle filter
analysis of dynamic economic models. Econometric Theory 27, 933–956.
Frisch, R. (1933). Propagation Problems and Impulse Problems in Dynamic Economics. London: Allen and
Unwin.
Frühwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series
Analysis 15, 183–202.
Gneiting, T. and M. Katzfuss (2014). Probabilistic forecasts. Annual Review of Statistics and Its Application 1,
125–151.
Gordon, N. J., D. J. Salmond, and A. F. M. Smith (1993). A novel approach to nonlinear and non-Gaussian
Bayesian state estimation. IEE-Proceedings F 140, 107–113.
Gordon, R. J. (2016). The Rise and Fall of American Growth: The U.S. Standard of Living since the Civil
War. Princeton University Press.
Granger, C. W. J. (1981). Some properties of time series data and their use in econometric model specification.
Journal of Econometrics 16, 121–130.
Granger, C. W. J. and A. Andersen (1978). On the invertibility of time series models. Stochastic Processes
and their Applications 8, 87–92.
Green, P. and B. W. Silverman (1994). Nonparameteric Regression and Generalized Linear Models: A Rough-
ness Penalty Approach. London: Chapman & Hall.
Grenander, U. and M. Rosenblatt (1953). Statistical spectral analysis of time-series arising from stationary
stochastic processes. Annals of Mathematical Statistics 24, 537–558.
Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations. Econometrica 11,
1–12.
Hamilton, J. (1989). A new approach to the economic analysis of nonstationary time series and the business
cycle. Econometrica 57, 357–384.
Hamilton, J. D. (1994). Time Series Analysis. Princeton: Princeton University Press.
Hansen, L. P. and T. J. Sargent (2014). Recursive Models of Dynamic Linear Economies. Princeton University
Press.
Hansen, L. P. and K. J. Singleton (1982). Generalized instrumental variables estimation of nonlinear rational
expectations models. Econometrica 50, 1269–1286.
Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge: Cam-
bridge University Press.
Harvey, A. C. (1993). Time Series Models (2 ed.). Hemel Hempstead: Harvester Wheatsheaf.
Harvey, A. C., E. Ruiz, and N. Shephard (1994). Multivariate stochastic variance models. Review of Economic
Studies 61, 247–264.
Hasbrouck, J. (1991a). Measuring the information content of stock trades. The Journal of Finance 46,
179–207.
Hasbrouck, J. (1991b). The summary informativeness of stock trades: An econometric analysis. The Review
of Financial Studies 4, 571–595.
Herbst, E. and F. Schorfheide (2015). Bayesian Estimation of DSGE Models. Princeton: Princeton University
Press.
Hindrayanto, I., J. A. D. Aston, S. J. Koopman, and M. Ooms (2013). Modelling trigonometric seasonal
components for monthly economic time series. Applied Economics 45, 3024–3034.
Hodrick, R. J. and E. C. Prescott (1997). Postwar U.S. business cycles: an empirical investigation. Journal
of Money, Credit, and Banking 24, 1–16.
Jacob, P. E. and Y. F. A. John O’Leary (2020). Unbiased Markov chain Monte Carlo methods with couplings.
Journal of the Royal Statistical Society, Series B 82, 543–600.
246 BIBLIOGRAPHY

Janson, S. (2021). A central limit theorem for m-dependent variables. Unpublished paper: Department of
Mathematics, Uppsala University.
Jørgensen, B. (1982). Statistical Properties of the Generalised Inverse Gaussian Distribution. New York:
Springer-Verlag.
Jorda, O. (2005). Estimation and inference of impulse responses by local projections. American Economic
Review 95, 161—-182.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engi-
neering, Transactions ASMA, Series D 82, 35–45.
Karatzas, I. and S. E. Shreve (1998). Methods of Mathematical Finance. New York: Springer–Verlag.
Kim, C.-J. and C. R. Nelson (1999). State-Space Models with Regime Switching. Classical and Gibbs-Sampling
Approaches with Applications. Cambridge: MIT.
Kim, S., N. Shephard, and S. Chib (1998). Stochastic volatility: likelihood inference and comparison with
ARCH models. Review of Economic Studies 65, 361–393.
Kim, S. J., K. Koh, S. Boyd, and D. Gorinevsky (2009). l1 trend filtering. SIAM Review 51, 339–360.
Kimmeldorf, G. S. and G. Wahba (1970). A correspondence between Bayesian estimation on stochastic pro-
cesses and smoothing by splines. The Annals of Mathematical Statistics 41, 495–502.
Kloeden, P. E. and E. Platen (1992). Numerical Solutions to Stochastic Differential Equations. New York:
Springer.
Kong, A., J. S. Liu, and W. H. Wong (1994). Sequential imputations and Bayesian missing data problems.
Journal of the American Statistical Association 89, 278–88.
Koopmans, T. C. (1950). Statistical inferene in dynamic economic models, Volume 10. Wiley. Cowles Com-
mission.
Lazarus, E., D. J. Lewis, and J. H. Stock (2021). The size-power tradeoff in HAR inference. Econometrica 89,
2497–2516.
Li, M. and S. J. Koopman (2021). Unobserved components with stochastic volatility: Simulation-based
estimation and signal extraction. Journal of Applied Econometrics 36, 614–627.
Liu, J. S. and R. Chen (1995). Blind deconvolution via sequential imputation. Journal of the American
Statistical Association 90, 567–76.
Liu, J. S. and R. Chen (1998). Sequential Monte Carlo methods for dynamic systems. Journal of the American
Statistical Association 93, 1032–1044.
Magnus, J. R. and H. Neudecker (2019). Matrix Differential Calculus with Applications in Statistics and
Econometrics (3 ed.). New York: Wiley.
Mandelbrot, B. B. (2021). The Fractal Geometry of Nature. Echo Point Books & Media.
Merton, R. (1969). Lifetime portfolio selection under uncertainty: the continuous time case. Review of Eco-
nomics and Statistics 51, 247–257.
Mikosch, T. (1998). Elementary Stochastic Calculus with Finance in View. Singapore: World Scientific.
Miller, J. W. (2018). A detailed treatment of Doob’s theorem. Unpublished paper: Harvard University.
Newey, W. K. and K. D. West (1987). A simple positive semi-definite, heteroskedasticity and autocorrelation
consistent covariance matrix. Econometrica 55, 703–708.
Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles.
section 9. Statistical Science 5, 465–472. Originally published 1923, republished in 1990, translated by
Dorota M. Dabrowska and Terence P. Speed.
Olea, J. L. M. and M. Plagborg-Moller (2021). Local projection inference is simpler and more robust than
you think. Econometrica 89, 1789–1823.
Omori, Y., S. Chib, N. Shephard, and J. Nakajima (2007). Stochastic volatility with leverage: fast and
efficient likelihood inference. Journal of Econometrics 140, 425–449.
BIBLIOGRAPHY 247

Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical
Statistics 33, 1065–1076.
Percival, D. B. and A. T. Walden (1993). Spectral Analysis for Physical Applications. Cambridge University
Press.
Percival, D. B. and A. T. Walden (2000). Wavelet Methods for Time Series Analysis. Cambridge: Cambridge
University Press.
Pitt, M. K. and N. Shephard (1999). Filtering via simulation: auxiliary particle filter. Journal of the American
Statistical Association 94, 590–599.
Plagborg-Moller, M. and C. K. Wolf (2021). Local projections and VARs estimate the same impulse reponse
functions. Econometrica 89, 955–980.
Priestley, M. B. (1981). Spectral Analysis and Time Series. London: Academic Press.
Protter, P. (2010). Stochastic Integration and Differential Equations: A New Approach (Third ed.). New
York: Springer-Verlag.
Rambachan, A. and N. Shephard (2021). When do common time series estimands have nonparametric causal
meaning? Unpublished paper: Department of Economics, Harvard University.
Romano, J. P. and M. Wolf (2000). A more general central limit theorem for m-dependent random variables
with unbounded m. Statistics and Probability Letters 47, 115—-124.
Rootzén, H. (1978). Extremes of moving averages of stable processes. Annals of Probability 6, 847–869.
Rosenblatt, M. (1952). Remarks on a multivariate transformation. Annals of Mathematical Statistics 23,
470–472.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of
Mathematical Statistics 27, 832–837.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal 27,
379–423.
Shephard, N. (1994). Partial non-Gaussian state space. Biometrika 81, 115–131.
Shephard, N. (2015). Martingale unobserved component models. In S. J. Koopman and N. Shephard (Eds.),
Unobserved components and time series econometrics, Chapter 10. Oxford University Press.
Shreve, S. (2004). Stochastic Calculus for Finance II: Continuous Time Models. Springer.
Sims, C. A. (1980). Macroeconomics and reality. Econometrica 48, 1–48.
Slutsky, E. E. (1927). The summation of random causes as the source of cyclic processes. Note: Published in
Russian in 1927 and reprinted in Econometrica in 1937.
Steele, J. M. (2001). Stochastic Calculus and Financial Applications. New York: Springer.
Stock, J. H. (1994). Unit roots, structural breaks and trends. In R. F. Engle and D. L. McFadden (Eds.),
Handbook of Econometrics, Volume 4, pp. 2739–2841. Elsevier.
Stock, J. H. and M. Watson (2016a). Core inflation and trend inflation. Review of Economics and Statistics 98,
770–784.
Stock, J. H. and M. Watson (2016b). Dynamic factor models, factor-augmented vector autoregressions, and
structural vector autoregressions in macroeconomics. In J. B. Taylor and H. Uhlig (Eds.), Handbook of
Macroeconomics, Volume 2A, pp. 415–525. Elsevier.
Stock, J. H. and M. W. Watson (2007). Why has U.S. inflation become harder to forecast? Journal of Money,
Credit, and Banking 39, 3–34.
Subba Rao, S. (2022). A course in time series analysis. Unpublished: Texas A&M.
Taylor, S. J. (1982). Financial returns modelled by the product of two stochastic processes — a study of
daily sugar prices 1961-79. In O. D. Anderson (Ed.), Time Series Analysis: Theory and Practice, 1, pp.
203–226. Amsterdam: North-Holland.
248 BIBLIOGRAPHY

Tibshirani, R. J. (2014). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statis-
tics 42, 285–323.
Watson, M. W. (2007). How accurate are real-time estimates of output trends and gaps? Economic Quar-
terly 93, 143–161.
West, M. and J. Harrison (1989). Bayesian Forecasting and Dynamic Models. New York: Springer-Verlag.
Whittaker, E. T. (1923). A new method of graduation. Proceedings of the Edinburgh Mathematical Society 41,
63–75.
Widder, D. V. (1946). The Laplace Transform. Princeton: Princeton University Press.
Williams, D. (1991). Probability with Martingales. Cambridge: Cambridge University Press.
Wold, H. (1938). A Study in the Analysis of Stationary Time Series. Uppsala: Almqvist and Wiksell.
Young, L. C. (1936). An inequality of the Hölder type, connected with Stieltjes integration. Acta Mathemat-
ica 67, 251–282.
Yule, G. U. (1921). On the time-correlation problem, with special reference to the variate-difference correlation
method. Journal of the Royal Statistical Society 84, 497–526.
Yule, G. U. (1926). Why do we sometimes get nonsense-correlations between time series? a study in sampling
and the nature of time series. Journal of the Royal Statistical Society 89, 1–63.
Yule, G. U. (1927). On a method for investigating periodicities in disturbed series with special reference
to Wolfer’s sunspot numbers. Philosophical Transactions of the Royal Society of London, Series A 226,
267—-298.

You might also like