A Course in Time Series Analysis 1662068197

Download as pdf or txt
Download as pdf or txt
You are on page 1of 300

A course in Time Series Analysis

Suhasini Subba Rao


Email: [email protected]

August 29, 2022


Contents

1 Introduction 12
1.1 Time Series data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Filtering time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Trends in a time series 18


2.1 Parametric trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Nonparametric methods (advanced) . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Rolling windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Sieve estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 What is trend and what is noise? . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Periodic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 The sine and cosine transform . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 The Fourier transform (the sine and cosine transform in disguise) . . 33
2.5.3 The discrete Fourier transform . . . . . . . . . . . . . . . . . . . . . . 36
2.5.4 The discrete Fourier transform and periodic signals . . . . . . . . . . 38
2.5.5 Smooth trends and its corresponding DFT . . . . . . . . . . . . . . . 42
2.5.6 Period detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.7 Period detection and correlated noise . . . . . . . . . . . . . . . . . . 47
2.5.8 History of the periodogram . . . . . . . . . . . . . . . . . . . . . . . . 49

1
2.6 Data Analysis: EEG data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6.1 Connecting Hertz and Frequencies . . . . . . . . . . . . . . . . . . . . 51
2.6.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3 Stationary Time Series 62


3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.1 Formal definition of a time series . . . . . . . . . . . . . . . . . . . . 65
3.2 The sample mean and its standard error . . . . . . . . . . . . . . . . . . . . 66
3.2.1 The variance of the estimated regressors in a linear regression model
with correlated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3 Stationary processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Types of stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.2 Towards statistical inference for time series . . . . . . . . . . . . . . . 79
3.4 What makes a covariance a covariance? . . . . . . . . . . . . . . . . . . . . . 80
3.5 Spatial covariances (advanced) . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4 Linear time series 87


4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 Linear time series and moving average models . . . . . . . . . . . . . . . . . 89
4.2.1 Infinite sums of random variables . . . . . . . . . . . . . . . . . . . . 89
4.3 The AR(p) model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.1 Difference equations and back-shift operators . . . . . . . . . . . . . . 92
4.3.2 Solution of two particular AR(1) models . . . . . . . . . . . . . . . . 94
4.3.3 The solution of a general AR(p) . . . . . . . . . . . . . . . . . . . . . 97
4.3.4 Obtaining an explicit solution of an AR(2) model . . . . . . . . . . . 98
4.3.5 History of the periodogram (Part II) . . . . . . . . . . . . . . . . . . 102
4.3.6 Examples of “Pseudo” periodic AR(2) models . . . . . . . . . . . . . 104
4.3.7 Derivation of “Pseudo” periodicity functions in an AR(2) . . . . . . . 108
4.3.8 Seasonal Autoregressive models . . . . . . . . . . . . . . . . . . . . . 110

2
4.3.9 Solution of the general AR(∞) model (advanced) . . . . . . . . . . . 110
4.4 Simulating from an Autoregressive process . . . . . . . . . . . . . . . . . . . 114
4.5 The ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6 ARFIMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7 Unit roots, integrated and non-invertible processes . . . . . . . . . . . . . . . 125
4.7.1 Unit roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.7.2 Non-invertible processes . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.8 Simulating from models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.9 Some diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.9.1 ACF and PACF plots for checking for MA and AR behaviour . . . . 127
4.9.2 Checking for unit roots . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.10 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5 A review of some results from multivariate analysis 134


5.1 Preliminaries: Euclidean space and projections . . . . . . . . . . . . . . . . . 134
5.1.1 Scalar/Inner products and norms . . . . . . . . . . . . . . . . . . . . 134
5.1.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.1.3 Orthogonal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.1.4 Projecting in multiple stages . . . . . . . . . . . . . . . . . . . . . . . 136
5.1.5 Spaces of random variables . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2 Linear prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.3 Partial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.4 Properties of the precision matrix . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4.1 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4.2 Proof of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6 The autocovariance and partial covariance of a stationary time series 158


6.1 The autocovariance function . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.1.1 The rate of decay of the autocovariance of an ARMA process . . . . . 159

3
6.1.2 The autocovariance of an autoregressive process and the Yule-Walker
equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.1.3 The autocovariance of a moving average process . . . . . . . . . . . . 167
6.1.4 The autocovariance of an ARMA process (advanced) . . . . . . . . . 167
6.1.5 Estimating the ACF from data . . . . . . . . . . . . . . . . . . . . . 168
6.2 Partial correlation in time series . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.2.1 A general definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.2.2 Partial correlation of a stationary time series . . . . . . . . . . . . . . 171
6.2.3 Best fitting AR(p) model . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.2.4 Best fitting AR(p) parameters and partial correlation . . . . . . . . . 174
6.2.5 The partial autocorrelation plot . . . . . . . . . . . . . . . . . . . . . 176
6.2.6 Using the ACF and PACF for model identification . . . . . . . . . . . 177
6.3 The variance and precision matrix of a stationary time series . . . . . . . . . 179
6.3.1 Variance matrix for AR(p) and MA(p) models . . . . . . . . . . . . . 180
6.4 The ACF of non-causal time series (advanced) . . . . . . . . . . . . . . . . . 182
6.4.1 The Yule-Walker equations of a non-causal process . . . . . . . . . . 185
6.4.2 Filtering non-causal AR models . . . . . . . . . . . . . . . . . . . . . 185

7 Prediction 188
7.1 Using prediction in estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.2 Forecasting for autoregressive processes . . . . . . . . . . . . . . . . . . . . . 191
7.3 Forecasting for AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.4 Forecasting for general time series using infinite past . . . . . . . . . . . . . 195
7.4.1 Example: Forecasting yearly temperatures . . . . . . . . . . . . . . . 198
7.5 One-step ahead predictors based on the finite past . . . . . . . . . . . . . . . 204
7.5.1 Levinson-Durbin algorithm . . . . . . . . . . . . . . . . . . . . . . . . 204
7.5.2 A proof of the Durbin-Levinson algorithm based on projections . . . 206
7.5.3 Applying the Durbin-Levinson to obtain the Cholesky decomposition 208
7.6 Comparing finite and infinite predictors (advanced) . . . . . . . . . . . . . . 209
7.7 r-step ahead predictors based on the finite past . . . . . . . . . . . . . . . . 210

4
7.8 Forecasting for ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.9 ARMA models and the Kalman filter . . . . . . . . . . . . . . . . . . . . . . 214
7.9.1 The Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.9.2 The state space (Markov) representation of the ARMA model . . . . 216
7.9.3 Prediction using the Kalman filter . . . . . . . . . . . . . . . . . . . . 219
7.10 Forecasting for nonlinear models (advanced) . . . . . . . . . . . . . . . . . . 220
7.10.1 Forecasting volatility using an ARCH(p) model . . . . . . . . . . . . 221
7.10.2 Forecasting volatility using a GARCH(1, 1) model . . . . . . . . . . . 221
7.10.3 Forecasting using a BL(1, 0, 1, 1) model . . . . . . . . . . . . . . . . . 223
7.11 Nonparametric prediction (advanced) . . . . . . . . . . . . . . . . . . . . . . 224
7.12 The Wold Decomposition (advanced) . . . . . . . . . . . . . . . . . . . . . . 226
7.13 Kolmogorov’s formula (advanced) . . . . . . . . . . . . . . . . . . . . . . . . 228
7.14 Appendix: Prediction coefficients for an AR(p) model . . . . . . . . . . . . . 231
7.15 Appendix: Proof of the Kalman filter . . . . . . . . . . . . . . . . . . . . . . 239

8 Estimation of the mean and covariance 243


8.1 An estimator of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.1.1 The sampling properties of the sample mean . . . . . . . . . . . . . . 245
8.2 An estimator of the covariance . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8.2.1 Asymptotic properties of the covariance estimator . . . . . . . . . . . 250
8.2.2 The asymptotic properties of the sample autocovariance and autocor-
relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
8.2.3 The covariance of the sample autocovariance . . . . . . . . . . . . . . 255
8.3 Checking for correlation in a time series . . . . . . . . . . . . . . . . . . . . . 265
8.3.1 Relaxing the assumptions: The robust Portmanteau test (advanced) . 269
8.4 Checking for partial correlation . . . . . . . . . . . . . . . . . . . . . . . . . 274
8.5 The Newey-West (HAC) estimator . . . . . . . . . . . . . . . . . . . . . . . 276
8.6 Checking for Goodness of fit (advanced) . . . . . . . . . . . . . . . . . . . . 278
8.7 Long range dependence (long memory) versus changes in the mean . . . . . 283

5
9 Parameter estimation 286
9.1 Estimation for Autoregressive models . . . . . . . . . . . . . . . . . . . . . . 287
9.1.1 The Yule-Walker estimator . . . . . . . . . . . . . . . . . . . . . . . . 288
9.1.2 The tapered Yule-Walker estimator . . . . . . . . . . . . . . . . . . . 292
9.1.3 The Gaussian likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 293
9.1.4 The conditional Gaussian likelihood and least squares . . . . . . . . . 295
9.1.5 Burg’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.1.6 Sampling properties of the AR regressive estimators . . . . . . . . . . 300
9.2 Estimation for ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . 306
9.2.1 The Gaussian maximum likelihood estimator . . . . . . . . . . . . . . 307
9.2.2 The approximate Gaussian likelihood . . . . . . . . . . . . . . . . . . 308
9.2.3 Estimation using the Kalman filter . . . . . . . . . . . . . . . . . . . 310
9.2.4 Sampling properties of the ARMA maximum likelihood estimator . . 311
9.2.5 The Hannan-Rissanen AR(∞) expansion method . . . . . . . . . . . 313
9.3 The quasi-maximum likelihood for ARCH processes . . . . . . . . . . . . . . 315

10 Spectral Representations 318


10.1 How we have used Fourier transforms so far . . . . . . . . . . . . . . . . . . 319
10.2 The ‘near’ uncorrelatedness of the DFT . . . . . . . . . . . . . . . . . . . . . 324
10.2.1 Testing for second order stationarity: An application of the near decor-
relation property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
10.2.2 Proof of Lemma 10.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.2.3 The DFT and complete decorrelation . . . . . . . . . . . . . . . . . . 330
10.3 Summary of spectral representation results . . . . . . . . . . . . . . . . . . . 335
10.3.1 The spectral (Cramer’s) representation theorem . . . . . . . . . . . . 335
10.3.2 Bochner’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
10.4 The spectral density and spectral distribution . . . . . . . . . . . . . . . . . 337
10.4.1 The spectral density and some of its properties . . . . . . . . . . . . 337
10.4.2 The spectral distribution and Bochner’s (Hergoltz) theorem . . . . . 340
10.5 The spectral representation theorem . . . . . . . . . . . . . . . . . . . . . . . 342

6
10.6 The spectral density functions of MA, AR and ARMA models . . . . . . . . 345
10.6.1 The spectral representation of linear processes . . . . . . . . . . . . . 346
10.6.2 The spectral density of a linear process . . . . . . . . . . . . . . . . . 347
10.6.3 Approximations of the spectral density to AR and MA spectral densities349
10.7 Cumulants and higher order spectrums . . . . . . . . . . . . . . . . . . . . . 352
10.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.8.1 The spectral density of a time series with randomly missing observations355
10.9 Appendix: Some proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

11 Spectral Analysis 363


11.1 The DFT and the periodogram . . . . . . . . . . . . . . . . . . . . . . . . . 364
11.2 Distribution of the DFT and Periodogram under linearity . . . . . . . . . . . 366
11.3 Estimating the spectral density function . . . . . . . . . . . . . . . . . . . . 372
11.3.1 Spectral density estimation using a lagged window approach . . . . . 373
11.3.2 Spectral density estimation by using a discrete average periodogram
approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
11.3.3 The sampling properties of the spectral density estimator . . . . . . . 382
11.4 The Whittle Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
11.4.1 Connecting the Whittle and Gaussian likelihoods . . . . . . . . . . . 389
11.4.2 Sampling properties of the Whittle likelihood estimator . . . . . . . . 393
11.5 Ratio statistics in Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.6 Goodness of fit tests for linear time series models . . . . . . . . . . . . . . . 404
11.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

12 Multivariate time series 408


12.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
12.1.1 Preliminaries 1: Sequences and functions . . . . . . . . . . . . . . . . 408
12.1.2 Preliminaries 2: Convolution . . . . . . . . . . . . . . . . . . . . . . . 409
12.1.3 Preliminaries 3: Spectral representations and mean squared errors . . 410
12.2 Multivariate time series regression . . . . . . . . . . . . . . . . . . . . . . . . 415
12.2.1 Conditional independence . . . . . . . . . . . . . . . . . . . . . . . . 416

7
12.2.2 Partial correlation and coherency between time series . . . . . . . . . 416
(a) (a)
12.2.3 Cross spectral density of {εt,Y , εt,Y }: The spectral partial coherency
function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.3 Properties of the inverse of the spectral density matrix . . . . . . . . . . . . 419
12.4 Proof of equation (12.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422

13 Nonlinear Time Series Models 425


13.0.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
13.1 Data Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
13.1.1 Yahoo data from 1996-2014 . . . . . . . . . . . . . . . . . . . . . . . 429
13.1.2 FTSE 100 from January - August 2014 . . . . . . . . . . . . . . . . . 432
13.2 The ARCH model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
13.2.1 Features of an ARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
13.2.2 Existence of a strictly stationary solution and second order stationarity
of the ARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
13.3 The GARCH model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
13.3.1 Existence of a stationary solution of a GARCH(1, 1) . . . . . . . . . . 439
13.3.2 Extensions of the GARCH model . . . . . . . . . . . . . . . . . . . . 441
13.3.3 R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
13.4 Bilinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
13.4.1 Features of the Bilinear model . . . . . . . . . . . . . . . . . . . . . . 442
13.4.2 Solution of the Bilinear model . . . . . . . . . . . . . . . . . . . . . . 444
13.4.3 R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
13.5 Nonparametric time series models . . . . . . . . . . . . . . . . . . . . . . . . 446

14 Consistency and and asymptotic normality of estimators 448


14.1 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
14.2 Sampling properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
14.3 Showing almost sure convergence of an estimator . . . . . . . . . . . . . . . 452
14.3.1 Proof of Theorem 14.3.2 (The stochastic Ascoli theorem) . . . . . . . 454

8
14.4 Toy Example: Almost sure convergence of the least squares estimator for an
AR(p) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
14.5 Convergence in probability of an estimator . . . . . . . . . . . . . . . . . . . 459
14.6 Asymptotic normality of an estimator . . . . . . . . . . . . . . . . . . . . . . 460
14.6.1 Martingale central limit theorem . . . . . . . . . . . . . . . . . . . . 462
14.6.2 Example: Asymptotic normality of the weighted periodogram . . . . 462
14.7 Asymptotic properties of the Hannan and Rissanen estimation method . . . 463
14.7.1 Proof of Theorem 14.7.1 (A rate for kb̂T − bT k2 ) . . . . . . . . . . . 468
14.8 Asymptotic properties of the GMLE . . . . . . . . . . . . . . . . . . . . . . 471

15 Residual Bootstrap for estimation in autoregressive processes 481


15.1 The residual bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
15.2 The sampling properties of the residual bootstrap estimator . . . . . . . . . 483

A Background 492
A.1 Some definitions and inequalities . . . . . . . . . . . . . . . . . . . . . . . . 492
A.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
A.3 The Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
A.4 Application of Burkholder’s inequality . . . . . . . . . . . . . . . . . . . . . 501
A.5 The Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . 503

B Mixingales and physical depedendence 508


B.1 Obtaining almost sure rates of convergence for some sums . . . . . . . . . . 509
B.2 Proof of Theorem 14.7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
B.3 Basic properties of physical dependence . . . . . . . . . . . . . . . . . . . . . 513

9
Preface

• The material for these notes come from several different places, in particular:

– Brockwell and Davis (1998) (yellow book)

– Shumway and Stoffer (2006) (a shortened version is Shumway and Stoffer EZ).

– Fuller (1995)

– Pourahmadi (2001)

– Priestley (1983)

– Box and Jenkins (1970)

– Brockwell and Davis (2002) (the red book), is a very nice introduction to Time
Series, which may be useful for students who don’t have a rigourous background
in mathematics.

– Wilson Tunnicliffe et al. (2020)

– Tucker and Politis (2021)

– A whole bunch of articles.

– My own random thoughts and derivations.

• Tata Subba Rao and Piotr Fryzlewicz were very generous in giving advice and sharing
homework problems.

• When doing the homework, you are encouraged to use all materials available, including
Wikipedia, Mathematica/Maple (software which allows you to easily derive analytic
expressions, a web-based version which is not sensitive to syntax is Wolfram-alpha).

10
• You are encouraged to use R (see David Stoffer’s tutorial). I have tried to include
Rcode in the notes so that you can replicate some of the results.

• Exercise questions will be in the notes and will be set at regular intervals.

• Finally, these notes are dedicated to my wonderful Father, whose inquisitive questions,
and unconditional support inspired my quest in time series.

11
Chapter 1

Introduction

A time series is a series of observations xt , observed over a period of time. Typically the
observations can be over an entire interval, randomly sampled on an interval or at fixed time
points. Different types of time sampling require different approaches to the data analysis.
In this course we will focus on the case that observations are observed at fixed equidistant
time points, hence we will suppose we observe {xt : t ∈ Z} (Z = {. . . , 0, 1, 2 . . .}).
Let us start with a simple example, independent, uncorrelated random variables (the
simplest example of a time series). A plot is given in Figure 1.1. We observe that there
aren’t any clear patterns in the data. Our best forecast (predictor) of the next observation
is zero (which appears to be the mean). The feature that distinguishes a time series from
classical statistics is that there is dependence in the observations. This allows us to obtain
better forecasts of future observations. Keep Figure 1.1 in mind, and compare this to the
following real examples of time series (observe in all these examples you see patterns).

1.1 Time Series data


Below we discuss four different data sets.

The Southern Oscillation Index from 1876-present

The Southern Oscillation Index (SOI) is an indicator of intensity of the El Nino effect (see
wiki). The SOI measures the fluctuations in air surface pressures between Tahiti and Darwin.

12
2
1
whitenoise

0
−1
−2

0 50 100 150 200

Time

Figure 1.1: Plot of independent uncorrelated random variables

In Figure 1.2 we give a plot of monthly SOI from January 1876 - July 2014 (note that
there is some doubt on the reliability of the data before 1930). The data was obtained
from http://www.bom.gov.au/climate/current/soihtm1.shtml. Using this data set one
major goal is to look for patterns, in particular periodicities in the data.
20
0
soi1

−20
−40

1880 1900 1920 1940 1960 1980 2000 2020

Time

Figure 1.2: Plot of monthly Southern Oscillation Index, 1876-2014

13
Nasdaq Data from 1985-present

The daily closing Nasdaq price from 1st October, 1985- 8th August, 2014 is given in Figure
1.3. The (historical) data was obtained from https://uk.finance.yahoo.com. See also
http://www.federalreserve.gov/releases/h10/Hist/. Of course with this type of data
the goal is to make money! Therefore the main object is to forecast (predict future volatility).
4000
3000
nasdaq1

2000
1000

1985 1990 1995 2000 2005 2010 2015

Time

Figure 1.3: Plot of daily closing price of Nasdaq 1985-2014

Yearly sunspot data from 1700-2013

Sunspot activity is measured by the number of sunspots seen on the sun. In recent years it has
had renewed interest because times in which there are high activity causes huge disruptions
to communication networks (see wiki and NASA).
In Figure 1.4 we give a plot of yearly sunspot numbers from 1700-2013. The data was
obtained from http://www.sidc.be/silso/datafiles. For this type of data the main aim
is to both look for patterns in the data and also to forecast (predict future sunspot activity).

Yearly and monthly average temperature data

Given that climate change is a very topical subject we consider global temperature data.
Figure 1.5 gives the yearly temperature anomalies from 1880-2013 and in Figure 1.6 we plot

14
150
sunspot1

100
50
0

1700 1750 1800 1850 1900 1950 2000

Time

Figure 1.4: Plot of Sunspot numbers 1700-2013

the monthly temperatures from January 1996 - July 2014. The data was obtained from
http://data.giss.nasa.gov/gistemp/graphs_v3/Fig.A2.txt and http://data.giss.
nasa.gov/gistemp/graphs_v3/Fig.C.txt respectively. For this type of data one may be
trying to detect for global warming (a long term change/increase in the average tempera-
tures). This would be done by fitting trend functions through the data. However, sophisti-
cated time series analysis is required to determine whether these estimators are statistically
significant.

1.2 R code
A large number of the methods and concepts will be illustrated in R. If you are not familar
with this language please learn the basics.
Here we give the R code for making the plots above.

# assuming the data is stored in your main directory we scan the data into R
soi <- scan("~/soi.txt")
soi1 <- ts(monthlytemp,start=c(1876,1),frequency=12)
# the function ts creates a timeseries object, start = starting year,

15
0.5
temp

0.0
−0.5

1880 1900 1920 1940 1960 1980 2000

Time

Figure 1.5: Plot of global, yearly average, temperature anomalies, 1880 - 2013
0.8
monthlytemp1

0.6
0.4
0.2

2000 2005 2010 2015

Time

Figure 1.6: Plot of global, monthly average, temperatures January, 1996 - July, 2014.

# where 1 denotes January. Frequency = number of observations in a


# unit of time (year). As the data is monthly it is 12.
plot.ts(soi1)

Dating plots properly is very useful. This can be done using the package zoo and the
function as.Date.

16
1.3 Filtering time series
Often we transform data to highlight features or remove unwanted features. This is often
done by taking the log transform or a linear transform.
It is no different for time series. Often a transformed time series can be easier to analyse
or contain features not apparent in the original time series. In these notes we mainly focus
on linear transformation of the time series. Let {Xt } denote the original time series and
{Yt } transformed time series where


X
Yt = hj Xt−j
j=−∞

where {hj } are weights.


In these notes we focus on two important types of linear transforms of the time series:

(i) Linear transforms that can be used to estimate the underlying mean function.

(ii) Linear transforms that allow us to obtain a deeper understanding on the actual stochas-
tic/random part of the observed time series.

In the next chapter we consider estimation of a time-varying mean in a time series and
will use some of the transforms alluded to above.

1.4 Terminology
• iid (independent, identically distributed) random variables. The simplest time series
you could ever deal with!

17
Chapter 2

Trends in a time series

Objectives:

• Parameter estimation in parametric trend.

• The Discrete Fourier transform.

• Period estimation.

In time series, the main focus is on understanding and modelling the relationship between
observations. A typical time series model looks like

Y t = µ t + εt ,

where µt is the underlying mean and εt are the residuals (errors) which the mean cannot
explain. Formally, we say E[Yt ] = µt . We will show later in this section, that when data
it can be difficult to disentangle to the two. However, a time series analysist usually has
a few jobs to do when given such a data set. Either (a) estimate µt , we discuss various
bt or (b) transform {Yt } in such a way that µt “disappears”.
methods below, this we call µ
What method is used depends on what the aims are of the analysis. In many cases it is to
estimate the mean µt . But the estimated residuals

εbt = Yt − µ
bt

18
also plays an important role. By modelling {εt }t we can understand its dependence structure.
This knowledge will allow us to construct reliable confidence intervals for the mean µt . Thus
the residuals {εt }t play an important but peripheral role. However, for many data sets the
residuals {εt }t are important and it is the mean that is a nuisance parameters. In such
situations we either find a transformation which removes the mean and focus our analysis
on the residuals εt . The main focus of this class will be on understanding the structure of
the residuals {εt }t . However, in this chapter we study ways in which to estimate the mean
µt .
Shumway and Stoffer, Chapter 2, and Brockwell and Davis (2002), Chapter 1.

2.1 Parametric trend


In many situations, when we observe time series, regressors are also available. The regressors
may be an exogenous variable but it could even be time (or functions of time), since for a
time series the index t has a meaningful ordering and can be treated as a regressor. Often
the data is assumed to be generated using a parametric model. By parametric model, we
mean a model where all but a finite number of parameters is assumed known. Possibly, the
simplest model is the linear model. In time series, a commonly used linear model is

Y t = β0 + β1 t + εt , (2.1)

or

Yt = β0 + β1 t + β2 t2 + εt , (2.2)

where β0 , β1 and β2 are unknown. These models are l inear because they are linear in the
regressors. An example of a popular nonlinear models is

1
Yt = + εt . (2.3)
1 + exp[β0 (t − β1 )]

where β0 and β1 are unknown. As the parameters in this model are inside a function, this

19
1.5

1.5
1.0

1.0
example1

example2
0.5

0.5
0.0

0.0
−0.5

−0.5
0 20 40 60 80 100 0 20 40 60 80 100

Time Time

Figure 2.1: The function Yt in (2.3) with iid noise with σ = 0.3. Dashed is the truth. Left:
β0 = 0.2 and β1 = 60. Right: β0 = 5 and β1 = 60

is an example of a nonlinear model. The above nonlinear model (called a smooth transition
model), is used to model transitions from one state to another (as it is monotonic, increasing
or decreasing depending on the sign of β0 ). Another popular model for modelling ECG data
is the burst signal model (see Swagata Nandi et. al.)

Yt = A exp (β0 (1 − cos(β2 t))) · cos(θt) + εt (2.4)

Both these nonlinear parametric models motivate the general nonlinear model

Yt = g(xt , θ) + εt , (2.5)

where g(xt , θ) is the nonlinear trend, g is a known function but θ is unknown. Observe that
most models include an additive noise term {εt }t to account for variation in Yt that the trend
cannot explain.

Real data example Monthly temperature data. This time series appears to include seasonal
behaviour (for example the southern oscillation index). Seasonal behaviour is often modelled

20
40

40
20
20

Ysignal
signal1

0
0

−20
−20

−40
−40

−60
0 20 40 60 80 100 0 20 40 60 80 100

Time Time

Figure 2.2: The Burst signal (equation (2.4)) A = 1, β0 = 2, β1 = 1 and θ = π/2 with iid
noise with σ = 8. Dashed is the truth. Left: True Signal. Right: True Signal with noise

with sines and cosines


   
2πt 2πt
Yt = β0 + β1 sin + β3 cos + εt ,
P P

where P denotes the length of the period. If P is known, for example there are 12 months
in a year so setting P = 12 is sensible. Then we are modelling trends which repeat every 12
months (for example monthly data) and
   
2πt 2πt
Yt = β0 + β1 sin + β3 cos + εt . (2.6)
12 12

is an example of a l inear model.


On the other hand, if P is known and has to be estimated from the data too. Then this
is an example of a nonlinear model. We consider more general periodic functions in Section
2.5.

2.1.1 Least squares estimation

In this section we review simple estimation methods. In this section, we do not study the
properties of these estimators. We touch on that in the next chapter.

21
A quick review of least squares Suppose that variable Xi are believed to influence the re-
sponse variable Yi . So far the relationship is unknown, but we regress (project Y n =
(Y1 , . . . , Yn )0 ) onto X n = (X1 , . . . , Xn ) using least squares. We know that this means finding
the α which minimises the distance

n
X
(Yi − αXi )2 .
i=1

The α, which minimises the above, for mathematical convenience we denote as

n
X
α
bn = arg min (Yi − αXi )2
α
i=1

and it has an analytic solution


Pn
hY n , X n i Yi Xi
α
bn = 2
= Pi=1
n 2
.
kX n k2 i=1 Xi

A geometric interpretation is that the vector Y n is projected onto X n such that

Yn =α
b n X n + εn

where εn is orthogonal to X n in other words

n
X
hX n , εn i = Xi εi,n = 0.
i=1

But so far no statistics. We can always project a vector on another vector. We have made
no underlying assumption on what generates Yi and how Xi really impacts Xi . Once we do
this we are in the realm of modelling. We do this now. Let us suppose the data generating
process (often abbreviated to DGP) is

Yi = αXi + εi ,

here we place the orthogonality assumption between Xi and εi by assuming that they are

22
uncorrelated i.e. cov[εi , Xi ]. This basically means εi contains no linear information about
Xi . Once a model has been established. We can make more informative statements about
α
bn . In this case α
bn is estimating α and α
bn Xi is an estimator of the mean αXi .

Multiple linear regression The above is regress Y n onto just one regressor X n . Now con-
sider regressing Y n onto several regressors (X 1,n , . . . , X p,n ) where X 0i,n = (Xi,1 , . . . , Xi,n ).
This means projecting Y n onto several regressors (X 1,n , . . . , X p,n ). The coefficients in this
projection are α
b n , where

n p
X X
α
b n = arg min (Yi − αj Xi,j )2
α
i=1 j=1
0 −1 0
= (X X) X Y n .

and X = (X 1,n , . . . , X p,n ). If the vectors {X j,n }pj=1 are orthogonal, then X0 X is diagonal
matrix. Then the expression for α
b n can be simplified
Pn
hY n , X j,n i Yi Xi,j
α
bj,n = 2
= Pi=1
n 2
.
kX j,n k2 i=1 Xi,j

Orthogonality of regressors is very useful, it allows simple estimation of parameters and


avoids issues such as collinearity between regressors.
Of course we can regress Y n onto anything. In order to make any statements at the
population level, we have to make an assumption about the true relationship between Yi and
X 0i,n = (Xi,1 , . . . , Xi,p ). Let us suppose the data generating process is

p
X
Yi = αj Xi,j + εi .
j=1

Then α
b n is an estimator of α. But how good an estimator it is depends on the properties
of {εi }ni=1 . Typically, we make the assumption that {εi }ni=1 are independent, identically
distributed random variables. But if Yi is observed over time, then this assumption may well
be untrue (we come to this later and the impact it may have).
If there is a choice of many different variables, the AIC (Akaike Information Criterion)
is usually used to select the important variables in the model (see wiki).

23
Nonlinear least squares Least squares has a nice geometric interpretation in terms of pro-
jections. But for models like (2.3) and (2.4) where the unknown parameters are not the
coefficients of the regressors (Yi = g(X i , θ) + εi ), least squares can still be used to estimate θ

n
X
θbn = arg min (Yi − g(X i , θ))2 .
θ∈Θ
i=1

Usually, for nonlinear linear least squares no analytic solution for θbn exists and one has to
use a numerical routine to minimise the least squares criterion (such as optim in R). These
methods can be highly sensitive to initial values (especially when there are many parameters
in the system) and may only give the local minimum. However, in some situations one by
“clever” manipulations one can find simple methods for minimising the above.
Again if the true model is Yi = g(X i , θ) + εi , then θbn is an estimator of θ.

2.2 Differencing
Let us return to the Nasdaq data (see Figure 1.3). We observe what appears to be an
upward trend. First differencing often removes the trend in the model. For example if
Yt = β0 + β1 t + εt , then

Zt = Yt+1 − Yt = β1 + εt+1 − εt .

Another model where first difference is also extremely helpful are those which have a stochas-
tic trend. A simple example is

Yt = Yt−1 + εt , (2.7)

where {εt }t are iid random variables. It is believed that the logorithm of the Nasdaq index
data (see Figure 1.3 is an example of such a model). Again by taking first differences we
have

Zt = Yt+1 − Yt = εt+1 .

24
Higher order differences Taking higher order differences can remove higher order polynomials
and stochastic trends. For example if Yt = β0 + β1 t + β2 t2 + εt then

(1)
Zt = Yt+1 − Yt = β1 + 2β2 t + εt+1 − εt ,

this still contains the trend. Taking second differences removes that

(2) (1) (1)


Zt = Zt − Zt−1 = 2β2 + εt+1 − 2εt + εt−1 .

In general, the number of differences corresponds to the order of the polynomial. Similarly
if a stochastic trend is of the form

Yt = 2Yt−1 − Yt−2 + εt ,

where {εt }t are iid. Then second differencing will return us to εt .

Warning Taking too many differences can induce “ugly” dependences in the data. This
happens with the linear trend model Yt = β0 +β1 t+εt when we difference {Yt } is independent
over time but Zt = Yt − Yt−1 = β1 + εt+1 − εt is dependent over time since

Zt = β1 + εt+1 − εt and Zt+1 = β1 + εt+2 − εt+1 ,

they both share a common εt+1 which is highly undesirable (for future: Zt has an MA(1)
representation and is non-invertible). Similarly for the stochastic trend Yt = Yt−1 + εt ,
(2)
taking second differences Zt = εt − εt−1 . Thus we encounter the same problem. Dealing
with dependencies caused by over differencing induces negative persistence in a time series
and it is a pain in the neck!
R code. It is straightforward to simulate a difference process. You can also use the arima
function in R. For example, arima.sim(list(order = c(0,1,0)), n = 200) will simulate
(2.7) and arima.sim(list(order = c(0,2,0)), n = 200) will simulate a differencing of
order two.

25
Exercise 2.1 (i) Import the yearly temperature data (file global mean temp.txt) into
R and fit the linear model in (2.1) to the data (use the R command lm, FitTemp =
lm(data), out = summary(FitTemp)) .

(ii) Suppose the errors in (2.1) are correlated (linear dependence between the errors). If
the errors are correlated, explain why the standard errors reported in the R output may
not be reliable.

Hint: The errors are usually calculated as

n
!−1 n
X 1 X 2
(1, t)0 (1, t) εb .
t=1
n − 2 t=1 t

(iii) Make a plot of the residuals (over time) after fitting the linear model in (i).

(iv) Make a plot of the first differences of the temperature data (against time). Compare
the plot of the residuals with the plot of the first differences.

2.3 Nonparametric methods (advanced)


In Section 2.1 we assumed that the mean had a certain known parametric form. This may
not always be the case. If we have no apriori knowledge of the features in the mean, we
can estimate the mean using a nonparametric approach. Of course some assumptions on the
mean are still required. And the most common is to assume that the mean µt is a sample
from a ‘smooth’ function. Mathematically we write that µt is sampled (at regular intervals)
from a smooth function (i.e. u2 ) with µt = µ( nt ) where the function µ(·) is unknown. Under
this assumption the following approaches are valid.

2.3.1 Rolling windows

Possibly one of the simplest methods is to use a ‘rolling window’. There are several windows
that one can use. We describe, below, the exponential window, since it can be ‘evaluated’

26
in an online way. For t = 1 let µ̂1 = Y1 , then for t > 1 define

bt = (1 − λ)b
µ µt−1 + λYt ,

where 0 < λ < 1. The choice of λ depends on how much weight one wants to give the
present observation. The rolling window is related to the regular window often used in
nonparametric regression. To see this, we note that it is straightforward to show that

t−1
X t
X
µ
bt = (1 − λ)t−j λYj = [1 − exp(−γ)] exp [−γ(t − j)] Yj
j=1 j=1

where 1 − λ = exp(−γ). Set γ = (nb)−1 and K(u) = exp(−u)I(u ≥ 0). Note that we treat
n as a “sample size” (it is of the same order as n and for convenience one can let n = t),
whereas b is a bandwidth, the smaller b the larger the weight on the current observations.
Then, µ
bt can be written as

n  
−1/(nb)
X t−j
bt = (1 − e
µ ) K Yj ,
| {z }
j=1
nb
≈(nb)−1

where the above approximation is due to a Taylor expansion of e−1/(nb) . This we observe that
the exponential rolling window estimator is very close to a nonparametric kernel smoothing,
which typically takes the form

n  
X 1 t−j
µ
et = K Yj .
j=1
nb nb

it is likely you came across such estimators in your nonparametric classes (a classical example
is the local average where K(u) = 1 for u ∈ [−1/2, 1/2] but zero elsewhere). The main
difference between the rolling window estimator and the nonparametric kernel estimator is
that the kernel/window for the rolling window is not symmetric. This is because we are
trying to estimate the mean at time t, given only the observations up to time t. Whereas
for general nonparametric kernel estimators one can use observations on both sides of t.

27
2.3.2 Sieve estimators
R1
Suppose that {φk (·)}k is an orthonormal basis of L2 [0, 1] (L2 [0, 1] = {f ; 0
f (x)2 dx < ∞},
so it includes all bounded and continuous functions)1 . Then every function in L2 can be
represented as a linear sum of the basis. Suppose µ(·) ∈ L2 [0, 1] (for example the function
is simply bounded). Then


X
µ(u) = ak φk (u).
k=1

Examples of basis functions are the Fourier φk (u) = exp(iku), Haar/other wavelet functions
etc. We observe that the unknown coefficients ak are a linear in the ‘regressors’ φk . Since
2
P
k |ak | < ∞, ak → 0. Therefore, for a sufficiently large M the finite truncation of the

above is such that

M  
X t
Yt ≈ ak φ k + εt .
k=1
n

Based on the above we observe that we can use least squares to estimate the coefficients,
{ak }. To estimate these coefficients, we truncate the above expansion to order M , and use
least squares to estimate the coefficients

n
" M  #2
X X t
Yt − ak φk . (2.8)
t=1 k=1
n

The orthogonality of the basis means that the corresponding design matrix (X 0 X) is close
to identity, since

1 6
1 X  
t
  Z
t  0 k = k2
n−1 (X 0 X)k1 ,k2 = φk1 φk2 ≈ φk1 (u)φk2 (u)du = .
n t n n  1 k =k
1 2

1
R1 R1
Orthonormal basis means that for all k 0
φk (u)2 du = 1 and for any k1 6= k2 we have 0
φk1 (u)φk2 (u)du =
0

28
This means that the least squares estimator of ak is b
ak where

n  
1X t
ak ≈
b Yt φk .
n t=1 n

2.4 What is trend and what is noise?


So far we have not discussed the nature of the noise εt . In classical statistics εt is usually
assumed to be iid (independent, identically distributed). But if the data is observed over
time, εt could be dependent; the previous observation influences the current observation.
However, once we relax the assumption of independence in the model problems arise. By
allowing the “noise” εt to be dependent it becomes extremely difficult to discriminate between
mean trend and noise. In Figure 2.3 two plots are given. The top plot is a realisation from
independent normal noise the bottom plot is a realisation from dependent noise (the AR(1)
process Xt = 0.95Xt−1 + εt ). Both realisations have zero mean (no trend), but the lower plot
does give the appearance of an underlying mean trend.
This effect because more problematic when analysing data where there is mean term
plus dependent noise. The smoothness in the dependent noise may give the appearance of
additional features mean function. This makes estimating the mean function more difficult,
especially the choice of bandwidth b. To understand why, suppose the mean function is
t
µt = µ( 200 ) (the sample size n = 200), where µ(u) = 5 × (2u − 2.5u2 ) + 20. We corrupt
this quadratic function with both iid and dependent noise (the dependent noise is the AR(2)
process defined in equation (2.19)). The plots are given in Figure 2.4. We observe that the
dependent noise looks ‘smooth’ (dependence can induce smoothness in a realisation). This
means that in the case that the mean has been corrupted by dependent noise it difficult to
see that the underlying trend is a simple quadratic function. In a very interesting paper Hart
(1991), shows that cross-validation (which is the classical method for choosing the bandwidth
parameter b) is terrible when the errors are correlated.

Exercise 2.2 The purpose of this exercise is to understand how correlated errors in a non-
parametric model influence local smoothing estimators. We will use a simple local average.
Define the smooth signal f (u) = 5 ∗ (2u − 2.5u2 ) + 20 and suppose we observe Yi =

29
2
1
independent

0
−2 −1

0 20 40 60 80 100

Time
10
dependent

5
0

0 20 40 60 80 100

Time

Figure 2.3: Top: realisations from iid random noise. Bottom: Realisation from dependent
noise

f (i/200) + εi (n = 200). To simular f (u) with n = 200 define temp <- c(1:200)/200 and
quadratic <- 5*(2*temp - 2.5*(temp**2)) + 20.

(i) Simulate from the above model using iid noise. You can use the code iid=rnom(200)
and quadraticiid = (quadratic + iid).

Our aim is to estimate f . To do this take a local average (the average can have different
lengths m) (you can use mean(quadraticiid[c(k:(k+m-1))]) for k = 1, . . . , 200−m).
Make of a plot the estimate.

(ii) Simulate from the above model using correlated noise (we simulate from an AR(2)) ar2
= 0.5*arima.sim(list(order=c(2,0,0), ar = c(1.5, -0.75)), n=200) and de-
fine quadraticar2 = (quadratic +ar2).

Again estimate f using local averages and make a plot.

Compare the plots of the estimates based on the two models above.

30
3
3

2
2

1
1

0
ar2
iid

−3 −2 −1
−2 −1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

temp temp

24
24

22
22

quadraticar2
quadraticiid

20

20
18

18
16

16
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

temp temp

Figure 2.4: Top: realisations from iid random noise and dependent noise (left = iid and right
= dependent). Bottom: Quadratic trend plus corresponding noise.

2.5 Periodic functions


Periodic mean functions arise in several applications, from ECG (which measure heart
rhythms), econometric data, geostatistical data to astrostatistics. Often the aim is to esti-
mate the period or of a periodic function. Let us return to the monthly rainfall example
consider in Section 2.1, equation (2.6):
   
2πt 2πt
Yt = β0 + β1 sin + β3 cos + εt .
12 12

This model assumes the mean has a repetition every 12 month period. But, it assumes a
very specific type of repetition over 12 months; one that is composed of one sine and one
cosine. If one wanted to be more general and allow for any periodic sequence of period 12,
the above should be replaced with

Yt = d12 (t) + εt ,

31
where d12 = (d12 (1), d12 (2), . . . , d12 (12)) and d12 (t) = d12 (t + 12) for all t. This a general
sequence which loops every 12 time points.
In the following few sections our aim is to show that all periodic functions can be written
in terms of sine and cosines.

2.5.1 The sine and cosine transform

An alternative (but equivalent) representation of this periodic sequence is by using sines and
cosines. This is very reasonable, since sines and cosines are also periodic. It can be shown
that

5     
X 2πtj 2πtj
d12 (t) = a0 + aj cos + bj sin + a6 cos (πt) . (2.9)
j=1
12 12

Where we observe that the number aj and bj s is 12, which is exactly the number of different
elements in the sequence. Any periodic sequence of period 12 can be written in this way. Fur-
ther equation (2.6) is the first two components in this representation. Thus the representation
in (2.9) motivates why (2.6) is often used to model seasonality. You may wonder why use just
the first two components in (2.9) in the seasonal, this is because typically the coefficients a1
and b1 are far larger than {aj , bj }6j=2 . This is only a rule of thumb: generate several periodic
sequences you see that in general this is true. Thus in general a1 cos 2πt 2πt
  
12
+ b 1 sin 12

tends to capture the main periodic features in the sequence. Algebraic manipulation shows
that

12   12  
1 X 2πtj 1 X 2πtj
aj = d12 (t) cos and bj = d12 (t) sin . (2.10)
12 t=1 12 12 t=1 12

These are often called the sin and cosine transforms.


In general for sequences of period P , if P is even we can write

P/2−1     
X 2πtj 2πtj
dP (t) = a0 + aj cos + bj sin + aP/2 cos (πt) (2.11)
j=1
P P

32
and if P is odd

bP/2c−1     
X 2πtj 2πtj
dP (t) = a0 + aj cos + bj sin (2.12)
j=1
P P

where

P   P  
1 X 2πtj 1 X 2πtj
aj = dP (t) cos and bj = dP (t) sin .
P t=1 P P t=1 P

The above reconstructs the periodic sequence dP (t) in terms of sines and cosines. What we
will learn later on is that all sequences can be built up with sines and cosines (it does not
matter if they are periodic or not).

2.5.2 The Fourier transform (the sine and cosine transform in


disguise)

We will now introduce a tool that often invokes panic in students. But it is very useful
and is simply an alternative representation of the sine and cosine transform (which does
not invoke panic). If you tried to prove (2.10) you would have probably used several cosine
and sine identities. It is a very mess proof. A simpler method is to use an alternative
representation which combines the sine and cosine transforms and imaginary numbers. We
recall the identity

eiω = cos(ω) + i sin(ω).


where i = −1. eiω contains the sin and cosine information in just one function. Thus
cos(ω) = Re eiω = (eiω + e−iω )/2 and sin(ω) = Im eiω = −i(eiω − e−iω )/2.
It has some very useful properties that just require basic knowledge of geometric series.
We state these below. Define the ratio ωk,n = 2πk/n (we exchange 12 for n), then

n−1
X n−1
X n−1
X
exp(ijωk,n ) = exp(ikωj,n ) = [exp(iωj,n )]k .
k=0 k=0 k=0

33
Pn−1
Keep in mind that jωk,n = j2πk/n = kωj,n . If j = 0, then k=0 exp(ijωn,n ) = n. On the
other hand, if 1 ≤ j, k ≤ (n − 1), then exp(ijωk,n ) = cos(2jπk/n) + i sin(2jπk/n) 6= 1. And
we can use the geometric sum identity

n−1 n−1
X X 1 − exp(inωk,n )
exp(ijωk,n ) = [exp(iωj,n )]k = .
k=0 k=0
1 − exp(iωk,n )

But exp(inωk,n ) = cos(n2πk/n) + i sin(n2πk/n) = 1. Thus for 1 ≤ k ≤ (n − 1) we have

n−1
X 1 − exp(inωj,n )
exp(ijωk,n ) = = 0.
k=0
1 − exp(iωj,n )

In summary,

n−1  n
X j = n or 0
exp(ijωk,n ) = (2.13)
 0 1 ≤ j ≤ (n − 1)
k=0

Now using the above results we now show we can rewrite d12 (t) in terms of exp(iω) (rather
than sines and cosines). And this representation is a lot easier to show; though you it is in
terms of complex numbers. Set n = 12 and define the coefficient

11
1 X
A12 (j) = d12 (t) exp (itωj,12 ) .
12 t=0

A12 (j) is complex (it has real and imaginary parts), with a little thought you can see that
A12 (j) = A12 (12 − j). By using (2.13) it is easily shown (see below for proof) that

11
X
d12 (τ ) = A12 (j) exp(−ijωτ,12 ) (2.14)
j=0

This is just like the sine and cosine representation

5     
X 2πtj 2πtj
d12 (t) = a0 + aj cos + bj sin + a6 cos (πt) .
j=1
12 12

but with exp(ijωt,12 ) replacing cos(jωt,12 ) and sin(jωt,12 ).

34
Proof of equation (2.14) The proof of (2.14) is very simple and we now give it. Plugging in
the equation for A12 (j) into (2.14) gives

11 11 11
X 1 X X
d12 (τ ) = A12 (j) exp(−ijωτ,12 ) = d12 (t) exp (itωj,n ) exp(−ijωτ,12 )
j=0
12 t=0 j=0
11 11
1 X X
= d12 (t) exp (i(t − τ )ωj,12 )).
12 t=0 j=0

P11
We know from (2.13) that j=0 exp (i(t − τ )ωj,12 )) = 0 unless t = τ . If t = τ , then
P11
j=0 exp (i(t − τ )ωj,12 )) = 12. Thus

11 11 11
1 X X 1 X
d12 (t) exp (i(t − τ )ωj,12 )) = d12 (t)I(t = τ ) × 12
12 t=0 j=0
12 t=0
= d12 (t),

this proves (2.14). 

Remember the above is just writing the sequence in terms of its sine and cosine transforms
in fact it is simple to link the two sets of coefficients:

1
aj = Re A12 (j) = [A12 (j) + A12 (12 − j)]
2
−i
bj = Im A12 (j) = [A12 (j) − A12 (12 − j)] .
2

We give an example of a periodic function and its Fourier coefficients (real and imaginary
parts) in Figure 2.5. The peak at the zero frequency of the real part corresponds to the
mean of the periodic signal (if the mean is zero, this will be zero).

Example 2.5.1 In the case that dP (t) is a pure sine or cosine function sin(2πt/P ) or
cos(2πt/P ), then AP (j) will only be non-zero at j = 1 and j = P − 1.
This is straightfoward to see, but we formally prove it below. Suppose that dP (t) =

35
1.0

0.6

0.3
0.5

0.2
0.8

0.4

0.1
0.6

Re(FO)

Im(FO)
oscill

0.3

0.0
0.4

−0.1
0.2
0.2

−0.2
0.1

−0.3
0.0

0.0
0 20 40 60 80 100 0 1 2 3 4 5 6 0 1 2 3 4 5 6

Time freq freq

Figure 2.5: Left: Periodic function d5 (s) = 1 for s = 1, 2, d5 (s) = 0 for s = 3, 4, 5 (period 5),
Right: The real and imaginary parts of its Fourier transform

2πs

cos P
, then

P −1     P −1  1/2 j = 1 or P − 1
1 X 2πs 2πsj 1 X  2πsj
cos exp i = ei2πs/P + e−i2πs/P ei P =
P s=0
P P 2P s=0
 0 otherwise

2πs

Suppose that dP (t) = sin P
, then

P −1     P −1


 i/2 j=1
1 X 2πs 2πsj −i X i2πs/P  2πsj 
sin exp i = e − e−i2πs/P ei P = −i/2 j = P − 1
P s=0 P P 2P s=0 


 0 otherwise

2.5.3 The discrete Fourier transform

The discussion above shows that any periodic sequence can be written as the sum of (modu-
lated) sins and cosines up to that frequency. But the same is true for any sequence. Suppose
{Yt }nt=1 is a sequence of length n, then it can always be represented as the superposition of
n sine and cosine functions. To make calculations easier we use exp(ijωk,n ) instead of sines
and cosines:

n−1
X
Yt = An (j) exp(−itωj,n ), (2.15)
j=0

36
where the amplitude An (j) is

n
1X
An (j) = Yτ exp(iτ ωj,n ).
n τ =1

Here Yt is acting like dP (t), it is also periodic if we over the boundary [1, . . . , n]. By using
(2.15) as the definition of Yt we can show that Yt+n = Yt .
Often the n is distributed evenly over the two sums and we represent Yt as

n−1
1 X
Yt = √ Jn (ωk,n ) exp(−itωk,n ),
n k=0

where the amplitude of exp(−itωk,n ) is

n
1 X
Jn (ωk,n ) = √ Yτ exp(iτ ωk,n ).
n τ =1


This representation evenly distributes 1/ n amongst the two sums. Jn (ωk,n ) is called the
Discrete Fourier transform (DFT) of {Yt }. It serves a few purposes:

• Jn (ωk,n ) measures the contribution (amplitude) of exp(itωk,n ) (or cos(tωk,n ) and sin(tωk,n ))
in {Yt }.

• Jn (ωk,n ) is a linear transformation of {Yt }nt=1 .

• You can view Jn (ωk,n ) as a scalar product of {Yt } with sines and cosines, or as projection
onto sines or cosines or measuring the resonance of {Yt } at frequency ωk,n . It has the
benefit of being a microscope for detecting periods, as we will see in the next section.

For general time series, the DFT, {Jn ( 2πk


n
); 1 ≤ k ≤ n} is simply a decomposition of
the time series {Xt ; t = 1, . . . , n} into sins and cosines of different frequencies. The mag-
nitude of Jn (ωk ) informs on how much of the functions sin(tω) and cos(tωk ) are in the
{Xt ; t = 1, . . . , n}. Below we define the periodogram. The periodogram effectively removes
the complex part in Jn (ωk ) and only measures the absolute magnitude.

37
Definition 2.5.1 (The periodogram) Jn (ω) is complex random variables. Often the ab-
solute square of Jn (ω) is analyzed, this is called the periodogram

n n 2 2
1 X
2 1 X
In (ω) = |Jn (ω)| = Xt cos(tω) + Xt sin(tω) .
n t=1 n t=1

In (ω) combines the information in the real and imaginary parts of Jn (ω) and has the advan-
tage that it is real.
In (ω) is symmetric about π. It is also periodic every [0, 2π], thus In (ω + 2π) = In (ω).
Put together only needs to consider In (ω) in the range [0, π] to extract all the information
from In (ω).

2.5.4 The discrete Fourier transform and periodic signals

In this section we consider signals with periodic trend:

Yt = dP (t) + εt t = 1, . . . , n
P −1
X 2πjt
= AP (j)e−i P + εt
j=0

where for all t, dP (t) = dP (t + P ) (assume {εt } are iid). Our aim in this section is estimate
(at least visually) the period. We use the DFT of the time series to gain some standing of
dP (t). We show below that the linear transformation Jn (ωk,n ) is more informative about dP
that {Yt }.
We recall that the discrete Fourier transform of {Yt } is

n n
1 X X
Jn (ωk,n ) = √ Yt [cos(tωk.n ) − i sin(tωk )] = Yt exp(−itωk,n )
n t=1 t=1

2πk
where {ωk = n
}. We show below that when the periodicity in the cosine and sin function
matches the periodicity of the mean function Jn (ω) will be large and at other frequencies it

38
will be small. Thus

 √nA (r) + √1 Pn ε e−itωk,n k =



n
p n t=1 t P
r, r = 0, . . . , P − 1.
Jn (ωk,n ) = (2.16)
n −itωk,n
√1 n
P

n t=1 εt e 6
k= P
Z

Pn −itωk,n
Assuming that t=1 εt e is low lying noise (we discuss this in detail later), what
we should see are P large spikes, each corresponding to AP (r). Though the above is simply
an algebraic calculation. The reason for the term n in (2.16) (recall n is the sample size) is
because there are n/P repetitions of the period.

Example We consider a simple example where d4 (s) = (1.125, −0.375, −0.375, −0.375) (pe-
riod = 4, total length 100, number of repetitions 25). We add noise to it (iid normal with
σ = 0.4). A plot of one realisation is given in Figure 2.7. In Figure 2.8 we superimpose the
observed signal with with two different sine functions. Observe that when the sine function
matches the frequencies (sin(25u), red plot) their scalar product will be large. But when the
sin frequency does not match the periodic frequency the scalar product will be close to zero.
In
Periodogram
14
1.0

12
10
0.5

8
oscillM

FS

6
4
0.0

2
0

0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 3.0

t freq

Figure 2.6: Left: Periodic function d4 (s) = (1.125, −0.375, −0.375, −0.375) (period 4)

In Figure 2.9 we plot the signal together with is periodgram. Observe that the plot
matches equation (2.16). At the frequency of the period the signal amplitude is very large.

39
1.5
1.0
0.5
oscill

0.0
−0.5
−1.0

0 20 40 60 80 100

Figure 2.7: Periodic function d4 (s) = (1.125, −0.375, −0.375, −0.375) (period 4) and signal
with noise (blue line).
1.5

1.5
1.0

1.0
0.5

0.5
oscill

oscill
0.0

0.0
−0.5

−0.5
−1.0

−1.0

0 20 40 60 80 100 0 20 40 60 80 100

t t

Figure 2.8: Left: Signal superimposed with sin(u). Right: Signal superimposed with
sin(25u).

Proof of equation (2.16) To see why, we rewrite Jn (ωk ) (we assume n is a multiple of P ) as

n n
1 X 1 X itωk
Jn (ωk ) = √ dP (t) exp(itωk ) + √ εt e
n t=0 n t=1
n/P −1 P n
1 X X 1 X itωk
= √ dP (P t + s) exp(iP tωk + isωk ) + √ εt e
n t=0 s=1 n t=1
n/P −1 P n
1 X X 1 X itωk
= √ exp(iP tωk ) dP (s) exp(isωk ) + √ εt e
n t=0 s=1
n t=1
P n/P −1 n
1 X X 1 X itωk
= √ dP (s) exp(isωk ) exp(iP tωk ) + √ εt e .
n s=1 t=0
n t=1

40
Periodogram

15
1.5
1.0

10
0.5
oscill

FO
0.0

5
−0.5
−1.0

0
0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 3.0

t freq

Figure 2.9: Left: Signal, Right: periodogram of signal (peridogram of periodic function in
red)

We now use a result analogous to (2.13)



n/P −1 exp(i2πk) n
X 
1−exp(iP tωk )
= 0 k 6= P
Z
exp(iP tωk ) =
n
t=0
 n/P k∈ P
Z

Thus

 √nA (r) + Pn ε eitωk k =



n
p t=1 t P
r, r = 0, . . . , P − 1.
Jn (ωk ) = P n itωk n

t=1 εt e k=6 P
Z

PP
where AP (r) = P −1 s=1 dP (s) exp(2πisr/P ). This proves (2.16) 

Exercise 2.3 Generate your own periodic sequence of length P (you select P ). Call this
sequence {dP (t)} and generate a sequence {xt } with several replications of {dP (t)} and cal-
culate the periodogram of the periodic signal.
Add iid noise to the signal and again evaluate the periodogram (do the same for noise
with different standard deviations).

(i) Make plots of the true signal and the corrupted signal.

(i) Compare the periodogram of the true signal with the periodogram of the corrupted signal.

41
2.5.5 Smooth trends and its corresponding DFT

So far we have used the DFT to search for periodocities. But the DFT/periodogram of a
smooth signal also leaves an interesting signature. Consider the quadratic signal
"  2 #
t t
g(t) = 6 − − 0.7 t = 1, . . . , 100.
100 100

To g(t) we add iid noise Yt = g(t)+εt where var[εt ] = 0.52 . A realisation and its corresponding
periodogram is given in Figure 2.10. We observe that the quadratic signal is composed of
low frequencies (sines and cosines with very large periods). In general, any signal which
is “smooth” can be decomposed of sines and cosines in the very low frequencies. Thus a
periodogram with a large peak around the low frequencies, suggests that the underlying
signal contains a smooth signal (either deterministically or stochastically).

Signal plus noise Periodogram


2

10
8
1

6
F1
y

4
2
−1

0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 1.2

time freq

Figure 2.10: Left: Signal and noise (blue). The signal is in red. Right: Periodogram of
signal plus noise (up to frequency π/5). Periodogram of signal is in red.

2.5.6 Period detection

In this section we formalize what we have seen and derived for the periodic sequences given
above. Our aim is to estimate the period P . But to simplify the approach, we focus on the
case that dP (t) is a pure sine or cosine function (no mix of sines and cosines).
We will show that the visual Fourier transform method described above is equivalent
to period estimation using least squares. Suppose that the observations {Yt ; t = 1, . . . , n}

42
satisfy the following regression model
   
2πt 2πt
Yt = A cos(Ωt) + B sin(Ωt) + εt = A cos + B sin + εt
P P

where {εt } are iid standard normal random variables and 0 < Ω < π (using the periodic

notation we set Ω = P
).
The parameters A, B, and Ω are real and unknown. Unlike the regression models given
in (2.1) the model here is nonlinear, since the unknown parameter, Ω, is inside a trignomet-
ric function. Standard least squares methods cannot be used to estimate the parameters.
Assuming Gaussianity of {εt } (though this assumption is not necessary), the maximum
likelihood corresponding to the model is

n
1X
Ln (A, B, Ω) = − (Yt − A cos(Ωt) − B sin(Ωt))2
2 t=1

(alternatively one can think of it in terms use least squares which is negative of the above).
The above criterion is a negative nonlinear least squares criterion in A, B and Ω. It does not
yield an analytic solution and would require the use of a numerical maximisation scheme.
However, using some algebraic manipulations, explicit expressions for the estimators can be
obtained (see Walker (1971) and Exercise 2.5). The result of these manipulations give the
frequency estimator


b n = arg max In (ω)
ω

where

n 2 n
!2 n
!2
1 X 1 X 1 X
In (ω) = Yt exp(itω) = Yt cos(tΩ) + Yt sin(tω) . (2.17)
n t=1 n t=1
n t=1

Using Ω
b n we estimate A and B with

n n
bn = 2 bn = 2
X X
A Yt cos(Ω
b n t) and B Yt sin(Ω
b n t).
n t=1 n t=1

43
The rather remarkable aspect of this result is that the rate of convergence of

b n − Ω| = Op (n−3/2 ),
|Ω

which is faster than the standard O(n−1/2 ) that we usually encounter (we will see this in

Example 2.5.2). This means that for even moderate sample sizes if P = Ω
is not too large,
2
then Ω
b n will be “close” to Ω. . The reason we get this remarkable result was alluded to
previously. We reiterate it again

n n 2
1 X itω 1 X itω
In (ω) ≈ [A cos (tΩ) + B sin (tΩ)] e + εt e .
n t=1 n t=1
| {z } | {z }
signal noise

The “signal” in In (ωk ) is the periodogram corresponding to the cos and/or sine function.
For example setting Ω = 2π/P , A = 1 and B = 0. The signal is

n 2 n n n−P
k= or k =
 
1 X 2πt itωk 
4 P P
cos e = .
n t=1 P  0 other wise

2πP 2π(n−P )
Observe there is a peak at n
and n
, which is of size n, elsewhere it is zero. On the
other hand the noise is

n 2 n
1 X itωk 1 X itωk
εt e = √ εt e = Op (1),
n t=1 n t=1
| {z }
treat as a rescaled mean

where Op (1) means that it is bounded in probability (it does not grow as n → ∞). Putting
these two facts together, we observe that the contribution of the signal dominates the peri-
odogram In (ω). A simulation to illustrate this effect is given in Figure ??

Remark 2.5.1 In practice, usually we evaluate Jn (ω) and In (ω) at the so called fundamental
2 n
Pnrandom variables2 {Xt }t=1 , where E[Xt ] =
In contrast consider the iid µ and var(Xt ) = σ 2 . The variance
−1 2 −1/2
of the sample mean X̄ = n t=1 is var[X̄] = σ /n (where var(Xt ) = σ ). This means |X̄ −µ| = Op (n ).
−1/2
This means there exists a random variable U such that |X̄ − µ| ≤ n U . Roughly, this means as n → ∞
the distance between X̄ and µ declines at the rate n−1/2 .

44
2πk
frequencies ωk = n
and we do this with the fft function in R:
( n   n  )n
2πk 1 X 2πk 1 X 2πk
{Yt }nt=1 → Jn ( )= √ Yt cos t + i√ Yt sin t .
n n t=1 n n t=1 n
k=1

Jn (ωk ) is simply a linear one to one transformation of the data (nothing is lost in this
transformation). Statistical analysis can be applied on any transformation of the data (for
example Wavelet transforms). It so happens that for stationary time series this so called
Fourier transform has some advantages.
For period detection and amplitude estimation one can often obtain a better estimator of
P (or Ω) if a finer frequency resolution were used. This is done by padding the signal with
2πk
zeros and evaluating the periodogram on d
where d >> n. The estimate of the period is
then evaluated by using

d
Pb =
b −1
K

where K
b is the entry in the vector corresponding to the maximum of the periodogram.

We consider an example below.

Example 2.5.2 Consider the following model


 
2πt
Yt = 2 sin + εt t = 1, . . . , n. (2.18)
8

where εt are iid standard normal random variables (and for simplicity we assume n is a
multiple of 8). Note by using Remark 2.5.1 and equation (2.16) we have

n 2
1 X 
2πt
  n k = n or n − n
8 8
2 sin exp(itωk,n ) =
n t=1
8  0 otherwise

It is clear that {Yt } is made up of a periodic signal with period eight. We make a plot of
one realisation (using sample size n = 128) together with the periodogram I(ω) (defined
in (2.17)). In Figure 2.11 we give a plot of one realisation together with a plot of the

45
periodogram. From the realisation, it is not clear what the period is (the noise has made
it difficult to see the period). On the other hand, the periodogram clearly shows a peak
at frequenct 2π/8 ≈ 0.78 (where we recall that 8 is the period) and 2π − 2π/8 (since the
periodogram is symmetric about π).
1 2 3
signal

−1
−3

0 20 40 60 80 100 120

Time

● ●
0.4
P

0.2

● ●
●● ● ● ●● ● ● ● ●● ● ● ●●
0.0

●● ●●● ●● ● ● ●● ●● ● ● ●●
●●●●● ●●● ●●●●●●●●●● ●●●●● ●● ●●●●● ●●● ●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●● ● ●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●

0 1 2 3 4 5 6

frequency

Figure 2.11: Left: Realisation of (2.18) plus iid noise, Right: Periodogram of signal plus iid
noise.

Searching for peaks in the periodogram is a long established method for detecting pe-
riodicities. The method outlined above can easily be generalized to the case that there
are multiple periods. However, distinguishing between two periods which are very close in
frequency (such data arises in astronomy) is a difficult problem and requires more subtle
methods (see Quinn and Hannan (2001)).

The Fisher’s g-statistic (advanced) The discussion above motivates Fisher’s test for hidden
period, where the objective is to detect a period in the signal. The null hypothesis is H0 :
The signal is just white noise with no periodicities the alternative is H1 : The signal contains
a periodicity. The original test statistic was constructed under the assumption that the noise
was iid Gaussian. As we have discussed above, if a period exists, In (ωk ) will contain a few
“large” values, which correspond to the periodicities. The majority of In (ωk ) will be “small”.

46
Based on this notion, the Fisher’s g-statistic is defined as

max1≤k≤(n−1)/2 In (ωk )
ηn = P(n−1)/2 ,
2
n−1 k=1 In (ωk )

where we note that the denominator can be treated as the average noise. Under the null
(and iid normality of the noise), this ratio is pivotal (it does not depend on any unknown
nuisance parameters).

2.5.7 Period detection and correlated noise

The methods described in the previous section are extremely effective if the error process
{εt } is uncorrelated. However, problems arise when the errors are correlated. To illustrate
this issue, consider again model (2.18)
 
2πt
Yt = 2 sin + εt t = 1, . . . , n.
8

but this time the errors are correlated. More precisely, they are generated by the AR(2)
model,

εt = 1.5εt−1 − 0.75εt−2 + t , (2.19)

where {t } are iid random variables (do not worry if this does not make sense to you we define
this class of models precisely in Chapter 4). As in the iid case we use a sample size n = 128. In
Figure 2.12 we give a plot of one realisation and the corresponding periodogram. We observe
that the peak at 2π/8 is not the highest. The correlated errors (often called coloured noise)
is masking the peak by introducing new peaks. To see what happens for larger sample sizes,
we consider exactly the same model (2.18) with the noise generated as in (2.19). But this
time we use n = 1024 (8 time the previous sample size). A plot of one realisation, together
with the periodogram is given in Figure 2.13. In contrast to the smaller sample size, a large
peak is visible at 2π/8. These examples illustrates two important points:

(i) When the noise is correlated and the sample size is relatively small it is difficult to

47
5
signal2

0
−5 0 20 40 60 80 100 120

Time

● ●
0.8

● ●

● ●
P2

● ●
0.4

● ●
● ● ● ●
● ●
● ●● ●● ● ●
● ● ●● ●● ●
0.0

●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

0 1 2 3 4 5 6

frequency

Figure 2.12: Top: Realisation of (2.18) plus correlated noise and n = 128, Bottom: Peri-
odogram of signal plus correlated noise.
5
signal2

0
−5

0 200 400 600 800 1000

Time

● ●
0.0 0.2 0.4 0.6
P2

● ● ●●
● ●
●● ● ● ●●
●● ●●
● ●●
● ●●●
● ●● ● ●● ●
●●●●●

●● ●

● ●●●●● ●●●●● ●

●●●●
●●●●


●●


●●

●●


●●
●●
●●


●●●

● ●

●●
●●

●●

●●


●●
●● ●● ●●●●●


●●


●●
●●
●●
●●
●●
●●
●●

●●
●●
●●


●●●


●●

●●
●●●●● ●●
●●
●●● ● ●
● ●● ●
●●


●●


●●


●●

●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●

●●


●●

●● ●●●
● ● ●●●
●● ●

●●
●●●

●●
●●

0 1 2 3 4 5 6

frequency

Figure 2.13: Top: Realisation of (2.18) plus correlated noise and n = 1024, Bottom: Peri-
odogram of signal plus correlated noise.

disentangle the deterministic period from the noise. Indeed we will show in Chapters 4
and 6 that linear time series (such as the AR(2) model described in (2.19)) can exhibit
similar types of behaviour to a periodic deterministic signal. This is a subject of on
going research that dates back at least 60 years (see Quinn and Hannan (2001) and

48
the P -statistic proposed by Priestley).

However, the similarity is only to a point. Given a large enough sample size (which
may in practice not be realistic), the deterministic frequency dominates again (as we
have seen when we increase n to 1024).

(ii) The periodogram holds important information about oscillations in the both the signal
and also the noise {εt }. If the noise is iid then the corresponding periodogram tends
to be flatish (see Figure 2.11). This informs us that no frequency dominates others.
And is the reason that iid time series (or more precisely uncorrelated time series) is
called “white noise”.

Comparing Figure 2.11 with 2.12 and 2.13) we observe that the periodogram does not
appear completely flat. Some frequencies tend to be far larger than others. This is
because when data is dependent, certain patterns are seen, which are registered by the
periodogram (see Section 4.3.6).

Understanding the DFT and the periodogram is called spectral analysis and is explored
in Chapters 10 and 11.

2.5.8 History of the periodogram

The use of the periodogram, In (ω) to detect for periodocities in the data dates back to
Schuster in the 1890’s. One of Schuster’s interest was sunspot data. He analyzed the
number of sunspot through the lense of the periodogram. A plot of the monthly time series
and corresponding periodogram is given in Figure 2.14. Let {Yt } denote the number of
sunspots at month t. Schuster fitted a model of the type the period trend plus noise model

Yt = A cos(Ωt) + B sin(Ωt) + εt ,

Ω = 2π/P . The periodogram below shows a peak at frequency = 0.047 Ω = 2π/(11 × 12)
(132 months), which corresponds to a period of P = 11 years. This suggests that the number
of sunspots follow a periodic cycle with a peak every P = 11 years. The general view until

49
Figure 2.14: Sunspot data from Jan, 1749 to Dec, 2014. There is a peak at about 30 along
the line which corresponds to 2π/P = 0.047 and P ≈ 132 months (11 years).

the 1920s was that most time series were a mix of periodic function with additive noise

P
X
Yt = [Aj cos(tΩj ) + Bj sin(tΩj )] + εt .
j=1

However, in the 1920’s, Udny Yule, a statistician, and Gilbert Walker, a Meterologist (work-
ing in Pune, India) believed an alternative model could be used to explain the features seen
in the periodogram. We consider their proposed approach in Section 4.3.5.

50
2.6 Data Analysis: EEG data

2.6.1 Connecting Hertz and Frequencies

Engineers and neuroscientists often “think” in terms of oscillations or cycles per second.
Instead of the sample size they will say the sampling frequency per second (number of
observations per second), which is measured in Herz (Hz) and the number of seconds the
time series is observed. Thus the periodogram is plotted against cycles per second rather
than on the [0, 2π] scale. In the following example we connect the two.

Example Suppose that a time series is sampled at 36Hz (36 observations per second) and
the signal is g(u) = sin(2π × 4u) (u ∈ R). The observed time series in one second is
t
{sin(2π × 4 × )}36 .
36 t=1
An illustration is given below.

We observe from the plot above that period of repetition is P = 9 time points (over 36
time points the signal repeats it self every 9 points). Thus in terms of the periodogram this
corresponds to a spike at frequency ω = 2π/9. But to an engineer this means 4 repetitions
a second and a spike at 4Hz. It is the same plot, just the x-axis is different. The two plots
are given below.

51
Analysis from the perspective of time series Typically, in time series, the sampling frequency
is kept the same. Just the same number of second that the time series is observed grows.
This allows us obtain a finer frequency grid on [0, 2π] and obtain a better resolution in terms
of peaks in frequencies. However, it does not allow is to identify frequencies that are sampled
at a higher frequency than the sampling rate.
Returning to the example above. Suppose we observe another signal h(u) = sin(2π ×
(4 + 36)u). If the sampling frequency is 36Hz and u = 1/36, 2/36, . . . , 36/36, then
   
t t
sin 2π × 4 × = sin 2π × (4 + 36) × for all t ∈ Z
36 36

Thus we cannot tell the differences between these two signals when we sample at 36Hz, even
if the observed time series is very long. This is called aliasing.

Analysis from the perspective of an engineer An engineer may be able to improve the hard-
ware and sample the time series at a higher temporal resolution, say, 72Hz. At this higher
temporal resolution, the two functions g(u) = sin(2π × 4 × u) and h(u) = sin(2π(4 + 36)u)
are different.

52
In the plot above the red line is g(u) = sin(2π4u) and the yellow line is g(u) = sin(2π(4 +
36)u). The periodogram for both signals g(u) = sin(2π × 4 × u) and h(u) = sin(2π(4 + 36)u)
is given below.

In Hz, we extend the x-axis to include more cycles. The same thing is done for the frequency
[0, 2π] we extend the frequency range to include higher frequencies. Thus when we observe
on a finer temporal grid, we are able to identify higher frequencies. Extending this idea, if
we observe time on R, then we can identify all frequencies on R not just on [0, 2π].

53
2.6.2 Data Analysis

In this section we conduct a preliminary analysis of an EEG data set. A plot of one EEG of
one participant at one channel (probe on skull) over 2 seconds (about 512 observations, 256
Hz) is given in Figure 2.15. The neuroscientists who analysis such data use the periodogram
to associate the EEG to different types of brain activity. A plot of the periodogam is
given Figure 2.16. The periodogram is given in both [0, π] and Hz (cycles per second).
Observe that the EEG contains a large amount of low frequency information, this is probably
due to the slowly changing trend in the original EEG. The neurologists have banded the
cycles into bands and associated to each band different types of brain activity (see https:
//en.wikipedia.org/wiki/Alpha_wave#Brain_waves). Very low frequency waves, such
as delta, theta and to some extent alpha waves are often associated with low level brain
activity (such as breathing). Higher frequencies (alpha and gamma waves) in the EEG are
often associated with conscious thought (though none of this is completely understood and
there are many debates on this). Studying the periodogram of the EEG in Figures 2.15
and 2.16, we observe that the low frequency information dominates the signal. Therefore,
the neuroscientists prefer to decompose the signal into different frequency bands to isolate
different parts of the signal. This is usually done by means of a band filter.
As mentioned above, higher frequencies in the EEG are believed to be associated with
conscious thought. However, the lower frequencies dominate the EEG. Therefore to put a
“microscope” on the higher frequencies in the EEG we isolate them by removing the lower
delta and theta band information. This allows us to examine the higher frequencies without
being “drowned out” by the more prominent lower frequencies (which have a much larger
amplitude). In this data example, we use a Butterworth filter which removes most of the
low frequency and very high information (by convolving the original signal with a filter,
see Remark 2.6.1). A plot of the periodogam of the orignal EEG together with the EEG
after processing with a filter is given in Figure 2.17. Except for a few artifacts (since the
Butterworth filter is a finite impulse response filter, and thus only has a finite number of
non-zero coefficients), the filter has completely removed the very low frequency information,
from 0 − 0.2 and for the higher frequencies beyond 0.75; we see from the lower plot in Figure

54
2.17 this means the focus is on 8-32Hz (Hz = number of cycles per second). We observe
that most of the frequencies in the interval [0.2, 0.75] have been captured with only a slight
amount of distortion. The processed EEG after passing it through the filter is given in
Figure 2.18, this data set corresponds to the red periodogram plot seen in Figure 2.17. The
corresponding processed EEG clearly shows the evidence of pseudo frequencies described in
the section above, and often the aim is to model this processed EEG.
The plot of the original, filtered and the differences in the EEG is given in Figure 2.19.
We see the difference (bottom plot) contains the trend in the original EEG and also the small
very high frequency fluctuations (probably corresponding to the small spike in the original
periodogram in the higher frequencies).
20
10
xtoriginal[, 1]

0
−10
−20
−30

0 100 200 300 400 500

Time

Figure 2.15: Original EEG..

Remark 2.6.1 (How filtering works) A linear filter is essentially a linear combination
of the time series with some weights. The weights are moved along the time series. For
example, if {hk } is the filter. Then the filtered time series {Xt } is the convolution


X
Yt = hs Xt−s ,
s=0

55
1500000

1500000
fftoriginal[c(1:250)]

fftoriginal[c(1:250)]
1000000

1000000
500000

500000
0

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 20 40 60 80 100 120

frequency cycles

Figure 2.16: Left: Periodogram of original EEG on [0, 2π]. Right: Periodogram in terms of
cycles per second.
1500000

1500000
fftoriginal[c(1:250)]

1000000

1000000
Peridogram
500000

500000
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 20 40 60 80 100 120

frequency cycles

Figure 2.17: The periodogram of original EEG overlayed with processed EEG (in red). The
same plot is given below, but the x-axis corresponds to cycles per second (measured in Hz)

note that hs can be viewed as a moving window. However, the moving window (filter) con-
sidered in Section ?? “smooth” and is used to isolate low frequency trend (mean) behaviour.
Whereas the general filtering scheme described above can isolate any type of frequency be-
haviour. To isolate high frequencies the weights {hs } should not be smooth (should not
change slowly over k). To understand the impact {hs } has on {Xt } we evaluate the Fourier
transform of {Yt }.

56
10
5
xtprocessed[, 1]

0
−5
−10
−15

0 100 200 300 400 500

Time

Figure 2.18: Time series after processing with a Buttersworth filter.


xtoriginal[, 1]

10
−10
−30

0 100 200 300 400 500

Time
xtprocessed[, 1]

5
−5
−15

0 100 200 300 400 500

Time
xtoriginal[, 1] − xtprocessed[, 1]

30
10
−10
−30

0 100 200 300 400 500

Time

Figure 2.19: Top: Original EEG. Middle: Filtered EEG and Bottom: Difference between
Original and Filtered EEG

The periodogram of {Yt } is

n n 2 n 2
1 X itω 2
X
isω 1 X
|JY (ω)| = √ Yt e = hs e √ Xt eitω
n t=1 s=1
n t=1
=57 |H(ω)|2 |JX (ω)|2 .
If H(ω) is close to zero at certain frequencies it is removing those frequencies in {Yt }. Hence
using the correct choice of hs we can isolate certain frequency bands.
Note, if a filter is finite (only a finite number of coefficients), then it is impossible to
make the function drop from zero to one. But one can approximately the step by a smooth
function (see https: // en. wikipedia. org/ wiki/ Butterworth_ filter ).

Remark 2.6.2 An interesting application of frequency analysis is in the comparison of peo-


ple in medative and non-medative states (see Gaurav et al. (2019)). A general science video
is given in this link.

2.7 Exercises
Exercise 2.4 (Understanding Fourier transforms) (i) Let Yt = 1. Plot the Peri-
odogram of {Yt ; t = 1, . . . , 128}.

(ii) Let Yt = 1 + εt , where {εt } are iid standard normal random variables. Plot the Peri-
odogram of {Yt ; t = 1, . . . , 128}.

t
(iii) Let Yt = µ( 128 ) where µ(u) = 5 × (2u − 2.5u2 ) + 20. Plot the Periodogram of {Yt ; t =
1, . . . , 128}.

(iv) Let Yt = 2 × sin( 2πt


8
). Plot the Periodogram of {Yt ; t = 1, . . . , 128}.

(v) Let Yt = 2 × sin( 2πt


8
) + 4 × cos( 2πt
12
). Plot the Periodogram of {Yt ; t = 1, . . . , 128}.
You can locate the maximum by using the function which.max

Exercise 2.5 This exercise is aimed at statistics graduate students (or those who have stud-
ied STAT613). If you are not a statistics graduate, then you may want help from a statistics
student.

(i) Let

n
X n 
X  1
Sn (A, B, Ω) = Yt2 −2 2 2
Yt A cos(Ωt) + B sin(Ωt) + n(A + B ) .
t=1 t=1
2

58
Show that

n n
(A2 − B 2 ) X X
2Ln (A, B, Ω) + Sn (A, B, Ω) = − cos(2tΩ) − AB sin(2tΩ).
2 t=1 t=1

and thus |Ln (A, B, Ω) + 12 Sn (A, B, Ω)| = O(1) (ie. the difference does not grow with
n).

Since Ln (A, B, Ω) and − 12 Sn (A, B, Ω) are asymptotically equivalent (i) shows that we
−1
can maximise S (A, B, Ω)
2 n
instead of the likelihood Ln (A, B, Ω).

(ii) By profiling out the parameters A and B, use the the profile likelihood to show that
b n = arg maxω | Pn Yt exp(itω)|2 .
Ω t=1

(iii) By using the identity (which is the one-sided Dirichlet kernel)



n exp( 21 i(n+1)Ω) sin( 12 nΩ)
X 
sin( 12 Ω)
0 < Ω < 2π
exp(iΩt) = (2.20)
t=1
 n Ω = 0 or 2π.

we can show that for 0 < Ω < 2π we have

n
X n
X
t cos(Ωt) = O(n) t sin(Ωt) = O(n)
t=1 t=1
n
X Xn
t2 cos(Ωt) = O(n2 ) t2 sin(Ωt) = O(n2 ).
t=1 t=1

Using the above identities, show that the Fisher Information of Ln (A, B, ω) (denoted
as I(A, B, ω)) is asymptotically equivalent to
 
n2
n 0 2
B + O(n)
∂ 2 Sn   2

2I(A, B, Ω) = E = − n2 A .
 
0 n + O(n)
∂ω 2  
n2 n2 n3
2
B + O(n) − 2 A + O(n) 3
(A2 + B 2 ) + O(n2 )

b n − Ω| = O(n−3/2 ).
(iv) Use the Fisher information to show that |Ω

Exercise 2.6 (i) Simulate one hundred times from model Yt = 2 sin(2pit/8) + εt where

59
t = 1, . . . , n = 60 and εt are iid normal random variables. For each sample, estimate
ω, A and B. You can estimate ω, A and B using both nonlinear least squares and
also the max periodogram approach described in the previous question.
1
P100
For each simulation study obtain the empirical mean squared error 100 i=1 (θ̂i − θ)2
(where θ denotes the parameter and θ̂i the estimate).

Note that the more times you simulate the more accurate the empirical standard error
will be. The empirical standard error also has an error associated with it, that will be

of order O(1/ number of simulations).

Hint 1: When estimating ω restrict the search to ω ∈ [0, π] (not [0, 2π]). Also when
estimating ω using the max periodogram approach (and A and B) do the search over two
grids (a) ω = [2πj/60, j = 1, . . . , 30] and (b) a finer grid ω = [2πj/600, j = 1, . . . , 300].
Do you see any difference in in your estimates of A, B and Ω over the different grids?

Hint 2: What do you think will happen if the model were changed to Yt = 2 sin(2πt/10)+
εt for t = 1, . . . , 60 and the maxim periodogram approach were used to estimate the
frequency Ω = 2π/20.

(ii) Repeat the above experiment but this time using the sample size n = 300. Compare the
quality/MSE of the estimators of A, B and Ω with those in part (i).

(iii) Do the same as above (using sample size n = 60 and 300) but now use coloured noise
given in (2.19) as the errors. How do your estimates compare with (i) and (ii)?

Hint: A method for simulating dependent data is to use the arima.sim command ar2 =
arima.sim(list(order=c(2,0,0), ar = c(1.5, -0.75)), n=60). This command
simulates an AR(2) time series model Xt = 1.5Xt−1 − 0.75Xt−2 + εt (where εt are iid
normal noise).

R Code

Simulation and periodogram for model (2.18) with iid errors:

temp <- rnorm(128)


signal <- 2*sin(2*pi*c(1:128)/8) + temp # this simulates the series

60
# Use the command fft to make the periodogram
P <- abs(fft(signal)/128)**2
frequency <- 2*pi*c(0:127)/128
# To plot the series and periodogram
par(mfrow=c(2,1))
plot.ts(signal)
plot(frequency, P,type="o")
# The estimate of the period is
K1 = which.max(P)
# Phat is the period estimate
Phat = 128/(K1-1)
# To obtain a finer resolution. Pad temp with zeros.
signal2 = c(signal,c(128*9))
frequency2 <- 2*pi*c(0:((128*10)-1))/1280
P2 <- abs(fft(signal2))**2
plot(frequency2, P2 ,type="o")
# To estimate the period we use
K2 = which.max(P)
# Phat2 is the period estimate
Phat2 = 1280/(K2-1)

Simulation and periodogram for model (2.18) with correlated errors:

set.seed(10)
ar2 <- arima.sim(list(order=c(2,0,0), ar = c(1.5, -0.75)), n=128)
signal2 <- 1.5*sin(2*pi*c(1:128)/8) + ar2
P2 <- abs(fft(signal2)/128)**2
frequency <- 2*pi*c(0:127)/128
par(mfrow=c(2,1))
plot.ts(signal2)
plot(frequency, P2,type="o")

61
Chapter 3

Stationary Time Series

3.1 Preliminaries
The past two chapters focussed on the data. It did not study the properties at the population
level (except for a brief discussion on period estimation). By population level, we mean what
would happen if the sample size is “infinite”. We formally define the tools we will need for such an
analysis below.
Different types of convergence
a.s.
(i) Almost sure convergence: Xn → a as n → ∞ (in this course a will always be a constant).
This means for every ω ∈ Ω Xn (ω) → a, where P (Ω) = 1 as n → ∞ (this is classical limit of
a sequence, see Wiki for a definition).

P
(ii) Convergence in probability: Xn → a. This means that for every ε > 0, P (|Xn − a| > ε) → 0
as n → ∞ (see Wiki)

2
(iii) Convergence in mean square Xn → a. This means E|Xn − a|2 → 0 as n → ∞ (see Wiki).

(iv) Convergence in distribution. This means the distribution of Xn converges to the distribution
of X, ie. for all x where FX is continuous, we have Fn (x) → FX (x) as n → ∞ (where Fn and
FX are the distribution functions of Xn and X respectively). This is the simplest definition
(see Wiki).

• Implies:

– (i), (ii) and (iii) imply (iv).

62
– (i) implies (ii).

– (iii) implies (ii).

• Comments:

– Central limit theorems require (iv).

– It is often easy to show (iii) (since this only requires mean and variance calculations).

The “Op (·)” notation.

• We use the notation |θbn − θ| = Op (n−1/2 ) if there exists a random variable A (which does
not depend on n) such that |θbn − θ| ≤ An−1/2 .

Example of when you can use Op (n−1/2 ). If E[θbn ] = 0 but var[θbn ] ≤ Cn−1 . Then we can say
that E|θb − θ| ≤ Cn−1/2 and thus |θb − θ| = Op (n−1/2 ).

Definition of expectation

• Suppose X is a random variable with density fX , then

Z ∞
E(X) = xfX (x)dx.
−∞

Pn
If E[Xi ] = µ, then the sample mean X̄ = n−1 i=1 Xi is an (unbiased) estimator of µ
(unbiased because E[X̄] = µ); most estimators will have a bias (but often it is small).

• Suppose (X, Y ) is a bivariate random variable with joint density fX,Y , then

Z ∞ Z ∞
E(XY ) = xyfX,Y (x, y)dxdy.
−∞ −∞

Definition of covariance

• The covariance is defined as

cov(X, Y ) = E ((X − E(X))(Y − E(Y ))) = E(XY ) − E(X)E(Y ).

• The variance is var(X) = E(X − E(X))2 = E(X 2 ) = E(X)2 .

• Observe var(X) = cov(X, X).

63
• Rules of covariances. If a, b, c are finite constants and X, Y, Z are random variables with
E(X 2 ) < ∞, E(Y 2 ) < ∞ and E(Z 2 ) < ∞ (which immediately implies their means are finite).
Then the covariance satisfies the linearity property

cov (aX + bY + c, Z) = acov(X, Z) + bcov(Y, Z).

Observe the shift c plays no role in the covariance (since it simply shifts the data).

• The variance of vectors. Suppose that A is a matrix and X a random vector with variance/-
covariance matrix Σ. Then

var(AX) = Avar(X)A0 = AΣA0 , (3.1)

which can be proved using the linearity property of covariances.

• The correlation between X and Y is

cov(X, Y )
cor(X, Y ) = p
var(X)var(Y )

and lies between [−1, 1]. If var(X) = var(Y ) then cor(X, Y ) is the coefficient of the best
linear predictor of X given Y and visa versa.

What is covariance and correlation The covariance and correlation measure the linear dependence
between two random variables. If you plot realisations of the bivariate random variable (X, Y ) (X
on x-axis and Y on y-axis), then the best line of best fit

Yb = β0 + β1 X

gives the best linear predictor of Y given X. β1 is closely related to the covariance. To see how,
consider the following example. Given the observation {(Xi , Yi ); i = 1, . . . , n} the gradient of the
linear of the line of best fit is
Pn
(X − X̄)(Yi − Ȳ )
βb1 = Pn i
i=1
2
.
i=1 (Xi − X̄)

64
As the sample size n → ∞ we recall that

P cov(X, Y )
βb1 → = β1 .
var(Y )

β1 = 0 if and only if cov(X, Y ) = 0. The covariance between two random variables measures the
amount of predictive information (in terms of linear prediction) one variable contains about the
other. The coefficients in a regression are not symmetric i.e. PX (Y ) = β1 X, whereas PY (X) = γ1 Y
and in general β1 6= γ1 . The correlation

cov(X, Y )
cor(X, Y ) = p
var(X)var(Y )

is a symmetric measure of dependence between the two variables.

Exercise 3.1 (Covariance calculations practice) Suppose {εt } are uncorrelated random vari-
ables with E[εt ] = 0 and E[ε2t ] = σ 2

• Let Xt = εt + 0.5εt−1 . Evaluate cov(Xt , Xt+r ) for r = 0, ±1, ±2, ±3, ±4, ±5.
P∞ j
• Let Xt = j=0 ρ εt−j where |ρ| < 1. Evaluate cov(Xt , Xt+r ) for r ∈ Z (0, ±1, ±2, ±3, ±4, . . .).

Cumulants: A measure of higher order dependence The covariance has a very simple geometric in-
terpretation. But it only measures linear dependence. In time series and many applications in
signal processing, more general measures of dependence are needed. These are called cumulants
and can simultaneously measure dependence between several variables or variables with themselves.
They generalize the notion of a covariance, but as far as I am aware don’t have the nice geometric
interpretation that a covariance has.

3.1.1 Formal definition of a time series


When we observe the time series {xt }, usually we assume that {xt } is a realisation from a random
process {Xt }. We formalise this notion below. The random process {Xt ; t ∈ Z} (where Z denotes
the integers) is defined on the probability space {Ω, F, P }. We explain what these mean below:

(i) Ω is the set of all possible outcomes. Suppose that ω ∈ Ω, then {Xt (ω)} is one realisation
from the random process. For any given ω, {Xt (ω)} is not random. In time series we will
usually assume that what we observe xt = Xt (ω) (for some ω) is a typical realisation. That

65
is, for any other ω ∗ ∈ Ω, Xt (ω ∗ ) will be different, but its general or overall characteristics
will be similar.

(ii) F is known as a sigma algebra. It is a set of subsets of Ω (though not necessarily the set of
all subsets, as this can be too large). But it consists of all sets for which a probability can
be assigned. That is if A ∈ F, then a probability is assigned to the set A.

(iii) P is the probability measure over the sigma-algebra F. For every set A ∈ F we can define a
probability P (A).

There are strange cases, where there is a subset of Ω, which is not in the sigma-algebra F,
where P (A) is not defined (these are called non-measurable sets). In this course, we not have
to worry about these cases.

This is a very general definition. But it is too general for modelling. Below we define the notion
of stationarity and weak dependence, that allows for estimators to have a meaningful interpretation.

3.2 The sample mean and its standard error


We start with the simplest case, estimating the mean when the data is dependent. This is usually
estimated with the sample mean. However, for the sample mean to be estimating something
reasonable we require a very weak form of stationarity. That is the time series has the same mean
for all t i.e.

Xt = µ + (Xt − µ),
|{z} | {z }
=E(Xt ) =εt

where µ = E(Xt ) for all t. This is analogous to say that the independent random variables {Xt }
all have a common mean. Under this assumption X̄ is an unbiased estimator of µ. Next, our aim
is to obtain conditions under which X̄ is a “reasonable” estimator of the mean.
Based on just one realisation of a time series we want to make inference about the parameters
associated with the process {Xt }, such as the mean. We recall that in classical statistics we usually
assume we observe several independent realisations, {Xt } all with the same distribution, and use
X̄ = n1 nt=1 Xt to estimate the mean. Roughly speaking, with several independent realisations we
P

are able to sample over the entire probability space and thus obtain a “good” (meaning consistent
or close to true mean) estimator of the mean. On the other hand, if the samples were highly

66
dependent, then it is likely that {Xt } is concentrated over a small part of the probability space. In
this case, the sample mean will not converge to the mean (be close to the true mean) as the sample
size grows.

The mean squared error a measure of closeness One classical measure of closeness between an es-
timator and a parameter is the mean squared error

h i2 h i2
E θbn − θ = var(θbn ) + E(θbn ) − θ .

If the estimator is an unbiased estimator of θ then

h i2
E θbn − θ = var(θbn ).

Returning to the sample mean example suppose that {Xt } is a time series wher E[Xt ] = µ for all
t. Then tt is clear that this is an unbiased estimator of µ and

 2
E X̄n − µ = var(X̄n ).

To see whether it converges in mean square to µ we evaluate its


 
1
 
1
 
−2
 
var(X̄) = n (1, . . . , 1) var(X n )  ..
,
| {z }   .


matrix, Σ  
1

where
 
cov(X1 , X1 ) cov(X1 , X2 ) cov(X1 , X3 ) . . . cov(X1 , Xn )
 
 cov(X2 , X1 ) cov(X2 , X2 ) cov(X2 , X3 ) ... cov(X2 , Xn ) 
 
 
var(X n ) =  cov(X3 , X1 ) cov(X3 , X2 ) cov(X3 , X3 ) ... cov(X3 , Xn )  .
 
.. .. ..
 
 .. 
 . . . . ··· 
 
cov(Xn , X1 ) cov(Xn , X2 ) ... ... cov(Xn , Xn )

67
Thus

n n n n
1 X 1 X 2 X X
var(X̄) = cov(Xt , Xτ ) var(Xt ) + cov(Xt , Xτ )
n2 n2 n2
t,τ =1 t=1 t=1 τ =t+1
n n−1 n−|r|
1 X 2 XX
= var(Xt ) + 2 cov(Xt , Xt+r ). (3.2)
n2 n
t=1 r=1 t=1

A typical time series is a half way house between “fully” dependent data and independent data.
Unlike classical statistics, in time series, parameter estimation is based on only one realisation
xt = Xt (ω) (not multiple, independent, replications). Therefore, it would appear impossible to
obtain a good estimator of the mean. However good estimators of the mean are still possible,
based on just one realisation of the time series so long as certain assumptions are satisfied (i) the
process has a constant mean (a type of stationarity) and (ii) despite the fact that each time series is
generated from one realisation there is ‘short’ memory in the observations. That is, what is observed
today, xt has little influence on observations in the future, xt+k (when k is relatively large). Hence,
even though we observe one tragectory, that trajectory traverses much of the probability space.
The amount of dependency in the time series determines the ‘quality’ of the estimator. There are
several ways to measure the dependency. We know that the most common is the measure of linear
dependency, known as the covariance. Formally, the covariance in the stochastic process {Xt } is
defined as

cov(Xt , Xt+k ) = E [(Xt − E(Xt )) (Xt+k − E (Xt+k ))] = E(Xt Xt+k ) − E(Xt )E(Xt+k ).

Noting that if {Xt } has zero mean, then the above reduces to cov(Xt , Xt+k ) = E(Xt Xt+k ).

Remark 3.2.1 (Covariance in a time series) To illustrate the covariance within a time series
setting, we generate the time series
 

Xt = 1.8 cos Xt−1 − 0.92 Xt−2 + εt (3.3)
5

for t = 1, . . . , n. A scatter plot of Xt against Xt+r for r = 1, . . . , 4 and n = 200 is given in Figure
3.1. The corresponding sample autocorrelation (ACF) plot (as defined in equation (3.7) is given in
Figure 3.2). Focus on the lags r = 1, . . . , 4 in the ACF plot. Observe that they match what is seen
in the scatter plots.

68
● ●

4
●● ●
●●
● ●● ● ● ● ● ● ● ●●
● ●
● ● ●● ●●● ●●

● ● ● ●●●●
●●●● ● ● ● ● ● ●●
● ●
● ● ● ●● ●

2
● ● ●● ●

●●●● ● ● ● ● ●●●●● ●●
● ● ● ● ● ●● ● ● ● ●
● ● ●●
● ● ● ●● ● ●● ●● ●
● ●● ●● ● ●●●● ● ● ●●● ●● ● ●

●● ●●
● ●●●● ● ●● ● ●●●●● ●● ●

● ●● ● ●
● ●
●● ● ● ●● ● ●
● ●●● ●
● ● ● ● ● ● ● ●

lag1

lag2
● ●● ●● ●●
●● ● ● ● ●
● ● ● ● ● ●●● ●

0
● ●

0
● ● ●● ● ●
● ● ●● ●● ● ●● ●●●●● ●● ●
● ●●● ●
●●●
● ● ●
● ●●
● ● ● ● ● ●●
● ●
●● ●●● ● ● ● ● ●● ● ● ● ● ●

●●● ●
●●

●● ● ●
●● ● ●
● ●
● ● ●
●●●
●● ● ● ●
● ●●
● ● ●
● ●● ● ●●● ● ● ● ●● ● ●● ● ●● ●● ●

−2

−2
●● ● ● ●
● ● ●
● ●● ● ● ● ●

● ● ●● ●●● ● ● ●
● ● ● ● ●●
● ● ●● ● ● ●●
● ● ● ● ●● ●●
● ● ● ●● ●
● ● ● ● ● ●●

−4

−4

● ● ● ●
● ● ● ●

−4 −2 0 2 4 −4 −2 0 2 4

lag0 lag0

● ●
4

4
● ● ●
● ● ● ●● ●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ●● ●
● ●● ● ●
● ● ● ● ●
● ●● ● ● ● ●●

● ● ● ● ●
2

2
●● ● ● ●● ● ● ● ● ● ● ● ● ●●

● ●●●● ●● ●
●●● ●
● ●
● ● ● ●●
●●● ● ● ●
● ●● ● ●
● ● ● ●
● ●●● ●●● ●●● ●
● ● ●
● ● ● ● ●●● ●● ● ●
● ● ● ●● ●● ● ●●
● ●● ● ●● ●● ● ●●
● ●● ●

● ● ●

● ● ● ● ● ●● ●●●
lag3

lag4
● ● ●● ● ● ● ●● ●
●● ●
●●●● ●●● ●● ●●●●●● ●
0

● ●●●

0
● ●● ●● ● ● ● ●●● ●●●●●● ● ●

● ●
● ● ● ●
● ● ●

● ●● ● ●

● ● ● ●● ● ● ● ● ●
● ●● ●● ● ●●
●●●


● ●● ●● ● ● ● ● ● ● ●
● ● ●●
● ●●●
● ●
● ● ●


●●●● ● ● ●
● ● ●● ●
●●● ● ● ● ● ● ● ● ●
−2

−2
●● ● ●● ● ● ● ●
●●● ● ●●● ● ●
● ● ● ● ●●● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ●● ●●
● ●

●● ● ● ●
●● ● ● ●
−4

−4
● ● ●
● ● ● ●
● ● ● ●

−4 −2 0 2 4 −4 −2 0 2 4

lag0 lag0

Figure 3.1: From model (3.3). Plot of Xt against Xt+r for r = 1, . . . , 4. Top left: r = 1. Top
right: r = 2, Bottom left: r = 3 and Bottom right: r = 4.

Series dependent
1.0
0.5
ACF

0.0
−0.5

0 5 10 15 20

Lag

Figure 3.2: ACF plot of realisation from model (3.3).

Using the expression in (3.4) we can deduce under what conditions on the time series we can
obtain a reasonable estimator of the mean. If the covariance structure decays at such a rate that
the sum of all lags is finite, that is


X
sup |cov(Xt , Xt+r )| < ∞,
t r=−∞

69
often called short memory), then the variance is

n n−1 n−|r|
1 X 2 XX
var(X̄) ≤ var(Xt ) + |cov(Xt , Xt+r )|
n2 n2
t=1 r=1 t=1
n n−1 ∞
1 X 2 XX
≤ var(Xt ) + |cov(Xt , Xt+r )| ≤ Cn−1 = O(n−1 ). (3.4)
n2 n2
t=1 t=1 r=1
| {z }
finite for all t and n

This rate of convergence is the same as if {Xt } were iid/uncorrelated data. However, if the corre-
lations are positive it will be larger than the case that {Xt } are uncorrelated.
However, even with this assumption we need to be able to estimate var(X̄) in order to test/-
construct CI for µ. Usually this requires the stronger assumption of stationarity, which we define
in Section 3.3.

Remark 3.2.2 It is worth bearing in mind that the covariance only measures linear dependence.
For some statistical analysis, such as deriving an expression for the variance of an estimator, the
covariance is often sufficient as a measure. However, given cov(Xt , Xt+k ) we cannot say anything
about cov(g(Xt ), g(Xt+k )), where g is a nonlinear function. There are occassions where we require
a more general measure of dependence (for example, to show asymptotic normality). Examples of
more general measures include mixing (and other related notions, such as Mixingales, Near-Epoch
dependence, approximate m-dependence, physical dependence, weak dependence), first introduced by
Rosenblatt in the 50s (Rosenblatt and Grenander (1997)). In this course we will not cover mixing.

3.2.1 The variance of the estimated regressors in a linear regres-


sion model with correlated errors
Let us return to the parametric models discussed in Section 2.1. The general model is

p
X
Yt = β0 + βj ut,j + εt = β 0 ut + εt ,
j=1

where E[εt ] = 0 and we will assume that {ut,j } are nonrandom regressors. Note this includes the
parametric trend models discussed in Section 2.1. We use least squares to estimate β

n
X
Ln (β) = (Yt − β 0 ut )2 ,
t=1

70
with

β̂n = arg min Ln (β).

Using that
 
∂Ln (β)
 ∂β1 
 ∂Ln (β)  n
∂Ln (β)  ∂β2
X
(Yt − β 0 ut )ut ,

∇β Ln (β) = = ..
 = −2
∂β 
.

  t=1
 
∂Ln (β)
∂βp

we have

n
X n
X
β̂n = arg min Ln (β) = ( ut u0t )−1 Yt ut ,
t=1 t=1

∂Ln (β̂ n )
since we solve ∂β = 0. To evaluate the variance of β̂ n we can either

• Directly evaluate the variance of β̂n = ( nt=1 ut u0t )−1 nt=1 Yt ut . But this is very special for
P P

linear least squares.

∂Ln (β)
• Or use an expansion of ∂β , which is a little longer but generalizes to more complicate
estimators and criterions.

∂Ln (β)
We will derive an expression for β̂ n − β. By using ∂β we can show

n n
∂Ln (β̂ n ) ∂Ln (β) X
b 0 ut )ut + 2
X
− = −2 (Yt − β n (Yt − β 0 ut )ut
∂β ∂β
t=1 t=1
h n
i0 X
= 2 β̂ n − β ut u0t . (3.5)
t=1

∂Ln (β̂ n )
On the other hand, because ∂β = 0 we have

∂Ln (β̂ n ) ∂Ln (β) ∂Ln (β)


− = −
∂β ∂β ∂β
n
X n
X
= [Yt − β 0 ut ] ut = u t εt . (3.6)
| {z }
t=1 εt t=1

71
Equating (3.5) and (3.6) gives

h n
i0 X n
X
β̂ n − β ut u0t = u0t εt
t=1 t=1
n
!−1 n n
!−1 n
h i X X 1X 1X
⇒ β̂ n − β = ut u0t u t εt = ut u0t u t εt .
n n
t=1 t=1 t=1 t=1

Using this expression we can see that

n
!−1 n
! n
!−1
h i 1X 1X 1X
var β̂ n − β = ut u0t var u t εt ut u0t .
n n n
t=1 t=1 t=1

1 Pn 
Finally we need only evaluate var n t=1 ut εt which is

n n
!
1X 1 X
var u t εt = cov[εt , ετ ]ut u0τ
n n2
t=1 t,τ =1
n n n
1 X 1 XX
= var[εt ]ut u0t + cov[εt , ετ ]ut u0τ .
n2 n2
t=1 t=1 τ 6=t
| {z } | {z }
expression if independent additional term due to correlation in the errors

This expression is analogous to the expression for the variance of the sample mean in (3.4) (make
a comparision of the two).
Under the assumption that n1 nt=1 ut u0t is non-singular, supt kut k1 < ∞ and supt ∞
P  P
τ =−∞ |cov(εt , ετ )| <
h i
∞, we can see that var β̂ n − β = O(n−1 ). Estimation of the variance of β b is important and re-
n
n
quires one to estimate var n1 t=1 ut εt . This is often done using the HAC estimator. We describe
P 

how this is done in Section 8.5.

3.3 Stationary processes


We have established that one of the main features that distinguish time series analysis from classical
methods is that observations taken over time (a time series) can be dependent and this dependency
tends to decline the further apart in time these two observations. However, to do any sort of analysis
of this time series we have to assume some sort of invariance in the time series, for example the mean
or variance of the time series does not change over time. If the marginal distributions of the time
series were totally different no sort of inference would be possible (suppose in classical statistics you

72
were given independent random variables all with different distributions, what parameter would
you be estimating, it is not possible to estimate anything!).
The typical assumption that is made is that a time series is stationary. Stationarity is a rather
intuitive concept, it is an invariant property which means that statistical characteristics of the time
series do not change over time. For example, the yearly rainfall may vary year by year, but the
average rainfall in two equal length time intervals will be roughly the same as would the number of
times the rainfall exceeds a certain threshold. Of course, over long periods of time this assumption
may not be so plausible. For example, the climate change that we are currently experiencing is
causing changes in the overall weather patterns (we will consider nonstationary time series towards
the end of this course). However in many situations, including short time intervals, the assumption
of stationarity is quite a plausible. Indeed often the statistical analysis of a time series is done
under the assumption that a time series is stationary.

3.3.1 Types of stationarity


There are two definitions of stationarity, weak stationarity which only concerns the covariance of a
process and strict stationarity which is a much stronger condition and supposes the distributions
are invariant over time.

Definition 3.3.1 (Strict stationarity) The time series {Xt } is said to be strictly stationary
if for any finite sequence of integers t1 , . . . , tk and shift h the distribution of (Xt1 , . . . , Xtk ) and
(Xt1 +h , . . . , Xtk +h ) are the same.

The above assumption is often considered to be rather strong (and given a data it is very
hard to check). Often it is possible to work under a weaker assumption called weak/second order
stationarity.

Definition 3.3.2 (Second order stationarity/weak stationarity) The time series {Xt } is said
to be second order stationary if the mean is constant for all t and if for any t and k the covariance
between Xt and Xt+k only depends on the lag difference k. In other words there exists a function
c : Z → R such that for all t and k we have

c(k) = cov(Xt , Xt+k ).

73
Remark 3.3.1 (Strict and second order stationarity) (i) If a process is strictly stationar-
ity and E|Xt2 | < ∞, then it is also second order stationary. But the converse is not necessarily
true. To show that strict stationarity (with E|Xt2 | < ∞) implies second order stationarity,
suppose that {Xt } is a strictly stationary process, then

cov(Xt , Xt+k ) = E(Xt Xt+k ) − E(Xt )E(Xt+k )


Z
 
= xy PXt ,Xt+k (dx, dy) − PXt (dx)PXt+k (dy)
Z
= xy [PX0 ,Xk (dx, dy) − PX0 (dx)PXk (dy)] = cov(X0 , Xk ),

where PXt ,Xt+k and PXt is the joint distribution and marginal distribution of Xt , Xt+k respec-
tively. The above shows that cov(Xt , Xt+k ) does not depend on t and {Xt } is second order
stationary.

(ii) If a process is strictly stationary but the second moment is not finite, then it is not second
order stationary.

(iii) It should be noted that a weakly stationary Gaussian time series is also strictly stationary too
(this is the only case where weakly stationary implies strictly stationary).

Example 3.3.1 (The sample mean and its variance under second order stationarity) Returning
the variance of the sample mean discussed (3.4), if a time series is second order stationary, then
the sample mean X̄ is estimating the mean µ and the variance of X̄ is

n n−1 n−r
1 X 2 XX
var(X̄) = var(X ) + cov(Xt , Xt+r )
n2 | {z t} n2 | {z }
t=1 r=1 t=1
c(0) =c(r)
n  
1 2 X n−r
= c(0) + c(r),
n n n
r=1 | {z }
=1−r/n

where we note that above is based on the expansion in (3.4). We approximate the above, by using that
the covariances r |c(r)| < ∞. Therefore for all r, (1−r/n)c(r) → c(r) and | nr=1 (1−|r|/n)c(r)| ≤
P P
P Pn P∞
r |c(r)|, thus by dominated convergence (see Appendix A) r=1 (1 − r/n)c(r) → r=1 c(r). This

implies that

∞ ∞
1 2X 1 X 1
var(X̄) ≈ c(0) + c(r) = c(r) = O( ).
n n n r=−∞ n
r=1

74
The above is often called the long term variance. The above implies that

E(X̄ − µ)2 = var(X̄) → 0, n → ∞,

which we recall is convergence in mean square. This immediately implies convergence in probability
P
X̄ → µ.

The example above illustrates how second order stationarity gives an elegant expression for the
variance and can be used to estimate the standard error associated with X̄.

Example 3.3.2 In Chapter 8 we consider estimation of the autocovariance function. However for
now rely on the R command acf. For the curious, it evaluates ρb(r) = b
c(r)/b
c(0), where

n−r
1X
c(r) =
b (Xt − X̄)(Xt+r − X̄) (3.7)
n
t=1

for r = 1, . . . , m (m is some value that R defines), you can change the maximum number of lags
by using acf(data, lag = 30), say). Observe that even if Xt = µt (nonconstant mean), from the
c(r) (sum of (n − r) terms) is defined, ρb(r) will decay to zero as r → n.
way b
In Figure 3.3 we give the sample acf plots of the Southern Oscillation Index and the Sunspot
data. We observe that are very different. The acf of the SOI decays rapidly, but there does appear
to be some sort of ‘pattern’ in the correlations. On the other hand, there is more “persistence” in
the acf of the Sunspot data. The correlations of the acf appear to decay but over a longer period of
time and there is a clear periodicity.

Exercise 3.2 State, with explanation, which of the following time series is second order stationary,
which are strictly stationary and which are both.

(i) {εt } are iid random variables with mean zero and variance one.

(ii) {εt } are iid random variables from a Cauchy distributon.

(iii) Xt+1 = Xt + εt , where {εt } are iid random variables with mean zero and variance one.

(iv) Xt = Y where Y is a random variable with mean zero and variance one.

(iv) Xt = Ut +Ut−1 +Vt , where {(Ut , Vt )} is a strictly stationary vector time series with E[Ut2 ] < ∞
and E[Vt2 ] < ∞.

75
Series soi

0.8
ACF

0.4
0.0
0 50 100 150 200 250 300

Lag

Series sunspot
−0.4 0.0 0.4 0.8
ACF

0 10 20 30 40 50 60

Lag

Figure 3.3: Top: ACF of Southern Oscillation data. Bottom ACF plot of Sunspot data.

Exercise 3.3 (i) Make an ACF plot of the monthly temperature data from 1996-2014.

(ii) Make and ACF plot of the yearly temperature data from 1880-2013.

(iii) Make and ACF plot of the residuals (after fitting a line through the data (using the command
lsfit(..)$res)) of the yearly temperature data from 1880-2013.
Briefly describe what you see.

Exercise 3.4 (i) Suppose that {Xt }t is a strictly stationary time series. Let

1
Yt = .
1 + Xt2

Show that {Yt } is a second order stationary time series.

(ii) Obtain an approximate expression for the variance of the sample mean of {Yt } in terms of its
long run variance (stating the sufficient assumptions for the long run variance to be finite).
You do not need to give an analytic expression for the autocovariance, there is not enough
information in the question to do this.

(iii) Possibly challenging question. Suppose that

Yt = g(θ0 , t) + εt ,

76
where {εt } are iid random variables and g(θ0 , t) is a deterministic mean and θ0 is an unknown
parameter. Let

n
X
θbn = arg min (Yt − g(θ, t))2 .
θ∈Θ
t=1

Explain why the quantity

θbn − θ0

can be expressed, approximately, as a sample mean. You can use approximations and heuris-
tics here.

Hint: Think derivatives and mean value theorems.

Ergodicity (Advanced)

We now motivate the concept of ergodicity. Conceptionally, this is more difficult to understand
than the mean and variance. But it is a very helpful tool when analysing estimators. It allows one
to simply replace the sample mean by its expectation without the need to evaluating a variance,
which is extremely useful in some situations.
It can be difficult to evaluate the mean and variance of an estimator. Therefore, we may want
an alternative form of convergence (instead of the mean squared error). To see whether this is
possible we recall that for iid random variables we have the very useful law of large numbers

n
1X a.s.
Xt → µ
n
t=1

1 Pn a.s.
and in general n t=1 g(Xt ) → E[g(X0 )] (if E[g(X0 )] < ∞). Does such a result exists in time
series? It does, but we require the slightly stronger condition that a time series is ergodic (which
is a slightly stronger condition than the strictly stationary).

Definition 3.3.3 (Ergodicity: Formal definition) Let (Ω, F, P ) be a probability space. A trans-
formation T : Ω → Ω is said to be measure preserving if for every set A ∈ F, P (T −1 A) = P (A).
Moreover, it is said to be an ergodic transformation if T −1 A = A implies that P (A) = 0 or 1.
It is not obvious what this has to do with stochastic processes, but we attempt to make a link. Let
us suppose that X = {Xt } is a strictly stationary process defined on the probability space (Ω, F, P ).

77
By strict stationarity the transformation (shifting a sequence by one)

T (x1 , x2 , . . .) = (x2 , x3 , . . .),

is a measure preserving transformation. To understand ergodicity we define the set A, where

A = {ω : (X1 (ω), X0 (ω), . . .) ∈ H}. = {ω : X−1 (ω), . . . , X−2 (ω), . . .) ∈ H}.

The stochastic process is said to be ergodic, if the only sets which satisfies the above are such that
P (A) = 0 or 1. Roughly, this means there cannot be too many outcomes ω which generate sequences
which ‘repeat’ itself (are periodic in some sense). An equivalent definition is given in (3.8). From
this definition is can be seen why “repeats” are a bad idea. If a sequence repeats the time average
is unlikey to converge to the mean.
See Billingsley (1994), page 312-314, for examples and a better explanation.

The definition of ergodicity, given above, is quite complex and is rarely used in time series analysis.
However, one consequence of ergodicity is the ergodic theorem, which is extremely useful in time
series. It states that if {Xt } is an ergodic stochastic process then

n
1X a.s.
g(Xt ) → E[g(X0 )]
n
t=1

for any function g(·). And in general for any shift τ1 , . . . , τk and function g : Rk+1 → R we have

n
1X a.s.
g(Xt , Xt+τ1 , . . . , Xt+τk ) → E[g(X0 , . . . , Xt+τk )] (3.8)
n
t=1

(often (3.8) is used as the definition of ergodicity, as it is an iff with the ergodic definition). This
result generalises the strong law of large numbers (which shows almost sure convergence for iid
random variables) to dependent random variables. It is an extremely useful result, as it shows us
that “mean-type” estimators consistently estimate their mean (without any real effort). The only
drawback is that we do not know the speed of convergence.
(3.8) gives us an idea of what constitutes an ergodic process. Suppose that {εt } is an ergodic
process (a classical example are iid random variables) then any reasonable (meaning measurable)

78
function of Xt is also ergodic. More precisely, if Xt is defined as

Xt = h(. . . , εt , εt−1 , . . .), (3.9)

where {εt } are iid random variables and h(·) is a measureable function, then {Xt } is an Ergodic
process. For full details see Stout (1974), Theorem 3.4.5.

Remark 3.3.2 As mentioned above all Ergodic processes are stationary, but a stationary process
is not necessarily ergodic. Here is one simple example. Suppose that {εt } are iid random variables
and Z is a Bernoulli random variable with outcomes {1, 2} (where the chance of either outcome is
half ). Suppose that Z stays the same for all t. Define

 µ1 + εt Z = 1
Xt =
 µ + ε Z = 2.
2 t

It is clear that E(Xt |Z = i) = µi and E(Xt ) = 12 (µ1 + µ2 ). This sequence is stationary. However,
we observe that T1 Tt=1 Xt will only converge to one of the means, hence we do not have almost
P

sure convergence (or convergence in probability) to 12 (µ1 + µ2 ).

R code

To make the above plots we use the commands

par(mfrow=c(2,1))
acf(soi,lag.max=300)
acf(sunspot,lag.max=60)

3.3.2 Towards statistical inference for time series


Returning to the sample mean Example 3.3.1. Suppose we want to construct CIs or apply statistical
tests on the mean. This requires us to estimate the long run variance (assuming stationarity)


1 2X
var(X̄) ≈ c(0) + c(r).
n n
r=1

There are several ways this can be done, either by fitting a model to the data and from the model
estimate the covariance or doing it nonparametrically. This example motivates the contents of the

79
course:

(i) Modelling, finding suitable time series models to fit to the data.

(ii) Forecasting, this is essentially predicting the future given current and past observations.

(iii) Estimation of the parameters in the time series model.

(iv) The spectral density function and frequency domain approaches, sometimes within the fre-
quency domain time series methods become extremely elegant.

(v) Analysis of nonstationary time series.

(vi) Analysis of nonlinear time series.

(vii) How to derive sampling properties.

3.4 What makes a covariance a covariance?


The covariance of a stationary process has several very interesting properties. The most important
is that it is positive semi-definite, which we define below.

Definition 3.4.1 (Positive semi-definite sequence) (i) A sequence {c(k); k ∈ Z} (Z is the


set of all integers) is said to be positive semi-definite if for any n ∈ Z and sequence x =
(x1 , . . . , xn ) ∈ Rn the following is satisfied

n
X
c(i − j)xi xj ≥ 0.
i,j=1

(ii) A function is said to be an even positive semi-definite sequence if (i) is satisfied and c(k) =
c(−k) for all k ∈ Z.

An extension of this notion is the positive semi-definite function.

Definition 3.4.2 (Positive semi-definite function) (i) A function {c(u); u ∈ R} is said to


be positive semi-definite if for any n ∈ Z and sequence x = (x1 , . . . , xn ) ∈ Rn the following
is satisfied

n
X
c(ui − uj )xi xj ≥ 0.
i,j=1

80
(ii) A function is said to be an even positive semi-definite function if (i) is satisfied and c(u) =
c(−u) for all u ∈ R.

Remark 3.4.1 You have probably encountered this positive definite notion before, when dealing
with positive definite matrices. Recall the n × n matrix Σn is positive semi-definite if for all x ∈ Rn
x0 Σn x ≥ 0. To see how this is related to positive semi-definite matrices, suppose that the matrix Σn
has a special form, that is the elements of Σn are (Σn )i,j = c(i−j). Then x0 Σn x = ni,j c(i−j)xi xj .
P

We observe that in the case that {Xt } is a stationary process with covariance c(k), the variance
covariance matrix of X n = (X1 , . . . , Xn ) is Σn , where (Σn )i,j = c(i − j).

We now take the above remark further and show that the covariance of a stationary process is
positive semi-definite.

Theorem 3.4.1 Suppose that {Xt } is a discrete time/continuous stationary time series with co-
variance function {c(k)}, then {c(k)} is an even positive semi-definite sequence/function. Con-
versely for any even positive semi-definite sequence/function there exists a stationary time series
with this positive semi-definite sequence/function as its covariance function.

PROOF. We prove the result in the case that {Xt } is a discrete time time series, ie. {Xt ; t ∈ Z}.
We first show that {c(k)} is a positive semi-definite sequence. Consider any sequence x =
Pn
(x1 , . . . , xn ) ∈ Rn , and the double sum i,j xi c(i − j)xj . Define the random variable Y =
Pn 0
Pn
i=1 xi Xi . It is straightforward to see that var(Y ) = x var(X n )x = i,j=1 c(i−j)xi xj where X n =

(X1 , . . . , Xn ). Since for any random variable Y , var(Y ) ≥ 0, this means that ni,j=1 xi c(i−j)xj ≥ 0,
P

hence {c(k)} is a positive definite sequence.


To show the converse, that is for any positive semi-definite sequence {c(k)} we can find a
corresponding stationary time series with the covariance {c(k)} is relatively straightfoward, but
depends on defining the characteristic function of a process and using Komologorov’s extension
theorem. We omit the details but refer an interested reader to Brockwell and Davis (1998), Section
1.5. 

In time series analysis usually the data is analysed by fitting a model to the data. The model
(so long as it is correctly specified, we will see what this means in later chapters) guarantees the
covariance function corresponding to the model (again we cover this in later chapters) is positive
definite. This means, in general we do not have to worry about positive definiteness of the covariance
function, as it is implicitly implied.

81
On the other hand, in spatial statistics, often the object of interest is the covariance function
and specific classes of covariance functions are fitted to the data. In which case it is necessary to
ensure that the covariance function is semi-positive definite (noting that once a covariance function
has been found by Theorem 3.4.1 there must exist a spatial process which has this covariance
function). It is impossible to check for positive definiteness using Definitions 3.4.1 or 3.4.1. Instead
an alternative but equivalent criterion is used. The general result, which does not impose any
conditions on {c(k)} is stated in terms of positive measures (this result is often called Bochner’s
theorem). Instead, we place some conditions on {c(k)}, and state a simpler version of the theorem.

P
Theorem 3.4.2 Suppose the coefficients {c(k); k ∈ Z} are absolutely summable (that is k |c(k)| <
∞). Then the sequence {c(k)} is positive semi-definite if an only if the function f (ω), where


1 X
f (ω) = c(k) exp(ikω),

k=−∞

is nonnegative for all ω ∈ [0, 2π].


We also state a variant of this result for positive semi-definite functions. Suppose the function
R
{c(u); k ∈ R} is absolutely summable (that is R |c(u)|du < ∞). Then the function {c(u)} is positive
semi-definite if and only if the function f (ω), where

Z ∞
1
f (ω) = c(u) exp(iuω)du ≥ 0
2π −∞

for all ω ∈ R.
The generalisation of the above result to dimension d is that {c(u); u ∈ Rd } is a positive semi-
definite sequence if and if

Z
1
f (ω) = c(u) exp(iu0 ω)du ≥ 0
(2π)d Rd

for all ω d ∈ Rd .

PROOF. See Section 10.4.1.

Example 3.4.1 We will show that sequence c(0) = 1, c(1) = 0.5, c(−1) = 0.5 and c(k) = 0 for
|k| > 1 a positive definite sequence.
From the definition of spectral density given above we see that the ‘spectral density’ corresponding

82
to the above sequence is

f (ω) = 1 + 2 × 0.5 × cos(ω).

Since | cos(ω)| ≤ 1, f (ω) ≥ 0, thus the sequence is positive definite. An alternative method is to
find a model which has this as the covariance structure. Let Xt = εt + εt−1 , where εt are iid random
variables with E[εt ] = 0 and var(εt ) = 0.5. This model has this covariance structure.

3.5 Spatial covariances (advanced)


Theorem 3.4.2 is extremely useful in finding valid spatial covariances. We recall that cd : Rd → R
is a positive semi-definite covariance (on the spatial plane Rd ) if there exists a positive function fd
where

Z
cd (u) = fd (ω) exp(−iu0 ω)dω (3.10)
Rd

for all u ∈ Rd (the inverse Fourier transform of what was written). This result allows one to find
parametric covariance spatial processes.
However, beyond dimension d = 1 (which can be considered a “time series”), there exists
conditions stronger than spatial (second order) stationarity. Probably the the most popular is
spatial isotropy, which is even stronger than stationarity. A covariance cd is called spatially isotropic
if it is stationary and there exist a function c : R → R such that cd (u) = c(kuk2 ). It is clear that
in the case d = 1, a stationary covariance is isotropic since cov(Xt , Xt+1 ) = c(1) = c(−1) ==
cov(Xt , Xt−1 ) = cov(Xt−1 , Xt ). For d > 1, isotropy is a stronger condition than stationarity. The
appeal of an isotropic covariance is that the actual directional difference between two observations
does not impact the covariance, it is simply the Euclidean distance between the two locations (see
picture on board). To show that the covariance c(·) is a valid isotropic covariance in dimension
d (that is there exists a positive semi-definite function cd : Rd → R such that c(kuk) = cd (u)),
conditions analogous but not the same as (3.10) are required. We state them now.

Theorem 3.5.1 If a covariance cd (·) is isotropic, its corresponding spectral density function fd is
also isotropic. That is, there exists a positive function f : R → R+ such that fd (ω) = f (kωk2 ).
A covariance c(·) is a valid isotropic covariance in Rd iff there exists a positive function f (·; d)

83
defined in R+ such that

Z ∞
d/2
c(r) = (2π) ρd/2 J(d/2)−1 (ρ)f (ρ; d)dρ (3.11)
0

where Jn is the order n Bessel function of the first kind.

PROOF. To give us some idea of where this result came from, we assume the first statement is true
and prove the second statement for the case the dimension d = 2.
By the spectral representation theorem we know that if c(u1 , ur ) is a valid covariance then there
exists a positive function f2 such that

Z
c(u1 , u2 ) = f2 (ω1 , ω2 ) exp(iω1 u1 + iω2 u2 )dω1 dω2 .
R2

Next we change variables moving from Euclidean coordinates to polar coordinates (see https://
en.wikipedia.org/wiki/Polar_coordinate_system), where s = ω12 + ω22 and θ = tan−1 ω1 /ω2 .
p

In this way the spectral density can be written in terms of f2 (ω1 , ω2 ) = fP,2 (r, θ) and we have

Z ∞ Z 2π
c(u1 , u2 ) = rfP,2 (s, θ) exp(isu1 cos θ + isu2 sin θ)dsdθ.
0 0

We convert the covariance in terms of polar coordinates c(u1 , u2 ) = cP,2 (r, Ω) (where u1 = r cos Ω
and u2 = r sin Ω) to give

Z ∞ Z 2π
cP,2 (r, Ω) = sfP,2 (s, θ) exp [isr (cos Ω cos θ + sin Ω sin θ)] dsdθ
0 0
Z ∞ Z 2π
= sfP,2 (s, θ) exp [isr cos (Ω − θΩ)] dsdθ. (3.12)
0 0

So far we have not used isotropy of the covariance, we have simply rewritten the spectral represen-
tation in terms of polar coordinates.
Now, we consider the special case that the covariance is isotropic, this means that there exists
a function c such that cP,2 (r, Ω) = c(r) for all r and Ω. Furthermore, by the first statement of the
theorem, if the covariance is isotropic, then there exists a positive function f : R+ → R+ such that

84
fP,2 (s, θ) = f (s) for all s and θ. Using these two facts and substituting them into (3.12) gives

Z ∞ Z 2π
c(r) = sf (s) exp [isr cos (Ω − θΩ)] dsdθ
0 0
Z ∞ Z 2π
= sf (s) exp [isr cos (Ω − θΩ)] dθ ds.
0
|0 {z }
=2πJ0 (s)

For the case, d = 2 we have obtained the desired result. Note that the Bessel function J0 (·) is
effectively playing the same role as the exponential function in the general spectral representation
theorem. 

The above result is extremely useful. It allows one to construct a valid isotropic covariance
function in dimension d with a positive function f . Furthermore, it shows that an isotropic covari-
ance c(r) may be valid in dimension in d = 1, . . . , 3, but for d > 3 it may not be valid. That is
for d > 3, there does not exist a positive function f (·; d) which satisfies (3.11). Schoenberg showed
that an isotropic covariance c(r) was valid in all dimensions d iff there exists a representation

Z ∞
c(r) = exp(−r2 t2 )dF (t),
0

where F is a probability measure. In most situations the above can be written as

Z ∞
c(r) = exp(−r2 t2 )f (t)dt,
0

where f : R+ → R+ . This representation turns out to be a very fruitful method for gener-
ating parametric families of isotropic covariances which are valid on all dimensions d. These
include the Matern class, Cauchy class, Powered exponential family. The feature in common
to all these isotropic covariance functions is that all the covariances are strictly positive and
strictly decreasing. In other words, the cost for an isotropic covariance to be valid in all di-
mensions is that it can only model positive, monotonic correlations. The use of such covariances
have become very popular in modelling Gaussian processes for problems in machine learning (see
http://www.gaussianprocess.org/gpml/chapters/RW1.pdf).
For an excellent review see ?, Section 2.5.

85
3.6 Exercises
Exercise 3.5 Which of these sequences can used as the autocovariance function of a second order
stationary time series?

(i) c(−1) = 1/2, c(0) = 1, c(1) = 1/2 and for all |k| > 1, c(k) = 0.

(ii) c(−1) = −1/2, c(0) = 1, c(1) = 1/2 and for all |k| > 1, c(k) = 0.

(iii) c(−2) = −0.8, c(−1) = 0.5, c(0) = 1, c(1) = 0.5 and c(2) = −0.8 and for all |k| > 2,
c(k) = 0.

Exercise 3.6 (i) Show that the function c(u) = exp(−a|u|) where a > 0 is a positive semi-
definite function.

(ii) Show that the commonly used exponential spatial covariance defined on R2 , c(u1 , u2 ) =
p
exp(−a u21 + u22 ), where a > 0, is a positive semi-definite function.

Hint: One method is to make a change of variables using Polar coordinates. You may also
want to harness the power of Mathematica or other such tools.

86
Chapter 4

Linear time series

Prerequisites

• Familarity with linear models in regression.

• Find the polynomial equations. If the solution is complex writing complex solutions in polar
form x + iy = reiθ , where θ is the phased and r the modulus or magnitude.

Objectives

• Understand what causal and invertible is.

• Know what an AR, MA and ARMA time series model is.

• Know how to find a solution of an ARMA time series, and understand why this is impor-
tant (how the roots determine causality and why this is important to know - in terms of
characteristics in the process and also simulations).

• Understand how the roots of the AR can determine ‘features’ in the time series and covariance
structure (such as pseudo periodicities).

4.1 Motivation
The objective of this chapter is to introduce the linear time series model. Linear time series models
are designed to model the covariance structure in the time series. There are two popular sub-

87
groups of linear time models (a) the autoregressive and (a) the moving average models, which can
be combined to make the autoregressive moving average models.
We motivate the autoregressive from the perspective of classical linear regression. We recall one
objective in linear regression is to predict the response variable given variables that are observed.
To do this, typically linear dependence between response and variable is assumed and we model Yi
as

p
X
Yi = aj Xij + εi ,
j=1

where εi is such that E[εi |Xij ] = 0 and more commonly εi and Xij are independent. In linear
regression once the model has been defined, we can immediately find estimators of the parameters,
do model selection etc.
Returning to time series, one major objective is to predict/forecast the future given current and
past observations (just as in linear regression our aim is to predict the response given the observed
variables). At least formally, it seems reasonable to represent this as

p
X
Xt = φj Xt−j + εt , t∈Z (4.1)
j=1

where we assume that {εt } are independent, identically distributed, zero mean random variables.
Model (4.1) is called an autoregressive model of order p (AR(p) for short). Further, it would appear
that

p
X
E(Xt |Xt−1 , . . . , Xt−p ) = φj Xt−j . (4.2)
j=1

I.e. the expected value of Xt given that Xt−1 , . . . , Xt−p have already been observed), thus the past
values of Xt have a linear influence on the conditional mean of Xt . However (4.2) not necessarily
true.
Unlike the linear regression model, (4.1) is an infinite set of linear difference equations. This
means, for this systems of equations to be well defined, it needs to have a solution which is mean-
ingful. To understand why, recall that (4.1) is defined for all t ∈ Z, so let us start the equation at
the beginning of time (t = −∞) and run it on. Without any constraint on the parameters {φj },
there is no reason to believe the solution is finite (contrast this with linear regression where these

88
issues are not relevant). Therefore, the first thing to understand is under what conditions will the
AR model (4.1) have a well defined stationary solution and what features in a time series is the
solution able to capture.
Of course, one could ask why go through to the effort. One could simply use least squares to
estimate the parameters. This is possible, but there are two related problems (a) without a proper
analysis it is not clear whether model has a meaningful solution (for example in Section 6.4 we
show that the least squares estimator can lead to misspecified models), it’s not even possible to
make simulations of the process (b) it is possible that E(εt |Xt−p ) 6= 0, this means that least squares
is not estimating φj and is instead estimating an entirely different set of parameters! Therefore,
there is a practical motivation behind our theoretical treatment.
In this chapter we will be deriving conditions for a strictly stationary solution of (4.1). Under
these moment conditions we obtain a strictly stationary solution of (4.1). In Chapter 6 we obtain
conditions for (4.1) to have both a strictly stationary and second order stationary solution. It is
worth mentioning that it is possible to obtain a strictly stationary solution to (4.1) under weaker
conditions (see Theorem 13.0.1).
How would you simulate from the following model? One simple method for understanding a model
is to understand how you would simulate from it:

Xt = φ1 Xt−1 + φ2 Xt−1 + εt t = . . . , −1, 0, 1, . . . .

4.2 Linear time series and moving average models

4.2.1 Infinite sums of random variables


Before defining a linear time series, we define the MA(q) model which is a subclass of linear time
series. Let us supppose that {εt } are iid random variables with mean zero and finite variance. The
time series {Xt } is said to have a MA(q) representation if it satisfies

q
X
Xt = ψj εt−j ,
j=0

where E(εt ) = 0 and var(εt ) = 1. It is clear that Xt is a rolling finite weighted sum of {εt }, therefore
{Xt } must be well defined. We extend this notion and consider infinite sums of random variables.

89
Now, things become more complicated, since care must be always be taken with anything involving
infinite sums. More precisely, for the sum


X
Xt = ψj εt−j ,
j=−∞

Pn
to be well defined (has a finite limit), the partial sums Sn = j=−n ψj εt−j should be (almost
surely) finite and the sequence Sn should converge (ie. |Sn1 − Sn2 | → 0 as n1 , n2 → ∞). A random
variable makes no sense if it is infinite. Therefore we must be sure that Xt is finite (this is what
we mean by being well defined).
Below, we give conditions under which this is true.

P∞
Lemma 4.2.1 Suppose j=−∞ |ψj | < ∞ and {Xt } is a strictly stationary time series with E|Xt | <
∞. Then {Yt }, defined by


X
Yt = ψj Xt−j ,
j=−∞

is a strictly stationary time series. Furthermore, the partial sum converges almost surely, Yn,t =
Pn
j=−n ψj Xt−j → Yt . If var(Xt ) < ∞, then {Yt } is second order stationary and converges in mean

square (that is E(Yn,t − Yt )2 → 0).

PROOF. See Brockwell and Davis (1998), Proposition 3.1.1 or Fuller (1995), Theorem 2.1.1 (page
31) (also Shumway and Stoffer (2006), page 86). 

Example 4.2.1 Suppose {Xt } is a strictly stationary time series with var(Xt ) < ∞. Define {Yt }
as the following infinite sum


X
Yt = j k ρj |Xt−j |
j=0

where |ρ| < 1. Then {Yt } is also a strictly stationary time series with a finite variance.
We will use this example later in the course.

Having derived conditions under which infinite sums are well defined, we can now define the
general class of linear and MA(∞) processes.

Definition 4.2.1 (The linear process and moving average (MA)(∞)) Suppose that {εt } are

90
P∞
iid random variables, j=0 |ψj | < ∞ and E(|εt |) < ∞.

(i) A time series is said to be a linear time series if it can be represented as


X
Xt = ψj εt−j ,
j=−∞

where {εt } are iid random variables with finite variance. Note that since that as these sums
are well defined by equation (3.9) {Xt } is a strictly stationary (ergodic) time series.

This is a rather strong definition of a linear process. A more general definition is {Xt } has
the representation


X
Xt = ψj εt−j ,
j=−∞

where {εt } are uncorrelated random variables with mean zero and variance one (thus the
independence assumption has been dropped).

(ii) The time series {Xt } has a MA(∞) representation if it satisfies


X
Xt = ψj εt−j . (4.3)
j=0

The difference between an MA(∞) process and a linear process is quite subtle. A linear process
involves both past, present and future innovations {εt }, whereas the MA(∞) uses only past and
present innovations.
A very interesting class of models which have MA(∞) representations are autoregressive and
autoregressive moving average models. In the following sections we prove this.
1
Note that late on we show that all second order stationary time series {Xt } have the representation

X
Xt = ψj Zt−j , (4.4)
j=1

where {Zt = Xt − PXt−1 ,Xt−2 ,... (Xt )} (where PXt−1 ,Xt−2 ,... (Xt ) is the best linear predictor of Xt given the
past, Xt−1 , Xt−2 , . . .). In this case {Zt } are uncorrelated random variables. It is called Wold’s representation
theorem (see Section 7.12). The representation in (4.4) has many practical advantages. For example Krampe
et al. (2016) recently used it to define the so called “MA bootstrap”.

91
4.3 The AR(p) model
In this section we will examine under what conditions the AR(p) model has a stationary solution.

4.3.1 Difference equations and back-shift operators


The autoregressive model is defined in terms of inhomogenuous difference equations. Difference
equations can often be represented in terms of backshift operators, so we start by defining them
and see why this representation may be useful (and why it should work).
The time series {Xt } is said to be an autoregressive (AR(p)) if it satisfies the equation

Xt − φ1 Xt−1 − . . . − φp Xt−p = εt , t ∈ Z, (4.5)

where {εt } are zero mean, finite variance random variables. As we mentioned previously, the
autoregressive model is a system of difference equation (which can be treated as a infinite number
of simultaneous equations). For this system to make any sense it must have a solution.

Remark 4.3.1 (What is meant by a solution?) By solution, we mean a sequence of numbers


{xt }∞
t=−∞ which satisfy the equations in (7.31). It is tempting to treat (7.31) as a recursion, where

we start with an intial value xI some time far back in the past and use (7.31) to generate {xt } (for
a given sequence {εt }t ). This is true for some equations but not all. To find out which, we need to
obtain the solution to (7.31).
Example Let us suppose the model is

Xt = φXt−1 + εt for t ∈ Z,

where εt are iid random variables and φ is a known parameter. Let ε2 = 0.5, ε3 = 3.1, ε4 = −1.2
etc. This gives the system of equations

x2 = φx1 + 0.5, x3 = φx2 + 3.1, and x4 = φx3 − 1.2

and so forth. We see this is an equation in terms of unknown {xt }t . Does there exist a {xt }t which
satisfy this system of equations? For linear systems, the answer can easily be found. But more
complex systems the answer is not so clear. Our focus in this chapter is on linear systems.

92
To obtain a solution we write the autoregressive model in terms of backshift operators:

Xt − φ1 BXt − . . . − φp B p Xt = εt , ⇒ φ(B)Xt = εt

Pp j,
where φ(B) = 1 − j=1 φj B B is the backshift operator and is defined such that B k Xt = Xt−k .
Simply rearranging φ(B)Xt = εt , gives the ‘solution’ of the autoregressive difference equation to
be Xt = φ(B)−1 εt , however this is just an algebraic manipulation, below we investigate whether it
really has any meaning.
In the subsections below we will show:
Pp j
• Let φ(z) = 1 − j=1 φj z be a pth order polynomial in z. Let z1 , . . . , zp denote the p roots
of φ(z). A solution for (7.31) will always exist if none of the p roots of φ(z) lie on the unit
circle i.e. |zj | =
6 1 for 1 ≤ j ≤ p.

• If all the roots lie outside the unit circle i.e. |zj | > 1 for 1 ≤ j ≤ p, then {xt } can be generated
by starting with an initial value far in the past xI and treating (7.31) as a recursion

Xt = φ1 Xt−1 + . . . + φp Xt−p + εt .

A time series that can be generated using the above recursion is called causal. It will have a
very specific solution.

• If all the roots lie inside the unit circle i.e. |zj | < 1 for 1 ≤ j ≤ p, then we cannot directly
treat (7.31) as a recursion. Instead, we need to rearrange (7.31) such that Xt−p is written in
terms of {Xt−j }pj=1 and εt

Xt−p = φ−1 −1
p [−φp−1 Xt−p+1 − . . . − φ1 Xt−1 + Xt ] − φp εt . (4.4)

{xt } can be generated by starting with an initial value far in the past xI and treating (7.31)
as a recursion.

• If the roots lie both inside and outside the unit circle. No recursion will generate a solution.

But we will show that a solution can be generated by adding recursions together.

To do this, we start with an example.

93
4.3.2 Solution of two particular AR(1) models
Below we consider two different AR(1) models and obtain their solutions.

(i) Consider the AR(1) process

Xt = 0.5Xt−1 + εt , t ∈ Z. (4.5)

Notice this is an equation (rather like 3x2 + 2x + 1 = 0, or an infinite number of simultaneous


equations), which may or may not have a solution. To obtain the solution we note that
Xt = 0.5Xt−1 + εt and Xt−1 = 0.5Xt−2 + εt−1 . Using this we get Xt = εt + 0.5(0.5Xt−2 +
εt−1 ) = εt + 0.5εt−1 + 0.52 Xt−2 . Continuing this backward iteration we obtain at the kth
iteration, Xt = kj=0 (0.5)j εt−j + (0.5)k+1 Xt−k . Because (0.5)k+1 → 0 as k → ∞ by taking
P

the limit we can show that Xt = ∞ j


P
j=0 (0.5) εt−j is almost surely finite and a solution of

(4.5). Of course like any other equation one may wonder whether it is the unique solution
(recalling that 3x2 + 2x + 1 = 0 has two solutions). We show in Section 4.3.2 that this is the
unique stationary solution of (4.5).

Let us see whether we can obtain a solution using the difference equation representation. We
recall, that by crudely taking inverses, the solution is Xt = (1 − 0.5B)−1 εt . The obvious
P∞
question is whether this has any meaning. Note that (1 − 0.5B)−1 = j
j=0 (0.5B) , for

|B| ≤ 2, hence substituting this power series expansion into Xt we have

X X ∞
X
Xt = (1 − 0.5B)−1 εt = ( (0.5B)j )εt = ( (0.5j B j ))εt = (0.5)j εt−j ,
j=0 j=0 j=0

which corresponds to the solution above. Hence the backshift operator in this example helps
us to obtain a solution. Moreover, because the solution can be written in terms of past values
of εt , it is causal.

(ii) Let us consider the AR model, which we will see has a very different solution:

Xt = 2Xt−1 + εt . (4.6)

Pk j
Doing what we did in (i) we find that after the kth back iteration we have Xt = j=0 2 εt−j +

2k+1 Xt−k . However, unlike example (i) 2k does not converge as k → ∞. This suggest that if

94
we continue the iteration Xt = ∞ j
P
j=0 2 εt−j is not a quantity that is finite (when εt are iid).

Therefore Xt = ∞ j
P
j=0 2 εt−j cannot be considered as a solution of (4.6). We need to write

(4.6) in a slightly different way in order to obtain a meaningful solution.

Rewriting (4.6) we have Xt−1 = 0.5Xt − 0.5εt . Forward iterating this we get Xt−1 =
−(0.5) kj=0 (0.5)j εt+j − (0.5)k+1 Xt+k . Since (0.5)k+1 → 0 as k → ∞ we have
P


X
Xt−1 = −(0.5) (0.5)j εt+j
j=0

as a solution of (4.6).

Let us see whether the difference equation can also offer a solution. Since (1 − 2B)Xt = εt ,
using the crude manipulation we have Xt = (1 − 2B)−1 εt . Now we see that


X
−1
(1 − 2B) = (2B)j for |B| < 1/2.
j=0

P∞ j j
Using this expansion gives the solution Xt = j=0 2 B Xt , but as pointed out above this
sum is not well defined. What we find is that φ(B)−1 εt only makes sense (is well defined) if
the series expansion of φ(B)−1 converges in a region that includes the unit circle |B| = 1.

What we need is another series expansion of (1 − 2B)−1 which converges in a region which
includes the unit circle |B| = 1 (as an aside, we note that a function does not necessarily
have a unique series expansion, it can have difference series expansions which may converge
in different regions). We now show that a convergent series expansion needs to be defined in
terms of negative powers of B not positive powers. Writing (1 − 2B) = −(2B)(1 − (2B)−1 ),
therefore


X
(1 − 2B)−1 = −(2B)−1 (2B)−j ,
j=0

which converges for |B| > 1/2. Using this expansion we have


X ∞
X
Xt = − (0.5)j+1 B −j−1 εt = − (0.5)j+1 εt+j+1 ,
j=0 j=0

which we have shown above is a well defined solution of (4.6).

95
In summary (1 − 2B)−1 has two series expansions


1 X
= (2B)−j
(1 − 2B)
j=0

which converges for |B| < 1/2 and


1 X
= −(2B)−1 (2B)−j ,
(1 − 2B)
j=0

which converges for |B| > 1/2. The one that is useful for us is the series which converges
when |B| = 1.

It is clear from the above examples how to obtain the solution of a general AR(1). This solution
is unique and we show this below.

Exercise 4.1 (i) Find the stationary solution of the AR(1) model

Xt = 0.8Xt−1 + εt

where εt are iid random variables with mean zero and variance one.

(ii) Find the stationary solution of the AR(1) model

5
Xt = Xt−1 + εt
4

where εt are iid random variables with mean zero and variance one.

(iii) [Optional] Obtain the autocovariance function of the stationary solution for both the models
in (i) and (ii).

Uniqueness of the stationary solution the AR(1) model (advanced)

Consider the AR(1) process Xt = φXt−1 + εt , where |φ| < 1. Using the method outlined in (i), it
is straightforward to show that Xt = ∞ j
P
j=0 φ εt−j is its stationary solution, we now show that this

solution is unique. This may seem obvious, but recall that many equations have multiple solutions.
The techniques used here generalize to nonlinear models too.

96
We first show that Xt = ∞ j
P
j=0 φ εt−j is well defined (that it is almost surely finite). We note
P∞ j
P∞ j
that |Xt | ≤ j=0 |φ | · |εt−j |. Thus we will show that j=0 |φ | · |εt−j | is almost surely finite,

which will imply that Xt is almost surely finite. By montone convergence we can exchange sum
and expectation and we have E(|Xt |) ≤ E(limn→∞ nj=0 |φj εt−j |) = limn→∞ nj=0 |φj |E|εt−j |) =
P P

E(|ε0 |) ∞
P j
P∞ j
j=0 |φ | < ∞. Therefore since E|Xt | < ∞, j=0 φ εt−j is a well defined solution of

Xt = φXt−1 + εt .
To show that it is the unique, stationary, causal solution, let us suppose there is another (causal)
solution, call it Yt . Clearly, by recursively applying the difference equation to Yt , for every s we
have

s
X
Yt = φj εt−j + φs Yt−s−1 .
j=0

Evaluating the difference between the two solutions gives Yt − Xt = As − Bs where As = φs Yt−s−1
and Bs = ∞ j
P
j=s+1 φ εt−j for all s. To show that Yt and Xt coincide almost surely we will show that

for every  > 0, ∞


P
s=1 P (|As − Bs | > ε) < ∞ (and then apply the Borel-Cantelli lemma). We note

if |As − Bs | > ε), then either |As | > ε/2 or |Bs | > ε/2. Therefore P (|As − Bs | > ε) ≤ P (|As | >
ε/2)+P (|Bs | > ε/2). To bound these two terms we use Markov’s inequality. It is straightforward to
show that P (|Bs | > ε/2) ≤ Cφs /ε. To bound E|As |, we note that |Ys | ≤ |φ| · |Ys−1 | + |εs |, since {Yt }
is a stationary solution then E|Ys |(1 − |φ|) ≤ E|εs |, thus E|Yt | ≤ E|εt |/(1 − |φ|) < ∞. Altogether
this gives P (|As − Bs | > ε) ≤ Cφs /ε (for some finite constant C). Hence ∞
P
s=1 P (|As − Bs | > ε) <
P∞ s
s=1 Cφ /ε < ∞. Thus by the Borel-Cantelli lemma, this implies that the event {|As − Bs | > ε}

happens only finitely often (almost surely). Since for every ε, {|As −Bs | > ε} occurs (almost surely)
only finitely often for all ε, then Yt = Xt almost surely. Hence Xt = ∞ j
P
j=0 φ εt−j is (almost surely)

the unique causal solution.

4.3.3 The solution of a general AR(p)


Let us now summarise our observation for the general AR(1) process Xt = φXt−1 + εt . If |φ| < 1,
then the solution is in terms of past values of {εt }, if on the other hand |φ| > 1 the solution is in
terms of future values of {εt }.
In this section we focus on general AR(p) model

Xt − φ1 Xt−1 − . . . − φp Xt−p = εt , t ∈ Z, (4.7)

97
Generalising this argument to a general polynomial, if the roots of φ(B) are greater than one,
then the power series of φ(B)−1 (which converges for |B| = 1) is in terms of positive powers (hence
the solution φ(B)−1 εt will be in past terms of {εt }). On the other hand, if the roots are both less
than and greater than one (but do not lie on the unit circle), then the power series of φ(B)−1 will
be in both negative and positive powers. Thus the solution Xt = φ(B)−1 εt will be in terms of both
past and future values of {εt }. We summarize this result in a lemma below.

Lemma 4.3.1 Suppose that the AR(p) process satisfies the representation φ(B)Xt = εt , where
none of the roots of the characteristic polynomial lie on the unit circle and E|εt | < ∞. Then {Xt }
has a stationary, almost surely unique, solution

X
Xt = ψj εt−j
j∈Z

j = φ(z)−1 (the Laurent series of φ(z)−1 which converges when |z| = 1).
P
where ψ(z) = j∈Z ψj z

We see that where the roots of the characteristic polynomial φ(B) lie defines the solution of the
AR process. We will show in Sections ?? and 6.1.2 that it not only defines the solution but also
determines some of the characteristics of the time series.

Exercise 4.2 Suppose {Xt } satisfies the AR(p) representation

p
X
Xt = φj Xt−j + εt ,
j=1

Pp
where j=1 |φj | < 1 and E|εt | < ∞. Show that {Xt } will always have a causal stationary solution
(i.e. the roots of the characteristic polynomial are outside the unit circle).

4.3.4 Obtaining an explicit solution of an AR(2) model

A worked out example

Suppose {Xt } satisfies

Xt = 0.75Xt−1 − 0.125Xt−2 + εt ,

where {εt } are iid random variables. We want to obtain a solution for the above equations.

98
It is not easy to use the backward (or forward) iterating techique for AR processes beyond
order one. This is where using the backshift operator becomes useful. We start by writing Xt =
0.75Xt−1 − 0.125Xt−2 + εt as φ(B)Xt = ε, where φ(B) = 1 − 0.75B + 0.125B 2 , which leads to what
is commonly known as the characteristic polynomial φ(z) = 1 − 0.75z + 0.125z 2 . If we can find a
power series expansion of φ(B)−1 , which is valid for |B| = 1, then the solution is Xt = φ(B)−1 εt .
We first observe that φ(z) = 1 − 0.75z + 0.125z 2 = (1 − 0.5z)(1 − 0.25z). Therefore by using
partial fractions we have

1 1 −1 2
= = + .
φ(z) (1 − 0.5z)(1 − 0.25z) (1 − 0.5z) (1 − 0.25z)

We recall from geometric expansions that

∞ ∞
−1 X 2 X
=− (0.5)j z j |z| ≤ 2, =2 (0.25)j z j |z| ≤ 4.
(1 − 0.5z) (1 − 0.25z)
j=0 j=0

Putting the above together gives


1 X
= {−(0.5)j + 2(0.25)j }z j |z| < 2.
(1 − 0.5z)(1 − 0.25z)
j=0

P∞
The above expansion is valid for |z| = 1, because j=0 | − (0.5)j + 2(0.25)j | < ∞ (see Lemma
4.3.2). Hence


X ∞
X
−1 j j j
{−(0.5)j + 2(0.25)j }εt−j ,

Xt = {(1 − 0.5B)(1 − 0.25B)} εt = {−(0.5) + 2(0.25) }B εt =
j=0 j=0

which gives a stationary solution to the AR(2) process (see Lemma 4.2.1). Moreover since the roots
lie outside the unit circle the solution is causal.
The discussion above shows how the backshift operator can be applied and how it can be used
to obtain solutions to AR(p) processes.

The solution of a general AR(2) model

We now generalise the above to general AR(2) models

Xt = (a + b)Xt−1 − abXt−2 + εt ,

99
the characteristic polynomial of the above is 1 − (a + b)z + abz 2 = (1 − az)(1 − bz). This means
the solution of Xt is

Xt = (1 − Ba)−1 (1 − Bb)−1 εt ,

thus we need an expansion of (1 − Ba)−1 (1 − Bb)−1 . Assuming that a 6= b, and using partial
fractions we have
 
1 1 b a
= −
(1 − za)(1 − zb) b − a 1 − bz 1 − az

Cases:

(1) |a| < 1 and |b| < 1, this means the roots lie outside the unit circle. Thus the expansion is

∞ ∞
1 1 X X
bj z j − a aj z j ,

= b
(1 − za)(1 − zb) (b − a)
j=0 j=0

which leads to the causal solution


X 
1 j+1 j+1
Xt = b −a )εt−j . (4.8)
b−a
j=0

(2) Case that |a| > 1 and |b| < 1, this means the roots lie inside and outside the unit circle and
we have the expansion
 
1 1 b a
= −
(1 − za)(1 − zb) b−a 1 − bz (az)((az)−1 − 1)
 X ∞ ∞ 
1 j j −1
X
−j −j
= b b z +z a z , (4.9)
(b − a)
j=0 j=0

which leads to the non-causal solution

∞ ∞
1 X X
bj+1 εt−j + a−j εt+1+j .

Xt = (4.10)
b−a
j=0 j=0

2
Later we show that the non-causal Xt , has the same correlation as an AR(2) model whose characteristic
polynomial has the roots a−1 and b, since both these roots lie out side the unit this model has a causal
solution. Moreover, it is possible to rewrite this non-causal AR(2) as an MA infinite type process but where

100
Returning to (4.10), we see that this solution throws up additional interesting results. Let
us return to the expansion in (4.9) and apply it to Xt
 

1 1  b 1 
Xt = εt = εt + ε
 
t
(1 − Ba)(1 − Bb) b − a  |1 −{zbB } B(1 − a−1 B −1 ) 

| {z }
causal AR(1) noncausal AR(1)
1
= (Yt + Zt+1 )
b−a

where Yt = bYt−1 + εt and Zt+1 = a−1 Zt+2 + εt+1 . In other words, the noncausal AR(2)
process is the sum of a causal and a‘future’ AR(1) process. This is true for all noncausal
time series (except when there is multiplicity in the roots) and is discussed further in Section
??.

We mention that several authors argue that noncausal time series can model features in data
which causal time series cannot.

(iii) a = b < 1 (both roots are the same and lie outside the unit circle). The characteristic
polynomial is (1 − az)2 . To obtain the convergent expansion when |z| = 1 we note that
−1
(1 − az)−2 = (−1) d(1−az)
d(az) . Thus


(−1) X
= (−1) j(az)j−1 .
(1 − az)2
j=0

This leads to the causal solution


X
Xt = (−1) jaj−1 εt−j .
j=1

In many respects this is analogous to Matern covariance defined over Rd (and used in spatial
statistics). However, unlike autocovarianced defined over Rd the behaviour of the autocovari-

the innovations are no independent but uncorrelated instead. I.e. we can write Xt as

(1 − a−1 B)(1 − bB)Xt = εet ,

where εbt are uncorrelated (and are a linear sum of the iid varepsilont ), which as the solution

X 
1
Xt = bj+1 − aj+1 )e
εt−j . (4.11)
b−a j=0

101
ance at zero is not an issue.

Exercise 4.3 Show for the AR(2) model Xt = φ1 Xt−1 + φ2 Xt−2 + εt to have a causal stationary
solution the parameters φ1 , φ2 must lie in the region defined by the three conditions

φ2 + φ1 < 1, φ2 − φ1 < 1 |φ2 | < 1.

Exercise 4.4 (a) Consider the AR(2) process

Xt = φ1 Xt−1 + φ2 Xt−2 + εt ,

where {εt } are iid random variables with mean zero and variance one. Suppose the absolute
of the roots of the characteristic polynomial 1 − φ1 z − φ2 z 2 are greater than one. Show that
|φ1 | + |φ2 | < 4.

(b) Now consider a generalisation of this result. Consider the AR(p) process

Xt = φ1 Xt−1 + φ2 Xt−2 + . . . φp Xt−p + εt .

Suppose the absolute of the roots of the characteristic polynomial 1 − φ1 z − . . . − φp z p are


greater than one. Show that |φ1 | + . . . + |φp | ≤ 2p .

4.3.5 History of the periodogram (Part II)


We now return to the development of the periodogram and the role that the AR model played in
understanding its behaviour.
The general view until the 1920s is that most time series were a mix of periodic function with
additive noise (where we treat Yt as the yearly sunspot data)

P
X
Yt = [Aj cos(tΩj ) + Bj sin(tΩj )] + εt .
j=1

In the 1920’s, Udny Yule, a statistician, and Gilbert Walker, a Meterologist (working in Pune,
India) believed an alternative model could be used to explain the features seen in the periodogram.
Yule fitted an Autoregressive model of order two to the Sunspot data and obtained the AR(2)

102
model

Xt = 1.381Xt−1 − 0.6807Xt−2 + εt .

We simulate a Gaussian model with exactly this AR(2) structure. In Figure 4.2 plot of the sunspot
data together realisation of the AR(2) process. In Figure 4.1 we plot the periodogram of the sunspot
data and a realisation from the fitted AR(2) process. One can fit a model to any data set. What

● ●
●● ●●
sunspot

20000

● ● ● ●

● ●
● ●
● ● ● ●
● ●
●● ●● ●
● ● ●●● ●● ●

●● ●●●● ●● ● ●●●● ●●● ●● ● ●●
●●●●
● ●●●●●●●●
●●● ●●
● ●●● ● ● ●

●● ●
●●
●●●
● ●

●●●●

● ● ●●●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●●●●
●●
●●●●

●●●● ● ●●
● ●
●●●

●● ● ●

● ●●● ●
0

0 1 2 3 4 5 6

frequency[−1]

● ●
120
fitted AR(2)

80

● ● ● ●
● ●
● ●

● ●
● ●●
40

● ●● ● ●
● ● ● ●
● ● ● ●
● ● ●● ●
● ● ●
● ● ●

● ● ●● ●● ●● ●●●● ● ● ●

●● ●● ●● ●●● ● ●●●●●
●●● ●● ●
●●●
● ●●●●●●
● ●
●●●●
● ●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●

●●
●●●●●
●●● ●●●

0

0 1 2 3 4 5 6

frequency[−1]

Figure 4.1: The periodogram of the Sunspot data is the top plot and the periodogram of
the fitted AR(2) model is the lower plot. They do not look exactly the same, but the AR(2)
model is able to capture some of the periodicities.

makes this model so interesting, is that the simple AR(2) models, model suprisingly well many of
the prominent features seen in the sunspot data. From Figures 4.1 and 4.2 we see how well the
AR(2) which is full stochastic can model a periodicities.
To summarize, Schuster, and Yule and Walker fit two completely different models to the same

103
sunspot

100
50
0

0.00 32.55 75.95 119.35 162.75 206.15 249.55 292.95

Time
2 4 6
ar2

−2
−6

0.00 32.55 75.95 119.35 162.75 206.15 249.55 292.95

Time

Figure 4.2: Top: Sunspot, Lower: a realisation from the AR(2) process. Lines correspond
to period of P = 2π/0.57 = 10.85 years.

data set and both models are able to mimic the periodocities observed in the sunspot data. While
it is obvious how a superimposition of sines and cosines can model periodicities it is not so clear
how the AR(2) can achieve a similar effect.
In the following section we study the coefficient of the AR(2) model and how it can mimic the
periodicities seen in the data.

4.3.6 Examples of “Pseudo” periodic AR(2) models


We start by studying the AR(2) model that Yule and Walker fitted to the data. We recall that the
fitted coefficients were

Xt = 1.381Xt−1 − 0.6807Xt−2 + εt .

104
This corresponds to the characteristic function φ(z) = 1 − 1.381z + 0.68z 2 . The roots of this
polynomial are λ1 = 0.77−1 exp(i0.57) and λ2 = 077−1 exp(−i0.57). Cross referencing with the
periodogram in Figure 4.1, we observe that the peak in the periodogram is at around 0.57 also.
This suggests that the phase of the solution (in polar form) determines the periodicities. If the
solution is real then the phase is either 0 or π and Xt has no (pseudo) periodicities or alternates
between signs.
Observe that complex solutions of φ(z) must have conjugations in order to ensure φ(z) is real.
Thus if a solution of the characteristic function corresponding to an AR(2) is λ1 = r exp(iθ), then
λ2 = r exp(−iθ). Based on this φ(z) can be written as

φ(z) = (1 − r exp(iθ)z)(1 − r exp(−iθ)) = 1 − 2r cos(θ)z + r2 z 2 ,

this leads to the AR(2) model

Xt = 2r cos(θ)Xt−1 − r2 Xt−2 + εt

where {εt } are iid random variables. To ensure it is causal we set |r| < 1. In the simulations below
we consider the models

Xt = 2r cos(π/3)Xt−1 − r2 Xt−2 + εt

and

Xt = 2r cos(0)Xt−1 − r2 Xt−2 + εt

for r = 0.5 and r = 0.9.The latter model has completely real coefficients and its characteristic
function is φ(z) = (1 − rz)2 .
In Figures 4.3 and 4.4 we plot a typical realisation from these models with n = 200 and
corresponding periodogram for the case θ = π/3. In Figures 4.5 and 4.6 we plot the a typical
realisation and corresponding periodogram for the case θ = 0
From the realisations and the periodogram we observe a periodicity centered about frequency
π/3 or 0 (depending on the model). We also observe that the larger r is the more pronounced the
period. For frequency 0, there is no period it is simply what looks like trend (very low frequency

105
4

4
2

2
r=0.5

r=0.9
0

0
−2

−2
−4

−4
0 50 100 150 200 0 50 100 150 200

time time

Figure 4.3: Realisation for Xt = 2r cos(π/3)Xt−1 − r2 Xt−2 + εt . Blue = r = 0.5 and red =
r = 0.9.

140
8

120
100
6
Periodogram r = 0.5

Periodogram r = 0.9

80
4

60
40
2

20
0
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

freq freq

Figure 4.4: Periodogram for realisation from Xt = 2r cos(π/3)Xt−1 − r2 Xt−2 + εt . Blue =


r = 0.5 and red = r = 0.9.

behaviour). But the AR(2) is a completely stochastic system (random), it is strange that exhibits
behaviour close to period. We explain why in the following section.
We conclude this section by showing what shape the periodogram is trying to mimic (but not
so well!). In will be shown later on that the expectation of the peridodogram is roughly equal to
the spectral density function of the AR(2) process which is

1 1
f (ω) = = .
|1 − φ1 eiω i2ω
− φ2 e | 2 |1 − 2r cos θeiω + r2 e2iω |2

Plots of the spectral density for θ = π/3, θ = 0 and r = 0.5 and 0.9 are given in Figures 4.7 and
4.8. Observe that the shapes in Figures 4.4 and 4.6 match those in Figures 4.7 and 4.8. But the

106
20

20
10

10
r=0.5

r=0.9
0

0
−10

−10
−20

−20
0 50 100 150 200 0 50 100 150 200

time time

Figure 4.5: Realisation for Xt = 2rXt−1 − r2 Xt−2 + εt . Blue = r = 0.5 and red = r = 0.9.
40

5000
4000
30
Periodogram r = 0.5

Periodogram r = 0.9

3000
20

2000
10

1000
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

freq freq

Figure 4.6: Periodogram for realisation from Xt = 2rXt−1 − r2 Xt−2 + εt . Blue = r = 0.5
and red = r = 0.9.

periodogram is very rough whereas the spectral density is smooth. This is because the periodogram
is simply a mirror of all the frequencies in the observed time series, and the actual time series do
not contain any pure frequencies. It is a mismatch of cosines and sines, thus the messiness of the
periodogram.

107
spectrum, theta=pi/3, r = 0.5 spectrum, theta = pi/3, r = 0.9

2.0

30
1.5

20
test1

test2
1.0

10
0.5

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

freq freq

Figure 4.7: Spectral density for Xt = 2rcos(π/3)Xt−1 − r2 Xt−2 + εt . Blue = r = 0.5 and red
= r = 0.9.

spectrum, theta=0, r = 0.5 spectrum, theta = 0, r = 0.9

10000
15

8000
10

6000
test1

test2

4000
5

2000
0
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

freq freq

Figure 4.8: Spectral density for Xt = 2rXt−1 − r2 Xt−2 + εt . Blue = r = 0.5 and red =
r = 0.9.

4.3.7 Derivation of “Pseudo” periodicity functions in an AR(2)


We now explain why the AR(2) (and higher orders) can characterise some very interesting behaviour
(over the rather dull AR(1)). For now we assume that Xt is a causal time series which satisfies the
AR(2) representation

Xt = φ1 Xt−1 + φ2 Xt−2 + εt

where {εt } are iid with mean zero and finite variance. We focus on the case that the characteristic
polynomial is complex with roots λ1 = r exp(iθ) and λ2 = r exp(−iθ). Thus our focus is on the

108
AR(2) model

Xt = 2r cos(θ)Xt−1 − r2 Xt−2 + εt |r| < 1.

By using equation (4.8) with a = λ and b = λ


1 X  j+1 
Xt = λ − λj+1 εt−j .
λ − λ j=0

We reparameterize λ = reiθ (noting that |r| < 1). Then


1 X
Xt = 2rj+1 sin ((j + 1)θ) εt−j . (4.12)
2r sin θ
j=0

We can see that Xt is effectively the sum of cosines/sines with frequency θ that have been modulated
by the iid errors and exponentially damped. This is why for realisations of autoregressive processes
you will often see periodicities (depending on the roots of the characteristic). Thus to include
periodicities in a time series in an These arguments can be generalised to higher order autoregressive
models.

Exercise 4.5 (a) Obtain the stationary solution of the AR(2) process

7 2
Xt = Xt−1 − Xt−2 + εt ,
3 3

where {εt } are iid random variables with mean zero and variance σ 2 .

Does the solution have an MA(∞) representation?

(b) Obtain the stationary solution of the AR(2) process



4× 3 42
Xt = Xt−1 − 2 Xt−2 + εt ,
5 5

where {εt } are iid random variables with mean zero and variance σ 2 .

Does the solution have an MA(∞) representation?

(c) Obtain the stationary solution of the AR(2) process

Xt = Xt−1 − 4Xt−2 + εt ,

109
where {εt } are iid random variables with mean zero and variance σ 2 .

Does the solution have an MA(∞) representation?

Exercise 4.6 Construct a causal stationary AR(2) process with pseudo-period 17. Using the R
function arima.sim simulate a realisation from this process (of length 200) and make a plot of the
periodogram. What do you observe about the peak in this plot?

4.3.8 Seasonal Autoregressive models


A popular autoregessive model that is often used for modelling seasonality, is the seasonal autore-
gressive model (SAR). To motivate the model consider the monthly average temperatures in College
Station. Let {Xt } denote the monthly temperatures. Now if you have had any experience with
temperatures in College Station using the average temperature in October (still hot) to predict the
average temperature in November (starts to cool) may not seem reasonable. It may seem more
reasonable to use the temperature last November. We can do this using the following model

Xt = φXt−12 + εt ,

where |φ| < 1. This is an AR(12) model in disguise, The characteristic function φ(z) = 1 − φz 12
has roots λj = φ−1/12 exp(i2πj/12) for j = 0, 1, . . . , 11. As there are 5 complex pairs and two real
terms. We would expect to see 7 peaks in the periodogram and spectral density. The spectral
density is

1
f (ω) = .
|1 − φei12ω |2

A realisation from the above model with φ = 0.8 and n = 200 is given in Figure 4.9. The
corresponding periodogram and spectral density is given in Figure 4.10. We observe that the
periodogram captures the general peaks in the spectral density, but is a lot messier.

4.3.9 Solution of the general AR(∞) model (advanced)


The AR(∞) model generalizes the AR(p)


X
Xt = φj Xt−j + εt
j=1

110
4
2
temp

0
−2
−4

0 50 100 150 200

time

Figure 4.9: Realisation from SAR(12) with φ = 0.8.

25
40

20
30

15
Periodogram

spectrum
20

10
10

5
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

freq freq

Figure 4.10: Left: Periodogram of realisation. Right Spectral density of model.

where {εt } are iid random variables. AR(∞) models are more general than the AR(p) model and
are able to model more complex behaviour, such as slower decay of the covariance structure.
In order to obtain the stationary solution of an AR(∞), we need to define an analytic function
and its inverse.

Definition 4.3.1 (Analytic functions in the region Ω) Suppose that z ∈ C. φ(z) is an ana-
lytic complex function in the region Ω, if it has a power series expansion which converges in Ω, that
is φ(z) = ∞ j
P
j=−∞ φj z .

If there exists a function φ̃(z) = ∞ j


P
j=−∞ φ̃j z such that φ̃(z)φ(z) = 1 for all z ∈ Ω, then φ̃(z)

is the inverse of φ(z) in the region Ω.

Example 4.3.1 (Analytic functions) (i) Clearly a(z) = 1 − 0.5z is analytic for all z ∈ C,

111
1 P∞ j
and has no zeros for |z| < 2. The inverse is a(z) = j=0 (0.5z) is well defined in the region
|z| < 2.

(ii) Clearly a(z) = 1 − 2z is analytic for all z ∈ C, and has no zeros for |z| > 1/2. The inverse is
1 −1 −1
P∞ j
a(z) = (−2z) (1 − (1/2z)) = (−2z) ( j=0 (1/(2z)) ) well defined in the region |z| > 1/2.

1
(iii) The function a(z) = (1−0.5z)(1−2z) is analytic in the region 0.5 < z < 2.

(iv) a(z) = 1 − z, is analytic for all z ∈ C, but is zero for z = 1. Hence its inverse is not well
defined for regions which involve |z| = 1 (see Example 4.7).
Pp j
(v) Finite order polynomials such as φ(z) = j=0 φj z for Ω = C.
P∞
(vi) The expansion (1 − 0.5z)−1 = j=0 (0.5z)
j for Ω = {z; |z| ≤ 2}.

We observe that for AR processes we can represent the equation as φ(B)Xt = εt , which formally
gives the solution Xt = φ(B)−1 εt . This raises the question, under what conditions on φ(B)−1 is
φ(B)−1 εt a valid solution. For φ(B)−1 εt to make sense φ(B)−1 should be represented as a power
series expansion. Below, we state a technical lemma on φ(z) which we use to obtain a stationary
solution.

P∞ j
Lemma 4.3.2 (Technical lemma) Suppose that ψ(z) = j=−∞ ψj z is finite on a region that
P∞
includes |z| = 1 (we say it is analytic in the region |z| = 1). Then j=−∞ |ψj | < ∞.
P∞ j
An immediate consequence of the lemma above is that if ψ(z) = j=−∞ ψj z is analytic in
the region and {Xt } is a strictly stationary time series, where E|Xt | we define the time series
P∞
Yt = ψ(B)Xt = j=−∞ ψj Xt−j . Then by the lemma above and Lemma 4.2.1, {Yt } is almost

surely finite and strictly stationary time series. We use this result to obtain a solution of an
AR(∞) (which includes an AR(p) as a special case).

P∞ P∞ j
Lemma 4.3.3 Suppose φ(z) = 1 + j=1 φj and ψ(z) = j=−∞ ψj z are analytic functions in a
region which contains |z| = 1 and φ(z)ψ(z)−1 = 1 for all |z| = 1. Then the AR(∞) process


X
Xt = φj Xt−j + εt .
j=1

112
has the unique solution


X
Xt = ψj εt−j .
j=−∞

We can immediately apply the lemma to find conditions under which the AR(p) process will admit
a stationary solution. Note that this is generalisation of Lemma 4.3.1.
Rules of the back shift operator:

(i) If a(z) is analytic in a region Ω which includes the unit circle |z| = 1 in its interior and {Yt } is
a well defined time series, then Xt defined by Yt = a(B)Xt is a well defined random variable.

(ii) The operator is commutative and associative, that is [a(B)b(B)]Xt = a(B)[b(B)Xt ] =


[b(B)a(B)]Xt (the square brackets are used to indicate which parts to multiply first). This
may seem obvious, but remember matrices are not commutative!

1
(iii) Suppose that a(z) and its inverse a(z) are both have solutions in the region Ω which includes
1
the unit circle |z| = 1 in its interior. If a(B)Xt = Zt , then Xt = a(B) Zt .

The magic backshift operator

A precise proof of Lemma 4.3.3 and the rules of the back shift operator described above is beyond
these notes. But we briefly describe the idea, so the backshift operator feels less like a magic trick.
Equation (4.7) is an infinite dimension matrix operation that maps (`2 -sequences to `2 -sequences)
where Γ : `2 → `2 and Γ(x) = ε with x = (. . . , x−1 , x0 , x1 , . . .). Thus x = Γ−1 ε. The objectives is to
find the coefficients in the operator Γ−1 . It is easier to do this by transforming the operator to the
Fourier domain with the Fourier operator F : `2 → L2 [0, 2π] and F ∗ : L2 [0, 2π] → `2 . Thus F ΓF ∗
is an integral operator with kernel K(λ, ω) = φ(eiω )δω=λ . It can be shown that the inverse operator
(F Γ−1 F ∗ ) has kernel K −1 (λ, ω) = φ(eiω )−1 δω=λ . One can then deduce that the coefficients of Γ−1
R 2π
are the Fourier coefficients 0 φ(eiω )−1 e−ijω dω, which correspond to the expansion of φ(z)−1 that
converges in the region that include |z| = 1 (the Laurent series in this region).

AR(∞) representation of stationary time series (Advanced)

If a time series is second order stationary and its spectral density function f (ω) = (2π)−1 irω
P
r∈Z c(r)e

is bounded away from zero (is not zero) and is finite on [0, π]. Then it will have form of AR(∞)

113
representation


X
Xt = aj Xt−j + εt ,
j=1

the difference is that {εt } are uncorrelated random variables and may not be iid random
variables. This result is useful when finding the best linear predictors of Xt given the past.

4.4 Simulating from an Autoregressive process


Simulating from a Gaussian AR process

We start with the case that the innovations, {εt }, are Gaussian. In this case, by using Lemma
4.5.1(ii) we observe that all AR processes can be written as the infinite sum of the innovations. As
sums of iid Gaussian random variables are Gaussian, then the resulting time series is also Gaussian.
We show in Chapter 6 that given any causal AR equation, the covariance structure of the time
series can be deduced. Since normal random variables are fully determined by their mean and
variance matrix, using the function mvnorm and var[X p ] = Σp , we can simulate the first p elements
in the time series X p = (X1 , . . . , Xp ). Then by simulating (n − p) iid random variables we can
generate Xt using the causal recursion

p
X
Xt = φj Xt−j + εt .
j=1

Remark 4.4.1 Any non-causal system of difference equations with Gaussian innovations can al-
ways be rewritten as a causal system. This property is unique for Gaussian processes.

A worked example

We illustrate the details with with an AR(1) process. Suppose Xt = φ1 Xt−1 + εt where {εt } are iid
standard normal random variables (note that for Gaussian processes it is impossible to discriminate
between causal and non-causal processes - see Section 6.4, therefore we will assume |φ1 | < 1). We
will show in Section 6.1, equation (6.1) that the autocovariance of an AR(1) is


φr1
φ2j
X
c(r) = φr1 1 = .
j=0
1 − φ21

114
Therefore, the marginal distribution of Xt is Gaussian with variance (1 − φ21 )−1 . Therefore, to
simulate an AR(1) Gaussian time series, we draw from a Gaussian time series with mean zero and
variance (1 − φ21 )−1 , calling this X1 . We then iterate for 2 ≤ t, Xt = φ1 Xt−1 + εt . This will give us
a stationary realization from an AR(1) Gaussian time series.
Note the function arima.sim is a routine in R which does the above. See below for details.

Simulating from a non-Gaussian causal AR model

Unlike the Gaussian AR process it is difficult to simulate an exact non-Gaussian model, but we
can obtain a very close approximation. This is because if the innovations are non-Gaussian the
distribution of Xt is not simple. Here we describe how to obtain a close approximation in the case
that the AR process is causal.
A worked example We describe a method for simulating an AR(1). Let {Xt } be an AR(1) process,
Xt = φ1 Xt−1 + εt , which has stationary, causal solution


φj1 εt−j .
X
Xt =
j=0

To simulate from the above model, we set X̃1 = 0. Then obtain the iteration X̃t = φ1 X̃t−1 + εt for
t ≥ 2. We note that the solution of this equation is

t
φj1 εt−j .
X
X̃t =
j=0

P∞ j
We recall from Lemma 4.5.1 that |Xt − X̃t | ≤ |φ1 |t j=0 |φ1 ε−j |, which converges geometrically fast
to zero. Thus if we choose a large n to allow ‘burn in’ and use {X̃t ; t ≥ n} in the simulations we
have a simulation which is close to a stationary solution from an AR(1) process. Using the same
method one can simulate causal AR(p) models too.

Building AR(p) models One problem with the above approach is the AR(p) coefficients {φj } should
be chosen such that it corresponds to a causal solution. This is not so simple. It is easier to build
a causal AR(p) model from its factorisation:

p
Y
φ(B) = (1 − λj B).
j=1

115
Thus φ(B)Xt = εt can be written as

φ(B)Xt = (1 − λp B)(1 − λp−1 B) . . . (1 − λ1 B)Xt = εt .

Using the above representation Xt can be simulated using a recursion. For simplicity we assume
p = 2 and φ(B)Xt = (1 − λ2 B)(1 − λ1 B)Xt = εt . First define the AR(1) model

(1 − λ1 B)Y1,t = εt ⇒ Y1,t = (1 − λ1 B)−1 εt .

This gives

(1 − λ2 B)Xt = (1 − λ1 B)−1 εt = Y1,t .

Thus we first simulate {Y1,t }t using the above AR(1) method described above. We treat {Y1,t }t as
the i nnovations, and then simulate

(1 − λ2 B)Xt = Y1,t ,

using the AR(1) method described above, but treating {Y1,t }t as the innovations. This method can
easily be generalized for any AR(p) model (with real roots). Below we describe how to do the same
but when the roots are complex

Simulating an AR(2) with complex roots Suppose that Xt has a causal AR(2) representation. The
roots can be complex, but since Xt is real, the roots must be conjugates (λ1 = r exp(iθ) and
λ2 = r exp(−iθ)). This means Xt satisfies the representation

(1 − 2r cos(θ)B + r2 B 2 )Xt = εt

where |r| < 1. Now by using the same method described for simulating an AR(1), we can simulate
an AR(2) model with complex roots.
In summary, by using the method for simulating AR(1) and AR(2) models we can simulate any
AR(p) model with both real and complex roots.

116
Simulating from a fully non-causal AR model

Suppose that {Xt } is an AR(p) model with characteristic function φ(B), whose roots lie inside the
unit circle (fully non-causal). Then we can simulate Xt using the backward recursion

Xt−p = φ−1 −1
p [−φp−1 Xt−p+1 − . . . − φ1 Xt−1 + Xt ] − φp εt . (4.13)

Simulating from a non-Gaussian non-causal AR model

We now describe a method for simulating AR(p) models whose roots are both inside and outside the
unit circle. The innovations should be non-Gaussian, as it makes no sense to simulate a non-causal
Gaussian model and it is impossible to distinguish it from a corresponding causal Gaussian model.
The method described below was suggested by former TAMU PhD student Furlong Li.

Worked example To simplify the description consider the AR(2) model where φ(B) = (1−λ1 B)(1−
µ1 B) with |λ1 | < 1 (outside unit circle) and |µ1 | > 1 (inside the unit circle). Then

(1 − λ1 B)(1 − µ1 B)Xt = εt .

Define the non-causal AR(1) model

(1 − µ1 B)Y1,t = εt .

And simulate {Y1,t } using a backward recursion. Then treat {Y1,t } as the innovations and simulate
the causal AR(1)

(1 − µ1 B)Xt = Y1,t

using a forward recursion. This gives an AR(2) model whose roots lie inside and outside the unit
circle. The same method can be generalized to any non-causal AR(p) model.

Exercise 4.7 In the following simulations, use non-Gaussian innovations.

(i) Simulate a stationary AR(4) process with characteristic function


    
2π 2π 2π 2π
φ(z) = 1 − 0.8 exp(i )z 1 − 0.8 exp(−i )z 1 − 1.5 exp(i )z 1 − 1.5 exp(−i )z .
13 13 5 5

117
(ii) Simulate a stationary AR(4) process with characteristic function
    
2π 2π 2 2π 2 2π
φ(z) = 1 − 0.8 exp(i )z 1 − 0.8 exp(−i )z 1 − exp(i )z 1 − exp(−i )z .
13 13 3 5 3 5

Do you observe any differences between these realisations?

R functions

Shumway and Stoffer (2006) and David Stoffer’s website gives a comprehensive introduction to time
series R-functions.
The function arima.sim simulates from a Gaussian ARIMA process. For example,
arima.sim(list(order=c(2,0,0), ar = c(1.5, -0.75)), n=150) simulates from the AR(2) model
Xt = 1.5Xt−1 − 0.75Xt−2 + εt , where the innovations are Gaussian.

4.5 The ARMA model


Up to now, we have focussed on the autoregressive model. The MA(q) in many respects is a much
simpler model to understand. In this case the time series is a weighted sum of independent latent
variables

q
X
Xt = εt + θ1 εt−1 + . . . + θq θt−q = εt + θj εt−j . (4.14)
j=1

We observe that Xt is independent of any Xt−j where |j| ≥ q + 1. On the contrast, for an AR(p)
model, there is dependence between Xt and all the time series at all other time points (we have
shown above that if the AR(p) is causal, then it can be written as an MA(∞) thus the dependency at
all lags). There are advantages and disadvantages of using either model. The MA(q) is independent
after q lags (which may be not be viewed as realistic). But for many data sets simply fitting an
AR(p) model to the data and using a model selection criterion (such as AIC), may lead to the
selection of a large order p. This means the estimation of many parameters for a relatively small
data sets. The AR(p) may not be parsimonious. The large order is usually chosen when the
correlations tend to decay slowly and/or the autcorrelations structure is quite complex (not just
monotonically decaying). However, a model involving 10-15 unknown parameters is not particularly
parsimonious and more parsimonious models which can model the same behaviour would be useful.

118
A very useful generalisation which can be more flexible (and parsimonious) is the ARMA(p, q)
model, in this case Xt has the representation

p
X q
X
Xt − φi Xt−i = εt + θj εt−j .
i=1 j=1

Definition 4.5.1 (Summary of AR, ARMA and MA models) (i) The autoregressive AR(p)
model: {Xt } satisfies

p
X
Xt = φi Xt−i + εt . (4.15)
i=1

Observe we can write it as φ(B)Xt = εt

(ii) The moving average M A(q) model: {Xt } satisfies

q
X
Xt = εt + θj εt−j . (4.16)
j=1

Observe we can write Xt = θ(B)εt

(iii) The autoregressive moving average ARM A(p, q) model: {Xt } satisfies

p
X q
X
Xt − φi Xt−i = εt + θj εt−j . (4.17)
i=1 j=1

We observe that we can write Xt as φ(B)Xt = θ(B)εt .

We now state some useful definitions.

Definition 4.5.2 (Causal and invertible) Consider the ARMA(p, q) model defined by

p
X q
X
Xt + ψj Xt−j = θ i εt ,
j=1 i=1

where {εt } are iid random variables with mean zero and constant variance.

(i) An ARMA process is said to be causal if it has the representation


X
Xt = bj εt−j .
j=0

119
Pp Pq
(ii) An ARMA(p, q) process Xt + j=1 ψj Xt−j = i=1 θi εt (where {εt } are uncorrelated ran-
dom variables with mean zero and constant variance) is said to be invertible if it has the
representation


X
Xt = aj Xt−j + εt .
j=1

We have already given conditions underwhich an AR(p) model (and consequently) and ARMA(p, q)
model is causal. We now look at when an MA(q) model is invertible (this allows us to write it as
an AR(∞) process).
A worked example Consider the MA(1) process

Xt = εt + θεt−1 ,

where {εt } are iid random variables. Our aim is understand when Xt can have an AR(∞) repre-
sentation. We do this using the backshift notation. Recall Bεt = εt−1 substituting this into the
MA(1) model above gives

Xt = (1 + θB)εt .

Thus at least formally

εt = (1 + θB)−1 Xt .

We recall that the following equality holds


−1
X
(1 + θB) = (−θ)j B j ,
j=0

when |θB| < 1. Therefore if |θ| < 1, then

∞ ∞
εt = (1 + Bθ)−1 Xt =
X X
(−θ)j B j Xt = (−θ)j Xt−j .
j=0 j=0

120
Rearranging the above gives the AR(∞) representation


X
Xt = (−θ)j Xt−j + εt ,
j=1

but observe this representation only holds if |θ| < 1.

Conditions for invertibility of an MA(q) The MA(q) process can be written as

q
X
Xt = θj εt−j + εt .
j=1

It will have an AR(∞) representation if the roots of the polynomial θ(z) = 1+ qj=1 θj z j lie outside
P

the unit circle. Then we can write (1 + qj=1 θj z)−1 = ∞ j


P P
j=0 φj z (i.e. all the roots are greater

than one in absolute) and we have


X
Xt = aj Xt−j + εt .
j=1

Causal and invertible solutions are useful in both estimation and forecasting (predicting the
future based on the current and past).
Below we give conditions for the ARMA to have a causal solution and also be invertible. We
also show that the coefficients of the MA(∞) representation of Xt will decay exponentially.

Lemma 4.5.1 Let us suppose Xt is an ARMA(p, q) process with representation given in Definition
4.5.1.

(i) If the roots of the polynomial φ(z) lie outside the unit circle, and are greater than (1 + δ) (for
some δ > 0), then Xt almost surely has the solution


X
Xt = bj εt−j , (4.18)
j=0

P
where j |bj | < ∞ (we note that really bj = bj (φ, θ) since its a function of {φi } and {θi }).
Moreover for all j,

|bj | ≤ Kρj (4.19)

121
for some finite constant K and 1/(1 + δ) < ρ < 1.

(ii) If the roots of φ(z) lie both inside or outside the unit circle and are larger than (1 + δ) or less
than (1 + δ)−1 for some δ > 0, then we have


X
Xt = bj εt−j , (4.20)
j=−∞

(a vector AR(1) is not possible), where

|aj | ≤ Kρ|j| (4.21)

for some finite constant K and 1/(1 + δ) < ρ < 1.


Pq j
(iii) If the absolute value of the roots of θ(z) = 1 + j=1 θj z are greater than (1 + δ), then (4.17)
can be written as


X
Xt = aj Xt−j + εt . (4.22)
j=1

where

|aj | ≤ Kρj (4.23)

for some finite constant K and 1/(1 + δ) < ρ < 1.

To compare the behaviour or an AR and ARMA models we simulate from and AR(3) and and
ARMA(3, 2) where both models have the same autoregressive parameters. We simulate from the
AR(3) model (two complex roots, one real root)

(1 − 2 · 0.8 cos(π/3)B + 0.82 B 2 )(1 − 0.6B)Xt = εt

and the ARMA(3, 2) model

(1 − 2 · 0.8 cos(π/3)B + 0.82 B 2 )(1 − 0.6B)Xt = (1 + 0.5B − 0.5B 2 )εt

The realisations and corresponding periodogram are given in Figures 4.11 and 4.12. Observe that
the AR(3) model has one real root λ = 0.6, this gives rise to the perceived curve in Figure 4.11

122
5
2

testARMA32
0
testAR3

0
−2

−5
−4

0 50 100 150 200 0 50 100 150 200

time time

Figure 4.11: Realisation from Left: AR(3) and Right: ARMA(3, 2)

Periodogram AR(3) Periodogram ARMA(3)

70
20

60
50
15

40
PtestARMA32
PtestAR3

10

30
20
5

10
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

freq freq

Figure 4.12: Periodogram from realisation from Left: AR(3) and Right: ARMA(3, 2)

Spectral Density AR(3) Spectral Density ARMA(3,2)


14

30
12

25
10

20
specARMA32
8
specAR3

15
6

10
4

5
2

0
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

freq freq

Figure 4.13: Spectral density from Left: AR(3) and Right: ARMA(3, 2)

123
and relatively amplitudes at low frequencies in the corresponding periodogram (in Figure 4.12). In
contrast, the ARMA model has exactly the same AR part as the AR(3) model, but the MA part
of this model appears to cancel out some of the low frequency information! The corresponding
spectral density of the AR(3) and ARMA(3, 2) model are

1
fAR (ω) =
|1 − 1.6 cos θeiω + 0.82 e2iω |2 |1 − 0.6eiω |2

and

|1 + 0.5eiω − 0.5e2iω |2
fARM A (ω) =
|1 − 1.6 cos θeiω + 0.82 e2iω |2 |1 − 0.6eiω |2

respectively. A plot of these spectral densities is given in Figure 4.13. We observe that the peri-
odogram maps the rough character of the spectral density. This the spectral density conveys more
information than then simply being a positive function. It informs on where periodicities in the
time series are most likely to lie. Studying 4.13 we observe that MA part of the ARMA spectral
density appears to be dampening the low frequencies. Code for all these models is given on the
course website. Simulate different models and study their behaviour.

4.6 ARFIMA models


We have shown in Lemma 4.5.1 that the coefficients of an ARMA processes which admit a stationary
solution decay geometrically. This means that they are unable to model “persistant” behaviour
between random variables which are separately relatively far in time. However, the ARIMA offers a
solution on how this could be done. We recall that (1−B)Xt = εt is a process which is nonstationary.
However we can no replace (1 − B)d (where d is a fraction) and see if one can obtain a compromise
between persistance (long memory) and nonstatonary (in the sense of differencing). Suppose

(1 − B)d Xt = εt .

If 0 ≤ d ≤ 1/2 we have the expansions


X ∞
X
d j −d
(1 − B) = ψj B (1 − B) = φj B j
j=0 j=0

124
where

Γ(j − d) Γ(j + d)
φj = ψj =
Γ(j + 1)Γ(−d) Γ(j + 1)Γ(d)
P∞ 2
P∞
and Γ(1 + k) = kΓ(k) is the Gamma function. Note that j=0 ψj < ∞ but j=0 ψj = ∞. This
means that Xt has the stationary solution


X
Xt = ψj εt−j .
j=0

Noting to show that the above is true requires weaker conditions than those given in Lemma 4.2.1.
It above process does not decay geometrically fast, and it can be shown that the sample covariance
is such that c(r) ∼ |r|2d−1 (hence is not absolutely summable).

4.7 Unit roots, integrated and non-invertible processes

4.7.1 Unit roots


If the difference equation has a root which is one, then an (almost sure) stationary solution of
the AR model does not exist. The simplest example is the ‘random walk’ Xt = Xt−1 + εt (φ(z) =
(1−z)). This is an example of an Autoregressive Integrated Moving Average ARIMA(0, 1, 0) model
(1 − B)Xt = εt .
To see that it does not have a stationary solution, we iterate the equation n steps backwards;
Xt = nj=0 εt−j + Xt−n . St,n = nj=0 εt−j is the partial sum, but it is clear that the partial sum
P P

St,n does not have a limit, since it is not a Cauchy sequence, ie. |St,n − St,m | does not have a limit.
However, given some initial value X0 , for t > 0 the so called “unit process” Xt = Xt−1 + εt is well
defined. Notice that the nonstationary solution of this sequence is Xt = X0 + tj=1 εt−j which has
P

variance var(Xt ) = var(X0 ) + t (assuming that {εt } are iid random variables with variance one and
independent of X0 ).
We observe that we can ‘stationarize’ the process by taking first differences, i.e. defining
Yt = Xt − Xt−1 = εt .

Unit roots for higher order differences The unit process described above can be generalised to tak-
ing d differences (often denoted as an ARIMA(0, d, 0)) where (1−B)d Xt = εt (by taking d-differences
we can remove d-order polynomial trends). We elaborate on this below.

125
To stationarize the sequence we take d differences, i.e. let Yt,0 = Xt and for 1 ≤ i ≤ d define
the iteration

Yt,i = Yt,i−1 − Yt−1,i−1

and Yt = Yt,d will be a stationary sequence. Note that this is equivalent to

d
X d!
Yt = (−1)j Xt−j .
j!(d − j)!
j=0

The ARIMA(p, d, q) model The general ARIMA(p, d, q) is defined as (1 − B)d φ(B)Xt = θ(B)εt ,
where φ(B) and θ(B) are p and q order polynomials respectively and the roots of φ(B) lie outside
the unit circle.
Another way of describing the above model is that after taking d differences (as detailed in
(ii)) the resulting process is an ARMA(p, q) process (see Section 4.5 for the definition of an ARMA
model).
To illustrate the difference between stationary ARMA and ARIMA processes, in Figure 4.14
Suppose (1 − B)φ(B)Xt = εt and let φ(B)
e = (1 − B)φ(B). Then we observe that φ(1)
e = 0.
This property is useful when checking for unit root behaviour (see Section 4.9).
More exotic unit roots
The unit root process need not be restricted to the case that the characteristic polynomial
associated the AR model is one. If the absolute of the root is equal to one, then a stationary
solution cannot exist. Consider the AR(2) model

Xt = 2 cos θXt−1 − Xt−2 + εt .

The associated characteristic polynomial is φ(B) = 1 − 2 cos(θ)B + B 2 = (1 − eiθ B)(1 − e−iθ B).
Thus the roots are eiθ and e−iθ both of which lie on the unit circle. Simulate this process.

4.7.2 Non-invertible processes


In the examples above a stationary solution does not exist. We now consider an example where
the process is stationary but an autoregressive representation does not exist (this matters when we
want to forecast).

126
Consider the MA(1) model Xt = εt − εt−1 . We recall that this can be written as Xt = φ(B)εt
where φ(B) = 1 − B. From Example 4.3.1(iv) we know that φ(z)−1 does not exist, therefore it does
not have an AR(∞) representation since (1 − B)−1 Xt = εt is not well defined.

80
5

60
40
0

20
ar2I
ar2

0
−20
−5

−40
−60
0 100 200 300 400 0 100 200 300 400

Time Time

(a) Xt = 1.5Xt−1 − 0.75Xt−2 + εt (b) (1 − B)Yt = Xt , where Xt is defined in (a)

Figure 4.14: Realisations from an AR process and its corresponding integrated process, using
N (0, 1) innovations (generated using the same seed).

4.8 Simulating from models

4.9 Some diagnostics


Here we discuss some guidelines which allows us to discriminate between a pure autoregressive
process and a pure moving average process; both with low orders. And also briefly discuss how to
identify a “unit root” in the time series and whether the data has been over differenced.

4.9.1 ACF and PACF plots for checking for MA and AR be-
haviour
The ACF and PACF plots are the autocorrelations and partial autocorrelations estimated from the
time series data (estimated assuming the time series is second order stationary). The ACF we came
across is Chapter 1, the PACF we define in Chapter 6, however roughly it is the correlation between
two time points after removing the linear dependence involving the observations inbetween. In R

127
the functions are acf and pacf. Note that the PACF at lag zero is not given (as it does not make
any sense).
The ACF and PACF of an AR(1), AR(2), MA(1) and MA(2) are given in Figures 4.15-4.18.
We observe from Figure 4.15 and 4.16 (which give the ACF of and AR(1) and AR(2) process)
that there is correlation at all lags (though it reduces for large lags). However, we see from the
PACF for the AR(1) has only one large coefficient at lag one and the PACF plot of the AR(2) has
two large coefficients at lag one and two. This suggests that the ACF and PACF plot can be used
to diagnose autoregressive behaviour and its order.
Similarly, we observe from Figures 4.17 and 4.18 (which give the ACF of and MA(1) and MA(2)
process) that there is no real correlation in the ACF plots after lag one and two respectively, but
the PACF plots are more ambigious (there seems to be correlations at several lags).

Series ar1 Series ar1


1.0

0.8
0.8

0.6
0.6

Partial ACF

0.4
0.4
ACF

0.2

0.2
0.0

0.0
−0.2

0 5 10 15 20 25 0 5 10 15 20 25

Lag Lag

Figure 4.15: ACF and PACF plot of an AR(1), Xt = 0.5Xt−1 + εt , n = 400

4.9.2 Checking for unit roots


We recall that for an AR(1) process, the unit root corresponds to Xt = Xt−1 + εt i.e. φ = 1. Thus
to check for unit root type behaviour we estimate φ and see how close φ is to one. We can formally
turn this into a statistical test H0 : φ = 1 vs. HA : |φ| < 1 and there several tests for this, the most
famous is the Dickey-Fuller test. Rather intriguingly, the distribution of φb (using the least squares

estimator) does not follow a normal distribution with a n-rate!
Extending the the unit root to the AR(p) process, the unit root corresponds to (1−B)φ(B)Xt =
εt where φ(B) is an order (p − 1)-polynomial (this is the same as saying Xt − Xt−1 is a stationary
AR(p − 1) process). Checking for unit root is the same as checking that the sum of all the AR

128
Series ar2 Series ar2

1.0
0.8

0.5
0.6
0.4

Partial ACF
ACF

0.0
0.2
0.0
−0.2

−0.5
−0.4

0 5 10 15 20 25 0 5 10 15 20 25

Lag Lag

Figure 4.16: ACF and PACF plot of an AR(2), n = 400

Series ma1 Series ma1


1.0

0.4
0.8
0.6

0.2
Partial ACF
ACF

0.4

0.0
0.2

−0.2
0.0

0 5 10 15 20 25 0 5 10 15 20 25

Lag Lag

Figure 4.17: ACF and PACF plot of an MA(1), Xt = εt + 0.8εt−1 , n = 400

coefficients is equal to one. This is easily seen by noting that φ(1)


e = 0 where φ(B)
e = (1 − B)φ(B)
or

(1 − B)φ(B)Xt = Xt − (φ1 − 1)Xt−1 − (φ2 − φ1 )Xt−2 − (φp−1 − φp−2 )Xt−p+1 + φp−1 Xt−p = εt .

Thus we see that the sum of the AR coefficients is equal to one. Therefore to check for unit root
behaviour in AR(p) processes one can see how close the sum of the estimate AR coefficients pj=1 φbj
P

is to one. Again this can be turned into a formal test.


In order to remove stochastic or deterministic trend one may difference the data. But if the
data is over differenced one can induce spurious dependence in the data which is best avoided
(estimation is terrible and prediction becomes a nightmare). One indicator of over differencing is

129
Series ma2 Series ma2

1.0

0.10
0.8

0.05
0.6

0.00
Partial ACF
ACF

0.4

−0.05
0.2

−0.10
0.0

−0.15
−0.2

0 5 10 15 20 25 0 5 10 15 20 25

Lag Lag

Figure 4.18: ACF and PACF plot of an MA(2), n = 400

Series test2 Series test2


1.0

1.0
0.8
0.6
0.5

0.4
ACF

ACF

0.2
0.0

0.0
−0.2
−0.5

0 5 10 15 20 0 5 10 15 20

Lag Lag

Figure 4.19: ACF of differenced data Yt = Xt − Xt−1 . Left Xt = εt , Right Xt = 1.5Xt−1 −


0.75Xt−2 + εt .

the appearance of negative correlation at lag one in the data. This is illustrated in Figure 4.19,
where for both data sets (difference of iid noise and differenced of an AR(2) process) we observe a
large negative correlation at lag one.

4.10 Appendix
Representing an AR(p) model as a VAR(1) Let us suppose Xt is an AR(p) process, with the rep-
resentation

p
X
Xt = φj Xt−j + εt .
j=1

130
For the rest of this section we will assume that the roots of the characteristic function, φ(z), lie
outside the unit circle, thus the solution causal. We can rewrite the above as a Vector Autoregressive
(VAR(1)) process

X t = AX t−1 + εt (4.24)

where
 
φ1 φ2 . . . φp−1 φp
 
 1 0 ... 0 0 
 
 , (4.25)
 
 0 1 ... 0 0 
 
0 0 ... 1 0

X 0t = (Xt , . . . , Xt−p+1 ) and ε0t = (εt , 0, . . . , 0). It is straightforward to show that the eigenvalues of
A are the inverse of the roots of φ(z) (since

p
X p
X
det(A − zI) = z − p
φi z p−i p
= z (1 − φi z −i )),
i=1 i=1
| {z }
=z p φ(z −1 )

thus the eigenvalues of A lie inside the unit circle. It can be shown that for any |λmax (A)| < δ < 1,
there exists a constant Cδ such that |kAj kspec ≤ Cδ δ j (see Appendix A). Note that result is
extremely obvious if the eigenvalues are distinct (in which case the spectral decomposition can be
used), in which case |kAj kspec ≤ Cδ |λmax (A)|j (note that kAkspec is the spectral norm of A, which
is the largest eigenvalue of the symmetric matrix AA0 ).
We can apply the same back iterating that we did for the AR(1) to the vector AR(1). Iterating
(13.4) backwards k times gives

k−1
X
Xt = Aj εt−j + Ak X t−k .
j=0

P
Since kAk X t−k k2 ≤ kAk kspec kX t−k k → 0 we have


X
Xt = Aj εt−j .
j=0

131
We use the above representation to prove Lemma 4.5.1.

PROOF of Lemma 4.5.1 We first prove (i) There are several way to prove the result. The proof
we consider here, uses the VAR expansion given in Section ??; thus we avoid using the Backshift
operator (however the same result can easily proved using the backshift). We write the ARMA
process as a vector difference equation

X t = AX t−1 + εt (4.26)

Pq
where X 0t = (Xt , . . . , Xt−p+1 ), ε0t = (εt + j=1 θj εt−j , 0, . . . , 0). Now iterating (4.26), we have


X
Xt = Aj εt−j , (4.27)
j=0

concentrating on the first element of the vector X t we see that


X q
X
Xt = [Ai ]1,1 (εt−i + θj εt−i−j ).
i=0 j=1

Pq
Comparing (4.18) with the above it is clear that for j > q, aj = [Aj ]1,1 + i=1 θi [A
j−i ] .
1,1 Observe
that the above representation is very similar to the AR(1). Indeed as we will show below the Aj
behaves in much the same way as the φj in AR(1) example. As with φj , we will show that Aj
converges to zero as j → ∞ (because the eigenvalues of A are less than one). We now show that
|Xt | ≤ K ∞ j j
P
j=1 ρ |εt−j | for some 0 < ρ < 1, this will mean that |aj | ≤ Kρ . To bound |Xt | we use

(4.27)


X
|Xt | ≤ kX t k2 ≤ kAj kspec kεt−j k2 .
j=0

Hence, by using Gelfand’s formula (see Appendix A) we have |kAj kspec ≤ Cρ ρj (for any |λmax (A)| <
ρ < 1, where λmax (A) denotes the largest maximum eigenvalue of the matrix A), which gives the
corresponding bound for |aj |.
θ(z)
To prove (ii) we use the backshift operator. This requires the power series expansion of φ(z) .

If the roots of φ(z) are distinct, then it is straightforward to write φ(z)−1 in terms of partial
fractions which uses a convergent power series for |z| = 1. This expansion immediately gives the
the linear coefficients aj and show that |aj | ≤ C(1 + δ)−|j| for some finite constant C. On the other

132
hand, if there are multiple roots, say the roots of φ(z) are λ1 , . . . , λs with multiplicity m1 , . . . , ms
(where sj=1 ms = p) then we need to adjust the partial fraction expansion. It can be shown that
P

|aj | ≤ C|j|maxs |ms | (1 + δ)−|j| . We note that for every (1 + δ)−1 < ρ < 1, there exists a constant
such that |j|maxs |ms | (1 + δ)−|j| ≤ Cρ|j| , thus we obtain the desired result.
To show (iii) we use a similar proof to (i), and omit the details. 

Corollary 4.10.1 An ARMA process is invertible if the roots of θ(B) (the MA coefficients) lie
outside the unit circle and causal if the roots of φ(B) (the AR coefficients) lie outside the unit
circle.
An AR(p) process and an MA(q) process is identifiable (meaning there is only one model associ-
ated to one solution). However, the ARMA is not necessarily identifiable. The problem arises when
the characteristic polynomial of the AR and MA part of the model share common roots. A simple
example is Xt = εt , this also satisfies the representation Xt − φXt−1 = εt − φεt−1 etc. Therefore it
is not possible to identify common factors in the polynomials.

One of the main advantages of the invertibility property is in prediction and estimation. We will
consider this in detail below. It is worth noting that even if an ARMA process is not invertible, one
can generate a time series which has identical correlation structure but is invertible (see Section
6.4).

133
Chapter 5

A review of some results from


multivariate analysis

5.1 Preliminaries: Euclidean space and projections


In this section we describe the notion of projections. Understanding linear predictions in terms of
the geometry of projections leads to a deeper understanding of linear predictions and also algorithms
for solving linear systems. We start with a short review of projections in Euclidean space.

5.1.1 Scalar/Inner products and norms


Suppose x1 , . . . , xp ∈ Rd , where p < d. There are two important quantities associated with the
space Rd :
qP
d 2
• The Euclidean norm: kxk2 = j=1 xj .

When we switch to random variables the L2-norm changes to the square root of the variance.

• The scalar/inner product

d
X
hxa , xb i = xaj xbj .
j=1

If xa and xb are orthogonal then the angle between them is 90 degrees and hxa , xb i = 0. It is
clear that hx, xi = kxk22 .

134
When we switch to random variables, the inner product becomes the variance covariance.
Two random variables are uncorrelated if their covariance is zero.

Let X = sp(x1 , . . . , xp ) denote the space spanned by the vectors x1 , . . . , xp . This means if z ∈
sp(x1 , . . . , xp ), there exists coefficients {αj }pj=1 where z = pj=1 αj xj .
P

5.1.2 Projections
Let y ∈ Rd . Our aim is to project y onto sp(x1 , . . . , xp ). The projection will lead to an error which
is orthogonal to sp(x1 , . . . , xp ). The projection of y onto sp(x1 , . . . , xp ) is the linear combination
z = pj=1 αj xj which minimises the Euclidean distance (least squares)
P

2
p
X p
X p
X
y− αj xj = hy − αj xj , y − αj xj i.
j=1 j=1 j=1
2

The coefficients {αj }pj=1 which minimise this difference correspond to the normal equations:

p
X p
X
0
hy − αj xj , x` i = y x` − αj x0j x` = 0. 1 ≤ ` ≤ p. (5.1)
j=1 j=1

The normal equations in (5.1) can be put in matrix form

p
X
0
y x` − αj x0j x` = 0
j=1
⇒ X 0 Xα = Xy (5.2)

where X 0 = (x1 , . . . , xp ). This leads to the well known solution

α = (X 0 X)−1 Xy. (5.3)

Pp
The above shows that the best linear predictors should be such that the error y − j=1 αj xj and
x` are orthogonal (90 degrees). Let X = sp(x1 , . . . , xp ), to simplify notation we often use the
notation PX (y) to denote the projection of y onto X. For example, PX (y) = pj=1 αj xj , where
P

hy − PX (y), x` i = 0 for all 1 ≤ ` ≤ p. We will often use this notation to simplify the exposition
below.
Since the projection error y − PX (y) contains no linear information on X, then

135
• All information on the Inner product between y and x` is contained in its projection:

hy, x` i = y 0 x` = hPX (y), x` i 1≤`≤p

• Euclidean distance of projection error:

hy − PX (y), yi

= hy − PX (y), y − PX (y) + PX (y)i


2
= hy − PX (y), y − PX (y)i + hy − PX (y), PX (y)i = y − PX (y) 2 .
| {z }
=0

5.1.3 Orthogonal vectors


We now consider the simple, but important case that the vectors {xj }pj=1 are orthogonal. In this
case, evaluation of the coefficients α0 = (α1 , . . . , αp ) is simple. From (5.4) we recall that

α = (X 0 X)−1 Xy. (5.4)

If {xj }pj=1 are orthogonal, then X 0 X is a diagonal matrix where

(X 0 X) = diag x01 x1 , . . . , x0p xp .




Since

d
X
(Xy)i = xi,j yj .
j=1

This gives the very simple, entry wise solution for αj

Pd
j=1 xi,j yj
αj = Pd .
2
j=1 xij

5.1.4 Projecting in multiple stages


Suppose that x1 , . . . , xp , xp+1 ∈ Rd . Let Xp = sp(x1 , . . . , xp ) and Xp+1 = sp(x1 , . . . , xp+1 ). Observe
that Xp is a subset of Xp+1 . With a little thought it is clear that Xp+1 = sp(Xp , xp+1 − PXp (xp+1 )).
In other words, xp+1 − PXp (xp+1 ) is the additional information in xp+1 that is not contained in Xp .

136
If xp+1 ∈ Xp , then xp+1 − PXp (xp+1 ) = 0.
Let y ∈ Rd . Our aim is to project y onto Xp+1 , but we do it in stages. By first projecting onto
Xp , then onto Xp+1 . Since xp+1 − PXp (xp+1 ) is orthogonal to Xp (this is by the very definition of
PXp (xp+1 )) we can write

y = PXp (y) + Pxp+1 −PXp (xp+1 ) (y) + ε

= PXp (y) + α(xp+1 − PXp (xp+1 )) + ε.

The coefficient α can deduced by minimising the Euclidean distance of the above;

2
y − PXp (y) − α(xp+1 − PXp (xp+1 )) 2
.

Differentiating with respect to α leads to the normal equation

hy − PXp (y) − α(xp+1 − PXp (xp+1 )), (xp+1 − PXp (xp+1 ))i = 0

= hy − α(xp+1 − PXp (xp+1 )), (xp+1 − PXp (xp+1 ))i = 0,

where the last line is because PXp (xp+1 )) is orthogonal to (xp+1 − PXp (xp+1 )). Thus solving the
above gives

hy, xp+1 − PXp (xp+1 )i


α = .
kxp+1 − PXp (xp+1 )k22

Therefore we can write y as

 
y = PXp (y) − αPXp (xp+1 )) + αxp+1 + ε. (5.5)

If α = 0, then xp+1 does not contain any additional information of y over what is already in Xp .
The above may seem a little heavy. But with a few sketches using R3 as an example will
make the derivations obvious. Once you are comfortable with projections in Euclidean space, the
same ideas transfer to projections of random variables where the innerproduct in the space is the
covariances (and not the scalar product).

137
5.1.5 Spaces of random variables
The set-up described above can be generalized to any general vector space. Our focus will be on
spaces of random variables. We assume the random variables in the appropriate probability space.
We then define the (Hilbert) space of random variables

H = {X; X is a (real) random variables where var(X) < ∞} .

This looks complicated, but in many ways it is analogous to Euclidean space. There are a few
additional complications (such as showing the space is complete, which we ignore). In order to
define a projection in this space project, we need to define the corresponding innerproduct and
norm for this space. Suppose X, Y ∈ H, then the inner-product is the covariance

hX, Y i = cov(X, Y ).

The norm is clearly the variance

kXk22 = hX, Xi = cov(X, X).

Most properties that apply to Euclidean space also apply to H. Suppose that X1 , . . . , Xn are
random variables in H. We define the subspace sp(X1 , . . . , Xn )
 
 p
X 
sp(X1 , . . . , Xn ) = Y ; where Y = aj Xj ,
 
j=1

i.e. all all random variables Z ∈ H which can be expressed as a linear combination of {Xj }nj=1 .
Now just as in Euclidean space you can project any y ∈ Rd onto the subspace spanned by the
vectors x1 , . . . , xp , we can project Y ∈ H onto X = sp(X1 , . . . , Xn ). The projection is such that

p
X
PX (Y ) = αj Xj ,
j=1

where the α0 = (α1 , . . . , αp ) are such that

p
X p
X
hX` , Y − αj Xj i = cov(X` , Y − αj Xj ) = 0 1 ≤ j ≤ p.
j=1 j=1

138
Using the above we can show that α satisfies

α = [var(X)]−1 cov(X, Y ).

where Y = (X1 , . . . , Xp )0 (out of slopiness we will often use say we project onto Y rather than
project onto the space spanned by Y which is sp(X1 , . . . , Xn )).
The properties described in Section 5.1.2 apply to H too:

• Inner product between Y and X` is contained in the projection:

hY, X` i = cov(Y, X` ) = cov (PX (Y ), X` ) 1 ≤ ` ≤ p. (5.6)

• The projection error

cov(Y − PX (Y ), Y ) = var[Y − PX (Y )].

This is rather formal. We now connect this to results from multivariate analysis.

5.2 Linear prediction


Suppose (Y, X), where X = (X1 , . . . , Xp ) is a random vector. The best linear predictor of Y given
X is given by

p
X
Yb = βj Xj
j=1

where β = Σ−1
XX ΣXY , with β = (β1 , . . . , βp ) and ΣXX = var(X), ΣXY = cov[X, Y ]. The corresond-

ing mean squared error is

 2
p
X
E Y − βj Xj  = E(Y 2 ) − ΣY X Σ−1
XX ΣXY .
j=1

Reason To understand why the above is true, we need to find the θ which minimises

 2
p
X
E Y − θj Xj  ,
j=1

139
we assume that Xj has zero mean. Differentiating the above wrt θi leads to the normal equations
 
p
X
−2 E (Y Xi ) − θj E (Xj Xi ) i = 1, . . . , p.
j=1

Equating to zero (since we want to find the θi which minimises the above) is

p
X
E (Y Xi ) − θj E (Xj Xi ) = 0 i = 1, . . . , p.
| {z } | {z }
j=1
=cov(Y,Xi ) =cov(Xi ,Xj )

Writing the above as a matrix equation gives the solution

β = var (X)−1 cov (Y, X) = Σ−1


XX ΣXY .

Substituting the above into the mean squared error gives

 2
p
X
E Y − βj Xj  = E(Y 2 ) − 2E(Y Yb ) + E(Yb 2 ).
j=1

Using that

Y = Yb + e

where e is uncorrelated with {Xj }, thus it is uncorrelated with Yb . This means E[Y Yb ] = E[Yb 2 ].
Therefore
 2
p
X
E Y − βj Xj  = E(Y 2 ) − E(Yb 2 ) = E(Y 2 ) − β 0 var(X)β
j=1

= E(Y 2 ) − ΣY X Σ−1
XX ΣXY .

5.3 Partial correlation


Suppose X = (X1 , . . . , Xd )0 is a zero mean random vector (we impose the zero mean condition
to simplify notation but it’s not necessary). The partial correlation is the covariance between
Xi and Xj , conditioned on the other elements in the vector. In other words, the covariance

140
between the residuals of Xi and Xj after removing their linear dependence on X −(ij) (the vector
not containing Xi and Xj ) and the residual of Xj conditioned on X −(ij) . To obtain an expression
for this correlation we simplify notation and let X = Xi , Z = Xj and Y = X −(ij)
The notion of partial correlation can also easily be understood through projections and linear
prediction (though there are other equivalent derivations). We describe this below. Let PY (X)
denote the projection of the random variable X onto the space spanned by Y . I.e. PY (X) minimises
the MSE E[X − α0 Y ]2 . The partial correlation between X and Z given Y is

cov(X − PY (X), Z − PY (Z))


ρX,Z|Y = p .
var(X − PY (X))var(Z − PY (Z))

By using the results in the previous section we have

PY (X) = α0X,Y Y and PY (Z) = α0Z,Y Y

where

αX,Y = [var(Y )]−1 cov(X, Y ) and αZ,Y = [var(Y )]−1 cov(Z, Y ). (5.7)

Using (5.7) we can write each of the terms in ρX,Z|Y in terms of the elements of the variance matrix:
i.e.

cov(X − PY (X), Z − PY (Z)) = cov(X, Z) − cov(X, Y )0 [var(Y )]−1 cov(Z, Y )

var(X − PY (X)) = var(X) − cov(X, Y )0 [var(Y )]−1 cov(X, Y )

var(Z − PY (Z)) = var(Z) − cov(Z, Y )0 [var(Y )]−1 cov(Z, Y ).

Relating partial correlation and the regression cofficients We show how the above is related to the
coefficients in linear regression. Using the two-stage projection scheme described in (5.5), but
switching from Euclidean space (and scalar products) to random variables and covariances we can
write

X = PY (X) + βZ )X (Z − PY (Z)) + εX

and Z = PY (Z) + βX )Z (X − PY (X)) + εZ , (5.8)

141
where

cov(X, Z − PY (Z)) cov(Z, X − PY (X))


βZ )X = and βX )Z = .
var(Z − PY (Z)) var(X − PY (X))

Since Z − PY (Z) is orthogonal to Y (and thus cov(Z − PY (Z), PY (X)) = 0) we have

cov(X, Z − PY (Z)) = cov(X − PY (X), Z − PY (Z)).

This is the partial covariance (as it is the covariance of the residials after projecting onto Y ). This
links βZ )X and βX )Z to the partial covariance, since

cov(X − PY (X), Z − PY (Z)) cov(Z − PY (Z), X − PY (X))


βZ )X = and βX )Z = .
var(Z − PY (Z)) var(X − PY (X))

To connect the regression coefficients to the partial correlations we rewrite we rewrite the partial
covariance in terms of the partial correlation:

p
cov(X − PY (X), Z − PY (Z)) = ρX,Z|Y var(X − PY (X))var(Z − PY (Z)).

Substituting the expression for cov(X − PY (X), Z − PY (Z)) into the expression for βZ )X and βX )Z
gives
s s
var(X − PY (X)) var(Z − PY (Z))
βZ )X = ρX,Z|Y and βX )Z = ρX,Z|Y . (5.9)
var(Z − PY (Z)) var(X − PY (X))

This leads to the linear regressions

X = (PY (X) − βZ )X PY (Z)) + βZ )X Z +εX


| {z } | {z }
in terms of Y in terms of Z
Z = (PY (Z) − βX )Z PY (X)) + βX )Z X +εZ .
| {z } | {z }
in terms of Z in terms of X

For below, keep in mind that var[εX ] = var[X − PY,Z (X)] and var[εZ ] = var[Z − PY,X (Z)].
The identity in (5.9) relates the regression coefficients to the partial correlation. In particular,
the partial correlation is zero if an only if the corresponding regression coefficient is zero too.
We now rewrite (5.9) in terms of var[εX ] = var[X − PY,Z (X)] and var[εZ ] = var[Z − PY,X (Z)].

142
This requires the following identity

var(X − PY,Z (X)) var(X − PY (X))


= , (5.10)
var(Z − PY,X (Z)) var(Z − PY (Z))

a proof of this identity is given at the end of this section. Using this identity together with (5.9)
gives
s s
var(εX ) var(εZ )
βZ )X = ρX,Z|Y and βX )Z = ρX,Z|Y (5.11)
var(εZ ) var(εX )

and
s s
var(εZ ) var(εY )
ρX,Z|Y = βZ )X = βX )Z (5.12)
var(εX ) var(εZ )

Proof of identity (5.10) We recall that

Xi = PX −(i,j) (Xi ) + βij (Xj − PX −(i,j) (Xj )) + εi

Xj = PX −(i,j) (Xj ) + βji (Xi − PX −(i,j) (Xi )) + εj .

To relate var(εi ) and var(εi,−j ) we evaluate

var(εi,−j ) = var(Xi − PX −(i,j) (Xi ))

= var[βij (Xj − PX −(i,j) (Xj ))] + var(εi )


2
= βij var[Xj − PX −(i,j) (Xj )] + var(εi )
[cov(Xi , Xj − PX −(i,j) (Xj ))]2
= + var(εi )
var[Xj − PX −(i,j) (Xj )]
[cov(Xi − PX −(i,j) (Xi ), Xj − PX −(i,j) (Xj ))]2
= + var(εi )
var[Xj − PX −(i,j) (Xj )]
c2ij
= + var(εi ).
var(εj,−i )

where cij = cov(Xi − PX −(i,j) (Xi ), Xj − PX −(i,j) (Xj )). By the same argument we have

c2ij
var(εj,−i ) = + var(εj )
var(εi,−j )
⇒ ρ2ij = var(εj,−i )var(εi,−j ) − var(εj )var(εi,−j ).

143
Putting these two equations together gives

var(εj,−i )var(εi,−j ) − var(εj )var(εi,−j ) = var(εi,−j )var(εj,−i ) − var(εi )var(εj,−i ).

This leads to the required identity

var(εi ) var(εi,−j )
= ,
var(εj ) var(εj,−i )

and the desired result. 

Example 5.3.1 Define the three random vectors X1 , X2 and X3 , where X1 and X2 are such that

X1 = X3 + ε1 X2 = X3 + ε2

where ε1 is independent of X2 and X3 and ε2 is independent of X1 and X3 (and of course they are
independent of each other). Then cov(X1 , X2 ) = var(X3 ) however the partial covariance between
X1 and X2 conditioned on X3 is zero. I.e. X3 is driving the dependence between the models, once
it is removed they are uncorrelated and, in this example, independent.

5.4 Properties of the precision matrix

5.4.1 Summary of results


Suppose X 0 = (X1 , . . . , Xd ) is a zero mean random vector (we impose the zero mean condition to
simplify notation but it is not necessary), where

Σ = var[X] and Γ = Σ−1 .

Σ is called the variance matrix, Γ is called the precision matrix. Unless stated otherwise all vectors
are column vectors. We summarize the main results above in the bullet points below. We then
relate these quantities to the precision matrix.

• β 0i = (βi,1 , . . . , βi,d ) are the coefficients which minimise E[Xi − β 0i X−i ]2 , where
X0−i = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xd ) (all elements in X excluding Xi ).

144
• β 0i,−j are the coefficients which minimise E[Xi − β 0i,−j X−(i,j) ]2 , where X−(i,j) are all elements
in X excluding Xi and Xj .

• The partial correlation between Xi and Xj is defined as

cov(εi,−j , εj,−i )
ρi,j = cor(εi,−j , εj,−i ) = p ,
var(εi,−j )var(εj,−i )

where

εi,−j = Xi − βi,−j X−(i,j)

εj,−i = Xj − βj,−i X−(i,j) .

It can be shown that

cov (εi,−j , εj,−i ) = cov(Xi , Xj ) − cov(Xi , X0−(i,j) )var[X−(i,j) ]−1 cov(Xj , X−(i,j) )

var (εi,−j ) = var(Xi ) − cov(Xi , X0−(i,j) )var[X−(i,j) ]−1 cov(Xi , X−(i,j) )

var (εj,−i ) = var(Xj ) − cov(Xj , X0−(i,j) )var[X−(i,j) ]−1 cov(Xj , X−(i,j) ).

• The regression coefficients and partial correlation are related through the identity
s
var(εi )
βij = ρij . (5.13)
var(εj )

Let Γi,j denote the (i, j)th entry in the precision matrix Γ = Σ−1 . Then Γi,j satisifies the following
well known properties

1
Γii = .
E[Xi − β 0i X−i ]2

For i 6= j we have Γi,j = −βi,j /E[Xi − β 0i X−i ]2 and

Γi,j Γi,j
βi,j = − and ρi,j = − p .
Γii Γii Γjj

145
5.4.2 Proof of results
Regression and the precision matrix The precision matrix contains many hidden treasures. We
start by showing that the entries of the precision matrix contain the regression coefficients of Xi
regressed on the random vector X−i = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xd ). We will show that the ith
row of Σ−1 is

 
−βi1 /σi2 , −βi2 /σi2 , ... 1/σi2 ... −βid /σi2 .

where σ22 = E[Xi − β 0i X−i ]2 ,


P
j6=i βij Xij is the best linear predictor of Xi given X−i and the ith
entry is 1/σi2 (notation can be simplified if set βii = −1). And equivalently the ith column of Σ−1
is the transpose of the vector

 
−βi1 /σi2 −βi2 /σi2 . . . −βid /σi2 . .

Though it may seem surprising at first, the result is very logical.


We recall that the coefficients β i = (βi,1 , . . . , βi,d ) are the coefficients which minimise E[Xi −
β 0i X−i ]2 . This is equivalent to the derivative of the MSE being zero, this gives rise to the classical
normal equations

X X
E[(Xi − βi,j Xj )X` ] = Σi,` − βi,j Σj,` = 0 1 ≤ ` ≤ j, ` 6= i.
j6=i j6=i

P
Further, since Xi − j6=i βi,j Xj is orthogonal to Xj we have

X X
E[(Xi − βi,j Xj )Xi ] = E[(Xi − βi,j Xj )2 ].
j6=i j6=i

Recall that each row in the precision matrix is orthogonal to all the columns in Σ except one. We
show below that this corresponds to precisely the normal equation. It is easiest seen through the
the simple example of a 4 × 4 variance matrix
 
c11 c12 c13 c14
 
 c21 c22 c23 c24
 

 
 
 c31 c32 c33 c34 
 
c41 c42 c43 c44

146
and the corresponding regression matrix
 
1 −β12 −β13 −β14
 
 −β21 1 −β23 −β24
 

 .
 −β31 −β32 −β34
 
1 
 
−β41 −β42 −β43 1

We recall from the definition of β1 that the inner product between c = (c11 , c12 , c13 , c14 ) and
e = (1, −β12 , −β13 , −β14 ) is
β 1

e c0 = hβ
β e , c1 i = c11 − β12 c12 − β13 c13 − β13 c13
1 1 1
X 4 4
X
= E[(X1 − β1,j Xj )X1 ] = E[(X1 − β1,j Xj )2 ].
j=2 j=2

Similarly

e c0 = hβe1 , c2 i = c21 − β12 c22 − β13 c23 − β13 c23


β 1 2
X 4
= E[(X1 − β1,j Xj )X2 ] = 0.
j=2

The same is true for the other cj and β


e . Based on these observations, we observe that the
j

regression coefficients/normal equations give the orthogonal projections and

  
1 −β12 −β13 −β14 c11 c12 c13 c14
  
 −β21 1 −β23 −β24   c21 c22 c23 c24
  
 = diag(σ12 , σ22 , σ32 , σ42 ),

 
 −β31 −β32 −β34
  
1   c31 c32 c33 c34 
  
−β41 −β42 −β43 1 c41 c42 c43 c44

147
where σj2 = E[(Xj − βi Xi )2 ]. Therefore the inverse of Σ is
P
i6=j

 
1 −β12 −β13 −β14
 
 −β21 1 −β23 −β24 
 
Σ−1 2 2 2 2 −1 
= diag(σ1 , σ2 , σ3 , σ4 )  
 −β31 −β32 −β34 

1
 
−β41 −β42 −β43 1
 
1/σ12 −β12 /σ12 −β13 /σ12 −β14 /σ12
 
 −β21 /σ22 1/σ22 −β23 /σ22 −β24 /σ22
 

=  .
 −β31 /σ32 −β32 /σ32 1/σ32 −β34 /σ32
 

 
−β41 /σ42 −β42 /σ42 −β43 /σ42 1/σ42

By a similar argument we have


 
1 −β21 −β31 −β41
 
 −β12 1 −β32 −β42 
 
Σ−1 =   diag(σ12 , σ22 , σ32 , σ42 )−1
 −β13 −β23 −β43 
 
1
 
−β14 −β24 −β34 1
 
1/σ12 −β21 /σ22 −β31 /σ32 −β41 /σ42
 
 −β12 /σ12 1/σ22 −β32 /σ32 −β42 /σ42 
 
=  .
 −β13 /σ12 −β23 /σ22 1/σ32 −β43 /σ42 
 
 
−β14 /σ12 −β24 /σ22 −β34 /σ32 1/σ42

In summary, the normal equations give the matrix multiplication required for a diagonal matrix
(which is is exactly the definition of ΣΓ = I, up to a change in the diagonal).
Clearly, the above proof holds for all dimensions and we have

1
Γii = ,
σi2

and

βij Γij
Γij = − 2 ⇒ βi,j = − .
σi Γii

Writing the partial correlation in terms of elements of the precision matrix By using the identity

148
(5.12) (and that βij = βj→i ) we have
s
var[εj ]
ρij = βij . (5.14)
var[εi ]

We recall that Γii = var(Xi − PX−i (Xi ))−1 , Γjj = var(Xj − PX−j (Xj ))−1 and Γij = −βij Γii gives

s
Γij Γii Γij
ρij = − = −p .
Γii Γjj Γii Γjj

The above represents the partial correlation in terms of entries of the precision matrix.

5.5 Appendix

Alternative derivations based on matrix identities


The above derivations are based on properties of normal equations and some algebraic manipu-
lations. An alternative set of derivations is given in terms of the inversions of block matrices,
specifically with the classical matrix inversions identities

 −1  
A B A−1 + A−1 BP1−1 CA−1 −A−1 BP1−1
  =   (5.15)
C D −P1−1 CA−1 P1−1
 
P2−1 −P2−1 BD−1
=  ,
−D−1 CP2−1 D−1 + D−1 CP2−1 BD−1

where P1 = (D − CA−1 B) and P2 = (A − BD−1 C). Or using the idea of normal equations in
projections.

The precision matrix and partial correlation

Let us suppose that X = (X1 , . . . , Xd ) is a zero mean random vector with variance Σ. The (i, j)th
element of Σ the covariance cov(Xi , Xj ) = Σij . Here we consider the inverse of Σ, and what
information the (i, j)th of the inverse tells us about the correlation between Xi and Xj . Let Σij
denote the (i, j)th element of Σ−1 . We will show that with appropriate standardisation, Σij is the

149
negative partial correlation between Xi and Xj . More precisely,

Σij
√ = −ρij . (5.16)
Σii Σjj

The proof uses the inverse of block matrices. To simplify the notation, we will focus on the (1, 2)th
element of Σ and Σ−1 (which concerns the correlation between X1 and X2 ).

Remark 5.5.1 Remember the reason we can always focus on the top two elements of X is because
we can always use a permutation matrix to permute the Xi and Xj such that they become the top
two elements. Since the inverse of the permutation matrix is simply its transpose everything still
holds.

Let X 1,2 = (X1 , X2 )0 , X −(1,2) = (X3 , . . . , Xd )0 , Σ−(1,2) = var(X −(1,2) ), c1,2 = cov(X (1,2) , X −(1,2) )
and Σ1,2 = var(X 1,2 ). Using this notation it is clear that
 
Σ1,2 c1,2
var(X) = Σ =  . (5.17)
c01,2 Σ−(1,2)

By using (5.15) we have


 
P −1 −P −1 c01,2 Σ−1
−(1,2)
Σ−1 =  , (5.18)
−Σ−1
−(1,2) c1,2 P
−1 P −1 + Σ−1
−(1,2) c1,2 P
−1 c0 Σ−1
1,2 −(1,2)

where P = (Σ1,2 − c01,2 Σ−1


−(1,2) c1,2 ). Comparing P with (??), we see that P is the 2 × 2 variance/-

covariance matrix of the residuals of X(1,2) conditioned on X −(1,2) . Thus the partial correlation
between X1 and X2 is

P1,2
ρ1,2 = p (5.19)
P1,1 P2,2

where Pij denotes the elements of the matrix P . Inverting P (since it is a two by two matrix), we
see that
 
1 P2,2 −P1,2
P −1 = 2
 . (5.20)
P1,1 P2,2 − P1,2 −P1,2 P11

Thus, by comparing (5.18) and (5.20) and by the definition of partial correlation given in (5.19) we

150
have

P (1,2)
√ = −ρ1,2 .
P (1,1) P (2,2)

Let Σij denote the (i, j)th element of Σ−1 . Thus we have shown (5.16):

Σij
ρij = − √ . (5.21)
Σii Σjj

In other words, the (i, j)th element of Σ−1 divided by the square root of its diagonal gives negative
partial correlation. Therefore, if the partial correlation between Xi and Xj given Xij is zero, then
Σi,j = 0.

The precision matrix and the coefficients in regression

The precision matrix, Σ−1 , contains many other hidden treasures. For example, the coefficients of
Σ−1 convey information about the best linear predictor Xi given X −i = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xd )
(all elements of X except Xi ). Let

X
Xi = βi,j Xj + εi ,
j6=i

where {βi,j } are the coefficients of the best linear predictor. Then it can be shown that

Σij 1
βi,j = − and Σii = . (5.22)
Σii 2
P
E[Xi − j6=i βi,j Xj ]

The precision matrix and the mean squared prediction error

We start with a well known expression, which expresses the prediction errors in terms of the
determinant of matrices.
We recall that the prediction error is

E[Y − Yb ]2 = σY − ΣY X Σ−1
XX ΣXY (5.23)

151
with σY = var[Y ]. Let
 
var[Y ] ΣY X
Σ= . (5.24)
ΣXY ΣXX

We show below that the prediction error can be rewritten as

det(Σ)
E[Y − Yb ]2 = σY − ΣY X Σ−1
XX ΣXY = . (5.25)
det(ΣXX )

Furthermore,

1 1
Σ−1

11
= −1 = . (5.26)
σY − ΣY X ΣXX ΣXY E[Y − Yb ]2

Proof of (5.25) and (5.26) To prove this result we use


 
A B
 = det(D) det A − BD−1 C .

det  (5.27)
C D

Applying this to (5.27) gives

det(Σ) = det(ΣXX ) σY − ΣY X Σ−1



XX ΣXY

⇒ det(Σ) = det(ΣXX )E[Y − Yb ]2 , (5.28)

thus giving (5.25).


To prove (5.26) we use the following result on the inverse of block matrices

 −1  
A B A−1 + A−1 BP1−1 CA−1 −A−1 BP1−1
  =   (5.29)
C D −P1−1 CA−1 P1−1
 
P2−1 −P2−1 BD−1
=  ,
−D−1 CP2−1 D−1 + D−1 CP2−1 BD−1

where P1 = (D − CA−1 B) and P2 = (A − BD−1 C). This block inverse turns out to be crucial in
deriving many of the interesting properties associated with the inverse of a matrix. We now show
that the the inverse of the matrix Σ, Σ−1 (usually called the precision matrix) contains the mean
squared error.

152
Comparing the above with (5.24) and (5.23) we see that

1 1
Σ−1

11
= −1 = .
σY − ΣY X ΣXX ΣXY E[Y − Yb ]2

which immediately proves (5.26).

The Cholesky decomposition and the precision matrix

We now represent the precision matrix through its Cholesky decomposition. It should be mentioned
that Mohsen Pourahmadi has done a lot of interesting research in this area and he recently wrote
a review paper, which can be found here.
We define the sequence of linear equations

t−1
X
Xt = βt,j Xj + εt , t = 2, . . . , k, (5.30)
j=1

where {βt,j ; 1 ≤ j ≤ t − 1} are the coefficeints of the best linear predictor of Xt given X1 , . . . , Xt−1 .
Let σt2 = var[εt ] = E[Xt − t−1 2 2
P
j=1 βt,j Xj ] and σ1 = var[X1 ]. We standardize (5.30) and define

 
t t−1
X 1  X
γt,j Xj = Xt − βt,j Xj  , (5.31)
σt
j=1 j=1

where we set γt,t = 1/σt and for 1 ≤ j < t − 1, γt,j = −βt,j /σi . By construction it is clear that
var(LX) = Ik , where

 
γ1,1 0 0 ... 0 0
 
 γ2,1 γ2,2 0 ... 0 0
 

 
L =  γ3,1 γ3,2 γ3,3 . . . 0 0 (5.32)
 

 .. .. .. .. .. ..
 

 . . . . . . 
 
γk,1 γk,2 γk,3 . . . γk,k−1 γk,k

and LL = Σ−1 (see Pourahmadi, equation (18)), where Σ = var(X k ). Let Σ = var[X k ], then

k
X
Σij = γis γjs (note many of the elements will be zero).
s=1

153
Remark 5.5.2 (The Cholesky decomposition of a matrix) All positive definite matrices ad-
mit a Cholesky decomposition. That is H 0 H = Sigma, where H is a lower triangular matrix. Sim-
ilarly, Sigma−1 = LL0 , where L is a lower triangular matrix and L = H −1 . Therefore we observe
that if Σ = var(X) (where X is a p-dimension random vector), then

var (LX) = L0 ΣL = L0 H 0 HL = Ip .

Therefore, the lower triangular matrix L “finds” a linear combination of the elements X such that
the resulting random vector is uncorrelated.

We use apply these results to the analysis of the partial correlations of autoregressive processes
and the inverse of its variance/covariance matrix.

A little bit more indepth: general vector spaces


First a brief definition of a vector space. X is called an vector space if for every x, y ∈ X and
a, b ∈ R (this can be generalised to C), then ax + by ∈ X . An inner product space is a vector
space which comes with an inner product, in other words for every element x, y ∈ X we can defined
an innerproduct hx, yi, where h·, ·i satisfies all the conditions of an inner product. Thus for every
element x ∈ X we can define its norm as kxk = hx, xi. If the inner product space is complete
(meaning the limit of every sequence in the space is also in the space) then the innerproduct space
is a Hilbert space (see wiki).

Example 5.5.1 (i) The Euclidean space Rn described above is a classical example of a Hilbert
space. Here the innerproduct between two elements is simply the scalar product, hx, yi =
Pn
i=1 xi yi .

(ii) The subset of the probability space (Ω, F, P ), where all the random variables defined on Ω
have a finite second moment, ie. E(X 2 ) = Ω X(ω)2 dP (ω) < ∞. This space is denoted as
R

L2 (Ω, F, P ). In this case, the inner product is hX, Y i = E(XY ).

(iii) The function space L2 [R, µ], where f ∈ L2 [R, µ] if f is mu-measureable and

Z
|f (x)|2 dµ(x) < ∞,
R

154
is a Hilbert space. For this space, the inner product is defined as

Z
hf, gi = f (x)g(x)dµ(x).
R

It is straightforward to generalize the above to complex random variables and functions defined
on C. We simply need to remember to take conjugates when defining the innerproduct, ie. hX, Y i =
R
cov(X, Y ) and hf, gi = C f (z)g(z)dµ(z).

In this chapter our focus will be on certain spaces of random variables which have a finite variance.

Basis

The random variables {Xt , Xt−1 , . . . , X1 } span the space Xt1 (denoted as sp(Xt , Xt−1 , . . . , X1 )), if
for every Y ∈ Xt1 , there exists coefficients {aj ∈ R} such that

t
X
Y = aj Xt+1−j . (5.33)
j=1

Pt
Moreover, sp(Xt , Xt−1 , . . . , X1 ) = Xt1 if for every {aj ∈ R}, j=1 aj Xt+1−j ∈ Xt1 . We now
define the basis of a vector space, which is closely related to the span. The random variables
{Xt , . . . , X1 } form a basis of the space Xt1 , if for every Y ∈ Xt1 we have a representation (5.33) and
this representation is unique. More precisely, there does not exist another set of coefficients {φj }
such that Y = tj=1 φj Xt+1−j . For this reason, one can consider a basis as the minimal span, that
P

is the smallest set of elements which can span a space.

Definition 5.5.1 (Projections) The projection of the random variable Y onto the space spanned
by sp(Xt , Xt−1 , . . . , X1 ) (often denoted as PXt ,Xt−1 ,...,X1 (Y)) is defined as PXt ,Xt−1 ,...,X1 (Y) = tj=1 cj Xt+1−j ,
P

where {cj } is chosen such that the difference Y −P( Xt ,Xt−1 ,...,X1 ) (Yt ) is uncorrelated (orthogonal/per-
pendicular) to any element in sp(Xt , Xt−1 , . . . , X1 ). In other words, PXt ,Xt−1 ,...,X1 (Yt ) is the best
linear predictor of Y given Xt , . . . , X1 .

Orthogonal basis

An orthogonal basis is a basis, where every element in the basis is orthogonal to every other element
in the basis. It is straightforward to orthogonalize any given basis using the method of projections.

155
To simplify notation let Xt|t−1 = PXt−1 ,...,X1 (Xt ). By definition, Xt − Xt|t−1 is orthogonal to
the space sp(Xt−1 , Xt−1 , . . . , X1 ). In other words Xt − Xt|t−1 and Xs (1 ≤ s ≤ t) are orthogonal
(cov(Xs , (Xt − Xt|t−1 )), and by a similar argument Xt − Xt|t−1 and Xs − Xs|s−1 are orthogonal.
Thus by using projections we have created an orthogonal basis X1 , (X2 −X2|1 ), . . . , (Xt −Xt|t−1 )
of the space sp(X1 , (X2 − X2|1 ), . . . , (Xt − Xt|t−1 )). By construction it clear that sp(X1 , (X2 −
X2|1 ), . . . , (Xt − Xt|t−1 )) is a subspace of sp(Xt , . . . , X1 ). We now show that
sp(X1 , (X2 − X2|1 ), . . . , (Xt − Xt|t−1 )) = sp(Xt , . . . , X1 ).
To do this we define the sum of spaces. If U and V are two orthogonal vector spaces (which
share the same innerproduct), then y ∈ U ⊕ V , if there exists a u ∈ U and v ∈ V such that
y = u + v. By the definition of Xt1 , it is clear that (Xt − Xt|t−1 ) ∈ Xt1 , but (Xt − Xt|t−1 ) ∈ 1 .
/ Xt−1
Hence Xt1 = sp(X 1 . Continuing this argument we see that X 1 = sp(X
¯ t − Xt|t−1 ) ⊕ Xt−1 t ¯ t − Xt|t−1 ) ⊕
¯ t−1 − Xt−1|t−2 )⊕, . . . , ⊕sp(X
sp(X ¯ 1 ). Hence sp(X ¯ t − Xt|t−1 , . . . , X2 − X2|1 , X1 ).
¯ t , . . . , X1 ) = sp(X
Pt
Therefore for every PXt ,...,X1 (Y ) = j=1 aj Xt+1−j , there exists coefficients {bj } such that

t
X t−1
X
PXt ,...,X1 (Y ) = PXt −Xt|t−1 ,...,X2 −X2|1 ,X1 (Y ) = PXt+1−j −Xt+1−j|t−j (Y ) = bj (Xt+1−j − Xt+1−j|t−j ) + bt X1 ,
j=1 j=1

where bj = E(Y (Xj − Xj|j−1 ))/E(Xj − Xj|j−1 ))2 . A useful application of orthogonal basis is the
ease of obtaining the coefficients bj , which avoids the inversion of a matrix. This is the underlying
idea behind the innovations algorithm proposed in Brockwell and Davis (1998), Chapter 5.

Spaces spanned by infinite number of elements (advanced)


The notions above can be generalised to spaces which have an infinite number of elements in their
basis. Let now construct the space spanned by infinite number random variables {Xt , Xt−1 , . . .}.
As with anything that involves ∞ we need to define precisely what we mean by an infinite basis.
To do this we construct a sequence of subspaces, each defined with a finite number of elements
in the basis. We increase the number of elements in the subspace and consider the limit of this
space. Let Xt−n = sp(Xt , . . . , X−n ), clearly if m > n, then Xt−n ⊂ Xt−m . We define Xt−∞ , as
Xt−∞ = ∪∞ −n
n=1 Xt , in other words if Y ∈ Xt
−∞
, then there exists an n such that Y ∈ Xt−n .
However, we also need to ensure that the limits of all the sequences lie in this infinite dimensional
space, therefore we close the space by defining defining a new space which includes the old space and
also includes all the limits. To make this precise suppose the sequence of random variables is such

156
that Ys ∈ Xt−s , and E(Ys1 − Ys2 )2 → 0 as s1 , s2 → ∞. Since the sequence {Ys } is a Cauchy sequence
there exists a limit. More precisely, there exists a random variable Y , such that E(Ys − Y )2 → 0 as
−n
s → ∞. Since the closure of the space, X t , contains the set Xt−n and all the limits of the Cauchy
sequences in this set, then Y ∈ Xt−∞ . We let

Xt−∞ = sp(Xt , Xt−1 , . . .), (5.34)

The orthogonal basis of sp(Xt , Xt−1 , . . .)

An orthogonal basis of sp(Xt , Xt−1 , . . .) can be constructed using the same method used to orthog-
onalize sp(Xt , Xt−1 , . . . , X1 ). The main difference is how to deal with the initial value, which in the
case of sp(Xt , Xt−1 , . . . , X1 ) is X1 . The analogous version of the initial value in infinite dimension
space sp(Xt , Xt−1 , . . .) is X−∞ , but this it not a well defined quantity (again we have to be careful
with these pesky infinities).
Let Xt−1 (1) denote the best linear predictor of Xt given Xt−1 , Xt−2 , . . .. As in Section 5.5 it is
clear that (Xt −Xt−1 (1)) and Xs for s ≤ t−1 are uncorrelated and Xt−∞ = sp(Xt −Xt−1 (1))⊕Xt−1
−∞
,
where Xt−∞ = sp(Xt , Xt−1 , . . .). Thus we can construct the orthogonal basis (Xt −Xt−1 (1)), (Xt−1 −
Xt−2 (1)), . . . and the corresponding space sp((Xt − Xt−1 (1)), (Xt−1 − Xt−2 (1)), . . .). It is clear that
sp((Xt −Xt−1 (1)), (Xt−1 −Xt−2 (1)), . . .) ⊂ sp(Xt , Xt−1 , . . .). However, unlike the finite dimensional
case it is not clear that they are equal, roughly speaking this is because sp((Xt − Xt−1 (1)), (Xt−1 −
Xt−2 (1)), . . .) lacks the inital value X−∞ . Of course the time −∞ in the past is not really a well
defined quantity. Instead, the way we overcome this issue is that we define the initial starting
−∞
random variable as the intersection of the subspaces, more precisely let X−∞ = ∩∞
n=−∞ Xt .
Furthermore, we note that since Xn − Xn−1 (1) and Xs (for any s ≤ n) are orthogonal, then
sp((Xt − Xt−1 (1)), (Xt−1 − Xt−2 (1)), . . .) and X−∞ are orthogonal spaces. Using X−∞ , we have
⊕tj=0 sp((Xt−j − Xt−j−1 (1)) ⊕ X−∞ = sp(Xt , Xt−1 , . . .).

157
Chapter 6

The autocovariance and partial


covariance of a stationary time series

Objectives

• Be able to determine the rate of decay of an ARMA time series.

• Be able ‘solve’ the autocovariance structure of an AR process.

• Understand what partial correlation is and how this may be useful in determining the order
of an AR model.

6.1 The autocovariance function


The autocovariance function (ACF) is defined as the sequence of covariances of a stationary process.
Precisely, suppose {Xt } is a stationary process with mean zero, then {c(r) : k ∈ Z} is the ACF of
{Xt } where c(r) = cov(X0 , Xr ). The autocorrelation function is the standardized version of the
autocovariance and is defined as

c(r)
ρ(r) = .
c(0)

Clearly different time series give rise to different features in the ACF. We will explore some of these
features below.

158
Before investigating the structure of ARMA processes we state a general result connecting linear
time series and the summability of the autocovariance function.

P∞
Lemma 6.1.1 Suppose the stationary time series Xt satisfies the linear representation j=−∞ ψj εt−j .

The covariance is c(r) = ∞


P
j=−∞ ψj ψj+r .

P∞ P
(i) If j=∞ |ψj | < ∞, then k |c(k)| < ∞.
P∞ P
(ii) If j=∞ |jψj | < ∞, then k |k · c(k)| < ∞.
P∞ 2
(iii) If j=∞ |ψj | < ∞, then we cannot say anything about summability of the covariance.

PROOF. It is straightforward to show that

X
c(k) = var[εt ] ψj ψj−k .
j

P P P P
Using this result, it is easy to see that k |c(k)| ≤ k j |ψj | · |ψj−k |, thus k |c(k)| < ∞, which
proves (i).
The proof of (ii) is similar. To prove (iii), we observe that j |ψj |2 < ∞ is a weaker condition
P

then j |ψj | < ∞ (for example the sequence ψj = |j|−1 satisfies the former condition but not the
P

latter). Thus based on the condition we cannot say anything about summability of the covariances.


First we consider a general result on the covariance of a causal ARMA process (always to obtain
the covariance we use the MA(∞) expansion - you will see why below).

6.1.1 The rate of decay of the autocovariance of an ARMA process


We evaluate the covariance of an ARMA process using its MA(∞) representation. Let us suppose
that {Xt } is a causal ARMA process, then it has the representation in (4.20) (where the roots of
φ(z) have absolute value greater than 1 + δ). Using (4.20) and the independence of {εt } we have


X ∞
X
cov(Xt , Xτ ) = cov( aj1 εt−j1 , aj2 ετ −j2 )
j1 =0 j2 =0

X ∞
X
= aj1 aj2 cov(εt−j , ετ −j ) = aj aj+|t−τ | var(εt ) (6.1)
j=0 j=0

159
(here use the MA(∞) expansion). Using (4.21) we have

∞ ∞
X
j j+|t−τ |
X ρ|t−τ |
|cov(Xt , Xτ )| ≤ var(εt )Cρ2 ρ ρ ≤ Cρ2 ρ|t−τ | ρ2j = C 2 , (6.2)
1 − ρ2
j=0 j=0

for any 1/(1 + δ) < ρ < 1.


The above bound is useful, it tells us that the ACF of an ARMA process decays exponentially
fast. In other words, there is very little memory in an ARMA process. However, it is not very
enlightening about features within the process. In the following we obtain an explicit expression for
the ACF of an autoregressive process. So far we have used the characteristic polynomial associated
with an AR process to determine whether it was causal. Now we show that the roots of the
characteristic polynomial also give information about the ACF and what a ‘typical’ realisation of
a autoregressive process could look like.

6.1.2 The autocovariance of an autoregressive process and the


Yule-Walker equations
Simple worked example Let us consider the two AR(1) processes considered in Section 4.3.2. We
recall that the model

Xt = 0.5Xt−1 + εt

has the stationary causal solution


X
Xt = 0.5j εt−j .
j=0

Assuming the innovations has variance one, the ACF of Xt is

1 0.5|k|
cX (0) = cX (k) =
1 − 0.52 1 − 0.52

The corresponding autocorrelation is

ρX (k) = 0.5|k| .

160
Let us consider the sister model

Yt = 2Yt−1 + εt ,

this has the noncausal stationary solution


X
Yt = − (0.5)j+1 εt+j+1 .
j=0

Thus process has the ACF

0.52 0.52+|k|
cY (0) = cX (k) = .
1 − 0.52 1 − 0.52

The corresponding autocorrelation is

ρX (k) = 0.5|k| .

Comparing the two ACFs, both models have identical autocorrelation function.
Therefore, we observe an interesting feature, that the non-causal time series has the same
correlation structure of its dual causal time series. For every non-causal time series there exists
a causal time series with the same autocovariance function. The dual is easily constructed. If an
autoregressive model has characteristic function φ(z) = 1 − pj=1 φj z j with roots λ1 , . . . , λp . If all
P

the roots lie inside the unit circle, then φ(z) corresponds to a non-causal time series. But by flipping
the roots λ1−1 , . . . , λ−1
p all the roots now lie outside the unit circle. This means the characteristic

polynomial corresponding to λ−1 −1


1 , . . . , λp leads to a causal AR(p) model (call this φ(z)). More over
e

the characteristic polynomial of the AR(p) models associated with φ(z) and φ(z)
e have the same
autocorrelation function. They are duals. In summary, autocorrelation is ‘blind’ to non-causality.

Another worked example Consider the AR(2) model

Xt = 2r cos(θ)Xt−1 − r2 Xt−2 + εt , (6.3)

where {εt } are iid random variables with mean zero and variance one. We assume 0 < r < 1
(which imposes causality on the model). Note, that the non-casual case (r > 1) will have the
same autocovariance as the causal case with r flipped to r−1 . The corresponding characteristic

161
polynomial is 1 − 2r cos(θ)z + r2 z 2 , which has roots r−1 exp(±iθ). By using (6.11), below, the ACF
is

c(k) = r|k| C1 exp(ikθ) + C̄1 exp(−ikθ) .


 

Setting C1 = a exp(ib), then the above can be written as

c(k) = ar|k| (exp(i(b + kθ)) + exp(−i(b + kθ))) = 2ar|k| cos (kθ + b) , (6.4)

where the above follows from the fact that the sum of a complex number and its conjugate is two
times the real part of the complex number.
Consider the AR(2) process

Xt = 1.5Xt−1 − 0.75Xt−2 + εt , (6.5)

where {εt } are iid random variables with mean zero and variance one. The corresponding character-
p
istic polynomial is 1−1.5z+0, 75z 2 , which has roots 4/3 exp(iπ/6). Using (6.4) the autocovariance
function of {Xt } is

 π 
c(k) = a( 3/4)|k| cos k + b .
p
6

We see that the covariance decays at an exponential rate, but there is a periodicity within the
decay. This means that observations separated by a lag k = 12 are more closely correlated than
other lags, this suggests a quasi-periodicity in the time series. The ACF of the process is given
in Figure 6.1. Notice that it decays to zero (relatively fast) but it also undulates. A plot of a
realisation of the time series is given in Figure 6.2, notice the quasi-periodicity of about 2π/12. To
measure the magnitude of the period we also give the corresponding periodogram in Figure 6.2.
Observe a peak at the frequency about frequency 2π/12 ≈ 0.52. We now generalise the results in
the above AR(1) and AR(2) examples. Let us consider the general AR(p) process

p
X
Xt = φj Xt−j + εt .
j=1

Suppose the roots of the corresponding characteristic polynomial are distinct and we split them

162
1.0
0.8
0.6
0.4
acf

0.2
0.0
−0.2
−0.4

0 10 20 30 40 50

lag

Figure 6.1: The ACF of the time series Xt = 1.5Xt−1 − 0.75Xt−2 + εt


6

●● ●●
0.5
4

● ●
0.4
2

Periodogram

0.3
ar2

● ●
● ●
0.2
−2

● ● ● ●

● ●
0.1

● ● ● ●
● ●
−4

● ●
●● ●●
● ●
● ●
● ●
● ●● ●
● ● ● ● ●● ● ● ●● ●● ● ● ●
0.0

● ●
● ● ● ● ●

● ●● ●●
●● ●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●●●
●●●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●● ●
● ●●●●
● ●

0 24 48 72 96 120 144 0 1 2 3 4 5 6

Time frequency

Figure 6.2: Left: A realisation from the time series Xt = 1.5Xt−1 − 0.75Xt−2 + εt . Right:
The corresponding periodogram.

163
into real and complex roots. Because the characteristic polynomial is comprised of real coefficients,
the complex roots come in complex conjugate pairs. Hence let us suppose the real roots are {λj }rj=1
(p−r)/2
and the complex roots are {λj , λj }j=r+1 . The covariance in (6.10) can be written as

r (p−2)/2
X X
c(k) = Cj λ−k
j + aj |λj |−k cos(kθj + bj )
j=1 j=r+1

where for j > r we write λj = |λj | exp(iθj ) and aj and bj are real constants. Notice that as the
example above the covariance decays exponentially with lag, but there is undulation. A typical
realisation from such a process will be quasi-periodic with periods at θr+1 , . . . , θ(p−r)/2 , though the
magnitude of each period will vary.

Exercise 6.1 Recall the AR(2) models considered in Exercise 4.5. Now we want to derive their
ACF functions.

(i) (a) Obtain the ACF corresponding to

7 2
Xt = Xt−1 − Xt−2 + εt ,
3 3

where {εt } are iid random variables with mean zero and variance σ 2 .

(b) Obtain the ACF corresponding to



4× 3 42
Xt = Xt−1 − 2 Xt−2 + εt ,
5 5

where {εt } are iid random variables with mean zero and variance σ 2 .

(c) Obtain the ACF corresponding to

Xt = Xt−1 − 4Xt−2 + εt ,

where {εt } are iid random variables with mean zero and variance σ 2 .

(ii) For all these models plot the true ACF in R. You will need to use the function ARMAacf.
BEWARE of the ACF it gives for non-causal solutions. Find a method of plotting a causal
solution in the non-causal case.

Exercise 6.2 In Exercise 4.6 you constructed a causal AR(2) process with period 17.

164
Load Shumway and Stoffer’s package astsa into R (use the command install.packages("astsa")
and then library("astsa").
Use the command arma.spec to make a plot of the corresponding spectral density function. How
does your periodogram compare with the ‘true’ spectral density function?

Derivation of the ACF of general models (advanced)

Worked example Let us suppose that Xt satisfies the model Xt = (a + b)Xt−1 − abXt−2 + εt . We
have shown that if |a| < 1 and |b| < 1, then it has the solution


1 X
bj+1 − aj+1 )εt−j .

Xt =
b−a
j=0

By matching the innovations it can be shown that for r > 0


X
cov(Xt , Xt+r ) = (bj+1 − aj+1 )(bj+1+r − aj+1+r ). (6.6)
j=0

Even by using the sum of a geometric series the above is still cumbersome. Below we derive the
general solution, which can be easier to interprete.
General AR(p) models
Let us consider the zero mean AR(p) process {Xt } where

p
X
Xt = φj Xt−j + εt . (6.7)
j=1

From now onwards we will assume that {Xt } is causal (the roots of φ(z) lie outside the unit circle).
Evaluating the covariance of above with respect Xt−k (k ≤ 0) gives the sequence of equations

p
X
cov(Xt Xt−k ) = φj cov(Xt−j , Xt−k ). (6.8)
j=1

It is worth mentioning that if the process were not causal this equation would not hold, since εt
and Xt−k are not uncorrelated. Let c(r) = cov(X0 , Xr ) and substituting into the above gives the
sequence of difference equations

p
X
c(k) − φj c(k − j) = 0, k ≥ 0. (6.9)
j=1

165
The autocovariance function of {Xt } is the solution of this difference equation. Solving (6.9) is
very similar to solving homogenuous differential equations, which some of you may be familar with
(do not worry if you are not).
Pp j
Recall the characteristic polynomial of the AR process φ(z) = 1 − j=1 φj z = 0, which has
the roots λ1 , . . . , λp . In Section 4.3.3 we used the roots of the characteristic equation to find the
stationary solution of the AR process. In this section we use the roots characteristic to obtain the
solution (6.9). We show below that if the roots are distinct (the roots are all different) the solution
of (6.9) is

p
−|k|
X
c(k) = Cj λ j , (6.10)
j=1

where the constants {Cj } are chosen depending on the initial values {c(k) : 1 ≤ k ≤ p}. If λj is
real, then Cj is real. If λj is complex, then it will have another root λj+1 . Consequently, Cj and
Cj+1 will be complex conjugations of each other. This is to ensure that {c(k)}k is real.
Example p = 2 Suppose the roots of φ(z) = 1 − φ1 z − φ2 z 2 are complex (and this conjugates). Then

−|k| −|k| −|k|


c(k) = C1 λ1 + C2 λ 2 = Cλ−|k| + Cλ . (6.11)

Proof of (6.10) The simplest way to prove (6.10) is to use the plugin method (guess a solution and
plug it in). Plugging c(k) = pj=1 Cj λj−k into (6.9) gives
P

p p  p 
−(k−i)
X X X
−k
c(k) − φj c(k − j) = Cj λ j − φi λj
j=1 j=1 i=1
Xp  p
X 
−k i
= Cj λ j 1− φi λj = 0.
j=1 i=1
| {z }
φ(λi )

which proves that it is a solution. 

Non-distinct roots In the case that the roots of φ(z) are not distinct, let the roots be λ1 , . . . , λs
with multiplicity m1 , . . . , ms ( sk=1 mk = p). In this case the solution is
P

s
X
c(k) = λ−k
j Pmj (k),
j=1

166
where Pmj (k) is mj th order polynomial and the coefficients {Cj } are now ‘hidden’ in Pmj (k).

6.1.3 The autocovariance of a moving average process


Suppose that {Xt } satisfies

q
X
Xt = εt + θj εt−j .
j=1

The covariance is
 P
p

i=0 θi θi−k k = −q, . . . , q
cov(Xt , Xt−k ) =
 0 otherwise

where θ0 = 1 and θi = 0 for i < 0 and i ≥ q. Therefore we see that there is no correlation when
the lag between Xt and Xt−k is greater than q.

6.1.4 The autocovariance of an ARMA process (advanced)


We see from the above that an MA(q) model is only really suitable when we believe that there
is no correlaton between two random variables separated by more than a certain distance. Often
autoregressive models are fitted. However in several applications we find that autoregressive models
of a very high order are needed to fit the data. If a very ‘long’ autoregressive model is required
a more suitable model may be the autoregressive moving average process. It has several of the
properties of an autoregressive process, but can be more parsimonuous than a ‘long’ autoregressive
process. In this section we consider the ACF of an ARMA process.
Let us suppose that the causal time series {Xt } satisfies the equations

p
X q
X
Xt − φi Xt−i = εt + θj εt−j .
i=1 j=1

We now define a recursion for ACF, which is similar to the ACF recursion for AR processes. Let
us suppose that the lag k is such that k > q, then it can be shown that the autocovariance function
of the ARMA process satisfies

p
X
cov(Xt , Xt−k ) − φi cov(Xt−i , Xt−k ) = 0 k > q.
i=1

167
On the other hand, if k ≤ q, then we have

p
X q
X q
X
cov(Xt , Xt−k ) − φi cov(Xt−i , Xt−k ) = θj cov(εt−j , Xt−k ) = θj cov(εt−j , Xt−k ).
i=1 j=1 j=k

P∞
We recall that Xt has the MA(∞) representation Xt = j=0 aj εt−j (see (4.20)), therefore for
k ≤ j ≤ q we have cov(εt−j , Xt−k ) = aj−k var(εt ) (where a(z) = θ(z)φ(z)−1 ). Altogether the above
gives the difference equations

p
X q
X
c(k) − φi c(k − i) = var(εt ) θj aj−k for 1 ≤ k ≤ q
i=1 j=k
Xp
c(k) − φi c(k − i) = 0, for k > q,
i=1

where c(k) = cov(X0 , Xk ). Since the above is a is homogenuous difference equation, then it can be
shown that the solution is

s
X
c(k) = λ−k
j Pmj (k),
j=1

P
where λ1 , . . . , λs with multiplicity m1 , . . . , ms ( k ms = p) are the roots of the characteristic
polynomial 1 − pj=1 φj z j . The coefficients in the polynomials Pmj are determined by initial
P

condition.
Further reading: Brockwell and Davis (1998), Chapter 3.3 and Shumway and Stoffer (2006),
Chapter 3.4.

6.1.5 Estimating the ACF from data


Suppose we observe {Yt }nt=1 , to estimate the covariance we can estimate the covariance c(k) =
cov(Y0 , Yk ) from the the observations. One such estimator is

n−|k|
1 X
cn (k) =
b (Yt − Ȳn )(Yt+|k| − Ȳn ), (6.12)
n
t=1

168
since E[(Yt − Ȳn )(Yt+|k| − Ȳn )] ≈ c(k). Of course if the mean of Yt is known to be zero (Yt = Xt ),
then the simpler covariance estimator is

n−|k|
1 X
cn (k) =
b Xt Xt+|k| .
n
t=1

The sample autocorrelation is the ratio

cn (r)
b
ρbn (r) = .
cn (0)
b

Thus for r = 0, we have ρbn (0) = 1. Most statistical software will have functions that evaluate the
sample autocorrelation function. In R, the standard function is acf. To illustrate the differences
between the true ACF and estimated ACF (with sample size n = 100) we consider the model

Xt = 2 · 0.9 cos(π/3)Xt−1 − 0.92 Xt−2 + εt .

We make a plot of the true ACF and estimated ACF in Figure ??. As a contrast we consider the
estimated and true ACF of the MA model

Xt = εt + 2 · 0.9 cos(π/3)εt−1 − 0.92 εt−2 . (6.13)

This plot is given in Figure 6.4.


Observe that estimated autocorrelation plot contains a blue line. This blue line corresponds to

±1.96/ n (where n is the sample size). These are the error bars, which are constructed under the
assumption the data is actually iid. We show in Section 8.2 if {Xt } are iid random variables then
for all h ≥ 1

√ D
cn (h) → N (0, 1).
nb (6.14)


This gives rise to the critical values ±1.96/ n.

169
Estimated ACF True ACF

1.0

1.0
0.5
0.5
ACF

acf

0.0
0.0

−0.5
−0.5

0 5 10 15 20 5 10 15 20

Lag lag

Figure 6.3: The AR(2) model. Left: Estimated ACF based on n = 100. Right: True ACF

Estimated ACF True ACF


1.0

1.0
0.8
0.6
0.5

0.4
ACF

acf

0.2
0.0

0.0
−0.2
−0.4
−0.5

0 5 10 15 20 5 10 15 20

Lag lag

Figure 6.4: The MA(2) model. Left: Estimated ACF based on n = 100. Right: True ACF

6.2 Partial correlation in time series

6.2.1 A general definition


In Section 5.3 we introduced the notion of partial correlation for multivariate data. We now apply
this notion to time series.

Definition 6.2.1 Suppose that {Xt }t is a time series. The partial covariance/correlation between
Xt and Xt+k+1 is defined as the partial covariance/correlation between Xt and Xt+k+1 after con-
ditioning out the ‘inbetween’ time series Y 0 = (Xt+1 , . . . , Xt+k ). We denote this as ρt,t+k+1 (k),

170
where

cov(Xt − PY (Xt ), Xt+k+1 − PY (Xt+k+1 ))


ρk (t) = p ,
var(Xt − PY (Xt ))var(Xt+k+1 − PY (Xt+k+1 ))

with

cov(Xt − PY (Xt ), Xt+k+1 − PY (Xt+k+1 ))

= cov(Xt , Xt+k+1 ) − cov(Xt , Y )0 [var(Y )]−1 cov(Xt+k+1 , Y )

var(Xt − PY (Xt ))

= var(Xt ) − cov(Xt , Y )0 [var(Y )]−1 cov(Xt , Y )

var(Xt+k+1 − PY (Xt+k+1 ))

= var(Xt+k+1 ) − cov(Xt+k+1 , Y )0 [var(Y )]−1 cov(Xt+k+1 , Y ).

The above expression is horribly unwieldy. But many simplifications can be made once we impose
the condition of second order stationarity.

6.2.2 Partial correlation of a stationary time series


If the time series is stationary, then the shift t becomes irrelevant (observe cov(Xt , Xt+k+1 ) =
c(k + 1), cov(Xt , Xt ) = c(0) etc). We can center everything about t = 0, the only term that is
relevant is the spacing k and define

cov(X0 − PY (X0 ), Xk+1 − PY (Xk+1 ))


ρk+1|k+1 = p ,
var(X0 − PY (X0 ))var(Xk+1 − PY (Xk+1 ))

where Y 0 = (X1 , X2 , . . . , Xk ),

cov(Xt − PY (Xt ), Xt+k+1 − PY (Xt+k+1 )) = c(k + 1) − cov(X0 , Y )0 [var(Y )]−1 cov(Xk+1 , Y )

var(X0 − PY (X0 )) = c(0) − cov(X0 , Y )0 [var(Y )]−1 cov(X0 , Y )

var(Xk+1 − PY (Xk+1 )) = c(0) − cov(Xk+1 , Y )0 [var(Y )]−1 cov(Xk+1 , Y ).

But there exists another interesting trick that will simplify the above. The value of the above
expression is that given the autocovariance function, one can evaluate the above. However, this
involves inverting matrices. Below we simplify the above expression even further, and in Section

171
7.5.1 we show how partial correlation can be evaluated without inverting any matrices. We first
note that by stationarity

cov(X0 , Y 0 ) = (c(1), c(2) . . . , c(k + 1))

and cov(Xn+1 , Y 0 ) = (c(k + 1), c(2) . . . , c(1)).

Thus the two vectors cov(X0 , Y 0 ) and cov(Xk+1 , Y 0 ) are flips/swaps of each other. The flipping
action can be done with a matrix transformation cov(X0 , Y ) = Ek cov(Xk+1 , Y ) where

 
0 0 0 ... 0 1
 
0 0 0 ... 1 0
 
 
Ek = 
 .. .. .. .. ..
.

 . . . . . 
 
.
1 0 .. 0 0 0

We now describe some useful implications of this result.

Time reversibility property of stationary time series For stationary time series, predicting into the
future and predicting into the past leads to the same set of prediction coefficients (they are
just flipped round). More precisely, the projection of Xk+1 onto the space spanned by Y =
(X1 , X2 , . . . , Xk ), is the best linear predictor of Xk+1 given X k . We will denote the projection
of Xk onto the space spanned by Y 0 = (X1 , X2 , . . . , Xk ) as PY (Xk+1 ). Thus

k
X
0
PY (Xk+1 ) = Y var[Y ] −1
cov[Xk+1 , Y ] = Y 0
Σ−1
k ck := φk,j Xk+1−j ,
j=1

where Σk = var(Y ) and ck = cov(Xk+1 , Y ). But by flipping/swapping the coefficients, the same
construction can be used to predict into the past X0 :

k
X k
X
PY (X0 ) = φk,j Xj = φk,k+1−j Xk+1−j . (6.15)
j=1 j=1

Proof of equation (6.15)

PY (X0 ) = Y 0 (var[Y ]−1 cov[X0 , Y ]).

172
However, second order stationarity implies that cov[X0 , Y ]) = Ek cov[Xk+1 , Y ]) = Ek ck Thus

PY (X0 ) = (Σ−1
k Ek cov[Xk+1 , Y ])
k
X
= Y 0 Σ−1 0 −1
k Ek ck = Y Ek Σk ck := φk,k+1−j Xk+1−j .
j=1

Thus proving (6.15). 


With a little thought, we realize the partial correlation between Xt and Xt+k (where k > 0) is
the correlation X0 −PY (X0 ) = X0 − kj=1 φk,j Xj and Xk+1 −PY (Xk+1 ) = Xk+1 − kj=1 φk,j Xk+1−j ,
P P

some algebra gives

cov(Xt − PY (Xt ), Xt+k+1 − PY (Xt+k+1 )) = c(0) − c0k Ek Σ−1


k ck

var(X0 − PY (X0 )) = var(Xk+1 − PY (Xk+1 )) = var(X0 ) − c0k Σ−1


k ck .

The last line of the above is important. It states that the variance of the prediction error in the past
X0 − PY (X0 ) has the same as the variance of the prediction error into the future Xk+1 − PY (Xk+1 ).
This is because the process is stationary.
Thus the partial correlation is

c(k + 1) − c0k Ek Σ−1


k ck
ρk+1|k1 = −1 . (6.16)
c(0) − c0k Σk ck

In the section below we show that ρk+1|k+1 can be expressed in terms of the best fitting AR(k + 1)
parameters (which we will first have to define).

6.2.3 Best fitting AR(p) model


So far we have discussed time series which is generated with an AR(2). But we have not discussed
fitting an AR(p) model to any stationary time series (not necessarily where the true underlying
data generating mechanism is an AR(p)), which is possibly more important. We will show that the
partial correlation is related to these fitted parameters. We state precisely what we mean below.
Suppose that the stationary time series is genuinely generated with the causal AR(p) model

p
X
Xt = φj Xt−j + εt (6.17)
j=1

173
where {εt } are iid random variables. Then the projection of Xt onto Y = (Xt−p , . . . , Xt−1 ) is

p
X
PY (Xt ) = φj Xt−j .
j=1

Since Y does not contain any (linear information) about the innovations {εt }t . This means that
{Xt−j }pj=1 are independent of εt . However, because (6.17) is the true model which generates the
data, εt is independent of all {Xt−j } for j ≥ 1. But this is by virtue of the model and not the
projection. The project can only ensure that Xt − PY (Xt ) and Y are uncorrelated.

The best fitting AR(p) Now let us suppose that {Xt } is a general second order stationary time
series with autocovariance {c(r)}r . We consider the projection of Xt onto Y = (Xt−p , . . . , Xt−1 )
(technically onto sp(X1 , . . . , Xn )) this is

p
X
PY (Xt ) = φp,j Xt−j .
j=1

By construction Xt −PY (Xt ) and Y are uncorrelated but Xt −PY (Xt ) is not necessarily uncorrelated
with {Xt−j } for j ≥ (p + 1). We call {φp,j } the best fitting AR(p) coefficients, because if the
true model were an AR(p) model φp,j = φj . The best fitting AR(p) model is very important in
applications. It is often used to forecast the time series into the future. Note we have already
alluded to pj=1 φp,j Xt−j in the previous section. And we summarize these results again. Since
P
Pp p
j=1 φp,j Xt−j is a projection onto Y , the coefficients {φp,j }j=1 are

φp = [var(Y )]−1 cov(Xt , Y ) = Σ−1


p cp ,

where [Σp ]t,τ = c(t−τ ) and c0p = (c(1), c(2), . . . , c(p)) (observe stationarity means these are invariant
to shift).

6.2.4 Best fitting AR(p) parameters and partial correlation


We now state the main result which connects the best fitting AR(p) parameters with partial cor-
relation. The partial correlation at lag (p + 1) is the last best fitting AR(p) coefficient φp+1,p+1 .
More precisely

ρp+1|p+1 = φp+1,p+1 . (6.18)

174
It is this identity that is used to calculate (from the true ACF) and estimate (from the estimated
ACF) partial correlation (and not the identity in (6.16), which is more cumbersome).
Proof of identity (6.18) To prove this result. We return to the classical multivariate case (in Sec-
tion 5.3). In particular the identity (5.12) which relates the regression coefficients to the partial
correlation:
s
var(ε0|X1 ,...,Xp+1 )
ρp+1|p+1 = φp+1|p+1
var(εp+1|X0 ,...,Xp )

where

ε0|X1 ,...,Xp+1 = X0 − PX1 ,...,Xp+1 (X0 ) and εp+1|X0 ,...,Xp = Xp+1 − PX0 ,...,Xp (Xp+1 ).

Now the important observation. We recall from the previous section that the variance of the
prediction error in the past, X0 − PX1 ,...,Xp+1 (X0 ) is the same as the variance of the prediction error
into the future, Xp+1 − PX0 ,...,Xp (Xp+1 ). Therefore var(ε0|X1 ,...,Xp+1 ) = var(εp+1|X0 ,...,Xp ) and

ρp+1|p+1 = φp+1|p+1 .

This proves equation (6.18). 

Important observation Relating the AR(p) model to the partial correlations


Suppose the true data generating process is an AR(p0 ), and we fit an AR(p) model to the data.
If p < p0 , then

p
X
PXt−p ,...,Xt−1 (Xt ) = φp,j Xt−j .
j=1

and ρp|p = φp,p . If p = p0 , then

p0
X
PXt−p0 ,...,Xt−1 (Xt ) = φj Xt−j
j=1

and φp0 ,p0 = ρp0 = φp0 . For any p > p0 , we have

p0
X
PXt−p ,...,Xt−1 (Xt ) = φj Xt−j .
j=1

175
Thus the coefficient is ρp|p = φp,p = 0.
Thus for AR(p) models, the partial correlation of order greater than p will be zero. We visualize
this property in the plots in the following section.

6.2.5 The partial autocorrelation plot


Of course given the time series {Xt }nt=1 the true partial correlation is unknown. Instead it is
estimated from the data. This is done by sequentially fitting an AR(p) model of increasing order to
the time series and extracting the parameter estimator φbp+1,p+1 = ρbp|p and plotting ρbp|p against p.
To illustrate the differences between the true ACF and estimated ACF (with sample size n = 100)
we consider the model

Xt = 2 · 0.9 cos(π/3)Xt−1 − 0.92 Xt−2 + εt .

The empirical partial estimated partial autocorrelation plot (n = 100) and true correlation is given
in Figures 6.5. As a contrast we consider the estimated (n = 100) and true ACF of the MA model

Xt = εt + 2 · 0.9 cos(π/3)εt−1 − 0.92 εt−2 .

The plot is given in Figure 6.6.

Estimated PACF True PACF


0.2

0.2
0.0

0.0
−0.2
−0.2
Partial ACF

pacf

−0.4
−0.4

−0.6
−0.6

−0.8

5 10 15 20 5 10 15 20

Lag lag

Figure 6.5: The AR(2): Left Estimated PACF (n = 100). Right: True PACF plot. n = 100

Observe that the partial correlation plot contains a blue line. This blue line corresponds to

±1.96/ n (where n is the sample size).

176
Estimated PACF True PACF

0.2

0.0
0.0

−0.1
Partial ACF

pacf

−0.2
−0.2

−0.3
−0.4

−0.4
5 10 15 20 5 10 15 20

Lag lag

Figure 6.6: The MA(2): Left Estimated PACF (n = 100). Right: True PACF plot. n = 100

This blue line can be used as an aid in selecting the Autoregressive order (under certain condi-
tions on the time series). We show in the next lecture that if {Xt } is a linear time series with an
AR(p) representation, then for h > p

√ D
ρh|h → N (0, 1),
nb (6.19)


which gives the critical values ±1.96/ n. But do not get too excited. We show that this result
does not necessarily hold for non-linear time series. More precisely, the distribution will not be
asymptotically pivotal.

6.2.6 Using the ACF and PACF for model identification


Figures 6.3, 6.4, 6.5 and 6.6 are very useful in identifying the model. We describe what we should
observe below.

Using the ACF for model identification

If the true autocovariances after a certain lag are zero q, it may be appropriate to fit an MA(q)
model to the time series. The [−1.96n−1/2 , 1.96n−1/2 ] error bars for an ACF plot cannot be reliably
used to determine the order of an MA(q) model.
On the other hand, the autocovariances of any AR(p) process will only decay to zero as the lag
increases (it will not be zero after a certain number of lags).

177
Using the PACF for model identification

If the true partial autocovariances after a certain lag are zero p, it may be appropriate to fit an
AR(p) model to the time series.
Of course, in practice we only have the estimated partial autocorrelation at hand and not the
true one. This is why we require the error bars. In Section 8.4 we show how these error bars are
derived. The surprisingly result is that the error bars of a PACF can be used to determine the
order of an AR(p) process. If the order of the autoregressive process is p, then for lag r > p, the
partial correlation is such that φbrr = N (0, n−1/2 ) (thus giving rise to the [−1.96n−1/2 , 1.96n−1/2 ]
error bars). But It should be noted that there will be correlation between the sample partial
correlations.

Exercise 6.3 (The partial correlation of an invertible MA(1)) Let φt,t denote the partial cor-
relation between Xt+1 and X1 . It is well known (this is the Levinson-Durbin algorithm, which we
cover in Chapter 7) that φt,t can be deduced recursively from the autocovariance funciton using the
algorithm:

Step 1 φ1,1 = c(1)/c(0) and r(2) = E[X2 − X2|1 ]2 = E[X2 − φ1,1 X1 ]2 = c(0) − φ1,1 c(1).

Step 2 For j = t
Pt−1
c(t) − j=1 φt−1,j c(t − j)
φt,t =
r(t)
φt,j = φt−1,j − φt,t φt−1,t−j 1 ≤ j ≤ t − 1,

and r(t + 1) = r(t)(1 − φ2t,t ).

(i) Using this algorithm and induction to show that the PACF of the MA(1) process Xt = εt +
θεt−1 , where |θ| < 1 (so it is invertible) is

(−1)t+1 (θ)t (1 − θ2 )
φt,t = .
1 − θ2(t+1)

Exercise 6.4 (Comparing the ACF and PACF of an AR process) Compare the below plots:

(i) Compare the ACF and PACF of the AR(2) model Xt = 1.5Xt−1 − 0.75Xt−2 + εt using
ARIMAacf(ar=c(1.5,-0.75),ma=0,30) and ARIMAacf(ar=c(1.5,-0.75),ma=0,pacf=T,30).

178
(ii) Compare the ACF and PACF of the MA(1) model Xt = εt −0.5εt using ARIMAacf(ar=0,ma=c(-1.5),30)
and ARIMAacf(ar=0,ma=c(-1.5),pacf=T,30).

(ii) Compare the ACF and PACF of the ARMA(2, 1) model Xt − 1.5Xt−1 + 0.75Xt−2 = εt − 0.5εt
using ARIMAacf(ar=c(1.5,-0.75),ma=c(-1.5),30) and
ARIMAacf(ar=c(1.5,0.75),ma=c(-1.5),pacf=T,30).

Exercise 6.5 Compare the ACF and PACF plots of the monthly temperature data from 1996-2014.
Would you fit an AR, MA or ARMA model to this data?

Rcode

The sample partial autocorrelation of a time series can be obtained using the command pacf.
However, remember just because the sample PACF is not zero, does not mean the true PACF is
non-zero.

6.3 The variance and precision matrix of a stationary


time series
Let us suppose that {Xt } is a stationary time series. In this section we consider the variance/co-
variance matrix var(X n ) = Σk , where X n = (X1 , . . . , Xn )0 . We will consider two cases (i) when
Xt follows an MA(p) models and (ii) when Xt follows an AR(p) model. The variance and inverse
of the variance matrices for both cases yield quite interesting results. We will use classical results
from multivariate analysis, stated in Chapter 5.
We recall that the variance/covariance matrix of a stationary time series has a (symmetric)
Toeplitz structure (see wiki for a definition). Let X n = (X1 , . . . , Xn )0 , then

 
c(0) c(1) 0 . . . c(n − 2) c(n − 1)
 
c(1) c(0) c(1) . . . c(n − 3) c(n − 2)
 
 
Σn = var(X n ) = 
 .. .. .. .. ..
.

 . . . . . 
 
..
c(n − 1) c(n − 2) . ... c(1) c(0)

179
6.3.1 Variance matrix for AR(p) and MA(p) models
(i) If {Xt } satisfies an MA(p) model and n > p, then Σn will be bandlimited, where p off-
diagonals above and below the diagonal will be non-zero and the rest of the off-diagonal will
be zero.

(ii) If {Xt } satisfies an AR(p) model, then Σn will not be bandlimited.

Precision matrix for AR(p) models

We now consider the inverse of Σn . Warning: note that the inverse of a Toeplitz is not necessarily
Toeplitz. Suppose that the time series {Xt }t has a causal AR(p) representation:

p
X
Xt = φj Xt−j + εt
j=1

where {εt } are iid random variables with (for simplicity) variance σ 2 = 1. Let X n = (X1 , . . . , Xn )
and suppose n > p.
Important result The inverse variance matrix Σ−1
n is banded, with n non-zero bands off the diagonal.

Proof of claim We use the results in Chapter 5. Suppose that we have an AR(p) process and we
consider the precision matrix of X n = (X1 , . . . , Xn ), where n > p. To show this we use the Cholesky
decomposition given in (5.30). This is where

Σ−1 0
n = Ln Ln

where Ln is the lower triangular matrix:


 
 φ1,0 0 ... 0 0 ... 0 0 0 ... 0 
 
 φ2,1 φ2,0 ... 0 0 ... 0 0 0 ... 0 
 
 .
.. .. .. .. .. .. .. .. .. .. .. 
 . . . . . . . . . . 
Lk =   (6.20)
 
 −φ −φp,p−1 . . . −φp,1 φp,0 . . . 0 0 0 ... 0 
 p,p 
 .
.. .. .. .. .. .. .. .. .. .. .. 
. . . . . . . . . .
 
 
 
−φn,n −φn,n−1 ... ... . . . −φn,4 −φn,3 −φn,2 φn,1 φn,0

where {φ`,j }`j=1 are the coefficients of the best linear predictor of X` given {X`−j }`−1
j=1 (after stan-

dardising by the residual variance). Since Xt is an autoregressive process of order p, if t > p,

180
then

 φj 1≤j≤p
φt,j =
 0 j>p

This gives the lower triangular p-bandlimited matrix


 
γ1,0 0 ... 0 0 ... 0 0 0 ... 0
 
 −γ2,1
 
γ2,0 ... 0 0 ... 0 0 0 ... 0 
 
 .. .. .. .. .. .. .. .. .. .. .. 
 . . . . . . . . . . . 
 
 
 −φp −φp−1 . . . −φ1 1 ... 0 0 0 ... 0 
 
 . .. .. .. .. .. .. .. .. .. .. 
Ln =  .
 . . . . . . . . . . .  . (6.21)
 
 0
 0 . . . −φp −φp−1 . . . −φ1 1 0 ... 0  
 
 0 0 ... 0 −φp . . . −φ2 −φ1 1 ... 0 
 
. .. .. .. .. .. .. .. .. .. .. 
 
 .
 . . . . . . . . . . . 
 
0 0 ... 0 0 ... 0 0 0 ... 1

Observe the above lower triangular matrix is zero after the pth off-diagonal.
Since Σn−1 = Ln L0n and Ln is a p-bandlimited matrix, Σ−1 0
n = Ln Ln is a bandlimited matrix

with the p off-diagonals either side of the diagonal non-zero. Let Σij denote the (i, j)th element of
Σ−1
k . Then we observe that Σ
(i,j) = 0 if |i − j| > p. Moreover, if 0 < |i − j| ≤ p and either i or j is

greater than p. Further, from Section 5.4 we observe that the coefficients Σ(i,j) are the regression
coefficients of Xi (after accounting for MSE).

Exercise 6.6 Suppose that the time series {Xt } has the causal AR(2) representation

Xt = φ1 Xt−1 + φ2 Xt−2 + εt .

Let X 0n = (X1 , . . . , Xn ) and Σn = var(X n ). Suppose Ln L0n = Σ−1


n , where Ln is a lower triangular

matrix.

(i) What does Ln looks like?

(ii) Using Ln evaluate the projection of Xt onto the space spanned by {Xt−j }j6=0 .

181
Pp
Remark 6.3.1 Suppose that Xt is an autoregressive process Xt = j=1 φj Xt−j +εt where var[εt ] =
σ 2 and {εt } are uncorrelated random variables with zero mean. Let Σm = var[X m ] where X m =
(X1 , . . . , Xm ). If m > p then

Σ−1 = Σmm = σ −2
 
m mm

and det(Σm ) = det(Σp )σ 2(m−p) .

Exercise 6.7 Prove Remark 6.3.1.

6.4 The ACF of non-causal time series (advanced)


Here we demonstrate that it is not possible to identify whether a process is noninvertible/noncausal
from its covariance structure. The simplest way to show result this uses the spectral density
function, which will now define and then return to and study in depth in Chapter 10.

|c(k)|2 < ∞) the


P
Definition 6.4.1 (The spectral density) Given the covariances c(k) (with k

spectral density function is defined as

X
f (ω) = c(k) exp(ikω).
k

The covariances can be obtained from the spectral density by using the inverse fourier transform

Z 2π
1
c(k) = f (ω) exp(−ikω).
2π 0

Hence the covariance yields the spectral density and visa-versa.

For reference below, we point out that the spectral density function uniquely identifies the autoco-
variance function.
Let us suppose that {Xt } satisfies the AR(p) representation

p
X
Xt = φi Xt−i + εt
i=1

Pp j
where var(εt ) = 1 and the roots of φ(z) = 1 − j=1 φj z can lie inside and outside the unit circle,
but not on the unit circle (thus it has a stationary solution). We will show in Chapter 10 that the

182
spectral density of this AR process is

1
f (ω) = Pp 2
. (6.22)
|1 − j=1 φj exp(ijω)|

• Factorizing f (ω).

Let us supose the roots of the characteristic polynomial φ(z) = 1 + qj=1 φj z j are {λj }pj=1 ,
P

thus we can factorize φ(x) 1 + pj=1 φj z j = pj=1 (1 − λj z). Using this factorization we have
P Q

(6.22) can be written as

1
f (ω) = Qp 2
. (6.23)
j=1 |1 − λj exp(iω)|

As we have not assumed {Xt } is causal, the roots of φ(z) can lie both inside and outside the
unit circle. We separate the roots, into those outside the unit circle {λO,j1 ; j1 = 1, . . . , p1 }
and inside the unit circle {λI,j2 ; j2 = 1, . . . , p2 } (p1 + p2 = p). Thus

p1
Y p2
Y
φ(z) = [ (1 − λO,j1 z)][ (1 − λI,j2 z)]
j1 =1 j2 =1
p
Y 1 p2
Y
= (−1)p2 λI,j2 z −p2 [ (1 − λO,j1 z)][ (1 − λ−1
I,j2 z)]. (6.24)
j1 =1 j2 =1

Thus we can rewrite the spectral density in (6.25)

1 1
f (ω) = Qp 2 2
Qp 1 Qp2 −1 . (6.25)
j2 =1 |λI,j2 |
2 2
j1 =1 |1 − λO,j exp(iω)| j2 =1 |1 − λI,j2 exp(iω)|

Let

1
fO (ω) = Qp 1 Qp 2 .
j1 =1 |1 − λO,j exp(iω)|2 j2 =1 |1 − λ−1
I,j2 exp(iω)|
2

Qp 2 −2
Then f (ω) = j2 =1 |λI,j2 | fO (ω).

• A parallel causal AR(p) process with the same covariance structure always exists.

We now define a process which has the same autocovariance function as {Xt } but is causal.

183
Using (6.24) we define the polynomial

p1
Y p2
Y
φ(z)
e =[ (1 − λO,j1 z)][ (1 − λ−1
I,j2 z)]. (6.26)
j1 =1 j2 =1

By construction, the roots of this polynomial lie outside the unit circle. We then define the
AR(p) process

φ(B)
e X
e t = εt , (6.27)

from Lemma 4.3.1 we know that {X


et } has a stationary, almost sure unique solution. More-

over, because the roots lie outside the unit circle the solution is causal.

By using (6.22) the spectral density of {X


et } is fe(ω). We know that the spectral density

function uniquely gives the autocovariance function. Comparing the spectral density of {X
et }

with the spectral density of {Xt } we see that they both are the same up to a multiplicative
constant. Thus they both have the same autocovariance structure up to a multiplicative
constant (which can be made the same, if in the definition (6.27) the innovation process has
variance pj22=1 |λI,j2 |−2 ).
Q

Therefore, for every non-causal process, there exists a causal process with the same autoco-
variance function.

By using the same arguments above, we can generalize to result to ARMA processes.

Definition 6.4.2 An ARMA process is said to have minimum phase when the roots of φ(z) and
θ(z) both lie outside of the unit circle.

Remark 6.4.1 For Gaussian random processes it is impossible to discriminate between a causal
and non-causal time series, this is because the mean and autocovariance function uniquely identify
the process.
However, if the innovations are non-Gaussian, even though the autocovariance function is ‘blind’
to non-causal processes, by looking for other features in the time series we are able to discriminate
between a causal and non-causal process.

184
6.4.1 The Yule-Walker equations of a non-causal process
Once again let us consider the zero mean AR(p) model

p
X
Xt = φj Xt−j + εt ,
j=1

and var(εt ) < ∞. Suppose the roots of the corresponding characteristic polynomial lie outside the
unit circle, then {Xt } is strictly stationary where the solution of Xt is only in terms of past and
present values of {εt }. Moreover, it is second order stationary with covariance {c(k)}. We recall
from Section 6.1.2, equation (6.8) that we derived the Yule-Walker equations for causal AR(p)
processes, where

p
X p
X
E(Xt Xt−k ) = φj E(Xt−j Xt−k ) ⇒ c(k) − φj c(k − j) = 0. (6.28)
j=1 j=1

Let us now consider the case that the roots of the characteristic polynomial lie both outside
and inside the unit circle, thus Xt does not have a causal solution but it is still strictly and second
order stationary (with autocovariance, say {c(k)}). In the previous section we showed that there
= 1 − pj=1 φ̃j z j are the characteristic
P
exists a causal AR(p) φ(B)
e X
et = εt (where φ(B) and φ(B)
e

polynomials defined in (6.24) and (6.26)). We showed that both have the same autocovariance
structure. Therefore,

p
X
c(k) − φ̃j c(k − j) = 0
j=1

This means the Yule-Walker equations for {Xt } would actually give the AR(p) coefficients of {X̃t }.
Thus if the Yule-Walker equations were used to estimate the AR coefficients of {Xt }, in reality we
would be estimating the AR coefficients of the corresponding causal {X̃t }.

6.4.2 Filtering non-causal AR models


Here we discuss the surprising result that filtering a non-causal time series with the corresponding
causal AR parameters leaves a sequence which is uncorrelated but not independent. Let us suppose

185
that

p
X
Xt = φj Xt−j + εt ,
j=1

where εt are iid, E(εt ) = 0 and var(εt ) < ∞. It is clear that given the input Xt , if we apply the
filter Xt − pj=1 φj Xt−j we obtain an iid sequence (which is {εt }).
P

Suppose that we filter {Xt } with the causal coefficients {φej }, the output εet = Xt − pj=1 φej Xt−j
P

is not an independent sequence. However, it is an uncorrelated sequence. We illustrate this with an


example.

Example 6.4.1 Let us return to the AR(1) example, where Xt = φXt−1 + εt . Let us suppose that
φ > 1, which corresponds to a non-causal time series, then Xt has the solution


X 1
Xt = − εt+j+1 .
φj
j=1

1 e
The causal time series with the same covariance structure as Xt is X
et =
φ Xt−1 + ε (which has
backshift representation (1 − 1/(φB))Xt = εt ). Suppose we pass Xt through the causal filter

1 1 (1 − φ1 B)
εet = (1 − B)Xt = Xt − Xt−1 = − 1 εt
φ φ B(1 − φB )

1 1 X 1
= − εt + (1 − 2 ) εt+j .
φ φ φj−1
j=1

Evaluating the covariance of the above (assuming wlog that var(ε) = 1) is


1 1 1 1 X 1
εt , εet+r ) = − (1 − 2 ) r + (1 − 2 )2
cov(e = 0.
φ φ φ φ φ2j
j=0

Thus we see that {e


εt } is an uncorrelated sequence, but unless it is Gaussian it is clearly not inde-
pendent. One method to study the higher order dependence of {e
εt }, by considering it’s higher order
cumulant structure etc.

The above above result can be generalised to general AR models, and it is relatively straightforward
to prove using the Crámer representation of a stationary process (see Section 10.5, Theorem ??).

186
Exercise 6.8 (i) Consider the causal AR(p) process

Xt = 1.5Xt−1 − 0.75Xt−2 + εt .

Derive a parallel process with the same autocovariance structure but that is non-causal (it
should be real).

(ii) Simulate both from the causal process above and the corresponding non-causal process with
non-Gaussian innovations (see Section 4.8). Show that they have the same ACF function.

(iii) Find features which allow you to discriminate between the causal and non-causal process.

187
Chapter 7

Prediction

Prerequisites

• The best linear predictor.

• Difference between best linear predictors and best predictors.

[Need to explain]

• Some idea of what a basis of a vector space is.

Objectives

• Understand that prediction using a long past can be difficult because a large matrix has to
be inverted, thus alternative, recursive method are often used to avoid direct inversion.

• Understand the derivation of the Levinson-Durbin algorithm, and why the coefficient, φt,t ,
corresponds to the partial correlation between X1 and Xt+1 .

• Understand how these predictive schemes can be used write space of sp(Xt , Xt−1 , . . . , X1 ) in
terms of an orthogonal basis sp(Xt − PXt−1 ,Xt−2 ,...,X1 (Xt ), . . . , X1 ).

• Understand how the above leads to the Wold decomposition of a second order stationary
time series.

• To understand how to approximate the prediction for an ARMA time series into a scheme
which explicitly uses the ARMA structure. And this approximation improves geometrically,
when the past is large.

188
One motivation behind fitting models to a time series is to forecast future unobserved observa-
tions - which would not be possible without a model. In this chapter we consider forecasting, based
on the assumption that the model and/or autocovariance structure is known.

7.1 Using prediction in estimation


There are various reasons prediction is important. The first is that forecasting has a vast number
of applications from finance to climatology. The second reason is that it forms the basis of most
estimation schemes. To understand why forecasting is important in the latter, we now obtain the
“likelihood” of the observed time series {Xt }nt=1 . We assume the joint density of X n = (X1 , . . . , Xn )
is fn (xn ; θ). By using conditioning it is clear that the likelihood is

fn (xn ; θ) = f1 (x1 ; θ)f2 (x2 |x1 ; θ)f3 (x3 |x2 , x1 ; θ) . . . fn (xn |xn−1 , . . . , x1 ; θ)

Therefore the log-likelihood is

n
X
log fn (xn ; θ) = log f1 (x1 ) + log ft (xt |xt−1 , . . . , x1 ; θ).
t=1

The parameters may be the AR, ARMA, ARCH, GARCH etc parameters. However, usually the
conditional distributions ft (xt |xt−1 , . . . , x1 ; θ) which make up the joint density f (x; θ) is completely
unknown. However, often we can get away with assuming that the conditional distribution is
Gaussian and we can still consistently estimate the parameters so long as the model has been
correctly specified. Now, if we can “pretend” that the conditional distribution is Gaussian, then
all we need is the conditional mean and the conditional variance

E(Xt |Xt−1 , . . . , X1 ; θ) = E (Xt |Xt−1 , . . . , X1 ; θ) and V (Xt |Xt−1 , . . . , X1 , θ) = var (Xt |Xt−1 , . . . , X1 ; θ) .

Using this above and the “Gaussianity” of the conditional distribution gives

1 (xt − E(xt |xt−1 , . . . , x1 , θ))2


log ft (xt |xt−1 , . . . , x1 ; θ) = − log V (xt |xt−1 , . . . , x1 , θ) − .
2 V (xt |xt−1 , . . . , x1 , θ)

189
Using the above the log density

n 
(xt − E(xt |xt−1 , . . . , x1 , θ))2

1X
log fn (xn ; θ) = − log V (xt |xt−1 , . . . , x1 , θ) + .
2 V (xt |xt−1 , . . . , x1 , θ)
t=1

Thus the log-likelihood

n 
(Xt − E(Xt |Xt−1 , . . . , X1 , θ))2

1X
L(X n ; θ) = − log V (Xt |Xt−1 , . . . , X1 , θ) + .
2 V (Xt |Xt−1 , . . . , X1 , θ)
t=1

Therefore we observe that in order to evaluate the log-likelihood, and estimate the parameters, we
require the conditonal mean and the conditional variance

E(Xt |Xt−1 , . . . , X1 ; θ) and V (Xt |Xt−1 , . . . , X1 ; θ).

This means that in order to do any form of estimation we need a clear understanding of what the
conditional mean (which is simply the best predictor of the observation tomorrow given the past)
and the conditional variance is for various models.
Note:

• Often expressions for conditional mean and variance can be extremely unwieldy. Therefore,
often we require approximations of the conditonal mean and variance which are tractable (this
is reminiscent of the Box-Jenkins approach and is till used when the conditional expectation
and variance are difficult to estimate).

• Suppose we “pretend” that the time series {Xt } is Gaussian. Which we can if it is linear,
even if it is not. But we cannot if the time series is nonlinear (since nonlinear time series
are not Gaussian), then the conditional variance var(Xt |Xt−1 , . . . , X1 ) will not be random
(this is a well known result for Gaussian random variables). If Xt is nonlinear, it can be
conditionally Gaussian but not Gaussian.

• If the model is linear usually the conditonal expectation E(Xt |Xt−1 , . . . , X1 ; θ) is replaced
with the best linear predictor of Xt given Xt−1 , . . . , X1 . This means if the model is in fact
non-causal the estimator will give a causal solution instead. Though not critical it is worth
bearing in mind.

190
7.2 Forecasting for autoregressive processes
Worked example: AR(1) Let

Xt+1 = φXt + εt+1

where {εt }t are iid random variable. We will assume the process is causal, thus |φ| < 1. Since {Xt }
are iid random variables, Xt−1 contains no information about εt . Therefore the best linear (indeed
best predictor) of Xt+1 given all the past information is contained in Xt

Xt (1) = φXt .

To quantify the error in the prediction we use the mean squared error

σ 2 = E[Xt+1 − Xt (1)]2 = E[Xt+1 − φXt ]2 = var[εt+1 ].

Xt (1) gives the one-step ahead prediction. Since

Xt+2 = φXt+1 + εt+2 = φ2 Xt + φεt+1 + εt+2

and {εt } are iid random variables, then the best linear predictor (and best predictor) of Xt+2 given
Xt is

Xt (2) = φXt (1) = φ2 Xt+1 .

Observe it recurses on the previous best linear predictor which makes it very easy to evaluate. The
mean squared error in the forecast is

E[Xt+3 − Xt (2)]2 = E[φεt+1 + εt+2 ]2 = (1 + φ2 )var[εt ].

Using a similar strategy we can forecast r steps into the future:

Xt (r) = φXt (r − 1) = φr Xt

191
where the mean squared error is

Xr−1 r−1
X
2 i 2
E[Xt+r − Xt (r)] = E[ φ εt+r−i ] = var[εt ] φ2i .
i=0 i=0

Worked example: AR(2) We now extend the above prediction strategy to AR(2) models (it is
straightfoward to go to the AR(p) model). It is best understood using the vector AR representation
of the model. Let

Xt+1 = φ1 Xt + φ2 Xt−1 + εt+1

where {εt }t are iid random variables and the characteristic function is causal. We can rewrite the
AR(2) as a VAR(1)
      
Xt+1 φ1 φ2 Xt εt+1
  =   + 
Xt 1 0 Xt−1 0
⇒ X t+1 = ΦX t + εt+1 .

This looks like a AR(1) and motivates how to forecast into the future. Since εt+1 is independent
of {Xt−j }j≥0 the best linear predictor of Xt+1 can be obtained using
    
Xt+1 φ1 φ2 Xt
Xt (1) =   =    .
Xt 1 0 Xt−1
(1) (1)

bt (1) − Xt+1 ]2 = σ 2 . To forecast two steps into the future we use that
The mean squared error is E[X

X t+2 = Φ2 X t + Φεt+1 + εt+2 .

Thus the best linear predictor of Xt+2 is

Xt (2) = [Φ2 X t ](1) = φ1 (2)Xt + φ2 (2)Xt−1 ,

where [·](1) denotes the first entry in the vector and (φ1 (2), φ2 (2)) is the first row vector in the

192
matrix Φ2 . The mean squared error is a

E (φ1 εt+1 + εt+2 )2 = (1 + φ21 )var(εt ).

We continue this iteration to obtain the r-step ahead predictor

Xt (r) = [ΦX t (r − 1)](1) = [Φr X t ](1) = φ1 (r)Xt + φ2 (r)Xt−1 ,

as above (φ1 (r), φ2 (r)) is the first row vector in the matrix Φr . The mean squared error is

r−1
!2
2
X
i
E (Xt+r − Xt (r)) = E [Φ ](1,1) εt+r−i
i=0
r−1
X
= var[εt ] ([Φi ](1,1) )2 .
i=0

7.3 Forecasting for AR(p)


The above iteration for calculating the best linear predictor easily generalises for any AR(p) process.
Let

Xt+1 = φ1 Xt + φ2 Xt−1 + . . . + φp Xt+1−p + εt+1

where {εt }t are iid random variables and the characteristic function is causal. We can rewrite the
AR(p) as a VAR(1)
      
 Xt+1  φ1 φ2 φ3 . . . φ p Xt εt+1
   
 
  
 Xt   1

0 0 ... 0 

Xt−1   0


    
.. ..
      
 .  = 
 0 1 0 ... 0 

. + 0


    
 .. .. ..
 
 ..  ..  ..   

 .

  . ··· . . 0  
.
  . 
   
 
 
Xt−p+1 0 0 0 ... 0 Xt−p 0
⇒ X t+1 = ΦX t + εt+1 .

193
Therefore the r step ahead predictor is

p
X
r
Xt (r) = [ΦX t (r − 1)](1) = [Φ X t ](1) = φj (r)Xt+1−j
j=1

as above (φ1 (r), φ2 (r), . . . , φp (r)) is the first row vector in the matrix Φr . The mean squared error
is

r−1
!2
X
E (Xt+r − Xt (r))2 = E [Φi ](1,1) εt+r−i
i=0
r−1
X
= var[εt ] ([Φi ](1,1) )2
i=0
r−1
X
= var[εt ] φ1 (i)2 .
i=0

The above predictors are easily obtained using a recursion. However, we now link {φj (r)}pj=1
to the underlying AR (and MA) coefficients.

Lemma 7.3.1 Suppose Xt has a causal AR(p) representation

Xt+1 = φ1 Xt + φ2 Xt−1 + . . . + φp Xt+1−p + εt+1

and

p
X ∞
X
j
Xt+1 = (1 − φj B )εt = ψj εt−j
j=1 j=0

is its MA(∞) representation. Then the predictive coefficients are

p−j min(p,j−1)
X X
φj (r) = φj+s ψr−1−s = φu ψr−1+j−u r≥1
s=0 u=0

and the best r-ahead predictor is

p
X p−j
X
Xt (r) = Xt+1−j φj+s ψr−1−s r ≥ 1.
j=1 s=0

194
The mean squared error is

r−1
X
2
E[Xt+r − Xt (r)] = var[εt ] ψi2
i=0

with ψ0 = 1,

7.4 Forecasting for general time series using infinite


past
In the previous section we focussed on time series which had an AR(p) representation. We now
consider general time series models and best linear predictors (linear forecasts) for such time series.
Specifically, we focus predicting the future given the (unrealistic situation) of the infinite past. Of
course, this is an idealized setting, and in the next section we consider linear forecasts based on the
finite past (for general stationary time series). A technical assumption we will use in this section is
that the stationary time series {Xt } has both an AR(∞) and MA(∞) representation (its spectral
density bounded away from zero and is finite):


X ∞
X
Xt+1 = ψj εt+1−j = aj Xt+1−j + εt+1 , (7.1)
j=0 j=1

where {εt } are iid random variables (recall Definition 4.5.2). A technical point is that the assump-
tion on {εt } can be relaxed to uncorrelated random variables if we are willing to consider best linear
predictor and not best predictors. Using (7.2), it is clear the best linear one-ahead predictor is


X
Xt (1) = aj Xt+1−j . (7.2)
j=1

and the mean squared error is E[Xt+1 − Xt (1)]2 = σ 2 . Transfering the ideas for the AR(p) model
(predicting r steps ahead), the best linear predictor r-steps ahead for the general time series is


X
Xt (r) = φj (r)Xt+1−j r ≥ 1. (7.3)
j=1

195
But analogous to Lemma 7.3.1 we can show that


X
φj (r) = aj+s ψr−1−s r ≥ 1.
s=0

Substituting this into (7.3) gives


X ∞
X
Xt (r) = Xt+1−j aj+s ψr−1−s r ≥ 1.
j=1 s=0

This is not a particularly simple method for estimating the predictors as one goes further in the
future. Later in this section we derive a recursion for prediction. First, we obtain the mean squared
error in the prediction.
To obtain the mean squared error, we note that since Xt , Xt−1 , Xt−2 , . . . is observed, we can
obtain ετ (for τ ≤ t) by using the invertibility condition


X
ετ = Xτ − aj Xτ −i .
j=1

This means that given the time series {Xt−j }∞


j=0 (and AR(∞) parameters {aj }) we can obtain all

the innovations {εt−j }∞


j=0 and visa versa. Based on this we revisit the problem of predicting Xt+k

given {Xτ ; τ ≤ t} but this time in terms of the innovations. Using the MA(∞) presentation (since
the time series is causal) of Xt+k we have


X r−1
X
Xt+r = ψj+r εt−j + ψj εt+r−j .
j=0 j=0
| {z } | {z }
innovations are ‘observed’ future innovations impossible to predict

Thus we can write the best predictor of Xt+r given {Xt−j }∞


j=0 as


X
Xt (r) = ψj+r εt−j (7.4)
j=0
∞ ∞
!
X X
= ψj+r Xt−j − ai Xt−j−i
j=0 i=1
X∞
= φj (r)Xt−j .
j=0

196
Using the above we see that the mean squared error is

Xr−1 r−1
X
2 2 2
E[Xt+r − Xt (r)] = E[ ψj εt+r−j ] = σ ψj2 .
j=0 j=0

We now show how Xt (r) can be evaluated recursively using the invertibility assumption.

Step 1 We use invertibility in (7.2) to give


X
Xt (1) = ai Xt+1−i ,
i=1

and E [Xt+1 − Xt (1)]2 = var[εt ]

Step 2 To obtain the 2-step ahead predictor we note that


X
Xt+2 = ai Xt+2−i + a1 Xt+1 + εt+2
i=2
X∞
= ai Xt+2−i + a1 [Xt (1) + εt+1 ] + εt+2 ,
i=2

thus it is clear that


X
Xt (2) = ai Xt+2−i + a1 Xt (1)
i=2

and E [Xt+2 − Xt (2)]2 = var[εt ] a21 + 1 = var[εt ] 1 + ψ12 .


 

Step 3 To obtain the 3-step ahead predictor we note that


X
Xt+3 = ai Xt+2−i + a2 Xt+1 + a1 Xt+2 + εt+3
i=3
X∞
= ai Xt+2−i + a2 (Xt (1) + εt+1 ) + a1 (Xt (2) + a1 εt+1 + εt+2 ) + εt+3 .
i=3

Thus


X
Xt (3) = ai Xt+2−i + a2 Xt (1) + a1 Xt (2)
i=3

and E [Xt+3 − Xt (3)]2 = var[εt ] (a2 + a21 )2 + a21 + 1 = var[εt ] 1 + ψ12 + ψ22 .
  

197
Step r Using the arguments it can be shown that


X r−1
X
Xt (r) = ai Xt+r−i + ai Xt (r − i) .
| {z }
i=r i=1
| {z } predicted
observed

Pr−1
And we have already shown that E[Xt+r − Xt (r)]2 = σ 2 j=0 ψj2

Thus the r-step ahead predictor can be recursively estimated.


We note that the predictor given above is based on the assumption that the infinite past is
observed. In practice this is not a realistic assumption. However, in the special case that time
series is an autoregressive process of order p (with AR parameters {φj }pj=1 ) and Xt , . . . , Xt−m is
observed where m ≥ p − 1, then the above scheme can be used for forecasting. More precisely,

p
X
Xt (1) = φj Xt+1−j
j=1
p
X r−1
X
Xt (r) = φj Xt+r−j + φj Xt (r − j) for 2 ≤ r ≤ p
j=r j=1
p
X
Xt (r) = φj Xt (r − j) for r > p. (7.5)
j=1

However, in the general case more sophisticated algorithms are required when only the finite
past is known.

7.4.1 Example: Forecasting yearly temperatures


We now fit an autoregressive model to the yearly temperatures from 1880-2008 and use this model
to forecast the temperatures from 2009-2013. In Figure 7.1 we give a plot of the temperature time
series together with its ACF. It is clear there is some trend in the temperature data, therefore we
have taken second differences, a plot of the second difference and its ACF is given in Figure 7.2.
We now use the command ar.yule(res1,order.max=10) (we will discuss in Chapter 9 how this
function estimates the AR parameters) to estimate the the AR parameters.

Remark 7.4.1 (The Yule-Walker estimator in prediction) The least squares estimator (or
equivalently the conditional likelihood) is likely to give a causal estimator of the AR parameters. But
it is not guaranteed. On the other hand the Yule-Walker estimator is guaranteed to give a causal

198
Series global.mean

1.0
0.8
0.5

0.6
temp

ACF

0.4
0.0

0.2
0.0
−0.5

−0.2
1880 1900 1920 1940 1960 1980 2000 0 5 10 15 20 25 30

Time Lag

Figure 7.1: Yearly temperature from 1880-2013 and the ACF.

Series diff2
0.6

1.0
0.4

0.5
0.2
second.differences

ACF
0.0

0.0
−0.2
−0.4

−0.5
−0.6

1880 1900 1920 1940 1960 1980 2000 0 5 10 15 20 25 30

Time Lag

Figure 7.2: Second differences of yearly temperature from 1880-2013 and its ACF.

solution. This will matter for prediction. We emphasize here that the least squares estimator cannot
consistently estimate non-causal solutions, it is only a quirk of the estimation method that means
at times the solution may be noncausal.
If the time series {Xt }t is linear and stationary with mean zero, then if we predict several steps
into the future we would expect our predictor to be close to zero (since E(Xt ) = 0). This is guaran-
teed if one uses AR parameters which are causal (since the eigenvalues of the VAR matrix is less
than one); such as the Yule-Walker estimators. On the other hand, if the parameter estimators do

199
not correspond to a causal solution (as could happen for the least squares estimator), the predictors
may explode for long term forecasts which makes no sense.

The function ar.yule uses the AIC to select the order of the AR model. When fitting the
second differences from (from 1880-2008 - a data set of length of 127) the AIC chooses the AR(7)
model

Xt = −1.1472Xt−1 − 1.1565Xt−2 − 1.0784Xt−3 − 0.7745Xt−4 − 0.6132Xt−5 − 0.3515Xt−6 − 0.1575Xt−7 + εt ,

with var[εt ] = σ 2 = 0.02294. An ACF plot after fitting this model and then estimating the residuals
{εt } is given in Figure 7.3. We observe that the ACF of the residuals ‘appears’ to be uncorrelated,
which suggests that the AR(7) model fitted the data well. Later we define the Ljung-Box test,
which is a method for checking this claim. However since the residuals are estimated residuals and
not the true residual, the results of this test need to be taken with a large pinch of salt. We will
show that when the residuals are estimated from the data the error bars given in the ACF plot
are not correct and the Ljung-Box test is not pivotal (as is assumed when deriving the limiting
distribution under the null the model is correct). By using the sequence of equations

Series residuals
1.0
0.8
0.6
ACF

0.4
0.2
0.0
−0.2

0 5 10 15 20

Lag

Figure 7.3: An ACF plot of the estimated residuals {εbt }.

200
X̂127 (1) = −1.1472X127 − 1.1565X126 − 1.0784X125 − 0.7745X124 − 0.6132X123

−0.3515X122 − 0.1575X121

X̂127 (2) = −1.1472X̂127 (1) − 1.1565X127 − 1.0784X126 − 0.7745X125 − 0.6132X124

−0.3515X123 − 0.1575X122

X̂127 (3) = −1.1472X̂127 (2) − 1.1565X̂127 (1) − 1.0784X127 − 0.7745X126 − 0.6132X125

−0.3515X124 − 0.1575X123

X̂127 (4) = −1.1472X̂127 (3) − 1.1565X̂127 (2) − 1.0784X̂127 (1) − 0.7745X127 − 0.6132X126

−0.3515X125 − 0.1575X124

X̂127 (5) = −1.1472X̂127 (4) − 1.1565X̂127 (3) − 1.0784X̂127 (2) − 0.7745X̂127 (1) − 0.6132X127

−0.3515X126 − 0.1575X125 .

We can use X̂127 (1), . . . , X̂127 (5) as forecasts of X128 , . . . , X132 (we recall are the second differences),
which we then use to construct forecasts of the temperatures. A plot of the second difference
forecasts together with the true values are given in Figure 7.4. From the forecasts of the second
differences we can obtain forecasts of the original data. Let Yt denote the temperature at time t
and Xt its second difference. Then Yt = −Yt−2 + 2Yt−1 + Xt . Using this we have

Yb127 (1) = −Y126 + 2Y127 + X127 (1)

Yb127 (2) = −Y127 + 2Y127 (1) + X127 (2)

Yb127 (3) = −Y127 (1) + 2Y127 (2) + X127 (3)

and so forth.
We note that (??) can be used to give the mse error. For example

E[X128 − X̂127 (1)]2 = σt2

E[X128 − X̂127 (1)]2 = (1 + φ21 )σt2

If we believe the residuals are Gaussian we can use the mean squared error to construct confidence
intervals for the predictions. Assuming for now that the parameter estimates are the true param-
eters (this is not the case), and Xt = ∞
P
j=0 ψj (φ)εt−j is the MA(∞) representation of the AR(7)
b

201
model, the mean square error for the kth ahead predictor is

k−1
X
σ2 b 2 (using (??))
ψj (φ)
j=0

thus the 95% CI for the prediction is


 
k−1
X
Xt (k) ± 1.96σ 2 b 2 ,
ψj (φ)
j=0

however this confidence interval for not take into account Xt (k) uses only parameter estimators
and not the true values. In reality we need to take into account the approximation error here too.
If the residuals are not Gaussian, the above interval is not a 95% confidence interval for the
prediction. One way to account for the non-Gaussianity is to use bootstrap. Specifically, we rewrite
the AR(7) process as an MA(∞) process


X
Xt = ψj (φ)ε
b t−j .
j=0

Hence the best linear predictor can be rewritten as


X
Xt (k) = ψj (φ)ε
b t+k−j
j=k

thus giving the prediction error

k−1
X
Xt+k − Xt (k) = ψj (φ)ε
b t+k−j .
j=0

Pk−1
We have the prediction estimates, therefore all we need is to obtain the distribution of j=0 ψj (φ)ε
b t+k−j .

This can be done by estimating the residuals and then using bootstrap1 to estimate the distribu-
tion of k−1
P Pk−1 b ∗
j=0 ψj (φ)εt+k−j , using the empirical distribution of j=0 ψj (φ)εt+k−j . From this we can
b

1
Residual bootstrap is based on sampling from the empirical distribution of the residualsPi.e. construct
n
the “bootstrap” sequence {ε∗t+k−j }j by sampling from the empirical distribution Fb(x) = n1 t=p+1 I(b εt ≤
Pp b
x) (where εbt = Xt − j=1 φj Xt−j ). This sequence is used to construct the bootstrap estimator
Pk−1 b ∗
j=0 ψj (φ)εt+k−j . By doing this several thousand times we can evaluate the empirical distribution of
Pk−1 b ∗
j=0 ψj (φ)εt+k−j using these bootstrap samples. This is an estimator of the distribution function of
Pk−1
j=0 ψj (φ)ε
b t+k−j .

202
construct the 95% CI for the forecasts.

0.3 ●
= forecast

● = true value
0.2



0.1
second difference


0.0



−0.1


−0.2


−0.3

● ●

2000 2002 2004 2006 2008 2010 2012

year

Figure 7.4: Forecasts of second differences.

A small criticism of our approach is that we have fitted a rather large AR(7) model to time
series of length of 127. It may be more appropriate to fit an ARMA model to this time series.

Exercise 7.1 In this exercise we analyze the Sunspot data found on the course website. In the data
analysis below only use the data from 1700 - 2003 (the remaining data we will use for prediction).
In this section you will need to use the function ar.yw in R.

(i) Fit the following models to the data and study the residuals (using the ACF). Using this
decide which model

Xt = µ + A cos(ωt) + B sin(ωt) + εt or
|{z}
AR
Xt = µ + εt
|{z}
AR

is more appropriate (take into account the number of parameters estimated overall).

(ii) Use these models to forecast the sunspot numbers from 2004-2013.

203
7.5 One-step ahead predictors based on the finite past
We return to Section 6.2.3 and call the definition of the best fitting AR(p) model.
The best fitting AR(p) Let us suppose that {Xt } is a general second order stationary time series
with autocovariance {c(r)}r . We consider the projection of Xt onto Y = (Xt−p , . . . , Xt−1 ) (techni-
cally we should should say sp(Xt−p , . . . , Xt−1 )), this is

p
X
PY (Xt ) = φp,j Xt−j
j=1

where
 
φp,1
 .. 
 
 .  = Σ−1
p rp , (7.6)
 
φp.p

where (Σp )i,j = c(i − j) and (rp )i = c(i + 1). We recall that Xt − PY (Xt ) and Y are uncorrelated
but Xt − PY (Xt ) is not necessarily uncorrelated with {Xt−j } for j ≥ (p + 1). We call {φp,j } the
best fitting AR(p) coefficients, because if the true model were an AR(p) model φp,j = φj .
Since Xt − PY (Xt ) is uncorrelated with Y = (Xt−p , . . . , Xt−1 ), the best linear predictor of Xt
given Y = (Xt−p , . . . , Xt−1 ) is

p
X
PY (Xt ) = φp,j Xt−j
j=1

7.5.1 Levinson-Durbin algorithm


The Levinson-Durbin algorithm, which we describe below forms the basis of several estimation
algorithms for linear time series. These include (a) the Gaussian Maximum likelihood estimator,
(b) the Yule-Walker estimator and (c) the Burg algorithm. We describe these methods in Chapter
9. But we start with a description of the Levinson-Durbin algorithm.
The Levinson-Durbin algorithm is a method for evaluating {φp,j }pj=1 for an increasing number
of past regressors (under the assumption of second order stationarity). A brute force method is to
evaluate {φp,j }pj=1 using (7.15), where Σ−1
p is evaluated using standard methods, such as Gauss-

Jordan elimination. To solve this system of equations requires O(p3 ) operations. The beauty of the
Levinson-Durbin algorithm is that it exploits the (Toeplitz) structure of Σp to reduce the number

204
of operations to O(p2 ). It is evaluated recursively by increasing the order of lags p. It was first
proposed in the 1940s by Norman Levinson (for Toeplitz equations). In the 1960s, Jim Durbin
adapted the algorithm to time series and improved it. In the discussion below we switch p to t.
We recall that in the aim in one-step ahead prediction is to predict Xt+1 given Xt , Xt−1 , . . . , X1 .
The best linear predictor is

t
X
Xt+1|t = PX1 ,...,Xt (Xt+1 ) = Xt+1|t,...,1 = φt,j Xt+1−j . (7.7)
j=1

The notation can get a little heavy. But the important point to remember is that as t grows we
are not predicting further into the future. We are including more of the past in the one-step ahead
prediction.
We first outline the algorithm. We recall that the best linear predictor of Xt+1 given Xt , . . . , X1
is

t
X
Xt+1|t = φt,j Xt+1−j . (7.8)
j=1

The mean squared error is r(t + 1) = E[Xt+1 − Xt+1|t ]2 . Given that the second order stationary
covariance structure, the idea of the Levinson-Durbin algorithm is to recursively estimate {φt,j ; j =
1, . . . , t} given {φt−1,j ; j = 1, . . . , t − 1} (which are the coefficients of the best linear predictor of Xt
given Xt−1 , . . . , X1 ). Let us suppose that the autocovariance function c(k) = cov[X0 , Xk ] is known.
The Levinson-Durbin algorithm is calculated using the following recursion.

Step 1 φ1,1 = c(1)/c(0) and r(2) = E[X2 − X2|1 ]2 = E[X2 − φ1,1 X1 ]2 = c(0) − φ1,1 c(1).

Step 2 For j = t
Pt−1
c(t) − j=1 φt−1,j c(t − j)
φt,t =
r(t)
φt,j = φt−1,j − φt,t φt−1,t−j 1 ≤ j ≤ t − 1,

Step 3 r(t + 1) = r(t)(1 − φ2t,t ).

We give two proofs of the above recursion.

205
Exercise 7.2 (i) Suppose Xt = φXt−1 +εt (where |φ| < 1). Use the Levinson-Durbin algorithm,
to deduce an expression for φt,j for (1 ≤ j ≤ t).

(ii) Suppose Xt = φεt−1 + εt (where |φ| < 1). Use the Levinson-Durbin algorithm (and possibly
Maple/Matlab), deduce an expression for φt,j for (1 ≤ j ≤ t). (recall from Exercise 6.3 that
you already have an analytic expression for φt,t ).

7.5.2 A proof of the Durbin-Levinson algorithm based on projec-


tions
Let us suppose {Xt } is a zero mean stationary time series and c(k) = E(Xk X0 ). Let PXt ,...,X2 (X1 )
denote the best linear predictor of X1 given Xt , . . . , X2 and PXt ,...,X2 (Xt+1 ) denote the best linear
predictor of Xt+1 given Xt , . . . , X2 . Stationarity means that the following predictors share the same
coefficients

t−1
X t−1
X
Xt|t−1 = φt−1,j Xt−j PXt ,...,X2 (Xt+1 ) = φt−1,j Xt+1−j (7.9)
j=1 j=1
t−1
X
PXt ,...,X2 (X1 ) = φt−1,j Xj+1 .
j=1

The last line is because stationarity means that flipping a time series round has the same correlation
structure. These three relations are an important component of the proof.
Recall our objective is to derive the coefficients of the best linear predictor of PXt ,...,X1 (Xt+1 )
based on the coefficients of the best linear predictor PXt−1 ,...,X1 (Xt ). To do this we partition the
space sp(Xt , . . . , X2 , X1 ) into two orthogonal spaces sp(Xt , . . . , X2 , X1 ) = sp(Xt , . . . , X2 , X1 ) ⊕
sp(X1 − PXt ,...,X2 (X1 )). Therefore by uncorrelatedness we have the partition

Xt+1|t = PXt ,...,X2 (Xt+1 ) + PX1 −PXt ,...,X2 (X1 ) (Xt+1 )


t−1
X
= φt−1,j Xt+1−j + φtt (X1 − PXt ,...,X2 (X1 ))
| {z }
j=1
| {z } by projection onto one variable
by (7.9)
 
 
t−1
X  t−1
X 
= φt−1,j Xt+1−j + φt,t X1 − φt−1,j Xj+1  . (7.10)
 
 
j=1  j=1 
| {z }
by (7.9)

206
We start by evaluating an expression for φt,t (which in turn will give the expression for the other
coefficients). It is straightforward to see that

E(Xt+1 (X1 − PXt ,...,X2 (X1 )))


φt,t = (7.11)
E(X1 − PXt ,...,X2 (X1 ))2
E[(Xt+1 − PXt ,...,X2 (Xt+1 ) + PXt ,...,X2 (Xt+1 ))(X1 − PXt ,...,X2 (X1 ))]
=
E(X1 − PXt ,...,X2 (X1 ))2
E[(Xt+1 − PXt ,...,X2 (Xt+1 ))(X1 − PXt ,...,X2 (X1 ))]
=
E(X1 − PXt ,...,X2 (X1 ))2

Therefore we see that the numerator of φt,t is the partial covariance between Xt+1 and X1 (see
Section 6.2), furthermore the denominator of φt,t is the mean squared prediction error, since by
stationarity

E(X1 − PXt ,...,X2 (X1 ))2 = E(Xt − PXt−1 ,...,X1 (Xt ))2 = r(t) (7.12)

Returning to (7.11), expanding out the expectation in the numerator and using (7.12) we have

Pt−1
E(Xt+1 (X1 − PXt ,...,X2 (X1 ))) c(0) − E[Xt+1 PXt ,...,X2 (X1 ))] c(0) − j=1 φt−1,j c(t − j)
φt,t = = = ,
r(t) r(t) r(t)
(7.13)

which immediately gives us the first equation in Step 2 of the Levinson-Durbin algorithm. To
obtain the recursion for φt,j we use (7.10) to give

t
X
Xt+1|t = φt,j Xt+1−j
j=1
 
t−1
X t−1
X
= φt−1,j Xt+1−j + φt,t X1 − φt−1,j Xj+1  .
j=1 j=1

To obtain the recursion we simply compare coefficients to give

φt,j = φt−1,j − φt,t φt−1,t−j 1 ≤ j ≤ t − 1.

This gives the middle equation in Step 2. To obtain the recursion for the mean squared prediction

207
error we note that by orthogonality of {Xt , . . . , X2 } and X1 − PXt ,...,X2 (X1 ) we use (7.10) to give

r(t + 1) = E(Xt+1 − Xt+1|t )2 = E[Xt+1 − PXt ,...,X2 (Xt+1 ) − φt,t (X1 − PXt ,...,X2 (X1 )]2

= E[Xt+1 − PX2 ,...,Xt (Xt+1 )]2 + φ2t,t E[X1 − PXt ,...,X2 (X1 )]2

−2φt,t E[(Xt+1 − PXt ,...,X2 (Xt+1 ))(X1 − PXt ,...,X2 (X1 ))]

= r(t) + φ2t,t r(t) − 2φt,t E[Xt+1 (X1 − PXt ,...,X2 (X1 ))]
| {z }
=r(t)φt,t by (7.13)

= r(t)[1 − φ2tt ].

This gives the final part of the equation in Step 2 of the Levinson-Durbin algorithm.

7.5.3 Applying the Durbin-Levinson to obtain the Cholesky de-


composition
We recall from Section 5.5 that by sequentially projecting the elements of random vector on the past
elements in the vector gives rise to Cholesky decomposition of the inverse of the variance/covariance
(precision) matrix. This is exactly what was done in when we make the Durbin-Levinson algorithm.
In other words,
 
√X1
 r(1) 
 
X1 −φ1,1 X2
 √ 
r(2)
 
var 
 ..
 = In


 . 

Xn − n−1
P
j=1 φn−1,j Xn−j
 

r(n)

Therefore, if Σn = var[X n ], where X n = (X1 , . . . , Xn ), then Σ−1 0


n = Ln Dn Ln , where

 
1 0 ... ... ... 0
 
−φ1,1 1 0 ... ... 0 
 

 
Ln =  −φ2,2 −φ2,1 1 0 ... 0  (7.14)
 
.. .. . . . . .. 
 
 ..
 . . . . . . 
 
−φn−1,n−1 −φn−1,n−2 −φn−1,n−3 . . . . . . 1

and Dn = diag(r1−1 , r2−1 , . . . , rn−1 ).

208
7.6 Comparing finite and infinite predictors (advanced)
We recall that

t
X
Xt+1|t = PXt ,...,X1 (Xt+1 ) = φt,j Xt−j ,
j=1

which is the best linear predictor given the finite past. However, often φt,j can be difficult to
evaluate (usually with the Durbin-Levinson algorithm) in comparison to the AR(∞) parameters.
Thus we define the above approximation

t
X
X
bt+1|t = φj Xt−j .
j=1

How good an approximation X


bt+1|t is of Xt+1|t is given by Baxter’s inequality.

Theorem 7.6.1 (Baxter’s inequality) Suppose {Xt } has an AR(∞) representation with param-
P∞
eters {φj }∞
j=1 such that
n
j=1 |φj | < ∞. Let {φn,j }j=1 denote the parameters of the parameters of

the best linear predictor of Xt+1 given {Xj }tj=1 . Then if n is large enough we have

n
X ∞
X
|φn,j − φj | ≤ C |φj |,
j=1 j=n+1

where C is a constant that depends on the underlying spectral density.


P∞ P∞
We note that since j=1 |φj | < ∞, then j=n+1 |φj | → 0 as n → ∞. Thus as n gets large

n
X
|φn,j − φj | ≈ 0.
j=1

We apply this result to measuring the difference between Xt+1|t and X


bt+1|t

t
X t
X ∞
X
E|Xt+1|t − X
bt+1|t | ≤ |φt,j − φj | E|Xt−j | ≤ E|Xt−j | |φt,j − φj | ≤ CE|Xt | |φj |.
j=1 j=1 j=t+1

Therefore the best linear predictor and its approximation are “close” for large t.

209
7.7 r-step ahead predictors based on the finite past
Let Y = (Xt−p , . . . , Xt−1 )

p
X
PY (Xt+r ) = φp,j (r)Xt−j
j=1

where
 
φp,1 (r)
..
 
 = Σ−1
p r p,r , (7.15)
 
 .
 
φp,p (r)

where (Σp )i,j = c(i − j) and (rp,r )i = c(i + r). This gives the best finite predictor for the time
series at lag r. In practice, one often finds the best fitting AR(p) model, which gives the best finite
predictor at lag one. And then uses the AR prediction method described in Section 7.3 to predict
forward

p
X
X b t (r − 1)](1) = [Φr X t ](1) =
bt (r) = [ΦX φj (r, p)Xt+1−j
p
j=1

where
 
φp,1 φp,2 φ3 . . . φp.p
 
1 0 0 ... 0
 
 
 
Φp =  0 1 0 ... 0 .
 
.. ..
 
 .. 
 . ··· . . 0 
 
0 0 0 ... 0

If the true model is not an AR(p) this will not give the best linear predictor, but it will given an
approximation of it. Suppose that j > n
For ARMA models

n
X  Op (ρp ) τ ≤p
|φj (τ ; p) − φj,n (τ )||Xt−j | =
 O ρp ρ |τ −p|

j=1 p τ > p.

|j K ψj | < ∞ and |j K aj | <


P P
Lemma 7.7.1 Suppose the MA(∞) and AR(∞) parameters satisfy j j

210
∞ for some K > 1. Then
  
n 1
X  O pK
τ ≤p
|φj (τ ; p) − φj (τ )|  
 O 1
j=1 pK |τ −p|K
τ > p.

PROOF. If τ < p

n n ∞ ∞  
X X X X 1
|φj (τ ; fp ) − φj (τ, f )| = | φj+s (fp )ψτ −s (fp ) − φj+s (f )ψτ −s (f )| = O .
pK
j=1 j=1 s=1 s=1

If τ > p

n n ∞ ∞  
X X X X 1
|φj (τ ; fp ) − φj (τ, f )| = | φj+s (fp )ψτ −s (fp ) − φj+s (f )ψτ −s (f )| = O .
pK |τ − p|K
j=1 j=1 s=1 s=1

7.8 Forecasting for ARMA processes


Given the autocovariance of any stationary process the Levinson-Durbin algorithm allows us to
systematically obtain one-step predictors of second order stationary time series without directly
inverting a matrix. In this section we consider the special case of ARMA(p, q) models where the
ARMA coefficients are known.
For AR(p) models prediction is especially easy, if the number of observations in the finite past,
t, is such that p ≤ t. For 1 ≤ t ≤ p one would use the Durbin-Levinson algorithm and for t > p we
use

p
X
Xt+1|t = φj Xt+1−j .
j=1

For ARMA(p, q) models prediction is not so straightforward, but we show below some simple
approximations can be made.
We recall that a causal invertible ARMA(p, q) has the representation

p
X q
X
Xt+1 = φj Xt+1−j + θi εt+1−i + εt+1 .
j=1 i=1

Then if the infinite past were observed by using equation (7.4) and the AR(∞) and MA(∞) repre-

211
sentation of the ARMA model the best linear predictor is


X
Xt (1) = ψj εt+1−j
j=1
X∞
= aj Xt+1−j
j=1

where {ψj } and {aj } are the AR(∞) and MA(∞) coefficients respectively. The above representation
does not explictly use the ARMA representation. However since εt−j = Xt−j −Xt−j−1 (1) it is easily
seen that an alternative representation is

p
X q
X
Xt (1) = φj Xt+1−j + θi (Xt+1−i − Xt−i (1)) .
j=1 i=1

However, for finite predictors the actual one-step ahead prediction formula is not so simple. It can
be shown that for t ≥ max(p, q)

p
X q
X
Xt+1|t = φj Xt+1−j + θt,i (Xt+1−i − Xt+1−i|t−i ), (7.16)
j=1 i=1

where the coefficients θt,i which can be evaluated from the autocovariance structure of the MA
process. A proof is given in the appendix. It can be shown that θt,i → θi as t → ∞ (see Brockwell
and Davis (1998)), Chapter 5.
The prediction can be simplified if we make a simple approximation (which works well if t is
relatively large). For 1 ≤ t ≤ max(p, q), set X
bt+1|t = Xt and for t > max(p, q) we define the

recursion

p
X q
X
X
bt+1|t = φj Xt+1−j + θi (Xt+1−i − X
bt+1−i|t−i ). (7.17)
j=1 i=1

This approximation seems plausible, since in the exact predictor (7.16), θt,i → θi . By iterating
backwards, we can show that

t−max(p,q) max(p,q)
X X
X
bt+1|t = aj Xt+1−j + bj Xj (7.18)
j=1 j=1
| {z }
first part of AR(∞) expansion

212
where |γj | ≤ Cρt , with 1/(1 + δ) < ρ < 1 and the roots of θ(z) are outside (1 + δ). On the other
hand, the infinite predictor is


X ∞
X
Xt (1) = aj Xt+1−j (since Xt+1 = aj Xt+1−j + εt+1 ).
j=1 j=1

Remark 7.8.1 We prove (7.18) for the MA(1) model Xt = θεt−1 + εt .The estimated predictor is

 
bt|t−1 = θ Xt−1 − X
X bt−1|t−2
 
⇒ Xt − X
bt|t−1 = −θ Xt−1 − Xbt−1|t−2 + Xt
t−1
X  
= (−θ)j Xt−j−1 + (−θ)t X1 − X
b1|0 .
j=0

On the other hand, the infinite predictor is

 
bt|t−1 = θ Xt−1 − X
X bt−1|t−2
 
⇒ Xt − X
bt|t−1 = −θ Xt−1 − Xbt−1|t−2 + Xt
t−1
X  
= (−θ)j Xt−j−1 + (−θ)t X1 − X
b1|0 .
j=0

In summary, we have three one-step ahead predictors. The finite past best linear predictor:

p
X q
X t
X
Xt+1|t = φj Xt+1−j + θi,t (Xt+1−i − X
bt+1−i|t−i ) = φt,s Xt+1−s (7.19)
j=1 i=1 s=1

The infinite past predictor:

p
X q
X ∞
X
Xt (1) = φj Xt+1−j + θi (Xt+1−i − Xt−i (1)) = aj Xt+1−s (7.20)
j=1 i=1 s=1

and the approximate finite predictor:

p q t max(p,q)
X X X X
X
bt+1|t = φj Xt+1−j + θi (Xt+1−i − X
bt−i (1)) = aj Xt+1−s + bs Xs . (7.21)
j=1 i=1 s=1 s=1

These predictors will be very useful in deriving the approximate Gaussian likelihood for the ARMA
model, see Section 9.2.2. We give a bound for the differences below.

213
Proposition 7.8.1 Suppose {Xt } is an ARMA process where the roots of φ(z) and θ(z) have roots
which are greater in absolute value than 1 + δ. Let Xt+1|t , Xt (1) and X
bt+1|t be defined as in (7.19),

(7.20) and (7.21) respectively. Then

bt+1|t − Xt (1)]2 ≤ Kρt ,


E[X (7.22)

E[Xt+1|t − Xt (1)]2 ≤ Kρt (7.23)

and

E[Xt+1 − Xt+1|t ]2 − σ 2 ≤ Kρt (7.24)

1
for any 1+δ < ρ < 1 and var(εt ) = σ 2 .

7.9 ARMA models and the Kalman filter

7.9.1 The Kalman filter


The Kalman filter can be used to define a variant of the estimated predictor X
bt (1) described in

(7.21). The Kalman filter construction is based on the state space equation

Xt = F Xt−1 + Vt

where {Xt }t is an unobserved time series, F is a known matrix, var[Vt ] = Q and {Vt }t are indepen-
dent random variables that are independent of Xt−1 . The observed equation

Yt = HXt−1 + Wt

where {Yt }t is the observed time series, var[Wt ] = R, {Wt }t are independent that are independent
of Xt−1 . Moreover {Vt }t and {Wt } are jointly independent. The parameters can be made time-
dependent, but this make the derivations notationally more cumbersome.
bt+1|t = PY ,...,Yt (Xt+1 ) and Pt+1|t = var[Xt+1 − X
The standard notation is to let X bt+1|t ] (pre-
1

bt+1|t+1 = PY ,...,Yt (Xt+1 ) and Pt+1|t+1 = var[Xt+1 − X


dictive) and X bt+1|t+1 ] (update). The Kalman
1

214
filter is an elegant method that iterates between the prediction steps X
bt+1|t and Pt+1|t and the

update steps X
bt+1|t+1 and Pt+1|t+1 . A proof is given at the end of the chapter. We summarise the

algorithm below:

The Kalman equations

(i) Prediction step The conditional expectation

X
bt+1|t = F X
bt|t

and the corresponding mean squared error

Pt+1|t = F Pt|t F ∗ + Q.

(ii) Update step The conditional expectation

 
X bt+1|t + Kt+1 Yt+1 − H X
bt+1|t+1 = X bt+1|t .

(note the appearance of Yt , this is where the observed data plays a role in the prediction)
where

Kt+1 = Pt+1|t H ∗ [HPt+1|t H ∗ + R]−1

and the corresponding mean squared error

Pt+1|t+1 = Pt+1|t − Kt+1 HPt+1|t = (I − Kt+1 H)Pt+1|t .

(iii) There is also a smoothing step (which we ignore for now).

Thus we observe that if we can write a model in the above notation, then the predictors can be
recursively updated. It is worth mentioning that in order to initiate the algorithm the initial values
X0|0 and P0|0 are required.

215
7.9.2 The state space (Markov) representation of the ARMA model
There is no unique state-space representation of the ARMA model. We give below the elegant
construction proposed in Akaike (1977) and expanded on in Jones (1980). This construction can be
used as in prediction (via the Kalman filter) and to estimate the parameters in likelihood likelihood
(but keep in mind initial conditions do matter). The construction is based on the best linear
predictor of the infinite past.
We will assume {Xt } has a causal ARMA(p, q) representation where

p
X q
X
Xt = φj Xt−j + θi εt−i + εt .
j=1 i=1

We now obtain a Markov-type representation of the above. It is based on best linear predictors
given the infinite past. Let

X(t + r|t) = PXt ,Xt−1 ,... (Xt+r ),

where we recall that previously we used the notation Xt (r) = X(t + r|t). The reason we change
notation is to keep track of the time stamps. To obtain the representation we use that the ARMA
model has the MA(∞) representation


X
Xt = ψj εt−j
j=0

where ψ0 = 1. The MA(∞) coefficients can be derived from the ARMA parameters using the
recursion

j−1
X
ψj = θj + φk θj−k for j ≥ 1,
k=1

setting the initial value ψ0 = 1. Since X(t + r|t) is the best linear predictor given the infinite past
by using the results from Section 7.4 we have


X
X(t + r|t) = PXt ,Xt−1 ,... (Xt+r ) = ψj+r εt+r−j
j=r

X
X(t + r|t + 1) = PXt+1 ,Xt ,Xt−1 ,... (Xt+r ) = ψj εt+r−j .
j=r−1

216
Thus taking differences we have

X(t + r|t + 1) − X(t + r|t) = ψr−1 εt+1 .

Rewriting the above gives

X(t + r|t + 1) = X(t + r|t) + ψr−1 εt+1 . (7.25)

The simplest example of the above is Xt+r = X(t + r|t + r) = X(t + r|t + r − 1) + εt+r . Based on
(7.25) we have
      
 X(t + 1|t + 1)   0 1 0 ... 0 0  X(t|t)   1 
      
 X(t + 2|t + 1) 0 0 1 ... 0 0    X(t + 1|t)
     ψ1 
     
      
 X(t + 3|t)   0 0 0 ... 0 0    X(t + 2|t)
   ψ2 
=  + εt+1  . .
     
 .. .. .. .. . . .. ..  
 ..  ..

 .  
  . . . . . .   . 
 


      
 X(t + r − 1|t + 1) 0 0 0 . . . 0 1   X(t + r − 2|t)  ψr−2
      
   
      
X(t + r|t + 1) ? ? ? ... ? ? X(t + r − 1|t) ψr−1

The important observation is that the two vectors on the RHS of the above are independent, which
is getting us towards a state space representation.
How to choose r in this representation and what are the ?s. Studying the last line in the above
vector equation we note that

X(t + r|t + 1) = X(t + r|t) + ψr−1 εt+1 ,

however X(t + r|t) is not explicitly in the vector. Instead we need to find a linear combination of
X(t|t), . . . , X(t + r − 1|t) which gives X(t + r|t). To do this we return to the ARMA representation

p
X q
X
Xt+r = φj Xt+r−j + θi εt+r−i + εt+r .
j=1 i=1

The next part gets a little messy (you may want to look at Akaike or Jones for a better explanation).

217
Suppose that r > q, specifically let r = q + 1, then

p
X q
X
PXt ,Xt−1 ,... (Xt+r ) = φj PXt ,Xt−1 ,... (Xt+r−j ) + θi PXt ,Xt−1 ,... (εt+r−i ) + PXt ,Xt−1 ,... (εt+r )
j=1 i=1
| {z }
since r > q this is = 0
p
X
= φj PXt ,Xt−1 ,... (Xt+r−j ).
j=1

If, p < q + 1, then the above reduces to

p
X
X(t + r|t) = φj X(t + r − j|t).
j=1

If, on the other hand p > r, then

r
X X
X(t + r|t) = φj X(t + r − j|t) + φj Xt−j .
j=1 j=r+1

Building {Xt−j }rj=1 from {X(t|t), . . . , X(t + r − 1|t)} seems unlikely (it can probably proved it is
not possible, but a proof escapes me for now). Thus, we choose r ≥ max(p, q + 1) (which will then
gives everything in terms of the predictors). This choice gives

p
X
PXt ,Xt−1 ,... (Xt+r ) = φj X(t + r − j|t).
j=1

This allows us to construct the recursion equations for any r ≥ max(p, q + 1) by using the above to
build the last row of the matrix. For simplicility we set r = m = max(p, q + 1). If p < max(p, q + 1),
then for p + 1 ≤ r ≤ m set φj = 0. Define the recursion
      
 X(t + 1|t + 1)   0 1 0 ... 0 0  X(t|t)   1 
      

 X(t + 2|t + 1)  
  0 0 1 ... 0 0  
 X(t + 1|t) 

 ψ1



      
 X(t + 3|t)   0 0 0 ... 0 0   X(t + 2|t)   ψ2 
=  + εt+1  .
      
 .. .. .. .. .. .. ..  
 .. ..

 .  
  . . . . . .  . 


 . 

      
 X(t + m − 1|t + 1) 0 0 0 ... 0 1   X(t + m − 2|t)  ψm−2
      
   
      
X(t + m|t + 1) φm φm−1 φm−2 . . . φ2 φ1 X(t + m − 1|t) ψm−1

Let Z t = (X(t|t), . . . , X(t + m − 1|t)), and observe that Z t is independent of εt+1 . This yields the

218
state space equation

Z t+1 = F Z t + V t+1

where Φ is the matrix defined above and V 0t+1 = εt+1 (1, ψ1 , . . . , ψm−1 ) = εt+1 ψ 0m . By forward
iterating

Z t+1 = F Z t + V t+1 t∈Z

from t = −∞ the top entry of Z t gives a stationary solution of the ARMA model. Of course in
practice, we cannot start at t = −∞ and start at t = 0, thus the initial conditions will play a role
(and the solution won’t precisely follow a stationary ARMA).
The observation model is

Yt+1 = (1, 0, . . . , 0)Z t+1 ,

where we note that Yt+1 = Xt+1 . Thus we set Yt = Xt (where Xt is the observed time series).

7.9.3 Prediction using the Kalman filter


We use the Kalman filter described above where we set Q = var(εt )ψ 0m ψ m , R = 0, H = (1, 0, . . . , 0).
This gives The Kalman equations

(1) Start with an initial value Z 0|0 . This part is where the approximation comes into play sinceY0
is not observed. Typically a vectors of zeros are imputted for Z 0|0 and recommendations for
P0|0 are given in given in Jones (1980) and Akaiki (1978). Then for t > 0 iterate on steps (2)
and (3) below.

(2) Prediction step

Zbt+1|t = F Z
bt|t

and the corresponding mean squared error

Pt+1|t = F Pt|t F ∗ + Q.

219
(3) Update step The conditional expectation

 
Z bt+1|t + Kt+1 Yt+1 − H Zbt+1|t .
bt+1|t+1 = Z

where

Pt+1|t H ∗
Kt+1 =
HPt+1|t H ∗

and the corresponding mean squared error

Pt+1|t+1 = Pt+1|t − Kt HPt+1|t = (I − Kt H)Pt+1|t .

Zbt+1|t will contain the linear predictors of Xt+1 , . . . , Xt+m given X1 , . . . , Xt . They are “almost” the
best linear predictors, but as in Section 7.8 the initial value plays a role (which is why it is only
approximately the best linear predictor). Since we do not observe the infinite past we do not know
Z m|m (which is set to zero). The only way this can be exactly the best linear predictor is if Z m|m
were known, which it is not. Thus the approximate one-step ahead predictor is

t
X
Xt+1|t ≈ [Z t+1|t ](1) ≈ aj Xt−j ,
j=1

where {aj }∞
j=1 are the coefficients of the AR(∞) expansion corresponding to the ARMA model.

The approximate r-step ahead predictor is [Z t+1|t ](1) (if r ≤ m).

7.10 Forecasting for nonlinear models (advanced)


In this section we consider forecasting for nonlinear models. The forecasts we construct, may not
necessarily/formally be the best linear predictor, because the best linear predictor is based on
minimising the mean squared error, which we recall from Chapter 13 requires the existence of the
higher order moments. Instead our forecast will be the conditional expection of Xt+1 given the past
(note that we can think of it as the best linear predictor). Furthermore, with the exception of the
ARCH model we will derive approximation of the conditional expectation/best linear predictor,
analogous to the forecasting approximation for the ARMA model, X
bt+1|t (given in (7.17)).

220
7.10.1 Forecasting volatility using an ARCH(p) model
We recall the ARCH(p) model defined in Section 13.2

p
X
Xt = σt Zt σt2 = a0 + 2
aj Xt−j .
j=1

Using a similar calculation to those given in Section 13.2.1, we see that

E[Xt+1 |Xt , Xt−1 , . . . , Xt−p+1 ] = E(Zt+1 σt+1 |Xt , Xt−1 , . . . , Xt−p+1 ) = σt+1 E(Zt+1 |Xt , Xt−1 , . . . , Xt−p+1 )
| {z }
σt+1 function of Xt ,...,Xt−p+1
= σt+1 E(Zt+1 ) = 0 · σt+1 = 0.
| {z }
by causality

In other words, past values of Xt have no influence on the expected value of Xt+1 . On the other
hand, in Section 13.2.1 we showed that

p
X
2 2 2 2 2 2 2
E(Xt+1 |Xt , Xt−1 , . . . , Xt−p+1 ) = E(Zt+1 σt+1 |Xt , Xt−2 , . . . , Xt−p+1 ) = σt+1 E[Zt+1 ] = σt+1 = aj Xt+1−j ,
j=1

thus Xt has an influence on the conditional mean squared/variance. Therefore, if we let Xt+k|t
denote the conditional variance of Xt+k given Xt , . . . , Xt−p+1 , it can be derived using the following
recursion

p
X
2 2
Xt+1|t = aj Xt+1−j
j=1
p
X k−1
X
2 2 2
Xt+k|t = aj Xt+k−j + aj Xt+k−j|k 2≤k≤p
j=k j=1
Xp
2 2
Xt+k|t = aj Xt+k−j|t k > p.
j=1

7.10.2 Forecasting volatility using a GARCH(1, 1) model


We recall the GARCH(1, 1) model defined in Section 13.3

σt2 = a0 + a1 Xt−1
2 2 2
 2
+ b1 σt−1 = a1 Zt−1 + b1 σt−1 + a0 .

221
Similar to the ARCH model it is straightforward to show that E[Xt+1 |Xt , Xt−1 , . . .] = 0 (where we
use the notation Xt , Xt−1 , . . . to denote the infinite past or more precisely conditioned on the sigma
algebra Ft = σ(Xt , Xt−1 , . . .)). Therefore, like the ARCH process, our aim is to predict Xt2 .
We recall from Example 13.3.1 that if the GARCH the process is invertible (satisfied if b < 1),
then


2 2 2 2 a0 X
E[Xt+1 |Xt , Xt−1 , . . .] = σt+1 = a0 + a1 Xt−1 + b1 σt−1 = + a1 bj Xt−j
2
. (7.26)
1−b
j=0

Of course, in reality we only observe the finite past Xt , Xt−1 , . . . , X1 . We can approximate
2 |X , X
E[Xt+1 2 = 0, then for t ≥ 1 let
t t−1 , . . . , X1 ] using the following recursion, set σ
b1|0

2
σ
bt+1|t = a0 + a1 Xt2 + b1 σ 2
bt|t−1

(noting that this is similar in spirit to the recursive approximate one-step ahead predictor defined
in (7.18)). It is straightforward to show that

t−1
2 a0 (1 − bt+1 ) X
σ
bt+1|t = + a1 bj Xt−j
2
,
1−b
j=0

2 |X , . . . , X ] (if the mean square error existed


taking note that this is not the same as E[Xt+1 t 1
2 |X , . . . , X ] would give a smaller mean square error), but just like the ARMA process it will
E[Xt+1 t 1

closely approximate it. Furthermore, from (7.26) it can be seen that σ 2


bt+1|t closely approximates
2
σt+1

Exercise 7.3 To answer this question you need R install.package("tseries") then remember
library("garch").

(i) You will find the Nasdaq data from 4th January 2010 - 15th October 2014 on my website.

(ii) By taking log differences fit a GARCH(1,1) model to the daily closing data (ignore the adjusted
closing value) from 4th January 2010 - 30th September 2014 (use the function garch(x,
order = c(1, 1)) fit the GARCH(1, 1) model).

(iii) Using the fitted GARCH(1, 1) model, forecast the volatility σt2 from October 1st-15th (not-
2 . Evaluate
ing that no trading is done during the weekends). Denote these forecasts as σt|0
P11 2
t=1 σt|0

222
P11 2
(iv) Compare this to the actual volatility t=1 Xt (where Xt are the log differences).

7.10.3 Forecasting using a BL(1, 0, 1, 1) model


We recall the Bilinear(1, 0, 1, 1) model defined in Section 13.4

Xt = φ1 Xt−1 + b1,1 Xt−1 εt−1 + εt .

Assuming invertibility, so that εt can be written in terms of Xt (see Remark 13.4.2):

∞ j−1
!
X Y
εt = (−b)j Xt−1−j [Xt−j − φXt−j−1 ],
j=0 i=0

it can be shown that

Xt (1) = E[Xt+1 |Xt , Xt−1 , . . .] = φ1 Xt + b1,1 Xt εt .

However, just as in the ARMA and GARCH case we can obtain an approximation, by setting
b1|0 = 0 and for t ≥ 1 defining the recursion
X

 
bt+1|t = φ1 Xt + b1,1 Xt Xt − X
X bt|t−1 .

See ? and ? for further details.

Remark 7.10.1 (How well does X


bt+1|t approximate Xt (1)?) We now derive conditions for

X
bt+1|t to be a close approximation of Xt (1) when t is large. We use a similar technique to that

used in Remark 7.8.1.


We note that Xt+1 − Xt (1) = εt+1 (since a future innovation, εt+1 , cannot be predicted). We
will show that Xt+1 − X
bt+1|t is ‘close’ to εt+1 . Subtracting X
bt+1|t from Xt+1 gives the recursion

Xt+1 − X
bt+1|t = −b1,1 (Xt − X
bt|t−1 )Xt + (bεt Xt + εt+1 ) . (7.27)

We will compare the above recursion to the recursion based on εt+1 . Rearranging the bilinear

223
equation gives

εt+1 = −bεt Xt + (Xt+1 − φ1 Xt ) . (7.28)


| {z }
=bεt Xt +εt+1

We observe that (7.27) and (7.28) are almost the same difference equation, the only difference is
that an initial value is set for X
b1|0 . This gives the difference between the two equations as

t
Y t
Y
bt+1|t ] = (−1)t bt X1
εt+1 − [Xt+1 − X εj + (−1)t bt [X1 − X
b1|0 ] εj .
j=1 j=1

Qt a.s. P
Thus if bt j=1 εj → 0 as t → ∞, then X bt+1|t → Xt (1) as t → ∞. We now show that if
t a.s. t
E[log |εt | < − log |b|, then bt j=1 εj → 0. Since bt j=1 εj is a product, it seems appropriate to
Q Q

take logarithms to transform it into a sum. To ensure that it is positive, we take absolutes and
t-roots

t t
Y 1X
log |bt εj |1/t = log |b| + log |εj | .
t
j=1 j=1
| {z }
average of iid random variables

Therefore by using the law of large numbers we have

t t
Y 1X P
log |bt εj |1/t = log |b| + log |εj | → log |b| + E log |ε0 | = γ.
t
j=1 j=1

1/t a.s.
Qt Qt
Thus we see that |bt j=1 εj | → exp(γ). In other words, |bt j=1 εj | ≈ exp(tγ), which will only
converge to zero if E[log |εt | < − log |b|.

7.11 Nonparametric prediction (advanced)


In this section we briefly consider how prediction can be achieved in the nonparametric world. Let
us assume that {Xt } is a stationary time series. Our objective is to predict Xt+1 given the past.
However, we don’t want to make any assumptions about the nature of {Xt }. Instead we want to
obtain a predictor of Xt+1 given Xt which minimises the means squared error, E[Xt+1 − g(Xt )]2 . It
is well known that this is conditional expectation E[Xt+1 |Xt ]. (since E[Xt+1 − g(Xt )]2 = E[Xt+1 −

224
E(Xt+1 |Xt )]2 + E[g(Xt ) − E(Xt+1 |Xt )]2 ). Therefore, one can estimate

E[Xt+1 |Xt = x] = m(x)

nonparametrically. A classical estimator of m(x) is the Nadaraya-Watson estimator

Pn−1 x−Xt
t=1 Xt+1 K( b )
m
b n (x) = Pn−1 x−Xt
,
t=1 K( b )

where K : R → R is a kernel function (see Fan and Yao (2003), Chapter 5 and 6). Under some
‘regularity conditions’ it can be shown that m
b n (x) is a consistent estimator of m(x) and converges
to m(x) in mean square (with the typical mean squared rate O(b4 + (bn)−1 )). The advantage of
going the non-parametric route is that we have not imposed any form of structure on the process
(such as linear/(G)ARCH/Bilinear). Therefore, we do not run the risk of misspecifying the model
A disadvantage is that nonparametric estimators tend to be a lot worse than parametric estimators
(in Chapter ?? we show that parametric estimators have O(n−1/2 ) convergence which is faster than
the nonparametric rate O(b2 + (bn)−1/2 )). Another possible disavantage is that if we wanted to
include more past values in the predictor, ie. m(x1 , . . . , xd ) = E[Xt+1 |Xt = x1 , . . . , Xt−p = xd ] then
the estimator will have an extremely poor rate of convergence (due to the curse of dimensionality).
A possible solution to the problem is to assume some structure on the nonparametric model,
and define a semi-parametric time series model. We state some examples below:

(i) An additive structure of the type

p
X
Xt = gj (Xt−j ) + εt
j=1

where {εt } are iid random variables.

(ii) A functional autoregressive type structure

p
X
Xt = gj (Xt−d )Xt−j + εt .
j=1

(iii) The semi-parametric GARCH(1, 1)

Xt = σt Zt , σt2 = bσt−1
2
+ m(Xt−1 ).

225
However, once a structure has been imposed, conditions need to be derived in order that the model
has a stationary solution (just as we did with the fully-parametric models).
See ?, ?, ?, ?, ? etc.

7.12 The Wold Decomposition (advanced)


Section 5.5 nicely leads to the Wold decomposition, which we now state and prove. The Wold
decomposition theorem, states that any stationary process, has something that appears close to
an MA(∞) representation (though it is not). We state the theorem below and use some of the
notation introduced in Section 5.5.

Theorem 7.12.1 Suppose that {Xt } is a second order stationary time series with a finite variance
(we shall assume that it has mean zero, though this is not necessary). Then Xt can be uniquely
expressed as


X
Xt = ψj Zt−j + Vt , (7.29)
j=0

where {Zt } are uncorrelated random variables, with var(Zt ) = E(Xt −Xt−1 (1))2 (noting that Xt−1 (1)
is the best linear predictor of Xt given Xt−1 , Xt−2 , . . .) and Vt ∈ X−∞ = ∩−∞ −∞ , where X −∞
n=−∞ Xn n

is defined in (5.34).

PROOF. First let is consider the one-step ahead prediction of Xt given the infinite past, denoted
Xt−1 (1). Since {Xt } is a second order stationary process it is clear that Xt−1 (1) = ∞
P
j=1 bj Xt−j ,

where the coefficients {bj } do not vary with t. For this reason {Xt−1 (1)} and {Xt − Xt−1 (1)} are
second order stationary random variables. Furthermore, since {Xt − Xt−1 (1)} is uncorrelated with
Xs for any s ≤ t, then {Xs − Xs−1 (1); s ∈ R} are uncorrelated random variables. Define Zs =
Xs −Xs−1 (1), and observe that Zs is the one-step ahead prediction error. We recall from Section 5.5
¯ −∞ ) = ⊕∞
that Xt ∈ sp((Xt − Xt−1 (1)), (Xt−1 − Xt−2 (1)), . . .) ⊕ sp(X j=0 sp(Zt−j ) ⊕ sp(X
¯ −∞ ). Since
the spaces ⊕∞ ∞
j=0 sp(Zt−j ) and sp(X−∞ ) are orthogonal, we shall first project Xt onto ⊕j=0 sp(Zt−j ),

due to orthogonality the difference between Xt and its projection will be in sp(X−∞ ). This will
lead to the Wold decomposition.

226
First we consider the projection of Xt onto the space ⊕∞
j=0 sp(Zt−j ), which is


X
PZt ,Zt−1 ,... (Xt ) = ψj Zt−j ,
j=0

where due to orthogonality ψj = cov(Xt , (Xt−j − Xt−j−1 (1)))/var(Xt−j − Xt−j−1 (1)). Since Xt ∈
⊕∞
j=0 sp(Zt−j ) ⊕ sp(X
¯ −∞ ), the difference Xt − PZt ,Zt−1 ,... Xt is orthogonal to {Zt } and belongs in
sp(X
¯ −∞ ). Hence we have


X
Xt = ψj Zt−j + Vt ,
j=0

P∞
where Vt = Xt − j=0 ψj Zt−j and is uncorrelated to {Zt }. Hence we have shown (7.29). To show
that the representation is unique we note that Zt , Zt−1 , . . . are an orthogonal basis of sp(Zt , Zt−1 , . . .),
which pretty much leads to uniqueness. 

Exercise 7.4 Consider the process Xt = A cos(Bt + U ) where A, B and U are random variables
such that A, B and U are independent and U is uniformly distributed on (0, 2π).

(i) Show that Xt is second order stationary (actually it’s stationary) and obtain its means and
covariance function.

(ii) Show that the distribution of A and B can be chosen in such a way that {Xt } has the same
covariance function as the MA(1) process Yt = εt + φεt (where |φ| < 1) (quite amazing).

(iii) Suppose A and B have the same distribution found in (ii).

(a) What is the best predictor of Xt+1 given Xt , Xt−1 , . . .?

(b) What is the best linear predictor of Xt+1 given Xt , Xt−1 , . . .?

It is worth noting that variants on the proof can be found in Brockwell and Davis (1998),
Section 5.7 and Fuller (1995), page 94.

Remark 7.12.1 Notice that the representation in (7.29) looks like an MA(∞) process. There is,
however, a significant difference. The random variables {Zt } of an MA(∞) process are iid random
variables and not just uncorrelated.
We recall that we have already come across the Wold decomposition of some time series. In
Section 6.4 we showed that a non-causal linear time series could be represented as a causal ‘linear

227
time series’ with uncorrelated but dependent innovations. Another example is in Chapter 13, where
we explored ARCH/GARCH process which have an AR and ARMA type representation. Using this
representation we can represent ARCH and GARCH processes as the weighted sum of {(Zt2 − 1)σt2 }
which are uncorrelated random variables.

Remark 7.12.2 (Variation on the Wold decomposition) In many technical proofs involving
time series, we often use results related to the Wold decomposition. More precisely, we often
decompose the time series in terms of an infinite sum of martingale differences. In particular,
we define the sigma-algebra Ft = σ(Xt , Xt−1 , . . .), and suppose that E(Xt |F−∞ ) = µ. Then by
telescoping we can formally write Xt as


X
Xt − µ = Zt,j
j=0

where Zt,j = E(Xt |Ft−j ) − E(Xt |Ft−j−1 ). It is straightforward to see that Zt,j are martingale
differences, and under certain conditions (mixing, physical dependence, your favourite dependence
flavour etc) it can be shown that ∞
P
j=0 kZt,j kp < ∞ (where k · kp is the pth moment). This means

the above representation holds almost surely. Thus in several proofs we can replace Xt − µ by
P∞
j=0 Zt,j . This decomposition allows us to use martingale theorems to prove results.

7.13 Kolmogorov’s formula (advanced)


Suppose {Xt } is a second order stationary time series. Kolmogorov’s(-Szegö) theorem is an expres-
sion for the error in the linear prediction of Xt given the infinite past Xt−1 , Xt−2 , . . .. It basically
states that
 Z 2π 
2 1
E [Xn − Xn (1)] = exp log f (ω)dω ,
2π 0

where f is the spectral density of the time series. Clearly from the definition we require that the
spectral density function is bounded away from zero.
To prove this result we use (5.25);

det(Σ)
var[Y − Yb ] = .
det(ΣXX )

228
and Szegö’s theorem (see, Gray’s technical report, where the proof is given), which we state later
Pn
on. Let PX1 ,...,Xn (Xn+1 ) = j=1 φj,n Xn+1−j (best linear predictor of Xn+1 given Xn , . . . , X1 ).

Then we observe that since {Xt } is a second order stationary time series and using (5.25) we have

 2
n
X det(Σn+1 )
E Xn+1 − φn,j Xn+1−j  = ,
det(Σn )
j=1

where Σn = {c(i − j); i, j = 0, . . . , n − 1}, and Σn is a non-singular matrix.


Szegö’s theorem is a general theorem concerning Toeplitz matrices. Define the sequence of
Toeplitz matrices Γn = {c(i − j); i, j = 0, . . . , n − 1} and assume the Fourier transform

X
f (ω) = c(j) exp(ijω)
j∈Z

|c(j)|2 < ∞). Let {γj,n } denote the Eigenvalues corresponding to Γn .


P
exists and is well defined ( j

Then for any function G we have

n Z 2π
1X
lim G(γj,n ) → G(f (ω))dω.
n→∞ n 0
j=1

Pn 2
To use this result we return to E[Xn+1 − j=1 φn,j Xn+1−j ] and take logarithms

n
X
log E[Xn+1 − φn,j Xn+1−j ]2 = log det(Σn+1 ) − log det(Σn )
j=1
n+1
X n
X
= log γj,n+1 − log γj,n
j=1 j=1

Qn
where the above is because det Σn = j=1 γj,n (where γj,n are the eigenvalues of Σn ). Now we
apply Szegö’s theorem using G(x) = log(x), this states that

n Z 2π
1X
lim log(γj,n ) → log(f (ω))dω.
n→∞ n 0
j=1

thus for large n

n+1 n
1 X 1X
log γj,n+1 ≈ log γj,n .
n+1 n
j=1 j=1

229
This implies that

n+1 n
X n+1X
log γj,n+1 ≈ log γj,n ,
n
j=1 j=1

hence

n
X
log E[Xn+1 − φn,j Xn+1−j ]2 = log det(Σn+1 ) − log det(Σn )
j=1
n+1
X n
X
= log γj,n+1 − log γj,n
j=1 j=1
n n n
n+1X X 1X
≈ log γj,n − log γj,n = log γj,n .
n n
j=1 j=1 j=1

Thus

n
X n
X
lim log E[Xt+1 − φn,j Xt+1−j ]2 = lim log E[Xn+1 − φn,j Xn+1−j ]2
n→∞ n→∞
j=1 j=1
n Z 2π
1 X
= lim log γj,n = log(f (ω))dω
n→∞ n 0
j=1

and

n
X Z 2π 
2
lim E[Xt+1 − φn,j Xt+1−j ] = exp log(f (ω))dω .
n→∞ 0
j=1

This gives a rough outline of the proof. The precise proof can be found in Gray’s technical report.
There exists alternative proofs (given by Kolmogorov), see Brockwell and Davis (1998), Chapter 5.
This is the reason that in many papers the assumption

Z 2π
log f (ω)dω > −∞
0

is made. This assumption essentially ensures Xt ∈


/ X−∞ .

Example 7.13.1 Consider the AR(p) process Xt = φXt−1 + εt (assume wlog that |φ| < 1) where
E[εt ] = 0 and var[εt ] = σ 2 . We know that Xt (1) = φXt and

E[Xt+1 − Xt (1)]2 = σ 2 .

230
We now show that
 Z 2π 
1
exp log f (ω)dω = σ2. (7.30)
2π 0

We recall that the spectral density of the AR(1) is

σ2
f (ω) =
|1 − φeiω |2
⇒ log f (ω) = log σ 2 − log |1 − φeiω |2 .

Thus

Z 2π Z 2π Z 2π
1 1 2 1
log f (ω)dω = log σ dω − log |1 − φeiω |2 dω .
2π 0 2π 0 2π 0
| {z } | {z }
=log σ 2 =0

There are various ways to prove that the second term is zero. Probably the simplest is to use basic
results in complex analysis. By making a change of variables z = eiω we have

Z 2π Z 2π Z 2π
1 1 1
log |1 − φeiω |2 dω = log(1 − φeiω )dω + log(1 − φe−iω )dω
2π 0 2π 0 2π 0
∞  j ijω
2π X
φj e−ijω
Z 
1 φ e
= + dω = 0.
2π 0 j j
j=1

From this we immediately prove (7.30).

7.14 Appendix: Prediction coefficients for an AR(p)


model
Define the p-dimension random vector X 0t = (Xt , . . . , Xt−p+1 ). We define the causal VAR(1) model
in the vector form as

X t = ΦX t−1 + εt

231
where ε0t = (εt , 0, . . . , 0) and

 
φ1 φ2 . . . φp−1 φp
 
1 0 ... 0 0 
 

 
Φ= 0 1 ... 0 0 . (7.31)
 
.. ..
 
 .. 
 . . . 0 0 
 
0 0 ... 1 0

Lemma 7.14.1 Let Φ be defined as in (7.31) where parameters φ are such that the roots of φ(z) =
1 − pj=1 φj z j lie outside the unit circle. Then
P

p
X p−`
X
|τ |+1
[Φ X p ](1) = X` φ`+s ψ|τ |−s . (7.32)
`=1 s=0

Pp −ijω )−1
P∞ −isω .
where {ψj } are the coefficients in the expansion (1 − j=1 φj e = j=0 ψs e

PROOF. The proof is based on the observation that the jth row of Φm (m ≥ 1) is the (j − 1)th
row of Φm−1 (due to the structure of A). Let (φ1,m , . . . , φp,m ) denote the first row of Φm . Using
this notation we have
    
φ1,m φ2,m ... φp,m φ1 φ2 . . . φp−1 φp φ1,m−1 φ2,m−1 . . . φp,m−1
    
 φ1,m−1 φ2,m−1 . . . φp,m−1   1 0 ... 0 0   φ1,m−2 φ2,m−2 . . . φp,m−2
    

= .
.. .. .. .. .. ..
 
.. ..
. .
    
 . . .   0 1 ... 0 0  . . . 
    
φ1,m−p+1 φ2,m−p+1 . . . φp,m−p+1 0 0 ... 1 0 φ1,m−p φ2,m−p . . . φp,m−p

From the above we observe that φ`,m satisfies the system of equations

φ`,m = φ` φ1,m−1 + φ`+1,m−1 1≤`≤p−1

φp,m = φp φ1,m−1 . (7.33)

Our aim is to obtain an expression for φ`,m in terms of {φj }pj=1 and {ψj }∞
j=0 which we now define.
Pp
Since the roots of φ(·) lies outside the unit circle the function (1 − j=1 φj z j )−1 is well defined
for |z| ≤ 1 and has the power series expansion (1 − pi=1 φi z)−1 = ∞ i
P P
i=0 ψi z for |z| ≤ 1. We use

the well know result [Φm ]1,1 = φ1,m = ψm . Using this we obtain an expression for the coefficients

232
{φ`,m ; 2 ≤ ` ≤ p} in terms of {φi } and {ψi }. Solving the system of equations in (7.33), starting
with φ1,1 = ψ1 and recursively solving for φp,m , . . . , φ2,m we have

φp,r = φp ψr−1 m−p≤r ≤m

φ`,r = φ` φ1,r−1 + φ`+1,r−1 1 ≤ ` ≤ p − 1, m−p≤r ≤m

This gives φp,m = φp ψm−1 , for ` = p − 1

φp−1,m = φp−1 φ1,m−1 + φp,m−1

= φp−1 ψm−1 + ψp ψm−2

φp−2,m = φp−2 φ1,m−1 + φp−1,m−1

= φp−2 ψm−1 + φp−1 ψm−2 + ψp ψm−3

up to

φ1,m = φ1 φ1,m−1 + φ2,m−1


p−1
X
= φ1+s ψm−1−s = (ψm ).
s=0

This gives the general expression

r
X
φp−r,m = φp−r+s ψm−1−s 0 ≤ r ≤ p − 1.
s=0

In the last line of the above we change variables with ` = p − r to give for m ≥ 1

p−`
X
φ`,m = φ`+s ψm−1−s 1 ≤ ` ≤ p,
s=0

where we set ψ0 = 1 and for t < 0, ψt = 0. Therefore

p
X p−`
X
[Φ|τ |+1 X p ](1) = X` φ`+s ψ|τ |−s .
`=1 s=0

233
Thus we obtain the desired result. 

A proof of Durbin-Levinson algorithm based on symmetric Toeplitz matrices

We now give an alternative proof which is based on properties of the (symmetric) Toeplitz matrix.
We use (7.15), which is a matrix equation where

 
φt,1
 . 
 
Σt  ..  = rt , (7.34)
 
φt,t

with
   
c(0) c(1) c(2) . . . c(t − 1) c(1)
   
c(1) c(0) c(1) . . . c(t − 2)  c(2) 
   
 
Σt =  .. .. .. .. ..
 and rt = 
 ..  .

.
 
 . . . .   . 
   
.. ..
c(t − 1) c(t − 2) . . c(0) c(t)

The proof is based on embedding rt−1 and Σt−1 into Σt−1 and using that Σt−1 φt−1 = rt−1 .
To do this, we define the (t − 1) × (t − 1) matrix Et−1 which basically swops round all the
elements in a vector
 
0 0 0 ... 0 1
 
0 0 0 ... 1 0 
 

Et−1 =
 .. .. .. .. ..
,

 . . . . . 
 
..
1 0 . 0 0 0

(recall we came across this swopping matrix in Section 6.2). Using the above notation, we have the
interesting block matrix structure
 
Σt−1 Et−1 rt−1
Σt =  
r0t−1 Et−1 c(0)
and rt = (r0t−1 , c(t))0 .

234
Returning to the matrix equations in (7.34) and substituting the above into (7.34) we have
    
Σt−1 Et−1 rt−1 φt−1,t rt−1
Σt φt = r t , ⇒   = ,
r0t−1 Et−1 c(0) φt,t c(t)

where φ0t−1,t = (φ1,t , . . . , φt−1,t ). This leads to the two equations

Σt−1 φt−1,t + Et−1 rt−1 φt,t = rt−1 (7.35)

r0t−1 Et−1 φt−1,t + c(0)φt,t = c(t). (7.36)

We first show that equation (7.35) corresponds to the second equation in the Levinson-Durbin
algorithm. Multiplying (7.35) by Σ−1
t−1 , and rearranging the equation we have

φt−1,t = Σ−1 rt−1 − Σ−1 Et−1 rt−1 φt,t .


{z } | t−1 {z
| t−1 }
=φt−1 =Et−1 φt−1

Thus we have

φt−1,t = φt−1 − φt,t Et−1 φt−1 . (7.37)

This proves the second equation in Step 2 of the Levinson-Durbin algorithm.


We now use (7.36) to obtain an expression for φt,t , which is the first equation in Step 1.
Substituting (7.37) into φt−1,t of (7.36) gives

 
r0t−1 Et−1 φt−1 − φt,t Et−1 φt−1 + c(0)φt,t = c(t). (7.38)

Thus solving for φt,t we have

c(t) − c0t−1 Et−1 φt−1


φt,t = . (7.39)
c(0) − c0t−1 φ0t−1

Noting that r(t) = c(0) − c0t−1 φ0t−1 . (7.39) is the first equation of Step 2 in the Levinson-Durbin
equation.
Note from this proof we do not need that the (symmetric) Toeplitz matrix is positive semi-
definite. See Pourahmadi (2001), Chapter 7.

235
Prediction for ARMA models

Proof of equation (7.16) For the proof, we define the variables {Wt }, where Wt = Xt for 1 ≤ t ≤ p
and for t > max(p, q) let Wt = εt + qi=1 θi εt−i (which is the MA(q) part of the process). Since
P

Xp+1 = pj=1 φj Xt+1−j + Wp+1 and so forth it is clear that sp(X1 , . . . , Xt ) = sp(W1 , . . . , Wt ) (i.e.
P

they are linear combinations of each other). To prove the result we use the following steps:

p
X q
X
PXt ,...,X1 (Xt+1 ) = φj PXt ,...,X1 (Xt+1−j ) + θi PXt ,...,X1 (εt+1−i )
| {z }
j=1 i=1
Xt+1−j
p
X q
X
= φj Xt+1−j + θi PXt −Xt|t−1 ,...,X2 −X2|1 ,X1 (εt+1−i )
j=1 i=1 | {z }
=PWt −Wt|t−1 ,...,W2 −W2|1 ,W1 (εt+1−i )
p
X q
X
= φj Xt+1−j + θi PWt −Wt|t−1 ,...,W2 −W2|1 ,W1 (εt+1−i )
j=1 i=1
Xp Xq
= φj Xt+1−j + θi PWt+1−i −Wt+1−i|t−i ,...,Wt −Wt|t−1 (εt+1−i )
j=1 i=1 | {z }
since εt+1−i is independent of Wt+1−i−j ;j≥1
p
X q
X i−1
X
= φj Xt+1−j + θi PWt+1−i+s −Wt+1−i+s|t−i+s (εt+1−i )
j=1 i=1 s=0 | {z }
since Wt+1−i+s −Wt+1−i+s|t−i+s are uncorrelated
p
X q
X
= φj Xt+1−j + θt,i (Wt+1−i − Wt+1−i|t−i )
j=1 i=1
| {z }
=Xt+1−i −Xt+1−i|t−i
p
X q
X
= φj Xt+1−j + θt,i (Xt+1−i − Xt+1−i|t−i ),
j=1 i=1

this gives the desired result.


We prove (7.18) for the ARMA(1, 2) model We first note that sp(X1 , Xt , . . . , Xt ) = sp(W1 , W2 , . . . , Wt ),
where W1 = X1 and for t ≥ 2 Wt = θ1 εt−1 +θ2 εt−2 +εt . The corresponding approximating predictor
is defined as W
c2|1 = W1 , W
c3|2 = W2 and for t > 3

ct|t−1 = θ1 [Wt−1 − W
W ct−1|t−2 ] + θ2 [Wt−2 − W
ct−2|t−3 ].

Note that by using (7.17), the above is equivalent to

Xbt+1|t − φ1 Xt = θ1 [Xt − X
bt|t−1 ] +θ2 [Xt−1 − X
bt−1|t−2 ] .
| {z } | {z } | {z }
W
ct+1|t =(Wt −W
ct|t−1 ) =(Wt−1 −W
ct−1|t−2 )

236
By subtracting the above from Wt+1 we have

Wt+1 − W
ct+1|t = −θ1 (Wt − W
ct|t−1 ) − θ2 (Wt−1 − W
ct−1|t−2 ) + Wt+1 . (7.40)

It is straightforward to rewrite Wt+1 − W


ct+1|t as the matrix difference equation

      
Wt+1 − W
ct+1|t θ1 θ2 Wt − W
ct|t−1 Wt+1
  = −  + 
Wt − W
ct|t−1 −1 0 Wt−1 − W
ct−1|t−2 0
| {z } | {z }| {z } | {z }
=b
εt+1 =Q =b
εt W t+1

We now show that εt+1 and Wt+1 − W


ct+1|t lead to the same difference equation except for some

initial conditions, it is this that will give us the result. To do this we write εt as function of {Wt }
(the irreducible condition). We first note that εt can be written as the matrix difference equation
      
εt+1 θ1 θ2 εt Wt+1
  = −  +  (7.41)
εt −1 0 εt−1 0
| {z } | {z } | {z } | {z }
=εt+1 Q εt W t+1

Thus iterating backwards we can write


X ∞
X
εt+1 = (−1)j [Qj ](1,1) Wt+1−j = b̃j Wt+1−j ,
j=0 j=0

where b̃j = (−1)j [Qj ](1,1) (noting that b̃0 = 1) denotes the (1, 1)th element of the matrix Qj (note
we did something similar in Section ??). Furthermore the same iteration shows that

t−3
X
εt+1 = (−1)j [Qj ](1,1) Wt+1−j + (−1)t−2 [Qt−2 ](1,1) ε3
j=0
t−3
X
= b̃j Wt+1−j + (−1)t−2 [Qt−2 ](1,1) ε3 . (7.42)
j=0

Therefore, by comparison we see that

t−3
X ∞
X
εt+1 − b̃j Wt+1−j = (−1)t−2 [Qt−2 ε3 ]1 = b̃j Wt+1−j .
j=0 j=t−2

We now return to the approximation prediction in (7.40). Comparing (7.41) and (7.41) we see

237
that they are almost the same difference equations. The only difference is the point at which the
algorithm starts. εt goes all the way back to the start of time. Whereas we have set initial values
for W
c2|1 = W1 , W ε03 = (W3 − W2 , W2 − W1 ).Therefore, by iterating both (7.41) and
c3|2 = W2 , thus b

(7.41) backwards, focusing on the first element of the vector and using (7.42) we have

εt+1 − εbt+1 = (−1)t−2 [Qt−2 ε3 ]1 +(−1)t−2 [Qt−2b


ε3 ]1
| {z }
= ∞
P
j=t−2 b̃j Wt+1−j

P∞
We recall that εt+1 = Wt+1 + j=1 b̃j Wt+1−j and that εbt+1 = Wt+1 − W
ct+1|t . Substituting this

into the above gives


X ∞
X
ct+1|t −
W b̃j Wt+1−j = b̃j Wt+1−j + (−1)t−2 [Qt−2b
ε3 ]1 .
j=1 j=t−2

Replacing Wt with Xt − φ1 Xt−1 gives (7.18), where the bj can be easily deduced from b̃j and φ1 .

We now state a few results which will be useful later.

Lemma 7.14.2 Suppose {Xt } is a stationary time series with spectral density f (ω). Let X t =
(X1 , . . . , Xt ) and Σt = var(X t ).

(i) If the spectral density function is bounded away from zero (there is some γ > 0 such that
inf ω f (ω) > 0), then for all t, λmin (Σt ) ≥ γ (where λmin and λmax denote the smallest and
largest absolute eigenvalues of the matrix).

(ii) Further, λmax (Σ−1 −1


t )≤γ .

(Since for symmetric matrices the spectral norm and the largest eigenvalue are the same, then
kΣ−1 −1
t kspec ≤ γ ).

(iii) Analogously, supω f (ω) ≤ M < ∞, then λmax (Σt ) ≤ M (hence kΣt kspec ≤ M ).

PROOF. See Chapter 10. 

Remark 7.14.1 Suppose {Xt } is an ARMA process, where the roots φ(z) and and θ(z) have
absolute value greater than 1 + δ1 and less than δ2 , then the spectral density f (ω) is bounded
(1− δ1 )2p 1
(1−( 1+δ )2p
by var(εt ) (1−( 2
1
)2p
≤ f (ω) ≤ var(εt ) (1− δ1 )2p
1
. Therefore, from Lemma 7.14.2 we have that
1+δ1 2
λmax (Σt ) and λmax (Σ−1
t ) is bounded uniformly over t.

238
7.15 Appendix: Proof of the Kalman filter
In this section we prove the recursive equations used to define the Kalman filter. The proof is
straightforward and used the multi-stage projection described in Section 5.1.4 (which has been
already been used to prove the Levinson-Durbin algorithm and forms the basis of the Burg algo-
rithm).
The Kalman filter construction is based on the state space equation

Xt = F Xt−1 + Vt

where {Xt }t is an unobserved time series, F is a known matrix, var[Vt ] = Q and {Vt }t are indepen-
dent random variables that are independent of Xt−1 . The observed equation

Yt = HXt−1 + Wt

where {Yt }t is the observed time series, var[Wt ] = R, {Wt }t are independent that are independent
of Xt−1 . Moreover {Vt }t and {Wt } are jointly independent. The parameters can be made time-
dependent, but this make the derivations notationally more cumbersome.
The derivation of the Kalman equations are based on the projections discussed in Section 5.3.
In particular, suppose that X, Y, Z are random variables then

PY,Z (X) = PY (X) + αX (Z − PY (Z)) (7.43)

where

cov(X, Z − PY (Z))
αX =
var(Z − PY (Z))

and

var[X − PY,Z (X)] = cov[X, X − PY,Z (X)], (7.44)

these properties we have already used a number of time.


bt+1|t = PY ,...,Yt (Xt+1 ) and Pt+1|t = var[Xt+1 − X
The standard notation is X bt+1|t ] (predictive)
1

bt+1|t+1 = PY ,...,Yt (Xt+1 ) and Pt+1|t+1 = var[Xt+1 − X


and X bt+1|t+1 ] (update).
1

239
The Kalman equations

(i) Prediction step

The conditional expectation

X
bt+1|t = F X
bt|t

and the corresponding mean squared error

Pt+1|t = F Pt|t F ∗ + Q.

(ii) Update step

The conditional expectation

 
X bt+1|t + Kt+1 Yt+1 − H X
bt+1|t+1 = X bt+1|t .

where

Kt+1 = Pt+1|t H ∗ [HPt+1|t H ∗ + R]−1

and the corresponding mean squared error

Pt+1|t+1 = Pt+1|t − Kt HPt+1|t = (I − Kt H)Pt+1|t

(iii) There is also a smoothing step (which we ignore for now).

The Kalman filter iteratively evaluates step (i) and (ii) for t = 2, 3, . . .. We start with X
bt−1|t−1

and Pt−1|t−1 .
Derivation of predictive equations The best linear predictor:

X
bt+1|t = PY ,...,Yt (Xt+1 ) = PY ,...,Yt (F Xt + Vt+1 )
1 1

= PY1 ,...,Yt (F Xt ) + PY1 ,...,Yt (Vt+1 ) = F PY1 ,...,Yt (Xt ) = F X


bt|t .

240
The mean squared error

Pt+1|t = var[Xt+1 − X
bt+1|t ] = var[F Xt + Vt+1 − F X
bt|t ]

= var[F (Xt − X
bt|t ) + Vt+1 ]

= var[F (Xt − X
bt|t )] + var[Vt+1 ]

bt|t ]F ∗ + var[Vt+1 ] = F Pt|t F ∗ + Q.


= F var[Xt − X

This gives the two predictors from the previous update equations. Next the update equations
(which is slightly more tricky).

Derivation of the update equations Now we expand the projection space from sp(Y1 , . . . , Yt ) to
sp(Y1 , . . . , Yt , Yt+1 ). But as the recursion uses sp(Y1 , . . . , Yt ) we represent

sp(Y1 , . . . , Yt , Yt+1 ) = sp(Y1 , . . . , Yt , Yt+1 − PY1 ,...,Yt (Yt+1 )).

Note that

Yt+1 − PY1 ,...,Yt (Yt+1 ) = Yt+1 − PY1 ,...,Yt (HXt+1 + Wt+1 )

= Yt+1 − H X
bt+1|t .

Thus by using (7.43) we have

 
X bt+1|t + α Yt+1 − H X
bt+1|t+1 = PY ,...,Yt ,Y (Xt+1 ) = X bt+1|t
1 t+1

where

bt+1|t )−1 cov(Xt+1 , Yt+1 − H X


α = var(Yt+1 − H X bt+1|t ).

We now find an expression for α = Kt+1 (Kt+1 is the typical notation). We recall that Yt+1 =
HXt+1 + Wt+1 , thus Yt+1 − H X
bt+1|t = H(Xt+1 − Xt+1|t ) + Wt+1 . Thus

cov(Xt+1 , Yt+1 − H X
bt+1|t ) = cov(Xt+1 , H(Xt+1 − Xt+1|t ) + Wt+1 )

= cov(Xt+1 , H(Xt+1 − Xt+1|t )) = cov(Xt+1 − Xt+1|t , Xt+1 )H ∗

= var(Xt+1 − Xt+1|t ) = Pt+1|t H ∗ (7.45)

241
and

var(Yt+1 − H X
bt+1|t ) = var(H(Xt+1 − Xt+1|t ) + Wt+1 )

= Hvar(Xt+1 − Xt+1|t )H ∗ + var(Wt+1 )

= HPt+1|t H ∗ + R.

Therefore, altogether

Kt+1 = Pt+1|t H ∗ [HPt+1|t H ∗ + R]−1

 
X bt+1|t + Kt+1 Yt+1 − H X
bt+1|t+1 = X bt+1|t .

 
Often Kt+1 or Kt+1 Yt+1 − H X
bt+1|t is referred to as the Kalman gain, which the “gain” when

including the additional term Yt+1 in the prediction. Finally we calculate the variance. Again using
(7.44) we have

Pt+1|t+1 = var[Xt+1 − Xbt+1|t+1 ] = cov[Xt+1 , Xt+1 − X


bt+1|t+1 ]
h  i
= cov Xt+1 , Xt+1 − Xbt+1|t − Kt Yt+1 − H X bt+1|t
h i h  i
= cov Xt+1 , Xt+1 − Xbt+1|t − cov Xt+1 , Kt Yt+1 − H X bt+1|t

= Pt+1|t − Kt HPt+1|t = (I − Kt H)Pt+1|t

where the above follows from (7.45). I have a feeling the above may be a little wrong in terms of
of brackets.

242
Chapter 8

Estimation of the mean and


covariance

Objectives

• To derive the sample autocovariance of a time series, and show that this is a positive definite
sequence.

• To show that the variance of the sample covariance involves fourth order cumulants, which
can be unwielding to estimate in practice. But under linearity the expression for the variance
greatly simplifies.

• To show that under linearity the correlation does not involve the fourth order cumulant. This
is the Bartlett formula.

• To use the above results to construct a test for uncorrelatedness of a time series (the Port-
manteau test). And understand how this test may be useful for testing for independence in
various different setting. Also understand situations where the test may fail.

Here we summarize the Central limit theorems we will use in this chapter. The simplest is the
case of iid random variables. The first is the classical central limit theorem. Suppose that {Xi } are
iid random variables with mean µ and variance σ 2 < ∞. Then

n
1 X D
√ (Xi − µ) → N (0, σ 2 ).
n
i=1

243
A small variant on the classical CLT is the case that {Xi } are independent random variables (but
not identically distributed). Suppose E[Xi ] = µi , var[Xi ] = σi2 < ∞ and for every ε > 0

n
1 X
E (Xi − µi )2 I(s−1

2 n |Xi − µi | > ε) → 0
sn
i=1

Pn Pn
where s2n = 2
i=1 σi , which is the variance of i=1 Xi (the above condition is called the Lindeberg
condition). Then

n
1 X D
qP (Xi − µi ) → N (0, 1).
n 2
i=1 σi i=1

The Lindeberg condition looks unwieldy, however by using Chebyshev’s and Hölder inequality it
can be reduced to simple bounds on the moments.

Remark 8.0.1 (The aims of the Lindeberg condition) The Lindeberg condition essential re-
quires a uniform bound in the tails for all the random variables {Xi } in the sum. For example,
suppose Xi are t-distributed random variables where Xi is distributed with a t-distribution with
(2 + i−1 ) degrees of freedom. We know that the number of df (which can be non-integer-valued) gets
thicker the lower the df. Furthermore, E[Xi2 ] < ∞ only if Xi has a df greater than 2. Therefore,
the second moments of Xi exists. But as i gets larger, Xi has thicker tails. Making it impossible (I
believe) to find a uniform bound such that Lindeberg’s condition is satisified.

Note that the Lindeberg condition generalizes to the conditional Lindeberg condition when
dealing with martingale differences.
We now state a generalisation of this central limit to triangular arrays. Suppose that {Xt,n }
are independent random variables with mean zero. Let Sn = nt=1 Xt,n we assume that var[Sn ] =
P
Pn
t=1 var[Xt,n ] = 1. For example, in the case that {Xt } are iid random variables and Sn =
Pn Pn −1 −1/2 (X − µ). If for all ε > 0
√1
n t=1 [Xt − µ] = t=1 Xt,n , where Xt,n = σ n t

n
X
2

E Xt,n I(|Xt,n | > ε) → 0,
t=1

D
then Sn → N (0, 1).

244
8.1 An estimator of the mean
Suppose we observe {Yt }nt=1 , where

Yt = µ + Xt ,

where µ is the finite mean, {Xt } is a zero mean stationary time series with absolutely summable
P
covariances ( k |cov(X0 , Xk )| < ∞). Our aim is to estimate the mean µ. The most obvious
estimator is the sample mean, that is Ȳn = n−1 nt=1 Yt as an estimator of µ.
P

8.1.1 The sampling properties of the sample mean


We recall from Example 3.3.1 that we obtained an expression for the sample mean. We showed
that

n
1 2 X n − k
var(Ȳn ) = c(0) + c(k).
n n n
k=1

P
Furthermore, if k |c(k)| < ∞, then in Example 3.3.1 we showed that


1 2X
var(Ȳn ) ≈ c(0) + c(k).
n n
k=1

Thus if the time series has sufficient decay in its correlation structure a mean squared consistent
estimator of the sample mean can be achieved. However, one drawback is that the dependency
means that one observation will influence the next, and if the influence is positive (seen by a positive
covariance), the resulting estimator may have a (much) larger variance than the iid case.
The above result does not require any more conditions on the process, besides second order
stationarity and summability of its covariance. However, to obtain confidence intervals we require
a stronger result, namely a central limit theorem for the sample mean. The above conditions are
not enough to give a central limit theorem. To obtain a CLT for sums of the form nt=1 Xt we
P

need the following main ingredients:

(i) The variance needs to be finite.

(ii) The dependence between Xt decreases the further apart in time the observations. However,
this is more than just the correlation, it really means the dependence.

245
The above conditions are satisfied by linear time series, if the cofficients φj decay sufficient fast.
However, these conditions can also be verified for nonlinear time series (for example the (G)ARCH
and Bilinear model described in Chapter 13).
We now state the asymptotic normality result for linear models.

Theorem 8.1.1 Suppose that Xt is a linear time series, of the form Xt = ∞


P
j=−∞ ψj εt−j , where εt

are iid random variables with mean zero and variance one, j=−∞ |ψj | < ∞ and ∞
P∞ P
j=−∞ ψj 6= 0.

Let Yt = µ + Xt , then we have

√ 
n Ȳn − µ = N (0, V )

P∞
where V = c(0) + 2 k=1 c(k).

PROOF. Later in this course we will give precise details on how to prove asymptotic normality of
several different type of estimators in time series. However, we give a small flavour here by showing
asymptotic normality of Ȳn in the special case that {Xt }nt=1 satisfy an MA(q) model, then explain
how it can be extended to MA(∞) processes.
The main idea of the proof is to transform/approximate the average into a quantity that we
know is asymptotic normal. We know if {t }nt=1 are iid random variables with mean µ and variance
one then

√ D
n − µ) → N (0, 1).
n(¯ (8.1)

We aim to use this result to prove the theorem. Returning to Ȳn by a change of variables (s = t − j)
we can show that

n n n q
1X 1X 1 XX
Yt = µ+ Xt = µ + ψj εt−j
n n n
t=1 t=1 t=1 j=0
     
n−q q 0 q n n−s
1 X X X X X X
= µ+ εs  ψj  + εs  ψj  + εs  ψj 
n
s=1 j=0 s=−q+1 j=q−s s=n−q+1 j=0
     
q n−q 0 q n n−s
n−q  X 1 X 1 X X 1 X X
= µ+ ψj  εs + εs  ψj  + εs  ψj 
n n−q n n
j=0 s=1 s=−q+1 j=q+s s=n−q+1 j=0

(n − q)Ψ
:= µ + ε̄n−q + E1 + E2 , (8.2)
n

246
Pq
where Ψ = It is straightforward to show that E|E1 | ≤ Cn−1 and E|E2 | ≤ Cn−1 .
j=0 ψj .

Finally we examine (n−q)Ψ ε̄n−q . We note that if the assumptions are not satisfied and qj=0 ψj =
P
n

0 (for example the process Xt = εt − εt−1 ), then


   
n 0 q n n−s
1X 1 X X 1 X X
Yt = µ + εs  ψj  + εs  ψj  .
n n n
t=1 s=−q+1 j=q−s s=n−q+1 j=0

This is a degenerate case, since E1 and E2 only consist of a finite number of terms and thus if εt are
non-Gaussian these terms will never be asymptotically normal. Therefore, in this case we simply
have that n1 nt=1 Yt = µ + O( n1 ) (this is why in the assumptions it was stated that Ψ 6= 0).
P

On the other hand, if Ψ 6= 0, then the dominating term in Ȳn is ε̄n−q . From (8.1) it is
√ P p P
clear that n − q ε̄n−q → N (0, 1) as n → ∞. However, for finite q, (n − q)/n → 1, therefore
√ P
nε̄n−q → N (0, 1). Altogether, substituting E|E1 | ≤ Cn−1 and E|E2 | ≤ Cn−1 into (8.2) gives

√ √ 1 P
n Ȳn − µ = Ψ nε̄n−q + Op ( ) → N 0, Ψ2 .
 
n

With a little work, it can be shown that Ψ2 = V .


Observe that the proof simply approximated the sum by a sum of iid random variables. In the
case that the process is a MA(∞) or linear time series, a similar method is used. More precisely,
we have

n ∞ ∞ n−j
√  1 XX 1 X X
n Ȳn − µ = √ ψj εt−j = √ ψj εs
n t=1 n
j=0 j=0 s=1−j
∞ n
1 X X
= √ ψj εt + R n
n t=1
j=0

where
 
∞ n−j n
1 X X X
Rn = √ ψj  εs − εs 
n s=1
j=0 s=1−j
   
n 0 n ∞ n−j n
1 X  X X 1 X X X
= √ ψj εs − εs  + √ ψj  εs − εs 
n n s=1
j=0 s=1−j s=n−j j=n+1 s=1−j
:= Rn1 + Rn2 + Rn3 + Rn4 .

247
2 ] = o(1) for 1 ≤ j ≤ 4. We start with R
We will show that E[Rn,j n,1

 
n 0 0
2 1 X X X
E[Rn,1 ] = ψj1 ψj2 cov  εs 1 , εs 2 
n
j1 ,j2 =0 s1 =1−j1 s2 =1−j2
n
1 X
= ψj1 ψj2 min[j1 − 1, j2 − 1]
n
j1 ,j2 =0
n n 1 −1
jX
1X 2 2 X
= ψj (j − 1) + ψj1 , ψj2 min[j2 − 1]
n n
j=0 j1 =0 j2 =0
n n
1 X 2Ψ X
≤ ψj2 (j − 1) + |j1 ψj1 |.
n n
j=0 j1 =0

P∞
< ∞ and, thus, ∞
P 2
Pn
Since j=0 |ψj | j=0 |ψj | < ∞, then by dominated convegence j=0 [1−j/n]ψj →
P∞ Pn 2
P ∞ 2
P n
j=0 ψj and j=0 [1 − j/n]ψj → j=0 ψj as n → ∞. This implies that j=0 (j/n)ψj → 0 and
Pn 2 2
j=0 (j/n)ψj → 0. Substituting this into the above bounds for E[Rn,1 ] we immediately obtain
2 ] = o(1). Using the same argument we obtain the same bound for R
E[Rn,1 n,2 , Rn,3 and Rn,4 . Thus

n
√  1 X
n Ȳn − µ = Ψ √ εt + op (1)
n
j=1

and the result then immediately follows. 

Estimation of the so called long run variance (given in Theorem 8.1.1) can be difficult. There
are various methods that can be used, such as estimating the spectral density function (which we
define in Chapter 10) at zero. Another approach proposed in Lobato (2001) and Shao (2010) is to
use the method of so called self-normalization which circumvents the need to estimate the long run
mean, by privotalising the statistic.

8.2 An estimator of the covariance


Suppose we observe {Yt }nt=1 , to estimate the covariance we can estimate the covariance c(k) =
cov(Y0 , Yk ) from the the observations. A plausible estimator is

n−|k|
1 X
ĉn (k) = (Yt − Ȳn )(Yt+|k| − Ȳn ), (8.3)
n
t=1

248
since E[(Yt − Ȳn )(Yt+|k| − Ȳn )] ≈ c(k). Of course if the mean of Yt is known to be zero (Yt = Xt ),
then the covariance estimator is

n−|k|
1 X
cn (k) =
b Xt Xt+|k| . (8.4)
n
t=1

1 Pn−|k|
The eagle-eyed amongst you may wonder why we don’t use n−|k| t=1 Xt Xt+|k| , when ĉn (k) is a
1 Pn−|k|
biased estimator, whereas n−|k| t=1 Xt Xt+|k| is not. However ĉn (k) has some very nice properties

which we discuss in the lemma below. The sample autocorrelation is the ratio

cn (r)
b
ρbn (r) = .
cn (0)
b

Most statistical software will have functions that evaluate the sample autocorrelation.

Lemma 8.2.1 Suppose we define the empirical covariances


 Pn−|k|
1

n t=1 Xt Xt+|k| |k| ≤ n − 1
cn (k) =
b
 0 otherwise

then {b
cn (k)} is a positive definite sequence. Therefore, using Lemma 3.4.1 there exists a stationary
time series {Zt } which has the covariance ĉn (k).

PROOF. There are various ways to show that {ĉn (k)} is a positive definite sequence. One method
uses that the spectral density corresponding to this sequence is non-negative, we give this proof in
Section 10.4.1.
Here we give an alternative proof. We recall a sequence is semi-positive definite if for any vector
a = (a1 , . . . , ar )0 we have

r
X n
X
ak1 ak2 ĉn (k1 − k2 ) = ak1 ak2 ĉn (k1 − k2 ) = a0 Σ
b na ≥ 0
k1 ,k2 =1 k1 ,k2 =1

where
 
ĉn (0) ĉn (1) ĉn (2) . . . ĉn (n − 1)
 
ĉn (1) ĉn (1) . . . ĉn (n − 2)
ĉn (0)
 
 
Σn = 
b
.. .... .. ..
,
.
 
 . . . . 
 
.. ..
ĉn (n − 1) ĉn (n − 2) . . ĉn (0)

249
1 Pn−|k| 1 Pn−|k|
noting that ĉn (k) = n t=1 Xt Xt+|k| . However, ĉn (k) = n t=1 Xt Xt+|k| has a very interesting
b n = Xn X0 , where Xn is a
construction, it can be shown that the above convariance matrix is Σ n

n × 2n matrix with
 
0 0 ... 0 X1 X2 ... Xn−1 Xn
 
0 0 ... X1 X2 . . . Xn−1 Xn 0
 
 
Xn = 
 .. .. .. .. .. .. .. .. ..



 . . . . . . . . . 

X1 X2 . . . Xn−1 Xn 0 ... ... 0

Using the above we have

a0 Σ
b n a = a0 Xn X0n a = kX0 ak22 ≥ 0.

This this proves that {ĉn (k)} is a positive definite sequence.


Finally, by using Theorem 3.4.1, there exists a stochastic process with {ĉn (k)} as its autoco-
variance function. 

8.2.1 Asymptotic properties of the covariance estimator


The main reason we construct an estimator is either for testing or constructing a confidence interval
for the parameter of interest. To do this we need the variance and distribution of the estimator. It
is impossible to derive the finite sample distribution, thus we look at their asymptotic distribution.
Besides showing asymptotic normality, it is important to derive an expression for the variance.
In an ideal world the variance will be simple and will not involve unknown parameters. Usually
in time series this will not be the case, and the variance will involve several (often an infinite)
number of parameters which are not straightforward to estimate. Later in this section we show
that the variance of the sample covariance can be extremely complicated. However, a substantial
simplification can arise if we consider only the sample correlation (not variance) and assume linearity
of the time series. This result is known as Bartlett’s formula (you may have come across Maurice
Bartlett before, besides his fundamental contributions in time series he is well known for proposing
the famous Bartlett correction). This example demonstrates, how the assumption of linearity can
really simplify problems in time series analysis and also how we can circumvent certain problems
in which arise by making slight modifications of the estimator (such as going from covariance to
correlation).

250
The following theorem gives the asymptotic sampling properties of the covariance estimator
(8.3). One proof of the result can be found in Brockwell and Davis (1998), Chapter 8, Fuller
(1995), but it goes back to Bartlett (indeed its called Bartlett’s formula). We prove the result in
Section ??.

Theorem 8.2.1 Suppose {Xt } is a mean zero linear stationary time series where


X
Xt = µ + ψj εt−j ,
j=−∞

|ψj | < ∞, {εt } are iid random variables with E(εt ) = 0 and E(ε4t ) < ∞. Suppose we
P
where j

observe {Xt : t = 1, . . . , n} and use (8.3) as an estimator of the covariance c(k) = cov(X0 , Xk ).
Define ρ̂n (r) = ĉn (r)/ĉn (0) as the sample correlation. Then for each h ∈ {1, . . . , n}

√ D
n(ρ̂n (h) − ρ(h)) → N (0, Wh ) (8.5)

where ρ̂n (h) = (ρ̂n (1), . . . , ρ̂n (h)), ρ(h) = (ρ(1), . . . , ρ(h)) and

∞ 
X
(Wh )ij = ρ(k + i)ρ(k + j) + ρ(k − i)ρ(k + j) + 2ρ(i)ρ(j)ρ2 (k)
k=−∞

−2ρ(i)ρ(k)ρ(k + j) − 2ρ(j)ρ(k)ρ(k + i) . (8.6)

Equation (8.6) is known as Bartlett’s formula.


In Section 8.3 we apply the method for checking for correlation in a time series. We first show
how the expression for the asymptotic variance is obtained.

8.2.2 The asymptotic properties of the sample autocovariance and


autocorrelation
In order to show asymptotic normality of the autocovariance and autocorrelation we require the
following result. For any coefficients {αrj }dj=0 ∈ Rd+1 (such that σα2 , defined below, is non-zero) we
have
 
d n−|rj | d
√ X 1 X X D
αrj c(rj ) → N 0, σα2 ,

n αrj Xt Xt+rj − (8.7)
n
j=0 t=1 j=0

251
for some σα2 < ∞. This result can be proved under a whole host of conditions including

|ψj | < ∞ and E[ε4t ] < ∞.


P P
• The time series is linear, Xt = j ψj εt−j , where {εt } are iid, j

• α and β-mixing with sufficient mixing rates and moment conditions (which are linked to the
mixing rates).

• Physical dependence

• Other dependence measures.

All these criterions essentially show that the time series {Xt } becomes “increasingly independent”
the further apart the observations are in time. How this dependence is measured depends on the
criterion, but it is essential for proving the CLT. We do not prove the above. Our focus in this
section will be on the variance of the estimator.

P
Theorem 8.2.2 Suppose that condition (8.7) is satisfied (and h∈Z |c(h)| < ∞ and
P
h1 ,h2 ,h3 |κ4 (h1 , h2 , h3 )| < ∞; this is a cumulant, which we define in the section below), then

 
cn (0) − c(0)
b
 
cn (r1 ) − c(r1 )
√  b
 
 P
n  → N (0, Vd+1 )
 .. 

 . 

cn (rd ) − c(rd )
b

where


X ∞
X
(Vd+1 )i,j = c(k)c(k + ri−1 − rj−1 ) + c(k + ri−1 − 1)c(k − rj−1 − 1) +
k=−∞ k=−∞
X∞
κ4 (ri−1 − 1, k, k + rj−1 − 1) (8.8)
k=−∞

where we set r0 = 0.

PROOF. The first part of the proof simply follows from (8.7). The derivation for Vd+1 is given in
Section 8.2.3, below. 

In order to prove the results below, we partition Vd+1 into a term which contains the covariances
and the term which contains the fourth order cumulants (which we have yet to define). Let Vd+1 =

252
Cd+1 + Kd+1 , where


X ∞
X
(Cd+1 )i,j = c(k)c(k + ri−1 − rj−1 ) + c(k + ri−1 )c(k − rj−1 )
k=−∞ k=−∞
X∞
(Kd+1 )i,j = κ4 (ri−1 , k, k + rj−1 ). (8.9)
k=−∞

and set r0 = 0. So far we have not defined κ4 . However, it is worth bearing in mind that if the
time series {Xt } is Gaussian, then this term is zero i.e. Kd+1 = 0. Thus estimation of the variance
of the sample covariance for Gaussian time series is relatively straightforward as it only depends
on the covariance.
We now derive the sampling properties of the sample autocorrelation.

Lemma 8.2.2 Suppose that conditions in Theorem 8.2.2 hold. Then


 
ρbn (r1 ) − ρ(r1 )
√  ..

 P
 → N 0, G(Cd+1 + Kd+1 )G0

n

.
 
ρbn (rd ) − ρ(rd )

where rj 6= 0, Cd+1 and Kd+1 are defined as in equation (8.9) and G is a d × (d + 1) dimensional
matrix where
 
−ρ(r1 ) 1 0 ... 0
 
−ρ(r2 ) 0 1 . . . 0
 
1  
G= 
.. .. .

c(0) 

. . . . . .. 0


 
−ρ(rd ) 0 . . . . . . 1

PROOF. We define the g : Rd+1 → Rd vector function


 
x1 xd
g(x0 , x1 , . . . , xd ) = ,..., .
x0 x0

253
We observe that (b
ρ(r1 ), . . . , ρb(rd )) = g(b
cn (0), b
cn (r1 ), . . . , b
cn (rd )). Thus

 
c(r1 ) 1
− c(0)2 c(0) 0 ... 0
 
 c(r2 ) 1
 − c(0)2 0 ... 0 

c(0)
∇g(c(0), . . . , c(rd )) =  .. .. ..
 = G.
.
 

 . . ... 0 

c(rd ) 1
− c(0)2 0 ... ... c(0)

Therefore, by using Theorem 8.2.2 together with the continuous mapping theorem we obtain the
result. 
√ D
Comparing Theorem 8.2.2 to the asymptotically pivotal result nρh,n → N (0, Ih ) in (??) it is clear
that additional assumptions are required for the result to be pivotal. Therefore, in the following
theorem we consider the case that {Xt } is a linear time series, which includes the special case that
{Xt } are iid random variables. First, we make some observations about G and GCd+1 G0 . Note
that the assumption of linearity of a time series can be checked (see, for example, Subba Rao and
Gabr (1980)).

Remark 8.2.1 (i) Basic algebra gives

∞ 
X
(GCd+1 G0 )r1 ,r2 = ρ(k + r1 )ρ(k + r2 ) + ρ(k − r1 )ρ(k + r2 ) + 2ρ(r1 )ρ(r2 )ρ2 (k)
k=−∞

−2ρ(r1 )ρ(k)ρ(k + r2 ) − 2ρ(r2 )ρ(k)ρ(k + r1 ) . (8.10)

(ii) Though it may not seem directly relevant. It is easily seen that the null space of the matrix
G is


N (G) = αcd+1 ; α ∈ R

where c0d+1 = (c(0), c(r1 ), . . . , c(rd )). This property will be useful in proving Bartlett’s formula
(below).

Theorem 8.2.3 Suppose {Xt } is a mean zero linear stationary time series where


X
Xt = ψj εt−j ,
j=−∞

254
|ψj | < ∞, {εt } are iid random variables with E(εt ) = 0 and E(ε4t ) < ∞. Suppose we
P
with j

observe {Xt : t = 1, . . . , n} and use (8.3) as an estimator of the covariance c(k) = cov(X0 , Xk ).
Then we have
 
ρbn (r1 ) − ρ(r1 )
√  ..

 P
 → N 0, GCd+1 G0 ,

n

.
 
ρbn (rd ) − ρ(rd )

where an explicit expression for GCd+1 G0 is given in (8.10) (this is called Bartlett’s formula).

PROOF. To prove the result we use Lemma 8.2.2. However, we observe that the term GKd+1 G0
has disappeared. In Section 8.2.3 we show that for (univariate) linear processes GKd+1 G0 = 0. 

Remark 8.2.2 • Under linearity of the time series, Brockwell and Davis (2002), Theorem
7.2.2 show that the above theorem also holds for linear time series whose fourth moment does
not exist. This result requires slightly stronger assumptions on the coefficients {ψj }.

• This allusive fourth cumulant term does not disappear for vector linear processes.

Using Theorem 8.2.3, we can prove (??) for iid time series. Since iid random variables are a
special case of a linear time series (φj = 0 for all j 6= 0) with c(r) = 0 for all r 6= 0. Substituting
this into Theorem 8.2.3 gives
 
ρbn (1)
√  ..

 D
n  → N (0, Ih ).

.
 
ρbn (h)

Using this result we obtain the critical values in the ACF plots and the Box-Pierce test. How-
ever, from Lemma 8.2.2 we observe that the results can be misleading for time series which are
uncorrelated but not necessarily iid. Before discussing this, we first prove the above results. These
calculations are a little tedious, but they are useful in understanding how to deal with many different
types of statistics of a time series (not just the sample autocovariances).

8.2.3 The covariance of the sample autocovariance


Our aim in this section is to derive an expression for cov (b
cn (r1 ), b
cn (r2 )). To simply notation we
focus on the variance (r1 = r2 ), noting that the same calculations carry over to the covariance.

255
Approach 1 Use the moment expansion of a covariance

n−|r|
1 X
var[b
cn (r)] = cov(Xt Xt+r , Xτ Xτ +r )
n2
t,τ =1
n−|r|
1 X
= (E(Xt Xt+r , Xτ Xτ +r ) − E(Xt Xt+r )E(Xτ Xτ +r ))
n2
t,τ =1
n−|r|
1 X 2

= E(Xt Xt+r , Xτ Xτ +r ) − c(r) .
n2
t,τ =1

Studying the above and comparing it to the expansion of var(X̄) when the {Xt } are iid, we would
cn (r)] = O(n−1 ). But it is difficult to see what is happening with this expansion.
expect that var[b
Though it is possible to use this method. We use an alternative expansion in terms of cumulants.

Approach 2 Use an expansion of the covariance of products in terms of products of cumulants.


Suppose A, B, C and D are zero mean (real) random variables. Then

cov (AB, CD) = |{z}


|{z} cov (A, C) |{z}
cov (B, D) + |{z}
cov (A, D) |{z}
cov (B, C) + cum(A, B, C, D). (8.11)
=cum =cum =cum =cum =cum

This result can be generalized to higher order cumulants, see Brillinger (2001).
Below, we formally define a cumulant and explain why it is a useful tool in time series.

Background: What are cumulants?

To understand what they are and why they are used, we focus the following discussion for fourth
order cumulants.
The joint cumulant of Xt , Xt+k1 , Xt+k2 , Xt+k3 (denoted as cum(Xt , Xt+k1 , Xt+k2 , Xt+k3 )) is the
coefficient of the term s1 s2 s3 s4 in the power series expansion of

K(s1 , s2 , s3 , s4 ) = log E[eis1 Xt +is2 Xt+k1 +is3 Xt+k2 +is4 Xt+k4 ].

Thus

∂ 4 K(s1 , s2 , s3 , s4 )
cum(Xt , Xt+k1 , Xt+k2 , Xt+k3 ) = cs1 ,s2 ,s3 ,s4 =0
∂s1 ∂s1 ∂s3 ∂s4

It looks very similar to the definition of moments and there is a one to one correpondence between

256
the moments and the cumulants. It can be shown that the cumulant corresponding to coefficient
of si sj is cum(Xt+ki , Xt+kj ) (the covariance is often called the second order cumulant).
Properties

• If Xt is independent of Xt+k1 , Xt+k2 , Xt+k3 then

cum (Xt , Xt+k1 , Xt+k2 , Xt+k3 ) = 0.

This is because the log of the corresponding characteristic function is

log E[eis1 Xt +is2 Xt+k1 +is3 Xt+k2 +is4 Xt+k4 ] = log E[eis1 Xt ] + log[E[eis2 Xt+k1 +is3 Xt+k2 +is4 Xt+k4 ].

Differentiating the above with respect to s1 s2 s3 s4 gives zero.

• If Xt , Xt+k1 , Xt+k2 , Xt+k3 is multivariate Gaussian, then all cumulants higher than order 2
are zero. This is easily seen, by recalling that the characteristic function of a multivariate
normal distribution is

1
C(s1 , s2 , s3 , s4 ) = exp(iµ0 s − s0 Σs)
2

where µ and Σ are the mean and variance of Xt , Xt+k1 , Xt+k2 , Xt+k3 respectively. Based on
the above, we observe that log C(s1 , s2 , s3 , s4 ) is an order two multivariate polynomial.

Note that this property can be used to prove CLTs.

• Cumulants satisfy the follow multilinear property

cum (aX1 + bY1 + c, X2 , X3 , X4 )

= acum (X1 , X2 , X3 , X4 ) + bcum (Y1 , X2 , X3 , X4 )

where a, b and c are scalars.

• The influence of stationarity:

From the definition of the characteristic function, if the time series {Xt } is strictly stationary.
Then

log E[eis1 Xt +is2 Xt+k1 +is3 Xt+k2 +is4 Xt+k4 ] = log E[eis1 X0 +is2 Xk1 +is3 Xk2 +is4 Xk4 ].

257
Thus, analogous to covariances, cumulants are invariant to shift

cum(Xt , Xt+k1 , Xt+k2 , Xt+k3 ) = cum(X0 , Xk1 , Xk2 , Xk3 ) = κ4 (k1 , k2 , k3 ).

Comparisons between the covariance and higher order cumulants

(a) The covariance is invariant to ordering cov[Xt , Xt+k ] = cov[Xt+k , Xt ].

Like the covariance, the joint cumulant cum[Xt , Xt+k1 , Xt+k2 , Xt+k3 ] is also invariant to order.

(b) The covariance cov[Xt , Xt+k ] is a measure of linear dependence between Xt and Xt+k .

The cumulant is measuring the dependence between cum[Xt , Xt+k1 , Xt+k2 , Xt+k3 ] in “three
directions” (though as far as I am aware, unlike the covariance it has no clear geometric
interpretation). For example, if {Xt } is a zero mean time series then

cum[Xt , Xt+k1 , Xt+k2 , Xt+k3 ]

= E[Xt Xt+k1 Xt+k2 Xt+k3 ] − E[Xt Xt+k1 ]E[Xt+k2 Xt+k3 ]

−E[Xt Xt+k2 ]E[Xt+k1 Xt+k3 ] − E[Xt Xt+k3 ]E[Xt+k1 Xt+k2 ]. (8.12)

Unlike the covariance, the cumulants do not seem to satisfy any non-negative definite condi-
tions.

(c) In time series we usually assume that the covariance decays over time i.e. if k > 0

|cov[Xt , Xt+k ]| ≤ α(k)

P
where α(k) is a positive sequence such that k α(k) < ∞. This can easily be proved for
linear time series with j |ψj | < ∞1 .
P

For a large class of time series, the analogous result is true for cumulants. I.e. if k1 ≤ k2 ≤ k3
1
ψj εt−j then cov(Xt , Xt+h ) = σ 2
P P
This is easily shown by noting that if Xt = j j ψj ψj+h . Thus
 2

X ∞
X X ∞
X
|c(h)| = σ 2 ψj ψj+h ≤ σ 2  |ψj | < ∞.
h=−∞ h=−∞ j j=−∞

258
then

|cum[Xt , Xt+k1 , Xt+k2 , Xt+k3 ]| ≤ α(k1 )α(k2 − k1 )α(k3 − k2 ) (8.13)

P∞
where k=−∞ α(k) < ∞.
P
(d) Often in proofs we use the assumption r |c(r)| < ∞. An analogous assumption for fourth
P
order cumulants is k1 ,k2 ,k3 |κ4 (k1 , k2 , k3 )| < ∞. Based on the inequality (8.13), this assump-
tion is often reasonable (such assumptions are often called Brillinger-type mixing conditions).

Point (c) and (d) are very important in the derivation of sampling properties of an estimator.

Example 8.2.1 • We illustrate (d) for the causal AR(1) model Xt = φXt−1 + εt (where {εt }
are iid random variables with finite fourth order cumulant κ4 = cum(εt , εt , εt , εt )). By using
the MA(∞) representation ∞ j
P
j=0 φ εt−j (assuming 0 ≤ k1 ≤ k2 ≤ k3 ) we have


X
cum[Xt , Xt+k1 , Xt+k2 , Xt+k3 ] = φj0 +j1 +j2 +j3 cum [εt−j0 , εt+k1 −j1 , εt+k2 −j2 , εt+k3 −j3 ]
j0 ,j1 ,j2 ,j3 =0

X φk1 +k2 +k3
= κ4 φj φj+k1 φj+k2 φj+k3 = κ4 .
1 − φ4
j=0

The fourth order dependence decays as the lag increases. And this rate of decay is faster than
the general bound |cum[Xt , Xt+k1 , Xt+k2 , Xt+k3 ]| ≤ α(k1 )α(k2 − k1 )α(k3 − k2 ).

• If {Xt }t are martingale differences and tj are all different, then using (8.12) (the expansion
of the fourth order cumulant in terms of moments) we have

cum[Xt1 , Xt2 , Xt3 , Xt4 ] = 0.

Remark 8.2.3 (Cumulants and dependence measures) The summability of cumulants can
be shown under various mixing and dependent type conditions. We mention a few below.

• Conditions for summability of cumulants for mixing processes are given in Statulevicius and
Jakimavicius (1988) and Lahiri (2003).

• Conditions for summability of cumulants for physical dependence processes are given in Shao
and Wu (2007), Theorem 4.1.

259
Proof of equation (8.8) in Theorem 8.2.2

Our aim is to show


  
cn (0)
b
  
√  bcn (r1 )
  

var  n   → Vd+1
..

 



 . 

cn (rd )
b

where


X ∞
X
(Vd+1 )i,j = c(k)c(k + ri−1 − rj−1 ) + c(k + ri−1 − 1)c(k − rj−1 − 1) +
k=−∞ k=−∞
X∞
κ4 (ri−1 − 1, k, k + rj−1 − 1). (8.14)
k=−∞

To simplify notation we start by considering the variance

n−|r|
√ 1 X
var[ nb
cn (r)] = cov(Xt Xt+r , Xτ Xτ +r ).
n
t,τ =1

To prove the result, we use the identity (8.11); if A, B, C and D are mean zero random variables,
then cov[AB, CD] = cov[A, C]cov[B, D]+cov[A, D]cov[B, C]+cum[A, B, C, D]. Using this identity
we have

var[b
cn (r)]
n−|r|
1 X 
= cov(Xt , Xτ ) cov(Xt+r , Xτ +r ) + cov(Xt , Xτ +r )cov(Xt+r , Xτ ) + cum(Xt , Xt+r , Xτ , Xτ +r )
n2 | {z } | {z }
t,τ =1
=c(t−τ ) κ4 (r,τ −t,t+r−τ )
n−|r| n−|r| n−|r|
1 X 1 X 1 X
= c(t − τ )2 + c(t − τ − r)c(t + r − τ ) + k4 (r, τ − t, τ + r − t)
n n n
t,τ =1 t,τ =1 t,τ =1
:= In + IIn + IIIn ,

where the above is due to strict stationarity of the time series. The benefit of using a cumulant
expansion rather than a moment expansion is now apparent. Since cumulants act like a covariances,
they do decay as the time gaps grow. This allows us to analysis each term In , IIn and IIIn
individually. This simplifies the analysis.

260
We first consider In . Either (i) by changing variables and letting k = t − τ and thus changing
Pn−|r|
the limits of the summand in an appropriate way or (ii) observing that t,τ =1 c(t − τ )2 is the sum
of the elements in the Toeplitz matrix
 
c(0)2 c(1)2 . . . c(n − 1)2
 
c(−1)2 c(0)2 . . . c(n − 2)2
 
 
,
.. .. ..

..
.
 

 . . . 

c((n − 1))2 c((n − 2))2 . . . c(0)2

(noting that c(−k) = c(k)) the sum I can be written as

n−|r| (n−1) n−|k| n−1  


1 X 1 X X X n − |k|
In = c(t − τ )2 = c(k)2 1= c(k)2 .
n n n
t,τ =1 k=−(n−1) t=1 k=−(n−1)

To obtain the limit of the above we use dominated convergence. Precisely, since for all k, (1 −
Pn−|r|
|k|/n)c(k)2 → c(k)2 and | k=−(n−|r|) (1 − |k|/n)c(k)2 | ≤ k∈Z c(k)2 < ∞, by dominated conver-
P
P∞
gence In = n−1 2 2
P
k=−(n−1) (1 − |k|/n)c(k) → k=−∞ c(k) . Using a similar argument we can show

that


X
lim IIn = c(k + r)c(k − r).
n→∞
k=−∞

To derive the limit of IIIn , we change variables k = τ − t to give

n−|r|  
X n − |r| − |k|
IIIn = k4 (r, k, k + r).
n
k=−(n−|r|)

Again we use dominated convergence. Precisely, for all k, (1 − |k|/n)k4 (r, k, k + r) → k4 (r, k, k + r)
Pn−|r| P
and | k=−(n−|r|) (1 − |k|/n)k4 (r, k, k + r)| ≤ k∈Z |k4 (r, k, k + r)| < ∞ (by assumption). Thus by
dominated convergence we have IIIn = nk=−(n−|r|) (1 − |k|/n)k4 (r, k, k + r) → ∞
P P
k=−∞ k4 (r, k, k +

r). Altogether the limits of In , IIn and IIIn give

∞ ∞ ∞
√ X X X
lim var[ nb
cn (r)] = c(k)2 + c(k + r)c(k − r) + κ4 (r, k, k + r).
n→∞
k=−∞ k=−∞ k=−∞

261
Using similar set of arguments we obtain

√ √
lim cov[ nb
cn (r1 ), nb
cn (r2 )]
n→∞

X ∞
X ∞
X
→ c(k)c(k + r1 − r2 ) + c(k − r1 )c(k + r2 ) + κ4 (r1 , k, k + r2 ).
k=−∞ k=−∞ k=−∞

This result gives the required variance matrix Vd+1 in Theorem 8.2.2.
Below, we show that under linearity the fourth order cumulant term has a simpler form. We
will show
 
ρbn (r1 ) − ρ(r1 )
√  ..

 P
 → N 0, GCd+1 G0 .

n

.
 
ρbn (rd ) − ρ(rd )

We have already shown that in the general case the limit distribution of the sample correlations is
G(Cd+1 + Kd+1 )G0 . Thus our objective here is to show that for linear time series the fourth order
cumulant term is GKd+1 G0 = 0.

Proof of Theorem 8.2.3 and the case of the vanishing fourth order cumulant

So far we have not used the structure of the time series to derive an expression for the variance of
the sample covariance. However, to prove GKd+1 G0 = 0 we require an explicit expression for Kd+1 .
The following result only holds for linear, univariate time series. We recall that


X
(Kd+1 )i,j = κ4 (ri−1 , k, k + rj−1 ).
k=−∞

By definition κ4 (ri−1 , k, k + rj−1 ) = cum(X0 , Xri−1 , Xk , Xk+rj−1 ). Further, we consider the specific
case that Xt is a linear time series, where


X
Xt = ψj εt−j
j=−∞

262
|ψj | < ∞, {εt } are iid, E(εt ) = 0, var(εt ) = σ 2 and κ4 = cum4 (εt ). To find an expression for
P
j

(Kd+1 )i,j , consider the general sum


X
cum(X0 , Xr1 , Xk , Xk+r2 )
k=−∞
 

X ∞
X ∞
X ∞
X ∞
X
= cum  ψj1 ε−j1 , ψj2 εr1 −j2 , ψj3 εk−j3 , ψj4 εk+r2 −j1 
k=−∞ j1 =−∞ j2 =−∞ j3 =−∞ j4 =−∞

X X
= ψj1 ψj2 ψj3 ψj4 cum (ε−j1 , εr1 −j2 , εk−j3 , εk+r2 −j1 ) .
k=−∞ j1 ,...,j4 =−∞

We recall from Section 8.2.3, if one of the variables above is independent of the other, then
cum (ε−j1 , εr1 −j2 , εk−j3 , εk+r2 −j1 ) = 0. This reduces the number of summands from five to two


X ∞
X ∞
X
cum(X0 , Xr1 , Xk , Xk+r2 ) = κ4 ψj ψj−r1 ψj−k ψj−r2 −k .
k=−∞ k=−∞ j=−∞

Changing variables j1 = j and j2 = j − k we have

∞ ∞ ∞
X X  X  c(r1 ) c(r2 ) κ4
cum(X0 , Xr1 , Xk , Xk+r2 ) = κ4 ψj ψj−r1 ψj2 ψj2 −r2 = κ4 2 2
= 4 c(r1 )c(r2 ),
σ σ σ
k=−∞ j1 =−∞ j2 =−∞

P∞
recalling that cov(Xt , Xt+r ) = σ 2 j=−∞ ψj ψj+r . Thus for linear time series


X κ4
(Kd+1 )i,j = κ4 (ri−1 , k, k + rj−1 ) = c(ri−1 )c(ji−1 )
σ2
k=−∞

and the matrix Kd+1 is

κ4
Kd+1 = c c0
σ 4 d+1 d+1

where c0d+1 = (c(0), c(r1 ), . . . , c(rd )). Substituting this representation of Kd+1 into GKd+1 G0 gives

κ4
GKd+1 G0 = Gc c0 G0 .
σ 4 d+1 d+1

We recall from Remark 8.2.1 that G is a d × (d + 1) dimension matrix with null space cd+1 . This
immediately gives Gcd+1 = 0 and the result.

263
Exercise 8.1 Under the assumption that {Xt } are iid random variables show that ĉn (1) is asymp-
totically normal.
Pn−1
Hint: Let m = n/(B + 1) and partition the sum k=1 Xt Xt+1 as follows

n−1
X B
X 2B+1
X
Xt Xt+1 = Xt Xt+1 + XB+1 XB+2 + Xt Xt+1 + X2B+2 X2B+3 +
t=1 t=1 t=B+2
3B+2
X 4B+3
X
Xt Xt+1 + X3B+3 X3B+4 + Xt Xt+1 + . . .
t=2B+3 t=3B+4
m−1
X m−1
X
= Um,j + X(j+1)(B+1) X(j+1)(B+1)+1
j=0 j=0

Pj(B+1)+B
where Um,j = t=j(B+1)+1 Xt Xt+1 . Show that the second term in the above summand is asymp-
totically negligible and show that the classical CLT for triangular arrays can be applied to the first
term.

Exercise 8.2 Under the assumption that {Xt } is a MA(1) process, show that b
cn (1) is asymptoti-
cally normal.

Exercise 8.3 The block bootstrap scheme is a commonly used method for estimating the finite
sample distribution of a statistic (which includes its variance). The aim in this exercise is to see
how well the bootstrap variance approximates the finite sample variance of a statistic.

1 Pn−1
(i) In R write a function to calculate the autocovariance b
cn (1) = n t=1 Xt Xt+1 .

Remember the function is defined as cov1 = function(x){...}

(ii) Load the library boot library("boot") into R. We will use the block bootstrap, which parti-
tions the data into blocks of lengths l and then samples from the blocks n/l times to construct
a new bootstrap time series of length n. For each bootstrap time series the covariance is
evaluated and this is done R times. The variance is calculated based on these R bootstrap
estimates.

You will need to use the function tsboot(tseries,statistic,R=100,l=20,sim="fixed").


tseries refers to the original data, statistic to the function you wrote in part (i) (which should
only be a function of the data), R=is the number of bootstrap replications and l is the length
of the block.

264
Note that tsboot(tseries,statistic,R=100,l=20,sim="fixed")$t will be vector of length
R = 100 which will contain the bootstrap statistics, you can calculate the variance of this
vector.

(iii) Simulate the AR(2) time series arima.sim(list(order = c(2, 0, 0), ar = c(1.5, −0.75)), n =
128) 500 times. For each realisation calculate the sample autocovariance at lag one and also
the bootstrap variance.

(iv) Calculate the mean of the bootstrap variances and also the mean squared error (compared
with the empirical variance), how does the bootstrap perform?

(iv) Play around with the bootstrap block length l. Observe how the block length can influence the
result.

Remark 8.2.4 The above would appear to be a nice trick, but there are two major factors that
lead to the cancellation of the fourth order cumulant term

• Linearity of the time series

• Ratio between ĉn (r) and ĉn (0).

Indeed this is not a chance result, in fact there is a logical reason why this result is true (and is
true for many statistics, which have a similar form - commonly called ratio statistics). It is easiest
explained in the Fourier domain. If the estimator can be written as
Pn
1 k=1 φ(ωk )In (ωk )
1 Pn ,
n n k=1 In (ωk )

where In (ω) is the periodogram, and {Xt } is a linear time series, then we will show later that the
asymptotic distribution of the above has a variance which is only in terms of the covariances not
higher order cumulants. We prove this result in Section 11.5.

8.3 Checking for correlation in a time series


Bartlett’s formula if commonly used to check by ‘eye; whether a time series is uncorrelated (there
are more sensitive tests, but this one is often used to construct CI in for the sample autocovariances
in several statistical packages). This is an important problem, for many reasons:

265
• Given a data set, we need to check whether there is dependence, if there is we need to analyse
it in a different way.

• Suppose we fit a linear regression to time series data. We may to check whether the residuals
are actually uncorrelated, else the standard errors based on the assumption of uncorrelated-
ness would be unreliable.

• We need to check whether a time series model is the appropriate model. To do this we fit
the model and estimate the residuals. If the residuals appear to be uncorrelated it would
seem likely that the model is correct. If they are correlated, then the model is inappropriate.
For example, we may fit an AR(1) to the data, estimate the residuals εt , if there is still
correlation in the residuals, then the AR(1) was not the correct model, since Xt − φ̂Xt−1 is
still correlated (which it would not be, if it were the correct model).

We now apply Theorem 8.2.3 to the case that the time series are iid random variables. Suppose {Xt }
are iid random variables, then it is clear that it is trivial example of a (not necessarily Gaussian)
linear process. We use (8.3) as an estimator of the autocovariances.
To derive the asymptotic variance of {ĉn (r)}, we recall that if {Xt } are iid then ρ(k) = 0 for
k 6= 0. Then by using Bartlett’s formula we have

 1 i=j
(Wh )ij =
6 j
 0 i=

√ D
In other words, nρ̂n → N (0, Ih ). Hence the sample autocovariances at different lags are asymp-
totically uncorrelated and have variance one. This allows us to easily construct error bars for the
sample autocovariances under the assumption of independence. If the vast majority of the sample
autocovariance lie inside the error bars there is not enough evidence to suggest that the data is
a realisation of a iid random variables (often called a white noise process). An example of the
empirical ACF and error bars is given in Figure 8.1. We see that the empirical autocorrelations of
the realisation from iid random variables all lie within the error bars. In contrast in Figure 8.2
we give a plot of the sample ACF of an AR(2). We observe that a large number of the sample
autocorrelations lie outside the error bars.
Of course, simply checking by eye means that we risk misconstruing a sample coefficient that
lies outside the error bars as meaning that the time series is correlated, whereas this could simply

266
Series iid

1.0
0.8
0.6
ACF

0.4
0.2
0.0

0 5 10 15 20

Lag

Figure 8.1: The sample ACF of an iid sample with error bars (sample size n = 200).

Series ar2
−0.4 0.0 0.4 0.8
ACF

0 5 10 15 20

Lag
−0.4 0.0 0.4 0.8
acf

5 10 15 20

lag

Figure 8.2: Top: The sample ACF of the AR(2) process Xt = 1.5Xt−1 + 0.75Xt−2 + εt with
error bars n = 200. Bottom: The true ACF.

267
be a false positive (due to multiple testing). To counter this problem, we construct a test statistic
for testing uncorrelatedness. We test the hypothesis H0 : c(r) = 0 for all r against HA : at least
one c(r) 6= 0.
A popular method for measuring correlation is to use the squares of the sample correlations

h
X
Sh = n ρn (r)|2 .
|b (8.15)
r=1

√ D
Since under the null n(ρ̂n (h) − ρ(h)) → N (0, I), under the null Sh asymptotically will have a χ2 -
distribution with h degrees of freedom, under the alternative it will be a non-central (generalised)
chi-squared. The non-centrality is what makes us reject the null if the alternative of correlatedness
is true. This is known as the Box-Pierce (or Portmanteau) test. The Ljung-Box test is a variant
on the Box-Pierce test and is defined as

h
X |ρ̂n (r)|2
Sh = n(n + 2) . (8.16)
n−r
r=1

D
Again under the null of no correlation, asymptotically, Sh → χ2h . Generally, the Ljung-Box test is
suppose to give more reliable results than the Box-Pierce test.
Of course, one needs to select h. In general, we do not have to use large h since most correlations
will arise when the lag is small, However the choice of h will have an influence on power. If h is too
large the test will loose power (since the mean of the chi-squared grows as h → ∞), on the other
hand choosing h too small may mean that certain correlations at higher lags are missed. How to
selection h is discussed in several papers, see for example Escanciano and Lobato (2009).

Remark 8.3.1 (Do’s and Don’t of the Box-Jenkins or Ljung-Box test) There is tempta-
tion to estimate the residuals from a model and test for correlation in the estimated residuals.
Pp
• Example 1 Yt = j=1 αj xj,t + εt . Suppose we want to know if the errors {εt }t are correlated.
We test H0 : errors are uncorrelated vs HA : errors are correlated.

Suppose H0 is true. {εt } are unobserved, but they can be estimated from the data. Then on
the estimated residuals {b
εt }t we can test for correlation. We estimate the correlation based

268
on the estimated residuals ρe(r) = e
cn (r)/e
cn (0), where

n−|r|
1 X
cn (r) =
e εbt εbt+r .
n
t=1


It can be shown that ρn (r) ∼ N (0, 1) and the Box-Jenkins or Ljung-Box test can be used.
ne
I.e. Sh ∼ χ2h even when using the estimated residuals.

• Example 2 This example is a word of warning. Suppose Yt = φYt−1 + εt . We want to test


H0 :errors are uncorrelated vs HA : errors are uncorrelated.

Suppose H0 is true. {εt } are unobserved, but they can be estimated from the data. We
εt = Yt −φY
estimate the correlation based on the estimated residuals (b b t−1 ), ρe(r) = e
cn (r)/e
cn (0),
where

n−|r|
1 X
cn (r) =
e εbt εbt+r .
n
t=1


ρen (r) is estimating zero but ne
ρn (r) is not a standard normal. Thus Sh does not follow a
standard chi-square distribution. This means the estimated residuals cannot be used to check
for uncorrelatedness.

To understand the difference between the two examples see Section 8.6.

8.3.1 Relaxing the assumptions: The robust Portmanteau test


(advanced)
One disadvantage of the Box-Pierce/Portmanteau test described above is that it requires under the
null that the time series is independent not just uncorrelated. Even though the test statistic can
only test for correlatedness and not dependence. As an illustration of this, in Figure ?? we give the
QQplot of the S2 (using an ARCH process as the time series) against a chi-square distribution. We
recall that despite the null being true, the test statistic deviates considerably from a chi-square. For
this time series, we would have too many false positive despite the time series being uncorrelated.
Thus the Box-Pierce test only gives reliable results for linear time series.

269
In general, under the null of no correlation we have
 P
√ √  
k κ4 (r1 , k, k + r2 ) r1 6= r2
cov nb
cn (r1 ), nb
cn (r2 ) =
 c(0)2 + P
k κ4 (r, k, k + r) r1 = r2 = (r)

Thus despite b
cn (r) being asymptotically normal we have

√ bcn (r) D
→ N 0, 1 + GK2 G0 ,

n
c(0)

where the cumulant term GK2 G tends to be positive. This results in the Box-Pierce test underes-
timating the variance, and the true quantiles of S2 (see Figure ??) being larger than the chi square
quantiles.
However, there is an important subset of uncorrelated time series, which are dependent, where
a slight modification of the Box-Pierce test does give reliable results. This subset includes the
aforementioned ARCH process and is a very useful test in financial applications. As mentioned in
(??) ARCH and GARCH processes are uncorrelated time series which are martingale differences.
We now describe the robust Portmanteau test, which is popular in econometrics as it is allows for
uncorrelated time series which are martingale differences and an additional joint moment condition
which we specify below (so long as it is stationary and its fourth moment exists).
We recall that {Xt }t is a martingale difference if

E(Xt |Xt−1 , Xt−2 , Xt−3 , . . .) = 0.

Martingale differences include independent random variables as a special case. Clearly, from this
definition {Xt } is uncorrelated since for r > 0 and by using the definition of a martingale difference
we have

cov(Xt , Xt+r ) = E(Xt Xt+r ) − E(Xt )E(Xt+r )

= E(Xt E(Xt+r |Xt )) − E(E(Xt |Xt−1 ))E(E(Xt+r |Xt+r−1 )) = 0.

Thus a martingale difference sequence is an uncorrelated sequence. However, martingale differences


have more structure than uncorrelated random variables, thus allow more flexibility. For a test to
be simple we would like that the sample covariance between different lags is asymptotically zero.

270
This can be achieved for martinagle differences plus an important additional condition:

E[Xt2 Xs1 Xs2 ] = 0 t > s1 6= s2 . (8.17)

To understand why, consider the sample covariance

√ √  1X
cov nb
cn (r1 ), nb
cn (r2 ) = cov (Xt1 Xt1 +r1 , Xt2 Xt2 +r2 )
n t ,t
1 2

Under the null, the above is

√ √  1X
cov nb
cn (r1 ), nb
cn (r2 ) = E (Xt1 Xt1 +r1 Xt2 Xt2 +r2 ) .
n t ,t
1 2

We show that under the null hypothesis, many of the above terms are zero (when r1 6= r2 ), however
there are some exceptions, which require the additional moment condition.
For example, if t1 6= t2 and suppose for simplicity t2 + r2 > t2 , t1 , t1 + r1 . Then

E(Xt1 Xt1 +r1 Xt2 Xt2 +r2 ) = E (Xt1 Xt1 +r2 Xt2 E(Xt2 +r2 |Xt1 , Xt1 +r2 , Xt2 )) = 0 (8.18)

and if r1 6= r2 (assume r2 > r1 ) by the same argument


 

E(Xt Xt+r1 Xt Xt+r2 ) = E Xt2 Xt+r1 E(Xt+r2 | Xt2 , Xt+r1 ) = 0.


 
| {z }
⊂σ(Xt+r2 −1 ,Xt+r2 −2 ,...)

However, in the case that t1 + r1 = t2 + r2 (r1 6= r2 ≥ 0, since r1 6= r2 , then this implies t1 6= t2 )


we have

E(Xt21 +r1 Xt1 Xt2 ) 6= 0,

even when Xt are martingale arguments. Consequently, we do not have that cov(Xt1 Xt1 +r1 , Xt2 Xt2 +r2 ) =
0. However, by including the additional moment condition that E[Xt2 Xs1 Xs2 ] = 0 for t > s1 , 6= s2 ,
then we have cov(Xt1 Xt1 +r1 , Xt2 Xt2 +r2 ) = 0 for all t1 and t2 when r1 6= r2 .
The above results can be used to show that the variance of b
cn (r) (under the assumption that

271
the time series martingale differences and E[Xt2 Xs1 Xs2 ] = 0 for t > s1 , s2 ) has a very simple form

n
√  1 X
var nb
cn (r) = cov (Xt1 Xt1 +r , Xt2 Xt2 +r )
n
t1 ,t2 =1
n n
1 X 1X
E Xt2 Xt+r
2
= E(X02 Xr2 )

= E (Xt1 Xt1 +r Xt2 Xt2 +r ) =
n n | {z }
t1 ,t2 =1 t=1
by stationarity

cn (r2 )) = 0. Let σr2 = E X02 Xr2 . Then we have that under the null

and if r1 6= r2 then cov(b
cn (r1 ), b
hypothesis (and suitable conditions to ensure normality) that

 
cn (1)/σ1
b
 
cn (2)/σ2
√  b
 
 D
n  → N (0, Ih ) .
 .. 

 . 

cn (h)/σh
b

It is straightforward to estimate the σr2 with

n
1X 2 2
br2
σ = Xt Xt+r .
n
t=1

Thus a similar squared distance as the Box-Pierce test is used to define the Robust Portmanteau
test, which is defined as

h
X cn (r)|2
|b
Rh = n .
br2
σ
r=1

D
Under the null hypothesis (assuming stationarity and martingale differences) asymptotically Rh →
χ2h (for h kept fixed).

Remark 8.3.2 (ARCH and the Robust Portmanteau test) If I remember correctly the rea-
son the above condition holds for ARCH models is (we assume wlog s2 > s1 )

E[Xt2 Xs1 Xs2 ] = E[ηt2 ]E[σt2 σs1 σs1 ηs2 ηs1 ]

= E[ηt2 ]E[ηs2 ηs1 E[σt2 σs2 σs1 |Fs1 −1 ]]

= E[ηs1 ]E[ηs2 ]E[σt2 σs2 σs1 |Fs1 −1 ]] = 0,

272
● ●

8

30
● ●


● ●
● ●

6


●●

garchM1[−m]

garch1[−m]

20
● ●●

●●
● ●



●●

4
●●
● ●





● ●

●●●

●●

● ●
●●


● ●
●●


● ●



●● ●


10
● ●


●●

● ●

● ●


●●


● ●


2

● ●

● ●

●●



●● ●



● ●




● ●

●●
● ●

● ●





● ●●





●● ●







●● ●




● ●






● ●
●●


●● ●●






●●
● ●


●●





●● ●●


●●

● ●





●●


● ●

●●




●●

● ●



● ●
0

0
0 2 4 6 8 10 0 2 4 6 8 10

qm qm

Figure 8.3: Using ARCH(1) time series over 200 replications Left: S2 against the quantiles
of a chi-square distribution with 2df for an ARCH process. Right: R2 against the quantiles
of a chi-square distribution with 2df for an ARCH process.

To see how this test performs, in the right hand plot in Figure 8.3 we give the quantile quantile
plot of Rh against the chi-squared distribution. We observe that it lies pretty much on the x = y
line. Moreover, the test results at the 5% level are given in Table 8.1. We observe that it is close
to the stated 5% level and performs far better than the classical Box-Pierce test.

ARCH Box-Pierce 26%


ARCH Robust Portmanteau 4.5%

Table 8.1: Proportion of rejections under the null hypothesis. Test done at the 5% level over
200 replications.

The robust Portmanteau test is a useful generalisation of the Box-Pierce test, however it still
requires that the time series under the null satisfies the martingale difference property and the
moment condition. These conditions cannot be verified. Consider for example the uncorrelated
time series


X φ
Xt+1 = φj εt−j − εt+1
1 − φ2
j=0

where {εt } are uncorrelated random variables from the ARCH process εt = Zt σt and σt2 = a0 +
a1 ε2t−1 . Despite εt being martingale differences, Xt are not martingale differences. Thus the robust
Portmanteau test will not necessarily give satisfactory results for this uncorrelated time series.

273
Methods have been developed for these general time series methods, including:

• The robust test for white noise proposed in Dalla et al. (2019).

• Bootstrap methods. These include the block bootstrap (Künsch (1989), Liu and Singh (1992)
and Lahiri (2003)), the stationary bootstrap (Politis and Romano (1994)), the sieve bootstrap
(Kreiss (1992) and Kreiss et al. (2011)) and the spectral bootstrap (Hurvich and Zeger (1987),
Franke and Härdle (1992), Dahlhaus and Janas (1996) and Dette and Paparoditis (2009)).
Please keep in mind that this is an incomplete list.

• Estimating the variance of the sample covariance using spectral methods or long-run variance
methods (together with fixed-b asymptotics have been used to obtain a more reliable finite
sample estimator of the distribution).

Finally a few remarks about ACF plots in general

• It is clear that the theoretical autocorrelation function of an MA(q) process is such that
ρ(r) = 0 if |r| > q. Thus from the theoretical ACF we can determine the order of the
process. By a similar argument the variance matrix of an MA(q) will be bandlimited, where
the band is of order q.

However, we cannot determine the order of an moving average process from the empirical
ACF plot. The critical values seen in the plot only correspond to the case the process is iid,
they cannot be used as a guide for determining order.

• Often a model is fitted to a time series and the residuals are evaluated. To see if the model was
appropriate, and ACF plot of empirical correlations corresponding to the estimated residuals.
Even if the true residuals are iid, the variance of the empirical residuals correlations will not
be (??). Li (1992) shows that the variance depends on the sampling properties of the model
estimator.

• Misspecification, when the time series contains a time-dependent trend.

8.4 Checking for partial correlation


We recall that the partial correlation of a stationary time series at lag t is given by the last coefficient
Pm
of the best linear predictor of Xm+1 given {Xj }m j=1 i.e. φm where Xm+1|m =
b
j=1 φj Xm+1−j . Thus

274
φm can be estimated using the Yule-Walker estimator or least squares (more of this later) and the
sampling properties of the estimator are determined by the sampling properties of the estimator of
an AR(m) process. We state these now. We assume {Xt } is a AR(p) time series of the form

p
X
Xt = φj Xt−j + εt
j=1

where {εt } are iid random variables with mean zero and variance σ 2 . Suppose an AR(m) model is
fitted to the data using the Yule-Walker estimator, we denote this estimator as φ b −1
b =Σ
m m r m . Let

φbm = (φbm1 , . . . , φbmm ), the estimator of the partial correlation at lag m is φbmm . Assume m ≥ p.
Then by using Theorem 9.2.1 (see also Theorem 8.1.2, Brockwell and Davis (1998)) we have

√  
P 2 −1
b −φ
n φ m m → N (0, σ Σm ).

where φm are the true parameters. If m > p, then φm = (φ1 , . . . , φp , 0, . . . , 0) and the last coefficient
has the marginal distribution

√ P
nφbmm → N (0, σ 2 Σmm ).

Since m > p, we can obtain a closed for expression for Σmm . By using Remark 6.3.1 we have
Σmm = σ −2 , thus

√ P
nφbmm → N (0, 1).

Therefore, for lags m > p the partial correlations will be asymptotically pivotal. The errors bars in
the partial correlations are [−1.96n−1/2 , 1.96n−1/2 ] and these can be used as a guide in determining
the order of the autoregressive process (note there will be dependence between the partial correlation
at different lags).
This is quite a surprising result and very different to the behaviour of the sample autocorrelation
function of an MA(p) process.

Exercise 8.4

(a) Simulate a mean zero invertible MA(1) process (use Gaussian errors). Use a reasonable sample
size (say n = 200). Evaluate the sample correlation at lag 2, rho
d n (2). Note the sample correlation

275
at lag two is estimating 0. Do this 500 times.

• Calculate of proportion of sample covariances |b
ρn (2)| > 1.96/ n

• Make a QQplot of ρbn (2)/ n against a standard normal distribution. What do you observe?

(b) Simulate a causal, stationary AR(1) process (use Gaussian errors). Use a reasonable sample
size (say n = 200). Evaluate the sample partial correlation at lag 2, φbn (2). Note the sample partial
correlation at lag two is estimating 0. Do this 500 times.

• Calculate of proportion of sample partial correlations |φbn (2)| > 1.96/ n

• Make a QQplot of φbn (2)/ n against a standard normal distribution. What do you observe?

8.5 The Newey-West (HAC) estimator


In this section we focus on the estimation of the variance of

n
1X
θbn = ut εt ,
n
t=1

where {ut }t are deterministic regressors and {εt } is a time series. Quantities of the form θbn arise in
several applications. One important application is in linear regression, which we summarize below.
In Section 3.2.1 we showed that the least squares estimator of the cofficients in

p
X
Yt = β 0 + βj ut,j + εt = β 0 ut + εt ,
j=1

is

n
X n
X
β̂n = arg min Ln (β) = ( ut u0t )−1 Yt ut .
t=1 t=1

The variance of β̂n is derived using

h n
i0 X n
X
β̂ n − β ut u0t = u0t εt
t=1 t=1
n
!−1 n n
!−1 n
h i X X 1X 1X
⇒ β̂ n − β = ut u0t u t εt = ut u0t u t εt .
n n
t=1 t=1 t=1 t=1

276
Using this expression we have

n
!−1 n
! n
!−1
h i 1X 1X 1X
var β̂ n = ut u0t var u t εt ut u0t .
n n n
t=1 t=1 t=1

1 Pn 
Hence the variance of β̂ n is based on var n t=1 ut εt which is

n n
!
1X 1 X
var u t εt = cov[εt , ετ ]ut u0τ
n n2
t=1 t,τ =1
n n n
1 X 1 XX
= var[εt ]ut u0t + cov[εt , ετ ]ut u0τ
n2 n2
t=1 t=1 τ 6=t
n Xn
1 X
= cov[εt , ετ ]ut u0τ .
n2
t=1 τ =1

In the case of stationarity of {εt }t , the above reduces

n n n
!
1X 1 XX
nvar u t εt = c(t − τ )ut u0τ , (8.19)
n n
t=1 t=1 τ =1

where c(t − τ ) = cov[εt , ετ ].


We start by motivating the estimator of (8.20), we start with the special case that ut = 1 for
all t. In this case (8.20) reduces to

n n n
!
1X 1 XX
nvar u t εt = c(t − τ ). (8.20)
n n
t=1 t=1 τ =1

Since E[εt ετ ] = c(t − τ ), as an estimator of the above we can potentially replace c(t − τ ) with εt ετ
to give the estimator

n n n
2 1 XX X
σ
bn,n = εt ετ = cr
b ,
n r=−n
t=1 τ =1
| {z }
due to a change of variables

Pn−|r|
cr = n−1
where b εt εt+r . We recall that in Section 8.2.1 we studied the sampling properties of
t=1

cr ] = O(1/n). As n1 nr=−n b
P
cr and showed that var[b
b cr consists of the sum of all n sample covariances,
Pn
αn ] = O( r=1 n−1 ) = O(1). Thus α
this would suggest var[b bn is an inconsistent estimator of the
variance. Calculations show that this is indeed the case. We discuss a very similar issue in Section

277
11.3 when estimating the spectral density function.
However, σ 2
bn,n suggests an alternative approach to estimation. As the autocovariance decays
as the lag r grows, it is not necessary to estimate all the covariance and instead to truncate the
number of covariances to be estimated i.e. use

m n m
2
X 1 XX
σ
bm,n = λm (r)b
cr = λm (t − τ )εt ετ , (8.21)
r=−m
n
t=1 τ =1

where λm (r) is a a so called lagged window which is zero for |r| > m. It can be shown that this
2 ] 6= nvar 1
Pn 
truncation technique induces a bias in the estimation scheme (i.e. E[b
σm,n n t=1 εt ) but

the variance converges to zero (E[b 2 ] = O(m/n)). By balancing the bias and variance we can
σm,n
mind suitable choice of m such that σ 2
bm,n is a consistent estimator of σ 2 .
The estimator σ 2
bm,n can be generalized to include the case ut 6= 1 and nonstationary errors {εt }.
We recall that

n n n
!
1X 1 XX
nvar u t εt = cov[εt , ετ ]ut u0τ . (8.22)
n n
t=1 t=1 τ =1

We assume that |cov[εt , ετ ]| ≤ |t−τ |−κ (where κ > 1). Since εt ετ can be treated as a “preestimator”
(an initial estimator) of cov[εt , ετ ] we replace cov[εt , ετ ] in (8.22) with εt ετ and weight it with λm (·)
to yield the Newey-West/HAC estimator

n n
1 XX
σ 2
bm,n = λm,n (t − τ )εt ετ ut u0τ . (8.23)
n
t=1 τ =1

Choices of weight functions are discussed in Section 11.3.1. The estimator (8.23) is closely related
to spectral density estimation at frequency zero. The sampling properties of σ 2
bm,n are similar to
those of spectral density estimation and can be found in (11.3.1). Further details can be found in
Andrews (1990).

8.6 Checking for Goodness of fit (advanced)


To check for adequency of a model, after fitting a model to the data the sample correlation of the
estimated residuals is evaluated. If there appears to be no correlation in the estimated residuals
(so the residuals are near uncorrelated) then the model is determined to adequately fit the data.

278
Consider the general model

Xt = g(Yt , θ) + εt

where {εt } are iid random variables and εt is independent of Yt , Yt−1 , . . .. Note Yt can be a vector,
such as Yt−1 = (Xt−1 , Xt−2 , . . . , Xt−p ) and examples of models which satisfy the above include the
AR(p) process. We will assume that {Xt , Yt } is a stationary ergodic process. Further to simplify
the discussion we will assume that θ is univariate, it is straightforward to generalize the discussion
below to the multivariate case.
Let θb denote the least squares estimator of θ i.e.

n
X
θb = arg min (Xt − g(Yt , θ))2 . (8.24)
t=1

Using the “usual” Taylor expansion methods (and assuming all the usual conditions are satisfied,
such as |θb − θ| = Op (n−1/2 ) etc) then it can be shown that

n
√  ∂g(Yt , θ) 2
 
−1 1 ∂g(Yt , θ)
 X
n θ−θ =I √
b εt + op (1) where I = E .
n t=1 ∂θ ∂θ

√ b 
{εt ∂g(Y
∂θ
t ,θ)
} are martingale differences, which is why n θ − θ is asymptotically normal, but more
of this in the next chapter. Let Ln (θ) denote the least squares criterion. Note that the above is
true because

n
∂Ln (θ) X ∂g(Yt , θ)
= −2 [Xt − g(Yt , θ)]
∂θ ∂θ
t=1

and

n n 2
∂ 2 Ln (θ) ∂ 2 g(Yt , θ)

X X ∂g(Yt , θ)
= −2 [Xt − g(Yt , θ)] + 2 ,
∂θ2 ∂θ2 ∂θ
t=1 t=1

thus at the true parameter, θ,

1 ∂ 2 Ln (θ) P
→ 2I.
n ∂θ2

279
Based on (8.24) we estimate the residuals using

εbt = Xt − g(Yt , θ)
b

and the sample correlation with ρb(r) = b


c(r)/b
c(0) where

n−|r|
1 XX
c(r) =
b εbt εbt+r .
n t
t=1

Often it is (wrongly) assumed that one can simply apply the results in Section 8.3 when checking
for adequacy of the model. That is make an ACF plot of ρb(r) and use [−n−1/2 , n1/2 ] as the error
bars. However, since the parameters have been estimated the size of the error bars will change. In
particular, under the null that the model is correct we will show that
 

√  σ2 
ρ(r) = N 0, |{z}
nb 1 − Jr I −1 Jr
 
c(0)

 
iid part | {z }
due to parameter estimation

 2
where c(0) = var[Xt ], σ 2 = var(εt ) and Jr = E[ ∂g(Y∂θ
t+r ,θ)
εt ] and I = E ∂g(Yt ,θ)
∂θ (see, for example,
Li (1992)). Thus the error bars under the null are

σ2
   
1 −1
± √ 1− J I Jr .
n c(0) r

2
σ
Estimation of the parameters means the inclusion of the term c(0) Jr I −1 Jr . If the lag r is not too

small then Jr will be close to zero and the [±1/ n] approximation is fine, but for small r, Jr I −1 Jr
can be large and positive, thus the error bars, ±n−1/2 , are too wide. Thus one needs to be a little
cautious when interpreting the ±n−1/2 error bars. Note if there is no dependence between εt and
Yt+r then using the usual error bars is fine.

Remark 8.6.1 The fact that the error bars get narrower after fitting a model to the data seems
a little strange. However, it is far from unusual. One explanation is that the variance of the
estimated residuals tend to be less than the true residuals (since the estimated residuals contain
less information about the process than the true residuals). The most simplest example are iid
observations {Xi }ni=1 with mean µ and variance σ 2 . The variance of the “estimated residual”
Xi − X̄ is (n − 1)σ 2 /n.

280
We now derive the above result (using lots of Taylor expansions). By making a Taylor expansion
similar to (??) we have

√ √ [ĉn (r) − c(r)] √ c(r)


ρn (r) − ρ(r)] n
n [b − n [ĉn (0) − c(0)] + Op (n−1/2 ).
c(0) c(0)2

However, under the “null” that the correct model was fitted to the data we have c(r) = 0 for |r| > 0,
this gives

√ √ bcn (r)
nb
ρn (r) = n + op (1),
c(0)

thus the sampling properties of ρbn (r) are determined by b


cn (r), and we focus on this term. It is
easy to see that

n−r
√ 1 X  
cn (r) = √
nb εt + g(θ, Yt ) − g(θ,
b Yt ) εt+r + g(θ, Yt+r ) − g(θ,
b Yt+r ) .
n t=1

Heuristically, by expanding the above, we can see that

n−r n n
√ 1 X 1 X b Yt ) + √1
  X  
cn (r) ≈ √
nb εt εt+r + √ εt+r g(θ, Yt ) − g(θ, εt g(θ, Yt+r ) − g(θ,
b Yt+r ) ,
n t=1 n t=1 n t=1

b ·) about g(θ, ·) (to take (θb − θ) out of the sum)


then by making a Taylor expansion of g(θ,

n−r
" n #
√ 1 X (θb − θ) X ∂g(θ, Yt ) ∂g(θ, Yt+r )
cn (r) ≈
nb √ εt εt+r + √ εt+r + εt + op (1)
n t=1 n ∂θ ∂θ
t=1
n−r
" n #
1 X √ 1 X ∂g(θ, Yt ) ∂g(θ, Yt+r )
= √ εt εt+r + n(θ − θ)
b εt+r + εt + op (1).
n n ∂θ ∂θ
t=1 t=1

We make this argument precise below. Making a Taylor expansion we have

n−r
!
√ 1 X ∂g(θ, Yt ) (θb − θ)2 ∂ 2 g(θ̄t , Yt )
nb
cn (r) = √ εt − (θb − θ) + ×
n t=1 ∂θ 2 ∂θ2
!
∂g(θ, Y ) ( θ − θ)2 ∂ 2 g(θ̄ , Y )
t+r t+r t+r
b
εt+r − (θb − θ) +
∂θ 2 ∂θ2
n−r
√ √ X  ∂g(θ, Yt+r ) 
1 ∂g(θ, Y t )
= cn (r) − n(θb − θ)
ne εt + εt+r + Op (n−1/2 )(8.25)
n ∂θ ∂θ
t=1

281
where θt lies between θb and θ and

n−r
1X
cn (r) =
e εt εt+r .
n
t=1

We recall that by using ergodicity we have

n−r    
1X ∂g(θ, Yt+r ) ∂g(θ, Yt ) a.s. ∂g(θ, Yt+r )
εt + εt+r → E εt = Jr ,
n ∂θ ∂θ ∂θ
t=1

∂g(θ,Yt )
where we use that εt+r and ∂θ are independent. Subsituting this into (8.25) gives

√ √ √
nb
cn (r) = cn (r) −
ne n(θb − θ)Jr + op (1)
n−r
√ −1 1 X ∂g(Yt , θ)
= cn (r) − I Jr √
ne εt +op (1).
n t=1 ∂θ
| √
{z }
n ∂Ln (θ)
=− 2 ∂θ


Asymptotic normality of nb cn (r) can be shown by showing asymptotic normality of the bivariate

vector n(ecn (r), ∂L∂θ
n (θ)
). Therefore all that remains is to obtain the asymptotic variance of the
above (which will give the desired result);


 
n −1 ∂Ln (θ)
var ne
cn (r) + I Jr
2 ∂θ
√ √
√ √
  
 −1 n ∂Ln (θ) −2 2 n ∂Ln (θ)
var ne cn (r) +2I Jr cov ne
cn (r), + I Jr var (8.26)
| {z } 2 ∂θ 2 ∂θ
=1

We evaluate the two covariance above;

√ n−r 

   
n ∂Ln (θ) 1 X ∂g(Yt2 , θ)
cov necn (r), − = cov εt1 εt1 +r , εt2
2 ∂θ n ∂θ
t1 ,t2 =1
n−r     
1 X ∂g(Yt2 , θ) ∂g(Yt2 , θ)
= cov {εt1 , εt2 } cov εt1 +r , + cov {εt1 +r , εt2 } cov εt1 ,
n ∂θ ∂θ
t1 ,t2 =1
   
∂g(Yt2 , θ) ∂g(Yt+r , θ)
+cum εt1 , εt1 +r , εt2 , = σ 2 E εt = σ 2 Jr .
∂θ ∂θ

Similarly we have
√ n
∂g(Yt1 , θ) 2
    
n ∂Ln (θ) 1 X ∂g(Yt1 , θ) ∂g(Yt2 , θ) 2
var = cov εt1 , εt2 =σ E = σ 2 I.
2 ∂θ n ∂θ ∂θ ∂θ
t1 ,t2 =1

282

Substituting the above into (8.26) gives the asymptotic variance of nb
c(r) to be

1 − σ 2 Jr I −1 Jr .

Thus we obtain the required result

√ σ2
 
−1
ρ(r) = N
nb 0, 1 − Jr I Jr .
c(0)

8.7 Long range dependence (long memory) versus changes


in the mean
A process is said to have long range dependence if the autocovariances are not absolutely summable,
P
i.e. k |c(k)| = ∞. A nice historical background on long memory is given in this paper.
From a practical point of view data is said to exhibit long range dependence if the autocovari-
ances do not decay very fast to zero as the lag increases. Returning to the Yahoo data considered
in Section 13.1.1 we recall that the ACF plot of the absolute log differences, given again in Figure
8.4 appears to exhibit this type of behaviour. However, it has been argued by several authors that

Series abs(yahoo.log.diff)
1.0
0.8
0.6
ACF

0.4
0.2
0.0

0 5 10 15 20 25 30 35

Lag

Figure 8.4: ACF plot of the absolute of the log differences.

the ‘appearance of long memory’ is really because of a time-dependent mean has not been corrected
for. Could this be the reason we see the ‘memory’ in the log differences?
We now demonstrate that one must be careful when diagnosing long range dependence, because
a slow/none decay of the autocovariance could also imply a time-dependent mean that has not been
corrected for. This was shown in Bhattacharya et al. (1983), and applied to econometric data in

283
Mikosch and Stărică (2000) and Mikosch and Stărică (2003). A test for distinguishing between long
range dependence and change points is proposed in Berkes et al. (2006).
Suppose that Yt satisfies

Yt = µt + εt ,

where {εt } are iid random variables and the mean µt depends on t. We observe {Yt } but do not
know the mean is changing. We want to evaluate the autocovariance function, hence estimate the
autocovariance at lag k using

n−|k|
1 X
ĉn (k) = (Yt − Ȳn )(Yt+|k| − Ȳn ).
n
t=1

Observe that Ȳn is not really estimating the mean but the average mean! If we plotted the empirical
ACF {ĉn (k)} we would see that the covariances do not decay with time. However the true ACF
would be zero and at all lags but zero. The reason the empirical ACF does not decay to zero is
because we have not corrected for the time dependent mean. Indeed it can be shown that

n−|k|
1 X
ĉn (k) = (Yt − µt + µt − Ȳn )(Yt+|k| − µt+k + µt+k − Ȳn )
n
t=1
n−|k| n−|k|
1 X 1 X
≈ (Yt − µt )(Yt+|k| − µt+k ) + (µt − Ȳn )(µt+k − Ȳn )
n n
t=1 t=1
n−|k|
1 X
≈ c(k) + (µt − Ȳn )(µt+k − Ȳn )
|{z} n
t=1
true autocovariance=0 | {z }
additional term due to time-dependent mean

Expanding the second term and assuming that k << n and µt ≈ µ(t/n) (and is thus smooth) we

284
have

n−|k|
1 X
(µt − Ȳn )(µt+k − Ȳn )
n
t=1
n n
!2
1X 2 1X
≈ µt − µt + op (1)
n n
t=1 t=1
n n n
!2
1 XX 2 1X
= µt − µt + op (1)
n2 n
s=1 t=1 t=1
n n n n n n
1 XX 1 XX 2 1 XX
= µ t (µ t − µs ) = (µt − µ s ) + µs (µt − µs )
n2 n2 n2
s=1 t=1 s=1 t=1
| s=1 t=1 {z }
1 Pn Pn
=− µt (µt −µs )
n2 s=1 t=1
n X
n n X
n n X
n
1 X 1 X 1 X
= (µt − µs )2 + µs (µt − µs ) − µt (µt − µs )
n2 2n2 2n2
s=1 t=1 s=1 t=1 s=1 t=1
n X n n X n n n
1 X 2 1 X 1 XX
= (µt − µs ) + 2 (µs − µt ) (µt − µs ) = 2 (µt − µs )2 .
n2 2n 2n
s=1 t=1 s=1 t=1 s=1 t=1

Therefore

n−|k| n n
1 X 1 XX
(µt − Ȳn )(µt+k − Ȳn ) ≈ 2 (µt − µs )2 .
n 2n
t=1 s=1 t=1

Thus we observe that the sample covariances are positive and don’t tend to zero for large lags.
This gives the false impression of long memory.
It should be noted if you study a realisation of a time series with a large amount of dependence,
it is unclear whether what you see is actually a stochastic time series or an underlying trend. This
makes disentangling a trend from data with a large amount of correlation extremely difficult.

285
Chapter 9

Parameter estimation

Prerequisites

• The Gaussian likelihood.

Objectives

• To be able to derive the Yule-Walker and least squares estimator of the AR parameters.

• To understand what the quasi-Gaussian likelihood for the estimation of ARMA models is,
and how the Durbin-Levinson algorithm is useful in obtaining this likelihood (in practice).
Also how we can approximate it by using approximations of the predictions.

• Understand that there exists alternative methods for estimating the ARMA parameters,
which exploit the fact that the ARMA can be written as an AR(∞).

We will consider various methods for estimating the parameters in a stationary time series.
We first consider estimation parameters of an AR and ARMA process. It is worth noting that we
will look at maximum likelihood estimators for the AR and ARMA parameters. The maximum
likelihood will be constructed as if the observations were Gaussian. However, these estimators
‘work’ both when the process is Gaussian is also non-Gaussian. In the non-Gaussian case, the
likelihood simply acts as a contrast function (and is commonly called the quasi-likelihood). In time
series, often the distribution of the random variables is unknown and the notion of ‘likelihood’ has
little meaning. Instead we seek methods that give good estimators of the parameters, meaning that
they are consistent and as close to efficiency as possible without placing too many assumption on

286
the distribution. We need to ‘free’ ourselves from the notion of likelihood acting as a likelihood
(and attaining the Crámer-Rao lower bound).

9.1 Estimation for Autoregressive models


Let us suppose that {Xt } is a zero mean stationary time series which satisfies the AR(p) represen-
tation

p
X
Xt = φj Xt−j + εt ,
j=1

Pp
where E(εt ) = 0, var(εt ) = σ 2 and the roots of the characteristic polynomial 1 − j=1 φj z
j lie
outside the unit circle. We will assume that the AR(p) is causal (the techniques discussed in this
section cannot consistently estimate the parameters in the case that the process is non-causal, they
will only consistently estimate the corresponding causal model). If you use the ar function in R to
estimate the parameters, you will see that there are several different estimation methods that one
can use to estimate {φj }. These include, the Yule-Walker estimator, Least squares estimator, the
Gaussian likelihood estimator and the Burg algorithm. Our aim in this section is to motivate and
describe these several different estimation methods.
All these methods are based on their correlation structure. Thus they are only designed to
estimate stationary, causal time series. For example, if we fit the AR(1) model Xt = φXt−1 + εt .
The methods below cannot consistently estimate non-casual parameters (when |φ| > 1). However,
depending the method used, the estimator may be non-causal. For example, the classical least
squares can yield estimators where |φ| > 1. This does not mean the true model is non-causal, it
simply means the minimum of the least criterion lies outside the parameter space (−1, 1). Similarly,
unless the parameter space of the MLE is constrained to only search for maximums inside [1, 1] it
can be give a maximum outside the natural parameter space. For the AR(1) estimator constraining
the parameter space is quite simple. However, for higher order autoregressive models. Constraining
the parameter space can be quite difficult.
On the other hand, both the Yule-Walker estimator and Burg’s algorithm will always yield a
causal estimator for any AR(p) model. There is no need to constrain the parameter space.

287
9.1.1 The Yule-Walker estimator
The Yule-Walker estimator is based on the Yule-Walker equations derived in (6.8) (Section 6.1.4).
We recall that the Yule-Walker equation state that if an AR process is causal, then for i > 0
we have

p
X p
X
E(Xt Xt−i ) = φj E(Xt−j Xt−i ), ⇒ c(i) = φj c(i − j). (9.1)
j=1 j=1

Putting the cases 1 ≤ i ≤ p together we can write the above as

rp = Σp φp , (9.2)

where (Σp )i,j = c(i − j), (rp )i = c(i) and φ0p = (φ1 , . . . , φp ). Thus the autoregressive parameters
solve these equations. It is important to observe that φp = (φ1 , . . . , φp ) minimise the mean squared
error

p
X
E[Xt+1 − φj Xt+1−j ]2 ,
j=1

(see Section 5.5).


The Yule-Walker equations inspire the method of moments estimator called the Yule-Walker
estimator. We use (9.2) as the basis of the estimator. It is clear that r̂p and Σ̂p are estimators of
rp and Σp where (Σ̂p )i,j = ĉn (i − j) and (r̂p )i = ĉn (i). Therefore we can use

φ̂p = Σ̂−1
p r̂ p , (9.3)

as an estimator of the AR parameters φ0p = (φ1 , . . . , φp ). We observe that if p is large this involves
inverting a large matrix. However, we can use the Durbin-Levinson algorithm to calculate φ̂p by
recursively fitting lower order AR processes to the observations and increasing the order. This way
an explicit inversion can be avoided. We detail how the Durbin-Levinson algorithm can be used to
estimate the AR parameters below.

Step 1 Set φ̂1,1 = ĉn (1)/ĉn (0) and r̂n (2) = ĉn (0) − φ̂1,1 ĉn (1).

288
Series ar2

0.5
Partial ACF

0.0
−0.5

5 10 15 20

Lag

Figure 9.1: Top: The sample partial autocorrelation plot of the AR(2) process Xt =
1.5Xt−1 + 0.75Xt−2 + εt with error bars n = 200.

Step 2 For 2 ≤ t ≤ p, we define the recursion

Pt−1
ĉn (t) − j=1 φ̂t−1,j ĉn (t − j)
φ̂t,t =
r̂n (t)
φ̂t,j = φ̂t−1,j − φ̂t,t φ̂t−1,t−j 1 ≤ j ≤ t − 1,

and r̂n (t + 1) = r̂n (t)(1 − φ̂2t,t ).

Step 3 We recall from (7.11) that φt,t is the partial correlation between Xt+1 and X1 , therefore φ̂tt
are estimators of the partial correlation between Xt+1 and X1 .

As mentioned in Step 3, the Yule-Walker estimators have the useful property that the partial
correlations can easily be evaluated within the procedure. This is useful when trying to determine
the order of the model to fit to the data. In Figure 9.1 we give the partial correlation plot corre-
sponding to Figure 8.1. Notice that only the first two terms are outside the error bars. This rightly
suggests the time series comes from an autoregressive process of order two.
In previous chapters it was frequently alluded to that the autocovariance is “blind” to non-
causality and that any estimator based on estimating the covariance will always be estimating the
causal solution. In Lemma 9.1.1 we show that the Yule-Walker estimator has the property that the
parameter estimates {φ̂j ; j = 1, . . . , p} correspond to a causal AR(p), in other words, the roots cor-
responding to φ̂(z) = 1 − pj=1 φbj z j lie outside the unit circle. A non-causal solution cannot arise.
P

The proof hinges on the fact that the Yule-Walker estimator is based on the sample autocovariances

289
{b
cn (r)} which are a positive semi-definite sequence (see Lemma 8.2.1).

Remark 9.1.1 (Fitting an AR(1) using the Yule-Walker) We generalize this idea to general
AR(p) models below. However, it is straightforward to show that the Yule-Walker estimator of the
AR(1) parameter will always be less than or equal to one. We recall that

Pn−1
XX
φbY W = Pn t 2t+1 .
t=1
t=1 Xt

By using Cauchy-Schwarz we have

Pn−1 Pn−1 2 1/2 Pn−1 2 1/2


t=1 |Xt Xt+1 | [ t=1 Xt ] [ t=1 Xt+1 ]
|φbY W | ≤ Pn 2 ≤ Pn 2
t=1 Xt t=1 Xt
[ t=1 Xt2 ]1/2 [ n−1
Pn P 2 ]1/2
Xt+1
≤ Pn t=0 2 = 1.
t=1 Xt

We use a similar idea below, but the proof hinges on the fact that the sample covariances forms a
positive semi-definite sequence.
An alternative proof using that {b
cn (r)} is the ACF of a stationary time series {Zt }. Then

cn (1)
b cov(Zt , Zt+1 ) cov(Zt , Zt+1 )
φbY W = = =p ,
cn (0)
b var(Zt ) var(Zt )var(Zt+1 )

which is a correlation and thus lies between [−1, 1].

Lemma 9.1.1 Let us suppose Z p+1 = (Z1 , . . . , Zp+1 ) is a zero mean random vector, where var[Z]p+1 =
(Σp+1 )i,j = cn (i − j) (which is Toeplitz). Let Zp+1|p be the best linear predictor of Zp+1 given
Zp , . . . , Z1 , where φp = (φ1 , . . . , φp ) = Σ−1
p r p are the coefficients corresponding to the best linear

predictor. Then the roots of the corresponding characteristic polynomial φ(z) = 1 − pj=1 φj z j lie
P

outside the unit circle.

PROOF. The proof is based on the following facts:

(i) Any sequence {φj }pj=1 has the following reparameterisation. There exists parameters {aj }pj=1
and λ such that a1 = 1, for 2 ≤ j ≤ p − 2, aj − λaj−1 = φj and λap = φp . Using {aj }pj=1 and
λ, for rewrite the linear combination {Zj }p+1
j=1 as

p
X p
X p
X
Zp+1 − φj Zp+1−j = aj Zp+1−j − λ aj Zp−j .
j=1 j=1 j=1

290
p
(ii) If φp = (φ1 , . . . , φp )0 = Σ−1
p r p , then φp minimises the mean square error i.e. for any {bj }j=1

 2  2
p
X p
X
EΣp+1 Zp+1 − φj Zp+1−j  ≤ EΣp+1 Zp+1 − bj Zp+1−j  (9.4)
j=1 j=1

where Σp+1 = var[Z p+1 ] and Z p+1 = (Zp+1 , . . . , Z1 ).

We use these facts to prove the result. Our objective is to show that the roots of φ(B) = 1 −
Pp j
j=1 φj B lie outside the unit circle. Using (i) we factorize φ(B) = (1 − λB)a(B) where a(B) =
Pp j
j=1 aj B . Suppose by contraction |λ| > 1 (thus at least one root of φ(B) lies inside the unit

circle). We will show if this were true, then by the Toeplitz nature of Σp+1 , φp = (φ1 , . . . , φp )
cannot be the best linear predictor.
Let

p
X p
X p
X p
X
Yp+1 = aj B j Zt+2 = aj Zp+2−j and Yp = BYp+1 = B aj B j Zt+2 = aj Zp+1−j .
j=1 j=1 j=1 j=1

Pp
By (i) is clear that Zp+1 − j=1 φj Zp+1−j = Yp+1 − λYp . Furthermore, since {φj } minimises the
mean squared error in (9.4), then λYp must be the best linear predictor of Yp+1 given Yp i.e. λ
must minimise the mean squared error

λ = arg min EΣp+1 (Yp+1 − βYp )2 ,


β

E[Yp+1 Yp ] E[Yp+1 Yp ]
that is λ = E[Yp2 ]
. However, we now show that | E[Yp2 ]
| ≤ 1 which leads to a contradiction.
We recall that Yp+1 is a linear combination of a stationary sequence, thus BYp+1 has the same
variance as Yp+1 . I.e. var(Yp+1 ) = var(Yp ). It you want to see the exact calculation, then

p
X p
X
E[Yp2 ] = var[Yp ] = aj1 aj2 cov[Yp+1−j1 , Yp+1−j2 ] = aj1 aj2 c(j1 − j2 )
j1 ,j2 =1 j1 ,j2 =1
2
= var[Yp+1 ] = E[Yp+1 ].

In other words, since Σp+1 is a Toeplitz matrix, then E[Yp2 ] = E[Yp+1


2 ] and

E[Yp+1 Yp ]
λ= 2 ])1/2
.
(E[Yp2 ]E[Yp+1

This means λ measures the correlation between Yp and Yp+1 and must be less than or equal to one.

291
Thus leading to a contradiction.
Observe this proof only works when Σp+1 is a Toeplitz matrix. If it is not we do not have
E[Yp2 ] = E[Yp+1
2 ] and that λ can be intepretated as the correlation. 

From the above result we can immediately see that the Yule-Walker estimators of the AR(p)
coefficients yield a causal solution. Since the autocovariance estimators {b
cn (r)} form a positive semi-
definite sequence, there exists a vector Y p where varΣb p+1 [Y p+1 ] = Σ
b p+1 with (Σ cn (i − j),
b p+1 ) = b
b −1
thus by the above lemma we have that Σ p brp are the coefficients of a Causal AR process.

Remark 9.1.2 (The bias of the Yule-Walker estimator) The Yule-Walker tends to have larger
bias than other other estimators when the sample size is small and the spectral density correspond-
ing to the underlying time series is has a large pronounced peak (see Shaman and Stine (1988) and
Ernst and Shaman (2019)). The large pronounced peak in the spectral density arises when the roots
of the underlying characteristic polynomial lie close to the unit circle.

9.1.2 The tapered Yule-Walker estimator


Substantial improvements to the Yule-Walker estimator can be obtained by tapering the original
time series (tapering dates back to Tukey, but its application for AR(p) estimation was first proposed
and proved in Dahlhaus (1988)).
Tapering is when the original data is downweighted towards the ends of the time series. This
is done with a positive function h : [0, 1] → R that satisfies certain smoothness properties and is
such that h(0) = h(1) = 0. And the tapered time series is h( nt )Xt . An illustration is given below:

In R, this can be done with the function spec.taper(x,p=0.1) where x is the time series, p is
the proportion to be tapered). Replacing Xt with h(t/n)Xt we define the tapered sample covariance
as

n−|r|    
1 X t t+r
cT,n (r) = Pn
b 2
h Xt h Xt+r .
t=1 h(t/n) t=1
n n

292
We now use {b
cT,n (r)} to define the Yule-Walker estimator for the AR(p) parameters.

9.1.3 The Gaussian likelihood


Our object here is to obtain the maximum likelihood estimator of the AR(p) parameters. We recall
that the maximum likelihood estimator is the parameter which maximises the joint density of the
observations. Since the log-likelihood often has a simpler form, we will focus on the log-likelihood.
We note that the Gaussian MLE is constructed as if the observations {Xt } were Gaussian, though it
is not necessary that {Xt } is Gaussian when doing the estimation. In the case that the innovations
are not Gaussian, the estimator may be less efficient (may not obtain the Cramer-Rao lower bound)
then the likelihood constructed as if the distribution were known.
Suppose we observe {Xt ; t = 1, . . . , n} where Xt are observations from an AR(p) process. Let
us suppose for the moment that the innovations of the AR process are Gaussian, this implies that
X n = (X1 , . . . , Xn ) is a n-dimension Gaussian random vector, with the corresponding log-likelihood

Ln (a) = − log |Σn (a)| − X0n Σn (a)−1 Xn , (9.5)

where Σn (a) the variance covariance matrix of Xn constructed as if Xn came from an AR process
with parameters a. Of course, in practice, the likelihood in the form given above is impossible to
maximise. Therefore we need to rewrite the likelihood in a more tractable form.
We now derive a tractable form of the likelihood under the assumption that the innovations come
from an arbitrary distribution. To construct the likelihood, we use the method of conditioning,
to write the likelihood as the product of conditional likelihoods. In order to do this, we derive
the conditional distribution of Xt+1 given Xt−1 , . . . , X1 . We first note that the AR(p) process is
p-Markovian (if it is causal), therefore if t ≥ p all the information about Xt+1 is contained in the
past p observations, therefore

P(Xt+1 ≤ x|Xt , Xt−1 , . . . , X1 ) = P(Xt+1 ≤ x|Xt , Xt−1 , . . . , Xt−p+1 ), (9.6)

by causality. Since the Markov property applies to the distribution function it also applies to the
density

f (Xt+1 |Xt , . . . , X1 ) = f (Xt+1 |Xt , . . . , Xt−p+1 ).

293
By using the (9.6) we have

p
X
P(Xt+1 ≤ x|Xt , . . . , X1 ) = P(Xt+1 ≤ x|Xt , . . . , X1 ) = Pε (ε ≤ x − aj Xt+1−j ), (9.7)
j=1

where Pε denotes the distribution of the innovation. Differentiating Pε with respect to Xt+1 gives

Pp  
p
∂Pε (ε ≤ Xt+1 − j=1 aj Xt+1−j )
X
f (Xt+1 |Xt , . . . , Xt−p+1 ) = = fε Xt+1 − aj Xt+1−j  . (9.8)
∂Xt+1
j=1

Example 9.1.1 (AR(1)) To understand why (9.6) is true consider the simple case that p = 1
(AR(1) with |φ| < 1). Studying the conditional probability gives

P(Xt+1 ≤ xt+1 |Xt = xt , . . . , X1 = x1 ) = P( φXt + εt ≤ xt+1 |Xt = xt , . . . , X1 = x1 )


| {z }
all information contained in Xt
= Pε (εt ≤ xt+1 − φxt ) = P(Xt+1 ≤ xt+1 |Xt = xt ),

where Pε denotes the distribution function of the innovation ε.

Using (9.8) we can derive the joint density of {Xt }nt=1 . By using conditioning we obtain

n−1
Y
f(X1 , X2 , . . . , Xn ) = f (X1 , . . . , Xp ) f(Xt+1 |Xt , . . . , X1 ) (by repeated conditioning)
t=p
n−1
Y
= f(X1 , . . . , Xp ) f(Xt+1 |Xt , . . . , Xt−p+1 ) (by the Markov property)
t=p
n−1
Y p
X
= f(X1 , . . . , Xp ) fε (Xt+1 − aj Xt+1−j ) (by (9.8)).
t=p j=1

Therefore the log likelihood is

n−1
X p
X
log f(X1 , X2 , . . . , Xn ) = log f(X1 , . . . , Xp ) + log fε (Xt+1 − aj Xt+1−j ) .
| {z } | {z } t=p
j=1
Full log-likelihood Ln (a;X n ) initial observations | {z }
conditional log-likelihood=Ln (a;X n )

In the case that the sample sizes are large n >> p, the contribution of initial observations
log f(X1 , . . . , Xp ) is minimal and the conditional log-likelihood and full log-likelihood are asymp-
totically equivalent.
So far we have not specified the distribution of {εt }t . From now on we shall assume that it is

294
Gaussian. Thus log f (X1 , . . . , Xn ; φ) and log f(X1 , . . . , Xp ; φ) are multivariate normal with mean
zero (since we are assuming, for convenience, that the time series has zero mean) and variance
Σn (φ) and Σp (φ) respectively, where by stationarity Σn (φ) and Σp (φ) are Toeplitz matrices. Based
on this the (negative) log-likelihood is

Ln (a) = log |Σn (a)| + X 0p Σn (a)−1 X p

= log |Σp (a)| + X 0p Σp (a)−1 X p + L (a; X) . (9.9)


| n {z }
conditional likelihood

The maximum likelihood estimator is

b = arg max Ln (a).


φ (9.10)
n a∈Θ

The parameters in the model are ‘buried’ within the covariance. By constraining the parameter
space, we can ensure the estimator correspond to a causal AR process (but find suitable parameter
space is not simple). Analytic expressions do exist for X 0p Σp (a)−1 X p and log |Σp (a)| but they are
not so simple. This motivates the conditional likelihood described in the next section.

9.1.4 The conditional Gaussian likelihood and least squares


The conditonal likelihood focusses on the conditonal term of the Gaussian likelihood and is defined
as
 2
n−1 p
1 X X
Ln (a; X) = −(n − p) log σ 2 − Xt+1 − aj Xt+1−j  ,
σ2 t=p j=1

is straightforward to maximise. Since the maximum of the above with respect to {aj } does not
depend on σ 2 . The conditional likelihood estimator of {φj } is simply the least squares estimator

 2
n−1
X p
X
φ
e = arg min Xt+1 − aj Xt+1−j 
p
t=p j=1

= e −1
Σ p erp ,

1 Pn 1 Pn
where (Σ
e p )i,j =
n−p t=p+1 Xt−i Xt−j and (e
r n )i = n−p t=p+1 Xt Xt−i .

Remark 9.1.3 (A comparison of the Yule-Walker and least squares estimators) Comparing

295
e −1e
e = Σ
the least squares estimator φ b −1 rp we see that
p r p with the Yule-Walker estimator φp = Σp b
b
p
they are very similar. The difference lies in Σ
e p and Σ rp and b
b p (and the corresponding e rp ). We see
that Σ
b p is a Toeplitz matrix, defined entirely by the positive definite sequence b
cn (r). On the other
hand, Σ
e p is not a Toeplitz matrix, the estimator of c(r) changes subtly at each row. This means

that the proof given in Lemma 9.1.1 cannot be applied to the least squares estimator as it relies
on the matrix Σp+1 (which is a combination of Σp and rp ) being Toeplitz (thus stationary). Thus
the characteristic polynomial corresponding to the least squares estimator will not necessarily have
roots which lie outside the unit circle.

Example 9.1.2 (Toy Example) To illustrate the difference between the Yule-Walker and least
squares estimator (at least for example samples) consider the rather artifical example that the time
series consists of two observations X1 and X2 (we will assume the mean is zero). We fit an AR(1)
model to the data, the least squares estimator of the AR(1) parameter is

X1 X2
φbLS =
X12

whereas the Yule-Walker estimator of the AR(1) parameter is

X1 X2
φbY W = .
X12 + X22

It is clear that φbLS < 1 only if X2 < X1 . On the other hand φbY W < 1. Indeed since (X1 −X2 )2 > 0,
we see that φbY W ≤ 1/2.

Exercise 9.1 (i) In R you can estimate the AR parameters using ordinary least squares (ar.ols),
yule-walker (ar.yw) and (Gaussian) maximum likelihood (ar.mle).

Simulate the causal AR(2) model Xt = 1.5Xt−1 − 0.75Xt−2 + εt using the routine arima.sim
(which gives Gaussian realizations) and also innovations which from a t-distribution with
4df. Use the sample sizes n = 100 and n = 500 and compare the three methods through a
simulation study.

(ii) Use the `1 -norm defined as

t
X p
X
Ln (φ) = Xt − φj Xt−j ,
t=p+1 j=1

296
with φ̂n = arg min Ln (φ) to estimate the AR(p) parameters.

You may need to use a Quantile Regression package to minimise the `1 norm. I suggest using
the package quantreg and the function rq where we set τ = 0.5 (the median).

Note that so far we have only considered estimation of causal AR(p) models. Breidt et. al. (2001)
propose a method for estimating parameters of a non-causal AR(p) process (see page 18).

9.1.5 Burg’s algorithm


Burg’s algorithm is an alternative method for estimating the AR(p) parameters. It is closely related
to the least squares estimator but uses properties of second order stationarity in its construction.
Like the Yule-Walker estimator it has the useful property that its estimates correspond to a causal
characteristic function. Like the Yule-Walker estimator it can recursively estimate the AR(p)
parameters by first fitting an AR(1) model and then recursively increasing the order of fit.
We start with fitting an AR(1) model to the data. Suppose that φ1,1 is the true best fitting
AR(1) parameter, that is

Xt = PXt−1 (Xt ) + ε1,t = φ1,1 Xt−1 + ε1,t .

Then the least squares estimator is based on estimating the projection by using the φ1,1 that
minimises

n
X
(Xt − φXt−1 )2 .
t=2

However, the same parameter φ1,1 minimises the projection of the future into the past

Xt = PXt+1 (Xt ) + δ1,t = φ1,1 Xt+1 + δ1,t .

Thus by the same argument as above, an estimator of φ1,1 is the parameter which minimises

n−1
X
(Xt − φXt+1 )2 .
t=1

297
We can combine these two least squares estimators to find the φ which minimises

n n−1
" #
X X
φb1,1 = arg min (Xt − φXt−1 )2 + (Xt − φXt+1 )2 .
t=2 t=1

Differentiating the above wrt φ and solving gives the explicit expression
Pn−1
Xt Xt+1 + nt=2 Xt Xt−1
P
t=1
φb1,1 = Pn−1 2
2 t=2 Xt + X12 + Xn2
Pn−1
2 Xt Xt+1
= Pn−1 t=12 .
2 t=2 Xt + X12 + Xn2

Unlike the least squares estimator φb1,1 is guaranteed to lie between [−1, 1]. Note that φ1,1 is the
partial correlation at lag one, thus φb1,1 is an estimator of the partial correlation. In the next step we
estimate the partial correlation at lag two. We use the projection argument described in Sections
5.1.4 and 7.5.1. That is


PXt−2 ,Xt−1 (Xt ) = PXt−1 (Xt ) + ρ Xt−2 − PXt−1 (Xt−2 )

and


Xt = PXt−2 ,Xt−1 (Xt ) + ε2,t = PXt−1 (Xt ) + ρ Xt−2 − PXt−1 (Xt−2 ) + ε2,t

= φ1,1 Xt−1 + ρ (Xt−2 − φ1,1 Xt−1 ) + ε2,t .

Thus we replace φ1,1 in the above with φb1,1 and estimate ρ by minimising least squares criterion

n h
X  i
Xt − φb1,1 Xt−1 − ρ Xt−2 − φb1,1 Xt−1 .
t=3

However, just as in the estimation scheme of φ1,1 we can estimate ρ by predicting into the past


PXt+2 ,Xt+1 (Xt ) = PXt+1 (Xt ) + ρ Xt+2 − PXt+1 (Xt+2 )

to give

Xt = φ1,1 Xt+1 + ρ (Xt+2 − φ1,1 Xt+1 ) + δ2,t .

298
This leads to an alternative estimator of ρ that minimises

n−2
Xh  i
Xt − φb1,1 Xt+1 − ρ Xt+2 − φb1,1 Xt+1 .
t=1

The Burg algorithm estimator of ρ minimises both the forward and backward predictor simultane-
ously

n h i n−2
!
X  Xh  i
ρb2 = arg min Xt − φb1,1 Xt−1 − ρ Xt−2 − φb1,1 Xt−1 + Xt − φb1,1 Xt+1 − ρ Xt+2 − φb1,1 Xt+1 .
ρ
t=3 t=1

Differentiating the above wrt ρ and solving gives an explicit solution for ρb2 . Moreover we can show
that |b
ρ2 | ≤ 1. The estimators of the best fitting AR(2) parameters (φ1,2 , φ2,2 ) are

 
φb1,2 = φb1,1 − ρb2 φb1,1

and φb2,2 = ρb2 .

Using the same method we can obtain estimators for {φbr,r }r which can be used to construct the
estimates of the best fitting AR(p) parameters {φbj,p }pj=1 . It can be shown that the parameters
{φbj,p }pj=1 correspond to a causal AR(p) model.

Proof that 0 ≤ |φb1,1 | ≤ 1 To prove the result we pair the terms in the estimator

2 [X1 X2 + X2 X3 + . . . + Xn−1 Xn ]
φb1,1 = .
(X12 + X22 ) + (X22 + X32 ) + . . . + (Xn−2
2 2 ) + (X 2
+ Xn−1 2
n−1 + Xn )

Each term in the numerator can be paired with the term in the denominator i.e. using that
(|Xt | − |Xt+1 |)2 ≥ 0 we have

2|Xt Xt+1 | ≤ Xt2 + Xt+1


2
1 ≤ t ≤ (n − 1).

Thus the absolute of the numerator is smaller that the denominator and we have

2 [|X1 X2 | + |X2 X3 | + . . . + |Xn−1 Xn |]


|φb1,1 | = ≤ 1.
(X12 + X22 ) + (X22 + X32 ) + . . . + (Xn−2
2 2 ) + (X 2
+ Xn−1 2
n−1 + Xn )

This proves the claim. 

299

You might also like