STAT 520 Forecasting and Time Series: Lecture Notes

STAT 520
FORECASTING AND TIME

SERIES
Fall, 2013
Lecture Notes
Joshua M. Tebbs
Department of Statistics
University of South Carolina
TABLE OF CONTENTS STAT 520, J. TEBBS
Contents
1 Introduction and Examples 1
2 Fundamental Concepts 20
2.1 Summary of important distribution theory . . . . . . . . . . . . . . . . . 20
2.1.1 Univariate random variables . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Bivariate random vectors . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Multivariate extensions and linear combinations . . . . . . . . . . 26
2.1.4 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Time series and stochastic processes . . . . . . . . . . . . . . . . . . . . . 28
2.3 Means, variances, and covariances . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Some (named) stochastic processes . . . . . . . . . . . . . . . . . . . . . 29
2.5 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Modeling Deterministic Trends 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Estimation of a constant mean . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.1 Straight line regression . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.3 Seasonal means model . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.4 Cosine trend model . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Interpreting regression output . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Residual analysis (model diagnostics) . . . . . . . . . . . . . . . . . . . . 70
3.5.1 Assessing normality . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.2 Assessing independence . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5.3 Sample autocorrelation function . . . . . . . . . . . . . . . . . . . 76
i
4 Models for Stationary Time Series 80
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Moving average processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 MA(1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.2 MA(2) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.3 MA(q) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Autoregressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.1 AR(1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.2 AR(2) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.3 AR(p) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5 Autoregressive moving average (ARMA) processes . . . . . . . . . . . . . 107
5 Models for Nonstationary Time Series 113
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Autoregressive integrated moving average (ARIMA) models . . . . . . . 118
5.2.1 IMA(1,1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2.2 IMA(2,2) process . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2.3 ARI(1,1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2.4 ARIMA(1,1,1) process . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3 Constant terms in ARIMA models . . . . . . . . . . . . . . . . . . . . . 127
5.4 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 Model Specification 136
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2 The sample autocorrelation function . . . . . . . . . . . . . . . . . . . . 136
6.3 The partial autocorrelation function . . . . . . . . . . . . . . . . . . . . . 143
6.4 The extended autocorrelation function . . . . . . . . . . . . . . . . . . . 155
ii
6.5 Nonstationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.6 Other model selection methods . . . . . . . . . . . . . . . . . . . . . . . 170
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7 Estimation 175
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.2 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.1 Autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.2 Moving average models . . . . . . . . . . . . . . . . . . . . . . . . 178
7.2.3 Mixed ARMA models . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2.4 White noise variance . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.3 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.3.1 Autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . 185
7.3.2 Moving average models . . . . . . . . . . . . . . . . . . . . . . . . 187
7.3.3 Mixed ARMA models . . . . . . . . . . . . . . . . . . . . . . . . 188
7.3.4 White noise variance . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 197
7.4.1 Large-sample properties of MLEs . . . . . . . . . . . . . . . . . . 200
7.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8 Model Diagnostics 208
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.2 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.2.1 Normality and independence . . . . . . . . . . . . . . . . . . . . . 211
8.2.2 Residual ACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
8.3 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
iii
9 Forecasting 231
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.2 Deterministic trend models . . . . . . . . . . . . . . . . . . . . . . . . . . 233
9.3 ARIMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9.3.1 AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.3.2 MA(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
9.3.3 ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
9.3.4 Nonstationary models . . . . . . . . . . . . . . . . . . . . . . . . 253
9.4 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
9.4.1 Deterministic trend models . . . . . . . . . . . . . . . . . . . . . . 259
9.4.2 ARIMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.5 Forecasting transformed series . . . . . . . . . . . . . . . . . . . . . . . . 263
9.5.1 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
9.5.2 Log-transformed series . . . . . . . . . . . . . . . . . . . . . . . . 265
10 Seasonal ARIMA Models 267
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.2 Purely seasonal (stationary) ARMA models . . . . . . . . . . . . . . . . 269
10.2.1 MA(Q)s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.2.2 AR(P )s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
10.2.3 ARMA(P, Q)s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.3 Multiplicative seasonal (stationary) ARMA models . . . . . . . . . . . . 280
10.4 Nonstationary seasonal ARIMA (SARIMA) models . . . . . . . . . . . . 290
10.5 Additional topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
iv
CHAPTER 1 STAT 520, J. TEBBS
1 Introduction and Examples
Complementary reading: Chapter 1 (CC).
TERMINOLOGY : A time series is a sequence of ordered data. The “ordering” refers

generally to time, but other orderings could be envisioned (e.g., over space, etc.). In this
class, we will be concerned exclusively with time series that are
• measured on a single continuous random variable Y
• equally spaced in discrete time; that is, we will have a single realization of Y at
each second, hour, day, month, year, etc.
UBIQUITY : Time series data arise in a variety of fields. Here are just a few examples.
• In business, we observe daily stock prices, weekly interest rates, quarterly sales,
monthly supply figures, annual earnings, etc.
• In agriculture, we observe annual yields (e.g., crop production), daily crop prices,
annual herd sizes, etc.
• In engineering, we observe electric signals, voltage measurements, etc.
• In natural sciences, we observe chemical yields, turbulence in ocean waves, earth

tectonic plate positions, etc.
• In medicine, we observe EKG measurements on patients, drug concentrations,

blood pressure readings, etc.
• In epidemiology, we observe the number of flu cases per day, the number of

health-care clinic visits per week, annual tuberculosis counts, etc.
• In meteorology, we observe daily high temperatures, annual rainfall, hourly wind

speeds, earthquake frequency, etc.
• In social sciences, we observe annual birth and death rates, accident frequencies,
crime rates, school enrollments, etc.
PAGE 1
0.4
0.2
Global temperature deviations
0.0
−0.2
−0.4
1860 1880 1900 1920 1940 1960 1980 2000
Year
Figure 1.1: Global temperature data. The data are a combination of land-air average
temperature anomalies, measured in degrees Centigrade.
Example 1.1. Global temperature data. “Global warming” refers to an increase in

the average temperature of the Earth’s near-surface air and oceans since the mid-20th
century and its projected continuation. The data in Figure 1.1 are annual temperature
deviations (1856-1997) in deg C, measured from a baseline average.
• Data file: globaltemps
• There are n = 142 observations.
• Measurements are taken each year.
• What are the noticeable patterns?
• Predictions? (Are we doomed?)
PAGE 2
1700
1600
Amount of milk produced
1500
1400
1300
1994 1996 1998 2000 2002 2004 2006
Year
Figure 1.2: United States milk production data. Monthly production figures, measured
in millions of pounds, from January, 1994 to December, 2005.
Example 1.2. Milk production data. Commercial dairy farming produces the vast
majority of milk in the United States. The data in Figure 1.2 are the monthly U.S. milk
production (in millions of pounds) from January, 1994 to December, 2005.
• Data file: milk (TSA)
• Measurements are taken each month.
• Predictions?
PAGE 3
220
210
CREF stock values
200
190
180
170
0 100 200 300 400 500
Time
Figure 1.3: CREF stock data. Daily values of one unit of CREF stock values: August
26, 2004 to August 15, 2006.
Example 1.3. CREF stock data. TIAA-CREF is the leading provider of retirement
accounts and products to employees in academic, research, medical, and cultural in-
stitutions. The data in Figure 1.3 are daily values of one unit of the CREF (College
Retirement Equity Fund) stock fund from 8/26/04 to 8/15/06.
• Data file: CREF (TSA)
• Measurements are taken each trading day.
• Predictions? (My retirement depends on these!)
PAGE 4
200
Number of homeruns
150
100
50
1920 1940 1960 1980 2000
Year
Figure 1.4: Homerun data. Number of homeruns hit by the Boston Red Sox each year
during 1909-2010.
Example 1.4. Homerun data. The Boston Red Sox are a professional baseball team
based in Boston, Massachusetts, and a member of the Major League Baseball’s American
League Eastern Division. The data in Figure 1.4 are the number of homeruns hit by the
team each year from 1909 to 2010. Source: Ted Hornback (Spring, 2010).
• Data file: homeruns
• Predictions?
PAGE 5
40
35
Number of earthquakes (7.0 or greater)
30
25
20
15
10
5
1900 1920 1940 1960 1980 2000
Year
Figure 1.5: Earthquake data. Number of “large” earthquakes per year from 1900-1998.
Example 1.5. Earthquake data. An earthquake occurs when there is a sudden release
of energy in the Earth’s crust. Earthquakes are caused mostly by rupture of geological
faults, but also by other events such as volcanic activity, landslides, mine blasts, and
nuclear tests. The data in Figure 1.5 are the number of global earthquakes annually
(with intensities of 7.0 or greater) during 1900-1998. Source: Craig Whitlow (Spring,
2010).
• Data file: earthquake
• Predictions?
PAGE 6
25000
USC Columbia fall enrollment
20000
15000
10000
5000
1960 1970 1980 1990 2000 2010
Year
Figure 1.6: University of South Carolina fall enrollment data. Number of students reg-
istered for classes on the Columbia campus during 1954-2010.
Example 1.6. Enrollment data. The data in Figure 1.6 are the annual fall enroll-
ment counts for USC (Columbia campus only, 1954-2010). The data were obtained from
the USC website http://www.ipr.sc.edu/enrollment/, which contains the enrollment
counts for all campuses in the USC system.
• Data file: enrollment
• Predictions?
PAGE 7
35
30
25
Star brightness
20
15
10
5
0
0 100 200 300 400 500 600
Time
Figure 1.7: Star brightness data. Measurements for a single star taken over 600 consec-
utive nights.
Example 1.7. Star brightness data. Two factors determine the brightness of a star:
its luminosity (how much energy it puts out in a given time) and its distance from the
Earth. The data in Figure 1.7 are nightly brightness measurements (in magnitude) of a
single star over a period of 600 nights.
• Data file: star (TSA)
• Measurements are taken each night.
• Predictions?
PAGE 8
5.5e+07
5.0e+07
4.5e+07
Airline miles
4.0e+07
3.5e+07
3.0e+07
1996 1998 2000 2002 2004
Year
Figure 1.8: Airline passenger mile data. The number of miles, in thousands, traveled by
passengers in the United States from January, 1996 to May, 2005.
Example 1.8. Airline mile data. The Bureau of Transportation Statistics publishes
monthly passenger traffic data reflecting 100 percent of scheduled operations for airlines
in the United States. The data in Figure 1.8 are monthly U.S. airline passenger miles
traveled from 1/1996 to 5/2005.
• Data file: airmiles (TSA)
• Predictions?
PAGE 9
1500
1450
SP500 Index
1400
1350
1300
1250
0 50 100 150 200 250
Time
Figure 1.9: S&P Index price data. Daily values of the index from June 6, 1999 to June
5, 2000.
Example 1.9. S&P500 Index data. The S&P500 is a capitalization-weighted index

(published since 1957) of the prices of 500 large-cap common stocks actively traded in
the United States. The data in Figure 1.9 are the daily S&P500 Index prices measured
during June 6, 1999 to June 5, 2000.
• Data file: sp500
• Measurements are taken each trading day.
• Predictions?
PAGE 10
80
Ventilation (L/min)
60
40
20
0 50 100 150 200
Observation time
Figure 1.10: Ventilation data. Ventilation measurements on a single cyclist at 15 second

intervals.
Example 1.10. Ventilation data. Collecting expired gases during exercise allows one to
quantify many outcomes during an exercise test. One such outcome is the ventilatory
threshold; i.e., the point at which lactate begins to accumulate in the blood. The data
in Figure 1.10 are ventilation observations (L/min) on a single cyclist during exercise.
Observations are recorded every 15 seconds. Source: Joe Alemany (Spring, 2010).
• Data file: ventilation
• Measurements are taken each 15 seconds.
• Predictions?
PAGE 11
2.4
2.2
2.0
British pounds
1.8
1.6
1.4
1.2
1980 1982 1984 1986 1988
Year
Figure 1.11: Exchange rate data. Weekly exchange rate of US dollar compared to the
British pound, from 1980-1988.
Example 1.11. Exchange rate data. The pound sterling, often simply called “the
pound,” is the currency of the United Kingdom and many of its territories. The data in
Figure 1.11 are weekly exchange rates of the US dollar and the British pound between
the years 1980 and 1988.
• Data file: exchangerate
• Measurements are taken each week.
• Predictions?
PAGE 12
60
50
Oil prices
40
30
20
10
1990 1995 2000 2005
Year
Figure 1.12: Crude oil price data. Monthly spot prices in dollars from Cushing, OK,
from 1/1986 to 1/2006.
Example 1.12. Oil price data. Crude oil prices behave much as any other commodity
with wide price swings in times of shortage or oversupply. The crude oil price cycle may
extend over several years responding to changes in demand. The data in Figure 1.12 are
monthly spot prices for crude oil (measured in U.S. dollars per barrel) from Cushing,
OK.
• Data file: oil.price (TSA)
• Predictions?
PAGE 13
40
30
LA rainfall amounts
20
10
1880 1900 1920 1940 1960 1980
Year
Figure 1.13: Los Angeles rainfall data. Annual precipitation measurements, in inches,
during 1878-1992.
Example 1.13. Annual rainfall data. Los Angeles averages 15 inches of precipitation
annually, which mainly occurs during the winter and spring (November through April)
with generally light rain showers, but sometimes as heavy rainfall and thunderstorms.
The data in Figure 1.13 are annual rainfall totals for Los Angeles during 1878-1992.
• Data file: larain (TSA)
• Predictions?
PAGE 14
600
Australian clay brick production (in millions)
500
400
300
200
1960 1970 1980 1990
Time
Figure 1.14: Australian clay brick production data. Number of bricks (in millions)
produced from 1956-1994.
Example 1.14. Brick production data. Clay bricks remain extremely popular for the
cladding of houses and small commercial buildings throughout Australia due to their
versatility of use, tensile strength, thermal properties and attractive appearance. The
data in Figure 1.14 represent the number of bricks produced in Australia (in millions)
during 1956-1994. The data are quarterly.
• Data file: brick
• Measurements are taken each quarter.
• Predictions?
PAGE 15
0.20
Percentage granted review
0.15
0.10
0.05
1940 1960 1980 2000
Time
Figure 1.15: United States Supreme Court data. Percent of cases granted review during
1926-2004.
Example 1.15. Supreme Court data. The Supreme Court of the United States has
ultimate (but largely discretionary) appellate jurisdiction over all state and federal courts,
and original jurisdiction over a small range of cases. The data in Figure 1.15 represent
the acceptance rate of cases appealed to the Supreme Court during 1926-2004. Source:
Jim Manning (Spring, 2010).
• Data file: supremecourt
• Predictions?
PAGE 16
IMPORTANCE : The purpose of time series analysis is twofold:
1. to model the stochastic (random) mechanism that gives rise to the series of data
2. to predict (forecast) the future values of the series based on the previous history.
NOTES : The analysis of time series data calls for a “new way of thinking” when compared
to other statistical methods courses. Essentially, we get to see only a single measurement
from a population (at time t) instead of a sample of measurements at a fixed point in
time (cross-sectional data).
• The special feature of time series data is that they are not independent! Instead,
observations are correlated through time.
– Correlated data are generally more difficult to analyze.
– Statistical theory in the absence of independence becomes markedly more

difficult.
• Most classical statistical methods (e.g., regression, analysis of variance, etc.) as-
sume that observations are statistically independent. For example, in the simple
linear regression model
Yi = β0 + β1 xi + ϵi ,
or an ANOVA model like
Yijk = µ + αi + βj + (αβ)ij + ϵijk ,
we typically assume that the ϵ error terms are independent and identically dis-
tributed (iid) normal random variables with mean 0 and constant variance.
• There can be additional trends or seasonal variation patterns (seasonality) that

may be difficult to identify and model.
• The data may be highly non-normal in appearance and be possibly contaminated

by outliers.
PAGE 17
MODELING: Our overarching goal in this course is to build (and use) time series models
for data. This breaks down into different parts.
1. Model specification (identification)
• Consider different classes of time series models for stationary processes.
• Use descriptive statistics, graphical displays, subject matter knowledge, etc.

to make sensible candidate selections.
• Abide by the Principle of Parsimony.
2. Model fitting
• Once a candidate model is chosen, estimate the parameters in the model.
• We will use least squares and/or maximum likelihood to do this.
3. Model diagnostics
• Use statistical inference and graphical displays to check how well the model
fits the data.
• This part of the analysis may suggest the candidate model is inadequate and
may point to more appropriate models.
TIME SERIES PLOT : The time series plot is the most basic graphical display in the
analysis of time series data. The plot is a basically a scatterplot of Yt versus t, with
straight lines connecting the points. Notationally,
Yt = value of the variable Y at time t, for t = 1, 2, ..., n.
The subscript t tells us to which time point the measurement Yt corresponds. Note that
in the sequence Y1 , Y2 , ..., Yn , the subscripts are very important because they correspond
to a particular ordering of the data. This is perhaps a change in mind set from other
methods courses where the time element is ignored.
PAGE 18
5.5e+07
J
A M
5.0e+07 J
A M
J J
J
A A M A
J J JA A O
J M
4.5e+07
A D
M M J
M N
A
M J D
A J A O
Airline miles
M M M O J
J D
A O N M S
4.0e+07
J A D M F
J M O A N
A M N S A F
J J M D J
M A O J
D S S
DM J F F NJ
M
3.5e+07
A O N J D
S J
M N S F
S A ODJ F OJ
F
N F
F SN
3.0e+07
J
F
J
S
1996 1998 2000 2002 2004
Year
Figure 1.16: Airline passenger mile data. The number of miles, in thousands, traveled
by passengers in the United States from January, 1996 to May, 2005. Monthly plotting
symbols have been added.
GRAPHICS : The time series plot is vital, both to describe the data and to help formulat-
ing a sensible model. Here are some simple, but important, guidelines when constructing
these plots.
• Give a clear, self-explanatory title or figure caption.
• State the units of measurement in the axis labels or figure caption.
• Choose the scales carefully (including the size of the intercept). Default settings
from software may be sufficient.
• Label axes clearly.
• Use special plotting symbols where appropriate; e.g., months of the year, days of
the week, actual numerical values for outlying values, etc.
PAGE 19
2 Fundamental Concepts
2.1 Summary of important distribution theory
DISCLAIMER: Going forward, we must be familiar with the following results from prob-
ability and distribution theory (e.g., STAT 511, etc.). If you have not had this material,
you should find a suitable reference and study up on your own. See also pp 24-26 (CC).
REVIEW : Informally, a random variable Y is a variable whose value can not be

predicted with certainty. Instead, the variable is said to vary according to a probability
distribution which describes which values Y can assume and with what probability
it assumes those values. There are basically two types of random variables. Discrete
random variables take on specific values with positive probability. Continuous random
variables have positive probability assigned to intervals of possible values. In this course,
we will restrict attention to random variables Y which are best viewed as continuous (or
at least quantitative).
2.1.1 Univariate random variables
DEFINITION : The (cumulative) distribution function (cdf ) of a random variable

Y , denoted by FY (y), is a function that gives the probability FY (y) = P (Y ≤ y), for all
−∞ < y < ∞. Mathematically, a random variable Y is said to be continuous if its cdf
FY (y) is a continuous function of y.
TERMINOLOGY : Let Y be a continuous random variable with cdf FY (y). The prob-
ability density function (pdf ) for Y , denoted by fY (y), is given by
d
fY (y) = FY (y),
dy
provided that this derivative exists.
PAGE 20
PROPERTIES : Suppose that Y is a continuous random variable with pdf fY (y) and
support R (that is, the set of all values that Y can assume). Then
(1) fY (y) > 0, for all y ∈ R,

∫
(2) the function fY (y) satisfies R
fY (y)dy = 1.
RESULT : Suppose Y is a continuous random variable with pdf fY (y) and cdf FY (y).
Then ∫ b
P (a < Y < b) = fY (y)dy = FY (b) − FY (a).
a
TERMINOLOGY : Let Y be a continuous random variable with pdf fY (y) and support
R. The expected value (or mean) of Y is given by
∫
E(Y ) = yfY (y)dy.
R
Mathematically, we require that

∫
|y|fY (y)dy < ∞.
R
If this is not true, then we say that E(Y ) does not exist. If g is a real-valued function,
then g(Y ) is a random variable and
∫
E[g(Y )] = g(y)fY (y)dy,
R
provided that this integral exists.
PROPERTIES OF EXPECTATIONS : Let Y be a random variable with pdf fY (y) and

support R, suppose that g, g1 , g2 , ..., gk are real-valued functions, and let a be any real
constant. Then
(a) E(a) = a
(b) E[ag(Y )] = aE[g(Y )]

∑ ∑
(c) E[ kj=1 gj (Y )] = kj=1 E[gj (Y )].
PAGE 21
TERMINOLOGY : Let Y be a continuous random variable with pdf fY (y), support R,

and mean E(Y ) = µ. The variance of Y is given by
∫
var(Y ) = E[(Y − µ) ] = (y − µ)2 fY (y)dy.
2
R
In general, it will be easier to use the variance computing formula
var(Y ) = E(Y 2 ) − [E(Y )]2 .
We will often use the statistical symbol σ 2 or σY2 to denote var(Y ).
FACTS :
(a) var(Y ) ≥ 0. var(Y ) = 0 if and only if the random variable Y has a degenerate
distribution; i.e., all the probability mass is located at one support point.
(b) The larger (smaller) var(Y ) is, the more (less) spread in the possible values of Y
about the mean µ = E(Y ).
√ √
(c) var(Y ) is measured in (units)2 . The standard deviation of Y is σ = σ 2 = var(Y )
and is measured in the original units of Y .
IMPORTANT RESULT : Let Y be a random variable, and suppose that a and b are fixed
constants. Then
var(a + bY ) = b2 var(Y ).
2.1.2 Bivariate random vectors
TERMINOLOGY : Let X and Y be continuous random variables. (X, Y ) is called a

continuous random vector, and the joint probability density function (pdf ) of
X and Y is denoted by fX,Y (x, y).
PROPERTIES : The function fX,Y (x, y) has the following properties:
(1) fX,Y (x, y) > 0, for all (x, y) ∈ R ⊆ R2

∫∫
(2) The function fX,Y (x, y) satisfies R
fX,Y (x, y)dxdy = 1.
PAGE 22
RESULT : Suppose (X, Y ) is a continuous random vector with joint pdf fX,Y (x, y). Then
∫ ∫
P [(X, Y ) ∈ B] = fX,Y (x, y)dxdy,
B
for any set B ⊂ R2 .
TERMINOLOGY : Suppose that (X, Y ) is a continuous random vector with joint pdf
fX,Y (x, y). The joint cumulative distribution function (cdf ) for (X, Y ) is given by
∫ x ∫ y
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = fX,Y (t, s)dtds,
−∞ −∞
for all (x, y) ∈ R2 . It follows upon differentiation that the joint pdf is given by
∂2
fX,Y (x, y) = FX,Y (x, y),
∂x∂y
wherever this mixed partial derivative is defined.
RESULT : Suppose that (X, Y ) has joint pdf fX,Y (x, y) and support R. Let g(X, Y ) be
a real vector valued function of (X, Y ); i.e., g : R2 → R. Then
∫ ∫
E[g(X, Y )] = g(x, y)fX,Y (x, y)dxdy.
R
If this quantity is not finite, then we say that E[g(X, Y )] does not exist.
PROPERTIES OF EXPECTATIONS : Let (X, Y ) be a random vector, suppose that

g, g1 , g2 , ..., gk are real vector valued functions from R2 → R, and let a be any real
constant. Then
(a) E(a) = a
(b) E[ag(X, Y )] = aE[g(X, Y )]

∑ ∑
(c) E[ kj=1 gj (X, Y )] = kj=1 E[gj (X, Y )].
TERMINOLOGY : Suppose that (X, Y ) is a continuous random vector with joint cdf
FX,Y (x, y), and denote the marginal cdfs of X and Y by FX (x) and FY (y), respectively.
The random variables X and Y are independent if and only if
FX,Y (x, y) = FX (x)FY (y),
PAGE 23
for all values of x and y. It can hence be shown that X and Y are independent if and
only if
fX,Y (x, y) = fX (x)fY (y),
for all values of x and y. That is, the joint pdf fX,Y (x, y) factors into the product the
marginal pdfs fX (x) and fY (y), respectively.
RESULT : Suppose that X and Y are independent random variables. Let g(X) be a
function of X only, and let h(Y ) be a function of Y only. Then
E[g(X)h(Y )] = E[g(X)]E[h(Y )],
provided that all expectations exist. Taking g(X) = X and h(Y ) = Y , we get
E(XY ) = E(X)E(Y ).
TERMINOLOGY : Suppose that X and Y are random variables with means E(X) = µX
and E(Y ) = µY , respectively. The covariance between X and Y is
cov(X, Y ) = E[(X − µX )(Y − µY )]
= E(XY ) − E(X)E(Y ).
The latter expression is called the covariance computing formula. The covariance is
a numerical measure that describes how two variables are linearly related.
• If cov(X, Y ) > 0, then X and Y are positively linearly related.
• If cov(X, Y ) < 0, then X and Y are negatively linearly related.
• If cov(X, Y ) = 0, then X and Y are not linearly related.
RESULT : If X and Y are independent, then cov(X, Y ) = 0. The converse is not neces-
sarily true.
RESULT : Suppose that X and Y are random variables.
var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )
var(X − Y ) = var(X) + var(Y ) − 2cov(X, Y ).
PAGE 24
RESULT : Suppose that X and Y are independent random variables.
var(X + Y ) = var(X) + var(Y )
var(X − Y ) = var(X) + var(Y ).
RESULTS : Suppose that X and Y are random variables. The covariance operator sat-
isfies the following:
(a) cov(X, Y ) = cov(Y, X)
(b) cov(X, X) = var(X).
(c) cov(a + bX, c + dY ) = bdcov(X, Y ), for any constants a, b, c, and d.
DEFINITION : Suppose that X and Y are random variables. The correlation between
X and Y is defined by
cov(X, Y )
ρ = corr(X, Y ) = .
σX σY
NOTES :
(1) −1 ≤ ρ ≤ 1.
(2) If ρ = 1, then Y = β0 + β1 X, where β1 > 0. That is, X and Y are perfectly

positively linearly related; i.e., the bivariate probability distribution of (X, Y ) lies
entirely on a straight line with positive slope.
(3) If ρ = −1, then Y = β0 + β1 X, where β1 < 0. That is, X and Y are perfectly
negatively linearly related; i.e., the bivariate probability distribution of (X, Y ) lies
entirely on a straight line with negative slope.
(4) If ρ = 0, then X and Y are not linearly related.
RESULT : If X and Y are independent, then ρ = ρX,Y = 0. The converse is not true in
general. However,
ρ = corr(X, Y ) = 0 =⇒ X and Y independent
when (X, Y ) has a bivariate normal distribution.
PAGE 25
2.1.3 Multivariate extensions and linear combinations
EXTENSION : We use the notation Y = (Y1 , Y2 , ..., Yn ) and y = (y1 , y2 , ..., yn ). The joint
cdf of Y is
FY (y) = P (Y1 ≤ y1 , Y2 ≤ y2 , ..., Yn ≤ yn )

∫ y1 ∫ y2 ∫ yn
= ··· fY (t)dt1 dt2 · · · dtn ,
−∞ −∞ −∞
where t = (t1 , t2 , ..., tn ) and fY (y) denotes the joint pdf of Y .
EXTENSION : Suppose that the random vector Y = (Y1 , Y2 , ..., Yn ) has joint cdf FY (y),
and suppose that the random variable Yi has cdf FYi (yi ), for i = 1, 2, ..., n. Then,
Y1 , Y2 , ..., Yn are independent random variables if and only if
∏
n
FY (y) = FYi (yi );
i=1
that is, the joint cdf can be factored into the product of the marginal cdfs. Alternatively,
Y1 , Y2 , ..., Yn are independent random variables if and only if
∏
n
fY (y) = fYi (yi );
i=1
that is, the joint pdf can be factored into the product of the marginal pdfs.
MATHEMATICAL EXPECTATION : Suppose that Y1 , Y2 , ..., Yn are (mutually) inde-

pendent random variables. For real valued functions g1 , g2 , ..., gn ,
E[g1 (Y1 )g2 (Y2 ) · · · gn (Yn )] = E[g1 (Y1 )]E[g2 (Y2 )] · · · E[gn (Yn )],
provided that each expectation exists.
TERMINOLOGY : Suppose that Y1 , Y2 , ..., Yn are random variables and that a1 , a2 , ..., an
are constants. The function
∑
n
U= ai Yi = a1 Y1 + a2 Y2 + · · · + an Yn
i=1
is called a linear combination of the random variables Y1 , Y2 , ..., Yn .
PAGE 26
REMARK : Linear combinations are commonly seen in the theoretical development of

time series models. We therefore must be familiar with the following results.
EXPECTED VALUE OF A LINEAR COMBINATION :

( n )
∑ ∑
n
E(U ) = E ai Yi = ai E(Yi )
i=1 i=1
VARIANCE OF A LINEAR COMBINATION :

( n )
∑ ∑
n ∑
var(U ) = var ai Yi = a2i var(Yi ) + 2 ai aj cov(Yi , Yj )
i=1 i=1 i<j
∑n ∑
= a2i var(Yi ) + ai aj cov(Yi , Yj )
i=1 i̸=j
COVARIANCE BETWEEN TWO LINEAR COMBINATIONS : Suppose that

∑
n
U1 = ai Yi = a1 Y1 + a2 Y2 + · · · + an Yn
i=1
∑m
U2 = bj Xj = b1 X1 + b2 X2 + · · · + bm Xm .
j=1
Then,
∑
n ∑
m
cov(U1 , U2 ) = ai bj cov(Yi , Xj ).
i=1 j=1
2.1.4 Miscellaneous
GEOMETRIC SUMS : Suppose that a is any real number and that |r| < 1. Then, the
finite geometric sum
∑
n
a(1 − rn+1 )
arj = .
j=0
1−r
Taking limits of both sides, we get

∑
∞
a
arj = .
j=0
1−r
These formulas should be committed to memory.
PAGE 27
2.2 Time series and stochastic processes
TERMINOLOGY : The sequence of random variables {Yt : t = 0, 1, 2, ..., }, or more

simply denoted by {Yt }, is called a stochastic process. It is a collection of random
variables indexed by time t; that is,
Y0 = value of the process at time t = 0

..
.
Yn = value of the process at time t = n.
The subscripts are important because they refer to which time period the value of Y is
being measured. A stochastic process can be described as “a statistical phenomenon that
evolves through time according to a set of probabilistic laws.”
• A complete probabilistic time series model for {Yt }, in fact, would specify all of the
joint distributions of random vectors Y = (Y1 , Y2 , ..., Yn ), for all n = 1, 2, ..., or,
equivalently, specify the joint probabilities
P (Y1 ≤ y1 , Y2 ≤ y2 , ..., Yn ≤ yn ),
for all y = (y1 , y2 , ..., yn ) and n = 1, 2, ...,.
• This specification is not generally needed in practice. In this course, we specify

only the first and second-order moments; i.e., expectations of the form E(Yt ) and
E(Yt Yt−k ), for k = 0, 1, 2, ..., and t = 0, 1, 2, ....
• Much of the important information in most time series processes is captured in

these first and second moments (or, equivalently, in the means, variances, and
covariances).
PAGE 28
2.3 Means, variances, and covariances
TERMINOLOGY : For the stochastic process {Yt : t = 0, 1, 2, ..., }, the mean function
is defined as
µt = E(Yt ),
for t = 0, 1, 2, ...,. That is, µt is the theoretical (or population) mean for the series at
time t. The autocovariance function is defined as
γt,s = cov(Yt , Ys ),
for t, s = 0, 1, 2, ..., where cov(Yt , Ys ) = E(Yt Ys ) − E(Yt )E(Ys ). The autocorrelation

function is given by
ρt,s = corr(Yt , Ys ),
where
cov(Yt , Ys ) γt,s
corr(Yt , Ys ) = √ =√ .
var(Yt )var(Ys ) γt,t γs,s
• Values of ρt,s near ±1 =⇒ strong linear dependence between Yt and Ys .
• Values of ρt,s near 0 =⇒ weak linear dependence between Yt and Ys
• Values of ρt,s = 0 =⇒ Yt and Ys are uncorrelated.
2.4 Some (named) stochastic processes
Example 2.1. A stochastic process {et : t = 0, 1, 2, ..., } is called a white noise process
if it is a sequence of independent and identically distributed (iid) random variables with
E(et ) = µe
var(et ) = σe2 .
• Both µe and σe2 are constant (free of t).
PAGE 29
3
2
Simulated white noise process
1
0
−1
−2
−3
0 50 100 150
Time
Figure 2.1: A simulated white noise process et ∼ iid N (0, σe2 ), where n = 150 and σe2 = 1.
• It is often assumed that µe = 0; that is, {et } is a zero mean process.
• A slightly less restrictive definition would require that the et ’s are uncorrelated (not
independent). However, under normality; i.e., et ∼ iid N (0, σe2 ), this distinction
becomes vacuous (for linear time series models).
AUTOCOVARIANCE FUNCTION : For t = s,
cov(et , es ) = cov(et , et ) = var(et ) = σe2 .
For t ̸= s,
cov(et , es ) = 0,
because the et ’s are independent. Thus, the autocovariance function of {et } is


 σ 2 , |t − s| = 0
e
γt,s =
 0, |t − s| ̸= 0.
PAGE 30
AUTOCORRELATION FUNCTION : For t = s,
γt,t
ρt,s = corr(et , es ) = corr(et , et ) = √ = 1.
γt,t γt,t
For t ̸= s,
γt,s
ρt,s = corr(et , es ) = √ = 0.
γt,t γs,s
Thus, the autocorrelation function is

 1, |t − s| = 0
ρt,s =
 0, |t − s| ̸= 0.
REMARK : A white noise process, by itself, is rather uninteresting for modeling real
data. However, white noise processes still play a crucial role in the analysis of time series
data! Time series processes {Yt } generally contain two different types of variation:
• systematic variation (that we would like to capture and model; e.g., trends, sea-
sonal components, etc.)
• random variation (that is just inherent background noise in the process).
Our goal as data analysts is to extract the systematic part of the variation in the data (and
incorporate this into our model). If we do an adequate job of extracting the systematic
part, then the only part “left over” should be random variation, which can be modeled
as white noise.
Example 2.2. Suppose that {et } is a zero mean white noise process with var(et ) = σe2 .
Define
Y1 = e1
Y2 = e1 + e2
..
.
Yn = e1 + e2 + · · · + en .
PAGE 31
By this definition, note that we can write, for t > 1,
Yt = Yt−1 + et ,
where E(et ) = 0 and var(et ) = σe2 . The process {Yt } is called a random walk process.
Random walk processes are used to model stock prices, movements of molecules in gases
and liquids, animal locations, etc.
MEAN FUNCTION : The mean of Yt is
µt = E(Yt )
= E(e1 + e2 + · · · + et )
= E(e1 ) + E(e2 ) + · · · + E(et ) = 0.
That is, {Yt } is a zero mean process.
VARIANCE FUNCTION : The variance of Yt is
var(Yt ) = var(e1 + e2 + · · · + et )
= var(e1 ) + var(e2 ) + · · · + var(et ) = tσe2 ,
because var(e1 ) = var(e2 ) = · · · = var(et ) = σe2 and cov(et , es ) = 0 for all t ̸= s.
AUTOCOVARIANCE FUNCTION : For t ≤ s, the autocovariance of Yt and Ys is
γt,s = cov(Yt , Ys ) = cov(e1 + e2 + · · · + et , e1 + e2 + · · · + et + et+1 + · · · + es )
= cov(e1 + e2 + · · · + et , e1 + e2 + · · · + et )
+ cov(e1 + e2 + · · · + et , et+1 + · · · + es )
∑
t ∑∑
= cov(ei , ei ) + cov(ei , ej )
i=1 1≤i̸=j≤t
∑
t
= var(ei ) = σe2 + σe2 + · · · + σe2 = tσe2 .
i=1
Because γt,s = γs,t , the autocovariance function for a random walk process is
γt,s = tσe2 , for 1 ≤ t ≤ s.
PAGE 32
5
Simulated random walk process
0
−5
−10
0 50 100 150
Time
Figure 2.2: A simulated random walk process Yt = Yt−1 + et , where et ∼ iid N (0, σe2 ),
n = 150, and σe2 = 1. This process has been constructed from the simulated white noise
process {et } in Figure 2.1.
AUTOCORRELATION FUNCTION : For 1 ≤ t ≤ s, the autocorrelation function for a

random walk process is
√
γt,s tσ 2 t
ρt,s = corr(Yt , Ys ) = √ =√ e = .
γt,t γs,s tσe2 sσe2 s
• Note that when t is closer to s, the autocorrelation ρt,s is closer to 1. That is,
two observations Yt and Ys close together in time are likely to be close together,
especially when t and s are both large (later on in the series).
• On the other hand, when t is far away from s (that is, for two points Yt and Ys far
apart in time), the autocorrelation is closer to 0.
PAGE 33
Define
1
Yt = (et + et−1 + et−2 ),
3
that is, Yt is a running (or moving) average of the white noise process (averaged across
the most recent 3 time periods). Note that this example is slightly different than that
on pp 14-15 (CC).
MEAN FUNCTION : The mean of Yt is

[ ]
1
µt = E(Yt ) = E (et + et−1 + et−2 )
3
1
= [E(et ) + E(et−1 ) + E(et−2 )] = 0,
3
because {et } is a zero-mean process. {Yt } is a zero mean process.
VARIANCE FUNCTION : The variance of Yt is

[ ]
1
var(Yt ) = var (et + et−1 + et−2 )
3
1
= var(et + et−1 + et−2 )
9
1 3σ 2 σ2
= [var(et ) + var(et−1 ) + var(et−2 )] = e = e ,
9 9 3
because var(et ) = σe2 for all t and because et , et−1 , and et−2 are independent (all covariance
terms are zero).
AUTOCOVARIANCE FUNCTION : We need to consider different cases.
Case 1: If s = t, then
σe2
γt,s = γt,t = cov(Yt , Yt ) = var(Yt ) = .
3
Case 2: If s = t + 1, then
γt,s = γt,t+1 = cov(Yt , Yt+1 )

[ ]
1 1
= cov (et + et−1 + et−2 ), (et+1 + et + et−1 )
3 3
1
= [cov(et , et ) + cov(et−1 , et−1 )]
9
1 2σ 2
= [var(et ) + var(et−1 )] = e .
9 9
PAGE 34
Case 3: If s = t + 2, then
γt,s = γt,t+2 = cov(Yt , Yt+2 )

[ ]
1 1
= cov (et + et−1 + et−2 ), (et+2 + et+1 + et )
3 3
1
= cov(et , et )
9
1 σ2
= var(et ) = e .
9 9
Case 4: If s > t + 2, then γt,s = 0 because Yt and Ys will have no common white noise
error terms.
Because γt,s = γs,t , the autocovariance function can be written as



 σe2 /3, |t − s| = 0




 2σ 2 /9, |t − s| = 1
e
γt,s =

 σe2 /9, |t − s| = 2




 0, |t − s| > 2.
AUTOCORRELATION FUNCTION : Recall that the autocorrelation function is

γt,s
ρt,s = corr(Yt , Ys ) = √ .
γt,t γs,s
Because γt,t = γs,s = σe2 /3, the autocorrelation function for this process is


 1, |t − s| = 0




 2/3, |t − s| = 1
ρt,s =

 1/3, |t − s| = 2




 0, |t − s| > 2.
• Observations Yt and Ys that are 1 unit apart in time have the same autocorrelation
regardless of the values of t and s.
• Observations Yt and Ys that are 2 units apart in time have the same autocorrelation
regardless of the values of t and s.
• Observations Yt and Ys that are more than 2 units apart in time are uncorrelated.
PAGE 35
1.0
Simulated moving average process
0.5
0.0
−0.5
−1.0
0 50 100 150
Time
Figure 2.3: A simulated moving average process Yt = 13 (et + et−1 + et−2 ), where et ∼
iid N (0, σe2 ), n = 150, and σe2 = 1. This process has been constructed from the simulated
white noise process {et } in Figure 2.1.
Consider the stochastic process defined by
Yt = 0.75Yt−1 + et ,
that is, Yt is directly related to the (downweighted) previous value of the process Yt−1
and the random error et (a “shock” or “innovation” that occurs at time t). This is called
an autoregressive model. Autoregression means “regression on itself.” Essentially, we
can envision “regressing” Yt on Yt−1 .
NOTE : We will postpone mean, variance, autocovariance, and autocorrelation calcula-

tions for this process until Chapter 4 when we discuss autoregressive models in more
detail. A simulated realization of this process appears in Figure 2.4.
PAGE 36
4
Simulated autoregressive process
2
0
−2
−4
0 50 100 150
Time
Figure 2.4: A simulated autoregressive process Yt = 0.75Yt−1 +et , where et ∼ iid N (0, σe2 ),
n = 150, and σe2 = 1.
Example 2.5. Many time series exhibit seasonal patterns that correspond to different
weeks, months, years, etc. One way to describe seasonal patterns is to use models with
deterministic parts which are trigonometric in nature. Suppose that {et } is a zero mean
white noise process with var(et ) = σe2 . Consider the process defined by
Yt = a sin(2πωt + ϕ) + et .
In this model, a is the amplitude, ω is the frequency of oscillation, and ϕ controls the
phase shift. With a = 2, ω = 1/52 (one cycle/52 time points), and ϕ = 0.6π, note that
E(Yt ) = 2 sin(2πt/52 + 0.6π),
since E(et ) = 0. Also, var(Yt ) = var(et ) = σe2 . The mean function, and three realizations
of this process (one realization corresponding to σe2 = 1, σe2 = 4, and σe2 = 16) are
depicted in Figure 2.5.
PAGE 37
1 2 3 4
2
1
0
−1
−1
−3
−2
0 50 100 150 0 50 100 150
Time Time
6
10
4
5
2
−4 −2 0
0
−5
−10
0 50 100 150 0 50 100 150
Time Time
Figure 2.5: Sinusoidal model illustration. Top left: E(Yt ) = 2 sin(2πt/52 + 0.6π). The
other plots are simulated realizations of this process with σe2 = 1 (top right), σe2 = 4
(bottom left), and σe2 = 16 (bottom right). In each simulated realization, n = 156.
2.5 Stationarity
NOTE : Stationarity is a very important concept in the analysis of time series data.
Broadly speaking, a time series is said to be stationary if there is no systematic change
in mean (no trend), if there is no systematic change in variance, and if strictly periodic
variations have been removed. In other words, the properties of one section of the data
are much like those of any other section.
IMPORTANCE : Much of the theory of time series is concerned with stationary time
series. For this reason, time series analysis often requires one to transform a nonstationary
PAGE 38
time series into a stationary one to use this theory. For example, it may be of interest
to remove the trend and seasonal variation from a set of data and then try to model
the variation in the residuals (the pieces “left over” after this removal) by means of a
stationary stochastic process.
STATIONARITY : The stochastic process {Yt : t = 0, 1, 2, ..., n} is said to be strictly

stationary if the joint distribution of
Yt1 , Yt2 , ..., Ytn
is the same as
Yt1 −k , Yt2 −k , ..., Ytn −k
for all time points t1 , t2 , ..., tn and for all time lags k. In other words, shifting the time
origin by an amount k has no effect on the joint distributions, which must therefore
depend only on the intervals between t1 , t2 , ..., tn . This is a very strong condition.
IMPLICATION : Since the above condition holds for all sets of time points t1 , t2 , ..., tn ,
it must hold when n = 1; i.e., there is only one time point.
• This implies Yt and Yt−k have the same marginal distribution for all t and k.
• Because these marginal distributions are the same,
E(Yt ) = E(Yt−k )
var(Yt ) = var(Yt−k ),
for all t and k.
• Therefore, for a strictly stationary process, both µt = E(Yt ) and γt,t = var(Yt ) are
constant over time.
ADDITIONAL IMPLICATION : Since the above condition holds for all sets of time
points t1 , t2 , ..., tn , it must hold when n = 2; i.e., there are only two time points.
PAGE 39
• This implies (Yt , Ys ) and (Yt−k , Ys−k ) have the same joint distribution for all t,
s, and k.
• Because these joint distributions are the same,
cov(Yt , Ys ) = cov(Yt−k , Ys−k ),
for all t, s, and k.
• Therefore, for a strictly stationary process, for k = s,
γt,s = cov(Yt , Ys ) = cov(Yt−s , Y0 ) = cov(Y0 , Yt−s ).
But, also, for k = t, we have
cov(Yt , Ys ) = cov(Y0 , Ys−t ).
Putting the last two results together, we have
γt,s = cov(Yt , Ys ) = cov(Y0 , Y|t−s| ) = γ0,|t−s| .
This means that the covariance between Yt and Ys does not depend on the actual
values of t and s; it only depends on the time difference |t − s|.
NEW NOTATION : For a (strictly) stationary process, the covariance γt,s depends only
on the time difference |t − s|. The quantity |t − s| is the distance between time points Yt
and Ys . In other words, the covariance between Yt and any observation k = |t − s| time
points from it only depends on the lag k. Therefore, we write
γk = cov(Yt , Yt−k )
ρk = corr(Yt , Yt−k ).
We use this simpler notation only when we refer to a process which is stationary. Note
that by taking k = 0, we have
γ0 = cov(Yt , Yt ) = var(Yt ).
Also,
γk
ρk = corr(Yt , Yt−k ) = .
γ0
PAGE 40
SUMMARY : For a process which is (strictly) stationary,
1. The mean function µt = E(Yt ) is constant throughout time; i.e., µt is free of t.
2. The covariance between any two observations depends only the time lag between
them; i.e., γt,t−k depends only on k (not on t).
REMARK : Strict stationarity is a condition that is much too restrictive for most applica-
tions. Moreover, it is difficult to assess the validity of this assumption in practice. Rather
than impose conditions on all possible (marginal and joint) distributions of a process, we
will use a milder form of stationarity that only deals with the first two moments.
DEFINITION : The stochastic process {Yt : t = 0, 1, 2, ..., n} is said to be weakly sta-

tionary (or second-order stationary) if
1. The mean function µt = E(Yt ) is constant throughout time; i.e., µt is free of t.
2. The covariance between any two observations depends only the time lag between
them; i.e., γt,t−k depends only on k (not on t).
Nothing is assumed about the collection of joint distributions of the process. Instead, we
only are specifying the characteristics of the first two moments of the process.
REALIZATION : Clearly, strict stationarity implies weak stationarity. It is also clear that
the converse to statement is not true, in general. However, if we append the additional
assumption of multivariate normality (for the Yt process), then the two definitions do
coincide; that is,
weak stationarity + multivariate normality =⇒ strict stationarity.
CONVENTION : For the purpose of modeling time series data in this course, we will
rarely (if ever) make the distinction between strict stationarity and weak stationarity.
When we use the term “stationary process,” this is understood to mean that the process
is weakly stationary.
PAGE 41
EXAMPLES : We now reexamine the time series models introduced in the last section.
• Suppose that {et } is a white noise process. That is, {et } consists of iid random
variables with E(et ) = µe and var(et ) = σe2 , both constant (free of t). In addition,
the autocovariance function γk = cov(Yt , Yt−k ) is given by

 σ2, k = 0
e
γk =
 0, k ̸= 0,
which is free of time t (i.e., γk depends only on k). Thus, a white noise process is
stationary.
• Suppose that {Yt } is a random walk process. That is,
Yt = Yt−1 + et ,
where {et } is white noise with E(et ) = 0 and var(et ) = σe2 . We calculated µt =
E(Yt ) = 0, for all t, which is free of t. However,
cov(Yt , Yt−k ) = cov(Yt−k , Yt ) = (t − k)σe2 ,
which clearly depends on time t. Thus, a random walk process is not stationary.
• Suppose that {Yt } is a moving average process given by
1
Yt = (et + et−1 + et−2 ),
3
where {et } is zero mean white noise with var(et ) = σe2 . We calculated µt = E(Yt ) =
0 (which is free of t) and γk = cov(Yt , Yt−k ) to be


 σe2 /3, k = 0




 2σ 2 /9, k = 1
e
γk =

 σe2 /9, k = 2




 0, k > 2.
Because cov(Yt , Yt−k ) is free of time t, this moving average process is stationary.
PAGE 42
• Suppose that {Yt } is the autoregressive process
Yt = 0.75Yt−1 + et ,
where {et } is zero mean white noise with var(et ) = σe2 . We avoided the calculation
of µt = E(Yt ) and cov(Yt , Yt−k ) for this process, so we will not make a definite
determination here. However, it turns out that if et is independent of Yt−1 , Yt−2 , ...,
and if σe2 > 0, then this autoregressive process is stationary (details coming later).
• Suppose that {Yt } is the sinusoidal process defined by
Yt = a sin(2πωt + ϕ) + et ,
where {et } is zero mean white noise with var(et ) = σe2 . Clearly µt = E(Yt ) =
a sin(2πωt + ϕ) is not free of t, so this sinusoidal process is not stationary.
• Consider the random cosine wave process

[ ( )]
t
Yt = cos 2π +Φ ,
12
where Φ is a uniform random variable from 0 to 1; i.e., Φ ∼ U(0, 1). The calculations
on pp 18-19 (CC) show that this process is (perhaps unexpectedly) stationary.
IMPORTANT : In order to start thinking about viable stationary time series models for
real data, we need to have a stationary process. However, as we have just seen, many
data sets exhibit nonstationary behavior. A simple, but effective, technique to convert a
nonstationary process into a stationary one is to examine data differences.
DEFINITION : Consider the process {Yt : t = 0, 1, 2, ..., n}. The (first) difference
process of {Yt } is defined by
∇Yt = Yt − Yt−1 ,
for t = 1, 2, ...., n. In many situations, a nonstationary process {Yt } can be “transformed”

into a stationary process by taking (first) differences. For example, the random walk
Yt = Yt−1 + et , where et ∼ iid N (0, σe2 ), is not stationary. However, the first difference
process ∇Yt = Yt − Yt−1 = et is zero mean white noise, which is stationary!
PAGE 43
3 Modeling Deterministic Trends
3.1 Introduction
DISCUSSION : In this course, we consider time series models for realizations of a stochas-
tic process {Yt : t = 0, 1, ..., n}. This will largely center around models for stationary
processes. However, as we have seen, many time series data sets exhibit a trend; i.e., a
long-term change in the mean level. We know that such series are not stationary because
the mean changes with time.
• An obvious difficulty with the definition of a trend is deciding what is meant by the
phrase “long-term.” For example, climatic processes can display cyclical variation
over a long period of time, say, 1000 years. However, if one has just 40-50 years of
data, this long-term cyclical pattern might be missed and be interpreted as a trend
which is linear.
• Trends can be “elusive,” and an analyst may mistakenly conjecture that a trend
exists when it really does not. For example, in Figure 2.2 (page 33), we have a
realization of a random walk process
Yt = Yt−1 + et ,
where et ∼ iid N (0, 1). There is no trend in the mean of this random walk process.
Recall that µt = E(Yt ) = 0, for all t. However, it would be easy to incorrectly
assert that true downward and upward trends are present.
• On the other hand, it may be hard to detect trends if the data are very noisy. For
example, the lower right plot in Figure 2.5 (page 38) is a noisy realization of a
sinusoidal process considered in the last chapter. It is easy to miss the true cyclical
structure from looking at the plot.
PAGE 44
DETERMINISTIC TREND MODELS : In this chapter, we consider models of the form
Yt = µt + Xt ,
where µt is a deterministic function that describes the trend and Xt is random error.
Note that if, in addition, E(Xt ) = 0 for all t (a common assumption), then
E(Yt ) = µt
is the mean function for the process {Yt }. In practice, different deterministic trend
functions could be considered. One popular choice is
µt = β0 + β1 t,
which says that the mean function increases (decreases) linearly with time. The function
µt = β0 + β1 t + β2 t2
is appropriate if there is a quadratic trend present. More generally, if the deterministic

trend can be described by a kth order polynomial in time, we can consider
µt = β0 + β1 t + β2 t2 + · · · + βk tk .
If the deterministic trend is cyclical, we could consider functions of the form

∑
m
µt = β0 + (αj cos ωj t + βj sin ωj t),
j=1
where the αj ’s and βj ’s are regression parameters and the ωj ’s are related to frequencies
of the trigonometric functions cos ωj t and sin ωj t. Fitting these and other deterministic
trend models (and even combinations of them) can be accomplished using the method
of least squares, as we will demonstrate later in this chapter.
LOOKING AHEAD: In this course, we want to deal with stationary time series models
for data. Therefore, if there is a deterministic trend present in the process, we want to
remove it. There are two general ways to do this.
PAGE 45
1. Estimate the trend and then subtract the estimated trend from the data (perhaps
bt and then model the
after transforming the data). Specifically, estimate µt with µ
residuals
bt = Yt − µ
X bt
as a stationary process. We can use regression methods to estimate µt and then

bt to check for violations of sta-
implement standard diagnostics on the residuals X
tionarity and other assumptions.
• If the residuals are stationary, we can use a stationary time series model (Chap-
ter 4) to describe their behavior.
bt } and then
• Forecasting takes place by first forecasting the residual process {X
inverting the transformations described above to arrive back at forecasts for
the original series {Yt }. We will pursue forecasting techniques in Chapter 9.
IMPORTANT : If we assert that a trend exists and we fit a deterministic model

that incorporates it, we are implicitly assuming that the trend lasts “forever.” In
some applications, this might be reasonable, but probably not in most.
2. Another approach, developed extensively by Box and Jenkins, is to apply differ-

encing repeatedly to the series {Yt } until the differenced observations resemble a
realization of a stationary time series. We can then use the theory of stationary pro-
cesses for the modeling, analysis, and prediction of the stationary series and then
transform this analysis back in terms of the original series {Yt }. This approach is
studied in Chapter 5.
3.2 Estimation of a constant mean
A CONSTANT “TREND”: We first consider the most elementary type of trend, namely,
a constant trend. Specifically, we consider the model
Yt = µ + Xt ,
PAGE 46
where µ is constant (free of t) and where E(Xt ) = 0. Note that, under this zero mean
error assumption, we have
E(Yt ) = µ.
That is, the process {Yt } has an overall population mean function µt = µ, for all t. The
most common estimate of µ is
1∑
n
Y = Yt ,
n t=1
the sample mean. It is easy to check that Y is an unbiased estimator of µ; i.e.,
E(Y ) = µ. This is true because
( n )
1∑ 1∑ 1∑
n n
nµ
E(Y ) = E Yt = E(Yt ) = µ= = µ.
n t=1 n t=1 n t=1 n
Therefore, under the minimal assumption that E(Xt ) = 0, we see that Y is an unbiased
estimator of µ. To assess the precision of Y as an estimator of µ, we examine var(Y ).
RESULT : If {Yt } is a stationary process with autocorrelation function ρk , then

[ n−1 ( ) ]
γ0 ∑ k
var(Y ) = 1+2 1− ρk ,
n k=1
n
where var(Yt ) = γ0 .
RECALL: If {Yt } is an iid process, that is, Y1 , Y2 , ..., Yn is an iid (random) sample, then
γ0
var(Y ) = .
n
Therefore, var(Y ), in general, can be larger than or smaller than γ0 /n depending on the
values of ρk through
[ n−1 ( ) ] n−1 ( )
γ0 ∑ k γ0 2γ0 ∑ k
1+2 1− ρk − = 1− ρk .
n k=1
n n n k=1 n
• If this quantity is smaller than zero, then Y is a better estimator of µ than Y is in

an iid sampling context; that is, var(Y ) < γ0 /n.
• If this quantity is larger than zero, then Y is a worse estimator of µ than Y is in

an iid sampling context; that is, var(Y ) > γ0 /n.
PAGE 47
Example 3.1. Suppose that {Yt } is a moving average process given by
1
Yt = (et + et−1 + et−2 ),
3
where {et } is zero mean white noise with var(et ) = σe2 . In the last chapter, we calculated


 σe2 /3, k = 0




 2σ 2 /9, k = 1
e
γk =

 2


σe /9, k = 2


 0, k > 2.
The lag 1 autocorrelation for this process is
γ1 2σ 2 /9
ρ1 = = 2e = 2/3.
γ0 σe /3
The lag 2 autocorrelation for this process is
γ2 σ 2 /9
ρ2 = = e2 = 1/3.
γ0 σe /3
Also, ρk = 0 for all k > 2. Therefore,

[ n−1 ( ) ]
γ0 ∑ k
var(Y ) = 1+2 1− ρk
n k=1
n
γ0 4(n − 1)γ0 + 2(n − 2)γ0 γ0
= + 2
> .
n 3n n
Therefore, we lose efficiency in estimating µ with Y when compared to using Y in an iid

sampling context. The positive autocorrelations make estimation of µ less precise.
Example 3.2. Suppose that {Yt } is a stationary process with autocorrelation function
ρk = ϕk , where −1 < ϕ < 1. For this process, the autocorrelation decays exponentially as
the lag k increases. As we will see in Chapter 4, the autoregressive of order 1, AR(1),
process possesses this autocorrelation function. To examine the effect of estimating µ
with Y in this situation, we use an approximation for var(Y ) for large n, specifically,
[ n−1 ( ) ] ( )
γ0 ∑ k γ0 ∑∞
var(Y ) = 1+2 1− ρk ≈ 1+2 ρk ,
n k=1
n n k=1
PAGE 48
where we have taken (1 − k/n) ≈ 1 for n large. Therefore, with ρk = ϕk , we have

( )
γ0 ∑∞
var(Y ) ≈ 1+2 ρk
n
[ (∞
k=1
)]
γ0 ∑
= 1+2 ϕk − 1
n
[ ( k=0 )] ( )
γ0 1 1 + ϕ γ0
= 1+2 −1 = .
n 1−ϕ 1−ϕ n
For example, if ϕ = −0.6, then

(γ )
0
var(Y ) ≈ 0.25 .
n
Using Y produces a more precise estimate of µ than in an iid (random) sampling context.
The negative autocorrelations ρ1 = −0.6, ρ3 = (−0.6)3 , etc., “outweigh” the positive ones
ρ2 = (−0.6)2 , ρ4 = (−0.6)4 , etc., making var(Y ) smaller than γ0 /n.
Example 3.3. In Examples 3.1 and 3.2, we considered stationary processes in examining
the precision of Y as an estimator for µ. In this example, we have the same goal, but
we consider the random walk process Yt = Yt−1 + et , where {et } is a zero mean white
noise process with var(et ) = σe2 . As we already know, this process is not stationary, so
we can not use the var(Y ) formula presented earlier. However, recall that this process
can be written out as
Y1 = e1 , Y2 = e1 + e2 , ..., Yn = e1 + e2 + · · · + en ,
so that
1∑
n
1
Y = Yt = [ne1 + (n − 1)e2 + (n − 2)e3 + · · · + 2en−1 + 1en ] .
n t=1 n
Therefore, we can derive an expression for var(Y ) directly:
1 [ 2 ]
var(Y ) = 2
n var(e1 ) + (n − 1)2 var(e2 ) + · · · + 22 var(en−1 ) + 12 var(en )
n
σ2
= e2 [12 + 22 + · · · + (n − 1)2 + n2 ]
n [ ] [ ]
σe2 n(n + 1)(2n + 1) σe2 (n + 1)(2n + 1)
= = .
n2 6 n 6
PAGE 49
• This result is surprising! Note that as n increases, so does var(Y ). That is, av-
eraging a larger sample produces a worse (i.e., more variable) estimate of µ than
averaging a smaller one!!
• This is quite different than the results obtained for stationary processes. The
nonstationarity in the data causes very bad things to happen, even in the relatively
simple task of estimating an overall process mean.
RESULT : Suppose that Yt = µ + Xt , where µ is constant, Xt ∼ N (0, γ0 ), and {Xt } is

a stationary process. Under these assumptions, Yt ∼ N (µ, γ0 ) and {Yt } is stationary.
Therefore, { [ n−1 (
∑ ) ]}
γ0 k
Y ∼N µ, 1+2 1− ρk
n k=1
n
so that
Y −µ
Z=√ [ ∑n−1 ( ) ] ∼ N (0, 1).
γ0
n
1 + 2 k=1 1 − n ρk
k
Since the sampling distribution of Z does not depend on any unknown parameters, we
say that Z is a pivotal quantity (or, more simply, a pivot). If γ0 and the ρk ’s are
known, then a 100(1 − α) percent confidence interval for µ is
v [
u
u γ0 n−1 (
∑ ) ]
k
Y ± zα/2 t 1+2 1− ρk ,
n k=1
n
where zα/2 is the upper α/2 quantile from the standard normal distribution.
REMARK : Note that if ρk = 0, for all k, then Y ∼ N (µ, γ0 /n), and the confidence
interval formula just presented reduces to
√
γ0
Y ± zα/2 ,
n
which we recognize as the confidence interval for µ when random sampling is used. The
impact of the autocorrelations ρk will be the same on the confidence interval. That is,
more negative autocorrelations ρk will make the standard error
v [
u
u γ0 n−1 (
∑ ) ]
k
se(Y ) = t 1+2 1− ρk
n k=1
n
PAGE 50
smaller, which will make the confidence interval more precise (i.e., shorter). On the other
hand, positive autocorrelations will make this quantity larger, thereby lengthening the
interval, making it less informative.
REMARK : Of course, in real life, rarely will anyone tell us the values of γ0 and the ρk ’s.
These are model (population) parameters. However, if the sample size n is large and
“good” (large-sample) estimates of these quantities can be calculated, we would expect
this interval to be approximately valid when the estimates are substituted in for the true
values. We will talk about estimation of γ0 and the autocorrelations later.
3.3 Regression methods
3.3.1 Straight line regression
STRAIGHT LINE MODEL: We now consider the deterministic time trend model
Yt = µt + Xt
= β0 + β1 t + Xt ,
where µt = β0 + β1 t and where E(Xt ) = 0. We are considering a simple linear re-

gression model for the process {Yt }, where time t is the predictor. By “fitting this
model,” we mean that we would like to estimate the regression parameters β0 and
β1 (the intercept and slope, respectively) using the observed data Y1 , Y2 , ..., Yn . The Xt ’s
are random errors and are not observed.
LEAST SQUARES ESTIMATION : To estimate β0 and β1 , we will use the method of

least squares. Specifically, we find the values of β0 and β1 that minimize the objective
function
∑
n
Q(β0 , β1 ) = (Yt − µt )2
t=1
∑
n ∑
n
= [Yt − (β0 + β1 t)] =
2
(Yt − β0 − β1 t)2 .
t=1 t=1
PAGE 51
This can be done using a multivariable calculus argument. Specifically, the partial deriva-
tives of Q(β0 , β1 ) are given by
∂Q(β0 , β1 ) ∑ n
= −2 (Yt − β0 − β1 t)
∂β0 t=1
∂Q(β0 , β1 ) ∑ n
= −2 t(Yt − β0 − β1 t).
∂β1 t=1
Setting these derivatives equal to zero and jointly solving for β0 and β1 , we get
βb0 = Y − βb1 t.
∑n
b (t − t)Yt
β1 = ∑t=1 .
t=1 (t − t)
n 2
These are the least squares estimators of β0 and β1 .
PROPERTIES : The following results can be established algebraically. Note carefully

which statistical assumptions are needed for each result.
• Under just the mild assumption of E(Xt ) = 0, for all t, the least squares estimators
are unbiased. That is, E(βb0 ) = β0 and E(βb1 ) = β1 .
• Under the assumptions that E(Xt ) = 0, {Xt } independent, and var(Xt ) = γ0 (a

constant, free of t), then
[ 2
]
1 t
var(βb0 ) = γ0 + ∑n
t=1 (t − t)
n 2
γ0
var(βb1 ) = ∑n .
t=1 (t − t)
2
Note that a zero mean white noise process {Xt } satisfies these assumptions.
• In addition to the assumptions E(Xt ) = 0, {Xt } independent, and var(Xt ) = γ0 , if

we also assume that the Xt ’s are normally distributed, then
{ [ 2
]}
1 t
βb0 ∼ N β0 , γ0 + ∑n
t=1 (t − t)
n 2
and [ ]
b γ0
β1 ∼ N β1 , ∑n .
t=1 (t − t)
2
PAGE 52
0.4
Global temperature deviations (since 1900)
0.2
0.0
−0.2
−0.4
1900 1920 1940 1960 1980 2000
Year
Figure 3.1: Global temperature data. The data are a combination of land-air average
temperature anomalies, measured in degrees Centigrade. Time period: 1900-1997.
IMPORTANT : You should recall that these four assumptions on the errors Xt , that
is, zero mean, independence, homoscedasticity, and normality, are the usual as-
sumptions on the errors in a standard regression setting. However, with most time series
data sets, at least one of these assumptions will be violated. The implication, then, is
that standard errors of the estimators, confidence intervals, t tests, probability values,
etc., quantities that are often provided in computing packages (e.g., R, etc.), will not be
meaningful. Proper usage of this output requires the four assumptions mentioned above
to hold. The only instance in which these are exactly true is if {Xt } is a zero-mean
normal white noise process (an assumption you likely made in your previous methods
courses where regression was discussed).
Example 3.4. Consider the global temperature data from Example 1.1 (notes), but let’s
restrict attention to the time period 1900-1997. These data are depicted in Figure 3.1.
PAGE 53
0.4
0.2
0.0
−0.2
−0.4
1900 1920 1940 1960 1980 2000
Year
Figure 3.2: Global temperature data (1900-1997) with a straight line trend fit.
Over this time period, there is an apparent upward trend in the series. Suppose that we
estimate this trend by fitting the straight line regression model
Yt = β0 + β1 t + Xt ,
for t = 1900, 1901, ..., 1997, where E(Xt ) = 0. Here is the output from fitting this model
in R.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.219e+01 9.032e-01 -13.49 <2e-16 ***
time(globaltemps.1900) 6.209e-03 4.635e-04 13.40 <2e-16 ***
Residual standard error: 0.1298 on 96 degrees of freedom

Multiple R-squared: 0.6515, Adjusted R-squared: 0.6479
F-statistic: 179.5 on 1 and 96 DF, p-value: < 2.2e-16
PAGE 54
0.3
0.2
0.1
Residuals
0.0
−0.1
−0.2
−0.3
0 20 40 60 80 100
Year
Figure 3.3: Global temperature data (1990-1997). Residuals from the straight line trend
model fit.
ANALYSIS : We interpret the regression coefficient output only. As we have learned,

standard errors, t tests, and probability values may not be meaningful! The least squares
estimates are βb0 = −12.19 and βb1 = 0.0062 so that the fitted regression model is
Ybt = −12.19 + 0.0062t.
This is the equation of the line superimposed over the series in Figure 3.2.
RESIDUALS : The residuals from the least squares fit are given by
bt = Yt − Ybt ,
X
that is, the observed data Yt minus the fitted values given by the equation in Ybt . In
this example (with the straight line model fit), the residuals are given by
bt = Yt − Ybt
X
= Yt + 12.19 − 0.0062t,
PAGE 55
for t = 1900, 1901, ..., 1997. Remember that one of the main reasons for fitting the
straight line model was to capture the linear trend. Now that we have done this, the
residual process defined by
bt = Yt + 12.19 − 0.0062t
X
contains information in the data that is not accounted for in the straight line trend
model. For this reason, it is called the detrended series. This series is plotted in
Figure 3.3. Essentially, this is a time series plot of the residuals from the straight line fit
versus time, the predictor variable in the model. This detrended series does appear to
be somewhat stationary, at least much more so than the original series {Yt }. However,
just from looking at the plot, it is a safe bet that the residuals are not white noise.
DIFFERENCING: Instead of fitting the deterministic model to the global temperature

series to remove the linear trend, suppose that we had examined the first difference
process {∇Yt }, where
∇Yt = Yt − Yt−1 .
We have learned that taking differences can be an effective means to remove non-
stationary patterns. Doing so here, as evidenced in Figure 3.4, produces a new process
that does appear to be somewhat stationary.
DISCUSSION : We have just seen, by means of an example, that both detrending

(using regression to fit the trend) and differencing can be helpful in transforming a
nonstationary process into one which is (or at least appears) stationary.
• One advantage of differencing over detrending to remove trend is that no parameters

are estimated in taking differences.
• One disadvantage of differencing is that it does not provide an “estimate” of the

error process Xt .
• If an estimate of the error process is crucial, detrending may be more appropriate.

If the goal is only to coerce the data to stationarity, differencing may be preferred.
PAGE 56
0.3
0.2
Global temperature deviation differences
0.1
0.0
−0.1
−0.2
−0.3
1900 1920 1940 1960 1980 2000
Year
Figure 3.4: Global temperature first data differences (1900-1997).
3.3.2 Polynomial regression
POLYNOMIAL REGRESSION : We now consider the deterministic time trend model
Yt = µt + Xt
= β0 + β1 t + β2 t2 + · · · + βk tk + Xt ,
where µt = β0 + β1 t + β2 t2 + · · · + βk tk and where E(Xt ) = 0. The mean function µt is a

polynomial function with degree k ≥ 1.
• If k = 1, µt = β0 + β1 t is a linear trend function.
• If k = 2, µt = β0 + β1 t + β2 t2 is a quadratic trend function.
• If k = 3, µt = β0 + β1 t + β2 t2 + β3 t3 is a cubic trend function, and so on.
PAGE 57
LEAST SQUARES ESTIMATION : The least squares estimates of β0 , β1 , β2 , ..., βk are

obtained in the same way as in the k = 1 case; namely, the estimates are obtained by
minimizing the objective function
∑
n
Q(β0 , β1 , β2 , ..., βk ) = [Yt − (β0 + β1 t + β2 t2 + · · · + βk tk )]2
t=1
∑
n
= (Yt − β0 − β1 t − β2 t2 − · · · − βk tk )2
t=1
with respect to β0 , β1 , β2 , ..., βk . Here, there are k + 1 partial derivatives and k + 1

equations to solve (in simple linear regression, k = 1, so there were 2 equations to solve).
• Unfortunately (without the use of more advanced notation), there are no conve-
nient, closed-form expressions for the least squares estimators when k > 1. This
turns out not to be a major distraction, because we use computing to fit the model
anyway.
• Under the mild assumption that the errors have zero mean; i.e., that E(Xt ) = 0, it
follows that the least squares estimators βb0 , βb1 , βb2 , ..., βbk are unbiased estimators
of their population analogues; i.e., E(βbi ) = βi , for i = 0, 1, 2, ..., k.
• As in the simple linear regression case (k = 1), additional assumptions on the errors
Xt are needed to derive the sampling distribution of the least squares estimators,
namely, independence, constant variance, and normality.
• Regression output (e.g., in R, etc.) is correct only under these additional assump-
tions. Thee analyst must keep this in mind.
Example 3.5. Data file: gold (TSA). Of all the precious metals, gold is the most
popular as an investment. Like most commodities, the price of gold is driven by supply
and demand as well as speculation. Figure 3.5 contains a time series of n = 254 daily
observations on the price of gold (per troy ounce) in US dollars during the year 2005.
There is a clear nonlinear trend in the data, so a straight-line model would not be
appropriate.
PAGE 58
540
520
500
480
Price
460
440
420
0 50 100 150 200 250
Time
Figure 3.5: Gold price data. Daily price in US dollars per troy ounce: 1/4/05-12/30/05.
In this example, we use R to detrend the data by fitting the quadratic regression
model
Yt = β0 + β1 t + β2 t2 + Xt ,
for t = 1, 2, ..., 254, where E(Xt ) = 0. Here is the output from fitting this model in R.
Coefficients:
(Intercept) 4.346e+02 1.771e+00 245.38 <2e-16 ***
t -3.618e-01 3.233e-02 -11.19 <2e-16 ***
t.sq 2.637e-03 1.237e-04 21.31 <2e-16 ***

PAGE 59
540
520
500
480
Price
460
440
420
0 50 100 150 200 250
Time
Figure 3.6: Gold price data with a quadratic trend fit.
ANALYSIS : Again, we focus only on the values of the least squares estimates. The fitted
regression equation is
Ybt = 434.6 − 0.362t + 0.00264t2 ,
for t = 1, 2, ..., 254. This fitted model is superimposed over the time series in Figure 3.6.
RESIDUALS : The residual process is
bt = Yt − Ybt
X
= Yt − 434.6 + 0.362t − 0.00264t2 ,
for t = 1, 2, ..., 254, and is depicted in Figure 3.7. This detrended series appears to be
somewhat stationary, at least, much more so than the original time series. However, it
should be obvious that the detrended (residual) process is not white noise. There is still
an enormous amount of momentum left in the residuals. Of course, we know that this
renders most of the R output on the previous page meaningless.
PAGE 60
40
30
20
Residuals
10
0
−10
−20
0 50 100 150 200 250
Time
Figure 3.7: Gold price data. Residuals from the quadratic trend fit.
3.3.3 Seasonal means model
SEASONAL MEANS MODEL: Consider the deterministic trend model
Yt = µt + Xt ,
where E(Xt ) = 0 and where the mean function






β1 , t = 1, 13, 25, ...


 β , t = 2, 14, 26, ...
2
µt = .

 ..




 β , t = 12, 24, 36, ...
12
The regression parameters β1 , β2 , ..., β12 are fixed constants. This is called a seasonal
means model. This model does not take the shape of the seasonal trend into account;
instead, it merely says that observations 12 months apart have the same mean, and
PAGE 61
this mean does not change through time. Other seasonal means models with a different
number of parameters could be specified. For instance, for quarterly data, we could
use a mean function with 4 regression parameters β1 , β2 , β3 , and β4 .
FITTING THE MODEL: We can still use least squares to fit the seasonal means model.
The least squares estimates of the regression parameters are simple to compute, but
difficult to write mathematically. In particular,
1 ∑
βb1 = Yt ,
n1 t∈A
1
where the set A1 = {t : t = 1 + 12j, j = 0, 1, 2, ..., }. In essence, to compute βb1 , we sum

the values Y1 , Y13 , Y25 , ..., and then divide by n1 , the number of observations in month 1
(e.g., January). Similarly,
1 ∑
βb2 = Yt ,
n2 t∈A
2
where the set A2 = {t : t = 2 + 12j, j = 0, 1, 2, ..., }. Again, we sum the values

Y2 , Y14 , Y26 , ..., and then divide by n2 , the number of observations in month 2 (e.g., Febru-
ary). In general,
1 ∑
βbi = Yt ,
ni t∈A
i
where the set Ai = {t : t = i + 12j, j = 0, 1, 2, ..., }, for i = 1, 2, ..., 12, where ni is the
number of observations in month i.
Example 3.6. Data file: beersales (TSA). The data in Figure 3.8 are monthly beer
sales (in millions of barrels) in the United States from 1/80 through 12/90. This time
series has a relatively constant mean overall (i.e., there are no apparent linear trends and
the repeating patterns are relatively constant over time), so a seasonal means model may
be appropriate. Fitting the model can be done in R; here are the results.
Coefficients:
January 13.1608 0.1647 79.90 <2e-16 ***
February 13.0176 0.1647 79.03 <2e-16 ***
March 15.1058 0.1647 91.71 <2e-16 ***
PAGE 62
M
A A
JJ J
A
JJ
J MJ J
17 M J
MJ MA
J J
M
M
J
J
J M J J J
MJ
A
MA A A
J J M A J
16
A A
A A A A M
M O
A A
M M M
M
A A
M A
15
M
S S A S
S
Sales
S M O
S M O S N
A O S OJ
M O J
14
OJ S F O
O O O J
S F
J S J
NF
N NF D
13
DF F
N F D
JF NF JF F N N
D N J N N D
D D
J D D
12
J D
1980 1982 1984 1986 1988 1990
Year
Figure 3.8: Monthly US beer sales from 1980-1990. The data are measured in millions
of barrels.
April 15.3981 0.1647 93.48 <2e-16 ***

May 16.7695 0.1647 101.81 <2e-16 ***
June 16.8792 0.1647 102.47 <2e-16 ***
July 16.8270 0.1647 102.16 <2e-16 ***
August 16.5716 0.1647 100.61 <2e-16 ***
September 14.4045 0.1647 87.45 <2e-16 ***
October 14.2848 0.1647 86.72 <2e-16 ***
November 12.8943 0.1647 78.28 <2e-16 ***
December 12.3404 0.1647 74.92 <2e-16 ***
DISCUSSION : The only quantities that have relevance are the least squares estimates.
The estimate βbi is simply the sample mean of the observations for month i; thus, βbi is an
unbiased estimate of the ith (population) mean monthly sales βi . The test statistics and
p-values are used to test H0 : βi = 0, a largely nonsensical hypothesis in this example.
PAGE 63
1.5
1.0
0.5
Residuals
0.0
−0.5
−1.0
0 20 40 60 80 100 120
Time
Figure 3.9: Beer sales data. Residuals from the seasonal means model fit.
RESIDUALS : A plot of the residuals from the seasonal means model fit, that is,
bt = Yt − Ybt
X
∑ 12
= Yt − βbi IAi (t)
i=1
∑12 b
is in Figure 3.9. The expression i=1 βi IAi (t), where I(·) is the indicator function, is
simply the sample mean for the set of observations at time t. This residual process looks
somewhat stationary, although I can detect a slightly increasing trend.
3.3.4 Cosine trend model
REMARK : The seasonal means model is somewhat simplistic in that it does not take the
shape of the seasonal trend into account. We now consider a more elaborate regression
equation that can be used to model data with seasonal trends.
PAGE 64
COSINE TREND MODEL: Consider the deterministic time trend model
Yt = µt + Xt
= β cos(2πf t + Φ) + Xt ,
where µt = β cos(2πf t + Φ) and where E(Xt ) = 0. The trigonometric mean function µt

consists of different parts:
• β is the amplitude. The function µt oscillates between −β and β.
• f is the frequency =⇒ 1/f is the period (the time it takes to complete one
full cycle of the function). For monthly data, the period is 12 months; i.e., the
frequency is f = 1/12.
• Φ controls the phase shift. This represents a horizontal shift in the mean function.
MODEL FITTING: Fitting this model is difficult unless we transform the mean function
into a simpler expression. We use the trigonometric identity
cos(a + b) = cos(a) cos(b) − sin(a) sin(b)
to write
β cos(2πf t + Φ) = β cos(2πf t) cos(Φ) − β sin(2πf t) sin(Φ)
= β1 cos(2πf t) + β2 sin(2πf t),
where β1 = β cos Φ and β2 = −β sin Φ, so that the phase shift parameter

( )
−1 β2
Φ = tan −
β1
√
and the amplitude β = β12 + β22 . The rewritten expression,
µt = β1 cos(2πf t) + β2 sin(2πf t),
is a linear function of β1 and β2 , where cos(2πf t) and sin(2πf t) play the roles of predictor
variables. Adding an intercept term for flexibility, say β0 , we get
Yt = β0 + β1 cos(2πf t) + β2 sin(2πf t) + Xt .
PAGE 65
REMARK : When we fit this model, we must be aware of the values used for the time t,
as it has a direct impact on how we specify the frequency f . For example,
• if we have monthly data and use the generic time specification t = 1, 2, ..., 12, 13, ...,
then we specify f = 1/12.
• if we have monthly data, but we use the years themselves as predictors; i.e., t =
1990, 1991, 1992, etc., we use f = 1, because 12 observations arrive each year.
Example 3.6 (continued). We now use R to fit the cosine trend model to the beer sales
data. Because the predictor variable t is measured in years 1980, 1981, ..., 1990 (with 12
observations each year), we use f = 1. Here is the output:
Coefficients:
(Intercept) 14.80446 0.05624 263.25 <2e-16 ***
har.cos(2*pi*t) -2.04362 0.07953 -25.70 <2e-16 ***
har.sin(2*pi*t) 0.92820 0.07953 11.67 <2e-16 ***

ANALYSIS : The fitted model
Ybt = 14.8 − 2.04 cos(2πt) + 0.93 sin(2πt),
is superimposed over the data in Figure 3.10. The least squares estimates βb0 = 14.8,
βb1 = −2.04, and βb2 = 0.93 are the only useful pieces of information in the output.
RESIDUALS : The (detrended) residual process is
bt = Yt − Ybt
X
= Yt − 14.8 + 2.04 cos(2πt) − 0.93 sin(2πt),
PAGE 66
17
16
15
Sales
14
13
12
1980 1982 1984 1986 1988 1990
Time
Figure 3.10: Beer sales data with a cosine trend model fit.
which is depicted in Figure 3.11. The residuals from the cosine trend fit appear to be
somewhat stationary, but are probably not white noise.
REMARK : The seasonal means and cosine trend models are competing models; that is,
both models are useful for seasonal data.
• The cosine trend model is more parsimonious; i.e., it is a simpler model because
there are 3 regression parameters to estimate. On the other hand, the (monthly)
seasonal means model has 12 parameters that need to be estimated!
• Remember, regression parameters (in any model) are estimated with the data. The
more parameters we have in a model, the more data we need to use to estimate them.
This leaves us with less information to estimate other quantities (e.g., residual
variance, etc.). In the end, we have regression estimates that are less precise.
• The mathematical argument on pp 36-39 (CC) should convince you of this result.
PAGE 67
2.0
1.5
1.0
0.5
Residuals
0.0
−0.5
−1.0
−1.5
0 20 40 60 80 100 120
Time
Figure 3.11: Beer sales data. Residuals from the cosine trend model fit.
3.4 Interpreting regression output
RECALL: In fitting the deterministic model
Yt = µt + Xt ,
we have learned the following:
• for least squares estimates to be unbiased, all we need is E(Xt ) = 0, for all t.
• for the variances of the least squares estimates (and standard errors) seen in R
output to be meaningful, we need E(Xt ) = 0, {Xt } independent, and var(Xt ) = γ0
(constant). These assumptions are met if {Xt } is a white noise process.
• for t tests and probability values to be valid, we need the last three assumptions to
hold; in addition, normality is needed on the error process {Xt }.
PAGE 68
NEW RESULT : If var(Xt ) = γ0 is constant, an estimate of γ0 is given by
1 ∑
n
S = 2
(Yt − µ
bt )2 ,
n − p t=1
bt is the least squares estimate of µt and p is the number of regression parameters

where µ
in µt . The term n − p is called the error degrees of freedom. If {Xt } is independent,
then E(S 2 ) = γ0 ; i.e., S 2 is an unbiased estimator of γ0 . The residual standard
deviation is defined by,
v
u
√ u 1 ∑
n
S = S2 = t (Yt − µ
bt )2 ,
n − p t=1
the (positive) square root of S 2 .
• The smaller S is, the better fit of the model. Therefore, in comparing two model
fits (for two different models), we can look at the value of S in each model to judge
which model may be preferred (caution is needed in doing this).
• The larger S is, the noisier the error process likely is. This makes the least squares
estimates more variable and predictions less precise.
RESULT : For any data set {Yt : t = 1, 2, ..., n}, we can write algebraically
∑
n ∑
n ∑
n
(Yt − Y )2 = (Ybt − Y )2 + (Yt − Ybt )2 .
|t=1 {z } |t=1 {z } |t=1 {z }
SST SSR SSE
These quantities are called sums of squares and form the basis for the following anal-
ysis of variance (ANOVA) table.
Source df SS MS F
Model p−1 SSR MSR = SSR
p−1
F = MSR
MSE
Error n−p SSE MSE = SSE

n−p
Total n−1 SST
PAGE 69
COEFFICIENT OF DETERMINATION : Since SST = SSR + SSE, it follows that the

proportion of the total variation in the data explained by the deterministic model is
SSR SSE
R2 = =1− ,
SST SST
the coefficient of determination. The larger R2 is, the better the deterministic part
of the model explains the variability in the data. Clearly, 0 ≤ R2 ≤ 1.
IMPORTANT : It is critical to understand what R2 does and does not measure. Its value
is computed under the assumption that the deterministic trend model is correct and
assesses how much of the variation in the data may be attributed to that relationship
rather than just to inherent variation.
• If R2 is small, it may be that there is a lot of random inherent variation in the data,
so that, although the deterministic trend model is reasonable, it can only explain
so much of the observed overall variation.
• Alternatively, R2 may be close to 1, but a particular model may not be the best
model. In fact, R2 could be very “high,” but not relevant because a better model
may exist.
ADJUSTED R2 : A slight variant of the coefficient of determination is
2 SSE/(n − p)
R =1− .
SST/(n − 1)
This is called the adjusted R2 statistic. It is useful for comparing models with different
numbers of parameters.
3.5 Residual analysis (model diagnostics)
RESIDUALS : Consider the deterministic trend model
Yt = µt + Xt ,
PAGE 70
where E(Xt ) = 0. In this chapter, we have talked about using the method of least squares
to fit models of this type (e.g., straight line regression, polynomial regression, seasonal
means, cosine trends, etc.). The fitted model is Ybt = µ
bt and the residual process is
bt = Yt − Ybt .
X
The residuals from the model fit are important. In essence, they serve as proxies (pre-
dictions) for the true errors Xt , which are not observed. The residuals can help us learn
about the validity of the assumptions made in our model.
STANDARDIZED RESIDUALS : If the model above is fit using least squares (and there
is an intercept term in the model), then algebraically,
∑
n
(Yt − Ybt ) = 0,
t=1
that is, the sum of the residuals is equal to zero. Thus, the residuals have mean zero and
the standardized residuals, defined by
b
b ∗ = Xt ,
Xt
S
are unitless quantities. If desired, we can use the standardized residuals for model diag-
nostic purposes. The standardized residuals defined here are not exactly zero mean, unit
variance quantities, but they are approximately so. Thus, if the model is adequate, we
would expect most standardized residuals to fall between −3 and 3.
3.5.1 Assessing normality
NORMALITY : If the error process {Xt } is normally distributed, then we would expect
the residuals to also be approximately normally distributed. We can therefore diag-
nose this assumption by examining the (standardized) residuals and looking for evidence
of normality. We can use histograms and normal probability plots (also known as
quantile-quantile, or qq plots) to do this.
• Histograms which resemble heavily skewed empirical distributions are evidence

against normality.
PAGE 71
bt (or standardized
• A normal probability plot is a scatterplot of ordered residuals X
b ∗ ) versus the ordered theoretical normal quantiles (or normal scores).
residuals X t
The idea behind this plot is simple. If the residuals are normally distributed, then
plotting them versus the corresponding normal quantiles (i.e., values from a normal
distribution) should produce a straight line (or at least close).
Example 3.4 (continued). In Example 3.4, we fit a straight line trend model to the
global temperature data. Below are the histogram and qq plot for the standardized
residuals. Does normality seem to be supported?
Histogram of standardized residuals QQ plot of standardized residuals

20
2
15
1
Sample Quantiles
Frequency
10
0
−1
5
−2
0
−2 −1 0 1 2 3 −2 −1 0 1 2
Standardized residuals Theoretical Quantiles
SHAPIRO-WILK TEST : Histograms and qq plots provide only visual evidence of nor-
mality. The Shapiro-Wilk test is a formal hypothesis test that can be used to test
H0 : the (standardized) residuals are normally distributed

versus
H1 : the (standardized) residuals are not normally distributed.
The test is carried out by calculating a statistic W approximately equal to the sample
correlation between the ordered (standardized) residuals and the normal scores. The
PAGE 72
higher this correlation, the higher the value of W . Therefore, small values of W are
evidence against H0 . The null distribution of W is very complicated, but probability
values (p-values) are produced in R automatically. If the p-value is smaller than the
significance level for the test (e.g., α = 0.05, etc.), then we reject H0 and conclude that
there is a violation in the normality assumption. Otherwise, we do not reject H0 .
global temperature data. The Shapiro-Wilk test on the standardized residuals produces
the following output:
> shapiro.test(rstudent(fit))
Shapiro-Wilk normality test
data: rstudent(fit)
W = 0.9934, p-value = 0.915
Because the p-value for the test is not small, we do not reject H0 . This test does not
provide evidence of non-normality for the standardized residuals.
3.5.2 Assessing independence
INDEPENDENCE : Plotting the residuals versus time can provide visual insight on
whether or not the (standardized) residuals exhibit independence (although it is often
easier to detect gross violations of independence). Residuals that “hang together” are not
what we would expect to see from a sequence of independent random variables. Similarly,
residuals that oscillate back and forth too notably also do not resemble this sequence.
RUNS TEST : A runs test is a nonparametric test which calculates the number of runs
in the (standardized) residuals. The formal test is
H0 : the (standardized) residuals are independent

versus
H1 : the (standardized) residuals are not independent.
PAGE 73
2
1
Residuals
0
−1
−2
0 20 40 60 80 100
Year
Figure 3.12: Standardized residuals from the straight line trend model fit for the global
temperature data. A horizontal line at zero has been added.
In particular, the test examines the (standardized) residuals in sequence to look for
patterns that would give evidence against independence. Runs above or below 0 (the
approximate median of the residuals) are counted.
• A small number of runs would indicate that neighboring values are positively
dependent and tend to hang together over time.
• Too many runs would indicate that the data oscillate back and forth across their
median. This suggests that neighboring residuals are negatively dependent.
• Therefore, either too few or too many runs lead us to reject independence.
global temperature data. A runs test on the standardized residuals produces the following
output:
PAGE 74
> runs(rstudent(fit))
$pvalue
[1] 3.65e-06
$observed.runs
[1] 27
$expected.runs
[1] 49.81633
The p-value for the test is extremely small, so we would reject H0 . The evidence points
to the standardized residuals being not independent. The R output also produces the ex-
pected number of runs (computed under the assumption of independence). The observed
number of runs is too much lower than the expected number to support independence.
BACKGROUND: If the (standardized) residuals are truly independent, it is possible to

write out the probability mass function of R, the number of runs. This mass function is
 ( n1 −1 )( n2 −1 ) (n1 +n2 )
 / n1 , if r is even
fR (r) = [( )(
(r/2)−1 (r/2)−1
) ( )( ) ] /( )
 n1 −1 n2 −1 n1 −1
+ (r−3)/2 n2 −1 n1 +n2
, if r is odd,
(r−1)/2 (r−3)/2 (r−1)/2 n1
where
• n1 = the number of residuals less than zero
• n2 = the number of residuals greater than zero
• r1 = the number of runs less than zero
• r2 = the number of runs less than zero
• r = r1 + r2 .
IMPLEMENTATION : When n1 and n2 are large, the number of runs R is approximately

normally distribution with mean
2n1 n2
µR = 1 +
n
and variance
2n1 n2 (2n1 n2 − n)
σR2 = .
n2 (n − 1)
PAGE 75
Therefore, values of
|R − µR |
Z= > zα/2
σR
lead to the rejection of H0 . The notation zα/2 denotes the upper α/2 quantile of the
N (0, 1) distribution.
3.5.3 Sample autocorrelation function
RECALL: Consider the stationary stochastic process {Yt : t = 1, 2, ..., n}. In Chapter 2,
we defined the autocorrelation function to be
γk
ρk = corr(Yt , Yt−k ) = ,
γ0
where γk = cov(Yt , Yt−k ) and γ0 = var(Yt ). Perhaps more aptly named, ρk is the pop-
ulation autocorrelation function because it depends on the true parameters for the
process {Yt }. In real life (that is, with real data) these population parameters are un-
known, so we don’t get to know the true ρk . However, we can estimate it. This leads to
the definition of the sample autocorrelation function.
TERMINOLOGY : For a set of time series data Y1 , Y2 , ..., Yn , we define the sample
autocorrelation function, at lag k, by
∑n
(Yt − Y )(Yt−k − Y )
rk = t=k+1 ∑n ,
t=1 (Yt − Y )
2
where Y is the sample mean of Y1 , Y2 , ..., Yn (i.e., all the data are used to compute Y ).
The sample version rk is a point estimate of the true (population) autocorrelation ρk .
USAGE WITH STANDARDIZED RESIDUALS : Because we are talking about using

standardized residuals to check regression model assumptions, we can examine the sample
b ∗ }. Replacing Yt with
autocorrelation function of the standardized residual process {X t
b ∗ and Y with X
X b ∗ in the above definition, we get
t
∑n b∗ b ∗ )(X
b∗ − Xb ∗)
t=k+1 (Xt −X
rk∗ = ∑n
t−k
.
b∗ b∗ 2
t=1 (Xt − X )
PAGE 76
Note that when the sum of the standardized residuals equals zero (which occurs when
least squares is used and when an intercept is included in the model), we also have
b ∗ = 0. Therefore, the formula above reduces to
X
∑n b∗ b∗
∗ t=k+1 Xt Xt−k
rk = ∑n .
b∗ 2
t=1 (Xt )
IMPORTANT : If the standardized residual process {Xb ∗ } is white noise, then

t
( )
1
rk∗ ∼ AN 0, ,
n
for n large. The notation AN is read “approximately normal.” For k ̸= l, it also turns
out that cov(rk∗ , rl∗ ) ≈ 0. These facts are established in Chapter 6.
• If the standardized residuals are truly white noise, then we would expect rk∗ to fall
√
within 2 standard errors of 0. That is, values of rk∗ within ±2/ n are within the
margin of error under the white noise assumption.
√
• Values of rk∗ larger than ±2/ n (in absolute value) are outside the margin of error,
and, thus, are not consistent with what we would see from a white noise process.
More specifically, this would suggest that there is dependence (autocorrelation) at
lag k in the standardized residual process.
GRAPHICAL TOOL: The plot of rk (or rk∗ if we are examining standardized residuals)
versus k is called a correlogram. If we are assessing whether or not the process is white
√
noise, it is helpful to put horizontal dashed lines at ±2/ n so we can easily see if the
sample autocorrelations fall outside the margin of error.
global temperature data. In Figure 3.13, we display the correlogram for the standardized
b ∗ } from the straight line fit.
residuals {X t
√
• Note that many of the sample estimates rk∗ fall outside the ±2/ n margin of error
cutoff. These residuals likely do not resemble a white noise process.
PAGE 77
Sample ACF for standardized residuals
0.4
0.2
ACF
0.0
−0.2
5 10 15
Lag
Figure 3.13: Global temperature data. Sample autocorrelation function for the standard-
ized residuals from the straight line model fit.
• There is still a substantial amount of structure left in the residuals. In particular,

there is strong positive autocorrelation at early lags and the sample ACF tends to
decay somewhat as k increases.
SIMULATION EXERCISE : Let’s generate some white noise processes and examine their
sample autocorrelation functions! Figure 3.14 (left) displays two simulated white noise
processes et ∼ iid N (0, 1), where n = 100. With n = 100, the margin of error for each
sample autocorrelation rk is
√
margin of error = ±2/ 100 = ±0.2.
Figure 3.14 (right) displays the sample correlograms (one for each simulated white noise
series) with horizontal lines at the ±0.2 margin of error cutoffs. Even though the gener-
ated data are truly white noise, we still do see some values of rk (one for each realization)
PAGE 78
Sample ACF
0.2
White noise process.1
0.1
0
ACF
−0.2 −0.1 0.0

−3 −2 −1
0 20 40 60 80 100 5 10 15 20
Time Lag
Sample ACF
0.2
3
White noise process.2
0.1
1
ACF
0.0
0
−2
−0.2
0 20 40 60 80 100 5 10 15 20
Time Lag
Figure 3.14: Two simulated standard normal white noise processes with their associated
sample autocorrelation functions.
that fall outside the margin of error cutoffs. Why does this happen?
√
• In essence, every time we compare rk to its margin of error cutoffs ±2/ n, we are
performing a hypothesis test, namely, we are testing H0 : ρk = 0 at a significance
level of approximately α = 0.05.
• Therefore, 5 percent of the time on average, we will observe a significant result

which is really a “false alarm” (i.e., a Type I Error).
• When you are interpreting correlograms, keep this in mind. If there are patterns
in the values of rk and many which extend beyond the margin of error (especially
at early lags), the series is probably not white noise. On the other hand, a stray
statistically significant value of rk at, say, lag k = 17 is likely just a false alarm.
PAGE 79
4 Models for Stationary Time Series
4.1 Introduction
RECALL: In the last chapter, we used regression to “detrend” time series data with
the hope of removing non-stationary patterns and producing residuals that resembled
a stationary process. We also learned that differencing can be an effective technique
to transform a non-stationary process into one which is stationary. In this chapter, we
consider (linear) time series models for stationary processes. Recall that stationary time
series are those whose statistical properties do not change over time.
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
A general linear process is defined by
Yt = et + Ψ1 et−1 + Ψ2 et−2 + Ψ3 et−3 + · · · .
That is, Yt , the value of the process at time t, is a weighted linear combination of white
noise terms at the current and past times. The processes that we examine in this chapter
are special cases of this general linear process. In general, E(Yt ) = 0 and
∑
∞
γk = cov(Yt , Yt−k ) = σe2 Ψi Ψi+k ,
i=0
for k ≥ 0, where we set Ψ0 = 1.
• For mathematical reasons (to ensure stationarity), we will assume that the Ψi ’s are
square summable, that is,
∑
∞
Ψ2i < ∞.
i=1
• A nonzero mean µ could be added to the right-hand side of the general linear
process above; this would not affect the stationarity properties of {Yt }. Therefore,
there is no harm in assuming that the process {Yt } has zero mean.
PAGE 80
4.2 Moving average processes
The process
Yt = et − θ1 et−1 − θ2 et−2 − · · · − θq et−q
is called a moving average process of order q, denoted by MA(q). Note that this
is a special case of the general linear process with Ψ0 = 1, Ψ1 = −θ1 , Ψ2 = −θ2 , ...,
Ψq = −θq , and Ψq∗ = 0 for all q ∗ > q.
4.2.1 MA(1) process
TERMINOLOGY : With q = 1, the moving average process defined above becomes
Yt = et − θet−1 .
This is called an MA(1) process. For this process, the mean is
E(Yt ) = E(et − θet−1 ) = E(et ) − θE(et−1 ) = 0.
The variance is
γ0 = var(Yt ) = var(et − θet−1 )
= var(et ) + θ2 var(et−1 ) − 2θcov(et , et−1 )
= σe2 + θ2 σe2 = σe2 (1 + θ2 ).
The autocovariance at lag 1 is given by
γ1 = cov(Yt , Yt−1 ) = cov(et − θet−1 , et−1 − θet−2 )
= cov(et , et−1 ) − θcov(et , et−2 ) − θcov(et−1 , et−1 ) + θ2 cov(et−1 , et−2 )
= −θvar(et−1 ) = −θσe2 .
For any lag k > 1, γk = cov(Yt , Yt−k ) = 0, because no white noise subscripts in Yt and
Yt−k will overlap.
PAGE 81
AUTOCOVARIANCE FUNCTION : For an MA(1) process,



 σ 2 (1 + θ2 ), k = 0

 e
γk = −θσe2 , k=1



 0, k > 1.
AUTOCORRELATION FUNCTION : For an MA(1) process,



 1, k=0

γk  θ
ρk = = − , k=1
γ0  1 + θ2


0, k > 1.
IMPORTANT : The MA(1) process has zero correlation beyond lag k = 1! Ob-
servations one time unit apart are correlated, but observations more than one time unit
apart are not. This is important to keep in mind when we entertain models for real data
using empirical evidence (e.g., sample autocorrelations rk , etc.).
FACTS : The following theoretical results hold for an MA(1) process.
• When θ = 0, the MA(1) process reduces to a white noise process.
• As θ ranges from −1 to 1, the (population) lag 1 autocorrelation ρ1 ranges from

0.5 to −0.5; see pp 58 (CC).
• The largest ρ1 can be is 0.5 (when θ = −1) and the smallest ρ1 can be is −0.5
(when θ = 1). Therefore, if we were to observe a sample lag 1 autocorrelation r1
that was well outside [−0.5, 0.5], this would be inconsistent with the MA(1) model.
• The population lag 1 autocorrelation

θ
ρ1 = −
1 + θ2
remains the same if θ is replaced by 1/θ. Therefore, if someone told you the value
of ρ1 for an MA(1) process, you could not identify the corresponding value of θ
uniquely. This is somewhat problematic and will have consequences in due course
(e.g., when we discuss invertibility).
PAGE 82
MA(1) simulation Sample ACF
0.4
2
1
0.2
ACF
0
Yt
−3 −2 −1
0.0
−0.2
0 20 40 60 80 100 5 10 15 20
Time Lag
Lag 1 scatterplot Lag 2 scatterplot

2
2
1
1
0
0
Yt
Yt
−3 −2 −1
−3 −2 −1
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Yt−1 Yt−2
Figure 4.1: Upper left: MA(1) simulation with θ = −0.9, n = 100, and σe2 = 1. Upper
right: Sample autocorrelation function rk . Lower left: Scatterplot of Yt versus Yt−1 .
Lower right: Scatterplot of Yt versus Yt−2 .
Example 4.1. We use R to simulate the MA(1) process Yt = et − θet−1 , where θ = −0.9,
n = 100, and et ∼ iid N (0, 1).
• Note that
−(−0.9)
θ = −0.9 =⇒ ρ1 = ≈ 0.497.
1 + (−0.9)2
• There is a moderately strong positive autocorrelation at lag 1. Of course, ρk = 0,

for all k > 1.
• The sample ACF in Figure 4.1 (upper right) looks like what we would expect from
the MA(1) theory. There is a pronounced “spike” at k = 1 in the sample ACF and
√
little action elsewhere (for k > 1). The error bounds at ±2/ 100 = 0.2 correspond
PAGE 83
0.2
2
1
0.0
0
ACF
Yt
−0.2
−3 −2 −1
−0.4
0 20 40 60 80 100 5 10 15 20
Time Lag

2
2
1
1
0
0
Yt
Yt
−3 −2 −1
−3 −2 −1
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Yt−1 Yt−2
Figure 4.2: Upper left: MA(1) simulation with θ = 0.9, n = 100, and σe2 = 1. Upper
right: Sample autocorrelation function rk . Lower left: Scatterplot of Yt versus Yt−1 .
Lower right: Scatterplot of Yt versus Yt−2 .
to those for a white noise process; not an MA(1) process.
• The lag 1 scatterplot; i.e., the scatterplot of Yt versus Yt−1 , shows a moderate
increasing linear relationship. This is expected because of the moderately strong
positive lag 1 autorcorrelation.
• The lag 2 scatterplot; i.e., the scatterplot of Yt versus Yt−2 , shows no linear rela-
tionship. This is expected because ρ2 = 0 for an MA(1) process.
• Figure 4.2 displays a second MA(1) simulation, except with θ = 0.9. In this model,
ρ1 ≈ −0.497 and ρk = 0, for all k > 1. Compare Figure 4.2 with Figure 4.1.
PAGE 84
4.2.2 MA(2) process
The process
Yt = et − θ1 et−1 − θ2 et−2
is a moving average process of order 2, denoted by MA(2). For this process, the
mean is
E(Yt ) = E(et − θ1 et−1 − θ2 et−2 ) = E(et ) − θ1 E(et−1 ) − θ2 E(et−2 ) = 0.
The variance is
γ0 = var(Yt ) = var(et − θ1 et−1 − θ2 et−2 )
= var(et ) + θ12 var(et−1 ) + θ22 var(et−2 ) + 6| covariance

{z terms}
all = 0
= σe2 + θ12 σe2 + θ22 σe2 = σe2 (1 + θ12 + θ22 ).
γ1 = cov(Yt , Yt−1 ) = cov(et − θ1 et−1 − θ2 et−2 , et−1 − θ1 et−2 − θ2 et−3 )
= cov(−θ1 et−1 , et−1 ) + cov(−θ2 et−2 , −θ1 et−2 )
= −θ1 var(et−1 ) + (−θ2 )(−θ1 )var(et−2 )
= −θ1 σe2 + θ1 θ2 σe2 = (−θ1 + θ1 θ2 )σe2 .
γ2 = cov(Yt , Yt−2 ) = cov(et − θ1 et−1 − θ2 et−2 , et−2 − θ1 et−3 − θ2 et−4 )
= cov(−θ2 et−2 , et−2 )
= −θ2 var(et−2 ) = −θ2 σe2 .
For any lag k > 2,

γk = cov(Yt , Yt−k ) = 0,
because no white noise subscripts in Yt and Yt−k will overlap.
PAGE 85
AUTOCOVARIANCE FUNCTION : For an MA(2) process,



 σe2 (1 + θ12 + θ22 ),


k=0


 (−θ + θ θ )σ 2 , k=1
1 1 2 e
γk =

 −θ2 σe2 ,


k=2


 0 k > 2.
AUTOCORRELATION FUNCTION : For an MA(2) process,





 1, k=0

 −θ1 + θ1 θ2


γk  2 2
, k=1
ρk = = 1 + θ1 + θ2
γ0  −θ2

 , k=2

 1 + θ12 + θ22



 0 k > 2.
IMPORTANT : The MA(2) process has zero correlation beyond lag k = 2! Ob-
servations 1 or 2 time units apart are correlated. Observations more than two time units
apart are not correlated.
Example 4.2. We use R to simulate the MA(2) process
Yt = et − θ1 et−1 − θ2 et−2 ,
where θ1 = 0.9, θ2 = −0.7, n = 100, and et ∼ iid N (0, 1). For this process,
−θ1 + θ1 θ2 −0.9 + (0.9)(−0.7)

ρ1 = 2 2
= ≈ −0.665
1 + θ1 + θ2 1 + (0.9)2 + (−0.7)2
and
−θ2 −(−0.7)
ρ2 = 2 2
= ≈ 0.304.
1 + θ1 + θ2 1 + (0.9)2 + (−0.7)2
Figure 4.3 displays the simulated MA(2) time series, the sample ACF, and the lag 1
and 2 scatterplots. There are pronounced “spikes” at k = 1 and k = 2 in the sample
ACF and little action elsewhere (for k > 2). The lagged scatterplots display negative
(positive) autocorrelation at lag 1 (2). All of these observations are consistent with the
√
MA(2) theory. Note that the error bounds at ±2/ 100 = 0.2 correspond to those for a
white noise process; not an MA(2) process.
PAGE 86
0.2
2
1
ACF
−0.2
Yt
−3 −2 −1 0
−0.6
0 20 40 60 80 100 5 10 15 20
Time Lag

3
3
2
2
1
1
Yt
Yt
−3 −2 −1 0
−3 −2 −1 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Yt−1 Yt−2
Figure 4.3: Upper left: MA(2) simulation with θ1 = 0.9, θ2 = −0.7, n = 100, and σe2 = 1.
Upper right: Sample autocorrelation function rk . Lower left: Scatterplot of Yt versus
Yt−1 . Lower right: Scatterplot of Yt versus Yt−2 .
4.2.3 MA(q) process
MODEL: Suppose {et } is a zero mean white noise process with var(et ) = σe2 . The MA(q)
process is
Yt = et − θ1 et−1 − θ2 et−2 − · · · − θq et−q .
Standard calculations show that

E(Yt ) = 0
and
γ0 = var(Yt ) = σe2 (1 + θ12 + θ22 + · · · + θq2 ).
PAGE 87
AUTOCORRELATION FUNCTION : For an MA(q) process,



 1, k=0



 −θ · · ·

 k + θ θ
1 k+1 + θ θ
2 k+2 + + θ θ
q−k q
, k = 1, 2, ..., q − 1

1 + θ1 + θ2 + · · · + θq
2 2 2
ρk = −θq



 , k=q

 1 + θ1 + θ22 + · · · + θq2
2


 0 k > q.
The salient feature is that the (population) ACF ρk is nonzero for lags k = 1, 2, ..., q.
For all lags k > q, the ACF ρk = 0.
4.3 Autoregressive processes
The process
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et
is called an autoregressive process of order p, denoted by AR(p).
• In this model, the value of the process at time t, Yt , is a weighted linear combination
of the values of the process from the previous p time points plus a “shock” or
“innovation” term et at time t.
• We assume that et , the innovation at time t, is independent of all previous process

values Yt−1 , Yt−2 , ...,.
• We continue to assume that E(Yt ) = 0. A nonzero mean could be added to the

model by replacing Yt with Yt − µ (for all t). This would not affect the stationarity
properties.
• This process (assuming that is stationary) is a special case of the general linear
process defined at the beginning of this chapter.
PAGE 88
4.3.1 AR(1) process
TERMINOLOGY : Take p = 1 in the general AR(p) process and we get
Yt = ϕYt−1 + et .
This is an AR(1) process. Note that if ϕ = 1, this process reduces to a random walk.
If ϕ = 0, this process reduces to white noise.
VARIANCE : Assuming that this process is stationary (it isn’t always), the variance of
Yt can be obtained in the following way. In the AR(1) equation, take variances of both
sides to get
var(Yt ) = var(ϕYt−1 + et )
= ϕ2 var(Yt−1 ) + var(et ) + 2ϕcov(Yt−1 , et )

| {z }
= 0
= ϕ2 var(Yt−1 ) + σe2 .
Assuming stationarity, var(Yt ) = var(Yt−1 ) = γ0 . Therefore, we have

σe2
γ0 = ϕ2 γ0 + σe2 =⇒ γ0 = .
1 − ϕ2
Because γ0 > 0, this equation implies that 0 < ϕ2 < 1, that is, −1 < ϕ < 1.
AUTOCOVARIANCE : To find the autocovariance function γk , multiply both sides of

the AR(1) equation by Yt−k to get
Yt Yt−k = ϕYt−1 Yt−k + et Yt−k .
Taking expectations of both sides, we have
E(Yt Yt−k ) = ϕE(Yt−1 Yt−k ) + E(et Yt−k ).
We now make the following observations:
• Because et is independent of Yt−k (by assumption), we have
E(et Yt−k ) = E(et )E(Yt−k ) = 0.
PAGE 89
• Because {Yt } is a zero mean process (by assumption), we have
γk = cov(Yt , Yt−k ) = E(Yt Yt−k ) − E(Yt )E(Yt−k ) = E(Yt Yt−k )
γk−1 = cov(Yt−1 , Yt−k ) = E(Yt−1 Yt−k ) − E(Yt−1 )E(Yt−k ) = E(Yt−1 Yt−k ).
From these two observations, we have established the following (recursive) relationship
for an AR(1) process:
γk = ϕγk−1 .
When k = 1, ( )
σe2
γ1 = ϕγ0 = ϕ .
1 − ϕ2
When k = 2, ( )
2 σe2
γ2 = ϕγ1 = ϕ .
1 − ϕ2
This pattern continues for larger k. In general, the autocovariance function for an AR(1)
process is ( )
k σe2
γk = ϕ , for k = 0, 1, 2, ..., .
1 − ϕ2
AUTOCORRELATION : For an AR(1) process,
( 2 )
k σe
γk ϕ 1−ϕ2
ρk = = σe2
= ϕk , for k = 0, 1, 2, ..., .
γ0 2 1−ϕ
IMPORTANT : For an AR(1) process, because −1 < ϕ < 1, the (population) ACF
ρk = ϕk decays exponentially as k increases.
• If ϕ is close to ±1, then the decay will be more slowly.
• If ϕ is not close to ±1, then the decay will take place rapidly.
• If ϕ > 0, then all of the autocorrelations will be positive.
• If ϕ < 0, then the autocorrelations will alternate from negative (k = 1), to positive
(k = 2), to negative (k = 3), to positive (k = 4), and so on.
• Remember these theoretical patterns so that when we see sample ACFs (from real
data!), we can make sensible decisions about potential model selection.
PAGE 90
Population ACF Population ACF
1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k

1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 4.4: Population ACFs for AR(1) processes. Upper left: ϕ = 0.9. Upper right:
ϕ = −0.9. Lower left: ϕ = 0.5. Lower right: ϕ = −0.5.
Example 4.3. We use R to simulate four different AR(1) processes
Yt = ϕYt−1 + et ,
with et ∼ iid N (0, 1) and n = 100. We choose
• ϕ = 0.9 (large ρ1 , ACF should decay slowly, all ρk positive)
• ϕ = −0.9 (large ρ1 , ACF should decay slowly, ρk alternating)
• ϕ = 0.5 (moderate ρ1 , ACF should decay more quickly, all ρk positive)
• ϕ = −0.5 (moderate ρ1 , ACF should decay more quickly, ρk alternating).
These choices of ϕ are consistent with those in Figure 4.4, which depicts the true (pop-
ulation) AR(1) autocorrelation functions.
PAGE 91
4
2
2
0
0
Yt
Yt
−6 −4 −2
−2
−4
−6
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
3
2
2
1
1
0
0
Yt
Yt
−3 −2 −1
−3 −2 −1
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
Figure 4.5: AR(1) simulations with n = 100 and σe2 = 1. Upper left: ϕ = 0.9. Upper
right: ϕ = −0.9. Lower left: ϕ = 0.5. Lower right: ϕ = −0.5.
• In Figure 4.5, note the differences between the series on the left (ϕ > 0) and the
series on the right (ϕ < 0).
– When ϕ > 0, the series tends to “hang together” (since ρ1 > 0).
– When ϕ < 0, there is more oscillation (since ρ1 < 0).
• In Figure 4.6, we display the sample autocorrelation functions. Compare the sample
ACFs to the theoretical ACFs in Figure 4.4. The fact that these figures do not agree
completely is a byproduct of the sample autocorrelations rk exhibiting sampling
√
variability. The error bounds at ±2/ 100 = 0.2 correspond to those for a white
noise process; not an AR(1) process.
PAGE 92
Sample ACF Sample ACF
0.2 0.4 0.6 0.8
0.5
0.0
ACF
ACF
−0.5
−0.2
5 10 15 20 5 10 15 20
Lag Lag
0.2
0.0 0.1 0.2 0.3
0.0
ACF
ACF
−0.2
−0.4
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Figure 4.6: Sample ACFs for AR(1) simulations with n = 100 and σe2 = 1. Upper left:
ϕ = 0.9. Upper right: ϕ = −0.9. Lower left: ϕ = 0.5. Lower right: ϕ = −0.5.
OBSERVATION : Suppose {et } is a zero mean white noise process with var(et ) = σe2 . We
now show that the AR(1) process
Yt = ϕYt−1 + et
can be written in the form of a general linear process
Yt = et + Ψ1 et−1 + Ψ2 et−2 + Ψ3 et−3 + · · · .
To show this, write Yt−1 = ϕYt−2 + et−1 so that
Yt = ϕYt−1 + et
= ϕ(ϕYt−2 + et−1 ) + et
= et + ϕet−1 + ϕ2 Yt−2 .
PAGE 93
Substituting in Yt−2 = ϕYt−3 + et−2 , we get
Yt = et + ϕet−1 + ϕ2 (ϕYt−3 + et−2 )
= et + ϕet−1 + ϕ2 et−2 + ϕ3 Yt−3 .
Continuing this type of substitution indefinitely, we get
Yt = et + ϕet−1 + ϕ2 et−2 + ϕ3 et−3 + · · · .
Therefore, the AR(1) process is a special case of the general linear process with Ψj = ϕj ,
for j = 0, 1, 2, ...,.
STATIONARITY CONDITION : The AR(1) process
Yt = ϕYt−1 + et
is stationary if and only if |ϕ| < 1, that is, if −1 < ϕ < 1. If |ϕ| ≥ 1, then the AR(1)
process is not stationary.
4.3.2 AR(2) process
The AR(2) process is
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et .
• The current value of the process, Yt , is a weighted linear combination of the values
of the process from the previous two time periods, plus a random innovation (error)
at the current time.
• We continue to assume that E(Yt ) = 0. A nonzero mean µ could be added to model

by replacing Yt with Yt − µ for all t.
• We continue to assume that et is independent of Yt−k , for all k = 1, 2, ...,.
PAGE 94
• Just as the AR(1) model requires certain conditions for stationarity, the AR(2)
model does too. A thorough discussion of stationarity for the AR(2) model, and
higher order AR models, becomes very theoretical. We highlight only the basic
points.
TERMINOLOGY : First, we define the operator B to satisfy
BYt = Yt−1 ,
that is, B “backs up” the current value Yt one time unit to Yt−1 . For this reason, we call
B the backshift operator. Similarly,
B 2 Yt = BBYt = BYt−1 = Yt−2 .
In general, B k Yt = Yt−k . Using this new notation, we can rewrite the AR(2) model
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et
in the following way:

Yt = ϕ1 BYt + ϕ2 B 2 Yt + et .
Rewriting this equation, we get
Yt − ϕ1 BYt − ϕ2 B 2 Yt = et ⇐⇒ (1 − ϕ1 B − ϕ2 B 2 )Yt = et .
Finally, treating the B as a dummy variable for algebraic reasons (and using the more
conventional algebraic symbol x), we define the AR(2) characteristic polynomial as
ϕ(x) = 1 − ϕ1 x − ϕ2 x2
and the corresponding AR(2) characteristic equation to be
ϕ(x) = 1 − ϕ1 x − ϕ2 x2 = 0.
IMPORTANT : Characterizing the stationarity conditions for the AR(2) model is done by
examining this equation and the solutions to it; i.e., the roots of ϕ(x) = 1 − ϕ1 x − ϕ2 x2 .
PAGE 95
NOTE : Applying the quadratic formula to the AR(2) characteristic equation, we see that
the roots of ϕ(x) = 1 − ϕ1 x − ϕ2 x2 are
√
ϕ1 ± ϕ21 + 4ϕ2
x= .
−2ϕ2
• The roots are both real if ϕ21 + 4ϕ2 > 0.
• The roots are both complex if ϕ21 + 4ϕ2 < 0
• There is a single real root with multiplicity 2 if ϕ21 + 4ϕ2 = 0.
STATIONARITY CONDITIONS : The AR(2) process is stationary when the roots of

ϕ(x) = 1 − ϕ1 x − ϕ2 x2 both exceed 1 in absolute value (or in modulus if the roots are
complex). This occurs if and only if
ϕ1 + ϕ2 < 1 ϕ2 − ϕ1 < 1 |ϕ2 | < 1
(see Appendix B, pp 84, CC). These are the stationarity conditions for the AR(2)
model. A sketch of this stationarity region (in the ϕ1 -ϕ2 plane) appears in Figure 4.7.
√
RECALL: Define i = −1 so that z = a + bi is a complex number. The modulus of
z = a + bi is
√
|z| = a2 + b2 .
AUTOCORRELATION FUNCTION : To derive the population ACF for an AR(2) pro-

cess, start with the AR(2) model equation
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et
and multiply both sides by Yt−k to get
Yt Yt−k = ϕ1 Yt−1 Yt−k + ϕ2 Yt−2 Yt−k + et Yt−k .
Taking expectations of both sides gives
E(Yt Yt−k ) = ϕ1 E(Yt−1 Yt−k ) + ϕ2 E(Yt−2 Yt−k ) + E(et Yt−k ).
PAGE 96
1.0
Real roots
Complex roots
Outside stationarity region
0.5
0.0
φ2
−0.5
−1.0
−2 −1 0 1 2
φ1
Figure 4.7: Stationarity region for the AR(2) model. The point (ϕ1 , ϕ2 ) must fall inside
the triangular region to satisfy the stationarity conditions. Points falling below the curve
ϕ21 + 4ϕ2 = 0 are complex solutions. Those falling above ϕ21 + 4ϕ2 = 0 are real solutions.
Because {Yt } is a zero mean process, E(Yt Yt−k ) = γk , E(Yt−1 Yt−k ) = γk−1 , and
E(Yt−2 Yt−k ) = γk−2 . Because et is independent of Yt−k , E(et Yt−k ) = E(et )E(Yt−k ) = 0.
This proves that
γk = ϕ1 γk−1 + ϕ2 γk−2 .
Dividing through by γ0 = var(Yt ) gives
ρk = ϕ1 ρk−1 + ϕ2 ρk−2 .
These are called the Yule-Walker equations for the AR(2) process.
PAGE 97
NOTE : For k = 1 and k = 2, the Yule-Walker equations provide
ρ 1 = ϕ1 + ϕ2 ρ 1
ρ 2 = ϕ1 ρ 1 + ϕ2 ,
where ρ0 = 1. Solving this system for ρ1 and ρ2 , we get
ϕ1 ϕ21 + ϕ2 − ϕ22
ρ1 = and ρ2 = .
1 − ϕ2 1 − ϕ2
• Therefore, we have closed-form expressions for ρ1 and ρ2 in terms of ϕ1 and ϕ2 .
• If we want to find higher lag autocorrelations, we can use the (recursive) relation
ρk = ϕ1 ρk−1 + ϕ2 ρk−2 .
For example, ρ3 = ϕ1 ρ2 + ϕ2 ρ1 , ρ4 = ϕ1 ρ3 + ϕ2 ρ2 , and so on.
REMARK : For those of you that like formulas, it is possible to write out closed-form
expressions for the autocorrelations in an AR(2) process. Denote the roots of the AR(2)
characteristic polynomial by 1/G1 and 1/G2 and assume that these roots both exceed 1
in absolute value (or modulus). Straightforward algebra shows that
√
ϕ1 − ϕ21 + 4ϕ2
G1 =
√2
ϕ1 + ϕ21 + 4ϕ2
G2 = .
2
• If G1 ̸= G2 , then
(1 − G22 )Gk+1
1 − (1 − G21 )Gk+1
2
ρk = .
(G1 − G2 )(1 + G1 G2 )
• If 1/G1 and 1/G2 are complex (i.e., when ϕ21 + 4ϕ2 < 0), then
sin(Θk + Φ)
ρk = R k ,
sin(Φ)
√ √
where R = −ϕ2 , Θ = cos−1 (ϕ1 /2 −ϕ2 ), and Φ = tan−1 [(1 − ϕ2 )/(1 + ϕ2 )].
PAGE 98
• If G1 = G2 (i.e., when ϕ21 + 4ϕ2 = 0), then

[ ( )] ( )k
1 + ϕ2 ϕ1
ρk = 1 + k .
1 − ϕ2 2
DISCUSSION : Personally, I don’t think these formulas are all that helpful for computa-
tion purposes. So, why present them? After all, we could use the Yule-Walker equations
for computation.
• The formulas are helpful in that they reveal typical shapes of the AR(2) population
ACFs. This is important because when we see these shapes with real data (through
the sample ACFs), this will aid us in model selection/identification.
• Denote the roots of the AR(2) characteristic polynomial by 1/G1 and 1/G2 . If the
AR(2) process is stationary, then both of these roots are larger than 1 (in absolute
value or modulus). However,
|1/G1 | > 1, |1/G2 | > 1 =⇒ |G1 | < 1, |G2 | < 1.
Therefore, each of
(1 − G22 )Gk+1
1 − (1 − G21 )Gk+1
2
ρk =
(G1 − G2 )(1 + G1 G2 )
sin(Θk + Φ)
ρk = Rk
sin(Φ)
[ ( )] ( )k
1 + ϕ2 ϕ1
ρk = 1+k
1 − ϕ2 2
satisfies the following:

ρk → 0, as k → ∞.
• Therefore, in an AR(2) process, the population autocorrelations ρk (in magnitude)

decay towards zero as k increases. Further inspection reveals that the decay is
exponential in nature. In addition, when the roots are complex, the values of ρk
resemble a sinusoidal pattern that dampens out as k increases.
PAGE 99
1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k

1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 4.8: Population ACFs for AR(2) processes. Upper left: (ϕ1 , ϕ2 ) = (0.5, −0.5).
Upper right: (ϕ1 , ϕ2 ) = (1.1, −0.3). Lower left: (ϕ1 , ϕ2 ) = (−0.5, 0.25). Lower right:
(ϕ1 , ϕ2 ) = (1, −0.5).
Example 4.4. We use R to simulate four AR(2) processes Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et ,

with et ∼ iid N (0, 1) and n = 100. We choose
• (ϕ1 , ϕ2 ) = (0.5, −0.5). CP: ϕ(x) = 1 − 0.5x + 0.5x2 . Complex roots.
• (ϕ1 , ϕ2 ) = (1.1, −0.3). CP: ϕ(x) = 1 − 1.1x + 0.3x2 . Two distinct (real) roots.
• (ϕ1 , ϕ2 ) = (−0.5, 0.25). CP: ϕ(x) = 1 + 0.5x − 0.25x2 . Two distinct (real) roots.
• (ϕ1 , ϕ2 ) = (1, −0.5). CP: ϕ(x) = 1 − x + 0.5x2 . Complex roots.
These choices of (ϕ1 , ϕ2 ) are consistent with those in Figure 4.8 that depict the true
(population) AR(2) autocorrelation functions.
PAGE 100
4
2
2
1
0
Yt
Yt
0
−2
−1
−4
−2
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
4
2
2
1
0
Yt
Yt
0
−3 −2 −1
−2
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
Figure 4.9: AR(2) simulations with n = 100 and σe2 = 1. Upper left: (ϕ1 , ϕ2 ) =
(0.5, −0.5). Upper right: (ϕ1 , ϕ2 ) = (1.1, −0.3). Lower left: (ϕ1 , ϕ2 ) = (−0.5, 0.25).
Lower right: (ϕ1 , ϕ2 ) = (1, −0.5).
• Consistent with the theory (see the population ACFs in Figure 4.8), the first (upper
left), second (upper right), and the fourth (lower right) series do “hang together;”
this is because of the positive lag 1 autocorrelation. The third series (lower left)
tends to oscillate, as we would expect since ρ1 < 0.
• The sample ACFs in Figure 4.10 resemble somewhat their theoretical counterparts
(at least at the first lag). Later lags generally deviate from the known theoretical
√
autocorrelations (there is a good reason for this). The error bounds at ±2/ 100 =
0.2 correspond to those for a white noise process; not an AR(1) process.
PAGE 101
0.4
0.6
0.2
0.4
ACF
ACF
0.0
0.2
−0.4
−0.2
5 10 15 20 5 10 15 20
Lag Lag
0.2 0.4 0.6

0.2
0.0
ACF
ACF
−0.2
−0.2
−0.4
5 10 15 20 5 10 15 20
Lag Lag
Figure 4.10: Sample ACFs for AR(2) simulations with n = 100 and σe2 = 1. Upper
left: (ϕ1 , ϕ2 ) = (0.5, −0.5). Upper right: (ϕ1 , ϕ2 ) = (1.1, −0.3). Lower left: (ϕ1 , ϕ2 ) =
(−0.5, 0.25). Lower right: (ϕ1 , ϕ2 ) = (1, −0.5).
VARIANCE : For the AR(2) process,

( )
1 − ϕ2 σe2
γ0 = var(Yt ) = .
1 + ϕ2 (1 − ϕ2 )2 − ϕ21
NOTE : The AR(2) model can be expressed as a general linear process
Yt = et + Ψ1 et−1 + Ψ2 et−2 + Ψ3 et−3 + · · · .
If 1/G1 and 1/G2 are the roots of the AR(2) characteristic polynomial, then
Gj+1 − Gj+1 sin[(j + 1)Θ]

Ψj = 1 2
, Ψj = R j , Ψj = (1 + j)ϕj1 ,
G1 − G2 sin(Θ)
depending on if G1 ̸= G2 , G1 and G2 are complex, or G1 = G2 , respectively.
PAGE 102
4.3.3 AR(p) process
RECALL: Suppose {et } is a zero mean white noise process with var(et ) = σe2 . The general
autoregressive process of order p, denoted AR(p), is
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et .
In backshift operator notation, we can write the model as
(1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt = et ,
yielding the AR(p) characteristic equation
ϕ(x) = 1 − ϕ1 x − ϕ2 x2 − · · · − ϕp xp = 0.
IMPORTANT : An AR(p) process is stationary if and only if the p roots of ϕ(x) each
exceed 1 in absolute value (or in modulus if the roots are complex).
• Consider an AR(1) process
Yt = ϕYt−1 + et ⇐⇒ (1 − ϕB)Yt = et .
The AR(1) characteristic polynomial is ϕ(x) = 1 − ϕx. Therefore,
1
ϕ(x) = 1 − ϕx = 0 =⇒ x = .
ϕ
Clearly,
1

|x| = > 1 ⇐⇒ |ϕ| < 1,
ϕ
which was the stated stationarity condition for the AR(1) process.
• Consider an AR(2) process
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et ⇐⇒ (1 − ϕ1 B − ϕ2 B 2 )Yt = et .
The AR(2) characteristic polynomial is ϕ(x) = 1 − ϕ1 x − ϕ2 x2 whose two roots are

√
ϕ1 ± ϕ21 + 4ϕ2
x= .
−2ϕ2
PAGE 103
The AR(2) process is stationary if and only if both roots are larger than 1 in
absolute value (or in modulus if complex). That is, both roots must lie outside the
unit circle.
• The same condition on the roots of ϕ(x) is needed for stationarity with any AR(p)
process.
YULE-WALKER EQUATIONS : Assuming stationarity and zero means, consider the

AR(p) process equation
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et
Yt Yt−k = ϕ1 Yt−1 Yt−k + ϕ2 Yt−2 Yt−k + · · · + ϕp Yt−p Yt−k + et Yt−k .
Taking expectations gives
γk = ϕ1 γk−1 + ϕ2 γk−2 + · · · + ϕp γk−p
and dividing through by the process variance γ0 , we get
ρk = ϕ1 ρk−1 + ϕ2 ρk−2 + · · · + ϕp ρk−p .
Plugging in k = 1, 2, ..., p, and using the fact that ρk = ρ−k , we get
ρ1 = ϕ1 + ϕ2 ρ1 + ϕ3 ρ2 + · · · + ϕp ρp−1
ρ2 = ϕ1 ρ1 + ϕ2 + ϕ3 ρ1 + · · · + ϕp ρp−2
..
.
ρp = ϕ1 ρp−1 + ϕ2 ρp−2 + ϕ3 ρp−3 + · · · + ϕp .
These are the Yule-Walker equations. For known values of ϕ1 , ϕ2 , ..., ϕp , we can com-
pute the first lag p autocorrelations ρ1 , ρ2 , ..., ρp . Values of ρk , for k > p, can be obtained
by using the recursive relation above. The AR(p) ACF tails off as k gets larger. It does
so as a mixture of exponential decays and/or damped sine waves, depending on if roots
are real or complex.
PAGE 104
4.4 Invertibility
TERMINOLOGY : We define a process {Yt } to be invertible if it can written as a

“mathematically meaningful” autoregressive process (possibly of infinite order). Invert-
ibility is an important theoretical property. For prediction purposes, it is important to
restrict our attention to the class of invertible models.
ILLUSTRATION : From the definition, we see that stationary autoregressive models are
automatically invertible. However, moving average models may not be. For example,
consider the MA(1) model
Yt = et − θet−1 ,
or, slightly rewritten,

et = Yt + θet−1 .
Note that we can write
et = Yt + θ (Yt−1 + θet−2 ) = Yt + θYt−1 + θ2 et−2 .

| {z }
= et−1
Repeated similar substitution reveals that
et = Yt + θYt−1 + θ2 Yt−2 + θ3 Yt−3 + · · · ,
or slightly rewritten
Yt = −θYt−1 − θ2 Yt−2 − θ3 Yt−3 − · · · + et .

| {z }
“AR(∞)”
• For this autoregressive representation to be “mathematically meaningful,” we need

∑
the infinite series of θ coefficients to be finite; that is, we need ∞j=1 θ < ∞. This
j
occurs if and only if |θ| < 1.
• We have expressed an MA(1) as an infinite-order AR model. The MA(1) process

is invertible if and only if |θ| < 1.
• Compare this MA(1) “invertibility condition” with the stationarity condition of

|ϕ| < 1 for the AR(1) model.
PAGE 105
IMPORTANCE : A model must be invertible for us to be able to identify the model

parameters associated with it. For example, for an MA(1) model, it is straightforward
to show that both of the following processes have the same autocorrelation function:
Yt = et − θet−1
1
Yt = et − et−1 .
θ
Put another way, if we knew the common ACF, we could not say if the MA(1) model
parameter was θ or 1/θ. Thus, we impose the condition that |θ| < 1 to ensure invertibility
(identifiability). Note that under this condition, the second MA(1) model, rewritten
( ) ( )2 ( )3
1 1 1
Yt = − Yt−1 − Yt−2 − Yt−3 − · · · + et ,
θ θ θ
∑ ( 1 )j
is no longer meaningful because the series ∞ j=1 θ diverges.
NOTE : Rewriting the MA(1) model using backshift notation, we see that
Yt = (1 − θB)et .
The function θ(x) = 1 − θx is called the MA(1) characteristic polynomial and
θ(x) = 1 − θx = 0
is called the MA(1) characteristic equation. The root of this equation is
1
x= .
θ
For this process to be invertible, we require the root of the characteristic equation to
exceed 1 (in absolute value). Doing so implies that |θ| < 1.
GENERALIZATION : The MA(q) process
Yt = et − θ1 et−1 − θ2 et−2 − · · · − θq et−q
= (1 − θ1 B − θ2 B 2 − · · · − θq B q )et
is invertible if and only if the roots of the MA(q) characteristic polynomial θ(x) =
1 − θ1 x − θ2 x2 − · · · − θq xq all exceed 1 in absolute value (or modulus).
PAGE 106
SUMMARY : We have discussed two important theoretical properties of autoregressive

(AR) and moving average (MA) models, namely, stationarity and invertibility. Here
is a summary of the important findings.
• For an AR(p) process to be stationary, we need the roots of the AR characteristic

polynomial
ϕ(x) = 1 − ϕ1 x − ϕ2 x2 − · · · − ϕp xp
to all exceed 1 in absolute value (or modulus).
• For an MA(q) process to be invertible, we need the roots of the MA characteristic

polynomial
θ(x) = 1 − θ1 x − θ2 x2 − · · · − θq xq
to all exceed 1 in absolute value (or modulus).
• All invertible MA processes are stationary.
• All stationary AR processes are invertible.
• Any invertible MA(q) process corresponds to an infinite order AR process.
• Any stationary AR(p) process corresponds to an infinite order MA process.
4.5 Autoregressive moving average (ARMA) processes
The process
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et − θ1 et−1 − θ2 et−2 − · · · − θq et−q
is an autoregressive moving average process of orders p and q, written ARMA(p, q).

AR(p) and MA(q) processes are each special cases of the ARMA(p, q) process.
• An ARMA(p, 0) process is the same as an AR(p) process.
• An ARMA(0, q) process is the same as an MA(q) process.
PAGE 107
REMARK : A stationary time series may often be adequately modeled by an ARMA

model involving fewer parameters than a pure MA or AR process by itself. This is an
example of the Principle of Parsimony; i.e., finding a model with as few parameters
as possible, but which gives an adequate representation of the data.
BACKSHIFT NOTATION : The ARMA(p, q) process, expressed using backshift nota-

tion, is
(1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt = (1 − θ1 B − θ2 B 2 − · · · − θq B q )et
or, more succinctly, as

ϕ(B)Yt = θ(B)et ,
where
ϕ(B) = 1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p
θ(B) = 1 − θ1 B − θ2 B 2 − · · · − θq B q .
• For the ARMA(p, q) process to be stationary, we need the roots of the AR char-
acteristic polynomial ϕ(x) = 1 − ϕ1 x − ϕ2 x2 − · · · − ϕp xp to all exceed 1 in absolute
value (or modulus).
• For the ARMA(p, q) process to be invertible, we need the roots of the MA char-
acteristic polynomial θ(x) = 1 − θ1 x − θ2 x2 − · · · − θq xq to all exceed 1 in absolute
value (or modulus).
Example 4.5. Write each of the models
(i) Yt = 0.3Yt−1 + et
(ii) Yt = et − 1.3et−1 + 0.4et−2
(iii) Yt = 0.5Yt−1 + et − 0.3et−1 + 1.2et−2
(iv) Yt = 0.4Yt−1 + 0.45Yt−2 + et + et−1 + 0.25et−2
using backshift notation and determine whether the model is stationary and/or invertible.
PAGE 108
Solutions.
(i) The model in (i) is an AR(1) with ϕ = 0.3. In backshift notation, this model is
(1 − 0.3B)Yt = et . The characteristic polynomial is
ϕ(x) = 1 − 0.3x,
which has the root x = 10/3. Because this root exceeds 1 in absolute value, this
process is stationary. The process is also invertible since it is a stationary AR
process.
(ii) The model in (ii) is an MA(2) with θ1 = 1.3 and θ2 = −0.4. In backshift notation,
this model is Yt = (1 − 1.3B + 0.4B 2 )et . The characteristic polynomial is
θ(x) = 1 − 1.3x + 0.4x2 ,
which has roots x = 2 and x = 1.25. Because these roots both exceed 1 in absolute
value, this process is invertible. The process is also stationary since it is an invertible
MA process.
(iii) The model in (iii) is an ARMA(1,2) with ϕ1 = 0.5, θ1 = 0.3 and θ2 = −1.2. In
backshift notation, this model is (1 − 0.5B)Yt = (1 − 0.3B + 1.2B 2 )et . The AR
characteristic polynomial is
ϕ(x) = 1 − 0.5x,
which has the root x = 2. Because this root is greater than 1, this process is
stationary. The MA characteristic polynomial is
θ(x) = 1 − 0.3x + 1.2x2 ,
which has roots x ≈ 0.125 ± 0.904i. The modulus of each root is

√
|x| ≈ (0.125)2 + (0.904)2 ≈ 0.913,
which is less than 1. Therefore, this process is not invertible.
PAGE 109
(iv) The model in (iv), at first glance, appears to be an ARMA(2,2) with ϕ1 = 0.4,
ϕ2 = 0.45, θ1 = −1, and θ2 = −0.25. In backshift notation, this model is written
as
(1 − 0.4B − 0.45B 2 )Yt = (1 + B + 0.25B 2 )et .
However, the AR and MA characteristic polynomials in this instance factor as
(1 + 0.5B)(1 − 0.9B)Yt = (1 + 0.5B)(1 + 0.5B)et .
In (mixed) ARMA models, the AR and MA characteristic polynomials can not

share any common factors. Here, they do; namely, (1 + 0.5B). Canceling, we have
(1 − 0.9B)Yt = (1 + 0.5B)et ,
which we identify as an ARMA(1,1) model with ϕ1 = 0.9 and θ1 = −0.5. This

process is stationary since the root of ϕ(x) = 1 −0.9x is x = 10/9 > 1. This process
is invertible since the root of θ(x) = 1 + 0.5x is x = −2, which exceeds 1 in absolute
value.
AUTOCORRELATION FUNCTION : Take the ARMA(p, q) model equation
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et − θ1 et−1 − θ2 et−2 − · · · − θq et−q
Yt Yt−k = ϕ1 Yt−1 Yt−k + ϕ2 Yt−2 Yt−k + · · · + ϕp Yt−p Yt−k
+ et Yt−k − θ1 et−1 Yt−k − θ2 et−2 Yt−k − · · · − θq et−q Yt−k .
For k > q, we have E(et Yt−k ) = E(et−1 Yt−k ) = · · · = E(et−q Yt−k ) = 0 so that
γk = ϕ1 γk−1 + ϕ2 γk−2 + · · · + ϕp γk−p .
Dividing through by the process variance γ0 , we get, for k > q,
ρk = ϕ1 ρk−1 + ϕ2 ρk−2 + · · · + ϕp ρk−p .
PAGE 110
Plugging in k = 1, 2, ..., p, and using the fact that ρk = ρ−k , we arrive again at the Yule
Walker equations:
ρ1 = ϕ1 + ϕ2 ρ1 + ϕ3 ρ2 + · · · + ϕp ρp−1
ρ2 = ϕ1 ρ1 + ϕ2 + ϕ3 ρ1 + · · · + ϕp ρp−2
..
.
ρp = ϕ1 ρp−1 + ϕ2 ρp−2 + ϕ3 ρp−3 + · · · + ϕp .
A similar system can be derived which involves θ1 , θ2 , ..., θq .
• The R function ARMAacf can compute autocorrelations numerically for any station-
ary ARMA(p, q) process (including those that are purely AR or MA).
• The ACF for the ARMA(p, q) process tails off after lag q in a manner similar to
the AR(p) process.
• However, unlike the AR(p) process, the first q autocorrelations depend on both
θ1 , θ2 , ..., θq and ϕ1 , ϕ2 , ..., ϕp .
SPECIAL CASE : Suppose that {et } is a zero mean white noise process with var(et ) = σe2 .
The process
Yt = ϕYt−1 + et − θet−1
is called an ARMA(1, 1) process. This is a special case of the ARMA(p, q) process

with p = q = 1. In backshift notation, the process can be written as
(1 − ϕB)Yt = (1 − θB)et
yielding ϕ(x) = 1 − ϕx and θ(x) = 1 − θx as the AR and MA characteristic polynomials,

respectively. As usual, the conditions for stationarity and invertibility are that the roots
of both polynomials exceed 1 in absolute value.
MOMENTS : The calculations on pp 78-79 (CC) show that

( )
1 − 2ϕθ + θ2
γ0 = σe2 ,
1−ϕ 2
PAGE 111
1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k

1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 4.11: Population ACFs for ARMA(1,1) processes. Upper left: (ϕ, θ) =
(0.9, −0.25). Upper right: (ϕ, θ) = (−0.9, −0.25). Lower left: (ϕ, θ) = (0.5, −0.25).
Lower right: (ϕ, θ) = (−0.5, −0.25).
γ1 = ϕγ0 − θσe2 , and γk = ϕγk−1 , for k ≥ 2. The autocorrelation function is shown to

satisfy [ ]
(1 − θϕ)(ϕ − θ) k−1
ρk = ϕ .
1 − 2θϕ + θ2
Note that when k = 1, ρ1 is equal to a quantity that depends on ϕ and θ. This is
different than the AR(1) process where ρ1 depends on ϕ only. However, as k gets larger,
the autocorrelation ρk decays in a manner similar to the AR(1) process. Figure 4.11
displays some different ARMA(1,1) ACFs.
REMARK : That the ARMA(1,1) model can be written in the general linear process form
defined at the beginning of the chapter is shown on pp 78-79 (CC).
PAGE 112
5 Models for Nonstationary Time Series
5.1 Introduction
RECALL: Suppose {et } is a zero mean white noise process with variance var(et ) = σe2 .
In the last chapter, we considered the class of ARMA models
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et − θ1 et−1 − θ2 et−2 − · · · − θq et−q ,
or, expressed more succinctly,

ϕ(B)Yt = θ(B)et ,
where the AR and MA characteristic operators are
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
• We learned that a process {Yt } in this class is stationary if and only if the roots of
the AR characteristic polynomial ϕ(x) all exceed 1 in absolute value (or modulus).
• We learned that a process {Yt } in this class is invertible if and only if the roots of
the MA characteristic polynomial θ(x) all exceed 1 in absolute value (or modulus).
• In this chapter, we extend this class of models to handle processes which are non-
stationary. We accomplish this by generalizing the class of ARMA models to
include differencing.
• Doing so gives rise to a much larger class of models, the autoregressive inte-
grated moving average (ARIMA) class. This class incorporates a wide range
of nonstationary time series processes.
PAGE 113
TERMINOLOGY : Suppose that {Yt } is a stochastic process. The first difference

process {∇Yt } consists of
∇Yt = Yt − Yt−1 .
The second difference process {∇2 Yt } consists of
∇2 Yt = ∇(∇Yt ) = ∇Yt − ∇Yt−1
= (Yt − Yt−1 ) − (Yt−1 − Yt−2 )
= Yt − 2Yt−1 + Yt−2 .
In general, the dth difference process {∇d Yt } consists of
∇d Yt = ∇(∇d−1 Yt ) = ∇d−1 Yt − ∇d−1 Yt−1 ,
for d = 1, 2, ...,. We take ∇0 Yt = Yt by convention.
Example 5.1. Suppose that {Yt } is a random walk process
Yt = Yt−1 + et ,
where {et } is zero mean white noise with variance var(et ) = σe2 . We know that {Yt } is not
stationary because its autocovariance function depends on t (see Chapter 2). However,
the first difference process
∇Yt = Yt − Yt−1 = et
is white noise, which is stationary.
• In Figure 5.1 (top), we display a simulated random walk process with n = 150 and
σe2 = 1. Note how the sample ACF of the series decays very, very slowly over time.
This is typical of a nonstationary series.
• The first difference (white noise) process also appears in Figure 5.1 (bottom), along
with its sample ACF. As we would expect from a white noise process, nearly all of
√
the sample autocorrelations rk are within the ±2/ n bounds.
• As this simple example shows, it is possible to “transform” a nonstationary process

into one that is stationary by taking differences.
PAGE 114
Sample ACF
0.2 0.4 0.6 0.8 1.0

5
0
ACF
−5
−10
−0.2
0 50 100 150 5 10 15 20
Time Lag
Sample ACF
0.2
2
1
0.1
0
ACF
0.0
−3 −2 −1
−0.1
0 50 100 150 5 10 15 20
Time Lag
Figure 5.1: Top: A simulated random walk process {Yt } and its sample ACF, with
n = 150 and σe2 = 1. Bottom: The first difference process {∇Yt } and its sample ACF.
LINEAR TREND MODELS : In Chapter 3, we talked about how to use regression meth-
ods to fit models of the form
Yt = µt + Xt ,
where µt is a deterministic trend function and where {Xt } is a stochastic process with
E(Xt ) = 0. Suppose that {Xt } is stationary and that the true trend function is
µt = β0 + β1 t,
a linear function of time. Clearly, {Yt } is not a stationary process because
E(Yt ) = E(β0 + β1 t + Xt )
= β0 + β1 t + E(Xt ) = β0 + β1 t,
PAGE 115
which depends on t. The first differences are given by
∇Yt = Yt − Yt−1 = (β0 + β1 t + Xt ) − [β0 + β1 (t − 1) + Xt−1 ] = β1 + Xt − Xt−1 .
Note that
E(∇Yt ) = E(β1 + Xt − Xt−1 ) = β1 + E(Xt ) − E(Xt−1 ) = β1 .
Also,
cov(∇Yt , ∇Yt−k ) = cov(β1 + Xt − Xt−1 , β1 + Xt−k − Xt−k−1 )
= cov(Xt , Xt−k ) − cov(Xt , Xt−k−1 )
− cov(Xt−1 , Xt−k ) + cov(Xt−1 , Xt−k−1 ).
Because {Xt } is stationary, each of these covariance terms does not depend on t. There-
fore, both E(∇Yt ) and cov(∇Yt , ∇Yt−k ) are free of t; i.e., {∇Yt } is a stationary process.
Taking first differences removes a linear determinstic trend.
QUADRATIC TRENDS : Suppose that the true deterministic trend model is
µt = β0 + β1 t + β2 t2 ,
a quadratic function of time. Clearly, {Yt } is not a stationary process since E(Yt ) = µt .
The first difference process consists of
∇Yt = Yt − Yt−1 = (β0 + β1 t + β2 t2 + Xt ) − [β0 + β1 (t − 1) + β2 (t − 1)2 + Xt−1 ]
= (β1 − β2 ) + 2β2 t + Xt − Xt−1
and E(∇Yt ) = β1 − β2 + 2β2 t, which depends on t. Therefore, {∇Yt } is not a stationary

process. The second difference process consists of
∇2 Yt = ∇Yt − ∇Yt−1
= [(β1 − β2 ) + 2β2 t + Xt − Xt−1 ] − [(β1 − β2 ) + 2β2 (t − 1) + Xt−1 − Xt−2 ]
= 2β2 + Xt − 2Xt−1 + Xt−2 .
Therefore, E(∇2 Yt ) = 2β2 and cov(∇2 Yt , ∇2 Yt−k ) are free of t. This shows that {∇2 Yt }
is stationary. Taking second differences removes a quadratic deterministic trend.
PAGE 116
Sample ACF
0.0 0.2 0.4 0.6 0.8

80
Ventilation (L/min)
60
ACF
40
20
0 50 100 150 200 5 10 15 20
Observation time Lag
Sample ACF: 1st differences

10
0.0
First differences
ACF
0
−0.2
−5
−0.4
−10
0 50 100 150 200 5 10 15 20
Time Lag
Figure 5.2: Ventilation measurements at 15 second intervals. Top: Ventilation series {Yt }
with sample ACF. Bottom: First difference process {∇Yt } with sample ACF.
GENERALIZATION : Suppose that Yt = µt + Xt , where µt is a deterministic trend

function and {Xt } is a stationary process with E(Xt ) = 0. In general, if
µt = β0 + β1 t + β2 t2 + · · · + βd td
is a polynomial in t of degree d, then the dth difference process {∇d Yt } is stationary.
Example 5.2. The data in Figure 5.2 are ventilation observations (L/min) on a single
cyclist recorded every 15 seconds during exercise. Source: Joe Alemany (Spring, 2010).
• The ventilation time series {Yt } does not resemble a stationary process. There is a
pronounced increasing linear trend over time. Nonstationarity is also reinforced
by examining the sample ACF for the series. In particular, the sample ACF decays
very, very slowly (a sure sign of nonstationarity).
PAGE 117
• The first difference series {∇Yt } does resemble a process with a constant mean. In
fact, the sample ACF for {∇Yt } looks like what we would expect from an MA(1)
process (i.e., a pronounced spike at k = 1 and little action elsewhere).
• To summarize, the evidence in Figure 5.2 suggests an MA(1) model for the differ-
ence process {∇Yt }.
5.2 Autoregressive integrated moving average (ARIMA) mod-

els
TERMINOLOGY : A stochastic process {Yt } is said to follow an autoregressive in-

tegrated moving average (ARIMA) model if the dth differences Wt = ∇d Yt follow
a stationary ARMA model. There are three important values which characterize an
ARIMA process:
• p, the order of the autoregressive component
• d, the number of differences needed to arrive at a stationary ARMA(p, q) process
• q, the order of the moving average component.
In particular, we have the general relationship:
Yt is ARIMA(p, d, q) ⇐⇒ Wt = ∇d Yt is ARMA(p, q).
RECALL: A stationary ARMA(p, q) process can be represented as
(1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt = (1 − θ1 B − θ2 B 2 − · · · − θq B q )et
or, more succinctly, as

ϕ(B)Yt = θ(B)et ,
where {et } is zero mean white noise with variance var(et ) = σe2 . In the ARIMA(p, d, q)
family, take d = 1 so that
Wt = ∇Yt = Yt − Yt−1 = Yt − BYt = (1 − B)Yt
PAGE 118
follows an ARMA(p, q) model. Therefore, an ARIMA(p, 1, q) process can be written

succinctly as
ϕ(B)(1 − B)Yt = θ(B)et .
Similarly, take d = 2 so that
Wt = ∇2 Yt = Yt − 2Yt−1 + Yt−2
= Yt − 2BYt + B 2 Yt
= (1 − 2B + B 2 )Yt = (1 − B)2 Yt
follows an ARMA(p, q) model. Therefore, an ARIMA(p, 2, q) process can be written as
ϕ(B)(1 − B)2 Yt = θ(B)et .
In general, an ARIMA(p, d, q) process can be written as
ϕ(B)(1 − B)d Yt = θ(B)et .
IMPORTANT : In practice (with real data), there will rarely be a need to consider values
of the differencing order d > 2. Most real time series data can be coerced into a station-
arity ARMA process by taking one difference or occasionally two differences (perhaps
after transforming the series initially).
REMARK : Autoregressive (AR) models, moving average (MA) models, and autoregres-
sive moving average (ARMA) models are all members of the ARIMA(p, d, q) family. In
particular,
• AR(p) ←→ ARIMA(p, 0, 0)
• MA(q) ←→ ARIMA(0, 0, q)
• ARMA(p, q) ←→ ARIMA(p, 0, q)
• ARI(p, d) ←→ ARIMA(p, d, 0)
• IMA(d, q) ←→ ARIMA(0, d, q).
PAGE 119
Sample ACF: ARI(1,1)
0.2 0.4 0.6 0.8 1.0

80
60
ACF
40
Yt
20
0
−0.2
0 50 100 150 5 10 15 20
Time Lag

4
0.6
2
0.4
Yt − Yt−1
ACF
0.2
0
0.0
−2
0 50 100 150 5 10 15 20
Time Lag
Figure 5.3: Top: ARI(1,1) simulation, with ϕ = 0.7, n = 150, and σe2 = 1, and the
sample ACF. Bottom: First difference process with sample ACF.
Example 5.3. Suppose {et } is a zero mean white noise process. Identify each model
(a) Yt = 1.7Yt−1 − 0.7Yt−2 + et
(b) Yt = 1.5Yt−1 − 0.5Yt−2 + et − et−1 + 0.25et−2
as an ARIMA(p, d, q) process. That is, specify the values of p, d, and q.
Solutions.
(a) Upon first glance,

Yt = 1.7Yt−1 − 0.7Yt−2 + et
looks like an AR(2) process with ϕ1 = 1.7 and ϕ2 = −0.7. However, upon closer
inspection, we see this process is not stationary because the AR(2) stationary con-
PAGE 120
ditions
ϕ1 + ϕ2 < 1 ϕ2 − ϕ1 < 1 |ϕ2 | < 1
are not met with ϕ1 = 1.7 and ϕ2 = −0.7 (in particular, the first condition is not
met). However, note that we can write this process as
Yt − 1.7Yt−1 + 0.7Yt−2 = et ⇐⇒ Yt − 1.7BYt + 0.7B 2 Yt = et
⇐⇒ (1 − 1.7B + 0.7B 2 )Yt = et
⇐⇒ (1 − 0.7B)(1 − B)Yt = et
⇐⇒ (1 − 0.7B)Wt = et ,
where
Wt = (1 − B)Yt = Yt − Yt−1
are the first differences. We identify {Wt } as a stationary AR(1) process with
ϕ = 0.7. Therefore, {Yt } is an ARIMA(1,1,0) ⇐⇒ ARI(1,1) process with ϕ = 0.7.
This ARI(1,1) process is simulated in Figure 5.3.
(b) Upon first glance,
Yt = 1.5Yt−1 − 0.5Yt−2 + et − et−1 + 0.25et−2
looks like an ARMA(2,2) process, but this process is not stationary either. To see
why, note that we can write this process as
Yt − 1.5Yt−1 + 0.5Yt−2 = et − et−1 + 0.25et−2
⇐⇒ (1 − 1.5B + 0.5B 2 )Yt = (1 − B + 0.25B 2 )et
⇐⇒ (1 − 0.5B)(1 − B)Yt = (1 − 0.5B)2 et
⇐⇒ (1 − B)Yt = (1 − 0.5B)et
⇐⇒ Wt = (1 − 0.5B)et ,
where Wt = (1 − B)Yt = Yt − Yt−1 . Here, the first differences {Wt } follow an MA(1)
model with θ = 0.5. Therefore, {Yt } is an ARIMA(0,1,1) ⇐⇒ IMA(1,1) process
with θ = 0.5. A realization of this IMA(1,1) process is shown in Figure 5.4.
PAGE 121
Sample ACF: IMA(1,1)
−0.2 0.0 0.2 0.4 0.6 0.8

0
−2
ACF
Yt
−4
−6
0 50 100 150 5 10 15 20
Time Lag

2
0.1
1
−0.1 0.0
Yt − Yt−1
ACF
−3 −2 −1
−0.3
0 50 100 150 5 10 15 20
Time Lag
Figure 5.4: Top: IMA(1,1) simulation, with θ = 0.5, n = 150, and σe2 = 1, and the
sample ACF. Bottom: First difference process with sample ACF.
5.2.1 IMA(1,1) process
An ARIMA(p, d, q) process with p = 0, d = 1, and q = 1 is called an IMA(1,1) process
and is given by
Yt = Yt−1 + et − θet−1 .
This model is very popular in economics applications. Note that if θ = 0, the IMA(1,1)
process reduces to a random walk.
REMARK : We first note that an IMA(1,1) process can be written as
(1 − B)Yt = (1 − θB)et .
PAGE 122
If we (mistakenly) treated this as an ARMA(1,1) process with characteristic operators
ϕ(B) = 1 − B
θ(B) = 1 − θB,
it would be clear that this process is not stationary since the AR characteristic polynomial
ϕ(x) = 1 − x has a unit root, that is, the root of ϕ(x) is x = 1. More appropriately, we
write
(1 − B)Yt = (1 − θB)et ⇐⇒ Wt = (1 − θB)et ,
and note that the first differences
Wt = (1 − B)Yt = Yt − Yt−1
follow an MA(1) model with parameter θ. From Chapter 4, we know that the first
difference process {Wt } is invertible if and only if |θ| < 1. To summarize,
{Yt } follows an IMA(1,1) ⇐⇒ {Wt } follows an MA(1).
5.2.2 IMA(2,2) process
An ARIMA(p, d, q) process with p = 0, d = 2, and q = 2 is called an IMA(2,2) process
and can be expressed as
(1 − B)2 Yt = (1 − θ1 B − θ2 B 2 )et ,
or, equivalently,
∇2 Yt = et − θ1 et−1 − θ2 et−2 .
In an IMA(2,2) process, the second differences
Wt = ∇2 Yt = (1 − B)2 Yt
follow an MA(2) model. Invertibility is assessed by examining the MA characteristic

operator θ(B) = 1 − θ1 B − θ2 B 2 . An IMA(2,2) process is simulated in Figure 5.5.
PAGE 123
Sample ACF: IMA(2,2)
1.0
400
0.6
ACF
Yt
200
0.2
−0.2
0
0 50 100 150 5 10 15 20
Time Lag
1.0
First differences
0.6
ACF
0
0.2
−5
−0.2
0 50 100 150 5 10 15 20
Time Lag
Sample ACF: 2nd differences

1 2 3
2nd differences
0.2
ACF
0.0
−1
−0.4
−3
0 50 100 150 5 10 15 20
Time Lag
Figure 5.5: Top: IMA(2,2) simulation with n = 150, θ1 = 0.3, θ2 = −0.3, and σe2 = 1.
Middle: First difference process. Bottom: Second difference process.
• The defining characteristic of an IMA(2,2) process is its very strong autocorrelation

at all lags. This is also seen in the sample ACF.
• The first difference process {∇Yt }, which is that of an IMA(1,2), is also clearly
nonstationary to the naked eye. This is also seen in the sample ACF.
• The second difference process {∇2 Yt } is an (invertible) MA(2) process. This is

suggested in the sample ACF for the second differences. Note how there are clear
spikes in the ACF at lags k = 1 and k = 2.
PAGE 124
5.2.3 ARI(1,1) process
An ARIMA(p, d, q) process with p = 1, d = 1, and q = 0 is called an ARI(1,1) process
and can be expressed as
(1 − ϕB)(1 − B)Yt = et ,
or, equivalently,
Yt = (1 + ϕ)Yt−1 − ϕYt−2 + et .
Note that the first differences Wt = (1 − B)Yt satisfy the model
(1 − ϕB)Wt = et ,
which we recognize as an AR(1) process with parameter ϕ. The first difference process
{Wt } is stationary if and only if |ϕ| < 1.
REMARK : Upon first glance, the process
Yt = (1 + ϕ)Yt−1 − ϕYt−2 + et
looks like an AR(2) model. However this process is not stationary since the coefficients
satisfy (1 + ϕ) − ϕ = 1; this violates the stationarity requirements for the AR(2) model.
An ARI(1,1) process is simulated in Figure 5.3.
5.2.4 ARIMA(1,1,1) process
An ARIMA(p, d, q) process with p = 1, d = 1, and q = 1 is called an ARIMA(1,1,1)
process and can be expressed as
(1 − ϕB)(1 − B)Yt = (1 − θB)et ,
or, equivalently,
Yt = (1 + ϕ)Yt−1 − ϕYt−2 + et − θet−1 .
PAGE 125
Sample ACF: ARIMA(1,1,1)
0.2 0.4 0.6 0.8 1.0

10
0
−40 −30 −20 −10
ACF
Yt
−0.2
0 50 100 150 5 10 15 20
Time Lag
0.6
2
First differences
0.4
0
ACF
0.2
−2
−0.2 0.0
−4
0 50 100 150 5 10 15 20
Time Lag
Figure 5.6: Top: ARIMA(1,1,1) simulation, with n = 150, ϕ = 0.5, θ = −0.5, and
σe2 = 1, and the sample ACF. Bottom: First difference process with sample ACF.
Note that the first differences Wt = (1 − B)Yt satisfy the model
(1 − ϕB)Wt = (1 − θB)et ,
which we recognize as an ARMA(1,1) process with parameters ϕ and θ.
• The first difference process {Wt } is stationary if and only if |ϕ| < 1. The first
difference process {Wt } is invertible if and only if |θ| < 1.
• A simulated ARIMA(1,1,1) process appears in Figure 5.6. The ARIMA(1,1,1)

simulated series Yt is clearly nonstationary. The first difference series Wt = ∇Yt
appears to have a constant mean, and its sample ACF resembles that of a stationary
ARMA(1,1) process (as it should).
PAGE 126
5.3 Constant terms in ARIMA models
RECALL: An ARIMA(p, d, q) process can be written as
ϕ(B)(1 − B)d Yt = θ(B)et ,
where {et } is zero mean white noise with var(et ) = σe2 . An extension of this model is
ϕ(B)(1 − B)d Yt = θ0 + θ(B)et ,
where the parameter θ0 is a constant term.
IMPORTANT : The parameter θ0 plays very different roles when
• d = 0 (a stationary ARMA model)
• d > 0 (a nonstationary model).
STATIONARY CASE : Suppose that d = 0, in which case the no-constant model becomes
ϕ(B)Yt = θ(B)et ,
a stationary ARMA process, where the AR and MA characteristic operators are
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
To examine the effects of adding a constant term, suppose that we replace Yt with Yt − µ,
where µ = E(Yt ). The model becomes
ϕ(B)(Yt − µ) = θ(B)et =⇒ ϕ(B)Yt − ϕ(B)µ = θ(B)et
=⇒ ϕ(B)Yt − (1 − ϕ1 − ϕ2 − · · · − ϕp )µ = θ(B)et
=⇒ ϕ(B)Yt = (1 − ϕ1 − ϕ2 − · · · − ϕp )µ +θ(B)et ,
| {z }
= θ0
so that
θ0
θ0 = (1 − ϕ1 − ϕ2 − · · · − ϕp )µ ⇐⇒ µ= .
1 − ϕ1 − ϕ2 − · · · − ϕp
PAGE 127
IMPORTANT : In a stationary ARMA process {Yt }, adding a constant term θ0 to the

model does not affect the stationarity properties of {Yt }.
NONSTATIONARY CASE : The impact of adding a constant term θ0 to the model when
d > 0 is quite different. As the simplest example in the ARIMA(p, d, q) family, take
p = q = 0 and d = 1 so that
(1 − B)Yt = θ0 + et ⇐⇒ Yt = θ0 + Yt−1 + et .
This model is called a random walk with drift; see pp 22 (CC). Note that we can
write via successive substitution
Yt = θ0 + Yt−1 + et
= θ0 + θ0 + Yt−2 + et−1 +et

| {z }
= Yt−1
= 2θ0 + Yt−2 + et + et−1
..
.
= (t − k)θ0 + Yk + et + et−1 + · · · + et−k+1 .
Therefore, the process {Yt } contains a linear deterministic trend with slope θ0 .
IMPORTANT : The previous finding holds for any (nonstationary) ARIMA(p, 1, q) model,
that is, adding a constant term θ0 induces a linear deterministic trend. Also,
• adding a constant term θ0 to an ARIMA(p, 2, q) model induces a quadratic deter-

ministic trend,
• adding a constant term θ0 to an ARIMA(p, 3, q) model induces a cubic determin-

istic trend, and so on.
Note that for very large t, the constant (deterministic trend) term can become very
dominating so that it forces the time series to follow a nearly deterministic pattern.
Therefore, a constant term should be added to a nonstationary ARIMA model (i.e.,
d > 0) only if it is strongly warranted.
PAGE 128
5.4 Transformations
REVIEW : If we are trying to model a nonstationary time series, it may be helpful to

transform the data first before we examine any data differences (or before “detrending”
the data if we use regression methods from Chapter 3).
• For example, if there is clear evidence of nonconstant variance over time (e.g., the
variance increases over time, etc.), then a suitable transformation to the data
might remove (or lessen the impact of) the nonconstant variance pattern.
• Applying a transformation to address nonconstant variance is regarded as a “first

step.” This is done before using differencing as a means to achieve stationarity.
Example 5.4. Data file: electricity (TSA). Figure 5.7 displays monthly electricity
usage in the United States (usage from coal, natural gas, nuclear, petroleum, and wind)
between January, 1973 and December, 2005.
• From the plot, we can see that there is increasing variance over time; e.g., the series
is much more variable at later years than it is in earlier years.
• Time series that exhibit this “fanning out” shape are not stationary because the
variance changes over time.
• Before we try to model these data, we should first apply a transformation to make
the variance constant (that is, we would like to first “stabilize” the variance).
THEORY : Suppose that the variance of nonstationary process {Yt } can be written as
var(Yt ) = c0 f (µt ),
where µt = E(Yt ) and c0 is a positive constant free of µt . Therefore, the variance is not
constant because it is a function of µt , which is changing over time. Our goal is to find a
function T so that the transformed series T (Yt ) has constant variance. Such a function is
PAGE 129
400000
350000
300000
Electricity usage
250000
200000
150000
1975 1980 1985 1990 1995 2000 2005
Time
Figure 5.7: Electricity data. Monthly U.S. electricity generation, measured in millions of
kilowatt hours, from 1/1973 to 12/2005.
called a variance stabilizing transformation function. Consider approximating the

function T by a first-order Taylor-series expansion about the point µt , that is,
T (Yt ) ≈ T (µt ) + T ′ (µt )(Yt − µt ),
where T ′ (µt ) is the first derivative of T (Yt ), evaluated at µt . Now, note that
var[T (Yt )] ≈ var[T (µt ) + T ′ (µt )(Yt − µt )]
= c0 [T ′ (µt )]2 f (µt ).
Therefore, we want to find the function T which satisfies
var[T (Yt )] ≈ c0 [T ′ (µt )]2 f (µt ) = c1 ,

set
where c1 is a constant free of µt . Solving this expression for T ′ (µt ), we get the differential
equation √
′ c1 c2
T (µt ) = =√ ,
c0 f (µt ) f (µt )
PAGE 130
√
where c2 = c1 /c0 is free of µt . Integrating both sides, we get
∫
c
T (µt ) = √ 2 dµt + c3 ,
f (µt )
where c3 is a constant free of µt . In the calculations below, the values of c2 and c3 can
be taken to be anything, as long as they are free of µt .
• If var(Yt ) = c0 µt , so that the variance of the series is proportional to the mean,

then ∫
c2 √
T (µt ) = √ dµt = 2c2 µt + c3 ,
µt
where c3 is a constant free of µt . If we take c2 = 1/2 and c3 = 0, we see that the
√
square root of the series, T (Yt ) = Yt , will provide a constant variance.
• If var(Yt ) = c0 µ2t , so that the standard deviation of the series is proportional to the
mean, then ∫
c
T (µt ) = √2 dµt = c2 ln(µt ) + c3 ,
µ2t
where c3 is a constant free of µt . If we take c2 = 1 and c3 = 0, we see that the
logarithm of the series, T (Yt ) = ln(Yt ), will provide a constant variance.
• If var(Yt ) = c0 µ4t , so that the standard deviation of the series is proportional to the
square of the mean, then
∫ ( )
c2 1
T (µt ) = √ dµt = c2 − + c3 ,
µ4t µt
where c3 is a constant free of µt . If we take c2 = −1 and c3 = 0, we see that the

reciprocal of the series, T (Yt ) = 1/Yt , will provide a constant variance.
BOX-COX TRANSFORMATIONS : More generally, we can use a power transforma-

tion introduced by Box and Cox (1964). The transformation is defined by


 Yt − 1 , λ ̸= 0
λ
T (Yt ) = λ

 ln(Yt ), λ = 0,
PAGE 131
Table 5.1: Box-Cox transformation parameters λ and their associated transformations.
λ T (Yt ) Description
−2.0 1/Yt2 Inverse square
−1.0 1/Yt Reciprocal
√
−0.5 1/ Yt Inverse square root
0.0 ln(Yt ) Logarithm
√
0.5 Yt Square root
1.0 Yt Identity (no transformation)
2.0 Yt2 Square
where λ is called the transformation parameter. Some common values of λ, and their
implied transformations are given in Table 5.1.
NOTE : To see why the logarithm transformation T (Yt ) = ln(Yt ) is used when λ = 0,
note that by L’Höptial’s Rule (from calculus),
Ytλ − 1 Y λ ln(Yt )
lim = lim t = ln(Yt ).
λ→0 λ λ→0 1
• A variance stabilizing transformation can only be performed on a positive series,

that is, when Yt > 0, for all t. This turns out not to be prohibitive, because if some
or all of the series Yt is negative, we can simply add (the same) positive constant c
to each observation, where c is chosen so that everything becomes positive. Adding
c will not affect the (non)stationarity properties of {Yt }.
• Remember, a variance stabilizing transformation, if needed, should be performed

before taking any data differences.
• Frequently, a transformation performed to stabilize the variance will also improve

an approximation of normality. We will discuss the normality assumption later
(Chapters 7-8) when we address issues in statistical inference.
PAGE 132
1500
95%
1480
Log Likelihood
1460
1440
1420
−2 −1 0 1 2
Figure 5.8: Electricity data. Log-likelihood function versus λ. Note that λ is on the
horizontal axis. A 95 percent confidence interval for λ is also depicted.
DETERMINING λ: We can let the data “suggest” a suitable transformation in the

Box-Cox power family.
• We do this by treating λ as a parameter, writing the log-likelihood function of the

data (under the normality assumption), and finding the value of λ which maximizes
the log-likelihood function; i.e., the maximum likelihood estimate (MLE) of λ.
• There is an R function BoxCox.ar that does all of the calculations. The func-
tion also provides an approximate 95 percent confidence interval for λ, which is
constructed using the large sample properties of MLEs.
• The computations needed to produce a figure like the one in Figure 5.8 can be time
consuming if the series is long (i.e., n is large). Also, the profile log-likelihood is
not always as “smooth” as that seen in Figure 5.8.
PAGE 133
12.8
12.6
(Log) electricity usage
12.4
12.2
12.0
1975 1980 1985 1990 1995 2000 2005
Time
Figure 5.9: Electricity data (transformed). Monthly U.S. electricity generation measured
on the log scale.
Example 5.4 (continued). Figure 5.8 displays the profile log-likelihood of λ for the
electricity data. The value of λ (on the horizontal axis) that maximizes the log-likelihood
function looks to be λ ≈ −0.1, suggesting the transformation
T (Yt ) = Yt−0.1 .
However, this transformation makes little practical sense. An approximate 95 percent

confidence interval for λ looks to be about (−0.4, 0.2). Because λ = 0 is in this interval,
a log transformation T (Yt ) = ln(Yt ) is not unreasonable.
• The log-transformed series {ln Yt } is displayed in Figure 5.9. We see that applying
the log transformation has notably lessened the nonconstant variance (although
there still is a mild increase in the variance over time).
• Now that we have applied the transformation, we can now return to our previous
PAGE 134
0.8
ACF of the 1st differences of the logged series
0.1
0.6
First differences of log(Electricity)
0.4
0.0
0.2
0.0
−0.1
−0.2
−0.4
−0.2
1975 1980 1985 1990 1995 2000 2005 0 5 10 15 20 25 30 35
Time Lag
Figure 5.10: Electricity data. Left: Wt = log Yt − log Yt−1 , the first differences of the
log-transformed data. Right: The sample autocorrelation function of the {Wt } data.
modeling techniques. For the log-transformed series, there is still a pronounced

linear trend over time. Therefore, we consider the first difference process (on the
log scale), given by
Wt = log Yt − log Yt−1 = ∇ log Yt .
• The {Wt } series is plotted in Figure 5.10 (left) along with the sample ACF of the
{Wt } series (right). The {Wt } series appears to have a constant mean.
• However, the sample ACF suggests that there is still a large amount of structure
in the data that remains after differencing the log-transformed series.
• In particular, there looks to be significant autocorrelations that arise according to

a seasonal pattern. We will consider seasonal processes that model this type of
variability in Chapter 10.
REMARK : Taking the differences of a log-transformed series, as we have done in this

example, often arises in financial applications where Yt (e.g., stock price, portfolio return,
etc.) tends to have stable percentage changes over time. See pp 99 (CC).
PAGE 135
6 Model Specification
6.1 Introduction
RECALL: Suppose that {et } is zero mean white noise with var(et ) = σe2 . In general, an
ARIMA(p, d, q) process can be written as
ϕ(B)(1 − B)d Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt
is the series of dth differences. In this chapter, we discuss techniques on how to choose
suitable values of p, d, and q for an observed (or transformed) time series. We want our
choices to be consistent with the underlying structure of the observed data. Bad choices
of p, d, and q lead to bad models, which, in turn, lead to bad predictions (forecasts) of
future values.
6.2 The sample autocorrelation function
RECALL: For time series data Y1 , Y2 , ..., Yn , the sample autocorrelation function
(ACF), at lag k, is given by
∑n
t=k+1 (Yt − Y )(Yt−k −Y)
rk = ∑n ,
t=1 (Yt − Y )
2
where Y is the sample mean of Y1 , Y2 , ..., Yn .
PAGE 136
IMPORTANT : The sample autocorrelation rk is an estimate of the true (population)

autocorrelation ρk . As with any statistic, rk has a sampling distribution which de-
scribes how it varies from sample to sample. We would like to know this distribution so
we can quantify the uncertainty in values of rk that we might see in practice.
THEORY : For a stationary ARMA(p, q) process,

√ d
n(rk − ρk ) −→ N (0, ckk ),
as n → ∞, where
∑
∞
ckk = (ρ2l + ρl−k ρl+k − 4ρk ρl ρl−k + 2ρ2k ρ2l ).
l=−∞
In other words, when the sample size n is large, the sample autocorrelation rk is ap-
proximately normally distributed with mean ρk and variance ckk /n; i.e.,
( c )
kk
rk ∼ AN ρk , .
n
We now examine some specific models and specialize this general result to those models.
1. WHITE NOISE: For a white noise process, the formula for ckk simplifies consid-
erably because nearly all the terms in the sum above are zero. For large n,
( )
1
rk ∼ AN 0, ,
n
√
for k = 1, 2, ...,. This explains why ±2/ n serve as approximate margin of error
bounds for rk . Values of rk outside these bounds would be “unusual” under the
white noise model assumption.
2. AR(1): For a stationary AR(1) process Yt = ϕYt−1 + et , the formula for ckk also
reduces considerably. For large n,
rk ∼ AN (ρk , σr2k ),
where ρk = ϕk and
[ ]
1 (1 + ϕ2 )(1 − ϕ2k )
σr2k = − 2kϕ .
2k
n 1 − ϕ2
PAGE 137
3. MA(1): For an invertible MA(1) process Yt = et − θet−1 , we treat the k = 1 and

k > 1 cases separately.
• Case 1: Lag k = 1. For large n,
r1 ∼ AN (ρ1 , σr21 ),
where ρ1 = −θ/(1 + θ2 ) and

1 − 3ρ21 + 4ρ41
σr21 = .
n
• Case 2: Lag k > 1. For large n,
rk ∼ AN (0, σr2k ),
where
1 + 2ρ21
σr2k = .
n
4. MA(q): For an invertible MA(q) process,
Yt = et − θ1 et−1 − θ2 et−2 − · · · − θq et−q ,
the sample autocorrelation rk , for all k > q, satisfies

[ ( )]
1 ∑q
rk ∼ AN 0, 1+2 ρ2j ,
n j=1
when n is large.
REMARK : The MA(q) result above suggests a natural large-sample test for
H0 : MA(q) process is appropriate

versus
H1 : MA(q) process is not appropriate.
If H0 is true, then the sample autocorrelation

[ ( )]
1 ∑q
rq+1 ∼ AN 0, 1+2 ρ2j .
n j=1
PAGE 138
Therefore, the random variable

rq+1
Z=√ ( ) ∼ AN (0, 1).
1
∑q 2
n
1 + 2 j=1 ρj
We can not use Z as a test statistic to test H0 versus H1 because Z depends on ρ1 , ρ2 , ..., ρq
which, in practice, are unknown. However, when n is large, we can use rj as an estimate
for ρj . This should not severely impact the large sample distribution of Z because rj
should be “close” to ρj when n is large. Making this substitution gives the large-sample
test statistic
rq+1
Z∗ = √ ( ).
1
∑q 2
n
1 + 2 j=1 rj
When H0 is true, Z ∗ ∼ AN (0, 1). Therefore, a level α decision rule is to reject H0 in

favor of H1 when
|Z ∗ | > zα/2 ,
where zα/2 is the upper α/2 quantile from the N (0, 1) distribution. This is a two-
sided test. Of course, an equivalent decision rule is to reject H0 when the (two-sided)
probability value is less than α.
Example 6.1. From a time series of n = 200 observations, we calculate r1 = −0.49,

r2 = 0.31, r3 = −0.13, r4 = 0.07, and |rk | < 0.09 for k > 4. Which moving average (MA)
model is most consistent with these sample autocorrelations?
Solution. To test
H0 : MA(1) process is appropriate
versus
H1 : MA(1) process is not appropriate
we compute
r2 0.31
z∗ = √ =√ ≈ 3.60.
1 1
n
(1 + 2r12 ) 200
[1 + 2(−0.49)2 ]
This is not a reasonable value of Z ∗ under H0 ; e.g., the p-value is
pr(|Z ∗ | > 3.60) ≈ 0.0003.
Therefore, we would reject H0 and conclude that the MA(1) model is not appropriate.
PAGE 139
To test
H0 : MA(2) process is appropriate

versus
H1 : MA(2) process is not appropriate
we compute
r3 −0.13
z∗ = √ =√ ≈ −1.42.
1 1
n
(1 + 2r12 + 2r22 ) 200
[1 + 2(−0.49)2 + 2(0.31)2 ]
This is not an unreasonable value of Z ∗ under H0 ; e.g., the p-value is
pr(|Z ∗ | > 1.42) ≈ 0.16.
Therefore, we would not reject H0 . An MA(2) model is not inconsistent with these
sample autocorrelations.
Example 6.2. Monte Carlo simulation. Consider the model
Yt = et + 0.7et−1 ,
an MA(1) process with θ = −0.7, where et ∼ iid N (0, 1) and n = 200. In this exam-
ple, we use a technique known as Monte Carlo simulation to simulate the sampling
distributions of the sample autocorrelations r1 , r2 , r5 , and r10 . Here is how this is done:
• We simulate an MA(1) process with θ = −0.7 and compute r1 with the simulated
data. Note that the R function arima.sim can be used to simulate this process.
• We repeat this simulation exercise a large number of times, say, M times. With
each simulated series, we compute r1 .
• If we simulate M different series, we will have M corresponding values of r1 .
• We can then plot the M values of r1 in a histogram. This histogram represents the
Monte Carlo sampling distribution of r1 .
• For each simulation, we can also record the values of r2 , r5 , and r10 . We can then
construct their corresponding histograms.
PAGE 140
100 200 300 400

250
Frequency
Frequency
150
0 50
0
0.30 0.40 0.50 0.60 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
r1 r2
500
Frequency
Frequency
300
300
100
100
0
−0.3 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 0.3
r5 r10
Figure 6.1: Monte Carlo simulation. Histograms of sample autocorrelations based on

M = 2000 Monte Carlo samples of size n = 200 taken from an MA(1) process with
θ = −0.7. Upper left: r1 . Upper right: r2 . Lower left: r5 . Lower right: r10 . The
histograms are approximations to the true sampling distributions when n = 200.
• Note that the approximate sampling distribution of r1 is centered around

−(−0.7)
ρ1 = ≈ 0.47.
1 + (−0.7)2
The other sampling distributions are centered around ρ2 = 0, ρ5 = 0, and ρ10 = 0,
as expected. All distributions take on a normal shape, also as expected.
• Important: The true large-sample distribution result

√ d
n(rk − ρk ) −→ N (0, ckk )
is a result that requires the sample size n → ∞. With n = 200, we see that the
normal distribution (large-sample) property has largely taken shape.
PAGE 141
MA(1) process Sample ACF
0.1
1
ACF
0
−0.1
Yt
−3 −2 −1
−0.3
0 20 40 60 80 100 5 10 15 20
Time Lag
MA(2) process Sample ACF
0.0 0.2 0.4

2
1
ACF
0
Yt
−3 −2 −1
−0.4
0 20 40 60 80 100 5 10 15 20
Time Lag
Figure 6.2: Simulated MA(1) and MA(2) processes with n = 100 and σe2 = 1. Moving
average error bounds are used in the corresponding sample ACFs; not the white noise
√
error bounds ±2/ n.
Example 6.3. We use R to generate data from two moving average processes:
1. Yt = et − 0.5et−1 ⇐⇒ MA(1), with θ = 0.5
2. Yt = et − 0.5et−1 + 0.5et−2 ⇐⇒ MA(2), with θ1 = 0.5 and θ2 = −0.5.
We take et ∼ iid N (0, 1) and n = 100. In Figure 6.2, we display the realized time series
and the corresponding sample autocorrelation functions (ACFs).
• However, instead of using the white noise margin of error bounds, that is,
2 2
±√ = ±√ = ±0.2,
n 100
PAGE 142
we use the more precise error bounds from the large sample distribution
[ ( )]
1 ∑ q
rk ∼ AN 0, 1+2 ρ2j .
n j=1
• In particular, for each lag k, the (estimated) standard error bounds are placed at
v ( )
u
u 1 ∑
k−1
±1.96t 1+2 rj2 .
100 j=1
• That is, error bounds at lag k are computed assuming that the MA(k − 1) model is
appropriate. Values of rk which exceed these bounds are deemed to be statistically
significant. Note that the MA error bounds are not constant, unlike those computed
under the white noise assumption.
6.3 The partial autocorrelation function
RECALL: We have seen that for MA(q) models, the population ACF ρk is nonzero for
lags k ≤ q and ρk = 0 for lags greater than q. That is, the ACF for an MA(q) process
“drops off” to zero after lag q.
• Therefore, the ACF provides a considerable amount of information about the order
of the dependence when the process is truly a moving average.
• On the other hand, if the process is autoregressive (AR), then the ACF may not
tell us much about the order of the dependence.
• It is therefore worthwhile to develop a function that will behave like the ACF for
MA models, but for use with AR models instead. This function is called the partial
autocorrelation function (PACF).
MOTIVATION : To set our ideas, consider a stationary, zero mean AR(1) process
PAGE 143
where {et } is zero mean white noise. The autocovariance between Yt and Yt−2 is
γ2 = cov(Yt , Yt−2 )
= cov(ϕYt−1 + et , Yt−2 )
= cov[ϕ(ϕYt−2 + et−1 ) + et , Yt−2 ]
= cov(ϕ2 Yt−2 + ϕet−1 + et , Yt−2 )
= ϕ2 cov(Yt−2 , Yt−2 ) + ϕcov(et−1 , Yt−2 ) + cov(et , Yt−2 )
= ϕ2 var(Yt−2 ) + 0 + 0 = ϕ2 γ0 ,
where γ0 = var(Yt ) = var(Yt−2 ). Recall that et−1 and et are independent of Yt−2 .
• Note that if Yt followed an MA(1) process, then γ2 = cov(Yt , Yt−2 ) = 0.
• This not true for an AR(1) process because Yt depends on Yt−2 through Yt−1 .
STRATEGY : Suppose that we “break” the dependence between Yt and Yt−2 in an AR(1)
process by removing (or partialing out) the effect of Yt−1 . To do this, consider the
quantities Yt − ϕYt−1 and Yt−2 − ϕYt−1 . Note that
cov(Yt − ϕYt−1 , Yt−2 − ϕYt−1 ) = cov(et , Yt−2 − ϕYt−1 ) = 0,
because et is independent of Yt−1 and Yt−2 . Now, we make the following observations.
• In the AR(1) model, if ϕ is known, we can think of
Yt − ϕYt−1
as the prediction error from regressing Yt on Yt−1 (with no intercept; this is not
needed because we are assuming a zero mean process).
• Similarly, the quantity

Yt−2 − ϕYt−1
can be thought of as the prediction error from regressing Yt−2 on Yt−1 , again with
no intercept.
PAGE 144
• Both of these prediction errors are uncorrelated with the intervening variable
Yt−1 . To see why, note that
cov(Yt − ϕYt−1 , Yt−1 ) = cov(Yt , Yt−1 ) − ϕcov(Yt−1 , Yt−1 )
= γ1 − ϕγ0 = 0,
because γ1 = ϕγ0 in the AR(1) model. An identical argument shows that
cov(Yt−2 − ϕYt−1 , Yt−1 ) = γ1 − ϕγ0 = 0.
AR(2): Consider a stationary, zero mean AR(2) process
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et ,
where {et } is zero mean white noise. Suppose that we “break” the dependence between
Yt and Yt−3 in the AR(2) process by removing the effects of both Yt−1 and Yt−2 . That
is, consider the quantities
Yt − ϕ1 Yt−1 − ϕ2 Yt−2
and
Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 .
Note that
cov(Yt − ϕ1 Yt−1 − ϕ2 Yt−2 , Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 ) = cov(et , Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 ) = 0,
because et is independent of Yt−1 , Yt−2 , and Yt−3 . Again, we note the following:
• In the AR(2) case, if ϕ1 and ϕ2 are known, then the quantity
Yt − ϕ1 Yt−1 − ϕ2 Yt−2
can be thought of as the prediction error from regressing Yt on Yt−1 and Yt−2
(with no intercept).
• Similarly, the quantity

Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2
can be thought of as the prediction error from regressing Yt−3 on Yt−1 and Yt−2 ,
again with no intercept.
PAGE 145
• Both of these prediction errors are uncorrelated with the intervening variables
Yt−1 and Yt−2 .
TERMINOLOGY : For a zero mean time series, let Ybt

(k−1)
denote the population regres-
sion of Yt on the variables Yt−1 , Yt−2 , ..., Yt−(k−1) , that is,
Ybt
(k−1)
= β1 Yt−1 + β2 Yt−2 + · · · + βk−1 Yt−(k−1) .
Let Ybt−k
(k−1)
denote the population regression of Yt−k on the variables Yt−1 , Yt−2 , ..., Yt−(k−1) ,
that is,
Ybt−k = β1 Yt−(k−1) + β2 Yt−(k−2) + · · · + βk−1 Yt−1 .
(k−1)
The partial autocorrelation function (PACF) of a stationary process {Yt }, denoted

by ϕkk , satisfies ϕ11 = ρ1 and
ϕkk = corr(Yt − Ybt , Yt−k − Ybt−k ),

(k−1) (k−1)
for k = 2, 3, ...,.
• With regards to Yt and Yt−k , the quantities Ybt and Ybt−k are linear functions
(k−1) (k−1)
of the intervening variables Yt−1 , Yt−2 , ..., Yt−(k−1) .
• The quantities Yt − Ybt and Yt−k − Ybt−k are called the prediction errors.
(k−1) (k−1)
The PACF at lag k is defined to be the correlation between these errors.
• If the underlying process {Yt } is normal, then an equivalent definition is
ϕkk = corr(Yt , Yt−k |Yt−1 , Yt−2 , ..., Yt−(k−1) ),
the correlation between Yt and Yt−k , conditional on the intervening variables

Yt−1 , Yt−2 , ..., Yt−(k−1) .
• That is, ϕkk measures the correlation between Yt and Yt−k after removing the linear
effects of Yt−1 , Yt−2 , ..., Yt−(k−1) .
PAGE 146
RECALL: We now revisit our AR(1) calculations. Consider the model
Yt = ϕYt−1 + et .
We showed that
cov(Yt − ϕYt−1 , Yt−2 − ϕYt−1 ) = cov(et , Yt−2 − ϕYt−1 ) = 0.
In this example, the quantities Yt − ϕYt−1 and Yt−2 − ϕYt−1 are the prediction errors from
regressing Yt on Yt−1 and Yt−2 on Yt−1 , respectively. That is, with k = 2, the general
expressions
Ybt
(k−1)
= β1 Yt−1 + β2 Yt−2 + · · · + βk−1 Yt−(k−1)
Ybt−k
(k−1)
= β1 Yt−(k−1) + β2 Yt−(k−2) + · · · + βk−1 Yt−1
become
Ybt
(2−1)
= ϕYt−1
Ybt−2
(2−1)
= ϕYt−1 .
Therefore, we have shown that for the AR(1) model,
ϕ22 = corr(Yt − Ybt , Yt−2 − Ybt−2 ) = 0

(2−1) (2−1)
because
cov(Yt − Ybt , Yt−2 − Ybt−2 ) = cov(Yt − ϕYt−1 , Yt−2 − ϕYt−1 ) = 0.

(2−1) (2−1)
IMPORTANT : For the AR(1) model, it follows that ϕ11 ̸= 0 (ϕ11 = ρ1 ) and
ϕ22 = ϕ33 = ϕ44 = · · · = 0.
That is, ϕkk = 0, for all k > 1.
RECALL: We now revisit our AR(2) calculations. Consider the model
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et .
PAGE 147
We showed that
cov(Yt − ϕ1 Yt−1 − ϕ2 Yt−2 , Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 ) = 0.
Note that in this example, the quantities Yt − ϕ1 Yt−1 − ϕ2 Yt−2 and Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2
are the prediction errors from regressing Yt on Yt−1 and Yt−2 and Yt−3 on Yt−1 and Yt−2 ,
respectively. That is, with k = 3, the general expressions
Ybt
(k−1)
= β1 Yt−1 + β2 Yt−2 + · · · + βk−1 Yt−(k−1)
Ybt−k
(k−1)
= β1 Yt−(k−1) + β2 Yt−(k−2) + · · · + βk−1 Yt−1
become
Ybt
(3−1)
= ϕ1 Yt−1 + ϕ2 Yt−2
Ybt−3
(3−1)
= ϕ1 Yt−1 + ϕ2 Yt−2 .
Therefore, we have shown that for the AR(2) model,
ϕ33 = corr(Yt − Ybt , Yt−3 − Ybt−3 ) = 0

(3−1) (3−1)
because
cov(Yt − Ybt , Yt−3 − Ybt−3 ) = cov(Yt − ϕ1 Yt−1 − ϕ2 Yt−2 , Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 ) = 0.
(3−1) (3−1)
IMPORTANT : For the AR(2) model, it follows that ϕ11 ̸= 0, ϕ22 ̸= 0, and
ϕ33 = ϕ44 = ϕ55 = · · · = 0.
That is, ϕkk = 0, for all k > 2.
GENERAL RESULT : For an AR(p) process, we have the following results:
• ϕ11 ̸= 0, ϕ22 ̸= 0, ..., ϕpp ̸= 0; i.e., the first p partial autocorrelations are nonzero
• ϕkk = 0, for all k > p.
For an AR(p) model, the PACF “drops off” to zero after the pth lag. Therefore,
the PACF can help to determine the order of an AR(p) process just like the ACF helps
to determine the order of an MA(q) process!
PAGE 148
Population ACF Population PACF
1.0
1.0
Partial autocorrelation
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k

1.0
1.0
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 6.3: Top: AR(1) model with ϕ = 0.9; population ACF (left) and population
PACF (right). Bottom: AR(2) model with ϕ1 = −0.5 and ϕ2 = 0.25; population ACF
(left) and population PACF (right).
Example 6.4. We use R to generate observations from two autoregressive processes:
(i) Yt = 0.9Yt−1 + et ⇐⇒ AR(1), with ϕ = 0.9
(ii) Yt = −0.5Yt−1 + 0.25Yt−2 + et ⇐⇒ AR(2), with ϕ1 = −0.5 and ϕ2 = 0.25.
We take et ∼ iid N (0, 1) and n = 150. Figure 6.3 displays the true (population) ACF
and PACF for these processes. Figure 6.4 displays the simulated time series from each
AR model and the sample ACF/PACF.
• The population PACFs in Figure 6.3 display the characteristics that we have just
derived; that is, the AR(1) PACF drops off to zero when the lag k > 1. The AR(2)
PACF drops off to zero when the lag k > 2.
PAGE 149
AR(1) process AR(2) process
2 4 6
4
2
Yt
Yt
0
−2
−2
−6
0 50 100 150 0 50 100 150
Time Time
AR(1) sample ACF AR(2) sample ACF
0.4
0.6
ACF
ACF
0.0
0.2
−0.6
−0.2
5 10 15 20 5 10 15 20
Lag Lag
AR(1) sample PACF 0.2 AR(2) sample PACF

0.6
Partial ACF
Partial ACF
−0.2
0.2
−0.6
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Figure 6.4: Left: AR(1) simulation with et ∼ iid N (0, 1) and n = 150; sample ACF
(middle), and sample PACF (bottom). Right: AR(2) simulation with et ∼ iid N (0, 1)
and n = 150; sample ACF (middle), and sample PACF (bottom).
• Figure 6.4 displays the sample ACF/PACFs. Just as the sample ACF is an esti-
mate of the true (population) ACF, the sample PACF is an estimate of the true
(population) PACF.
• Note that the sample PACF for the AR(1) simulation declares ϕbkk insignificant for
k > 1. The estimates of ϕkk , for k > 1, are all within the margin of error bounds.
The sample PACF for the AR(2) simulation declares ϕbkk insignificant for k > 2.
• We will soon discuss why the PACF error bounds here are correct.
PAGE 150
1.0
1.0
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k

1.0
1.0
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 6.5: Top: MA(1) model with θ = 0.9; population ACF (left) and population
PACF (right). Bottom: MA(2) model with θ1 = −0.5 and θ2 = 0.25; population ACF
(left) and population PACF (right).
CURIOSITY : How does the PACF behave for a moving average process? To answer
this, consider the invertible MA(1) model, Yt = et − θet−1 . For this process, it can be
shown that
θk (θ2 − 1)
ϕkk = ,
1 − θ2(k+1)
for k ≥ 1. Because |θ| < 1 (invertibility requirement), note that
θk (θ2 − 1)
lim ϕkk = lim = 0.
k→∞ k→∞ 1 − θ 2(k+1)
That is, the PACF for the MA(1) process decays to zero as the lag k increases, much like
the ACF decays to zero for the AR(1). The same happens in higher order MA models.
PAGE 151
MA(1) process MA(2) process
2
2
1
Yt
Yt
0
−1
−4 −2
−3
0 50 100 150 0 50 100 150
Time Time
MA(1) sample ACF MA(2) sample ACF

0.2
0.2
−0.2 0.0
ACF
ACF
0.0
−0.2
−0.5
5 10 15 20 5 10 15 20
Lag Lag
MA(1) sample PACF 0.1 MA(2) sample PACF

−0.2 0.0
Partial ACF
Partial ACF
−0.1
−0.5
−0.3
5 10 15 20 5 10 15 20
Lag Lag
Figure 6.6: Left: MA(1) simulation with et ∼ iid N (0, 1) and n = 150; sample ACF
(middle), and sample PACF (bottom). Right: MA(2) simulation with et ∼ iid N (0, 1)
and n = 150; sample ACF (middle), and sample PACF (bottom).
IMPORTANT : The PACF for an MA process behaves much like the ACF for
an AR process of the same order.
Example 6.5. We use R to generate observations from two moving average processes:
(i) Yt = et − 0.9et−1 ⇐⇒ MA(1), with θ = 0.9
(ii) Yt = et + 0.5et−1 − 0.25et−2 ⇐⇒ MA(2), with θ1 = −0.5 and θ2 = 0.25.
We take et ∼ iid N (0, 1) and n = 150. Figure 6.5 displays the true (population) ACF
and PACF for these processes. Figure 6.6 displays the simulated time series from each
PAGE 152
MA model and the sample ACF/PACF.
• The population ACFs in Figure 6.5 display the well-known characteristics; that is,
the MA(1) ACF drops off to zero when the lag k > 1. The MA(2) ACF drops off
to zero when the lag k > 2.
• The population PACF in Figure 6.5 for both the MA(1) and MA(2) decays to zero
as the lag k increases. This is the theoretical behavior exhibited in the ACF for an
AR process.
• The sample versions in Figure 6.6 largely agree with what we know to be true
theoretically.
COMPARISON : The following table succinctly summarizes the behavior of the ACF and
PACF for moving average and autoregressive processes.
AR(p) MA(q)
ACF Tails off Cuts off after lag q
PACF Cuts off after lag p Tails off
Therefore, the ACF is the key tool to help determine the order of a MA process. The
PACF is the key tool to help determine the order of an AR process. For mixed ARMA
processes, we need a different tool (coming up).
COMPUTATION : For any stationary ARMA process, it is possible to compute the

theoretical PACF values ϕkk , for k = 1, 2, ...,. For a fixed k, we have the following
Yule-Walker equations:
ρ1 = ϕk,1 + ρ1 ϕk,2 + ρ2 ϕk,3 + · · · + ρk−1 ϕkk
ρ2 = ρ1 ϕk,1 + ϕk,2 + ρ1 ϕk,3 + · · · + ρk−2 ϕkk

..
.
ρk = ρk−1 ϕk,1 + ρk−2 ϕk,2 + ρk−3 ϕk,3 + · · · + ϕkk ,
PAGE 153
where
ρj = corr(Yt , Yt−j )
ϕk,j = ϕk−1,j − ϕkk ϕk−1,k−j , j = 1, 2, ..., k − 1
ϕkk = corr(Yt , Yt−k |Yt−1 , Yt−2 , ..., Yt−(k−1) ).
For known ρ1 , ρ2 , ..., ρk , we can solve this system for ϕk,1 , ϕk,2 , ..., ϕk,k−1 , ϕkk , and keep the
value of ϕkk .
Example 6.6. The ARMAacf function in R will compute partial autocorrelations for any
stationary ARMA model. For example, for the AR(2) model
Yt = 0.6Yt−1 − 0.4Yt−2 + et ,
we compute the first ten (theoretical) autocorrelations ρk and partial autocorrelations

ϕkk . Note that I use the round function for aesthetic reasons.
> round(ARMAacf(ar = c(0.6,-0.4), lag.max = 10),digits=3)

0 1 2 3 4 5 6 7 8 9 10
1.000 0.429 -0.143 -0.257 -0.097 0.045 0.066 0.022 -0.013 -0.017 -0.005
> round(ARMAacf(ar = c(0.6,-0.4), lag.max = 10, pacf=TRUE),digits=3)

[1] 0.429 -0.400 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Similarly, for the MA(2) model
Yt = et + 0.6et−1 − 0.4et−2 + et ,
we compute the first ten (theoretical) autocorrelations and partial autocorrelations.
> round(ARMAacf(ma = c(0.6,-0.4), lag.max = 10),digits=3)

0 1 2 3 4 5 6 7 8 9 10
1.000 0.237 -0.263 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
> round(ARMAacf(ma = c(0.6,-0.4), lag.max = 10, pacf=TRUE),digits=3)

[1] 0.237 -0.338 0.196 -0.189 0.149 -0.134 0.116 -0.105 0.095 -0.086
PAGE 154
ESTIMATION : The partial autocorrelation ϕkk can be estimated by taking the Yule-
Walker equations and substituting rk in for the true autocorrelations ρk , that is,
r1 = ϕk,1 + r1 ϕk,2 + r2 ϕk,3 + · · · + rk−1 ϕkk
r2 = r1 ϕk,1 + ϕk,2 + r1 ϕk,3 + · · · + rk−2 ϕkk

..
.
rk = rk−1 ϕk,1 + rk−2 ϕk,2 + rk−3 ϕk,3 + · · · + ϕkk .
This system can then be solved for ϕk,1 , ϕk,2 , ..., ϕk,k−1 , ϕkk as before, but now the solutions
are estimates ϕbk,1 , ϕbk,2 , ..., ϕbk,k−1 , ϕbkk . This can be done for each k = 1, 2, ...,.
RESULT : When the AR(p) model is correct, then for large n,

( )
b 1
ϕkk ∼ AN 0, ,
n
√
for all k > p. Therefore, we can use ±zα/2 / n as “critical points” to test, at level α,
H0 : AR(p) model is appropriate

versus
H1 : AR(p) model is not appropriate
in the same way that we tested whether or not a specific MA model was appropriate
using the sample autocorrelations rk . See Example 6.1 (notes).
6.4 The extended autocorrelation function
REMARK : We have learned that the autocorrelation function (ACF) can help us deter-
mine the order of an MA(q) process because ρk = 0, for all lags k > q. Similarly, the
partial autocorrelation function (PACF) can help us determine the order of an AR(p)
process because ϕkk = 0, for all lags k > p. Therefore, in the sample versions of the
ACF and PACF, we can look for values of rk and ϕbkk , respectively, that are consistent
with this theory. We have also discussed formal testing procedures that can be used to
PAGE 155
determine if a given MA(q) or AR(p) model is appropriate. A problem, however, is that

neither the sample ACF nor sample PACF is all that helpful if the underlying process
is a mixture of autoregressive and moving average parts, that is, an ARMA process.
Therefore, we introduce a new function to help us identify the orders of an ARMA(p, q)
process, the extended autocorrelation function.
MOTIVATION : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
Recall that a stationary ARMA(p, q) process can be expressed as
ϕ(B)Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
To start our discussion, note that
Wt ≡ ϕ(B)Yt = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt
= Yt − ϕ1 Yt−1 − ϕ2 Yt−2 − · · · − ϕp Yt−p
follows an MA(q) model, that is,
Wt = (1 − θ1 B − θ2 B 2 − · · · − θq B q )et .
Of course, the {Wt } process is not observed because Wt depends on ϕ1 , ϕ2 , ..., ϕp , which
are unknown parameters.
STRATEGY : Suppose that we regress Yt on Yt−1 , Yt−2 , ..., Yt−p (that is, use the p lagged
versions of Yt as independent variables in a multiple linear regression) and use ordinary
least squares to fit the no-intercept model
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + ϵt ,
where ϵt denotes a generic error term (not the white noise term in the MA process). This
would produce estimates ϕb1 , ϕb2 , ..., ϕbp from which we could compute
ct = (1 − ϕb1 B − ϕb2 B 2 − · · · − ϕbp B p )Yt

W
= Yt − ϕb1 Yt−1 − ϕb2 Yt−2 − · · · − ϕbp Yt−p .
PAGE 156
These values (which are merely the residuals from the regression) serve as proxies for the
true {Wt } process, and we could now treat these residuals as our “data.”
ct data so that we can

• In particular, we could construct the sample ACF for the W
learn about the order q of the MA part of the process.
ct
• For example, if we fit an AR(2) model Yt = ϕ1 Yt−1 +ϕ2 Yt−2 +ϵt and the residuals W
look to follow an MA(2) process, then this would suggest that a mixed ARMA(2,2)
model is worthy of consideration.
PROBLEM : We have just laid out a sensible strategy on how to select candidate ARMA
models; i.e., choosing values for p and q. The problem is that ordinary least squares
regression estimates ϕb1 , ϕb2 , ..., ϕbp are inconsistent estimates of ϕ1 , ϕ2 , ..., ϕp when the
underlying process is ARMA(p, q). Inconsistency means that the estimates ϕb1 , ϕb2 , ..., ϕbp
estimate the wrong things (in a large-sample sense). Therefore, the strategy that we have
just described could lead to incorrect identification of p and q.
ADJUSTMENT : We now describe an “algorithm” to repair the approach just outlined.
0. Consider using ordinary least squares to fit the same no-intercept AR(p) model
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + ϵt ,
where ϵt denotes the error term (not the white noise term in an MA process). If the
true process is an ARMA(p, q), then the least squares estimates from the regression,
say,
ϕb1 , ϕb2 , ..., ϕb(0)
(0) (0)
p
will be inconsistent and the least squares residuals
ϵt = Yt − ϕb1 Yt−1 − ϕb2 Yt−2 − · · · − ϕb(0)

(0) (0) (0)
b p Yt−p
will not be white noise. In fact, if q ≥ 1 (so that the true process is ARMA), then
(0)
the residuals b
ϵt and lagged versions of them will contain information about the
process {Yt }.
PAGE 157
(0)
1. Because the residuals b
ϵt contain information about the value of q, we first fit the
model
(1) (1) (1) (0) (1)
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕ(1)
p Yt−p + β1 b
ϵt−1 + ϵt ,
(0)
Note that we have added the lag 1 residuals b
ϵt−1 from the initial model fit as a
predictor in the regression.
• If the order of the MA part of the ARMA process is truly q = 1, then the
least squares estimates
ϕb1 , ϕb2 , ..., ϕb(1)
(1) (1)
p
will be consistent; i.e., they will estimate the true AR parameters in large
samples.
(1)
• If q > 1, then the estimates will be inconsistent and the residual process {b
ϵt }
will not be white noise.
(1)
2. If q > 1, then the residuals from the most recent regression b
ϵt still contain infor-
mation about the value of q, so we next fit the model
(2) (2) (2) (1) (2) (0) (2)
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕ(2)
p Yt−p + β1 b
ϵt−1 + β2 b
ϵt−2 + ϵt .
(0)
Note that in this model, we have added the lag 2 residuals b
ϵt−2 from the initial
(1)
model fit as well as the lag 1 residuals b
ϵt−1 from the most recent fit.
• If the order of the MA part of the ARMA process is truly q = 2, then the
least squares estimates
ϕb1 , ϕb2 , ..., ϕb(2)
(2) (2)
p
will be consistent; i.e., they will estimate the true AR parameters in large
samples.
(2)
ϵt }
• If q > 2, then the estimates will be inconsistent and the residual process {b
will not be white noise.
3. We continue this iterative process, at each step, adding the residuals from the most
recent fit in the same fashion. For example, at the next step, we would fit
(3) (3) (3) (2) (3) (1) (3) (0) (3)
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕ(3)
p Yt−p + β1 b
ϵt−1 + β2 b
ϵt−2 + β3 b
ϵt−3 + ϵt .
PAGE 158
We continue fitting higher order models until residuals (from the most recent fit)
resemble a white noise process.
EXTENDED ACF : In practice, the true orders p and q of the ARMA(p, q) model are
unknown and have to be estimated. Based on the strategy outlined, however, we can
estimate p and q using a new type of function. For an AR(m) model fit, define the mth
(m)
sample extended autocorrelation function (EACF) ρbj as the sample ACF for
the residual process
ct(j) = (1 − ϕb1(j) B − ϕb(j)

W b(j) m
2 B − · · · − ϕm B )Yt
2
= Yt − ϕb1 Yt−1 − ϕb2 Yt−2 − · · · − ϕb(j)

(j) (j)
m Yt−m ,
for m = 0, 1, 2, ..., and j = 0, 1, 2, ...,. Here, the subscript j refers to the iteration number
in the aforementioned sequential fitting process (hence, j refers to the order the MA
part). The value m refers to the AR part of the process. Usually the maximum values
of m and j are taken to be 10 or so.
MA
AR 0 1 2 3 4 ···
(0) (0) (0) (0) (0)
0 ρb1 ρb2 ρb3 ρb4 ρb5 ···
(1) (1) (1) (1) (1)
(2) (2) (2) (2) (2)
(3) (3) (3) (3) (3)
(4) (4) (4) (4) (4)
.. .. .. .. .. ..
. . . . . . ···
(m)
REPRESENTATION : It is useful to arrange the estimates ρbj in a two-way table
where one direction corresponds to the AR part and the other direction corresponds to
the MA part. Mathematical arguments show that, as n → ∞,
(m)
ρbj −→ 0, for 0 ≤ m − p < j − q
(m)
ρbj −→ c ̸= 0, otherwise.
PAGE 159
Therefore, the true large-sample extended autocorrelation function (EACF) table for an
ARMA(1, 1) process, for example, looks like
MA
AR 0 1 2 3 4 5 ···
0 x x x x x x ···
1 x 0 0 0 0 0 ···
2 x x 0 0 0 0 ···
3 x x x 0 0 0 ···
4 x x x x 0 0 ···
5 x x x x x 0 ···
.. .. .. .. .. .. ..
. . . . . . . ···
(m)
In this table, the “0” entries correspond to the zero limits of ρbj . The “x” entries
(m)
correspond to limits of ρbj which are nonzero. Therefore, the geometric pattern formed
by the zeros is a “wedge” with a tip at (1,1). This tip corresponds to the values of p = 1
and q = 1 in the ARMA model.
The true large-sample EACF table for an ARMA(2, 2) process looks like
MA
AR 0 1 2 3 4 5 ···
0 x x x x x x ···
1 x x x x x x ···
2 x x 0 0 0 0 ···
3 x x x 0 0 0 ···
4 x x x x 0 0 ···
5 x x x x x 0 ···
.. .. .. .. .. .. ..
. . . . . . . ···
In this table, we see that the tip of the wedge is at the point (2,2). This tip corresponds
to the values of p = 2 and q = 2 in the ARMA model.
PAGE 160
The true large-sample EACF table for an ARMA(2, 1) process looks like
MA
AR 0 1 2 3 4 5 ···
0 x x x x x x ···
1 x x x x x x ···
2 x 0 0 0 0 0 ···
3 x x 0 0 0 0 ···
4 x x x 0 0 0 ···
5 x x x x 0 0 ···
.. .. .. .. .. .. ..
. . . . . . . ···
In this table, we see that the tip of the wedge is at the point (2,1). This tip corresponds
to the values of p = 2 and q = 1 in the ARMA model.
DISCLAIMER: The tables shown above represent theoretical results for infinitely large
sample sizes. Of course, with real data, we would not expect the tables to follow such a
(m)
clear cut pattern. Remember, the sample EACF values ρbj are estimates, so they have
inherent sampling variation! This is important to keep in mind. For some data sets, the
sample EACF table may reveal 2 or 3 models which are consistent with the estimates.
In other situations, the sample EACF may be completely ambiguous and give little or
no information, especially if the sample size n is small.
SAMPLING DISTRIBUTION : When the residual process
ct(j) = (1 − ϕb1(j) B − ϕb(j)

W b(j) m
2 B − · · · − ϕm B )Yt
2
is truly white noise, then the sample extended autocorrelation function estimator
( )
(m) 1
ρbj ∼ AN 0, ,
n−m−j
(m)
when n is large. Therefore, we would expect 95 percent of the estimates ρbj to fall
√
within ±1.96/ n − m − j. Values outside these cutoffs are classified with an “x” in the
sample EACF. Values within these bounds are classified with a “0.”
PAGE 161
Example 6.7. We use R to simulate data from three different ARMA(p, q) processes
and examine the sample EACF produced in R. The first simulation is an
• ARMA(1,1), with n = 200, ϕ = 0.6, θ = −0.8, and et ∼ iid N (0, 1).
The sample EACF produced from the simulation was
AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x x x x o o o o o o o o o
1 x o o o o o o o o o o o o o
2 x o o o x o o o o o o o o o
3 x x x o o o o o o o o o o o
4 x o x o x o o o o o o o o o
5 x x x x o o o o o o o o o o
6 x x o x x o o o o o o o o o
7 x x o x o x o o o o o o o o
INTERPRETATION : This sample EACF agrees largely with the theory, which says that
there should be a wedge of zeros with tip at (1,1); the “x”s at (2,4) and (4,4) may be false
positives. If one is willing to additionally assume that the “x” at (3,2) is a false positive,
then an ARMA(2,1) model would also be deemed consistent with these estimates.
The second simulation is an
• ARMA(2,2), with n = 200, ϕ1 = 0.5, ϕ2 = −0.5, θ1 = −0.8, θ2 = 0.2, and

et ∼ iid N (0, 1).
AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x x o x o o o o o x o o o
1 x x x o x o o o o o x o x o
2 x o o o o o o o o o x o o o
3 x x o o o o o o o o o o o o
4 x x o x o o o o o o o o o o
5 x x x x o o o o o o o o o o
6 x x x x o o o o x o o o o o
7 x o x x x o o o o o o o o o
PAGE 162
INTERPRETATION : This sample EACF also agrees largely with the theory, which says
that there should be a wedge of zeros with tip at (2,2). If one is willing to additionally
assume that the “x” at (4,3) is a false positive, then an ARMA(2,1) model would also be
deemed consistent with these estimates.
Finally, we use an
• ARMA(3,3), with n = 200, ϕ1 = 0.8, ϕ2 = 0.8, ϕ3 = −0.9, θ1 = 0.9, θ2 = −0.8,

θ3 = 0.2, and et ∼ iid N (0, 1).
AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x x x x x x x x x x x x x
1 x x x o x x x x x x x o x x
2 x x x o x x x x x x x o x x
3 x x o x x x x x o o o o o o
4 x x o x o o o o o o o o o o
5 x o o x o o o o o o o o o o
6 x o o x o x o o o o o o o o
7 x o o x o o o o o o o o o o
INTERPRETATION : This sample EACF does not agree with the theory, which says
that there should be a wedge of zeros with tip at (3,3). There is more of a “block” of
zeros; not a wedge. If we saw this EACF in practice, it would not be all that helpful in
model selection.
6.5 Nonstationarity
REVIEW : In general, an ARIMA(p, d, q) process can be written as
ϕ(B)(1 − B)d Yt = θ(B)et ,
PAGE 163
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt .
Up until now, we have discussed three functions to help us identify possible values for p
and q in stationary ARMA processes.
• The sample ACF can be used to determine the order q of a purely MA process.
• The sample PACF can be used to determine the order p of a purely AR process.
• The sample EACF can be used to determine the orders p and q of a mixed ARMA
process.
DIFFERENCING: For a series of data, a clear indicator of nonstationarity is that the

sample ACF exhibits a very slow decay across lags. This occurs because in a nonsta-
tionary process, the series tends to “hang together” and displays “trends.”
• When there is a clear trend in the data (e.g., linear) and the sample ACF for a
series decays very slowly, take first differences.
• If the sample ACF for the first differences resembles that a stationary ARMA
process (the ACF decays quickly), then take d = 1 in the ARIMA(p, d, q) family
and use the ACF, PACF, and EACF (on the first differences) to identify plausible
values of p and q.
• If the sample ACF for the first differences still exhibits a slow decay across lags,
take second differences and use d = 2. One can then use the ACF, PACF, and
EACF (on the second differences) to identify plausible values of p and q. There
should rarely be a need to consider values of d > 2. In fact, I have found that it is
not all that often that even second differences (d = 2) are needed.
PAGE 164
• If a transformation is warranted (e.g., because of clear evidence of heteroscedastic-

ity), implement it up front before taking any differences. Then, use these guidelines
to choose p, d, and q for the transformed series.
TERMINOLOGY : Overdifferencing occurs when we choose d to be too large. For

example, suppose that the correct model for a process {Yt } is an IMA(1,1), that is,
Yt = Yt−1 + et − θet−1 ,
where |θ| < 1 and {et } is zero mean white noise. The first differences are given by
∇Yt = Yt − Yt−1 = et − θet−1 ,
which is a stationary and invertible MA(1) process. The second differences are given by
∇2 Yt = ∇Yt − ∇Yt−1
= (et − θet−1 ) − (et−1 − θet−2 )
= et − (1 + θ)et−1 + θet−2
= [1 − (1 + θ)B + θB 2 ]et .
The second difference process is not invertible because
θ(x) = 1 − (1 + θ)x + θx2
has a unit root x = 1. Therefore, by unnecessarily taking second differences, we have

created a problem. Namely, we have differenced an invertible MA(1) process (for first
differences) into one which is not invertible. Recall that if a process is not invertible
(here, the second differences), then the parameters in the model can not be estimated
uniquely. In this example, the correct value of d is d = 1. Taking d = 2 would be an
example of overdifferencing.
INFERENCE : Instead of relying on the sample ACF, which may be subjective in “bor-
derline cases,” we can formally test whether or not an observed time series is stationary
using the methodology proposed by Dickey and Fuller (1979).
PAGE 165
DEVELOPMENT : To set our ideas, consider the model
Yt = αYt−1 + Xt ,
where {Xt } is a stationary AR(k) process, that is,
Xt = ϕ1 Xt−1 + ϕ2 Xt−2 + · · · + ϕk Xt−k + et ,
where {et } is zero mean white noise. Therefore,
Yt = αYt−1 + ϕ1 Xt−1 + ϕ2 Xt−2 + · · · + ϕk Xt−k + et
= αYt−1 + ϕ1 (Yt−1 − αYt−2 ) + ϕ2 (Yt−2 − αYt−3 ) + · · · + ϕk (Yt−k − αYt−k−1 ) + et .
After some algebra, we can rewrite this model for Yt as
ϕ∗ (B)Yt = et ,
where
ϕ∗ (B) = ϕ(B)(1 − αB)
and where ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕk B k ) is the usual AR characteristic operator

of order k. Note that
• if α = 1, then ϕ∗ (B) = ϕ(B)(1 − B), that is, ϕ∗ (x), a polynomial of degree k + 1,

has a unit root and {Yt } is not stationary.
• if −1 < α < 1, then ϕ∗ (x) does not have a unit root, and {Yt } is a stationary
AR(k + 1) process.
The augmented Dickey-Fuller (ADF) unit root test therefore tests
H0 : α = 1 (nonstationarity)
versus
H1 : α < 1 (stationarity).
PAGE 166
IMPLEMENTATION : Dickey and Fuller advocated that this test could be carried out
using least squares regression. To see how, note that when H0 : α = 1 is true (i.e., the
process is nonstationary), the model for Yt can be written as
Yt − Yt−1 = ϕ1 (Yt−1 − Yt−2 ) + ϕ2 (Yt−2 − Yt−3 ) + · · · + ϕk (Yt−k − Yt−k−1 ) + et
= aYt−1 + ϕ1 (Yt−1 − Yt−2 ) + ϕ2 (Yt−2 − Yt−3 ) + · · · + ϕk (Yt−k − Yt−k−1 ) + et ,
where a = α − 1. Note that a = 0 when α = 1. That is,
H0 : α = 1 is true ⇐⇒ H0 : a = 0 is true.
Using difference notation, the model under H0 : α = 1 is
∇Yt = aYt−1 + ϕ1 ∇Yt−1 + ϕ2 ∇Yt−2 + · · · + ϕk ∇Yt−k + et .
Therefore, we carry out the test by regressing ∇Yt on Yt−1 , ∇Yt−1 , ∇Yt−2 , ..., ∇Yt−k . We
can then decide between H0 and H1 by examining the size of the least-squares estimate
of a. In particular,
• if the least squares regression estimate of a is significantly different from 0, we reject

H0 and conclude that the process is stationary.
• if the least squares regression estimate of a is not significantly different from 0, we

do not reject H0 . This decision would suggest the process {Yt } is nonstationary.
REMARK : The test statistic needed to test H0 versus H1 , and its large-sample distri-
bution, are complicated (the test statistic is similar to the t test statistic from ordinary
least squares regression; however, the large-sample distribution is not t). Fortunately,
there is an R function to implement the test automatically. The only thing we need to
do is choose a value of k in the model
∇Yt = aYt−1 + ϕ1 ∇Yt−1 + ϕ2 ∇Yt−2 + · · · + ϕk ∇Yt−k + et ,
that is, the value k is the order of the AR process for ∇Yt . Of course, the true value
of k is unknown. However, we can have R determine the “best value” of k using model
selection criteria that we will discuss in the next subsection.
PAGE 167
40
0.4
0.2
Global temperature deviations
30
LA rainfall amounts
0.0
20
−0.2
10
−0.4
1860 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980
Year Year
Figure 6.7: Left: Global temperature data. Right: Los Angeles annual rainfall data.
Example 6.8. We illustrate the ADF test using two data sets from Chapter 1, the global
temperature data set (Example 1.1, pp 2, notes) and the Los Angeles annual rainfall
data set (Example 1.13, pp 14, notes). For the global temperature data, the command
ar(diff(globtemp)) is used to determine the “best” value of k for the differences. Here,
it is k = 3. The ADF test output is
Null hypothesis: Unit root.

Alternative hypothesis: Stationarity.
ADF statistic:
adf.reg -0.031 0.049 -0.636 0.1
Lag orders: 1 2 3
Number of available observations: 138
In particular, the output automatically produces the p-value for the test
H0 : α = 1 (nonstationarity)
versus
H1 : α < 1 (stationarity).
PAGE 168
The large p-value here (> 0.10) does not refute H0 : α = 1. There is insufficient evidence
to conclude that the global temperature process is stationary. For the LA rainfall data,
the command ar(diff(larain)) is used to determine the best value of k, which is k = 4.
Null hypothesis: Unit root.

Alternative hypothesis: Stationarity.
ADF statistic:
adf.reg -0.702 0.207 -3.385 0.015
Lag orders: 1 2 3 4
Number of available observations: 110
The small p-value here (p = 0.015) indicates strong evidence against H0 : α = 1. There
is sufficient evidence to conclude that the LA rainfall process is stationary.
DISCUSSION : When performing the ADF test, some words of caution are in order.
• When H0 : α = 1 is true, the AR characteristic polynomial ϕ∗ (B) = ϕ(B)(1 −

αB) contains a unit root. In other words, {Yt } is nonstationary, but {∇Yt } is
stationary. This is called difference nonstationarity. The ADF procedure we
have described, more precisely, tests for difference nonstationarity.
• Because of this, the ADF test outlined here may not have sufficient power to reject
H0 when the process is truly stationary. In addition, the test may reject H0 incor-
rectly because a different form of nonstationarity is present (one that can not be
overcome merely by taking first differences).
• The ADF test outcome must be interpreted with these points in mind, especially
when the sample size n is small. In other words, do not blindly interpret the ADF
test outcome as a yes/no indicator of nonstationarity.
IMPORTANT : To implement the ADF test in R, we need to install the uroot package.
Installing this package has to be done manually.
PAGE 169
6.6 Other model selection methods
TERMINOLOGY : The Akaike’s Information Criterion (AIC) says to select the

ARMA(p, q) model which minimizes
AIC = −2 ln L + 2k,
where ln L is the natural logarithm of the maximized likelihood function (computed under
a distributional assumption for Y1 , Y2 , ..., Yn ) and k is the number of parameters in the
model (excluding the white noise variance). In a stationary no-intercept ARMA(p, q)
model, there are k = p + q parameters.
• The likelihood function gives (loosely speaking) the “probability of the data,” so
we would like for it to be as large as possible. This is equivalent to wanting −2 ln L
to be as small as possible.
• The 2k term serves as a penalty, namely, we do not want models with too many
parameters (adhering to the Principle of Parsimony).
• The AIC is an estimator of the expected Kullback-Leibler divergence, which

measures the closeness of a candidate model to the truth. The smaller this diver-
gence, the better the model. See pp 130 (CC).
• The AIC is used more generally for model selection in statistics (not just in the
analysis of time series data). Herein, we restrict attention to its use in selecting
candidate stationary ARMA(p, q) models.
TERMINOLOGY : The Bayesian Information Criterion (BIC) says to select the

ARMA(p, q) model which minimizes
BIC = −2 ln L + k ln n,
where ln L is the natural logarithm of the maximized likelihood function and k is the
number of parameters in the model (excluding the white noise variance). In a stationary
no-intercept ARMA(p, q) model, there are k = p + q parameters.
PAGE 170
10
80
5
Ventilation differences
Ventilation (L/min)
60
0
40
−5
20
−10
0 50 100 150 200 0 50 100 150 200
Observation time Observation time
Figure 6.8: Ventilation measurements at 15 second intervals. Left: Ventilation data.

Right: First differences.
• Both AIC and BIC require the maximization of a log likelihood function (we assume
normality). When compared to AIC, BIC offers a stiffer penalty for overparame-
terized models since ln n will often exceed 2.
Example 6.9. We use the BIC as a means for model selection with the ventilation data
in Example 1.10 (pp 11, notes); see also Example 5.2 (pp 117, notes). Figure 6.8 shows
the original series (left) and the first difference process (right). The BIC output (next
page) is provided by R. Remember that the smaller the BIC, the better the model.
• The original ventilation series displays a clear linear trend. The ADF test (results
not shown) provides a p-value of p > 0.10, indicating that the series is difference
nonstationary.
• We therefore find the “best” ARMA(p, q) model for the first differences; that is, we
are taking d = 1, so we are essentially finding the “best” ARIMA(p, 1, q) model.
• The BIC output in Figure 6.8 shows that the best model (smallest BIC) for the
differences contains a lag 1 error component; i.e., q = 1.
PAGE 171
diff.temp−lag1
diff.temp−lag2
diff.temp−lag3
diff.temp−lag4
diff.temp−lag5
diff.temp−lag6
error−lag1
error−lag2
error−lag3
error−lag4
error−lag5
error−lag6
(Intercept)
−61
−58
−56
−52
BIC
−48
−44
−39
−33
Figure 6.9: Ventilation data. ARMA best subsets output for the first difference process
{∇Yt } using the BIC.
• Therefore, the model that provides the smallest BIC for {∇Yt } is an MA(1).
• In other words, the “best” model for the original ventilation series, as judged by
the BIC, is an ARIMA(0,1,1); i.e., an IMA(1,1).
DISCLAIMER: Model selection according to BIC (or AIC) does not always provide
“selected” models that are easily interpretable. Therefore, while AIC and BIC are model
selection tools, they are not the only tools available to us. The ACF, PACF, and EACF
may direct us to models that are different than those deemed “best” by the AIC/BIC.
PAGE 172
6.7 Summary
SUMMARY : Here is a summary of the techniques that we have reviewed this chapter.
This summary is presented in an “algorithm” format to help guide the data analyst
through the ARIMA model selection phase. Advice is interspersed throughout.
1. Plot the data and identify an appropriate transformation if needed.
• Examining the time series plot, we can get an idea about whether the series
contains a trend, seasonality, outliers, nonconstant variance, etc. This under-
standing often provides a basis for postulating a possible data transformation.
• Examine the time series plot for nonconstant variance and perform a suitable
transformation (from the Box-Cox family); see Chapter 5. Alternatively, the
data analyst can try several transformations and choose the one that does the
best at stabilizing the variance.
• Always implement a transformation before taking any data differences.
2. Compute the sample ACF and the sample PACF of the original series (or trans-
formed series) and further confirm the need for differencing.
• If the sample ACF decays very, very slowly, this usually indicates that it is a
good idea to take first differences.
• Tests for stationarity (ADF test) can also be implemented at this point on the
original or transformed series. In a borderline case, differencing is generally
recommended.
• Higher order differencing may be needed (however, I have found that it gen-
erally is not). One can perform an ADF test for stationarity of the first
differences to see if taking second differences is warranted. In nearly all cases,
d is not larger than 2 (i.e., taking second differences).
• Some authors argue that the consequences of overdifferencing are much less
serious than those of underdifferencing. However, overdifferencing can create
model identifiability problems.
PAGE 173
3. Compute the sample ACF, the sample PACF, and the sample EACF of the original,
properly transformed, properly differenced, or properly transformed/differenced se-
ries to identify the orders of p and q.
• Usually, p and q are not larger than 4 (excluding seasonal models, which we
have yet to discuss).
• Use knowledge of the patterns for theoretical versions of these functions; i.e.,
– the ACF for an MA(q) drops off after lag q

– the PACF for an AR(p) drops off after lag p
– the “tip” in the EACF identifies the proper ARMA(p, q) model.
• We identify the orders p and q by matching the patterns in the sample

ACF/PACF/EACF with the theoretical patterns of known models.
• To build a reasonable model, ideally, we need a minimum of about n = 50

observations, and the number of sample ACF and PACF to be calculated
should be about n/4 (a rough guideline). It might be hard to identify an
adequate model with smaller data sets.
• “The art of model selection is very much like the method of an FBI’s agent
criminal search. Most criminals disguise themselves to avoid being recog-
nized.” This is also true of the ACF, PACF, and EACF. Sampling variation
can disguise the theoretical ACF/PACF/EACF patterns.
• BIC and AIC can also be used to identify models consistent with the data.
REMARK : It is rare, after going through all of this, that the analyst will be able to
identify a single model that is a “clear-cut” choice. It is more likely that a small number
of candidate models have been identified from the steps above.
NEXT STEP : With our (hopefully small) set of candidate models, we then move forward
to parameter estimation and model diagnostics (model checking). These topics are the
subjects of Chapter 7 and Chapter 8, respectively. Once a final model has been chosen,
fit, and diagnosed, forecasting then becomes the central focus (Chapter 9).
PAGE 174
7 Estimation
7.1 Introduction
RECALL: Suppose that {et } is a zero mean white noise process with var(et ) = σe2 . In
general, an ARIMA(p, d, q) process can be written as
ϕ(B)(1 − B)d Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt
is the series of dth differences. In the last chapter, we were primarily concerned with
selecting values of p, d, and q which were consistent with the observed (or suitably
transformed) data, that is, we were concerned with model selection.
PREVIEW : In this chapter, our efforts are directed towards estimating parameters in
this class of models. In doing so, it suffices to restrict attention to stationary ARMA(p, q)
models. If d > 0 (which corresponds to a nonstationary process), the methodology de-
scribed herein can be applied to the suitably differenced process (1 − B)d Yt = ∇d Yt .
Therefore, when we write Y1 , Y2 , ..., Yn to represent our “data” in this chapter, it is
understood that Y1 , Y2 , ..., Yn may denote the original data, the differenced data, trans-
formed data (e.g., log-transformed, etc.), or possibly data that have been transformed
and differenced.
PREVIEW : We will discuss three estimation techniques: method of moments, least

squares, and maximum likelihood.
PAGE 175
7.2 Method of moments
TERMINOLOGY : The method of moments (MOM) approach to estimation consists

of equating sample moments to the corresponding population (theoretical) moments and
solving the resulting system of equations for the model parameters.
7.2.1 Autoregressive models
AR(1): Consider the stationary AR(1) model
where {et } is zero mean white noise with var(et ) = σe2 . In this model, there are two
parameters: ϕ and σe2 . The MOM estimator of ϕ is obtained by setting the population
lag one autocorrelation ρ1 equal to the sample lag one autocorrelation r1 and solving for
ϕ, that is,
set
ρ1 = r1 .
For this model, we know ρ1 = ϕ (see Chapter 4). Therefore, the MOM estimator of ϕ is
ϕb = r1 .
AR(2): For the AR(2) model,
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et ,
there are three parameters: ϕ1 , ϕ2 , and σe2 . To find the MOM estimators of ϕ1 and ϕ2 ,
recall the Yule-Walker equations (derived in Chapter 4) for the AR(2):
ρ 1 = ϕ1 + ρ 1 ϕ2
ρ 2 = ρ 1 ϕ1 + ϕ2 .
Setting ρ1 = r1 and ρ2 = r2 , we have
r1 = ϕ1 + r1 ϕ2
r2 = r1 ϕ1 + ϕ2 .
PAGE 176
Solving this system for ϕ1 and ϕ2 produces the MOM estimators
r1 (1 − r2 )
ϕb1 =
1 − r12
b r2 − r12
ϕ2 = .
1 − r12
AR(p): For the general AR(p) process,
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et ,
there are p + 1 parameters: ϕ1 , ϕ2 , ..., ϕp and σe2 . We again recall the Yule-Walker
equations from Chapter 4:
ρ1 = ϕ1 + ϕ2 ρ1 + ϕ3 ρ2 + · · · + ϕp ρp−1
ρ2 = ϕ1 ρ1 + ϕ2 + ϕ3 ρ1 + · · · + ϕp ρp−2
..
.
ρp = ϕ1 ρp−1 + ϕ2 ρp−2 + ϕ3 ρp−3 + · · · + ϕp .
Just as in the AR(2) case, we set ρ1 = r1 , ρ2 = r2 , ..., ρp = rp to obtain
r1 = ϕ1 + ϕ2 r1 + ϕ3 r2 + · · · + ϕp rp−1
r2 = ϕ1 r1 + ϕ2 + ϕ3 r1 + · · · + ϕp rp−2
..
.
rp = ϕ1 rp−1 + ϕ2 rp−2 + ϕ3 rp−3 + · · · + ϕp .
The MOM estimators ϕb1 , ϕb2 , ..., ϕbp solve this system of equations.
REMARK : Calculating MOM estimates (or any estimates) in practice should be done
using software. The MOM approach may produce estimates ϕb1 , ϕb2 , ..., ϕbp that fall
“outside” the stationarity region, even if the process is truly stationary! That is, the
estimated AR(p) polynomial, say,
ϕbMOM (x) = 1 − ϕb1 x − ϕb2 x2 − · · · − ϕbp xp
may possess roots which do not exceed 1 in absolute value (or modulus).
PAGE 177
7.2.2 Moving average models
MA(1): Consider the invertible MA(1) process
where {et } is zero mean white noise with var(et ) = σe2 . In this model, there are two
parameters: θ and σe2 . To find the MOM estimator of θ, we solve
−θ set
ρ1 = = r1 ⇐⇒ r1 θ2 + θ + r1 = 0
1 + θ2
for θ. Using the quadratic formula, we find that the solutions to this equation are
√
−1 ± 1 − 4r12
θ= .
2r1
• If |r1 | > 0.5, then no real solutions for θ exist.
• If |r1 | = 0.5, then the solutions for θ are ±1, which corresponds to an MA(1) model
that is not invertible.
• If |r1 | < 0.5, the invertible solution for θ is the MOM estimator
√
−1 + 1 − 4r12
θb = .
2r1
NOTE : For higher order MA models, the difficulties become more pronounced. For the
general MA(q) case, we are left to solve the highly nonlinear system
−θk + θ1 θk+1 + θ2 θk+2 + · · · + θq−k θq set

ρk = = rk , k = 1, 2, ..., q − 1
1 + θ12 + θ22 + · · · + θq2
−θq set
ρq = = rq ,
1 + θ1 + θ22 + · · · + θq2
2
for θ1 , θ2 , ..., θq . Just as in the MA(1) case, there will likely be multiple solutions, only
of which at most one will correspond to a fitted invertible model.
IMPORTANT : MOM estimates are not recommended for use with MA models. They
are hard to obtain and (as we will see) they are not necessarily “good” estimates.
PAGE 178
7.2.3 Mixed ARMA models
ARMA(1,1): Consider the ARMA(1,1) process
Yt = ϕYt−1 + et − θet−1 ,
where {et } is zero mean white noise with var(et ) = σe2 . In this model, there are three
parameters: ϕ, θ, and σe2 . Recall from Chapter 4 that
[ ]
(1 − θϕ)(ϕ − θ) k−1
ρk = ϕ .
1 − 2θϕ + θ2
It follows directly that

ρ2
= ϕ.
ρ1
Setting ρ1 = r1 and ρ2 = r2 , the MOM estimator of ϕ is given by
r2
ϕb = .
r1
The MOM estimator of θ then solves

b ϕb − θ)
(1 − θϕ)(
r1 = .
1 − 2θϕb + θ2
This is a quadratic equation in θ, so there are two solutions. The invertible solution θb (if
any) is kept; i.e., θbMOM = 1 − θx
b has root x larger than 1 in absolute value.
7.2.4 White noise variance
GOAL: We now wish to estimate the white noise variance σe2 . To do this, we first note
that for any stationary ARMA model, the process variance γ0 = var(Yt ) can be estimated
by the sample variance
1 ∑
n
2
S = (Yt − Y )2 .
n − 1 t=1
• For a general AR(p) process, we recall from Chapter 4 that
σe2
γ0 = =⇒ σe2 = (1 − ϕ1 ρ1 − ϕ2 ρ2 − · · · − ϕp ρp )γ0 .
1 − ϕ1 ρ1 − ϕ2 ρ2 − · · · − ϕp ρp
PAGE 179
Therefore, the MOM estimator of σe2 is obtained by substituting in ϕbk for ϕk , rk

for ρk , and S 2 for γ0 . We obtain
be2 = (1 − ϕb1 r1 − ϕb2 r2 − · · · − ϕbp rp )S 2 .

σ
• For a general MA(q) process, we recall from Chapter 4 that

γ0
γ0 = (1 + θ12 + θ22 + · · · + θq2 )σe2 =⇒ σe2 = .
1+ θ12 + θ22 + · · · + θq2
Therefore, the MOM estimator of σe2 is obtained by substituting in θbk for θk and
S 2 for γ0 . We obtain
S2
be2 =
σ .
1 + θb2 + θb2 + · · · + θb2
1 2 q
• For an ARMA(1,1) process,

( ) ( )
1 − 2ϕθ + θ2 2 2 1 − ϕ2
γ0 = σe =⇒ σe = γ0 .
1 − ϕ2 1 − 2ϕθ + θ2
Therefore, the MOM estimator of σe2 is obtained by substituting in θb for θ, ϕb for ϕ,

and S 2 for γ0 . We obtain
( )
1 − ϕb2
be2 =
σ S 2.
1 − 2ϕbθb + θb2
7.2.5 Examples
Example 7.1. Suppose {et } is zero mean white noise with var(et ) = σe2 . In this example,
we use Monte Carlo simulation to approximate the sampling distributions of the MOM
estimators of θ and σe2 in the MA(1) model
Yt = et − θet−1 .
We take θ = 0.7, σe2 = 1, and n = 100. Recall that the MOM approach is generally not
recommended for use with MA models. We will now see why this is true.
• We simulate M = 2000 MA(1) time series, each of length n = 100, with θ = 0.7
and σe2 = 1.
PAGE 180
250
150
200
100
Frequency
Frequency
150
100
50
50
0
0
0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
θ σ2e
Figure 7.1: Monte Carlo simulation. Left: Histogram of MOM estimates of θ in the
MA(1) model. Right: Histogram of MOM estimates of σe2 . The true values are θ = 0.7
and σe2 = 1. The sample size is n = 100.
• For each simulated series, we compute the MA(1) MOM estimates

√
−1 + 1 − 4r12
θb =
2r1
2
S
be2 =
σ ,
1 + θb2
if they exist. Recall the formula for θb only makes sense when |r1 | < 0.5.
• Of the M = 2000 simulated series, only 1388 produced a value of |r1 | < 0.5. For
the other 612 simulated series, the MOM estimates do not exist (therefore, the
histograms in Figure 7.1 contain only 1388 estimates).
• The Monte Carlo distribution of θb illustrates why MOM estimation is not recom-
mended for MA models. The sampling distribution is not even centered at the true
value of θ = 0.7. The MOM estimator θb is negatively biased.
• The Monte Carlo distribution of σ

be2 is slightly skewed to the right with mean larger
be2 looks to be slightly positively biased.
than σe2 = 1. The MOM estimator σ
PAGE 181
582
581
580
Elevation level (in feet)
579
578
577
576
1880 1900 1920 1940 1960 1980 2000
Year
Figure 7.2: Lake Huron data. Average July water surface elevation (measured in feet)
during 1880-2006.
Example 7.2. Data file: huron. Figure 7.2 displays the average July water surface
elevation (measured in feet) from 1880-2006 at Harbor Beach, Michigan, on Lake Huron.
The sample ACF and PACF for the series, both given in Figure 7.3, suggest that an
AR(1) model or possibly an AR(2) model may be appropriate.
AR(1): First, we consider the AR(1) model
Yt − µ = ϕ(Yt−1 − µ) + et .
Note that this model includes a parameter µ for the overall mean. By inspection, it is
clear that {Yt } is not a zero mean process. I used R to compute the sample statistics
r1 = 0.831 r2 = 0.643 y = 579.309 s2 = 1.783978.
For these data, the AR(1) MOM estimate of ϕ is
ϕb = r1 = 0.831
PAGE 182
Sample ACF Sample PACF

0.8
0.8
0.6
0.6
0.4
0.4
Partial ACF
ACF
0.2
0.2
0.0
0.0
−0.2
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Figure 7.3: Lake Huron data. Left: Sample ACF. Right: Sample PACF.
Using y as an (unbiased) estimate of µ, the fitted AR(1) model is
Yt − 579.309 = 0.831(Yt−1 − 579.309) + et ,
or, equivalently (after simplifying),
Yt = 97.903 + 0.831Yt−1 + et .
The AR(1) MOM estimate of the white noise variance is
b 1 )s2
be2 = (1 − ϕr
σ
= [1 − (0.831)(0.831)](1.783978) ≈ 0.552.
We can have R automate the estimation process. Here is the output:
> ar(huron,order.max=1,AIC=F,method=’yw’) # method of moments

Coefficients:
1
0.8315
Order selected 1 sigma^2 estimated as 0.5551
PAGE 183
AR(2): Consider the AR(2) model
Yt − µ = ϕ1 (Yt−1 − µ) + ϕ2 (Yt−2 − µ) + et .
For these data, the AR(2) MOM estimates of ϕ1 and ϕ2 are

r1 (1 − r2 ) 0.831(1 − 0.643)
ϕb1 = = ≈ 0.959
1 − r1 2
1 − (0.831)2
r2 − r12 0.643 − (0.831)2
ϕb2 = = ≈ −0.154
1 − r12 1 − (0.831)2
so the fitted AR(2) model is
Yt − 579.309 = 0.959(Yt−1 − 579.309) − 0.154(Yt−2 − 579.309) + et ,
or, equivalently (after simplifying),
Yt = 112.965 + 0.959Yt−1 − 0.154Yt−2 + et .
The AR(2) MOM estimate of the white noise variance is
be2 = (1 − ϕb1 r1 − ϕb2 r2 )s2

σ
= [1 − (0.959)(0.831) − (−0.154)(0.643)](1.783978) ≈ 0.539.
In R, fitting the AR(2) model gives
> ar(huron,order.max=2,AIC=F,method=’yw’) # method of moments

Coefficients:
1 2
0.9617 -0.1567
Order selected 2 sigma^2 estimated as 0.5458
REMARK : Note that there are minor differences in the estimates obtained “by hand”
and those from using R’s automated procedure. These are likely due to rounding error
and/or computational errors (e.g., in solving the Yule Walker equations, etc.). It should
also be noted that the R command ar(huron,order.max=1,AIC=F,method=’yw’) fits
the model (via MOM) by centering all observations first about an estimate of the overall
mean. This is why no “intercept” output is given.
PAGE 184
7.3 Least squares estimation
REMARK : The MOM approach to estimation in stationary ARMA models is not always
satisfactory. In fact, your authors recommend to avoid MOM estimation in any model
with moving average components. We therefore consider other estimation approaches,
starting with conditional least squares (CLS).
7.3.1 Autoregressive models
AR(1): Consider the stationary AR(1) model
Yt − µ = ϕ(Yt−1 − µ) + et ,
where note that a nonzero mean µ = E(Yt ) has been added for flexibility. For this model,
the conditional sum of squares function is
∑
n
SC (ϕ, µ) = [(Yt − µ) − ϕ(Yt−1 − µ)]2 .
t=2
• With a sample of time series data Y1 , Y2 , ..., Yn , note that the t = 1 term does not
make sense because there is no Y0 observation.
• The principle of least squares says to choose the values of ϕ and µ that will minimize
SC (ϕ, µ).
For the AR(1) model, this amounts to solving
∂SC (ϕ, µ) set

= 0
∂ϕ
∂SC (ϕ, µ) set
= 0
∂µ
for ϕ and µ. This is a multivariate calculus problem and the details of its solution are
shown on pp 154-155 (CC).
PAGE 185
• In the AR(1) model, the CLS estimators are

∑n
t=2 (Yt − Y )(Yt−1 − Y )
ϕb = ∑n
t=2 (Yt − Y )
2
b ≈ Y.
µ
• For this AR(1) model, the CLS estimator ϕb is approximately equal to r1 , the lag
one sample autocorrelation (the only difference is that the denominator does not
include the t = 1 term). We would therefore expect the difference between ϕb and
r1 (the MOM estimator) to be negligible when the sample size n is large.
• The CLS estimator µ

b is only approximately equal to the sample mean Y , but the
approximation should be adequate when the sample size n is large.
AR(p): In the general AR(p) model, the conditional sum of squares function is
∑
n
SC (ϕ1 , ϕ2 , ..., ϕp , µ) = [(Yt − µ) − ϕ1 (Yt−1 − µ) − ϕ2 (Yt−2 − µ) − · · · − ϕp (Yt−p − µ)]2 ,
t=p+1
a function of p + 1 parameters. The sum starts at t = p + 1 because estimates are based

on the sample Y1 , Y2 , ..., Yn . Despite being more complex, the CLS estimators are found
in the same way, that is, ϕ1 , ϕ2 , ..., ϕp and µ are chosen to minimize SC (ϕ1 , ϕ2 , ..., ϕp , µ).
The CLS estimator of µ is
b≈Y,
µ
an approximation when n is large (i.e., much larger than p). The CLS estimators for ϕ1 ,
ϕ2 , ..., ϕp are well approximated by the solutions to the sample Yule-Walker equations:
r1 = ϕ1 + ϕ2 r1 + ϕ3 r2 + · · · + ϕp rp−1
r2 = ϕ1 r1 + ϕ2 + ϕ3 r1 + · · · + ϕp rp−2
..
.
rp = ϕ1 rp−1 + ϕ2 rp−2 + ϕ3 rp−3 + · · · + ϕp .
Therefore, in stationary AR models, the MOM and CLS estimates should be approxi-
mately equal.
PAGE 186
7.3.2 Moving average models
MA(1): We first consider the zero mean invertible MA(1) model
where {et } is a zero mean white noise process. Recall from Chapter 4 that we can rewrite
an invertible MA(1) model as an infinite-order AR model; i.e.,
Yt = −θYt−1 − θ2 Yt−2 − θ3 Yt−3 − · · · + et .

| {z }
“AR(∞)”
Therefore, the CLS estimator of θ is the value of θ which minimizes

∑ ∑
SC (θ) = e2t = (Yt + θYt−1 + θ2 Yt−2 + θ3 Yt−3 + · · · )2 .
Unfortunately, minimizing SC (θ) as stated is not a practical exercise, because we have

only the observed sample Y1 , Y2 , ..., Yn . We therefore rewrite the MA(1) model as
et = Yt + θet−1 ,
and take e0 ≡ 0. Then, conditional on e0 = 0, we can write
e1 = Y1
e2 = Y2 + θe1
e3 = Y3 + θe2
..
.
en = Yn + θen−1 .
Using these expressions for e1 , e2 , ..., en , we can now find the value of θ that minimizes
∑
n
SC (θ) = e2t .
t=1
This minimization problem can be carried out numerically, searching over a grid of θ
values in (−1, 1) and selecting the value of θ that produces the smallest possible SC (θ).
This minimizer is the CLS estimator of θ in the MA(1) model.
PAGE 187
MA(q): The technique just described for MA(1) estimation via CLS can be carried
out for any higher-order MA(q) model in the same fashion. When q > 1, the problem
becomes finding the values of θ1 , θ2 , ..., θq such that
∑
n
SC (θ1 , θ2 , ..., θq ) = e2t
t=1
∑n
= (Yt + θ1 et−1 + θ2 et−2 + · · · + θq et−q )2 ,
t=1
is minimized, subject to the initial conditions that e0 = e−1 = · · · = e−q = 0. This can
be done numerically, searching over all possible values of θ1 , θ2 , ..., θq which yield an
invertible solution.
7.3.3 Mixed ARMA models
ARMA(1,1): We again consider only the zero mean ARMA(1,1) process
Yt = ϕYt−1 + et − θet−1 ,
where {et } is zero mean white noise. We first rewrite the model as
et = Yt − ϕYt−1 + θet−1 ,
with the goal of minimizing

∑
n
SC (ϕ, θ) = e2t .
t=1
There are now two “startup” problems, namely, specifying values for e0 and Y0 . The
authors of your text recommend avoiding specifying Y0 , taking e1 = 0, and minimizing
∑
n
SC∗ (ϕ, θ) = e2t
t=2
with respect to ϕ and θ instead. Similar modification is recommended for ARMA models
when p > 1 and/or when q > 1. See pp 157-158 (CC).
PAGE 188
7.3.4 White noise variance
NOTE : Nothing changes with our formulae for the white noise variance estimates that
we saw previously when discussing the MOM approach. The only difference is that now
CLS estimates for the ϕ’s and θ’s are used in place of MOM estimates.
• AR(p):
be2 = (1 − ϕb1 r1 − ϕb2 r2 − · · · − ϕbp rp )S 2 .
σ
• MA(q):
S2
be2 =
σ .
1 + θb2 + θb2 + · · · + θb2
1 2 q
• ARMA(1,1): ( )
1 − ϕb2
be2 =
σ S 2.
1 − 2ϕbθb + θb2
7.3.5 Examples
Example 7.3. Data file: gota. The Göta River is located in western Sweden near
Göteburg. The annual discharge rates (volume, measured in m3 /s) from 1807-1956 are
depicted in Figure 7.4. The sample ACF and PACF are given in Figure 7.5.
• The sample ACF suggests that an MA(1) model
Yt = µ + et − θet−1
is worth considering. Note that this model includes an intercept term µ for the
overall mean. Clearly, {Yt } is not a zero mean process.
• The sample PACF suggests that an AR(2) model
Yt − µ = ϕ1 (Yt−1 − µ) + ϕ2 (Yt−2 − µ) + et
is also worth considering.
PAGE 189
700
600
Water discharge rate
500
400
1850 1900 1950
Year
Figure 7.4: Göta River data. Water flow discharge rates (volume, measured in m3 /s)
from 1807-1956.
• We will fit an MA(1) model in this example using both MOM and CLS.
MOM: I used R to compute the following: r1 = 0.458, y = 535.4641, and s2 = 9457.164.

For the Göta River discharge data, the MOM estimate of θ is
√ √
−1 + 1 − 4r 2 −1 + 1 − 4(0.458)2
θb = 1
= ≈ −0.654.
2r1 2(0.458)
Therefore, the fitted MA(1) model for the discharge rate process is
Yt = 535.4641 + et + 0.654et−1 .
The white noise variance is estimated to be
s2 9457.164
be2 =
σ = ≈ 6624.
1 + θb2 1 + (−0.654)2
PAGE 190
0.4
0.4
0.3
0.2
0.2
Partial ACF
ACF
0.1
0.0
0.0
−0.2
−0.1
5 10 15 20 5 10 15 20
Lag Lag
Figure 7.5: Göta River data. Left: Sample ACF. Right: Sample PACF.
CLS: Here is the R output summarizing the CLS fit:
> arima(gota,order=c(0,0,1),method=’CSS’) # conditional least squares

Coefficients:
ma1 intercept
0.5353 534.7199
s.e. 0.0593 10.4303
sigma^2 estimated as 6973: part log likelihood = -876.57
The CLS estimates are θb = −0.5353 (remember, R negates MA parameters/estimates)

b = 534.7199, which gives the fitted MA(1) model
and µ
Yt = 534.7199 + et + 0.5353et−1 .
be2 ≈ 6973.
The white noise variance estimate is σ
• The R output gives estimated standard errors of the CLS estimates, so we can
assess their significance.
• We will learn later that CLS estimates are approximately normal in large samples.
PAGE 191
• Therefore, an approximate 95 percent confidence interval for θ is
−0.5353 ± 1.96(0.0593) =⇒ (−0.652, −0.419).
We are 95 percent confident that θ is between −0.652 and −0.419. Note that this
confidence interval does not include 0.
COMPARISON : It is instructive to compare the MOM and CLS estimates for the Göta
River discharge data. This comparison (to 3 decimal places) is summarized below.
Method b
µ θb be2
σ
MOM 535.464 −0.654 6624
CLS 534.720 −0.535 6973
• The estimates for µ are very close. The MA(1) estimate is equal to y whereas the
CLS estimate is only approximately equal to y. See pp 155 (CC).
• The estimates for θ are not close. As previously mentioned, the MOM approach
for MA models is generally not recommended.
• The estimates for σe2 are notably different as well.
Example 7.4. We now revisit the Lake Huron water surface elevation data in Example
7.2 and use R to fit AR(1) and AR(2) models
Yt − µ = ϕ(Yt−1 − µ) + et
and
Yt − µ = ϕ1 (Yt−1 − µ) + ϕ2 (Yt−2 − µ) + et ,
respectively, using conditional least squares (CLS). Recall that in Example 7.2 we fit
both the AR(1) and AR(2) models using MOM.
AR(1): Here is the R output summarizing the CLS fit:
PAGE 192
> arima(huron,order=c(1,0,0),method=’CSS’) # conditional least squares

Coefficients:
ar1 intercept
0.8459 579.2788
s.e. 0.0469 0.4027
sigma^2 estimated as 0.489: part log likelihood = -134.77
The fitted AR(1) model, using CLS, is
Yt − 579.2788 = 0.8459(Yt−1 − 579.2788) + et ,
or, equivalently (to 3 significant digits),
Yt = 89.267 + 0.846Yt−1 + et .
be2 ≈ 0.489.
The white noise variance estimate, using CLS, is σ
AR(2): Here is the R output summarizing the CLS fit:
> arima(huron,order=c(2,0,0),method=’CSS’) # conditional least squares

Coefficients:
ar1 ar2 intercept
0.9874 -0.1702 579.2691
s.e. 0.0878 0.0871 0.3355
The fitted AR(2) model, using CLS, is
Yt − 579.2691 = 0.9874(Yt−1 − 579.2691) − 0.1702(Yt−2 − 579.2691) + et ,
or, equivalently (to 3 significant digits),
Yt = 105.890 + 0.987Yt−1 − 0.170Yt−2 + et .
be2 ≈ 0.4776.
PAGE 193
COMPARISON : It is instructive to compare the MOM and CLS estimates for the Lake
Huron data. This comparison (to 3 decimal places) is summarized below.
AR(1) AR(2)
Method b
µ ϕb be2
σ b
µ ϕb1 ϕb2 be2
σ
MOM 579.309 0.831 0.552 579.309 0.959 −0.154 0.539
CLS 579.279 0.846 0.489 579.269 0.987 −0.170 0.478
• Note that the MOM and CLS estimates for µ and the ϕ’s are in large agreement.
This is common in purely AR models (not in models with MA components).
QUESTION : For the Lake Huron data, which model is preferred: AR(1) or AR(2)?
be2 estimate is slightly smaller in the AR(2) fit, but only marginally.
• The σ
• Using the CLS estimates, note that an approximate 95 percent confidence interval
for ϕ2 in the AR(2) model is
−0.1702 ± 1.96(0.0871) =⇒ (−0.341, 0.001).
This interval does (barely) include 0, indicating that ϕb2 is not statistically different
from 0.
• Note also that the estimated standard error of ϕb1 (in the CLS output) is almost
twice as large in the AR(2) model as in the AR(1) model. Reason: When we fit
a higher-order model, we lose precision in the other model estimates (especially if
the higher-order terms are not needed).
• It is worth noting that the AR(1) model is the ARMA model identified as having
the smallest BIC (using armasubsets in R; see Chapter 6).
• For the last three reasons, and with an interest in being parsimonious, I would pick
the AR(1) if I had to choose between the two.
PAGE 194
80
70
Blood sugar level (mg/100ml blood)
60
50
40
0 50 100 150
Days
Figure 7.6: Bovine blood sugar data. Blood sugar levels (mg/100ml blood) for a single
cow measured for n = 176 consecutive days.
Example 7.5. Data file: cows. The data in Figure 7.6 represent daily blood sugar con-
centrations (measured in mg/100ml of blood) on a single cow being dosed intermuscularly
with 10 mg of dexamethasone (commonly given to increase milk production).
• The sample ACF in Figure 7.7 shows an AR-type decay, while the PACF in Figure
7.7 also shows an MA-type (oscillating) decay with “spikes” at the first three lags.
• ARMA(1,1) and AR(3) models are consistent with the sample ACF/PACF.
Consider using an ARMA(1,1) model
Yt − µ = ϕ(Yt−1 − µ) + et − θet−1
to represent this process. Note that we have added an overall mean µ parameter in the
model. Clearly, {Yt } is not a zero mean process. Therefore, there are three parameters
to estimate and we do so using conditional least squares (CLS).
PAGE 195
0.8
0.8
0.6
0.6
0.4
0.4
Partial ACF
ACF
0.2
0.2
0.0
−0.2
0.0
−0.4
5 10 15 20 5 10 15 20
Lag Lag
Figure 7.7: Bovine blood sugar data. Left: Sample ACF. Right: Sample PACF.
ARMA(1,1): Here is the R output summarizing the CLS fit:
> arima(cows,order=c(1,0,1),method=’CSS’) # conditional least squares

Coefficients:
ar1 ma1 intercept
0.6625 0.6111 58.7013
s.e. 0.0616 0.0670 1.6192
Therefore, the fitted ARMA(1,1) model is
Yt − 58.7013 = 0.6625(Yt−1 − 58.7013) + et + 0.6111et−1
or, equivalently,
Yt = 19.8117 + 0.6625Yt−1 + et + 0.6111et−1 .
be2 ≈ 20.38. From examining the

(estimated) standard errors in the output, it is easy to see that both CLS estimates
ϕb = 0.6625 and θb = −0.6111 are significantly different from 0.
PAGE 196
7.4 Maximum likelihood estimation
TERMINOLOGY : The method of maximum likelihood is the most commonly-used

technique to estimate unknown parameters (not just in time series models, but in nearly
all statistical models).
• An advantage of maximum likelihood in fitting time series models is that parameter

estimates are based on the entire observed sample Y1 , Y2 , ..., Yn . There is no need
to worry about “start up” values.
• Another advantage is that maximum likelihood estimators have very nice large-
sample distributional properties. This makes statistical inference proceed in a
straightforward manner.
• The main disadvantage is that we have to specify a joint probability distribution for
the random variables in the sample. This makes the method more mathematical.
TERMINOLOGY : The likelihood function L is a function that describes the joint

distribution of the data Y1 , Y2 , ..., Yn . However, it is viewed as a function of the model
parameters with the observed data being fixed.
• Therefore, when we maximize the likelihood function with respect to the model
parameters, we are finding the values of the parameters (i.e., the estimates) that
are most consistent with the observed data.
AR(1): To illustrate how maximum likelihood estimates are obtained, consider the
AR(1) model
Yt − µ = ϕ(Yt−1 − µ) + et ,
where {et } is a normal zero mean white noise process with var(et ) = σe2 and where
µ = E(Yt ) is the overall (process) mean. There are three parameters in this model: ϕ, µ,
and σe2 . The probability density function (pdf) of et ∼ N (0, σe2 ) is
1
f (et ) = √ exp(−e2t /2σe2 ),
2πσe
PAGE 197
for all −∞ < et < ∞, where exp(·) denotes the exponential function. Because e1 , e2 , ..., en
are independent, the joint pdf of e2 , e3 , ..., en is given by
∏
n ∏
n
1
f (e2 , e3 , ..., en ) = f (et ) = √ exp(−e2t /2σe2 )
2πσe
t=2 t=2
( )
1 ∑
n
= (2πσe2 )−(n−1)/2 exp − 2 e2t .
2σe t=2
To write out the joint pdf of Y = (Y1 , Y2 , ..., Yn ), we can first perform a multivariate
transformation using
Y2 = µ + ϕ(Y1 − µ) + e2
Y3 = µ + ϕ(Y2 − µ) + e3
.. ..
. = .
Yn = µ + ϕ(Yn−1 − µ) + en ,
with Y1 = y1 (fixed). This will give us the (conditional) joint distribution of Y2 , Y3 , ..., Yn ,
given Y1 = y1 . Applying the laws of conditioning, the joint pdf of Y; i.e., the likelihood
function L ≡ L(ϕ, µ, σe2 |y), is given by
L = L(ϕ, µ, σe2 |y) = f (y2 , y3 , ..., yn |y1 )f (y1 ).
The details on pp 159 (CC) show that

{ }
1 ∑
n
f (y2 , y3 , ..., yn |y1 ) = (2πσe2 )−(n−1)/2 exp − 2 [(yt − µ) − ϕ(yt−1 − µ)]2
2σe t=2
[ ]1/2 [ ]
1 (y1 − µ)2
f (y1 ) = exp − 2 .
2πσe2 /(1 − ϕ2 ) 2σe /(1 − ϕ2 )
Multiplying these pdfs and simplifying, we get
[ ]
S(ϕ, µ)
L= L(ϕ, µ, σe2 |y) = (2πσe2 )−n/2 (1 −ϕ )2 1/2
exp − ,
2σe2
where
∑
n
S(ϕ, µ) = (1 − ϕ )(y1 − µ) +
2
[(yt − µ) − ϕ(yt−1 − µ)]2 .
t=2
For this AR(1) model, the maximum likelihood estimators (MLEs) of ϕ, µ, and σe2 are
the values which maximize L(ϕ, µ, σe2 |y).
PAGE 198
REMARK : In this AR(1) model, the function S(ϕ, µ) is called the unconditional sum-
of-squares function. Note that when S(ϕ, µ) is viewed as random,
S(ϕ, µ) = (1 − ϕ2 )(Y1 − µ) + SC (ϕ, µ),
where SC (ϕ, µ) is the conditional sum of squares function defined in Section 7.3.1 (notes)
for the same AR(1) model.
• We have already seen in Section 7.3.1 (notes) that the conditional least squares
(CLS) estimates of ϕ and µ are found by minimizing SC (ϕ, µ).
• The unconditional least squares (ULS) estimates of ϕ and µ are found by

minimizing S(ϕ, µ). ULS is regarded as a “compromise” between CLS and the
method of maximum likelihood.
• We will not pursue the ULS approach.
NOTE : The approach to finding MLEs in any stationary ARMA(p, q) model is the same
as what we have just outlined in the special AR(1) case. The likelihood function L
becomes more complex in larger models. However, this turns out not to be a big deal
for us because we will use software to do the estimation. R can compute MLEs in any
stationary ARMA(p, q) model using the arima function. This function also provides
(estimated) standard errors of the MLEs.
DISCUSSION : We have talked about three methods of estimation: method of moments,

least squares (conditional and unconditional), and maximum likelihood. Going forward,
which procedure should we use? To answer this, Box, Jenkins, and Reinsel (1994) write
“Generally, the conditional and unconditional least squares estimators serve

as satisfactory approximations to the maximum likelihood estimator for large
sample sizes. However, simulation evidence suggests a preference for the max-
imum likelihood estimator for small or moderate sample sizes, especially if the
moving average operator has a root close to the boundary of the invertibility
region.”
PAGE 199
7.4.1 Large-sample properties of MLEs
THEORY : Suppose that {et } is a normal zero mean white noise process with var(et ) = σe2 .
Consider a stationary ARMA(p, q) process
ϕ(B)Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
The maximum likelihood estimators ϕbj and θbk satisfy
√
n(ϕbj − ϕj ) −→ N (0, σϕ2b ),
d
for j = 1, 2, ..., p,
j
and
√
n(θbk − θk ) −→ N (0, σθ2b ),
d
for k = 1, 2, ..., q,
k
respectively, as n → ∞. In other words, for large n,
ϕbj ∼ AN (ϕj , σϕ2b /n)

j
θbk ∼ AN (θk , σθ2b /n),

k
for all j = 1, 2, ..., p and k = 1, 2, ..., q. Implication: Maximum likelihood estimators

are consistent and asymptotically normal.
SPECIFIC CASES :
• AR(1). ( )
1 − ϕ 2
ϕb ∼ AN ϕ,
n
• AR(2). ( )
1 − ϕ22
ϕb1 ∼ ANϕ1 ,
n
( )
b 1 − ϕ22
ϕ2 ∼ AN ϕ2 ,
n
PAGE 200
• MA(1). ( )
1 − θ 2
θb ∼ AN θ,
n
• MA(2). ( )
1 − θ22
θb1 ∼ AN θ1 ,
n
( )
1 − θ 2
θb2 ∼ AN θ2 , 2
n
• ARMA(1,1). [ ]
b c(ϕ, θ)(1 − ϕ2 )
ϕ ∼ AN ϕ,
n
[ ]
b c(ϕ, θ)(1 − θ2 )
θ ∼ AN θ, ,
n
where c(ϕ, θ) = [(1 − ϕθ)/(ϕ − θ)]2 .
REMARK : In multi-parameter models; e.g., AR(2), MA(2), ARMA(1,1), etc., the MLEs
are (asymptotically) correlated. This correlation can also be large; see pp 161 (CC) for
further description.
IMPORTANT : The large-sample distributional results above make getting large-sample

confidence intervals for ARMA model parameters easy. For example, an approximate
100(1 − α) percent confidence interval for ϕ in an AR(1) model is
√
1 − ϕb2
ϕb ± zα/2 .
n
An approximate 100(1 − α) percent confidence interval for θ in an MA(1) model is
√
1 − θb2
θb ± zα/2 .
n
Note the form of these intervals. In words, the form is
“ML point estimate ± zα/2 (estimated standard error).”
• Approximate confidence intervals for the other ARMA model parameters are com-
puted in the same way.
PAGE 201
• The nice thing about R is that ML estimates and their (estimated) standard errors
are given in the output (as they were for CLS estimates), so we have to do almost
no calculation by hand.
• Furthermore, examining these confidence intervals can give us information about

which estimates are statistically different from zero. This is a key part of assessing
model adequacy.
NOTE : Maximum likelihood estimators (MLEs) and least-squares estimators (both CLS
and ULS) have the same large-sample distributions. Large sample distributions of MOM
estimators can be quite different for purely MA models (although they are the same for
purely AR models). See pp 162 (CC).
7.4.2 Examples
Example 7.6. We revisit the Göta River discharge data in Example 7.3 (notes) and use
R to fit an MA(1) model
Yt = µ + et − θet−1 ,
using the method of maximum likelihood. Here is the output from R:
> arima(gota,order=c(0,0,1),method=’ML’) # maximum likelihood

Coefficients:
ma1 intercept
0.5350 535.0311
s.e. 0.0594 10.4300
sigma^2 estimated as 6957: log likelihood = -876.58, aic = 1757.15
ESTIMATES : The ML estimates are θb = −0.5350 (remember, R negates the MA pa-

b = 535.0311, which gives the fitted model
rameters/estimates) and µ
Yt = 535.0311 + et + 0.5350et−1 .
PAGE 202
be2 ≈ 6957. An approximate 95 percent confidence

interval for θ is
−0.5350 ± 1.96(0.0594) =⇒ (−0.651, −0.419).
We are 95 percent confident that θ is between −0.651 and −0.419. This interval is almost
identical to the one based on the CLS estimate; see Example 7.3.
COMPARISON : We compare the estimates from all three methods (MOM, CLS, and
MLE) with the Göta River discharge data. This comparison (to 3 decimal places) is
summarized below.
Method b
µ θb be2
σ
MOM 535.464 −0.654 6624
CLS 534.720 −0.535 6973
MLE 535.031 −0.535 6957
Note that the CLS and ML estimates of θ are identical (to three decimal places). The
MOM estimate of θ is noticeably different. Recall that MOM estimation is not advised
for models with MA components.
Example 7.7. The data in Figure 7.8 (left) are the number of global earthquakes
annually (with intensities of 7.0 or greater) during 1900-1998. Source: Craig Whitlow
(Spring, 2010). We examined these data in Chapter 1 (Example 1.5, pp 6).
• Because the data (number of earthquakes) are “counts,” this suggests that a trans-
formation is needed. The Box-Cox transformation output in Figure 7.8 (right)
shows that λ = 0.5 resides in an approximate 95 percent confidence interval for λ.
Recall that λ = 0.5 corresponds to the square-root transformation.
• R output for the square-root transformed series is given in Figure 7.9. The
armasubsets output, which ranks competing ARMA models according to their
BIC, selects an ARMA(1,1) model. This model is also consistent with the sample
ACF and PACF.
PAGE 203
200
95%
40
190
35
Number of earthquakes (7.0 or greater)
180
30
170
Log Likelihood
25
160
20
150
15
140
10
130
5
1900 1920 1940 1960 1980 2000 −2 −1 0 1 2
Year λ
Figure 7.8: Earthquake data. Left: Number of “large” earthquakes per year from 1900-
1998. Right: Box-Cox transformation output (profile log-likelihood function of λ).
√
• We therefore fit an ARMA(1,1) model to the { Yt } process, that is,
√ √
Yt − µ = ϕ( Yt−1 − µ) + et − θet−1 .
• We will use maximum likelihood. The R output is given below:
> arima(sqrt(earthquake),order=c(1,0,1),method=’ML’) # maximum likelihood

Coefficients:
ar1 ma1 intercept
0.8352 -0.4295 4.3591
s.e. 0.0811 0.1277 0.2196
sigma^2 estimated as 0.4294: log likelihood = -98.88, aic = 203.76
For this model, the maximum likelihood estimates based on these data are ϕb = 0.8352,
θb = 0.4295, and µ
b = 4.3591. The fitted model is
√ √
Yt − 4.3591 = 0.8352( Yt−1 − 4.3591) + et − 0.4295et−1
PAGE 204
sqrt(e.qu.)−lag1
sqrt(e.qu.)−lag2
sqrt(e.qu.)−lag3
sqrt(e.qu.)−lag4
sqrt(e.qu.)−lag5
sqrt(e.qu.)−lag6
error−lag1
error−lag2
error−lag3
error−lag4
error−lag5
error−lag6
(Intercept)
No. of earthquakes (Square−root scale)
6 −24
−20
−18
5
−16
BIC
−12
4
−9.1
−5.3
3
−1.7
1900 1940 1980

0.4
0.4
Partial ACF
0.2
0.2
ACF
0.0
0.0
−0.2
−0.2
5 10 15 5 10 15
√
Figure 7.9: Earthquake data. Upper left: Time series plot of { Yt } process. Upper
√
right: armasubsets output (on square-root scale). Lower left: Sample ACF of { Yt }.
√
Lower right: Sample PACF of { Yt }.
or, equivalently,
√ √
Yt = 0.7184 + 0.8352 Yt−1 + et − 0.4295et−1 .
be2 ≈ 0.4294. From

The white noise variance estimate, using maximum likelihood, is σ
examining the (estimated) standard errors in the output, it is easy to see that both ML
estimates ϕb = 0.8352 and θb = 0.4295 are significantly different from 0. Approximate
95 percent confidence intervals for ϕ and θ, computed separately, are (0.676,0.994) and
(0.179,0.680), respectively.
PAGE 205
95%
0.20
Percentage granted review
350
Log Likelihood
0.15
0.10
300
0.05
250
1940 1960 1980 2000 −2 −1 0 1 2
Time λ
Percentage granted review (Log scale)
−1.5
0.4
Difference of logarithms
0.2
−2.5
0.0
−3.5
−0.4
−4.5
1940 1960 1980 2000 1940 1960 1980 2000
Year Year
Figure 7.10: U.S. Supreme Court data. Upper left: Percent of cases granted review during
1926-2004. Upper right: Box-Cox transformation output. Lower left: Log-transformed
data {log Yt }. Lower right: First differences of log-transformed data {∇ log Yt }.
Example 7.8. The data in Figure 7.10 (upper left) represent the acceptance rate of
cases appealed to the Supreme Court during 1926-2004. Source: Jim Manning (Spring,
2010). We examined these data in Chapter 1 (Example 1.15, pp 16).
• The time series plot suggests that this process {Yt } is not stationary. There is a
clear linear downward trend. There is also a notable nonconstant variance problem.
• The BoxCox.ar transformation output in Figure 7.10 (upper right) suggests a log-
transformation is appropriate; note that λ ≈ 0.
PAGE 206
• The log-transformed series {log Yt } in Figure 7.10 (lower left) still displays the linear
trend, as expected. However, the variance in the {log Yt } process is more constant
than in the original series. It looks like the log-transformation has “worked.”
• The lower right plot in Figure 7.10 gives the first differences of the log-transformed
process {∇ log Yt }. This process appears to be stationary.
• The sample ACF, PACF, EACF, and armasubsets results (not shown) suggest
that an MA(1) model for {∇ log Yt } ⇐⇒ an IMA(1,1) model for {log Yt }, that is,
∇ log Yt = et − θet−1 ,
may be appropriate. Here is the R output from fitting this model:
> arima(log(supremecourt),order=c(0,1,1),method=’ML’) # ML
Coefficients:
ma1
-0.3556
s.e. 0.0941
sigma^2 estimated as 0.03408: log likelihood = 21.04, aic = -40.08
Therefore, the fitted model is
∇ log Yt = et − 0.3556et−1 ,
or, equivalently,
log Yt = log Yt−1 + et − 0.3556et−1 .
be2 ≈ 0.03408. From

The white noise variance estimate, using maximum likelihood, is σ
examining the (estimated) standard error in the output, it is easy to see that the ML
estimate θb = 0.3556 is significantly different from 0.
COMMENT : Note that there is no estimated intercept term in the output above. Recall
that in ARIMA(p, d, q) models with d > 0, intercept terms are generally not used.
PAGE 207
8 Model Diagnostics
8.1 Introduction
ϕ(B)(1 − B)d Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt
is the series of dth differences. Until now, we have discussed the following topics:
• Model specification (model selection). This deals with specifying the values of
p, d, and q that are most consistent with the observed (or possibly transformed)
data. This was the topic of Chapter 6.
• Model fitting (parameter estimation). This deals with estimating model param-
eters in the ARIMA(p, d, q) class. This was the topic of Chapter 7.
PREVIEW : In this chapter, we are now concerned with model diagnostics, which
generally means that we are “checking the fit of the model.” We were exposed to this
topic in Chapter 3, where we encountered deterministic trend models of the form
Yt = µt + Xt ,
where E(Xt ) = 0. We apply many of the same techniques we used then to our situation
now, that is, to diagnose the fit of ARIMA(p, d, q) models.
PAGE 208
8.2 Residual analysis
TERMINOLOGY : Residuals are random quantities which describe the part of the
variation in {Yt } that is not explained by the fitted model. In general, we have the
general relationship (not just in time series models, but in nearly all statistical models):
Residualt = Observed Yt − Predicted Yt .
Calculating residuals from an ARIMA(p, d, q) model fit based on an observed sample

Y1 , Y2 , ..., Yn can be difficult. It is most straightforward with purely AR models, so we
start there first.
AR(p): Consider the stationary AR(p) model
Yt − µ = ϕ1 (Yt−1 − µ) + ϕ2 (Yt−2 − µ) + · · · + ϕp (Yt−p − µ) + et ,
where µ = E(Yt ) is the overall (process) mean and where {et } is a zero mean white noise
process. This model can be reparameterized as
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et ,
where θ0 = µ(1 − ϕ1 − ϕ2 − · · · − ϕp ) is the intercept term. For this model, the residual
at time t is
ebt = Yt − Ybt
= Yt − (θb0 + ϕb1 Yt−1 + ϕb2 Yt−2 + · · · + ϕbp Yt−p )
= Yt − θb0 − ϕb1 Yt−1 − ϕb2 Yt−2 − · · · − ϕbp Yt−p ,
where ϕbj is an estimator of ϕj (e.g., ML, CLS, etc.), for j = 1, 2, ..., p, and where
θb0 = µ
b(1 − ϕb1 − ϕb2 − · · · − ϕbp )
is the estimated intercept. Therefore, once we observe the values of Y1 , Y2 , ..., Yn in our
sample, we can compute the n residuals.
SUBTLETY : The first p residuals must be computed using backcasting, which is a

mathematical technique used to “reverse predict” the unseen values of Y0 , Y−1 , ..., Y1−p ,
PAGE 209
that is, the p values of the process {Yt } before time t = 1. We will not discuss backcasting
in detail, but be aware that it is needed to compute early residuals in the process.
ARMA(p, q): To define residuals for an invertible ARMA model containing moving
average terms, we exploit the fact that the model can be written as an inverted autore-
gressive process. To be specific, recall that any zero-mean invertible ARMA(p, q) model
can be written as
Yt = π1 Yt−1 + π2 Yt−2 + π3 Yt−3 + · · · + et ,
where the π coefficients are functions of the ϕ and θ parameters in the specific ARMA(p, q)
model. Residuals are of the form
ebt = Yt − π
b1 Yt−1 − π
b2 Yt−2 − π
b3 Yt−3 − · · · ,
bj is an estimator for πj , for j = 1, 2, ...,.

where π
IMPORTANT : The observed residuals ebt serve as “proxies” for the white noise terms et .
We can therefore learn about the quality of the model fit by examining the residuals.
• If the model is correctly specified and our estimates are “reasonably close” to the
true parameters, then the residuals should behave roughly like an iid normal white
noise process, that is, a sequence of independent, normal random variables with zero
mean and constant variance.
• If the model is not correctly specified, then the residuals will not behave roughly like
an iid normal white noise process. Furthermore, examining the residuals carefully
may help us identify a better model.
TERMINOLOGY : It is very common to instead work with residuals which have been
standardized, that is,
ebt
eb∗t = ,
be
σ
be2 is an estimate of the white noise error variance σe2 . We call these standardized
where σ
residuals.
PAGE 210
e∗t }, like their

• If the model is correctly specified, then the standardized residuals {b
unstandardized counterparts, should behave roughly like an iid normal white noise
process.
• From the standard normal distribution, we know then that most of the standardized
e∗t } should fall between −3 and 3.
residuals {b
• Standardized residuals that fall outside this range could correspond to observations
which are “outlying” in some sense; we’ll make this more concrete later. If many
standardized residuals fall outside (−3, 3), this suggests that the error process {et }
has a heavy-tailed distribution (common in financial time series applications).
8.2.1 Normality and independence
DIAGNOSTICS : Histograms and qq plots of the residuals can be used to assess the
normality assumption visually. Time series plots of the residuals can be helpful to detect
“patterns” which violate the independence assumption.
• We can also apply the hypothesis tests for normality (Shapiro-Wilk) and indepen-
dence (runs test) with the standardized residuals, just as we did in Chapter 3 with
the deterministic trend models.
• The Shapiro-Wilk test formally tests
H0 : the (standardized) residuals are normally distributed

versus
H1 : the (standardized) residuals are not normally distributed.
• The runs test formally tests
H0 : the (standardized) residuals are independent

versus
H1 : the (standardized) residuals are not independent.
• For either test, small p-values lead to the rejection of H0 in favor of H1 .
PAGE 211
700
2
Standardised residuals
600
1
0
500
−1
400
−2
1850 1900 1950 1850 1900 1950
Year Time
10 15 20 25 30
2
Sample Quantiles
1
Frequency
0
−1
5
−2
0
−3 −2 −1 0 1 2 3 −2 −1 0 1 2
Standardised residuals Theoretical Quantiles
Figure 8.1: Göta River discharge data. Upper left: Discharge rate time series. Up-
per right: Standardized residuals from an MA(1) fit with zero line added. Lower left:
Histogram of the standardized residuals from MA(1) fit. Lower right: QQ plot of the
standardized residuals from MA(1) fit.
Example 8.1. In Example 7.3 (pp 189, notes), we examined the Göta River discharge
rate data and used an MA(1) process to model them. The fit using maximum likelihood
in Example 7.6 (pp 202, notes) was
Yt = 535.0311 + et + 0.5350et−1 .
Figure 8.1 displays the time series plot (upper right), the histogram (lower left), and the
qq plot (lower right) of the standardized residuals. The histogram and the qq plot show
no gross departures from normality. This observation is supported by the Shapiro-Wilk
test for normality, which we perform in R. Here is the output:
PAGE 212
> shapiro.test(rstandard(gota.ma1.fit))
W = 0.9951, p-value = 0.8975
The large p-value is not evidence against normality (i.e., we do not reject H0 ). To examine
the independence assumption, note that the time series of the residuals in Figure 8.1
(upper right) displays no discernible patterns and looks to be random in appearance.
This observation is supported by the runs test for independence, which we also perform
in R. Here is the output:
> runs(rstandard(gota.ma1.fit))
$pvalue
[1] 0.29
$observed.runs
[1] 69
$expected.runs
[1] 75.94667
Therefore, we do not have evidence against independence (i.e., we do not reject H0 ).
CONCLUSION : For the Göta River discharge data, (standardized) residuals from a
MA(1) fit look to reasonably satisfy the normality and independence assumptions.
Example 8.2. In Example 7.2 (pp 182, notes), we examined the Lake Huron elevation
data and considered using an AR(1) process to model them. Here is the R output from
fitting an AR(1) model via maximum likelihood:
> huron.ar1.fit = arima(huron,order=c(1,0,0),method=’ML’)

> huron.ar1.fit
Coefficients:
ar1 intercept
0.8586 579.4921
s.e. 0.0465 0.4268
PAGE 213
582
2
580
1
0
578
−3 −2 −1
576
1880 1920 1960 2000 1880 1920 1960 2000
Year Time
10 15 20 25
2
Sample Quantiles
1
Frequency
0
−3 −2 −1
5
0
−3 −2 −1 0 1 2 3 −2 −1 0 1 2
Figure 8.2: Lake Huron elevation data. Upper left: Elevation time series. Upper right:
Standardized residuals from an AR(1) fit with zero line added. Lower left: Histogram
of the standardized residuals from AR(1) fit. Lower right: QQ plot of the standardized
residuals from AR(1) fit.
Therefore, the fitted AR(1) model is
Yt − 579.4921 = 0.8586(Yt−1 − 579.4921) + et
or, equivalently,
Yt = 81.9402 + 0.8586Yt−1 + et .
Figure 8.2 displays the time series plot (upper right), the histogram (lower left), and the
qq plot (lower right) of the standardized residuals. The histogram and the qq plot show
no gross departures from normality. The time series plot of the standardized residuals
displays no noticeable patterns and looks like a stationary random process.
PAGE 214
The R output for the Shapiro-Wilk and runs tests is given below:
> shapiro.test(rstandard(huron.ar1.fit))
W = 0.9946, p-value = 0.9156
> runs(rstandard(huron.ar1.fit))
$pvalue
[1] 0.373
$observed.runs
[1] 59
$expected.runs
[1] 64.49606
CONCLUSION : For the Lake Huron elevation data, (standardized) residuals from a
AR(1) fit look to reasonably satisfy the normality and independence assumptions.
8.2.2 Residual ACF
RECALL: In Chapter 6, we discovered that for a white noise process, the sample
autocorrelation satisfies ( )
1
rk ∼ AN 0, ,
n
for large n. Furthermore, the sample autocorrelations rj and rk , for j ̸= k, are approx-
imately uncorrelated.
• Therefore, to further check the adequacy of a fitted ARIMA(p, d, q) model, it is a

good idea to examine the sample autocorrelation function (ACF) of the residuals.
• To separate our discussion in Chapter 6 from now, we will denote
rbk = kth sample autocorrelation of the residuals ebt ,
for k = 1, 2, ...,. That is, the “hat” symbol in rbk will remind us that we are now
dealing with residuals.
PAGE 215
• We remarked earlier in this chapter that
“If the model is correctly specified and our estimates are “reasonably
close” to the true parameters, then the residuals should behave roughly
like an iid normal white noise process.”
• We say “roughly,” because even if the correct model is fit, the sample autocorre-
lations of the residuals, rbk , have sampling distributions that are a little different
than that of white noise (most prominently at early lags).
• In addition, rbj and rbk , for j ̸= k, are correlated, notably so at early lags and more
weakly at later lags.
RESULTS : Suppose that {et } is a zero mean white noise process with var(et ) = σe2 . In
addition, suppose that we have identified and fit the correct ARIMA(p, d, q) model
ϕ(B)(1 − B)d Yt = θ(B)et
using maximum likelihood. All of the following are large-sample results (i.e., they are
approximate for large n).
• MA(1).
θ2
r1 ) ≈
var(b
n
1 − (1 − θ2 )θ2k−2
var(b rk ) ≈ , for k > 1
n[ ]
(1 − θ2 )θk−2
r1 , rbk ) ≈ −sign(θ)
corr(b , for k > 1,
1 − (1 − θ2 )θ2k−2
where sign(θ) = 1, if θ > 0 and sign(θ) = −1, if θ < 0.
• MA(2).
θ22
r1 ) ≈
var(b
n
θ22 + θ12 (1 + θ2 )2
r2 ) ≈
var(b
n
1
rk ) ≈
var(b , for k > 2.
n
PAGE 216
• AR(1).
ϕ2
r1 ) ≈
var(b
n
1 − (1 − ϕ2 )ϕ2k−2
var(b rk ) ≈ , for k > 1
n[ ]
(1 − ϕ2 )ϕk−2
r1 , rbk ) ≈ −sign(ϕ)
corr(b , for k > 1.
1 − (1 − ϕ2 )ϕ2k−2
• AR(2).
ϕ22
r1 ) ≈
var(b
n
ϕ22 + ϕ21 (1 + ϕ2 )2
r2 ) ≈
var(b
n
1
rk ) ≈
var(b , for k > 2.
n
rk ) ≈ 1/n, for k > 2, may not hold if (θ1 , θ2 ) is

NOTE : The MA(2) result that var(b
“close” to the boundary of the invertibility region for the MA(2) model. The same is
true for the AR(2) if (ϕ1 , ϕ2 ) is “close” to the boundary of the stationarity region.
MAIN POINT : Even if we fit the correct ARIMA(p, d, q) model, the residuals from the
fit will not follow a white noise process exactly. At very early lags, there are noticeable
differences from a white noise process. For larger lags, the differences become negligible.
Example 8.3. In Example 8.1, we examined the residuals from an MA(1) fit to the
Göta River discharge data (via ML).
• The sample ACF of the MA(1) residuals is depicted in Figure 8.3 with margin of
error bounds at
2 2
√ =√ ≈ 0.163.
n 150
That is, the margin of error bounds in Figure 8.3 are computed under the white
noise assumption.
• In this example, we calculate estimates of var(b

rk ), for k = 1, 2, ..., 10.
PAGE 217
Sample ACF for MA(1) residuals
0.15
0.10
0.05
0.00
ACF
−0.05
−0.10
−0.15
5 10 15 20
Lag
Figure 8.3: Göta River discharge data. Sample ACF of the residuals from an MA(1)
model fit.
• For an MA(1) model fit,

θb2
c r1 ) ≈
var(b
n
1 − (1 − θb2 )θb2k−2
c rk ) ≈
var(b , for k > 1.
n
Note that in these formulae, θb replaces θ making these estimates of the true
variances stated earlier.
Recall that the MA(1) model fit to these data (via ML) was
Yt = 535.0311 + et + 0.5350et−1 .
so that θb = −0.5350. Therefore,

(−0.5350)2
c r1 ) ≈
var(b ≈ 0.001908
150
1 − [1 − (−0.5350)2 ](−0.5350)2k−2
c rk ) ≈
var(b , for k > 1.
150
PAGE 218
Here are first 10 sample autocorrelations for the residuals from the MA(1) fit:
> acf(residuals(gota.ma1.fit),plot=F,lag.max=10)
1 2 3 4 5 6 7 8 9 10
0.059 0.020 -0.115 0.021 -0.074 0.041 -0.009 0.019 -0.076 0.042
We now construct a table which displays these sample autocorrelations, along with their
±2 estimated standard errors
√
±2se(b
b rk ) = ±2 var(b
c rk ),
for k = 1, 2, ..., 10. Values of rbk more than 2 (estimated) standard errors away from 0
would be considered inconsistent with the fitted model.
k 1 2 3 4 5 6 7 8 9 10
rbk 0.059 0.020 −0.115 0.021 −0.074 0.041 −0.009 0.019 −0.076 0.042
b rk )
2se(b 0.087 0.146 0.158 0.162 0.163 0.163 0.163 0.163 0.163 0.163
• Note that as k gets larger, 2se(b

b rk ) approaches
2 2
√ =√ ≈ 0.163
n 150
the white noise margin of error bounds.
• None of the sample autocorrelations fall outside the ±2se(b

b rk ) bounds.
• This finding further supports the MA(1) model choice for these data.
REMARK : In addition to examining the sample autocorrelations of the residuals indi-

vidually, it is useful to consider them as a group.
• Although sample autocorrelations may be moderate individually; e.g., each within

the ±2se(b
b rk ) bounds, it could be that as a group the sample autocorrelations are
“excessive,” and therefore inconsistent with the fitted model.
PAGE 219
• To address this potential occurrence, Ljung and Box (1978) developed a procedure,
based on the sample autocorrelations of the residuals, to test formally whether or
not a certain model in the ARMA(p, q) family was appropriate.
LJUNG-BOX TEST : In particular, the modified Ljung-Box test statistic

∑
K
rbk2
Q∗ = n(n + 2)
k=1
n−k
can be used to test
H0 : the ARMA(p, q) model is appropriate

versus
H1 : the ARMA(p, q) model is not appropriate.
• The sample autocorrelations rbk , for k = 1, 2, ..., K, are computed under the
ARMA(p, q) model assumption in H0 . If a nonstationary model is fit (d > 0),
then the ARMA(p, q) model refers to the suitably differenced process.
• The value K is called the maximum lag; it’s choice is somewhat arbitrary.
• Somewhat diaphanously, the authors of your text recommend that K be cho-

sen so that the Ψj weights of the general linear process representation of the
ARMA(p, q) model (under H0 ) are negligible for all j > K. Recall that any sta-
tionary ARMA(p, q) process can be written as
Yt = et + Ψ1 et−1 + Ψ2 et−2 + · · · ,
where {et } is a zero mean white noise process.
• Typically one can simply compute Q∗ for various choices of K and determine if the
same decision is reached for all values of K.
• For a fixed K, a level α decision rule is to reject H0 if the value of Q∗ exceeds the
upper α quantile of the χ2 distribution with K − p − q degrees of freedom, that is,
Reject H0 if Q∗ > χ2K−p−q,α .
PAGE 220
• Fitting an erroneous model tends to inflate Q∗ , so this is a one sided test. R

produces p-values for this test automatically.
• The tsdiag function in R will compute Q∗ at all lags specified by the user.
Example 8.4. In Example 8.1, we examined the residuals from an MA(1) fit to the Göta
River discharge data (via ML). Here we illustrate the use of the modified Ljung-Box test
for the MA(1) model. Recall that we computed the first 10 sample autocorrelations:
> acf(residuals(gota.ma1.fit),plot=F,lag.max=10)
1 2 3 4 5 6 7 8 9 10
0.059 0.020 -0.115 0.021 -0.074 0.041 -0.009 0.019 -0.076 0.042
Taking K = 10 and n = 150, the modified Ljung-Box statistic is

[ ]
(0.059)2 (0.020)2 (0.042)2
Q∗ = 150(150 + 2) + + ··· +
150 − 1 150 − 2 150 − 10
≈ 5.13.
To test MA(1) model adequacy, we compare Q∗ to the upper α quantile of a χ2 distri-

bution with K − p − q = 10 − 0 − 1 = 9 degrees of freedom and reject the MA(1) model
if Q∗ exceeds this quantile. With α = 0.05,
χ29,0.05 = 16.91898,
which I found using the qchisq(0.95,9) command in R. Because the test statistic Q∗
does not exceed this upper quantile, we do not reject H0 .
REMARK : Note that R can perform the modified Ljung-Box test automatically. Here
is the output:
> Box.test(residuals(gota.ma1.fit),lag=10,type="Ljung-Box",fitdf=1)
Box-Ljung test
X-squared = 5.1305, df = 9, p-value = 0.8228
We do not have evidence against MA(1) model adequacy for these data when K = 10.
PAGE 221
Standardized Residuals
2
1
0
−2 −1
1850 1900 1950
Time
0.15
ACF of Residuals
0.05
−0.15 −0.05
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.4: Göta River discharge data. Residual graphics and modified Ljung-Box p-
values for MA(1) fit. This figure was created using the tsdiag function in R.
GRAPHICS : The R function tsdiag produces the plot in Figure 8.4.
• The top plot displays the residuals plotted through time (without connecting lines).
• The middle plot displays the sample ACF of the residuals.
• The bottom plot displays the p-values of the modified Ljung-Box test for various
values of K. A horizontal line at α = 0.05 is added.
For the Göta River discharge data, we see in Figure 8.4 that all of the modified Ljung-Box
test p-values are larger than 0.05, lending further support of the MA(1) model.
PAGE 222
2
1
0
−2 −1
1900 1920 1940 1960 1980 2000
Time
0.2
ACF of Residuals
0.1
0.0
−0.2 −0.1
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.5: Earthquake data. Residual graphics and modified Ljung-Box p-values for
ARMA(1,1) fit to the square-root transformed data.
Example 8.5. In Example 7.7 (pp 203, notes), we fit an ARMA(1,1) model to the
(square-root transformed) earthquake data using maximum likelihood. Figure 8.5 dis-
plays the tsdiag output for the ARMA(1,1) model fit.
• The Shapiro-Wilk test does not reject normality (p-value = 0.7202). The runs test
does not reject independence (p-value = 0.679). Both the Shapiro-Wilk and runs
tests were applied to the standardized residuals.
• The residual output in Figure 8.5 fully supports the ARMA(1,1) model.
PAGE 223
1
0
−1
−2
1940 1960 1980 2000
Time
0.2
ACF of Residuals
0.1
0.0
−0.2
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.6: U.S. Supreme Court data. Residual graphics and modified Ljung-Box p-
values for IMA(1,1) fit to the log transformed data.
Example 8.6. In Example 7.8 (pp 206, notes), we fit an IMA(1,1) model to the (log
transformed) Supreme Court data using maximum likelihood. Figure 8.6 displays the
tsdiag output for the IMA(1,1) model fit.
• The Shapiro-Wilk test does not reject normality (p-value = 0.5638). The runs test
does not reject independence (p-value = 0.864). Both the Shapiro-Wilk and runs
tests were applied to the standardized residuals.
• The modified Ljung-Box test p-values in Figure 8.6 raise serious concerns over the
adequacy of the IMA(1,1) model fit.
PAGE 224
60
50
Oil prices
40
30
20
10
1990 1995 2000 2005
Year
Figure 8.7: Crude oil price data. Monthly spot prices in dollars from Cushing, OK, from
1/1986 to 1/2006.
Example 8.7. The data in Figure 8.7 are monthly spot prices for crude oil (measured
in U.S. dollars per barrel). We examined these data in Chapter 1 (Example 1.12, pp 13).
In this example, we assess the fit of an IMA(1,1) model for {log Yt }; i.e.,
∇ log Yt = et − θet−1 .
I have arrived at this candidate model using our established techniques from Chapter 6;
these details are omitted for brevity. I used maximum likelihood to fit the model.
• In Figure 8.8, we display the {∇ log Yt } process (upper left), along with plots of
the standardized residuals from the IMA(1,1) fit.
• It is difficult to notice a pattern in the time series plot of the residuals, although
there are notable outliers on the low and high sides.
PAGE 225
0.4
4
1st differences of logarithms
0.2
2
0.0
0
−2
−0.2
−4
−0.4
1990 1995 2000 2005 1990 1995 2000 2005
Year Time
4
80
Sample Quantiles
2
Frequency
60
0
40
−2
20
−4
0
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
Figure 8.8: Oil price data with IMA(1,1) fit to {log Yt }. Upper left: {∇ log Yt } process.
Upper right: Standardized residuals with zero line added. Lower left: Histogram of the
standardized residuals. Lower right: QQ plot of the standardized residuals.
• The Shapiro-Wilk test strongly rejects normality of the residuals (p-value < 0.0001).
This is likely due to the extreme outliers on each side, which are not “expected”
under the assumption of normality. The runs test does not reject independence
(p-value = 0.341).
• The tsdiag output for the IMA(1,1) residuals is given in Figure 8.9. The top plot
displays the residuals from the IMA(1,1) fit with “outlier limits” at
z0.025/241 ≈ 3.709744,
which is the upper 1 − 0.05/2(241) quantile of the N (0, 1) distribution.
• R is implementing a “Bonferroni” correction to test each residual as an outlier.
PAGE 226
4
2
0
−2
−4
1990 1995 2000 2005
Time
0.10
ACF of Residuals
0.00
−0.10
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.9: Oil price data. Residual graphics and modified Ljung-Box p-values for
IMA(1,1) fit to the log transformed data.
• According to the Bonferroni criterion, residuals which exceed this value (3.709744)
in absolute value would be classified as outliers. The one around 1991 likely corre-
sponds to the U.S. invasion of Iraq (the first one).
• The sample ACF for the residuals raises concern, but the modified Ljung-Box p-
values do not suggest lack of fit (although it becomes interesting for large K).
• The IMA(1,1) model for the log-transformed data appears to do a fairly good job.
I am a little concerned about the outliers and the residual ACF. Intervention
analysis (Chapter 11) may help to adjust for the outlying observations.
PAGE 227
8.3 Overfitting
REMARK : In addition to performing a thorough residual analysis, overfitting can be a

useful diagnostic technique to further assess the validity of an assumed model. Basically,
“overfitting” refers to the process of a fitting a model more complicated than the one
under investigation and then
(a) examining the significance of the additional parameter estimates
(b) examining the change in the estimates from the assumed model.
EXAMPLE : Suppose that, after the model specification phase and residual diagnostics,
we are strongly considering an AR(2) model for our data, that is,
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + et .
To perform an overfit, we would fit the following two models:
• AR(3):
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + ϕ3 Yt−3 + et
• ARMA(2,1):
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + et − θet−1 .
• If the additional AR parameter estimate ϕb3 is significantly different than zero, then
this would be evidence that an AR(3) model is worthy of investigation. If ϕb3 is
not significantly different than zero and the estimates of ϕ1 and ϕ2 do not change
much from their values in the AR(2) model fit, this would be evidence that the
more complicated AR(3) model is not needed.
• If the additional MA parameter estimate θb is significantly different than zero, then

this would be evidence that an ARMA(2,1) model is worthy of investigation. If θb is
not significantly different than zero and the estimates of ϕ1 and ϕ2 do not change
much from their values in the AR(2) model fit, this would be evidence that the
more complicated ARMA(2,1) model is not needed.
PAGE 228
IMPORTANT : When overfitting an ARIMA(p, d, q) model, we consider the following

two models:
(a) ARIMA(p + 1, d, q)
(b) ARIMA(p, d, q + 1).
That is, one overfit model increases p by 1. The other increases q by 1.
Example 8.8. Our residual analysis this chapter suggests that an MA(1) model for the
Göta River discharge data is very reasonable. We now overfit using an MA(2) model and
an ARMA(1,1) model. Here is the R output from all three model fits:
> gota.ma1.fit
Call: arima(x = gota, order = c(0, 0, 1), method = "ML")
Coefficients:
ma1 intercept
0.5350 535.0311
s.e. 0.0594 10.4300
> gota.ma2.overfit
Coefficients:
ma1 ma2 intercept
0.6153 0.1198 534.8117
s.e. 0.0861 0.0843 11.7000
> gota.arma11.overfit
Coefficients:
ar1 ma1 intercept
0.1574 0.4367 534.8004
s.e. 0.1292 0.1100 11.5217
PAGE 229
ANALYSIS : In the MA(2) overfit, we see that a 95 percent confidence interval for θ2 ,
the additional MA model parameter, is
−0.1198 ± 1.96(0.0843) =⇒ (−0.285, 0.045),
which does include 0. Therefore, θb2 is not statistically different than zero, which suggests
that the MA(2) model is not necessary. In the ARMA(1,1) overfit, we see that a 95
percent confidence interval for ϕ, the additional AR model parameter, is
0.1574 ± 1.96(0.1292) =⇒ (−0.096, 0.411),
which also includes 0. Therefore, ϕb is not statistically different than zero, which suggests
that the ARMA(1,1) model is not necessary. The following table summarizes the output
on the last page:
Model θb (se)
b Additional estimate Significant? be2
σ AIC
MA(1) 0.5350(0.0594) −− −− 6957 1757.15
MA(2) 0.6153(0.0861) θb2 no 6864 1757.18
ARMA(1,1) 0.4367(0.1100) ϕb no 6891 1757.74
Because the additional estimates in the overfit models are not statistically different from
zero, there is no reason to further consider either model. Note also how the estimate of
θ becomes less precise in the two larger models.
DISCUSSION : We have finished our discussions on model specification, model fitting,

and model diagnostics. Having done so, you are now well-versed in modeling time se-
ries data in the ARIMA(p, d, q) family. Hopefully, you have realized that the process
of building a model is not always clear cut and that some “give and take” is necessary.
Remember, no model is perfect! Furthermore, model building takes creativity and pa-
tience; it is not a black box exercise. Overall, our goal as data analysts is to find the best
possible model which explains the variation in the data in a clear and concise manner.
Having done this, our task now moves to using the fitted model for forecasting.
PAGE 230
9 Forecasting
9.1 Introduction
RECALL: We have discussed two types of statistical models for time series data, namely,
deterministic trend models (Chapter 3) of the form
Yt = µt + Xt ,
where {Xt } is a zero mean stochastic process, and ARIMA(p, d, q) models of the form
ϕ(B)(1 − B)d Yt = θ(B)et ,
where {et } is zero mean white noise. For both types of models, we have studied model
specification, model fitting, and diagnostic procedures to assess model fit.
PREVIEW : We now switch our attention to forecasting.
• We start with a sample of process values up until time t, say, Y1 , Y2 , ..., Yt . These
are our observed data.
• Forecasting refers to the technique of predicting future values of the process, i.e.,
Yt+1 , Yt+2 , Yt+3 , ..., .
In general, Yt+l is the value of the process at time t + l, where l ≥ 1.
• We call t the forecast origin and l the lead time. The value Yt+l is “l steps
ahead” of the most recently observed value Yt .
IMPORTANT : By “forecasting,” we mean that we are trying to predict the value of

a future random variable Yt+l . In general, prediction is a more challenging problem
PAGE 231
than, say, estimating a population (model) parameter. Model parameters are fixed (but
unknown) values. Random variables are not fixed; they are random.
APPROACH : We need to adopt a formal mathematical criterion to calculate model

forecasts. The criterion that we will use is based on the mean squared error of
prediction
MSEP = E{[Yt+l − h(Y1 , Y2 , ..., Yt )]2 }.
• Suppose that we have a sample of observed data Y1 , Y2 , ..., Yt and that we would
like to predict Yt+l .
• The approach we take is to choose the function h(Y1 , Y2 , ..., Yt ) that minimizes
MSEP. This function will be our forecasted value of Yt+l .
• The general solution to this minimization problem is
h(Y1 , Y2 , ..., Yt ) = E(Yt+l |Y1 , Y2 , ..., Yt ),
the conditional expectation of Yt+l , given the observed data Y1 , Y2 , ..., Yt (see
Appendices E and F, CC).
• Adopting conventional notation, we write
Ybt (l) = E(Yt+l |Y1 , Y2 , ..., Yt ).
This is called the minimum mean squared error (MMSE) forecast. That is,
Ybt (l) is the MMSE forecast of Yt+l .
Conditional Expectation rules:
• The conditional expectation E(Z|Y1 , Y2 , ..., Yt ) is a function of Y1 , Y2 , ..., Yt .
• If c is a constant, then E(c|Y1 , Y2 , ..., Yt ) = c.
• If Z1 and Z2 are random variables, then
E(Z1 + Z2 |Y1 , Y2 , ..., Yt ) = E(Z1 |Y1 , Y2 , ..., Yt ) + E(Z2 |Y1 , Y2 , ..., Yt );
i.e., conditional expectation is additive (just like unconditional expectation).
PAGE 232
• If Z is a function of Y1 , Y2 , ..., Yt , say, Z = f (Y1 , Y2 , ..., Yt ), then
E(Z|Y1 , Y2 , ..., Yt ) = E[f (Y1 , Y2 , ..., Yt )|Y1 , Y2 , ..., Yt ] = f (Y1 , Y2 , ..., Yt ).
In other words, once you condition on Y1 , Y2 , ..., Yt , any function of Y1 , Y2 , ..., Yt acts
as a constant.
• If Z is independent of Y1 , Y2 , ..., Yt , then
E(Z|Y1 , Y2 , ..., Yt ) = E(Z).
9.2 Deterministic trend models
RECALL: Consider the model

Yt = µt + Xt ,
where µt is a deterministic (non-random) trend function and where {Xt } is assumed to be

a white noise process with E(Xt ) = 0 and var(Xt ) = γ0 (constant). By direct calculation,
the l-step ahead forecast is
Ybt (l) = E(Yt+l |Y1 , Y2 , ..., Yt )
= E(µt+l + Xt+l |Y1 , Y2 , ..., Yt )
= E(µt+l |Y1 , Y2 , ..., Yt ) + E(Xt+l |Y1 , Y2 , ..., Yt ) = µt+l ,

| {z } | {z }
= µt+l = E(Xt+l )=0
because µt+l is constant and because Xt+l is a zero mean random variable independent
of Y1 , Y2 , ..., Yt . Therefore,
Ybt (l) = µt+l
is the MMSE forecast.
• For example, if µt = β0 + β1 t, a linear trend model, then
Ybt (l) = µt+l = β0 + β1 (t + l).
PAGE 233
• If µt = β0 + β1 cos(2πf t) + β2 sin(2πf t), a cosine trend model, then
Ybt (l) = µt+l = β0 + β1 cos[2πf (t + l)] + β2 sin[2πf (t + l)].
ESTIMATION : Of course, MMSE forecasts must be estimated! For example, in the

linear trend model, Ybt (l) is estimated by
bt+l = βb0 + βb1 (t + l).

µ
where βb0 and βb1 are the least squares estimates of β0 and β1 , respectively. In the cosine
trend model, Ybt (l) is estimated by
bt+l = βb0 + βb1 cos[2πf (t + l)] + βb2 sin[2πf (t + l)],

µ
where βb0 , βb1 , and βb2 are the least squares estimates.
Example 9.1. In Example 3.4 (pp 53, notes), we fit a straight line trend model to the
global temperature deviation data. The fitted model is
Ybt = −12.19 + 0.0062t,
where t = 1900, 1991, ..., 1997, depicted visually in Figure 9.1. Here are examples of
forecasting with this estimated trend model:
• In 1997, we could have used the model to predict for 1998,
b1998 = µ
µ b1997+1 = −12.19 + 0.0062(1997 + 1) ≈ 0.198.
• For 2005 (8 steps ahead of 1997),
b2005 = µ
µ b1997+8 = −12.19 + 0.0062(1997 + 8) ≈ 0.241.
• For 2020 (23 steps ahead of 1997),
b2020 = µ
µ b1997+23 = −12.19 + 0.0062(1997 + 23) ≈ 0.334.
PAGE 234
0.4
0.2
0.0
−0.2
−0.4
1900 1920 1940 1960 1980 2000
Year
Figure 9.1: Global temperature data. The least squares straight line fit is superimposed.
Example 9.2. In Example 3.6 (pp 66, notes), we fit a cosine trend model to the monthly
US beer sales data (in millions of barrels), which produced the fitted model
Ybt = 14.8 − 2.04 cos(2πt) + 0.93 sin(2πt),
where t = 1980, 1980.083, 1980.166, ..., 1990.916. Note that
• t = 1980 refers to January, 1980,
• t = 1980.083 refers to February, 1980,
• t = 1980.166 refers to March, 1980, and so on.
• These values for t are used because data arrive monthly and “year” is used as a
predictor in the regression.
• This fitted model is depicted in Figure 9.2.
PAGE 235
17
16
15
Sales
14
13
12
1980 1982 1984 1986 1988 1990
Time
Figure 9.2: Beer sales data. The least squares cosine trend fit is superimposed.
• In December, 1990, we could have used the model to predict for January, 1991,
b1991 = 14.8 − 2.04 cos[2π(1991)] + 0.93 sin[2π(1991)] ≈ 12.76.

µ
• For June, 1992,
b1992.416 = 14.8 − 2.04 cos[2π(1992.416)] + 0.93 sin[2π(1992.416)] ≈ 17.03.

µ
Note that the beginning of June, 1992 corresponds to t = 1992.416.
REMARK : One major drawback with predictions made from deterministic trend models
is that they are based only on the least squares model fit, that is, the forecast for Yt+l
ignores the correlation between Yt+l and Y1 , Y2 , ..., Yt . Therefore, the analyst who makes
these predictions is ignoring this correlation and, in addition, is assuming that the fitted
trend is applicable indefinitely into the future; i.e., “the trend lasts forever.”
PAGE 236
TERMINOLOGY : For deterministic trend models of the form
Yt = µt + Xt ,
where E(Xt ) = 0 and var(Xt ) = γ0 (constant), the forecast error at lead time l, denoted
by et (l), is the difference between the value of the process at time t + l and the MMSE
forecast at this time. Mathematically,
et (l) = Yt+l − Ybt (l)
= µt+l + Xt+l − µt+l = Xt+l .
For all l ≥ 1,
E[et (l)] = E(Xt+l ) = 0
var[et (l)] = var(Xt+l ) = γ0 .
• The first equation implies that forecasts are unbiased because the forecast error
is an unbiased estimator of 0.
• The second equation implies that the forecast error variance is constant for all
lead times l.
• These facts will be useful in deriving prediction intervals for future values.
9.3 ARIMA models
GOAL: We now discuss forecasting methods with ARIMA models. Recall that an
ARIMA(p, d, q) process can be written generally as
ϕ(B)(1 − B)d Yt = θ0 + θ(B)et ,
where θ0 is an intercept term. We first focus on stationary ARMA(p, q) models, that

is, ARIMA(p, d, q) models with d = 0. Special cases are treated in detail.
PAGE 237
9.3.1 AR(1)
AR(1): Suppose that {et } is zero mean white noise with var(et ) = σe2 . Consider the
AR(1) model
Yt − µ = ϕ(Yt−1 − µ) + et ,
where the overall (process) mean µ = E(Yt ).
1-step ahead forecast: The MMSE forecast of Yt+1 , the 1-step ahead forecast, is
Ybt (1) = E(Yt+1 |Y1 , Y2 , ..., Yt )
= E[µ + ϕ(Yt − µ) + et+1 |Y1 , Y2 , ..., Yt ]

| {z }
= Yt+1
= E(µ|Y1 , Y2 , ..., Yt ) + E[ϕ(Yt − µ)|Y1 , Y2 , ..., Yt ] + E(et+1 |Y1 , Y2 , ..., Yt ).
From the properties of conditional expectation, we note the following:
• E(µ|Y1 , Y2 , ..., Yt ) = µ, because µ is a constant.
• E[ϕ(Yt −µ)|Y1 , Y2 , ..., Yt ] = ϕ(Yt −µ), because ϕ(Yt −µ) is a function of Y1 , Y2 , ..., Yt .
• E(et+1 |Y1 , Y2 , ..., Yt ) = E(et+1 ) = 0, because et+1 is independent of Y1 , Y2 , ..., Yt .
Therefore, the MMSE forecast of Yt+1 is
Ybt (1) = µ + ϕ(Yt − µ).
Ybt (2) = E(Yt+2 |Y1 , Y2 , ..., Yt )
= E[µ + ϕ(Yt+1 − µ) + et+2 |Y1 , Y2 , ..., Yt ]

| {z }
= Yt+2
= E(µ|Y1 , Y2 , ..., Yt ) + E[ϕ(Yt+1 − µ)|Y1 , Y2 , ..., Yt ] + E(et+2 |Y1 , Y2 , ..., Yt ) .

| {z } | {z } | {z }
= µ = (∗∗) = E(et+2 )=0
PAGE 238
Now, the expression in (∗∗) is equal to
E[ϕ(Yt+1 − µ)|Y1 , Y2 , ..., Yt ] = E{ϕ[µ + ϕ(Yt − µ) + et+1 −µ]|Y1 , Y2 , ..., Yt }

| {z }
= Yt+1
= E{ϕ[ϕ(Yt − µ) + et+1 ]|Y1 , Y2 , ..., Yt }
= E[ϕ2 (Yt − µ)|Y1 , Y2 , ..., Yt ] + E(ϕet+1 |Y1 , Y2 , ..., Yt ).
From the properties of conditional expectation, we again note the following:
• E[ϕ2 (Yt − µ)|Y1 , Y2 , ..., Yt ] = ϕ2 (Yt − µ), because ϕ2 (Yt − µ) is a function of

Y1 , Y2 , ..., Yt .
• E(ϕet+1 |Y1 , Y2 , ..., Yt ) = ϕE(et+1 |Y1 , Y2 , ..., Yt ) = ϕE(et+1 ) = 0, because et+1 is

independent of Y1 , Y2 , ..., Yt .
Therefore, the MMSE forecast of Yt+2 is
Ybt (2) = µ + ϕ2 (Yt − µ).
l-step ahead forecast: For larger lead times, this pattern continues. In general, the
MMSE forecast of Yt+l , for all l ≥ 1, is
Ybt (l) = µ + ϕl (Yt − µ).
• When −1 < ϕ < 1 (stationarity condition), note that ϕl ≈ 0 when l is large.
• Therefore, as l increases without bound, the l-step ahead MMSE forecast
Ybt (l) → µ.
In other words, MMSE forecasts will “converge” to the overall process mean µ as
the lead time l increases.
IMPORTANT: That the MMSE forecast Ybt (l) → µ as l → ∞ is a characteristic of all

stationary ARMA(p, q) models.
PAGE 239
FORECAST ERROR: In the AR(1) model, the 1-step ahead forecast error is
et (1) = Yt+1 − Ybt (1)
= µ + ϕ(Yt − µ) + et+1 −[µ + ϕ(Yt − µ)]

| {z } | {z }
= Yt+1 = Ybt (1)
= et+1 .
Therefore,
E[et (1)] = E(et+1 ) = 0
var[et (1)] = var(et+1 ) = σe2 .
Because the 1-step ahead forecast error et (1) is an unbiased estimator of 0, we say that
the 1-step ahead forecast Ybt (1) is unbiased. The second equation says that the 1-step
ahead forecast error et (1) has constant variance. To find the l-step ahead forecast
error, et (l), we first remind ourselves (pp 94, notes) that a zero mean AR(1) process can
be written as an infinite order MA model, that is,
Yt − µ = et + ϕet−1 + ϕ2 et−2 + ϕ3 et−3 + · · · .
Therefore,
Yt+l − µ = et+l + ϕet+l−1 + ϕ2 et+l−2 + · · · + ϕl−1 et+1 + ϕl et + · · · .
The l-step ahead forecast error is
et (l) = Yt+l − Ybt (l)
= Yt+l − µ + µ − Ybt (l)
= et+l + ϕet+l−1 + ϕ2 et+l−2 + · · · + ϕl−1 et+1 + ϕl et + · · · −ϕl (Yt − µ)

| {z }| {z }
= Yt+l −µ = µ−Ybt (l)
= et+l + ϕet+l−1 + ϕ2 et+l−2 + · · · + ϕl−1 et+1 + ϕl et + · · ·
−ϕl (et + ϕet−1 + ϕ2 et−2 + ϕ3 et−3 + · · ·)

| {z }
= Yt −µ
= et+l + ϕet+l−1 + ϕ et+l−2 + · · · + ϕ

2 l−1
et+1 .
Therefore, the l-step ahead forecast error has mean
E[et (l)] = E(et+l + ϕet+l−1 + ϕ2 et+l−2 + · · · + ϕl−1 et+1 ) = 0,
PAGE 240
i.e., forecasts are unbiased. The variance of the l-step ahead forecast error is
var[et (l)] = var(et+l + ϕet+l−1 + ϕ2 et+l−2 + · · · + ϕl−1 et+1 )
= var(et+l ) + ϕ2 var(et+l−1 ) + ϕ4 var(et+l−2 ) + · · · + ϕ2(l−1) var(et+1 )
= σe2 + ϕ2 σe2 + ϕ4 σe2 + · · · + ϕ2(l−1) σe2

∑
l−1 ( )
2 2k 2 1 − ϕ2l
= σe ϕ = σe .
k=0
1 − ϕ2
Assuming stationarity, note that ϕ2l → 0 as l → ∞ (because −1 < ϕ < 1) and
σe2
var[et (l)] → = γ0 = var(Yt ).
1 − ϕ2
IMPORTANT: That var[et (l)] → γ0 = var(Yt ) as l → ∞ is a characteristic of all

stationary ARMA(p, q) models.
Example 9.3. In Example 8.2 (pp 213, notes), we examined the Lake Huron elevation
data (from 1880-2006) and we used an AR(1) process to model them.
• The fit using maximum likelihood is
Yt − 579.4921 = 0.8586(Yt−1 − 579.4921) + et ,
b = 579.4921, ϕb = 0.8586, and the white noise error variance estimate

so that µ
be2 = 0.4951. The last value observed was Yt = 581.27 (the elevation for 2006).
σ
• With l = 1, the (estimated) MMSE forecast for Yt+1 (for 2007) is
Ybt (1) = 579.4921 + 0.8586(581.27 − 579.4921) ≈ 581.02.
Ybt (2) = 579.4921 + (0.8586)2 (581.27 − 579.4921) ≈ 580.80.
Ybt (10) = 579.4921 + (0.8586)10 (581.27 − 579.4921) ≈ 579.88.
PAGE 241
NOTE : The R function predict provides (estimated) MMSE forecasts and (estimated)
standard errors of the forecast error for any ARIMA(p, d, q) model fit. For example,
consider the Lake Huron data with lead times l = 1, 2, ..., 20 (which corresponds to years
2007, 2008, ..., 2026). R produces the following output:
huron.ar1.predict <- predict(huron.ar1.fit,n.ahead=20)

> round(huron.ar1.predict$pred,3)
Start = 2007
End = 2026
[1] 581.019 580.803 580.618 580.458 580.322 580.205 580.104 580.017 579.943 579.879
[11] 579.825 579.778 579.737 579.703 579.673 579.647 579.625 579.607 579.590 579.576
> round(huron.ar1.predict$se,3)
Start = 2007
End = 2026
[1] 0.704 0.927 1.063 1.152 1.214 1.258 1.289 1.311 1.328 1.340 1.349 1.355 1.360
[14] 1.363 1.366 1.367 1.369 1.370 1.371 1.371
• In Figure 9.3, we display the Lake Huron data. The full data set is from 1880-2006
(one elevation reading per year).
• However, for aesthetic reasons (to emphasize the MMSE forecasts), we start the
series in the plot at year 1940.
• The estimated MMSE forecasts in the R predict output are computed using
b + ϕbl (Yt − µ
Ybt (l) = µ b),
for l = 1, 2, ..., 20, starting with Yt = 581.27, the observed elevation in 2006. There-
fore, the forecasts in Figure 9.3 start at 2007 and end in 2026.
• In the output above, note how MMSE forecasts Ybt (l) approach the estimated mean
b = 579.492, as l increases. This can also be clearly seen in Figure 9.3.
µ
• The (estimated) standard errors of the forecast error (in the predict output above)
are used to construct prediction intervals. We will discuss their construction in
due course.
PAGE 242
582
581
580
579
578
577
576
1940 1960 1980 2000 2020
Year
Figure 9.3: Lake Huron elevation data. The full data set is from 1880-2006. This figure
starts the series at 1940. AR(1) estimated MMSE forecasts and 95 percent prediction
limits are given for lead times l = 1, 2, ..., 20. These lead times correspond to years
2007-2026.
• Specifically, the (estimated) standard errors of the forecast error (in the predict
output above) are given by
v ( )
u
√ u 1 − b2l
ϕ
se[e c t (l)] = tσ
b t (l)] = var[e be2 ,
1 − ϕb2
be2 = 0.4951 and ϕb = 0.8586.

where σ
• Note how these (estimated) standard errors approach

√ √
be2
σ 0.4951
b t (l)] =
lim se[e = ≈ 1.373.
l→∞ 1 − ϕb2 1 − (0.8586)2
b0 .
This value (1.373) is the square root of the estimated AR(1) process variance γ
PAGE 243
9.3.2 MA(1)
MA(1): Suppose that {et } is zero mean white noise with var(et ) = σe2 . Consider the
invertible MA(1) process
Yt = µ + et − θet−1 ,
where the overall (process) mean µ = E(Yt ).
Ybt (1) = E(Yt+1 |Y1 , Y2 , ..., Yt )
= E(µ + et+1 − θet |Y1 , Y2 , ..., Yt )
= E(µ|Y1 , Y2 , ..., Yt ) + E(et+1 |Y1 , Y2 , ..., Yt ) − E(θet |Y1 , Y2 , ..., Yt ) .

| {z } | {z } | {z }
= µ = E(et+1 )=0 = (∗∗)
To compute (∗∗), recall (pp 105, notes) that a zero mean invertible MA(1) process can
be written in its “AR(∞)” expansion
et = (Yt − µ) + θ(Yt−1 − µ) + θ2 (Yt−2 − µ) + θ3 (Yt−3 − µ) + · · · ,
a weighted (theoretically infinite) linear combination of Yt−j − µ, for j = 0, 1, 2, ...,. That

is, et can be expressed as a function of Y1 , Y2 , ..., Yt , and hence
E(θet |Y1 , Y2 , ..., Yt ) = θet .
Therefore, the 1-step ahead forecast
Ybt (1) = µ − θet .
From the representation above, note that the white noise term et can be “computed” in
the 1-step ahead forecast as a byproduct of estimating θ and µ in the MA(1) fit.
l-step ahead forecast: The MMSE prediction for Yt+l , l > 1, is given by
Ybt (l) = E(Yt+l |Y1 , Y2 , ..., Yt )
= E(µ + et+l − θet+l−1 |Y1 , Y2 , ..., Yt )
= E(µ|Y1 , Y2 , ..., Yt ) + E(et+l |Y1 , Y2 , ..., Yt ) − E(θet+l−1 |Y1 , Y2 , ..., Yt ),

| {z } | {z } | {z }
= µ = E(et+l )=0 = θE(et+l−1 )=0
PAGE 244
because et+l and et+l−1 are both independent of Y1 , Y2 , ..., Yt , when l > 1. Therefore, we
have shown that for the MA(1) model, MMSE forecasts are

 µ − θe , l = 1
Ybt (l) =
t
 µ, l > 1.
The key feature of an MA(1) process is that observations one unit apart in time are
correlated, whereas observations l > 1 units apart in time are not. For l > 1, there is no
autocorrelation to exploit in making a prediction; this is why a constant mean prediction
is made. Note: More generally, for any purely MA(q) process, the MMSE forecast is
Ybt (l) = µ at all lead times l > q.
REMARK : Just as we saw in the AR(1) model case, note that Ybt (l) → µ as l → ∞. This
is a characteristic of Ybt (l) in all stationary ARMA(p, q) models.
FORECAST ERROR: In the MA(1) model, the 1-step ahead forecast error is
et (1) = Yt+1 − Ybt (1)
= µ + et+1 − θet −(µ − θet )

| {z } | {z }
= Yt+1 = Ybt (1)
= et+1 .
Therefore,
E[et (1)] = E(et+1 ) = 0
var[et (1)] = var(et+1 ) = σe2 .
As in the AR(1) model, 1-step ahead forecasts are unbiased and the variance of the 1-step
ahead forecast error is constant. The variance of the l-step ahead prediction error
et (l), for l > 1, is given by
var[et (l)] = var[Yt+l − Ybt (l)]
= var(µ + et+l − θet+l−1 −µ)

| {z }
= Yt+l
= var(et+l − θet+l−1 )
= var(et+l ) + θ2 var(et+l−1 ) − 2θ cov(et+l , et+l−1 )

| {z }
= 0
= σe2 + θ2 σe2 = σe2 (1 2
+ θ ).
PAGE 245
Summarizing, 
 σe2 , l=1
var[et (l)] =
 σ 2 (1 + θ2 ), l > 1.
e
REMARK : In the MA(1) model, note that as l → ∞,
var[et (l)] → γ0 = var(Yt ).
This is a characteristic of var[et (l)] in all stationary ARMA(p, q) models.
Example 9.4. In Example 7.6 (pp 202, notes), we examined the Göta River discharge
rate data (1807-1956) and used an MA(1) process to model them. The fitted model
(using ML) is
Yt = 535.0311 + et + 0.5350et−1 .
b = 535.0311, θb = −0.5350 and σ

so that µ be2 = 6957. Here are the MA(1) forecasts for lead
times l = 1, 2, ..., 10, computed using the predict function in R:
> gota.ma1.predict <- predict(gota.ma1.fit,n.ahead=10)

> round(gota.ma1.predict$pred,3)
Start = 1957
End = 1966
[1] 510.960 535.031 535.031 535.031 535.031 535.031 535.031 535.031 535.031 535.031
> round(gota.ma1.predict$se,3)
Start = 1957
End = 1966
[1] 83.411 94.599 94.599 94.599 94.599 94.599 94.599 94.599 94.599 94.599
• In Figure 9.4, we display the Göta River data. The full data set is from 1807-1956
(one discharge reading per year). However, to emphasize the MMSE forecasts in
the plot, we start the series at year 1890.
• With l = 1, 2, ..., 10, the MMSE forecasts in the predict output and in Figure 9.4
start at 1957 and end in 1966.
• From the predict output, note that Ybt (1) = 510.960, the 1-step ahead forecast,
is the only “informative” one. Forecasts for l > 1 are Ybt (l) = µ
b ≈ 535.0311.
PAGE 246
700
600
500
400
1900 1920 1940 1960
Year
Figure 9.4: Göta River discharge data. The full data set is from 1807-1956. This figure
starts the series at 1890. MA(1) estimated MMSE forecasts and 95 percent prediction
limits are given for lead times l = 1, 2, ..., 10. These lead times correspond to years
1957-1966.
• Recall that MA(1) forecasts only exploit the autocorrelation at the l = 1 lead time!
In the MA(1) process, there is no autocorrelation after the first lag. All future
forecasts (after the first) will revert to the process mean estimate.
• For lead time l = 1,

√ √ √
b t (1)] =
se[e c t (1)] =
var[e be2 =
σ 6957 ≈ 83.411
• For any lead time l > 1,

√ √ √
b t (l)] = var[e
se[e c t (l)] = σbe2 (1 + θb2 ) ≈ 6957[1 + (−0.5350)2 ] ≈ 94.599.
b0 .
This value (94.599) is the square root of the estimated MA(1) process variance γ
PAGE 247
9.3.3 ARMA(p, q)
ARMA(p, q): Suppose that {et } is zero mean white noise with var(et ) = σe2 and consider
the ARMA(p, q) process
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et − θ1 et−1 − θ2 et−2 − · · · − θq et−q .
To calculate the l-step ahead MMSE forecast, replace the time index t with t + l and take
conditional expectations of both sides (given the process history Y1 , Y2 , ..., Yt ). Doing this
leads directly to the following difference equation:
Ybt (l) = θ0 + ϕ1 Ybt (l − 1) + ϕ2 Ybt (l − 2) + · · · + ϕp Ybt (l − p)
− θ1 E(et+l−1 |Y1 , Y2 , ..., Yt ) − θ2 E(et+l−2 |Y1 , Y2 , ..., Yt ) − · · ·
− θq E(et+l−q |Y1 , Y2 , ..., Yt ).
For a general ARMA(p, q) process, MMSE forecasts are calculated using this equation.
• In the expression above,
Ybt (l − j) = E(Yt+l−j |Y1 , Y2 , ..., Yt ),
for j = 1, 2, ..., p. General recursive formulas can be derived to compute this con-
ditional expectation, as we saw in the AR(1) case.
• In the expression above,


 0, l−k >0
E(et+l−k |Y1 , Y2 , ..., Yt ) =
 e
t+l−k , l − k ≤ 0,
for k = 1, 2, ..., q. When l − k ≤ 0, the conditional expectation
E(et+l−k |Y1 , Y2 , ..., Yt ) = et+l−k ,
which can be approximated using infinite AR representations for invertible models

(see pp 80, CC). This is only necessary for MMSE forecasts at early lags l ≤ q
when q is larger than or equal to 1.
PAGE 248
SPECIAL CASE : Consider the ARMA(1,1) process
Yt = θ0 + ϕYt−1 + et − θet−1 .
For l = 1, we have
Ybt (1) = E(Yt+1 |Y1 , Y2 , ..., Yt )
= E(θ0 + ϕYt + et+1 − θet |Y1 , Y2 , ..., Yt )
= E(θ0 |Y1 , Y2 , ..., Yt ) + E(ϕYt |Y1 , Y2 , ..., Yt ) + E(et+1 |Y1 , Y2 , ..., Yt )

| {z } | {z } | {z }
= θ0 = ϕYt = E(et+1 )=0
− E(θet |Y1 , Y2 , ..., Yt )

| {z }
= θet
= θ0 + ϕYt − θet .
For l = 2, we have
Ybt (2) = E(Yt+2 |Y1 , Y2 , ..., Yt )
= E(θ0 + ϕYt+1 + et+2 − θet+1 |Y1 , Y2 , ..., Yt )
= E(θ0 |Y1 , Y2 , ..., Yt ) + E(ϕYt+1 |Y1 , Y2 , ..., Yt ) + E(et+2 |Y1 , Y2 , ..., Yt )

| {z } | {z } | {z }
= θ0 = ϕYbt (1) = E(et+2 )=0
− E(θet+1 |Y1 , Y2 , ..., Yt )

| {z }
= θE(et+1 )=0
= θ0 + ϕYbt (1).
It is easy to see that this pattern continues for larger lead times l; in general,
Ybt (l) = θ0 + ϕYbt (l − 1),
for all lead times l > 1. It is important to make the following observations in this special
ARMA(p = 1, q = 1) case:
• The MMSE forecast Ybt (l) depends on the MA components only when l ≤ q = 1.
• When l > q = 1, the MMSE forecast Ybt (l) depends only on the AR components.
• This is also true of MMSE forecasts in higher order ARMA(p, q) models.
PAGE 249
SUMMARY : The following notes summarize MMSE forecast calculations in general

ARMA(p, q) models:
• When l ≤ q, MMSE forecasts depend on both the AR and MA parts of the model.
• When l > q, the MA contributions vanish and forecasts will depend solely on the
recursion identified in the AR part. That is, when l > q,
Ybt (l) = θ0 + ϕ1 Ybt (l − 1) + ϕ2 Ybt (l − 2) + · · · + ϕp Ybt (l − p).
• It is insightful to note that the last expression, for l > q, can be written as
Ybt (l) − µ = ϕ1 [Ybt (l − 1) − µ] + ϕ2 [Ybt (l − 2) − µ] + · · · + ϕp [Ybt (l − p) − µ].
Therefore,
– as a function of l, Ybt (l) − µ follows the same Yule-Walker recursion as the

autocorrelation function ρk .
– the roots of ϕ(x) = 1 − ϕ1 x − ϕ2 x2 − · · · − ϕp xp determine the behavior of

Ybt (l) − µ, when l > q; e.g., exponential decay, damped sine waves, etc.
• For any stationary ARMA(p, q) process,
lim Ybt (l) = µ,

l→∞
where µ = E(Yt ). Therefore, for large lead times l, MMSE forecasts will be ap-
proximately equal to the process mean.
• For any stationary ARMA(p, q) process, the variance of the l-step ahead fore-
cast error satisfies
lim var[et (l)] = γ0 ,
l→∞
where γ0 = var(Yt ). That is, for large lead times l, the variance of the forecast error
will be close to the process variance.
• The predict function in R automates the entire forecasting process, providing

(estimated) MMSE forecasts and standard errors of the forecast error.
PAGE 250
Example 9.5. In Example 7.5 (pp 195, notes), we examined the bovine blood sugar
data (176 observations) and we used an ARMA(1,1) process to model them. The fitted
ARMA(1,1) model (using ML) is
Yt − 59.0071 = 0.6623(Yt−1 − 59.0071) + et + 0.6107et−1 ,
b = 59.0071, ϕb = 0.6623, θb = −0.6107, and the white noise variance estimate

so that µ
be2 = 20.43. Here are the ARMA(1,1) forecasts for lead times l = 1, 2, ..., 10, computed
σ
using the predict function in R:
> cows.arma11.predict <- predict(cows.arma11.fit,n.ahead=10)

> round(cows.arma11.predict$pred,3)
Start = 177
End = 186
[1] 58.643 58.766 58.847 58.901 58.937 58.961 58.976 58.987 58.994 58.998
> round(cows.arma11.predict$se,3)
Start = 177
End = 186
[1] 4.520 7.316 8.249 8.627 8.787 8.856 8.887 8.900 8.906 8.908
• In Figure 9.5, we display the bovine data. The full data set is from day 1-176 (one
blood sugar reading per day). However, to emphasize the MMSE forecasts in the
plot, we start the series at day 81.
• With l = 1, 2, ..., 10, the MMSE forecasts in the predict output and in Figure 9.5
start at day 177 and end at day 186.
• From the predict output and Figure 9.5, note that the predictions are all close to
b = 59.0071, the estimated process mean. This happens because the last observed
µ
b = 59.0071.
data value was Y176 = 55.91133, which is already somewhat close to µ
• Close inspection reveals that the forecasts decay (quickly) towards µ

b = 59.0071 as
expected.
PAGE 251
80
70
Blood sugar level (mg/100ml blood)
60
50
40
80 100 120 140 160 180
Day
Figure 9.5: Bovine blood sugar data. The full data set is from day 1-176. This figure
starts the series at day 81. ARMA(1,1) estimated MMSE forecasts and 95 percent
prediction limits are given for lead times l = 1, 2, ..., 10. These lead times correspond to
days 177-186.
• The variance of the l-step ahead prediction error et (l) should satisfy
( )
1 − 2ϕθ + θ2
lim var[et (l)] = γ0 = σe2 ,
l→∞ 1 − ϕ2
which, with ϕb = 0.6623, θb = −0.6107, and σ be2 = 20.43, is estimated to be

( )
b b
1 − 2ϕθ + θb2
b0 =
γ be2
σ
1−ϕ b2
[ ]
1 − 2(0.6623)(−0.6107) + (−0.6107)2
= (20.43) ≈ 79.407.
1 − (0.6623)2
• As l increases, note that the estimated standard errors se[e

b t (l)] from the predict
√ √
output, as expected, get very close to γ b0 ≈ 79.407 ≈ 8.911.
PAGE 252
9.3.4 Nonstationary models
NOTE : For invertible ARIMA(p, d, q) models with d ≥ 1, MMSE forecasts are computed
using the same approach as in the stationary case. To see why, suppose that d = 1, so
that the model is
ϕ(B)(1 − B)Yt = θ(B)et ,
where (1 − B)Yt = ∇Yt is the series of first differences. Note that
ϕ(B)(1 − B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )(1 − B)
= (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p ) − (B − ϕ1 B 2 − ϕ2 B 3 − · · · − ϕp B p+1 )
= 1 − (1 + ϕ1 )B − (ϕ2 − ϕ1 )B 2 − · · · − (ϕp − ϕp−1 )B p + ϕp B p+1 .

| {z }
= ϕ∗ (B), say
We can therefore rewrite the ARIMA(p, 1, q) model as
ϕ∗ (B)Yt = θ(B)et ,
a nonstationary ARMA(p + 1, q) model. We then calculate MMSE forecasts the same

way as in the stationary case.
EXAMPLE : Suppose p = d = q = 1 so that we have an ARIMA(1,1,1) process
(1 − ϕB)(1 − B)Yt = (1 − θB)et .
This can be written as
Yt = (1 + ϕ)Yt−1 − ϕYt−2 + et − θet−1 ,
a nonstationary ARMA(2,1) model. If l = 1, then
Ybt (1) = E(Yt+1 |Y1 , Y2 , ..., Yt )
= E[(1 + ϕ)Yt − ϕYt−1 + et+1 − θet |Y1 , Y2 , ..., Yt ]
= E[(1 + ϕ)Yt |Y1 , Y2 , ..., Yt ] − E(ϕYt−1 |Y1 , Y2 , ..., Yt ) + E(et+1 |Y1 , Y2 , ..., Yt )
| {z } | {z } | {z }
= (1+ϕ)Yt = ϕYt−1 = E(et+1 )=0
− E(θet |Y1 , Y2 , ..., Yt )

| {z }
= θet
= (1 + ϕ)Yt − ϕYt−1 − θet .
PAGE 253
If l = 2, then
Ybt (2) = E(Yt+2 |Y1 , Y2 , ..., Yt )
= E[(1 + ϕ)Yt+1 − ϕYt + et+2 − θet+1 |Y1 , Y2 , ..., Yt ]
= E[(1 + ϕ)Yt+1 |Y1 , Y2 , ..., Yt ] − E(ϕYt |Y1 , Y2 , ..., Yt ) + E(et+2 |Y1 , Y2 , ..., Yt )
| {z } | {z } | {z }
= (1+ϕ)Ybt (1) = ϕYt = E(et+2 )=0
− E(θet+1 |Y1 , Y2 , ..., Yt )

| {z }
= θE(et+1 )=0
= (1 + ϕ)Ybt (1) − ϕYt .
For l > 2, it follows similarly that
Ybt (l) = E(Yt+l |Y1 , Y2 , ..., Yt )
= E[(1 + ϕ)Yt+l−1 − ϕYt+l−2 + et+l − θet+l−1 |Y1 , Y2 , ..., Yt ]
= (1 + ϕ)Ybt (l − 1) − ϕYbt (l − 2).
Writing recursive expressions for MMSE forecasts in any invertible ARIMA(p, d, q) model
can be done similarly.
RESULT : The l-step ahead forecast error et (l) = Yt+l − Ybt (l) for any invertible
ARIMA(p, d, q) model has the following characteristics:
E[et (l)] = 0
∑
l−1
var[et (l)] = σe2 Ψ2j ,
j=0
where the Ψ weights correspond to those in the truncated linear process representation
of the ARIMA(p, d, q) model; see pp 200 (CC).
• The first equation implies that MMSE ARIMA forecasts are unbiased.
• The salient feature in the second equation is that for nonstationary models, the
Ψ weights do not “die out” as they do with stationary models.
• Therefore, for nonstationary models, the variance of the forecast error var[et (l)]
continues to increase as l does. This is not surprising given that the process is not
stationary.
PAGE 254
Example 9.6. In Example 8.7 (pp 225, notes), we examined monthly spot prices for
crude oil (measured in U.S. dollars per barrel) from 1/86 to 1/06, and we used a log-
transformed IMA(1,1) process to model them. The model fit (using ML) is
log Yt = log Yt−1 + et + 0.2956et−1 ,
so that θb = −0.2956 and the white noise variance estimate is σ

be2 = 0.006689. The
estimated forecasts and standard errors (on the log scale) are given for lead times
l = 1, 2, ..., 12 in the predict output below:
> ima11.log.oil.predict <- predict(ima11.log.oil.fit,n.ahead=12)

> round(ima11.log.oil.predict$pred,3)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208
2007 4.208
> round(ima11.log.oil.predict$se,3)
2006 0.082 0.134 0.171 0.201 0.227 0.251 0.272 0.292 0.311 0.328 0.345
2007 0.361
• In Figure 9.6, we display the oil price data. The full data set is from 1/86 to 1/06
(one observation per month). However, to emphasize the MMSE forecasts in the
plot, we start the series at month 1/98.
• With l = 1, 2, ..., 12, the estimated MMSE forecasts in the predict output and in
Figure 9.6 start at 2/06 and end in 1/07.
• From the predict output, note that Ybt (1) = Ybt (2) = · · · = Ybt (12) = 4.208. It is
important to remember that these forecasts are on the log scale.
• On the original scale (in dollars), we will see later that MMSE forecasts are not
constant.
• As expected from a nonstationary process, the estimated standard errors (also on

the log scale) increase as l increases.
PAGE 255
5.0
4.5
Oil prices (on log scale)
4.0
3.5
3.0
2.5
1998 2000 2002 2004 2006
Year
Figure 9.6: Oil price data (log-transformed). The full data set is from 1/86 to 1/06. This
figure starts the series at 1/98. IMA(1,1) estimated MMSE forecasts and 95 percent
prediction limits (on the log scale) are given for lead times l = 1, 2, ..., 12. These lead
times correspond to months 2/06-1/07.
Example 9.7. In Example 1.6 (pp 7, notes), we examined the USC fall enrollment data
(Columbia campus) from 1954-2010. An ARI(1,1) process provides a good fit to these
data; fitting this model in R (using ML) gives the following output:
> enrollment.ari11.fit = arima(enrollment,order=c(1,1,0),method=’ML’)

> enrollment.ari11.fit
Coefficients:
ar1
0.3637
s.e. 0.1236
PAGE 256
The fitted ARI(1,1) model is therefore
Yt − Yt−1 = 0.3637(Yt−1 − Yt−2 ) + et ,
so that ϕb = 0.3637 and the white noise variance estimate σ

be2 = 1119849. The predict
output from R is given below:
> enrollment.ari11.predict <- predict(enrollment.ari11.fit,n.ahead=10)

> round(enrollment.ari11.predict$pred,3)
Start = 2011
End = 2020
[1] 28842.12 28973.44 29021.20 29038.56 29044.88 29047.18 29048.01 29048.32 29048.43
[10] 29048.47
> round(enrollment.ari11.predict$se,3)
Start = 2011
End = 2020
[1] 1058.229 1789.494 2389.190 2894.460 3332.925 3723.059 4077.018 4402.947 4706.473
[10] 4991.615
• In Figure 9.7, we display the USC enrollment data. The full data set is from 1954-
2010 (one enrollment count per year). However, to emphasize the MMSE forecasts
in the plot, we start the series at year 1974.
Figure 9.7 start at 2011 and end at 2020.
• From the predict output, note that the estimated MMSE forecasts for the next
10 years, based on the ARI(1,1) fit, fluctuate slightly.
• As expected from a nonstationary process, the estimated standard errors increase

as l increases.
REMARK : As we have seen in the forecasting examples up to now, prediction limits

are used to assess uncertainty in the calculated MMSE forecasts. We now discuss how
these limits are obtained.
PAGE 257
35000
USC Columbia fall enrollment
30000
25000
20000
1980 1990 2000 2010 2020
Year
Figure 9.7: University of South Carolina fall enrollment data. The full data set is from
1954-2010. This figure starts the series at 1974. ARI(1,1) estimated MMSE forecasts
and 95 percent prediction limits are given for lead times l = 1, 2, ..., 10. These lead times
correspond to years 2011-2020.
9.4 Prediction intervals
TERMINOLOGY : A 100(1−α) percent prediction interval for the Yt+l is an interval

(Ybt+l , Ybt+l ) which satisfies
(L) (U )
pr(Ybt+l < Yt+l < Ybt+l ) = 1 − α.

(L) (U )
We now derive prediction intervals for future responses with deterministic trend and
ARIMA models.
NOTE : Prediction intervals and confidence intervals, while similar in spirit, have very
different interpretations. A confidence interval is for a population (model) parameter,
which is fixed. A prediction interval is constructed for a random variable.
PAGE 258
9.4.1 Deterministic trend models
RECALL: Recall our deterministic trend model of the form
Yt = µt + Xt ,
where µt is a non-random trend function and where we assume (for purposes of the current
discussion) that {Xt } is a normally distributed stochastic process with E(Xt ) = 0 and
var(Xt ) = γ0 (constant). We have already shown the following:
Ybt (l) = µt+l
E[et (l)] = 0
var[et (l)] = γ0 ,
where et (l) = Yt+l − Ybt (l) is the l-step ahead prediction error. Under the assumption of
normality, the random variable
et (l) Yt+l − Ybt (l) Yt+l − Ybt (l)

Z=√ = √ = ∼ N (0, 1).
var[et (l)] var[et (l)] se[et (l)]
Therefore, Z is a pivotal quantity and

( )
Yt+l − Ybt (l)
pr −zα/2 < < zα/2 = 1 − α.
se[et (l)]
Using algebra to rearrange the event inside the probability symbol, we have
( )
pr Ybt (l) − zα/2 se[et (l)] < Yt+l < Ybt (l) + zα/2 se[et (l)] = 1 − α.
This shows that

( )
Ybt (l) − zα/2 se[et (l)], Ybt (l) + zα/2 se[et (l)]
is a 100(1 − α) percent prediction interval for Yt+l .
REMARK : The form the prediction interval includes the quantities Ybt (l) = µt+l and
√
se[et (l)] = γ0 . Of course, these are population parameters that must be estimated
using the data.
PAGE 259
Example 9.8. Consider the global temperature data from Example 3.4 (pp 53, notes).
Fitting a linear deterministic trend model Yt = β0 + β1 t + Xt , for t = 1900, 1901, ..., 1997,
produces the following output in R:
Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.219e+01 9.032e-01 -13.49 <2e-16 ***
time(globaltemps.1900) 6.209e-03 4.635e-04 13.40 <2e-16 ***

Suppose that {Xt } is a normal white noise process with (constant) variance γ0 . The
analysis in Section 3.5.1 (notes, pp 72-73) does support the normality assumption.
• The fitted model is

Ybt = −12.19 + 0.0062t,
for t = 1900, 1991, ..., 1997.
• An estimate of the (assumed constant) variance of Xt is
b0 ≈ (0.1298)2 ≈ 0.0168.
γ
• The 1-step ahead MMSE forecast for 1998 is estimated to be
Ybt (1) = −12.19 + 0.0062(1997 + 1) ≈ 0.198.
• Therefore, with
√
b t (1)] ≈
se[e b0 ≈ 0.1298,
γ
a 95 percent prediction interval for 1998 is
0.198 ± 1.96 × 0.1298 =⇒ (−0.056, 0.452).
• If we had made this prediction in 1997, we would have been 95 percent confident
that the temperature deviation for 1998, Y1998 , falls between −0.056 and 0.452.
PAGE 260
IMPORTANT : The formation of prediction intervals from deterministic trend models

requires that the stochastic component Xt is normally distributed with constant
variance. This may or may not be true in practice, but the validity of the prediction
limits requires it to be true. Note also that since the margin of error
√
zα/2 se[et (l)] = zα/2 γ0
is free of l, prediction intervals have the same width indefinitely into the future.
9.4.2 ARIMA models
ϕ(B)(1 − B)d Yt = θ0 + θ(B)et .
We have seen that the l-step ahead forecast error et (l) = Yt+l − Ybt (l) for any invertible
ARIMA(p, d, q) model has the following characteristics:
E[et (l)] = 0
∑
l−1
var[et (l)] = σe2 Ψ2j ,
j=0
where the Ψ weights are unique to the specific model under investigation. If we addi-
tionally assume that the white noise process {et } is normally distributed, then
et (l) Yt+l − Ybt (l) Yt+l − Ybt (l)

Z=√ = √ = ∼ N (0, 1).
var[et (l)] var[et (l)] se[et (l)]
This implies that

( )
b b
Yt (l) − zα/2 se[et (l)], Yt (l) + zα/2 se[et (l)]
is a 100(1 − α) percent prediction interval for Yt+l . As we have seen in the examples
so far, R gives (estimated) MMSE forecasts and standard errors; i.e., estimates of Ybt (l)
and se[et (l)], so we can compute prediction intervals associated with any ARIMA(p, d, q)
model. It is important to emphasize that normality is assumed.
PAGE 261
Example 9.9. In Example 9.3, we examined the Lake Huron elevation data (from 1880-
2006) and calculated the (estimated) MMSE forecasts based on an AR(1) model fit with
lead times l = 1, 2, ..., 20. These forecasts, along with 95 percent prediction intervals
(limits) were depicted visually in Figure 9.3. Here are the numerical values of these
prediction intervals from R (I display only out to lead time l = 10 for brevity):
> # Create lower and upper prediction interval bounds

> lower.pi<-huron.ar1.predict$pred-qnorm(0.975,0,1)*huron.ar1.predict$se
> upper.pi<-huron.ar1.predict$pred+qnorm(0.975,0,1)*huron.ar1.predict$se
> # Display prediction intervals (2007-2026)
> data.frame(Year=c(2007:2026),lower.pi,upper.pi)
Year lower.pi upper.pi
1 2007 579.6395 582.3978
2 2008 578.9851 582.6206
3 2009 578.5347 582.7004
4 2010 578.2000 582.7168
5 2011 577.9423 582.7014
6 2012 577.7395 582.6696
7 2013 577.5776 582.6301
8 2014 577.4469 582.5878
9 2015 577.3406 582.5456
10 2016 577.2534 582.5053
• In the R code, $pred extracts the estimated MMSE forecasts and $se extracts the
estimated standard error of the forecast error. The expression qnorm(0.975,0,1)
gives the upper 0.025 quantile of the N (0, 1) distribution (approximately 1.96).
• For example, we are 95 percent confident that the Lake Huron elevation level for
2015 will be between 577.3406 and 582.5456 feet.
• Note how the prediction limits (lower and upper) start to stabilize as the lead
time l increases. This is typical of a stationary process. Prediction limits from
nonstationary model fits do not stabilize as l increases.
• Important: The validity of prediction intervals depends on the white noise process
{et } being normally distributed.
PAGE 262
9.5 Forecasting transformed series
9.5.1 Differencing
DISCOVERY : Calculating MMSE forecasts from nonstationary ARIMA models (i.e.,

d ≥ 1) poses no additional methodological challenges beyond those of stationary ARMA
models. It is easy to take this fact for granted, because as we have already seen, R auto-
mates the entire forecasting process for stationary and nonstationary models. However,
it is important to understand why this is true so we investigate by means of an example.
EXAMPLE : Suppose that {et } is a zero mean white noise process with var(et ) = σe2 and
consider the IMA(1,1) model
Yt = Yt−1 + et − θet−1 .
The 1-step ahead MMSE forecast is
Ybt (1) = E(Yt+1 |Y1 , Y2 , ..., Yt )
= E(Yt + et+1 − θet |Y1 , Y2 , ..., Yt )
= E(Yt |Y1 , Y2 , ..., Yt ) + E(et+1 |Y1 , Y2 , ..., Yt ) − E(θet |Y1 , Y2 , ..., Yt )

| {z } | {z } | {z }
= Yt = E(et+1 )=0 = θet
= Yt − θet .
The l-step ahead MMSE forecast, for l > 1, is
Ybt (l) = E(Yt+l |Y1 , Y2 , ..., Yt )
= E(Yt+l−1 + et+l − θet+l−1 |Y1 , Y2 , ..., Yt )
= E(Yt+l−1 |Y1 , Y2 , ..., Yt ) + E(et+l |Y1 , Y2 , ..., Yt ) − E(θet+l−1 |Y1 , Y2 , ..., Yt )

| {z } | {z } | {z }
= Ybt (l−1) = E(et+l )=0 = E(et+l−1 )=0
= Ybt (l − 1).
Therefore, we have shown that for the IMA(1,1) model, MMSE forecasts are

 Y − θe , l = 1
b
Yt (l) =
t t
 Yb (l − 1), l > 1.
t
PAGE 263
Now, let Wt = ∇Yt = Yt − Yt−1 , so that Wt follows a zero-mean MA(1) model; i.e.,
Wt = et − θet−1 ,
We have already shown that for an MA(1) process with µ = 0,


 −θe , l = 1
ct (l) =
W
t
 0, l > 1.
• When l = 1, note that
ct (1) = −θet = Yt − θet −Yt = Ybt (1) − Yt .

W
| {z }
= Ybt (1)
• When l > 1, note that
ct (l) = 0 = Ybt (l) − Ybt (l − 1).

W
Therefore, we have shown that with the IMA(1,1) model,
(a) forecasting the original nonstationary series Yt
(b) forecasting the stationary differenced series Wt = ∇Yt and then summing to
obtain the forecast in original terms
are equivalent procedures. In fact, this equivalence holds when forecasting for any
ARIMA(p, d, q) model!
• That is, the analyst can calculate predictions with the nonstationary model for Yt
or with the stationary model for Wt = ∇d Yt (and then convert back to the original
scale by adding).
• The predictions in both cases will be equal (hence, the resulting standard errors
will be the same too).
• The reason this occurs is that differencing is a linear operation (just as conditional
expectation is).
PAGE 264
9.5.2 Log-transformed series
RECALL: In Chapter 5, we discussed the Box-Cox family of transformations



 Yt − 1 , λ ̸= 0
λ
T (Yt ) = λ

 ln(Yt ), λ = 0,
where λ is the transformation parameter. Many time series processes {Yt } exhibit
nonconstant variability that can be stabilized by taking logarithms. However, the func-
tion T (x) = ln x is not a linear function, so transformations on the log scale can not
simply be “undone” as easily as with differenced series (differencing is a linear transfor-
mation). MMSE forecasts are not preserved under exponentiation.
THEORY : For notational purposes, set
Zt = ln Yt ,
and denote the MMSE forecast for Zt+l by Zbt (l), that is, Zbt (l) is the l-step ahead MMSE
forecast on the log scale.
• The MMSE forecast for Yt+l is not Ybt (l) = eZt (l) !! This is sometimes called the
b
naive forecast of Yt+l .
• The theoretical argument on pp 210 (CC) shows that the corresponding MMSE
forecast of Yt+l is { }
b b 1
Yt (l) = exp Zt (l) + var[et (l)] ,
2
where var[et (l)] is the variance of the l-step ahead forecast error et (l) = Zt+l − Zbt (l).
Example 9.10. In Example 9.6, we examined the monthly oil price data (1/86-1/01)
and we computed MMSE forecasts and predictions limits for l = 1, 2, ..., 12 (i.e., for 2/06
to 1/07), based on an IMA(1,1) fit for Zt = ln Yt . The estimated MMSE forecasts (on
the log scale) are depicted visually in Figure 9.6. The estimated MMSE forecasts, both
on the log scale and on the original scale (back-transformed), are given below:
PAGE 265
> ima11.log.oil.predict <- predict(ima11.log.oil.fit,n.ahead=12)

> round(ima11.log.oil.predict$pred,3)
2006 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208 4.208
2007 4.208
> round(ima11.log.oil.predict$se,3)
2006 0.082 0.134 0.171 0.201 0.227 0.251 0.272 0.292 0.311 0.328 0.345
2007 0.361
> # MMSE forecasts back-transformed (to original scale)
> oil.price.predict <-
round(exp(ima11.log.oil.predict$pred + (1/2)*(ima11.log.oil.predict$se)^2),3)
> oil.price.predict
2006 67.417 67.796 68.178 68.562 68.948 69.336 69.726 70.119 70.513 70.910 71.310
2007 71.711
For example, the MMSE forecast (on the original scale) for June, 2006 is given by
{ }
b 1
Yt (5) = exp 4.208 + (0.227) ≈ 68.948.
2
2
NOTE : A 100(1 − α) percent prediction interval for Yt+l can be formed by exponen-
tiating the endpoints of the prediction interval for Zt+l = log Yt+l . This is true because
( (L) )
b b(U )
1 − α = pr(Zbt+l < Zt+l < Zbt+l ) = pr eZt+l < Yt+l < eZt+l ;
(L) (U )
that is, because the exponential function f (x) = ex is strictly increasing, the two proba-
bilities above are the same.
• For example, a 95 percent prediction interval for June, 2005 (on the log scale) is
4.208 ± 1.96(0.227) =⇒ (3.763, 4.653).
• A 95 percent prediction interval for June, 2005 on the original scale (in dollars) is
(e3.763 , e4.653 ) =⇒ (43.08, 104.90).
Therefore, we are 95 percent confident that the June, 2006 oil price (had we made
this prediction in January, 2006) would fall between 43.08 and 104.90 dollars.
PAGE 266
10 Seasonal ARIMA Models
10.1 Introduction
PREVIEW : In this chapter, we introduce new ARIMA models that incorporate seasonal
patterns occurring over time. With seasonal data, dependence with the past occurs most
prominently at multiples of an underlying seasonal lag, denoted by s. Consider the
following examples:
• With monthly data, there can be strong autocorrelation at lags that are multiples
of s = 12. For example, January observations tend to be “alike” across years,
February observations tend to be “alike,” and so on.
• With quarterly data, there can be strong autocorrelation at lags that are multiples
of s = 4. For example, first quarter sales tend to be “alike” across years, second
quarter sales tend to be “alike,” and so on.
UBIQUITY : Many physical, biological, epidemiological, and economic processes tend to

elicit seasonal patterns over time. We therefore wish to study new time series models
which can account explicitly for these types of patterns. We refer to this new class of
models generally as seasonal ARIMA models.
Example 10.1. In Example 1.2 (pp 3, notes), we examined the monthly U.S. milk
production data (in millions of pounds) from January, 1994 to December, 2005.
• In Figure 10.1, we see that there are two types of trend in the milk production
data:
– an upward linear trend (across the years)
– a seasonal trend (within years).
PAGE 267
1700
1600
1500
1400
1300
1994 1996 1998 2000 2002 2004 2006
Year
Figure 10.1: United States milk production data. Monthly production figures, measured
in millions of pounds, from January, 1994 to December, 2005.
• We know the upward linear trend can be “removed” by working with first differences
∇Yt = Yt − Yt−1 . This is how we removed linear trends with nonseasonal data.
• Figure 10.2 displays the series of first differences ∇Yt . From this plot, it is clear that
the upward linear trend over time has been removed. That is, the first differences
∇Yt look stationary in the mean level.
• However, the first difference process {∇Yt } still displays a pronounced seasonal
pattern that repeats itself every s = 12 months. This is easily seen from the
monthly plotting symbols that I have added. How can we “handle” this type of
pattern? Is it possible to “remove” it as well?
GOAL: We wish to enlarge our class of ARIMA(p, d, q) models to handle seasonal data
such as these.
PAGE 268
M
M M
M M
M
M
M M
M
M
M M
M M
M
M
M
150
M
M
Amount of milk produced: First differences
M
M M
M
100
D
D D
D D
D D
D D
D
D
D D
D M
M D
D
M
M D
D O
O D M
D M D
D
D
D M
M M
M M
M M
M M
M M
M
50
O
O O
O O
O O
O
O
O
JMM OO
J
J OJ M
O M
O
OJ OJ M
M O
O O
O
O J J
J J J J
J J J
J J J J
0
A
A J J A JA
AJ J
J A
A JA
A
A A
A A J A A
A
A
A A
A A
A AA A
A A
A AA
A
A A
A N
N AA
A A
A N A A
A A N A
N A
N
N A
S
SN S
SNN N S
S N A
N A
N
−50
N N
N N N
N
S
S N
N S
S
N
N
J SS
S
S S
S
J F
F S
J S
S
F
F SS
S JSS
J
J J J
J JJ
J J
−100
J J F
F
F
F F
F F
F
F
F F
F
F
F F
F
F
F F
F
1994 1996 1998 2000 2002 2004 2006
Year
Figure 10.2: United States milk production data. First differences ∇Yt = Yt − Yt−1 .
Monthly plotting symbols have been added.
10.2 Purely seasonal (stationary) ARMA models
10.2.1 MA(Q)s
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 . A
seasonal moving average (MA) model of order Q with seasonal period s, denoted
by MA(Q)s , is
Yt = et − Θ1 et−s − Θ2 et−2s − · · · − ΘQ et−Qs .
A nonzero mean µ could be added for flexibility (as with nonseasonal models), but we
take µ = 0 for simplicity.
MA(1)12 : When Q = 1 and s = 12, we have
Yt = et − Θet−12 .
PAGE 269
CALCULATIONS : For an MA(1)12 process, note that
µ = E(Yt ) = E(et − Θet−12 ) = E(et ) − ΘE(et−12 ) = 0.
The process variance is
γ0 = var(Yt ) = var(et − Θet−12 )
= var(et ) + Θ2 var(et−12 ) − 2Θ cov(et , et−12 )

| {z }
= 0
= σe2 + Θ2 σe2 = σe2 (1 2
+ Θ ).
The lag 1 autocorrelation is
γ1 = cov(Yt , Yt−1 ) = cov(et − Θet−12 , et−1 − Θet−13 ) = 0,
because no white noise subscripts match. In fact, it is easy to see that γk = 0 for all k,
except when k = s = 12. Note that
γ12 = cov(Yt , Yt−12 ) = cov(et − Θet−12 , et−12 − Θet−24 )
= −Θvar(et−12 ) = −Θσe2 .
Therefore, the autocovariance function for an MA(1)12 process is



 2 2
 σe (1 + Θ ), k = 0

γk = −Θσe2 , k = 12



 0, otherwise.
Because E(Yt ) = 0 and γk are both free of t, an MA(1)12 process is stationary. The
autocorrelation function (ACF) for an MA(1)12 process is


 1, k=0


γk Θ
ρk = = − , k = 12
γ0   1 + Θ2


0, otherwise.
NOTE : The form of the MA(1)12 ACF is identical to the form of the nonseasonal MA(1)
ACF from Chapter 4. For the MA(1)12 , the only nonzero autocorrelation occurs at the
first seasonal lag k = 12, as opposed to at k = 1 in the nonseasonal MA(1).
PAGE 270
4
2
0
Yt
−2
−4
0 50 100 150 200
Time
Figure 10.3: MA(1)12 simulation with Θ = −0.9, n = 200, and σe2 = 1.
NOTE : A seasonal MA(1)12 process is mathematically equivalent to a nonseasonal

MA(12) process with
θ1 = θ2 = · · · = θ11 = 0
and θ12 = Θ. Because of this equivalence (which occurs here and with other seasonal
models), we can use our already-established methods to specify, fit, diagnose, and forecast
seasonal models.
Example 10.2. We use R to simulate one realization of an MA(1)12 process with

Θ = −0.9, that is,
Yt = et + 0.9et−12 ,
where et ∼ iid N (0, 1) and n = 200. This realization is displayed in Figure 10.3. In
Figure 10.4, we display the population (theoretical) ACF and PACF for this MA(1)12
process and the sample versions that correspond to the simulation in Figure 10.3.
PAGE 271
1.0
1.0
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
0 10 20 30 40 50 0 10 20 30 40 50
k k
0.4
0.3
0.2
Partial ACF
ACF
0.1
0.0
−0.2
−0.1
0 10 20 30 40 50 0 10 20 30 40 50
Lag Lag
Figure 10.4: MA(1)12 with Θ = −0.9. Upper left: Population ACF. Upper right: Popu-
lation PACF. Lower left (right): Sample ACF (PACF) using data in Figure 10.3.
• The population ACF and PACF (Figure 10.4; top) display the same patterns as
the nonseasonal MA(1), except that now these patterns occur at seasonal lags.
– The population ACF displays nonzero autocorrelation only at the first (sea-
sonal) lag k = 12. In other words, observations 12 units apart in time are
correlated, whereas all other observations are not.
– The population PACF shows a decay across seasonal lags k = 12, 24, 36, ...,.
• The sample ACF and PACF reveal these same patterns overall. Margin of error
bounds in the sample ACF/PACF are for white noise; not an MA(1)12 process.
PAGE 272
MA(2)12 : A seasonal MA model of order Q = 2 with seasonal lag s = 12 is
Yt = et − Θ1 et−12 − Θ2 et−24 .
For an MA(2)12 process, is easy to show that E(Yt ) = 0 and



 σe2 (1 + Θ21 + Θ22 ), k = 0




 (−Θ + Θ Θ )σ 2 , k = 12
1 1 2 e
γk =

 −Θ2 σe2 ,


k = 24


 0, otherwise.
Therefore, an MA(2)12 process is stationary. The autocorrelation function (ACF)

for an MA(2)12 process is


 1, k=0



 −Θ1 + Θ1 Θ2 ,

γk  1 + Θ21 + Θ22
k = 12
ρk = = −Θ2
γ0 

 , k = 24

 1 + Θ21 + Θ22


0, otherwise.
NOTE : The ACF for an MA(2)12 process has the same form as the ACF for a nonseasonal
MA(2). The only difference is that nonzero autocorrelations occur at the first two
seasonal lags k = 12 and k = 24, as opposed to at k = 1 and k = 2 in the nonseasonal
MA(2).
NOTE : A seasonal MA(2)12 process is mathematically equivalent to a nonseasonal

MA(24) process with
θ1 = θ2 = · · · = θ11 = θ13 = θ14 = · · · = θ23 = 0,
θ12 = Θ1 , and θ24 = Θ2 . This again reveals that we can use our already-established
methods to specify, fit, diagnose, and forecast seasonal models.
BACKSHIFT NOTATION : In general, a seasonal MA(Q)s process
Yt = et − Θ1 et−s − Θ2 et−2s − · · · − ΘQ et−Qs
PAGE 273
can be expressed using backshift notation as
Yt = et − Θ1 B s et − Θ2 B 2s et − · · · − ΘQ B Qs et
= (1 − Θ1 B s − Θ2 B 2s − · · · − ΘQ B Qs )et ≡ ΘQ (B s )et ,
where ΘQ (B s ) = 1 − Θ1 B s − Θ2 B 2s − · · · − ΘQ B Qs is called the seasonal MA char-

acteristic operator. The operator ΘQ (B s ) can be viewed as a polynomial (in B) of
degree Qs.
• As with nonseasonal processes, a seasonal MA(Q)s process is invertible if and only

if each of the Qs roots of ΘQ (B s ) exceed 1 in absolute value (or modulus).
• All seasonal MA(Q)s processes are stationary.
10.2.2 AR(P )s
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 . A
seasonal autoregressive (AR) model of order P with seasonal period s, denoted
by AR(P )s , is
Yt = Φ1 Yt−s + Φ2 Yt−2s + · · · + ΦP Yt−P s + et .
AR(1)12 : When P = 1 and s = 12, we have
Yt = ΦYt−12 + et .
• Similar to a nonseasonal AR(1) process, a seasonal AR(1)12 process is stationary

if and only if −1 < Φ < 1. An AR(1)12 process is automatically invertible.
• For an AR(1)12 process,

E(Yt ) = 0
σe2
γ0 = var(Yt ) = .
1 − Φ2
PAGE 274
6
4
2
Yt
0
−2
−4
0 50 100 150 200
Time
Figure 10.5: AR(1)12 simulation with Φ = 0.9, n = 200, and σe2 = 1.
• The AR(1)12 autocorrelation function (ACF) is given by


 Φk/12 , k = 0, 12, 24, 36, ...,
ρk =
 0, otherwise.
• That is, ρ0 = 1, ρ12 = Φ, ρ24 = Φ2 , ρ36 = Φ3 , and so on, similar to the nonseasonal
AR(1). The ACF ρk = 0 at all lags k that are not multiples of s = 12.
• A seasonal AR(1)12 process is mathematically equivalent to a nonseasonal AR(12)

process with
ϕ1 = ϕ2 = · · · = ϕ11 = 0
and ϕ12 = Φ.
Example 10.3. We use R to simulate one realization of an AR(1)12 process with Φ = 0.9,
that is,
Yt = 0.9Yt−12 + et ,
PAGE 275
1.0
1.0
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
0 10 20 30 40 50 0 10 20 30 40 50
k k
0.6
0.6
0.4
Partial ACF
0.4
ACF
0.2
0.2
−0.2 0.0
−0.2 0.0
0 10 20 30 40 50 0 10 20 30 40 50
Lag Lag
Figure 10.6: AR(1)12 with Φ = −0.9. Upper left: Population ACF. Upper right: Popu-
lation PACF. Lower left (right): Sample ACF (PACF) using data in Figure 10.5.
where et ∼ iid N (0, 1) and n = 200. This realization is displayed in Figure 10.5. In
Figure 10.6, we display the population (theoretical) ACF and PACF for this AR(1)12
process and the sample versions that correspond to the simulation in Figure 10.5.
• The population ACF and PACF (Figure 10.6; top) display the same patterns as
the nonseasonal AR(1), except that now these patterns occur at seasonal lags.
– The population ACF displays a slow decay across the seasonal lags k =
12, 24, 36, 48, ...,. In other words, observations that are 12, 24, 36, 48, etc.
units apart in time are correlated, whereas all other observations are not.
PAGE 276
– The population PACF is nonzero at the first seasonal lag k = 12. The PACF
is zero at all other lags. This is analogous to the PACF for an AR(1) being
nonzero when k = 1 and zero elsewhere.
• The sample ACF and PACF reveal these same patterns overall. Margin of error
bounds in the sample ACF/PACF are for white noise; not an AR(1)12 process.
AR(2)12 : A seasonal AR model of order P = 2 with seasonal lag s = 12; i.e., AR(2)12 , is
Yt = ΦYt−12 + Φ2 Yt−24 + et .
• A seasonal AR(2)12 behaves like the nonseasonal AR(2) at the seasonal lags.
– In particular, the ACF ρk displays exponential decay or damped sinusoidal

patterns across the seasonal lags k = 12, 24, 36, 48, ...,.
– The PACF ϕkk is nonzero at lags k = 12 and k = 24; it is zero at all other
lags.
• A seasonal AR(2)12 process is mathematically equivalent to a nonseasonal AR(24)

process with
ϕ1 = ϕ2 = · · · = ϕ11 = ϕ13 = ϕ14 = · · · = ϕ23 = 0,
ϕ12 = Φ1 , and ϕ24 = Φ2 .
BACKSHIFT NOTATION : In general, a seasonal AR(P )s process
Yt = Φ1 Yt−s + Φ2 Yt−2s + · · · + ΦP Yt−P s + et
can be expressed as
Yt − Φ1 Yt−s − Φ2 Yt−2s − · · · − ΦP Yt−P s = et
⇐⇒ (1 − Φ1 B s − Φ2 B 2s − · · · − ΦP B P s )Yt = et ⇐⇒ ΦP (B s )Yt = et ,
where ΦP (B s ) = 1 − Φ1 B s − Φ2 B 2s − · · · − ΦP B P s is the seasonal AR characteristic

operator. The operator ΦP (B s ) can be viewed as a polynomial (in B) of degree P s.
PAGE 277
• As with nonseasonal processes, a seasonal AR(P )s process is stationary if and

only if each of the P s roots of ΦP (B s ) exceed 1 in absolute value (or modulus).
• All seasonal AR(P )s processes are invertible.
10.2.3 ARMA(P, Q)s
A seasonal autoregressive moving average (ARMA) model of orders P and Q
with seasonal period s, denoted by ARMA(P, Q)s , is
Yt = Φ1 Yt−s + Φ2 Yt−2s + · · · + ΦP Yt−P s + et − Θ1 et−s − Θ2 et−2s − · · · − ΘQ et−Qs .
• An ARMA(P, Q)s process is the seasonal analogue of the nonseasonal ARMA(p, q)

process with nonzero autocorrelations at lags k = s, 2s, 3s, ...,.
• Using backshift notation, this model can be expressed as
ΦP (B s )Yt = ΘQ (B s )et ,
where the seasonal AR and MA characteristic operators are
ΦP (B s ) = 1 − Φ1 B s − Φ2 B 2s − · · · − ΦP B P s
ΘQ (B s ) = 1 − Θ1 B s − Θ2 B 2s − · · · − ΘQ B Qs .
• Analogous to a nonseasonal ARMA(p, q) process,
– the ARMA(P, Q)s process is stationary if and only if the roots of ΦP (B s )

each exceed 1 in absolute value (or modulus)
– the ARMA(P, Q)s process is invertible if and only if the roots of ΘQ (B s )

each exceed 1 in absolute value (or modulus).
PAGE 278
• A seasonal ARMA(P, Q)s process is mathematically equivalent to a nonseasonal

ARMA(P s, Qs) process with
ϕs = Φ1 , ϕ2s = Φ2 , ..., ϕP s = ΦP , θs = Θ1 , θ2s = Θ2 , ..., θQs = ΘQ ,
and all other ϕ and θ parameters set equal to 0.
• The following table succinctly summarizes the behavior of the population ACF and
PACF for seasonal ARMA(P, Q)s processes:
AR(P )s MA(Q)s ARMA(P, Q)s

ACF Tails off at lags ks Cuts off after Tails off at lags ks
k = 1, 2, ..., lag Qs k = 1, 2, ..., s
PACF Cuts off after Tails off at lags ks Tails off at lags ks
lag P s k = 1, 2, ..., k = 1, 2, ...,
SUMMARY : We have broadened the class of stationary ARMA(p, q) models to incorpo-

rate the same type of ARMA(p, q) behavior at seasonal lags, the so-called the seasonal
ARMA(P, Q)s class of models.
• In many ways, this “extension” is not that much of an extension, because the
seasonal ARMA(P, Q)s model is essentially an ARMA(p, q) model restricted the
seasonal lags k = s, 2s, 3s, ...,.
• That is, an ARMA(P, Q)s model, which incorporates autocorrelation at seasonal

lags and nowhere else, is likely limited in application for stationary processes.
• However, if we combine these new seasonal ARMA(P, Q)s models with our tradi-
tional nonseasonal ARMA(p, q) models, we create a larger class of models applicable
for use with stationary processes that exhibit seasonality.
• We now examine this new class of models, the so-called multiplicative seasonal
ARMA class.
PAGE 279
10.3 Multiplicative seasonal (stationary) ARMA models
MA(1) × MA(1)12 : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
Consider the nonseasonal MA(1) model
Yt = et − θet−1 ⇐⇒ Yt = (1 − θB)et
and the seasonal MA(1)12 model
Yt = et − Θet−12 ⇐⇒ Yt = (1 − ΘB 12 )et .
• The defining characteristic of the nonseasonal MA(1) process is that the only
nonzero autocorrelation occurs at lag k = 1.
• The defining characteristic of the seasonal MA(1)12 process is that the only nonzero
autocorrelation occurs at lag k = 12.
COMBINING THE MODELS : Consider taking the nonseasonal MA characteristic op-

erator θ(B) = 1 − θB and the nonseasonal one Θ(B) = 1 − ΘB 12 and multiplying them
together to get the new model
Yt = (1 − θB)(1 − ΘB 12 )et
= (1 − θB − ΘB 12 + θΘB 13 )et ,
or, equivalently,
Yt = et − θet−1 − Θet−12 + θΘet−13 .
We call this a multiplicative seasonal MA(1) × MA(1)12 model. The term “multi-
plicative” arises because the MA characteristic operator 1 − θB − ΘB 12 + θΘB 13 is the
product of (1 − θB) and (1 − ΘB 12 ). An MA(1) × MA(1)12 process has E(Yt ) = 0 and
θ θΘ
ρ1 = − ρ11 =
1 + θ2 (1 + θ2 )(1 + Θ2 )
Θ θΘ
ρ12 = − ρ13 = .
1 + Θ2 (1 + θ2 )(1 + Θ2 )
PAGE 280
• The MA(1)×MA(1)12 process has nonzero autocorrelation at lags k = 1 and k = 12

from the nonseasonal and seasonal MA models individually.
• It has additional nonzero autocorrelation at lags k = 11 and k = 13 which arises

from the multiplicative effect of the two models.
• The MA(1) × MA(1)12 process
Yt = et − θet−1 − Θet−12 + θΘet−13
is mathematically equivalent to a nonseasonal MA(13) process with parameters

θ1 = θ, θ2 = θ3 = · · · = θ11 = 0, θ12 = Θ, and θ13 = −θΘ.
MA(1) × AR(1)12 : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
Consider the two models
Yt = et − θet−1 ⇐⇒ Yt = (1 − θB)et
and
Yt = ΦYt−12 + et ⇐⇒ (1 − ΦB 12 )Yt = et ,
a nonseasonal MA(1) and a seasonal AR(1)12 , respectively.
• The defining characteristic of the nonseasonal MA(1) is that the only nonzero
autocorrelation occurs at lag k = 1.
• The defining characteristic of the seasonal AR(1)12 is that the autocorrelation de-
cays across seasonal lags k = 12, 24, 36, ...,.
COMBINING THE MODELS : Consider combining these two models to form
(1 − ΦB 12 )Yt = (1 − θB)et ,
or, equivalently,
Yt = ΦYt−12 + et − θet−1 .
We call this a multiplicative seasonal MA(1) × AR(1)12 process. By combining a

nonseasonal MA(1) with a seasonal AR(1)12 , we create a new process which possesses
the following:
PAGE 281
MA1*MA1(12) ACF MA1*MA1(12) PACF
1.0
0.2
0.5
Autocorrelation
0.0
PACF
0.0
−0.2
−0.5
−1.0
−0.4
0 10 20 30 40 50 0 10 20 30 40 50
k k
MA1*AR1(12) ACF MA1*AR1(12) PACF

1.0
0.0 0.2 0.4 0.6

0.5
Autocorrelation
PACF
0.0
−0.5
−0.4
−1.0
0 10 20 30 40 50 0 10 20 30 40 50
k k
Figure 10.7: Top: Population ACF/PACF for MA(1)×MA(1)12 process with θ = 0.5 and
Θ = 0.9. Bottom: Population ACF/PACF for MA(1) × AR(1)12 process with θ = 0.5
and Φ = 0.9.
• AR-type autocorrelation at seasonal lags k = 12, 24, 36, ...,
• additional MA-type autocorrelation at lag k = 1 and at lags one unit in time from
the seasonal lags, that is, at k = 11 and k = 13, k = 23 and k = 25, and so on.
• The MA(1) × AR(1)12 process
Yt = ΦYt−12 + et − θet−1 ,
is mathematically equivalent to a nonseasonal ARMA(12,1) process with parame-

ters θ, ϕ1 = ϕ2 = · · · = ϕ11 = 0, and ϕ12 = Φ.
PAGE 282
In general, we can combine a nonseasonal ARMA(p, q) process
ϕ(B)Yt = θ(B)et
with a seasonal ARMA(P, Q)s process
ΦP (B s )Yt = ΘQ (B s )et
to create the model

ϕ(B)ΦP (B s )Yt = θ(B)ΘQ (B s )et .
We call this a multiplicative seasonal (stationary) ARMA(p, q) × ARMA(P, Q)s

model with seasonal period s.
• This is a very flexible family of models for stationary seasonal processes.
– The MA(1)×MA(1)12 and MA(1)×AR(1)12 processes (that we have discussed

explicitly) are special cases.
• An ARMA(p, q) × ARMA(P, Q)s process is mathematically equivalent to a nonsea-

sonal ARMA process with AR characteristic operator ϕ∗ (B) = ϕ(B)ΦP (B s ) and
MA characteristic operator θ∗ (B) = θ(B)ΘQ (B s ).
– Stationarity and invertibility conditions can be characterized in terms of the

roots of ϕ∗ (B) and θ∗ (B), respectively.
• Because of this equivalence, we can use our already-established methods to specify,

fit, diagnose, and forecast seasonal stationary models.
Example 10.4. Data file: boardings (TSA). Figure 10.8 displays the number of public
transit boardings (mostly for bus and light rail) in Denver, Colorado from 8/2000 to
3/2006. The data have been log-transformed.
• From the plot, the boarding process appears to be relatively stationary in the mean
level; that is, there are no pronounced shifts in mean level over time.
PAGE 283
12.70
S
12.65 O
N
Number of boardings (log−transformed)
F
S
M
12.60
S J
O A
S O S F M A
AM M
S N
12.55
A O O
N M N FM J
O F
N AM A J D
F J F
J A A J
AM
12.50
J N
M J
A M J
A
J J J
D
J D
12.45
J D
JJ
M
D
12.40
2001 2002 2003 2004 2005 2006
Year
Figure 10.8: Denver public transit data. Monthly number of public transit boardings
(log-transformed) in Denver from 8/2000 to 3/2006. Monthly plotting symbols have
been added.
• Therefore, a member of the stationary ARMA(p, q) × ARMA(P, Q)s family of mod-

els may be reasonable for these data. The seasonal lag is s = 12 (the data are
monthly).
• In Figure 10.9, we display the sample ACF and PACF for the boardings data. Note
that the margin of error bounds in the plot are for a white noise process.
– The sample ACF shows a pronounced sample autocorrelation at lag k = 12

and a decay afterward at seasonal lags k = 24 and k = 36.
– The sample PACF shows a pronounced sample partial autocorrelation at lag

k = 12 and none at higher seasonal lags.
– These two observations together suggest a seasonal AR(1)12 component.
PAGE 284
0.4
0.4
0.2
0.2
Partial ACF
ACF
0.0
0.0
−0.2
−0.2
0 10 20 30 40 0 10 20 30 40
Lag Lag
Figure 10.9: Denver public transit data. Left: Sample ACF. Right: Sample PACF.
– Around the seasonal lags k = 12, k = 24, and k = 36 (in the ACF), there are
noticeable autocorrelations 3 time units in both directions. This suggests a
nonseasonal MA(3) component.
• We therefore specify an ARMA(0, 3) × ARMA(1, 0)12 model for these data. Of

course, this model at this point is tentative and is subject to further investigation
and scrutiny.
MODEL FITTING: We use R to fit an ARMA(0, 3) × ARMA(1, 0)12 model using maxi-

mum likelihood. Here is the output:
> boardings.arma03.arma10 = arima(boardings,order=c(0,0,3),method=’ML’,

seasonal=list(order=c(1,0,0),period=12))
> boardings.arma03.arma10
Coefficients:
ma1 ma2 ma3 sar1 intercept
0.7288 0.6115 0.2951 0.8777 12.5455
s.e. 0.1186 0.1172 0.1118 0.0507 0.0354
sigma^2 estimated as 0.0006542: log likelihood = 143.54, aic = -277.09
PAGE 285
15
3
2
10
1
Sample Quantiles
Frequency
0
5
−1
−2
0
−2 −1 0 1 2 3 −2 −1 0 1 2
Figure 10.10: Denver public transit data. Standardized residuals from ARMA(0, 3) ×
ARMA(1, 0)12 model fit.
Note that each of the parameter estimates is statistically different from zero. The fitted
ARMA(0, 3) × ARMA(1, 0)12 model (on the log scale) is
(1 − 0.8777B 12 )(Yt − 12.5455) = (1 + 0.7288B + 0.6115B 2 + 0.2951B 3 )et ,
or equivalently,
Yt = 1.5343 + 0.8777Yt−12 + et + 0.7288et−1 + 0.6115et−2 + 0.2951et−3 .
be2 ≈ 0.0006542.
MODEL DIAGNOSTICS : The histogram and qq plot of the standardized residuals in

Figure 10.10 generally supports the normality assumption. In addition, when further
examining the standardized residuals,
• the Shapiro-Wilk test does not reject normality (p-value = 0.6187)
• the runs test does not reject independence (p-value = 0.385).
Finally, the tsdiag output in Figure 10.11 shows no notable problems with the ARMA(0, 3)×
ARMA(1, 0)12 model.
PAGE 286
3
2
1
0
−2 −1
2001 2002 2003 2004 2005 2006
Time
0.0 0.1 0.2
ACF of Residuals
−0.2
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 10.11: Denver public transit data. ARMA(0, 3) × ARMA(1, 0)12 tsdiag output.
OVERFITTING: For an ARMA(0, 3)×ARMA(1, 0)12 model, there are 4 overfitted mod-
els. Here are the models and the results from overfitting the boarding data:
ARMA(1, 3) × ARMA(1, 0)12 =⇒ ϕb significant
ARMA(0, 4) × ARMA(1, 0)12 =⇒ θb4 not significant

b 2 not significant
ARMA(0, 3) × ARMA(2, 0)12 =⇒ Φ
b not significant
ARMA(0, 3) × ARMA(1, 1)12 =⇒ Θ
The ARMA(1, 3) × ARMA(1, 0)12 fit declares a nonseasonal AR component at lag k = 1

to be significant, but the MA estimates at lags k = 1, 2, and 3 (which were all highly
significant in the original fit) become insignificant in this overfitted model! Therefore,
the ARMA(1, 3) × ARMA(1, 0)12 overfitted model is not considered further.
PAGE 287
12.75
12.70
Number of boardings (log−transformed)
12.65
12.60
12.55
12.50
12.45
2003 2004 2005 2006 2007
Year
Figure 10.12: Denver public transit data. The full data set is from 8/2000-3/2006. This
figure starts the series at 1/2003. ARMA(0, 3)×ARMA(1, 0)12 estimated MMSE forecasts
and 95 percent prediction limits are given for lead times l = 1, 2, ..., 12. These lead times
correspond to years 4/2006-3/2007.
FORECASTING: The estimated forecasts and standard errors (on the log scale) are
given for lead times l = 1, 2, ..., 12 in the predict output below:
> boardings.arma03.arma10.predict <- predict(boardings.arma03.arma10.fit,n.ahead=12)

> round(boardings.arma03.arma10.predict$pred,3)
2006 12.613 12.588 12.531 12.520 12.575 12.679 12.650 12.628 12.529
2007 12.594 12.619 12.606
> round(boardings.arma03.arma10.predict$se,3)
2006 0.026 0.032 0.035 0.036 0.036 0.036 0.036 0.036 0.036
2007 0.036 0.036 0.036
PAGE 288
• In Figure 10.12, we display the Denver boardings data. The full data set is from
8/00 to 3/06 (one observation per month). However, to emphasize the MMSE
forecasts in the plot, we start the series at month 1/03.
• With l = 1, 2, ..., 12, the estimated MMSE forecasts in the predict output and
in Figure 10.12 start at 4/06 and end in 3/07. It is important to remember that
these forecasts are on the log scale. MMSE forecasts on the original scale and 95
percent prediction intervals are given below.
> # MMSE forecasts back-transformed (to original scale)

> denver.boardings.predict <- round(exp(boardings.arma03.arma10.predict$pred
+ (1/2)*(boardings.arma03.arma10.predict$se)^2),3)
> denver.boardings.predict
Jan Feb Mar Apr May Jun Jul Aug Sep
2006 300411.9 293085.1 276937.5 273911.8 289321.6 321123.3
2007 294962.7 302347.3 298397.8
Oct Nov Dec
2006 312037.2 305125.6 276521.7
2007
> # Compute prediction intervals (on original scale)
> data.frame(Month=year.temp,lower.pi=exp(lower.pi),upper.pi=exp(upper.pi))
Month lower.pi upper.pi
1 2006.250 285630.0 315752.2
2 2006.333 275318.8 311685.4
3 2006.416 258262.2 296593.3
4 2006.500 255034.1 293803.7
5 2006.583 269381.8 310332.5
6 2006.666 298991.8 344443.8
7 2006.750 290531.9 334697.8
8 2006.833 284096.7 327284.3
9 2006.916 257464.1 296603.1
10 2007.000 274634.1 316383.3
11 2007.083 281509.8 324304.2
12 2007.166 277832.5 320067.9
PAGE 289
SUMMARY : The multiplicative seasonal (stationary) ARMA(p, q) × ARMA(P, Q)s fam-

ily of models
ϕ(B)ΦP (B s )Yt = θ(B)ΘQ (B s )et .
is a flexible class of time series models for stationary seasonal processes. The next step
is to extend this class of models to handle two types of nonstationarity:
• Nonseasonal nonstationary over time (e.g., increasing linear trends, etc.)
• Seasonal nonstationarity, that is, additional changes in the seasonal mean level,
even after possibly adjusting for nonseasonal stationarity over time.
10.4 Nonstationary seasonal ARIMA (SARIMA) models
REVIEW : For a stochastic process {Yt }, the first differences are
∇Yt = Yt − Yt−1 = (1 − B)Yt .
This definition can be generalized to any number of differences; in general, the dth
differences are given by
∇d Yt = (1 − B)d Yt .
We know that taking d = 1 or (usually at most) d = 2 can coerce a (nonseasonal)

nonstationary process into stationarity.
EXAMPLE : Suppose that we have a stochastic process defined by
Yt = St + et ,
where {et } is zero mean white noise and where
St = St−12 + ut ,
where {ut } is zero mean white noise that is uncorrelated with {et }. That is, {St } is a
zero mean random walk with period s = 12. For this process, taking nonseasonal
PAGE 290
differences (as we have done up until now) will not have an effect on the seasonal
nonstationarity. For example, with d = 1, we have
∇Yt = ∇St + ∇et
= ∇St−12 + ∇ut + ∇et
= St−12 − St−13 + ut − ut−1 + et − et−1 .
The first difference process {∇Yt } is still nonstationary because {St } is a random walk
across seasons; i.e, across time points t = 12k.
• That is, taking (nonseasonal) differences has only produced a more complicated
model, one which is still nonstationary across seasons.
• We therefore need to define a new differencing operator that can remove nonsta-
tionarity across seasonal lags.
TERMINOLOGY : The seasonal difference operator ∇s is defined by
∇s Yt = Yt − Yt−s = (1 − B s )Yt ,
for a seasonal period s. For example, with s = 12 and monthly data, the first seasonal
differences are
∇12 Yt = Yt − Yt−12 = (1 − B 12 )Yt ,
that is, the first differences of the January observations, the first differences of the Febru-
ary observations, and so on.
EXAMPLE : For the stochastic process defined earlier, that is,
Yt = St + et ,
where St = St−12 + ut , taking first seasonal differences yields
∇12 Yt = ∇12 St + ∇12 et
= St − St−12 + et − et−12 = ut + et − et−12 .
It can be shown that this process has the same ACF as a stationary seasonal MA(1)12 .
That is, taking first seasonal differences has coerced the {Yt } process into stationarity.
PAGE 291
1700
150
100
1600
First differences
50
1500
0
1400
−50
1300
1994 1996 1998 2000 2002 2004 2006 −100 1994 1996 1998 2000 2002 2004 2006
Year Year
100
40
First seasonal differences (s=12)
50
Combined first differences
20
0
0
−20
−50
−40
1996 1998 2000 2002 2004 2006 1996 1998 2000 2002 2004 2006
Year Year
Figure 10.13: United States milk production data. Upper left: Original series {Yt }.
Upper right: First (nonseasonal) differences ∇Yt = Yt − Yt−1 . Lower left: First (seasonal)
differences ∇12 Yt = Yt − Yt−12 . Lower right: Combined first (seasonal and nonseasonal)
differences ∇∇12 Yt .
PAGE 292
Example 10.5. Consider the monthly U.S. milk production data from Example 10.1.
Figure 10.13 (last page) displays the time series plot of the data (upper left), the first
difference process ∇Yt (upper right), the first seasonal difference process ∇12 Yt (lower
left), and the combined difference process ∇∇12 Yt (lower right). The combined difference
process ∇∇12 Yt is given by
∇∇12 Yt = (1 − B)(1 − B 12 )Yt
= (1 − B − B 12 + B 13 )Yt
= Yt − Yt−1 − Yt−12 + Yt−13 .
• The milk series (Figure 10.13; upper left) displays two trends: nonstationarity over
time and a within-year seasonal pattern. A Box-Cox analysis (results not shown)
suggests that no transformation is necessary for variance stabilization purposes.
• Taking first (nonseasonal) differences; i.e., computing ∇Yt , (Figure 10.13; upper
right) has removed the upward linear trend (as expected), but the process {∇Yt }
still displays notable seasonality.
• Taking first (seasonal) differences; i.e., computing ∇12 Yt , (Figure 10.13; lower left)
has seemingly removed the seasonality (as expected), but the process {∇12 Yt } dis-
plays still strong momentum over time.
– The sample ACF of {∇12 Yt } (not shown) displays a slow decay, a sign of
nonstationarity over time.
• The combined first differences ∇∇12 Yt (Figure 10.13; lower right) look to resemble
a stationary process (at least in the mean level).
REMARK : From this example, it should be clear that we can now extend the multiplica-
tive seasonal (stationary) ARMA(p, q) × ARMA(P, Q)s model
ϕ(B)ΦP (B s )Yt = θ(B)ΘQ (B s )et
to incorporate the two types of nonstationarity: nonseasonal and seasonal. This leads to
the definition of our largest class of ARIMA models.
PAGE 293
TERMINOLOGY : Suppose that {et } is zero mean white noise with var(et ) = σe2 . The
multiplicative seasonal autoregressive integrated moving average (SARIMA)
model with seasonal period s, denoted by ARIMA(p, d, q) × ARIMA(P, D, Q)s , is
ϕ(B)ΦP (B s )∇d ∇D s
s Yt = θ(B)ΘQ (B )et ,
where the nonseasonal AR and MA characteristic operators are
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ),
the seasonal AR and MA characteristic operators are
ΦP (B s ) = 1 − Φ1 B s − Φ2 B 2s − · · · − ΦP B P s
ΘQ (B s ) = 1 − Θ1 B s − Θ2 B 2s − · · · − ΘQ B Qs ,
and
∇d ∇D
s Yt = (1 − B) (1 − B ) Yt .
d s D
In this model,
• d denotes the number of nonseasonal differences. Usually d = 1 or (at most)

d = 2 will provide nonseasonal stationarity (as we have seen before).
• D denotes the number of seasonal differences. Usually D = 1 will achieve

seasonal stationarity.
• For many nonstationary seasonal time series data sets (at least for the ones I have
seen), the most common choice for (d, D) is (1, 1).
NOTE : We have the following relationship:
s Yt ∼ ARMA(p, q)×ARMA(P, Q)s .

Yt ∼ ARIMA(p, d, q)×ARIMA(P, D, Q)s ⇐⇒ ∇d ∇D
The SARIMA class is very flexible. Many times series can be adequately fit by these
models, usually with a small number of parameters, often less than five.
PAGE 294
0.2
0.4
0.1
0.2
0.0
Partial ACF
0.0
−0.1
ACF
−0.2
−0.2
−0.3
−0.4
−0.4
0 10 20 30 40 0 10 20 30 40
Lag Lag
Figure 10.14: United States milk production data. Left: Sample ACF for {∇∇12 Yt }.
Right: Sample PACF for {∇∇12 Yt }.
Example 10.5 (continued). For the milk production data in Example 10.1, we have
seen that the combined difference process {∇∇12 Yt } looks to be relatively stationary.
In Figure 10.14, we display the sample ACF (left) and sample PACF (right) of the
{∇∇12 Yt } process. Examining these two plots will help us identify which ARMA(p, q) ×
ARMA(P, Q)12 model is appropriate for {∇∇12 Yt }.
• The sample ACF for {∇∇12 Yt } has a pronounced spike at seasonal lag k = 12 and
one at k = 48 (but none at k = 24 and k = 36).
• The sample PACF for {∇∇12 Yt } displays pronounced spikes at seasonal lags k =
12, 24 and 36.
• The last two observations are consistent with the following choices:
– (P, Q) = (0, 1) if one is willing to ignore the ACF at k = 48. Also, if (P, Q) =
(0, 1), we would expect the the PACF to decay at lags k = 12, 24 and 36.
There is actually not that much of a decay.
– (P, Q) = (3, 0), if one is willing to place strong emphasis on the sample PACF.
PAGE 295
• There does not appear to be “anything happening” around seasonal lags in the
ACF, and the ACF at k = 1 is borderline. We therefore take p = 0 and q = 0.
• Therefore, there are two models which emerge as strong possibilities:
– (P, Q) = (0, 1): MA(1)12 model for {∇∇12 Yt }
– (P, Q) = (3, 0): AR(3)12 model for {∇∇12 Yt }.
• I have carefully examined both models. The AR(3)12 model provides a much better
fit to the {∇∇12 Yt } process than the MA(1)12 model.
– The AR(3)12 model for {∇∇12 Yt } provides a smaller AIC, a smaller estimate
of the white noise variance, and superior residual diagnostics; e.g., the Ljung-
Box test strongly discounts the MA(1)12 model for {∇∇12 Yt } at all lags.
• For illustrative purposes, we therefore tentatively adopt an ARIMA(0, 1, 0) ×

ARIMA(3, 1, 0)12 model for the milk production data.
MODEL FITTING: We use R to fit this ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 model using

maximum likelihood. Here is the output:
> milk.arima010.arima310 =
arima(milk,order=c(0,1,0),method=’ML’,seasonal=list(order=c(3,1,0),period=12))
> milk.arima010.arima310
Coefficients:
sar1 sar2 sar3
-0.9133 -0.8146 -0.6002
s.e. 0.0696 0.0776 0.0688
The fitted model is
(1 + 0.9133B 12 + 0.8146B 24 + 0.6002B 36 ) (1 − B)(1 − B 12 )Yt = et .

| {z }
= ∇∇12 Yt
b 1,
be2 ≈ 121.4. Note that all parameter estimates (Θ
b 2 , and Θ
Θ b 3 ) are statistically different from zero (by a very large amount).
PAGE 296
30
2
25
1
20
Sample Quantiles
Frequency
0
15
10
−1
5
−2
0
−3 −2 −1 0 1 2 3 −2 −1 0 1 2
Figure 10.15: United States milk production data. Standardized residuals from
ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 model fit.
MODEL DIAGNOSTICS : The histogram and qq plot of the standardized residuals in

Figure 10.15 generally supports the normality assumption. In addition, when further
examining the standardized residuals from the model fit,
• the Shapiro-Wilk test does not reject normality (p-value = 0.6619)
• the runs test does not reject independence (p-value = 0.112).
Finally, the tsdiag output in Figure 10.16 supports the ARIMA(0, 1, 0)×ARIMA(3, 1, 0)12
model choice.
OVERFITTING: For an ARIMA(0, 1, 0)×ARIMA(3, 1, 0)12 model, there are 4 overfitted

models. Here are the models and the results from overfitting:
ARIMA(1, 1, 0) × ARIMA(3, 1, 0)12 =⇒ ϕb not significant
ARIMA(0, 1, 1) × ARIMA(3, 1, 0)12 =⇒ θb not significant

b 4 not significant
ARIMA(0, 1, 0) × ARIMA(4, 1, 0)12 =⇒ Φ
b not significant.
ARIMA(0, 1, 0) × ARIMA(3, 1, 1)12 =⇒ Θ
PAGE 297
2
1
0
−2 −1
1996 1998 2000 2002 2004 2006
Time
0.10
ACF of Residuals
0.00
−0.15
0 5 10 15 20 25 30
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
0 5 10 15 20 25 30
Number of lags
Figure 10.16: United States milk production data. ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12
tsdiag output.
CONCLUSION : The ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 model does a good job at de-
scribing the U.S. milk production data. With this model, we move forward with fore-
casting future observations.
FORECASTING: We use R to compute forecasts and prediction limits for the lead times
l = 1, 2, ..., 24 (two years ahead) based on the ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 model
fit. Here are the estimated MMSE forecasts and 95 percent prediction limits:
# MMSE forecasts
> milk.arima010.arima310.predict <- predict(milk.arima010.arima310.fit,n.ahead=24)
PAGE 298
> round(milk.arima010.arima310.predict$pred,3)
Jan Feb Mar Apr May Jun Jul Aug Sep
2006 1702.409 1584.302 1760.356 1728.246 1783.487 1698.330 1694.116 1680.528 1610.895
2007 1725.769 1608.022 1775.653 1742.424 1792.538 1715.007 1717.981 1695.297 1631.562
Oct Nov Dec
2006 1655.054 1610.777 1689.084
2007 1679.871 1634.033 1712.183
> round(milk.arima010.arima310.predict$se,3)
2006 11.018 15.581 19.083 22.035 24.636 26.988 29.150 31.162 33.053 34.841 36.541 38.166
2007 40.000 41.753 43.436 45.056 46.620 48.132 49.599 51.024 52.410 53.760 55.077 56.363
# Compute prediction intervals

lower.pi<-
milk.arima010.arima310.predict$pred-qnorm(0.975,0,1)*milk.arima010.arima310.predict$se
upper.pi<-
milk.arima010.arima310.predict$pred+qnorm(0.975,0,1)*milk.arima010.arima310.predict$se
## For brevity (in the notes), I display estimated MMSE forecasts only 12 months ahead.
Month lower.pi upper.pi
1 2006.000 1680.815 1724.003
2 2006.083 1553.763 1614.840
3 2006.166 1722.954 1797.758
4 2006.250 1685.058 1771.434
5 2006.333 1735.201 1831.773
6 2006.416 1645.436 1751.225
7 2006.500 1636.983 1751.249
8 2006.583 1619.450 1741.605
9 2006.666 1546.113 1675.678
10 2006.750 1586.767 1723.340
11 2006.833 1539.158 1682.397
12 2006.916 1614.280 1763.888
• In Figure 10.17, we display the U.S. milk production data. The full data set is from
1/94 to 12/05 (one observation per month). However, to emphasize the MMSE
forecasts in the plot, we start the series at month 1/04.
PAGE 299
1800
1700
1600
2004 2005 2006 2007 2008
Year
Figure 10.17: U.S. milk production data. The full data set is from 1/1994-12/2005. This
figure starts the series at 1/2004. ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 estimated MMSE
forecasts and 95 percent prediction limits are given for lead times l = 1, 2, ..., 24. These
lead times correspond to years 1/2006-12/2007.
Figure 10.17 start at 1/06 and end in 12/07 (24 months).
• Numerical values of the 95 percent prediction intervals are given for 1/06-12/06 in
the prediction interval output. Note how the interval lengths increase as l does.
This is a byproduct of nonstationarity. In Figure 10.17, the impact of nonsta-
tionarity is also easily seen as l increases (prediction limits become wider).
NOTE : Although we did not state so explicitly, determining MMSE forecasts and predic-
tion limits for seasonal models is exactly analogous to the nonseasonal cases we studied
in Chapter 9. Formulae for seasonal MMSE forecasts are given in Section 10.5 (CC) for
special cases.
PAGE 300
600
Australian clay brick production (in millions)
500
400
300
200
1960 1970 1980 1990
Time
Figure 10.18: Australian clay brick production data. Number of bricks (in millions)
produced from 1956-1994.
Example 10.6. In this example, we revisit the Australian brick production data in
Example 1.14 (pp 15, notes). The data in Figure 10.18 represent the number of bricks
produced in Australia (in millions) during 1956-1994. The data are quarterly, so the
underlying seasonal lag of interest is s = 4.
INITIAL ANALYSIS : The first thing we should do is a Box-Cox analysis to see if a

variance-stabilizing transformation is needed (there is evidence of heteroscedasticity from
examining the original series in Figure 10.18).
• Using the BoxCox.ar function in R (output not shown) suggests that the Box-Cox
transformation parameter λ ≈ 0.5.
• This suggests that a square-root transformation is warranted.
• We now examine the transformed data and the relevant differenced series.
PAGE 301
3
24
2
Brick production (Square−root transformed)
22
1
First differences
20
0
18
−1
16
−2
14
1960 1970 1980 1990 −3 1960 1970 1980 1990
Time Year
2
2
First seasonal differences (s=4)
Combined first differences
1
0
0
−2
−1
−4
−2
1960 1970 1980 1990 1960 1970 1980 1990
Year Year
Figure 10.19: Australian clay brick production data (square-root transformed). Upper
left: Original series {Yt }. Upper right: First (nonseasonal) differences ∇Yt = Yt − Yt−1 .
Lower left: First (seasonal) differences ∇4 Yt = Yt − Yt−4 . Lower right: Combined first
(seasonal and nonseasonal) differences ∇∇4 Yt .
PAGE 302
0.2
0.1
0.2
0.0
Partial ACF
−0.1
0.0
ACF
−0.2
−0.2
−0.3
−0.4
−0.4
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Lag Lag
Figure 10.20: Australian clay brick production data (square-root transformed). Left:
Sample ACF for {∇∇4 Yt }. Right: Sample PACF for {∇∇4 Yt }.
NOTE : The combined difference process ∇∇4 Yt in Figure 10.19 looks stationary in the
mean level. The sample ACF/PACF for the ∇∇4 Yt series is given in Figure 10.20. Recall
that our analysis is now on the square-root transformed scale.
ANALYSIS : Examining the sample ACF/PACF for the ∇∇4 Yt data does not lead us to
one single model as a “clear favorite.” In fact, there are ambiguities that emerge; e.g., a
spike in the ACF at lag k = 25 (this is not a seasonal lag), a spike in the PACF at the
seventh seasonal lag k = 28, etc.
• The PACF does display spikes at the first 4 seasonal lags k = 4, k = 8, k = 12,
and k = 16.
• The ACF does not display consistent “action” around these seasonal lags in either
direction.
• These two observations lead us to tentatively consider an AR(4)4 model for the
combined difference process {∇∇4 Yt }; i.e., an ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4
for the square-root transformed series.
PAGE 303
3
60
2
50
1
40
Sample Quantiles
Frequency
0
30
−1
20
−2
10
−3
0
−4 −3 −2 −1 0 1 2 3 −2 −1 0 1 2
Figure 10.21: Australian clay brick production data (square-root transformed). Stan-
dardized residuals from ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 model fit.
MODEL FITTING: We use R to fit this ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 model using

maximum likelihood. Here is the output:
> sqrt.brick.arima010.arima410 =
arima(sqrt.brick,order=c(0,1,0),method=’ML’,seasonal=list(order=c(4,1,0),period=4))
> sqrt.brick.arima010.arima410
Coefficients:
sar1 sar2 sar3 sar4
-0.8249 -0.8390 -0.5330 -0.3290
s.e. 0.0780 0.0935 0.0936 0.0772
The fitted model is
(1 + 0.8249B 4 + 0.8390B 8 + 0.5330B 12 + 0.3290B 16 ) (1 − B)(1 − B 4 )Yt = et .

| {z }
= ∇∇4 Yt
b 1,
be2 ≈ 0.2889. Note that all parameter estimates (Θ
b 2, Θ
Θ b 3 , and Θ
b 4 ) are statistically different from zero (by a very large amount).
PAGE 304
3
2
1
−1 0
−3
1960 1970 1980 1990
Time
0.10
ACF of Residuals
0.00
−0.15
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 10.22: Australian clay brick production data (square-root transformed).

ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 tsdiag output.
DIAGNOSTICS : The tsdiag output in Figure 10.22 does not strongly refute the
ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 model choice, and overfitting (results not shown) does
not lead us to consider a higher order model. However, the qq plot of the standardized
residuals in Figure 10.21 reveals major problems with the normality assumption, and the
Shapiro-Wilk test strongly rejects normality (p-value < 0.0001).
CONCLUSION : The ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 model for the Australian brick
production data (square-root transformed) is not completely worthless, but I would hes-
itate to use this model for forecasting purposes (since the normality assumption is so
grossly violated). The search for a better model should continue!
PAGE 305
10.5 Additional topics
DISCUSSION : In this course, we have covered the first 10 chapters of Cryer and Chan
(2008). This material provides you with a powerful arsenal of techniques to analyze many
time series data sets that are seen in practice. These chapters also lay the foundation for
further study in time series analysis.
• Chapter 11. This chapter provides an introduction to intervention analysis,

which deals with incorporating external events in modeling time series data (e.g.,
a change in production methods, natural disasters, terrorist attacks, etc.). Tech-
niques for incorporating external covariate information and analyzing multiple time
series are also presented.
• Chapter 12. This chapter deals explicitly with modeling financial time se-
ries data (e.g., stock prices, portfolio returns, etc.), mainly with the commonly
used ARCH and GARCH models. The key feature of these models is that they
incorporate additional heteroscedasticity that are common in financial data.
• Chapter 13. This chapter deals with frequency domain methods (spectral
analysis) for periodic data which arise in physics, biomedicine, engineering, etc.
The periodogram and spectral density are introduced. These methods use linear
combinations of sine and cosine functions to model underlying (possibly multiple)
frequencies.
• Chapter 14. This chapter is an extension of Chapter 13 which studies the sampling
characteristics of the spectral density estimator.
• Chapter 15. This chapter discusses nonlinear models for time series data.
This class of models assumes that current data are nonlinear functions of past
observations, which can be a result of nonnormality.
PAGE 306

STAT 520 Forecasting and Time Series: Lecture Notes

Uploaded by

Copyright:

Available Formats

STAT 520 Forecasting and Time Series: Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT 520 Forecasting and Time Series: Lecture Notes

Uploaded by

Copyright:

Available Formats

STAT 520

FORECASTING AND TIME

1 Introduction and Examples 1

2.1 Summary of important distribution theory . . . . . . . . . . . . . . . . . 20

2.1.1 Univariate random variables . . . . . . . . . . . . . . . . . . . . . 20

2.1.2 Bivariate random vectors . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.3 Multivariate extensions and linear combinations . . . . . . . . . . 26

2.2 Time series and stochastic processes . . . . . . . . . . . . . . . . . . . . . 28

2.3 Means, variances, and covariances . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Some (named) stochastic processes . . . . . . . . . . . . . . . . . . . . . 29

3 Modeling Deterministic Trends 44

3.2 Estimation of a constant mean . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.1 Straight line regression . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.2 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.3 Seasonal means model . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.4 Cosine trend model . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 Interpreting regression output . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5 Residual analysis (model diagnostics) . . . . . . . . . . . . . . . . . . . . 70

3.5.1 Assessing normality . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.5.2 Assessing independence . . . . . . . . . . . . . . . . . . . . . . . . 73

3.5.3 Sample autocorrelation function . . . . . . . . . . . . . . . . . . . 76

4 Models for Stationary Time Series 80

4.2 Moving average processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.2.1 MA(1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.2.2 MA(2) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2.3 MA(q) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Autoregressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3.1 AR(1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.2 AR(2) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.3.3 AR(p) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.4 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.5 Autoregressive moving average (ARMA) processes . . . . . . . . . . . . . 107

5 Models for Nonstationary Time Series 113

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.2 Autoregressive integrated moving average (ARIMA) models . . . . . . . 118

5.2.1 IMA(1,1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.2.2 IMA(2,2) process . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.2.3 ARI(1,1) process . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.2.4 ARIMA(1,1,1) process . . . . . . . . . . . . . . . . . . . . . . . . 125

5.3 Constant terms in ARIMA models . . . . . . . . . . . . . . . . . . . . . 127

5.4 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6 Model Speciﬁcation 136

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2 The sample autocorrelation function . . . . . . . . . . . . . . . . . . . . 136

6.3 The partial autocorrelation function . . . . . . . . . . . . . . . . . . . . . 143

6.4 The extended autocorrelation function . . . . . . . . . . . . . . . . . . . 155

6.5 Nonstationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.6 Other model selection methods . . . . . . . . . . . . . . . . . . . . . . . 170

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.2 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.2.1 Autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . 176

7.2.2 Moving average models . . . . . . . . . . . . . . . . . . . . . . . . 178

7.2.3 Mixed ARMA models . . . . . . . . . . . . . . . . . . . . . . . . 179

7.2.4 White noise variance . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

7.3 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

7.3.1 Autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . 185