STAT 520 Forecasting and Time Series: Lecture Notes
STAT 520 Forecasting and Time Series: Lecture Notes
STAT 520 Forecasting and Time Series: Lecture Notes
Lecture Notes
Joshua M. Tebbs
Department of Statistics
University of South Carolina
TABLE OF CONTENTS STAT 520, J. TEBBS
Contents
2 Fundamental Concepts 20
2.1.4 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
i
TABLE OF CONTENTS STAT 520, J. TEBBS
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ii
TABLE OF CONTENTS STAT 520, J. TEBBS
7 Estimation 175
iii
TABLE OF CONTENTS STAT 520, J. TEBBS
9 Forecasting 231
iv
CHAPTER 1 STAT 520, J. TEBBS
• equally spaced in discrete time; that is, we will have a single realization of Y at
each second, hour, day, month, year, etc.
UBIQUITY : Time series data arise in a variety of fields. Here are just a few examples.
• In business, we observe daily stock prices, weekly interest rates, quarterly sales,
monthly supply figures, annual earnings, etc.
• In agriculture, we observe annual yields (e.g., crop production), daily crop prices,
annual herd sizes, etc.
• In social sciences, we observe annual birth and death rates, accident frequencies,
crime rates, school enrollments, etc.
PAGE 1
CHAPTER 1 STAT 520, J. TEBBS
0.4
0.2
Global temperature deviations
0.0
−0.2
−0.4
Year
Figure 1.1: Global temperature data. The data are a combination of land-air average
temperature anomalies, measured in degrees Centigrade.
PAGE 2
CHAPTER 1 STAT 520, J. TEBBS
1700
1600
Amount of milk produced
1500
1400
1300
Year
Figure 1.2: United States milk production data. Monthly production figures, measured
in millions of pounds, from January, 1994 to December, 2005.
Example 1.2. Milk production data. Commercial dairy farming produces the vast
majority of milk in the United States. The data in Figure 1.2 are the monthly U.S. milk
production (in millions of pounds) from January, 1994 to December, 2005.
• Predictions?
PAGE 3
CHAPTER 1 STAT 520, J. TEBBS
220
210
CREF stock values
200
190
180
170
Time
Figure 1.3: CREF stock data. Daily values of one unit of CREF stock values: August
26, 2004 to August 15, 2006.
Example 1.3. CREF stock data. TIAA-CREF is the leading provider of retirement
accounts and products to employees in academic, research, medical, and cultural in-
stitutions. The data in Figure 1.3 are daily values of one unit of the CREF (College
Retirement Equity Fund) stock fund from 8/26/04 to 8/15/06.
PAGE 4
CHAPTER 1 STAT 520, J. TEBBS
200
Number of homeruns
150
100
50
Year
Figure 1.4: Homerun data. Number of homeruns hit by the Boston Red Sox each year
during 1909-2010.
Example 1.4. Homerun data. The Boston Red Sox are a professional baseball team
based in Boston, Massachusetts, and a member of the Major League Baseball’s American
League Eastern Division. The data in Figure 1.4 are the number of homeruns hit by the
team each year from 1909 to 2010. Source: Ted Hornback (Spring, 2010).
• Predictions?
PAGE 5
CHAPTER 1 STAT 520, J. TEBBS
40
35
Number of earthquakes (7.0 or greater)
30
25
20
15
10
5
Year
Figure 1.5: Earthquake data. Number of “large” earthquakes per year from 1900-1998.
Example 1.5. Earthquake data. An earthquake occurs when there is a sudden release
of energy in the Earth’s crust. Earthquakes are caused mostly by rupture of geological
faults, but also by other events such as volcanic activity, landslides, mine blasts, and
nuclear tests. The data in Figure 1.5 are the number of global earthquakes annually
(with intensities of 7.0 or greater) during 1900-1998. Source: Craig Whitlow (Spring,
2010).
• Predictions?
PAGE 6
CHAPTER 1 STAT 520, J. TEBBS
25000
USC Columbia fall enrollment
20000
15000
10000
5000
Year
Figure 1.6: University of South Carolina fall enrollment data. Number of students reg-
istered for classes on the Columbia campus during 1954-2010.
Example 1.6. Enrollment data. The data in Figure 1.6 are the annual fall enroll-
ment counts for USC (Columbia campus only, 1954-2010). The data were obtained from
the USC website http://www.ipr.sc.edu/enrollment/, which contains the enrollment
counts for all campuses in the USC system.
• Predictions?
PAGE 7
CHAPTER 1 STAT 520, J. TEBBS
35
30
25
Star brightness
20
15
10
5
0
Time
Figure 1.7: Star brightness data. Measurements for a single star taken over 600 consec-
utive nights.
Example 1.7. Star brightness data. Two factors determine the brightness of a star:
its luminosity (how much energy it puts out in a given time) and its distance from the
Earth. The data in Figure 1.7 are nightly brightness measurements (in magnitude) of a
single star over a period of 600 nights.
• Predictions?
PAGE 8
CHAPTER 1 STAT 520, J. TEBBS
5.5e+07
5.0e+07
4.5e+07
Airline miles
4.0e+07
3.5e+07
3.0e+07
Year
Figure 1.8: Airline passenger mile data. The number of miles, in thousands, traveled by
passengers in the United States from January, 1996 to May, 2005.
Example 1.8. Airline mile data. The Bureau of Transportation Statistics publishes
monthly passenger traffic data reflecting 100 percent of scheduled operations for airlines
in the United States. The data in Figure 1.8 are monthly U.S. airline passenger miles
traveled from 1/1996 to 5/2005.
• Predictions?
PAGE 9
CHAPTER 1 STAT 520, J. TEBBS
1500
1450
SP500 Index
1400
1350
1300
1250
Time
Figure 1.9: S&P Index price data. Daily values of the index from June 6, 1999 to June
5, 2000.
• Predictions?
PAGE 10
CHAPTER 1 STAT 520, J. TEBBS
80
Ventilation (L/min)
60
40
20
Observation time
Example 1.10. Ventilation data. Collecting expired gases during exercise allows one to
quantify many outcomes during an exercise test. One such outcome is the ventilatory
threshold; i.e., the point at which lactate begins to accumulate in the blood. The data
in Figure 1.10 are ventilation observations (L/min) on a single cyclist during exercise.
Observations are recorded every 15 seconds. Source: Joe Alemany (Spring, 2010).
• Predictions?
PAGE 11
CHAPTER 1 STAT 520, J. TEBBS
2.4
2.2
2.0
British pounds
1.8
1.6
1.4
1.2
Year
Figure 1.11: Exchange rate data. Weekly exchange rate of US dollar compared to the
British pound, from 1980-1988.
Example 1.11. Exchange rate data. The pound sterling, often simply called “the
pound,” is the currency of the United Kingdom and many of its territories. The data in
Figure 1.11 are weekly exchange rates of the US dollar and the British pound between
the years 1980 and 1988.
• Predictions?
PAGE 12
CHAPTER 1 STAT 520, J. TEBBS
60
50
Oil prices
40
30
20
10
Year
Figure 1.12: Crude oil price data. Monthly spot prices in dollars from Cushing, OK,
from 1/1986 to 1/2006.
Example 1.12. Oil price data. Crude oil prices behave much as any other commodity
with wide price swings in times of shortage or oversupply. The crude oil price cycle may
extend over several years responding to changes in demand. The data in Figure 1.12 are
monthly spot prices for crude oil (measured in U.S. dollars per barrel) from Cushing,
OK.
• Predictions?
PAGE 13
CHAPTER 1 STAT 520, J. TEBBS
40
30
LA rainfall amounts
20
10
Year
Figure 1.13: Los Angeles rainfall data. Annual precipitation measurements, in inches,
during 1878-1992.
Example 1.13. Annual rainfall data. Los Angeles averages 15 inches of precipitation
annually, which mainly occurs during the winter and spring (November through April)
with generally light rain showers, but sometimes as heavy rainfall and thunderstorms.
The data in Figure 1.13 are annual rainfall totals for Los Angeles during 1878-1992.
• Predictions?
PAGE 14
CHAPTER 1 STAT 520, J. TEBBS
600
Australian clay brick production (in millions)
500
400
300
200
Time
Figure 1.14: Australian clay brick production data. Number of bricks (in millions)
produced from 1956-1994.
Example 1.14. Brick production data. Clay bricks remain extremely popular for the
cladding of houses and small commercial buildings throughout Australia due to their
versatility of use, tensile strength, thermal properties and attractive appearance. The
data in Figure 1.14 represent the number of bricks produced in Australia (in millions)
during 1956-1994. The data are quarterly.
• Predictions?
PAGE 15
CHAPTER 1 STAT 520, J. TEBBS
0.20
Percentage granted review
0.15
0.10
0.05
Time
Figure 1.15: United States Supreme Court data. Percent of cases granted review during
1926-2004.
Example 1.15. Supreme Court data. The Supreme Court of the United States has
ultimate (but largely discretionary) appellate jurisdiction over all state and federal courts,
and original jurisdiction over a small range of cases. The data in Figure 1.15 represent
the acceptance rate of cases appealed to the Supreme Court during 1926-2004. Source:
Jim Manning (Spring, 2010).
• Predictions?
PAGE 16
CHAPTER 1 STAT 520, J. TEBBS
1. to model the stochastic (random) mechanism that gives rise to the series of data
2. to predict (forecast) the future values of the series based on the previous history.
NOTES : The analysis of time series data calls for a “new way of thinking” when compared
to other statistical methods courses. Essentially, we get to see only a single measurement
from a population (at time t) instead of a sample of measurements at a fixed point in
time (cross-sectional data).
• The special feature of time series data is that they are not independent! Instead,
observations are correlated through time.
• Most classical statistical methods (e.g., regression, analysis of variance, etc.) as-
sume that observations are statistically independent. For example, in the simple
linear regression model
Yi = β0 + β1 xi + ϵi ,
we typically assume that the ϵ error terms are independent and identically dis-
tributed (iid) normal random variables with mean 0 and constant variance.
PAGE 17
CHAPTER 1 STAT 520, J. TEBBS
MODELING: Our overarching goal in this course is to build (and use) time series models
for data. This breaks down into different parts.
2. Model fitting
3. Model diagnostics
• Use statistical inference and graphical displays to check how well the model
fits the data.
• This part of the analysis may suggest the candidate model is inadequate and
may point to more appropriate models.
TIME SERIES PLOT : The time series plot is the most basic graphical display in the
analysis of time series data. The plot is a basically a scatterplot of Yt versus t, with
straight lines connecting the points. Notationally,
The subscript t tells us to which time point the measurement Yt corresponds. Note that
in the sequence Y1 , Y2 , ..., Yn , the subscripts are very important because they correspond
to a particular ordering of the data. This is perhaps a change in mind set from other
methods courses where the time element is ignored.
PAGE 18
CHAPTER 1 STAT 520, J. TEBBS
5.5e+07
J
A M
5.0e+07 J
A M
J J
J
A A M A
J J JA A O
J M
4.5e+07
A D
M M J
M N
A
M J D
A J A O
Airline miles
M M M O J
J D
A O N M S
4.0e+07
J A D M F
J M O A N
A M N S A F
J J M D J
M A O J
D S S
DM J F F NJ
M
3.5e+07
A O N J D
S J
M N S F
S A ODJ F OJ
F
N F
F SN
3.0e+07
J
F
J
S
Year
Figure 1.16: Airline passenger mile data. The number of miles, in thousands, traveled
by passengers in the United States from January, 1996 to May, 2005. Monthly plotting
symbols have been added.
GRAPHICS : The time series plot is vital, both to describe the data and to help formulat-
ing a sensible model. Here are some simple, but important, guidelines when constructing
these plots.
• Choose the scales carefully (including the size of the intercept). Default settings
from software may be sufficient.
• Use special plotting symbols where appropriate; e.g., months of the year, days of
the week, actual numerical values for outlying values, etc.
PAGE 19
CHAPTER 2 STAT 520, J. TEBBS
2 Fundamental Concepts
DISCLAIMER: Going forward, we must be familiar with the following results from prob-
ability and distribution theory (e.g., STAT 511, etc.). If you have not had this material,
you should find a suitable reference and study up on your own. See also pp 24-26 (CC).
TERMINOLOGY : Let Y be a continuous random variable with cdf FY (y). The prob-
ability density function (pdf ) for Y , denoted by fY (y), is given by
d
fY (y) = FY (y),
dy
provided that this derivative exists.
PAGE 20
CHAPTER 2 STAT 520, J. TEBBS
PROPERTIES : Suppose that Y is a continuous random variable with pdf fY (y) and
support R (that is, the set of all values that Y can assume). Then
RESULT : Suppose Y is a continuous random variable with pdf fY (y) and cdf FY (y).
Then ∫ b
P (a < Y < b) = fY (y)dy = FY (b) − FY (a).
a
TERMINOLOGY : Let Y be a continuous random variable with pdf fY (y) and support
R. The expected value (or mean) of Y is given by
∫
E(Y ) = yfY (y)dy.
R
If this is not true, then we say that E(Y ) does not exist. If g is a real-valued function,
then g(Y ) is a random variable and
∫
E[g(Y )] = g(y)fY (y)dy,
R
(a) E(a) = a
PAGE 21
CHAPTER 2 STAT 520, J. TEBBS
FACTS :
(a) var(Y ) ≥ 0. var(Y ) = 0 if and only if the random variable Y has a degenerate
distribution; i.e., all the probability mass is located at one support point.
(b) The larger (smaller) var(Y ) is, the more (less) spread in the possible values of Y
about the mean µ = E(Y ).
√ √
(c) var(Y ) is measured in (units)2 . The standard deviation of Y is σ = σ 2 = var(Y )
and is measured in the original units of Y .
IMPORTANT RESULT : Let Y be a random variable, and suppose that a and b are fixed
constants. Then
var(a + bY ) = b2 var(Y ).
PAGE 22
CHAPTER 2 STAT 520, J. TEBBS
RESULT : Suppose (X, Y ) is a continuous random vector with joint pdf fX,Y (x, y). Then
∫ ∫
P [(X, Y ) ∈ B] = fX,Y (x, y)dxdy,
B
TERMINOLOGY : Suppose that (X, Y ) is a continuous random vector with joint pdf
fX,Y (x, y). The joint cumulative distribution function (cdf ) for (X, Y ) is given by
∫ x ∫ y
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = fX,Y (t, s)dtds,
−∞ −∞
for all (x, y) ∈ R2 . It follows upon differentiation that the joint pdf is given by
∂2
fX,Y (x, y) = FX,Y (x, y),
∂x∂y
wherever this mixed partial derivative is defined.
RESULT : Suppose that (X, Y ) has joint pdf fX,Y (x, y) and support R. Let g(X, Y ) be
a real vector valued function of (X, Y ); i.e., g : R2 → R. Then
∫ ∫
E[g(X, Y )] = g(x, y)fX,Y (x, y)dxdy.
R
If this quantity is not finite, then we say that E[g(X, Y )] does not exist.
(a) E(a) = a
TERMINOLOGY : Suppose that (X, Y ) is a continuous random vector with joint cdf
FX,Y (x, y), and denote the marginal cdfs of X and Y by FX (x) and FY (y), respectively.
The random variables X and Y are independent if and only if
PAGE 23
CHAPTER 2 STAT 520, J. TEBBS
for all values of x and y. It can hence be shown that X and Y are independent if and
only if
fX,Y (x, y) = fX (x)fY (y),
for all values of x and y. That is, the joint pdf fX,Y (x, y) factors into the product the
marginal pdfs fX (x) and fY (y), respectively.
RESULT : Suppose that X and Y are independent random variables. Let g(X) be a
function of X only, and let h(Y ) be a function of Y only. Then
provided that all expectations exist. Taking g(X) = X and h(Y ) = Y , we get
E(XY ) = E(X)E(Y ).
TERMINOLOGY : Suppose that X and Y are random variables with means E(X) = µX
and E(Y ) = µY , respectively. The covariance between X and Y is
= E(XY ) − E(X)E(Y ).
The latter expression is called the covariance computing formula. The covariance is
a numerical measure that describes how two variables are linearly related.
RESULT : If X and Y are independent, then cov(X, Y ) = 0. The converse is not neces-
sarily true.
PAGE 24
CHAPTER 2 STAT 520, J. TEBBS
RESULTS : Suppose that X and Y are random variables. The covariance operator sat-
isfies the following:
DEFINITION : Suppose that X and Y are random variables. The correlation between
X and Y is defined by
cov(X, Y )
ρ = corr(X, Y ) = .
σX σY
NOTES :
(1) −1 ≤ ρ ≤ 1.
(3) If ρ = −1, then Y = β0 + β1 X, where β1 < 0. That is, X and Y are perfectly
negatively linearly related; i.e., the bivariate probability distribution of (X, Y ) lies
entirely on a straight line with negative slope.
RESULT : If X and Y are independent, then ρ = ρX,Y = 0. The converse is not true in
general. However,
PAGE 25
CHAPTER 2 STAT 520, J. TEBBS
EXTENSION : We use the notation Y = (Y1 , Y2 , ..., Yn ) and y = (y1 , y2 , ..., yn ). The joint
cdf of Y is
EXTENSION : Suppose that the random vector Y = (Y1 , Y2 , ..., Yn ) has joint cdf FY (y),
and suppose that the random variable Yi has cdf FYi (yi ), for i = 1, 2, ..., n. Then,
Y1 , Y2 , ..., Yn are independent random variables if and only if
∏
n
FY (y) = FYi (yi );
i=1
that is, the joint cdf can be factored into the product of the marginal cdfs. Alternatively,
Y1 , Y2 , ..., Yn are independent random variables if and only if
∏
n
fY (y) = fYi (yi );
i=1
that is, the joint pdf can be factored into the product of the marginal pdfs.
E[g1 (Y1 )g2 (Y2 ) · · · gn (Yn )] = E[g1 (Y1 )]E[g2 (Y2 )] · · · E[gn (Yn )],
TERMINOLOGY : Suppose that Y1 , Y2 , ..., Yn are random variables and that a1 , a2 , ..., an
are constants. The function
∑
n
U= ai Yi = a1 Y1 + a2 Y2 + · · · + an Yn
i=1
PAGE 26
CHAPTER 2 STAT 520, J. TEBBS
Then,
∑
n ∑
m
cov(U1 , U2 ) = ai bj cov(Yi , Xj ).
i=1 j=1
2.1.4 Miscellaneous
GEOMETRIC SUMS : Suppose that a is any real number and that |r| < 1. Then, the
finite geometric sum
∑
n
a(1 − rn+1 )
arj = .
j=0
1−r
PAGE 27
CHAPTER 2 STAT 520, J. TEBBS
The subscripts are important because they refer to which time period the value of Y is
being measured. A stochastic process can be described as “a statistical phenomenon that
evolves through time according to a set of probabilistic laws.”
• A complete probabilistic time series model for {Yt }, in fact, would specify all of the
joint distributions of random vectors Y = (Y1 , Y2 , ..., Yn ), for all n = 1, 2, ..., or,
equivalently, specify the joint probabilities
P (Y1 ≤ y1 , Y2 ≤ y2 , ..., Yn ≤ yn ),
PAGE 28
CHAPTER 2 STAT 520, J. TEBBS
TERMINOLOGY : For the stochastic process {Yt : t = 0, 1, 2, ..., }, the mean function
is defined as
µt = E(Yt ),
for t = 0, 1, 2, ...,. That is, µt is the theoretical (or population) mean for the series at
time t. The autocovariance function is defined as
γt,s = cov(Yt , Ys ),
where
cov(Yt , Ys ) γt,s
corr(Yt , Ys ) = √ =√ .
var(Yt )var(Ys ) γt,t γs,s
Example 2.1. A stochastic process {et : t = 0, 1, 2, ..., } is called a white noise process
if it is a sequence of independent and identically distributed (iid) random variables with
E(et ) = µe
var(et ) = σe2 .
PAGE 29
CHAPTER 2 STAT 520, J. TEBBS
3
2
Simulated white noise process
1
0
−1
−2
−3
0 50 100 150
Time
Figure 2.1: A simulated white noise process et ∼ iid N (0, σe2 ), where n = 150 and σe2 = 1.
• A slightly less restrictive definition would require that the et ’s are uncorrelated (not
independent). However, under normality; i.e., et ∼ iid N (0, σe2 ), this distinction
becomes vacuous (for linear time series models).
For t ̸= s,
cov(et , es ) = 0,
PAGE 30
CHAPTER 2 STAT 520, J. TEBBS
γt,t
ρt,s = corr(et , es ) = corr(et , et ) = √ = 1.
γt,t γt,t
For t ̸= s,
γt,s
ρt,s = corr(et , es ) = √ = 0.
γt,t γs,s
Thus, the autocorrelation function is
1, |t − s| = 0
ρt,s =
0, |t − s| ̸= 0.
REMARK : A white noise process, by itself, is rather uninteresting for modeling real
data. However, white noise processes still play a crucial role in the analysis of time series
data! Time series processes {Yt } generally contain two different types of variation:
• systematic variation (that we would like to capture and model; e.g., trends, sea-
sonal components, etc.)
Our goal as data analysts is to extract the systematic part of the variation in the data (and
incorporate this into our model). If we do an adequate job of extracting the systematic
part, then the only part “left over” should be random variation, which can be modeled
as white noise.
Example 2.2. Suppose that {et } is a zero mean white noise process with var(et ) = σe2 .
Define
Y1 = e1
Y2 = e1 + e2
..
.
Yn = e1 + e2 + · · · + en .
PAGE 31
CHAPTER 2 STAT 520, J. TEBBS
Yt = Yt−1 + et ,
where E(et ) = 0 and var(et ) = σe2 . The process {Yt } is called a random walk process.
Random walk processes are used to model stock prices, movements of molecules in gases
and liquids, animal locations, etc.
µt = E(Yt )
= E(e1 + e2 + · · · + et )
var(Yt ) = var(e1 + e2 + · · · + et )
= cov(e1 + e2 + · · · + et , e1 + e2 + · · · + et )
+ cov(e1 + e2 + · · · + et , et+1 + · · · + es )
∑
t ∑∑
= cov(ei , ei ) + cov(ei , ej )
i=1 1≤i̸=j≤t
∑
t
= var(ei ) = σe2 + σe2 + · · · + σe2 = tσe2 .
i=1
Because γt,s = γs,t , the autocovariance function for a random walk process is
PAGE 32
CHAPTER 2 STAT 520, J. TEBBS
5
Simulated random walk process
0
−5
−10
0 50 100 150
Time
Figure 2.2: A simulated random walk process Yt = Yt−1 + et , where et ∼ iid N (0, σe2 ),
n = 150, and σe2 = 1. This process has been constructed from the simulated white noise
process {et } in Figure 2.1.
• Note that when t is closer to s, the autocorrelation ρt,s is closer to 1. That is,
two observations Yt and Ys close together in time are likely to be close together,
especially when t and s are both large (later on in the series).
• On the other hand, when t is far away from s (that is, for two points Yt and Ys far
apart in time), the autocorrelation is closer to 0.
PAGE 33
CHAPTER 2 STAT 520, J. TEBBS
Example 2.3. Suppose that {et } is a zero mean white noise process with var(et ) = σe2 .
Define
1
Yt = (et + et−1 + et−2 ),
3
that is, Yt is a running (or moving) average of the white noise process (averaged across
the most recent 3 time periods). Note that this example is slightly different than that
on pp 14-15 (CC).
Case 1: If s = t, then
σe2
γt,s = γt,t = cov(Yt , Yt ) = var(Yt ) = .
3
Case 2: If s = t + 1, then
PAGE 34
CHAPTER 2 STAT 520, J. TEBBS
Case 3: If s = t + 2, then
Because γt,t = γs,s = σe2 /3, the autocorrelation function for this process is
1, |t − s| = 0
2/3, |t − s| = 1
ρt,s =
1/3, |t − s| = 2
0, |t − s| > 2.
• Observations Yt and Ys that are 1 unit apart in time have the same autocorrelation
regardless of the values of t and s.
• Observations Yt and Ys that are 2 units apart in time have the same autocorrelation
regardless of the values of t and s.
• Observations Yt and Ys that are more than 2 units apart in time are uncorrelated.
PAGE 35
CHAPTER 2 STAT 520, J. TEBBS
1.0
Simulated moving average process
0.5
0.0
−0.5
−1.0
0 50 100 150
Time
Figure 2.3: A simulated moving average process Yt = 13 (et + et−1 + et−2 ), where et ∼
iid N (0, σe2 ), n = 150, and σe2 = 1. This process has been constructed from the simulated
white noise process {et } in Figure 2.1.
Example 2.4. Suppose that {et } is a zero mean white noise process with var(et ) = σe2 .
Consider the stochastic process defined by
Yt = 0.75Yt−1 + et ,
that is, Yt is directly related to the (downweighted) previous value of the process Yt−1
and the random error et (a “shock” or “innovation” that occurs at time t). This is called
an autoregressive model. Autoregression means “regression on itself.” Essentially, we
can envision “regressing” Yt on Yt−1 .
PAGE 36
CHAPTER 2 STAT 520, J. TEBBS
4
Simulated autoregressive process
2
0
−2
−4
0 50 100 150
Time
Figure 2.4: A simulated autoregressive process Yt = 0.75Yt−1 +et , where et ∼ iid N (0, σe2 ),
n = 150, and σe2 = 1.
Example 2.5. Many time series exhibit seasonal patterns that correspond to different
weeks, months, years, etc. One way to describe seasonal patterns is to use models with
deterministic parts which are trigonometric in nature. Suppose that {et } is a zero mean
white noise process with var(et ) = σe2 . Consider the process defined by
Yt = a sin(2πωt + ϕ) + et .
In this model, a is the amplitude, ω is the frequency of oscillation, and ϕ controls the
phase shift. With a = 2, ω = 1/52 (one cycle/52 time points), and ϕ = 0.6π, note that
since E(et ) = 0. Also, var(Yt ) = var(et ) = σe2 . The mean function, and three realizations
of this process (one realization corresponding to σe2 = 1, σe2 = 4, and σe2 = 16) are
depicted in Figure 2.5.
PAGE 37
CHAPTER 2 STAT 520, J. TEBBS
1 2 3 4
2
1
0
−1
−1
−3
−2
Time Time
6
10
4
5
2
−4 −2 0
0
−5
−10
Time Time
Figure 2.5: Sinusoidal model illustration. Top left: E(Yt ) = 2 sin(2πt/52 + 0.6π). The
other plots are simulated realizations of this process with σe2 = 1 (top right), σe2 = 4
(bottom left), and σe2 = 16 (bottom right). In each simulated realization, n = 156.
2.5 Stationarity
NOTE : Stationarity is a very important concept in the analysis of time series data.
Broadly speaking, a time series is said to be stationary if there is no systematic change
in mean (no trend), if there is no systematic change in variance, and if strictly periodic
variations have been removed. In other words, the properties of one section of the data
are much like those of any other section.
IMPORTANCE : Much of the theory of time series is concerned with stationary time
series. For this reason, time series analysis often requires one to transform a nonstationary
PAGE 38
CHAPTER 2 STAT 520, J. TEBBS
time series into a stationary one to use this theory. For example, it may be of interest
to remove the trend and seasonal variation from a set of data and then try to model
the variation in the residuals (the pieces “left over” after this removal) by means of a
stationary stochastic process.
is the same as
Yt1 −k , Yt2 −k , ..., Ytn −k
for all time points t1 , t2 , ..., tn and for all time lags k. In other words, shifting the time
origin by an amount k has no effect on the joint distributions, which must therefore
depend only on the intervals between t1 , t2 , ..., tn . This is a very strong condition.
IMPLICATION : Since the above condition holds for all sets of time points t1 , t2 , ..., tn ,
it must hold when n = 1; i.e., there is only one time point.
• This implies Yt and Yt−k have the same marginal distribution for all t and k.
E(Yt ) = E(Yt−k )
var(Yt ) = var(Yt−k ),
• Therefore, for a strictly stationary process, both µt = E(Yt ) and γt,t = var(Yt ) are
constant over time.
ADDITIONAL IMPLICATION : Since the above condition holds for all sets of time
points t1 , t2 , ..., tn , it must hold when n = 2; i.e., there are only two time points.
PAGE 39
CHAPTER 2 STAT 520, J. TEBBS
• This implies (Yt , Ys ) and (Yt−k , Ys−k ) have the same joint distribution for all t,
s, and k.
This means that the covariance between Yt and Ys does not depend on the actual
values of t and s; it only depends on the time difference |t − s|.
NEW NOTATION : For a (strictly) stationary process, the covariance γt,s depends only
on the time difference |t − s|. The quantity |t − s| is the distance between time points Yt
and Ys . In other words, the covariance between Yt and any observation k = |t − s| time
points from it only depends on the lag k. Therefore, we write
γk = cov(Yt , Yt−k )
ρk = corr(Yt , Yt−k ).
We use this simpler notation only when we refer to a process which is stationary. Note
that by taking k = 0, we have
γ0 = cov(Yt , Yt ) = var(Yt ).
Also,
γk
ρk = corr(Yt , Yt−k ) = .
γ0
PAGE 40
CHAPTER 2 STAT 520, J. TEBBS
2. The covariance between any two observations depends only the time lag between
them; i.e., γt,t−k depends only on k (not on t).
REMARK : Strict stationarity is a condition that is much too restrictive for most applica-
tions. Moreover, it is difficult to assess the validity of this assumption in practice. Rather
than impose conditions on all possible (marginal and joint) distributions of a process, we
will use a milder form of stationarity that only deals with the first two moments.
2. The covariance between any two observations depends only the time lag between
them; i.e., γt,t−k depends only on k (not on t).
Nothing is assumed about the collection of joint distributions of the process. Instead, we
only are specifying the characteristics of the first two moments of the process.
REALIZATION : Clearly, strict stationarity implies weak stationarity. It is also clear that
the converse to statement is not true, in general. However, if we append the additional
assumption of multivariate normality (for the Yt process), then the two definitions do
coincide; that is,
CONVENTION : For the purpose of modeling time series data in this course, we will
rarely (if ever) make the distinction between strict stationarity and weak stationarity.
When we use the term “stationary process,” this is understood to mean that the process
is weakly stationary.
PAGE 41
CHAPTER 2 STAT 520, J. TEBBS
EXAMPLES : We now reexamine the time series models introduced in the last section.
• Suppose that {et } is a white noise process. That is, {et } consists of iid random
variables with E(et ) = µe and var(et ) = σe2 , both constant (free of t). In addition,
the autocovariance function γk = cov(Yt , Yt−k ) is given by
σ2, k = 0
e
γk =
0, k ̸= 0,
which is free of time t (i.e., γk depends only on k). Thus, a white noise process is
stationary.
Yt = Yt−1 + et ,
where {et } is white noise with E(et ) = 0 and var(et ) = σe2 . We calculated µt =
E(Yt ) = 0, for all t, which is free of t. However,
which clearly depends on time t. Thus, a random walk process is not stationary.
1
Yt = (et + et−1 + et−2 ),
3
where {et } is zero mean white noise with var(et ) = σe2 . We calculated µt = E(Yt ) =
0 (which is free of t) and γk = cov(Yt , Yt−k ) to be
σe2 /3, k = 0
2σ 2 /9, k = 1
e
γk =
σe2 /9, k = 2
0, k > 2.
Because cov(Yt , Yt−k ) is free of time t, this moving average process is stationary.
PAGE 42
CHAPTER 2 STAT 520, J. TEBBS
Yt = 0.75Yt−1 + et ,
where {et } is zero mean white noise with var(et ) = σe2 . We avoided the calculation
of µt = E(Yt ) and cov(Yt , Yt−k ) for this process, so we will not make a definite
determination here. However, it turns out that if et is independent of Yt−1 , Yt−2 , ...,
and if σe2 > 0, then this autoregressive process is stationary (details coming later).
Yt = a sin(2πωt + ϕ) + et ,
where {et } is zero mean white noise with var(et ) = σe2 . Clearly µt = E(Yt ) =
a sin(2πωt + ϕ) is not free of t, so this sinusoidal process is not stationary.
IMPORTANT : In order to start thinking about viable stationary time series models for
real data, we need to have a stationary process. However, as we have just seen, many
data sets exhibit nonstationary behavior. A simple, but effective, technique to convert a
nonstationary process into a stationary one is to examine data differences.
DEFINITION : Consider the process {Yt : t = 0, 1, 2, ..., n}. The (first) difference
process of {Yt } is defined by
∇Yt = Yt − Yt−1 ,
PAGE 43
CHAPTER 3 STAT 520, J. TEBBS
3.1 Introduction
DISCUSSION : In this course, we consider time series models for realizations of a stochas-
tic process {Yt : t = 0, 1, ..., n}. This will largely center around models for stationary
processes. However, as we have seen, many time series data sets exhibit a trend; i.e., a
long-term change in the mean level. We know that such series are not stationary because
the mean changes with time.
• An obvious difficulty with the definition of a trend is deciding what is meant by the
phrase “long-term.” For example, climatic processes can display cyclical variation
over a long period of time, say, 1000 years. However, if one has just 40-50 years of
data, this long-term cyclical pattern might be missed and be interpreted as a trend
which is linear.
• Trends can be “elusive,” and an analyst may mistakenly conjecture that a trend
exists when it really does not. For example, in Figure 2.2 (page 33), we have a
realization of a random walk process
Yt = Yt−1 + et ,
where et ∼ iid N (0, 1). There is no trend in the mean of this random walk process.
Recall that µt = E(Yt ) = 0, for all t. However, it would be easy to incorrectly
assert that true downward and upward trends are present.
• On the other hand, it may be hard to detect trends if the data are very noisy. For
example, the lower right plot in Figure 2.5 (page 38) is a noisy realization of a
sinusoidal process considered in the last chapter. It is easy to miss the true cyclical
structure from looking at the plot.
PAGE 44
CHAPTER 3 STAT 520, J. TEBBS
Yt = µt + Xt ,
where µt is a deterministic function that describes the trend and Xt is random error.
Note that if, in addition, E(Xt ) = 0 for all t (a common assumption), then
E(Yt ) = µt
is the mean function for the process {Yt }. In practice, different deterministic trend
functions could be considered. One popular choice is
µt = β0 + β1 t,
which says that the mean function increases (decreases) linearly with time. The function
µt = β0 + β1 t + β2 t2
µt = β0 + β1 t + β2 t2 + · · · + βk tk .
where the αj ’s and βj ’s are regression parameters and the ωj ’s are related to frequencies
of the trigonometric functions cos ωj t and sin ωj t. Fitting these and other deterministic
trend models (and even combinations of them) can be accomplished using the method
of least squares, as we will demonstrate later in this chapter.
LOOKING AHEAD: In this course, we want to deal with stationary time series models
for data. Therefore, if there is a deterministic trend present in the process, we want to
remove it. There are two general ways to do this.
PAGE 45
CHAPTER 3 STAT 520, J. TEBBS
1. Estimate the trend and then subtract the estimated trend from the data (perhaps
bt and then model the
after transforming the data). Specifically, estimate µt with µ
residuals
bt = Yt − µ
X bt
• If the residuals are stationary, we can use a stationary time series model (Chap-
ter 4) to describe their behavior.
bt } and then
• Forecasting takes place by first forecasting the residual process {X
inverting the transformations described above to arrive back at forecasts for
the original series {Yt }. We will pursue forecasting techniques in Chapter 9.
A CONSTANT “TREND”: We first consider the most elementary type of trend, namely,
a constant trend. Specifically, we consider the model
Yt = µ + Xt ,
PAGE 46
CHAPTER 3 STAT 520, J. TEBBS
where µ is constant (free of t) and where E(Xt ) = 0. Note that, under this zero mean
error assumption, we have
E(Yt ) = µ.
That is, the process {Yt } has an overall population mean function µt = µ, for all t. The
most common estimate of µ is
1∑
n
Y = Yt ,
n t=1
the sample mean. It is easy to check that Y is an unbiased estimator of µ; i.e.,
E(Y ) = µ. This is true because
( n )
1∑ 1∑ 1∑
n n
nµ
E(Y ) = E Yt = E(Yt ) = µ= = µ.
n t=1 n t=1 n t=1 n
Therefore, under the minimal assumption that E(Xt ) = 0, we see that Y is an unbiased
estimator of µ. To assess the precision of Y as an estimator of µ, we examine var(Y ).
where var(Yt ) = γ0 .
RECALL: If {Yt } is an iid process, that is, Y1 , Y2 , ..., Yn is an iid (random) sample, then
γ0
var(Y ) = .
n
Therefore, var(Y ), in general, can be larger than or smaller than γ0 /n depending on the
values of ρk through
[ n−1 ( ) ] n−1 ( )
γ0 ∑ k γ0 2γ0 ∑ k
1+2 1− ρk − = 1− ρk .
n k=1
n n n k=1 n
PAGE 47
CHAPTER 3 STAT 520, J. TEBBS
1
Yt = (et + et−1 + et−2 ),
3
where {et } is zero mean white noise with var(et ) = σe2 . In the last chapter, we calculated
σe2 /3, k = 0
2σ 2 /9, k = 1
e
γk =
2
σe /9, k = 2
0, k > 2.
γ1 2σ 2 /9
ρ1 = = 2e = 2/3.
γ0 σe /3
γ2 σ 2 /9
ρ2 = = e2 = 1/3.
γ0 σe /3
Example 3.2. Suppose that {Yt } is a stationary process with autocorrelation function
ρk = ϕk , where −1 < ϕ < 1. For this process, the autocorrelation decays exponentially as
the lag k increases. As we will see in Chapter 4, the autoregressive of order 1, AR(1),
process possesses this autocorrelation function. To examine the effect of estimating µ
with Y in this situation, we use an approximation for var(Y ) for large n, specifically,
[ n−1 ( ) ] ( )
γ0 ∑ k γ0 ∑∞
var(Y ) = 1+2 1− ρk ≈ 1+2 ρk ,
n k=1
n n k=1
PAGE 48
CHAPTER 3 STAT 520, J. TEBBS
Using Y produces a more precise estimate of µ than in an iid (random) sampling context.
The negative autocorrelations ρ1 = −0.6, ρ3 = (−0.6)3 , etc., “outweigh” the positive ones
ρ2 = (−0.6)2 , ρ4 = (−0.6)4 , etc., making var(Y ) smaller than γ0 /n.
Example 3.3. In Examples 3.1 and 3.2, we considered stationary processes in examining
the precision of Y as an estimator for µ. In this example, we have the same goal, but
we consider the random walk process Yt = Yt−1 + et , where {et } is a zero mean white
noise process with var(et ) = σe2 . As we already know, this process is not stationary, so
we can not use the var(Y ) formula presented earlier. However, recall that this process
can be written out as
Y1 = e1 , Y2 = e1 + e2 , ..., Yn = e1 + e2 + · · · + en ,
so that
1∑
n
1
Y = Yt = [ne1 + (n − 1)e2 + (n − 2)e3 + · · · + 2en−1 + 1en ] .
n t=1 n
1 [ 2 ]
var(Y ) = 2
n var(e1 ) + (n − 1)2 var(e2 ) + · · · + 22 var(en−1 ) + 12 var(en )
n
σ2
= e2 [12 + 22 + · · · + (n − 1)2 + n2 ]
n [ ] [ ]
σe2 n(n + 1)(2n + 1) σe2 (n + 1)(2n + 1)
= = .
n2 6 n 6
PAGE 49
CHAPTER 3 STAT 520, J. TEBBS
• This result is surprising! Note that as n increases, so does var(Y ). That is, av-
eraging a larger sample produces a worse (i.e., more variable) estimate of µ than
averaging a smaller one!!
• This is quite different than the results obtained for stationary processes. The
nonstationarity in the data causes very bad things to happen, even in the relatively
simple task of estimating an overall process mean.
Since the sampling distribution of Z does not depend on any unknown parameters, we
say that Z is a pivotal quantity (or, more simply, a pivot). If γ0 and the ρk ’s are
known, then a 100(1 − α) percent confidence interval for µ is
v [
u
u γ0 n−1 (
∑ ) ]
k
Y ± zα/2 t 1+2 1− ρk ,
n k=1
n
where zα/2 is the upper α/2 quantile from the standard normal distribution.
REMARK : Note that if ρk = 0, for all k, then Y ∼ N (µ, γ0 /n), and the confidence
interval formula just presented reduces to
√
γ0
Y ± zα/2 ,
n
which we recognize as the confidence interval for µ when random sampling is used. The
impact of the autocorrelations ρk will be the same on the confidence interval. That is,
more negative autocorrelations ρk will make the standard error
v [
u
u γ0 n−1 (
∑ ) ]
k
se(Y ) = t 1+2 1− ρk
n k=1
n
PAGE 50
CHAPTER 3 STAT 520, J. TEBBS
smaller, which will make the confidence interval more precise (i.e., shorter). On the other
hand, positive autocorrelations will make this quantity larger, thereby lengthening the
interval, making it less informative.
REMARK : Of course, in real life, rarely will anyone tell us the values of γ0 and the ρk ’s.
These are model (population) parameters. However, if the sample size n is large and
“good” (large-sample) estimates of these quantities can be calculated, we would expect
this interval to be approximately valid when the estimates are substituted in for the true
values. We will talk about estimation of γ0 and the autocorrelations later.
STRAIGHT LINE MODEL: We now consider the deterministic time trend model
Yt = µt + Xt
= β0 + β1 t + Xt ,
PAGE 51
CHAPTER 3 STAT 520, J. TEBBS
This can be done using a multivariable calculus argument. Specifically, the partial deriva-
tives of Q(β0 , β1 ) are given by
∂Q(β0 , β1 ) ∑ n
= −2 (Yt − β0 − β1 t)
∂β0 t=1
∂Q(β0 , β1 ) ∑ n
= −2 t(Yt − β0 − β1 t).
∂β1 t=1
Setting these derivatives equal to zero and jointly solving for β0 and β1 , we get
βb0 = Y − βb1 t.
∑n
b (t − t)Yt
β1 = ∑t=1 .
t=1 (t − t)
n 2
• Under just the mild assumption of E(Xt ) = 0, for all t, the least squares estimators
are unbiased. That is, E(βb0 ) = β0 and E(βb1 ) = β1 .
γ0
var(βb1 ) = ∑n .
t=1 (t − t)
2
Note that a zero mean white noise process {Xt } satisfies these assumptions.
and [ ]
b γ0
β1 ∼ N β1 , ∑n .
t=1 (t − t)
2
PAGE 52
CHAPTER 3 STAT 520, J. TEBBS
0.4
Global temperature deviations (since 1900)
0.2
0.0
−0.2
−0.4
Year
Figure 3.1: Global temperature data. The data are a combination of land-air average
temperature anomalies, measured in degrees Centigrade. Time period: 1900-1997.
IMPORTANT : You should recall that these four assumptions on the errors Xt , that
is, zero mean, independence, homoscedasticity, and normality, are the usual as-
sumptions on the errors in a standard regression setting. However, with most time series
data sets, at least one of these assumptions will be violated. The implication, then, is
that standard errors of the estimators, confidence intervals, t tests, probability values,
etc., quantities that are often provided in computing packages (e.g., R, etc.), will not be
meaningful. Proper usage of this output requires the four assumptions mentioned above
to hold. The only instance in which these are exactly true is if {Xt } is a zero-mean
normal white noise process (an assumption you likely made in your previous methods
courses where regression was discussed).
Example 3.4. Consider the global temperature data from Example 1.1 (notes), but let’s
restrict attention to the time period 1900-1997. These data are depicted in Figure 3.1.
PAGE 53
CHAPTER 3 STAT 520, J. TEBBS
0.4
Global temperature deviations (since 1900)
0.2
0.0
−0.2
−0.4
Year
Figure 3.2: Global temperature data (1900-1997) with a straight line trend fit.
Over this time period, there is an apparent upward trend in the series. Suppose that we
estimate this trend by fitting the straight line regression model
Yt = β0 + β1 t + Xt ,
for t = 1900, 1901, ..., 1997, where E(Xt ) = 0. Here is the output from fitting this model
in R.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.219e+01 9.032e-01 -13.49 <2e-16 ***
time(globaltemps.1900) 6.209e-03 4.635e-04 13.40 <2e-16 ***
PAGE 54
CHAPTER 3 STAT 520, J. TEBBS
0.3
0.2
0.1
Residuals
0.0
−0.1
−0.2
−0.3
0 20 40 60 80 100
Year
Figure 3.3: Global temperature data (1990-1997). Residuals from the straight line trend
model fit.
This is the equation of the line superimposed over the series in Figure 3.2.
bt = Yt − Ybt ,
X
that is, the observed data Yt minus the fitted values given by the equation in Ybt . In
this example (with the straight line model fit), the residuals are given by
bt = Yt − Ybt
X
= Yt + 12.19 − 0.0062t,
PAGE 55
CHAPTER 3 STAT 520, J. TEBBS
for t = 1900, 1901, ..., 1997. Remember that one of the main reasons for fitting the
straight line model was to capture the linear trend. Now that we have done this, the
residual process defined by
bt = Yt + 12.19 − 0.0062t
X
contains information in the data that is not accounted for in the straight line trend
model. For this reason, it is called the detrended series. This series is plotted in
Figure 3.3. Essentially, this is a time series plot of the residuals from the straight line fit
versus time, the predictor variable in the model. This detrended series does appear to
be somewhat stationary, at least much more so than the original series {Yt }. However,
just from looking at the plot, it is a safe bet that the residuals are not white noise.
We have learned that taking differences can be an effective means to remove non-
stationary patterns. Doing so here, as evidenced in Figure 3.4, produces a new process
that does appear to be somewhat stationary.
PAGE 56
CHAPTER 3 STAT 520, J. TEBBS
0.3
0.2
Global temperature deviation differences
0.1
0.0
−0.1
−0.2
−0.3
Year
Yt = µt + Xt
= β0 + β1 t + β2 t2 + · · · + βk tk + Xt ,
PAGE 57
CHAPTER 3 STAT 520, J. TEBBS
• Unfortunately (without the use of more advanced notation), there are no conve-
nient, closed-form expressions for the least squares estimators when k > 1. This
turns out not to be a major distraction, because we use computing to fit the model
anyway.
• Under the mild assumption that the errors have zero mean; i.e., that E(Xt ) = 0, it
follows that the least squares estimators βb0 , βb1 , βb2 , ..., βbk are unbiased estimators
of their population analogues; i.e., E(βbi ) = βi , for i = 0, 1, 2, ..., k.
• As in the simple linear regression case (k = 1), additional assumptions on the errors
Xt are needed to derive the sampling distribution of the least squares estimators,
namely, independence, constant variance, and normality.
• Regression output (e.g., in R, etc.) is correct only under these additional assump-
tions. Thee analyst must keep this in mind.
Example 3.5. Data file: gold (TSA). Of all the precious metals, gold is the most
popular as an investment. Like most commodities, the price of gold is driven by supply
and demand as well as speculation. Figure 3.5 contains a time series of n = 254 daily
observations on the price of gold (per troy ounce) in US dollars during the year 2005.
There is a clear nonlinear trend in the data, so a straight-line model would not be
appropriate.
PAGE 58
CHAPTER 3 STAT 520, J. TEBBS
540
520
500
480
Price
460
440
420
Time
Figure 3.5: Gold price data. Daily price in US dollars per troy ounce: 1/4/05-12/30/05.
In this example, we use R to detrend the data by fitting the quadratic regression
model
Yt = β0 + β1 t + β2 t2 + Xt ,
for t = 1, 2, ..., 254, where E(Xt ) = 0. Here is the output from fitting this model in R.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.346e+02 1.771e+00 245.38 <2e-16 ***
t -3.618e-01 3.233e-02 -11.19 <2e-16 ***
t.sq 2.637e-03 1.237e-04 21.31 <2e-16 ***
PAGE 59
CHAPTER 3 STAT 520, J. TEBBS
540
520
500
480
Price
460
440
420
Time
ANALYSIS : Again, we focus only on the values of the least squares estimates. The fitted
regression equation is
Ybt = 434.6 − 0.362t + 0.00264t2 ,
for t = 1, 2, ..., 254. This fitted model is superimposed over the time series in Figure 3.6.
bt = Yt − Ybt
X
for t = 1, 2, ..., 254, and is depicted in Figure 3.7. This detrended series appears to be
somewhat stationary, at least, much more so than the original time series. However, it
should be obvious that the detrended (residual) process is not white noise. There is still
an enormous amount of momentum left in the residuals. Of course, we know that this
renders most of the R output on the previous page meaningless.
PAGE 60
CHAPTER 3 STAT 520, J. TEBBS
40
30
20
Residuals
10
0
−10
−20
Time
Figure 3.7: Gold price data. Residuals from the quadratic trend fit.
Yt = µt + Xt ,
The regression parameters β1 , β2 , ..., β12 are fixed constants. This is called a seasonal
means model. This model does not take the shape of the seasonal trend into account;
instead, it merely says that observations 12 months apart have the same mean, and
PAGE 61
CHAPTER 3 STAT 520, J. TEBBS
this mean does not change through time. Other seasonal means models with a different
number of parameters could be specified. For instance, for quarterly data, we could
use a mean function with 4 regression parameters β1 , β2 , β3 , and β4 .
FITTING THE MODEL: We can still use least squares to fit the seasonal means model.
The least squares estimates of the regression parameters are simple to compute, but
difficult to write mathematically. In particular,
1 ∑
βb1 = Yt ,
n1 t∈A
1
where the set Ai = {t : t = i + 12j, j = 0, 1, 2, ..., }, for i = 1, 2, ..., 12, where ni is the
number of observations in month i.
Example 3.6. Data file: beersales (TSA). The data in Figure 3.8 are monthly beer
sales (in millions of barrels) in the United States from 1/80 through 12/90. This time
series has a relatively constant mean overall (i.e., there are no apparent linear trends and
the repeating patterns are relatively constant over time), so a seasonal means model may
be appropriate. Fitting the model can be done in R; here are the results.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
January 13.1608 0.1647 79.90 <2e-16 ***
February 13.0176 0.1647 79.03 <2e-16 ***
March 15.1058 0.1647 91.71 <2e-16 ***
PAGE 62
CHAPTER 3 STAT 520, J. TEBBS
M
A A
JJ J
A
JJ
J MJ J
17 M J
MJ MA
J J
M
M
J
J
J M J J J
MJ
A
MA A A
J J M A J
16
A A
A A A A M
M O
A A
M M M
M
A A
M A
15
M
S S A S
S
Sales
S M O
S M O S N
A O S OJ
M O J
14
OJ S F O
O O O J
S F
J S J
NF
N NF D
13
DF F
N F D
JF NF JF F N N
D N J N N D
D D
J D D
12
J D
Year
Figure 3.8: Monthly US beer sales from 1980-1990. The data are measured in millions
of barrels.
DISCUSSION : The only quantities that have relevance are the least squares estimates.
The estimate βbi is simply the sample mean of the observations for month i; thus, βbi is an
unbiased estimate of the ith (population) mean monthly sales βi . The test statistics and
p-values are used to test H0 : βi = 0, a largely nonsensical hypothesis in this example.
PAGE 63
CHAPTER 3 STAT 520, J. TEBBS
1.5
1.0
0.5
Residuals
0.0
−0.5
−1.0
0 20 40 60 80 100 120
Time
Figure 3.9: Beer sales data. Residuals from the seasonal means model fit.
RESIDUALS : A plot of the residuals from the seasonal means model fit, that is,
bt = Yt − Ybt
X
∑ 12
= Yt − βbi IAi (t)
i=1
∑12 b
is in Figure 3.9. The expression i=1 βi IAi (t), where I(·) is the indicator function, is
simply the sample mean for the set of observations at time t. This residual process looks
somewhat stationary, although I can detect a slightly increasing trend.
REMARK : The seasonal means model is somewhat simplistic in that it does not take the
shape of the seasonal trend into account. We now consider a more elaborate regression
equation that can be used to model data with seasonal trends.
PAGE 64
CHAPTER 3 STAT 520, J. TEBBS
Yt = µt + Xt
= β cos(2πf t + Φ) + Xt ,
• f is the frequency =⇒ 1/f is the period (the time it takes to complete one
full cycle of the function). For monthly data, the period is 12 months; i.e., the
frequency is f = 1/12.
• Φ controls the phase shift. This represents a horizontal shift in the mean function.
MODEL FITTING: Fitting this model is difficult unless we transform the mean function
into a simpler expression. We use the trigonometric identity
to write
is a linear function of β1 and β2 , where cos(2πf t) and sin(2πf t) play the roles of predictor
variables. Adding an intercept term for flexibility, say β0 , we get
Yt = β0 + β1 cos(2πf t) + β2 sin(2πf t) + Xt .
PAGE 65
CHAPTER 3 STAT 520, J. TEBBS
REMARK : When we fit this model, we must be aware of the values used for the time t,
as it has a direct impact on how we specify the frequency f . For example,
• if we have monthly data and use the generic time specification t = 1, 2, ..., 12, 13, ...,
then we specify f = 1/12.
• if we have monthly data, but we use the years themselves as predictors; i.e., t =
1990, 1991, 1992, etc., we use f = 1, because 12 observations arrive each year.
Example 3.6 (continued). We now use R to fit the cosine trend model to the beer sales
data. Because the predictor variable t is measured in years 1980, 1981, ..., 1990 (with 12
observations each year), we use f = 1. Here is the output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.80446 0.05624 263.25 <2e-16 ***
har.cos(2*pi*t) -2.04362 0.07953 -25.70 <2e-16 ***
har.sin(2*pi*t) 0.92820 0.07953 11.67 <2e-16 ***
is superimposed over the data in Figure 3.10. The least squares estimates βb0 = 14.8,
βb1 = −2.04, and βb2 = 0.93 are the only useful pieces of information in the output.
bt = Yt − Ybt
X
PAGE 66
CHAPTER 3 STAT 520, J. TEBBS
17
16
15
Sales
14
13
12
Time
Figure 3.10: Beer sales data with a cosine trend model fit.
which is depicted in Figure 3.11. The residuals from the cosine trend fit appear to be
somewhat stationary, but are probably not white noise.
REMARK : The seasonal means and cosine trend models are competing models; that is,
both models are useful for seasonal data.
• The cosine trend model is more parsimonious; i.e., it is a simpler model because
there are 3 regression parameters to estimate. On the other hand, the (monthly)
seasonal means model has 12 parameters that need to be estimated!
• Remember, regression parameters (in any model) are estimated with the data. The
more parameters we have in a model, the more data we need to use to estimate them.
This leaves us with less information to estimate other quantities (e.g., residual
variance, etc.). In the end, we have regression estimates that are less precise.
• The mathematical argument on pp 36-39 (CC) should convince you of this result.
PAGE 67
CHAPTER 3 STAT 520, J. TEBBS
2.0
1.5
1.0
0.5
Residuals
0.0
−0.5
−1.0
−1.5
0 20 40 60 80 100 120
Time
Figure 3.11: Beer sales data. Residuals from the cosine trend model fit.
Yt = µt + Xt ,
• for least squares estimates to be unbiased, all we need is E(Xt ) = 0, for all t.
• for the variances of the least squares estimates (and standard errors) seen in R
output to be meaningful, we need E(Xt ) = 0, {Xt } independent, and var(Xt ) = γ0
(constant). These assumptions are met if {Xt } is a white noise process.
• for t tests and probability values to be valid, we need the last three assumptions to
hold; in addition, normality is needed on the error process {Xt }.
PAGE 68
CHAPTER 3 STAT 520, J. TEBBS
1 ∑
n
S = 2
(Yt − µ
bt )2 ,
n − p t=1
• The smaller S is, the better fit of the model. Therefore, in comparing two model
fits (for two different models), we can look at the value of S in each model to judge
which model may be preferred (caution is needed in doing this).
• The larger S is, the noisier the error process likely is. This makes the least squares
estimates more variable and predictions less precise.
RESULT : For any data set {Yt : t = 1, 2, ..., n}, we can write algebraically
∑
n ∑
n ∑
n
(Yt − Y )2 = (Ybt − Y )2 + (Yt − Ybt )2 .
|t=1 {z } |t=1 {z } |t=1 {z }
SST SSR SSE
These quantities are called sums of squares and form the basis for the following anal-
ysis of variance (ANOVA) table.
Source df SS MS F
Model p−1 SSR MSR = SSR
p−1
F = MSR
MSE
PAGE 69
CHAPTER 3 STAT 520, J. TEBBS
SSR SSE
R2 = =1− ,
SST SST
the coefficient of determination. The larger R2 is, the better the deterministic part
of the model explains the variability in the data. Clearly, 0 ≤ R2 ≤ 1.
IMPORTANT : It is critical to understand what R2 does and does not measure. Its value
is computed under the assumption that the deterministic trend model is correct and
assesses how much of the variation in the data may be attributed to that relationship
rather than just to inherent variation.
• If R2 is small, it may be that there is a lot of random inherent variation in the data,
so that, although the deterministic trend model is reasonable, it can only explain
so much of the observed overall variation.
• Alternatively, R2 may be close to 1, but a particular model may not be the best
model. In fact, R2 could be very “high,” but not relevant because a better model
may exist.
2 SSE/(n − p)
R =1− .
SST/(n − 1)
This is called the adjusted R2 statistic. It is useful for comparing models with different
numbers of parameters.
Yt = µt + Xt ,
PAGE 70
CHAPTER 3 STAT 520, J. TEBBS
where E(Xt ) = 0. In this chapter, we have talked about using the method of least squares
to fit models of this type (e.g., straight line regression, polynomial regression, seasonal
means, cosine trends, etc.). The fitted model is Ybt = µ
bt and the residual process is
bt = Yt − Ybt .
X
The residuals from the model fit are important. In essence, they serve as proxies (pre-
dictions) for the true errors Xt , which are not observed. The residuals can help us learn
about the validity of the assumptions made in our model.
STANDARDIZED RESIDUALS : If the model above is fit using least squares (and there
is an intercept term in the model), then algebraically,
∑
n
(Yt − Ybt ) = 0,
t=1
that is, the sum of the residuals is equal to zero. Thus, the residuals have mean zero and
the standardized residuals, defined by
b
b ∗ = Xt ,
Xt
S
are unitless quantities. If desired, we can use the standardized residuals for model diag-
nostic purposes. The standardized residuals defined here are not exactly zero mean, unit
variance quantities, but they are approximately so. Thus, if the model is adequate, we
would expect most standardized residuals to fall between −3 and 3.
NORMALITY : If the error process {Xt } is normally distributed, then we would expect
the residuals to also be approximately normally distributed. We can therefore diag-
nose this assumption by examining the (standardized) residuals and looking for evidence
of normality. We can use histograms and normal probability plots (also known as
quantile-quantile, or qq plots) to do this.
PAGE 71
CHAPTER 3 STAT 520, J. TEBBS
bt (or standardized
• A normal probability plot is a scatterplot of ordered residuals X
b ∗ ) versus the ordered theoretical normal quantiles (or normal scores).
residuals X t
The idea behind this plot is simple. If the residuals are normally distributed, then
plotting them versus the corresponding normal quantiles (i.e., values from a normal
distribution) should produce a straight line (or at least close).
Example 3.4 (continued). In Example 3.4, we fit a straight line trend model to the
global temperature data. Below are the histogram and qq plot for the standardized
residuals. Does normality seem to be supported?
2
15
1
Sample Quantiles
Frequency
10
0
−1
5
−2
0
−2 −1 0 1 2 3 −2 −1 0 1 2
SHAPIRO-WILK TEST : Histograms and qq plots provide only visual evidence of nor-
mality. The Shapiro-Wilk test is a formal hypothesis test that can be used to test
The test is carried out by calculating a statistic W approximately equal to the sample
correlation between the ordered (standardized) residuals and the normal scores. The
PAGE 72
CHAPTER 3 STAT 520, J. TEBBS
higher this correlation, the higher the value of W . Therefore, small values of W are
evidence against H0 . The null distribution of W is very complicated, but probability
values (p-values) are produced in R automatically. If the p-value is smaller than the
significance level for the test (e.g., α = 0.05, etc.), then we reject H0 and conclude that
there is a violation in the normality assumption. Otherwise, we do not reject H0 .
Example 3.4 (continued). In Example 3.4, we fit a straight line trend model to the
global temperature data. The Shapiro-Wilk test on the standardized residuals produces
the following output:
> shapiro.test(rstudent(fit))
Shapiro-Wilk normality test
data: rstudent(fit)
W = 0.9934, p-value = 0.915
Because the p-value for the test is not small, we do not reject H0 . This test does not
provide evidence of non-normality for the standardized residuals.
INDEPENDENCE : Plotting the residuals versus time can provide visual insight on
whether or not the (standardized) residuals exhibit independence (although it is often
easier to detect gross violations of independence). Residuals that “hang together” are not
what we would expect to see from a sequence of independent random variables. Similarly,
residuals that oscillate back and forth too notably also do not resemble this sequence.
RUNS TEST : A runs test is a nonparametric test which calculates the number of runs
in the (standardized) residuals. The formal test is
PAGE 73
CHAPTER 3 STAT 520, J. TEBBS
2
1
Residuals
0
−1
−2
0 20 40 60 80 100
Year
Figure 3.12: Standardized residuals from the straight line trend model fit for the global
temperature data. A horizontal line at zero has been added.
In particular, the test examines the (standardized) residuals in sequence to look for
patterns that would give evidence against independence. Runs above or below 0 (the
approximate median of the residuals) are counted.
• A small number of runs would indicate that neighboring values are positively
dependent and tend to hang together over time.
• Too many runs would indicate that the data oscillate back and forth across their
median. This suggests that neighboring residuals are negatively dependent.
• Therefore, either too few or too many runs lead us to reject independence.
Example 3.4 (continued). In Example 3.4, we fit a straight line trend model to the
global temperature data. A runs test on the standardized residuals produces the following
output:
PAGE 74
CHAPTER 3 STAT 520, J. TEBBS
> runs(rstudent(fit))
$pvalue
[1] 3.65e-06
$observed.runs
[1] 27
$expected.runs
[1] 49.81633
The p-value for the test is extremely small, so we would reject H0 . The evidence points
to the standardized residuals being not independent. The R output also produces the ex-
pected number of runs (computed under the assumption of independence). The observed
number of runs is too much lower than the expected number to support independence.
where
• r = r1 + r2 .
PAGE 75
CHAPTER 3 STAT 520, J. TEBBS
Therefore, values of
|R − µR |
Z= > zα/2
σR
lead to the rejection of H0 . The notation zα/2 denotes the upper α/2 quantile of the
N (0, 1) distribution.
RECALL: Consider the stationary stochastic process {Yt : t = 1, 2, ..., n}. In Chapter 2,
we defined the autocorrelation function to be
γk
ρk = corr(Yt , Yt−k ) = ,
γ0
where γk = cov(Yt , Yt−k ) and γ0 = var(Yt ). Perhaps more aptly named, ρk is the pop-
ulation autocorrelation function because it depends on the true parameters for the
process {Yt }. In real life (that is, with real data) these population parameters are un-
known, so we don’t get to know the true ρk . However, we can estimate it. This leads to
the definition of the sample autocorrelation function.
TERMINOLOGY : For a set of time series data Y1 , Y2 , ..., Yn , we define the sample
autocorrelation function, at lag k, by
∑n
(Yt − Y )(Yt−k − Y )
rk = t=k+1 ∑n ,
t=1 (Yt − Y )
2
where Y is the sample mean of Y1 , Y2 , ..., Yn (i.e., all the data are used to compute Y ).
The sample version rk is a point estimate of the true (population) autocorrelation ρk .
∑n b∗ b ∗ )(X
b∗ − Xb ∗)
t=k+1 (Xt −X
rk∗ = ∑n
t−k
.
b∗ b∗ 2
t=1 (Xt − X )
PAGE 76
CHAPTER 3 STAT 520, J. TEBBS
Note that when the sum of the standardized residuals equals zero (which occurs when
least squares is used and when an intercept is included in the model), we also have
b ∗ = 0. Therefore, the formula above reduces to
X
∑n b∗ b∗
∗ t=k+1 Xt Xt−k
rk = ∑n .
b∗ 2
t=1 (Xt )
for n large. The notation AN is read “approximately normal.” For k ̸= l, it also turns
out that cov(rk∗ , rl∗ ) ≈ 0. These facts are established in Chapter 6.
• If the standardized residuals are truly white noise, then we would expect rk∗ to fall
√
within 2 standard errors of 0. That is, values of rk∗ within ±2/ n are within the
margin of error under the white noise assumption.
√
• Values of rk∗ larger than ±2/ n (in absolute value) are outside the margin of error,
and, thus, are not consistent with what we would see from a white noise process.
More specifically, this would suggest that there is dependence (autocorrelation) at
lag k in the standardized residual process.
GRAPHICAL TOOL: The plot of rk (or rk∗ if we are examining standardized residuals)
versus k is called a correlogram. If we are assessing whether or not the process is white
√
noise, it is helpful to put horizontal dashed lines at ±2/ n so we can easily see if the
sample autocorrelations fall outside the margin of error.
Example 3.4 (continued). In Example 3.4, we fit a straight line trend model to the
global temperature data. In Figure 3.13, we display the correlogram for the standardized
b ∗ } from the straight line fit.
residuals {X t
√
• Note that many of the sample estimates rk∗ fall outside the ±2/ n margin of error
cutoff. These residuals likely do not resemble a white noise process.
PAGE 77
CHAPTER 3 STAT 520, J. TEBBS
0.4
0.2
ACF
0.0
−0.2
5 10 15
Lag
Figure 3.13: Global temperature data. Sample autocorrelation function for the standard-
ized residuals from the straight line model fit.
SIMULATION EXERCISE : Let’s generate some white noise processes and examine their
sample autocorrelation functions! Figure 3.14 (left) displays two simulated white noise
processes et ∼ iid N (0, 1), where n = 100. With n = 100, the margin of error for each
sample autocorrelation rk is
√
margin of error = ±2/ 100 = ±0.2.
Figure 3.14 (right) displays the sample correlograms (one for each simulated white noise
series) with horizontal lines at the ±0.2 margin of error cutoffs. Even though the gener-
ated data are truly white noise, we still do see some values of rk (one for each realization)
PAGE 78
CHAPTER 3 STAT 520, J. TEBBS
Sample ACF
0.2
White noise process.1
0.1
0
ACF
0 20 40 60 80 100 5 10 15 20
Time Lag
Sample ACF
0.2
3
White noise process.2
0.1
1
ACF
0.0
0
−2
−0.2
0 20 40 60 80 100 5 10 15 20
Time Lag
Figure 3.14: Two simulated standard normal white noise processes with their associated
sample autocorrelation functions.
that fall outside the margin of error cutoffs. Why does this happen?
√
• In essence, every time we compare rk to its margin of error cutoffs ±2/ n, we are
performing a hypothesis test, namely, we are testing H0 : ρk = 0 at a significance
level of approximately α = 0.05.
• When you are interpreting correlograms, keep this in mind. If there are patterns
in the values of rk and many which extend beyond the margin of error (especially
at early lags), the series is probably not white noise. On the other hand, a stray
statistically significant value of rk at, say, lag k = 17 is likely just a false alarm.
PAGE 79
CHAPTER 4 STAT 520, J. TEBBS
4.1 Introduction
RECALL: In the last chapter, we used regression to “detrend” time series data with
the hope of removing non-stationary patterns and producing residuals that resembled
a stationary process. We also learned that differencing can be an effective technique
to transform a non-stationary process into one which is stationary. In this chapter, we
consider (linear) time series models for stationary processes. Recall that stationary time
series are those whose statistical properties do not change over time.
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
A general linear process is defined by
That is, Yt , the value of the process at time t, is a weighted linear combination of white
noise terms at the current and past times. The processes that we examine in this chapter
are special cases of this general linear process. In general, E(Yt ) = 0 and
∑
∞
γk = cov(Yt , Yt−k ) = σe2 Ψi Ψi+k ,
i=0
• For mathematical reasons (to ensure stationarity), we will assume that the Ψi ’s are
square summable, that is,
∑
∞
Ψ2i < ∞.
i=1
• A nonzero mean µ could be added to the right-hand side of the general linear
process above; this would not affect the stationarity properties of {Yt }. Therefore,
there is no harm in assuming that the process {Yt } has zero mean.
PAGE 80
CHAPTER 4 STAT 520, J. TEBBS
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
The process
Yt = et − θ1 et−1 − θ2 et−2 − · · · − θq et−q
is called a moving average process of order q, denoted by MA(q). Note that this
is a special case of the general linear process with Ψ0 = 1, Ψ1 = −θ1 , Ψ2 = −θ2 , ...,
Ψq = −θq , and Ψq∗ = 0 for all q ∗ > q.
Yt = et − θet−1 .
The variance is
= −θvar(et−1 ) = −θσe2 .
For any lag k > 1, γk = cov(Yt , Yt−k ) = 0, because no white noise subscripts in Yt and
Yt−k will overlap.
PAGE 81
CHAPTER 4 STAT 520, J. TEBBS
IMPORTANT : The MA(1) process has zero correlation beyond lag k = 1! Ob-
servations one time unit apart are correlated, but observations more than one time unit
apart are not. This is important to keep in mind when we entertain models for real data
using empirical evidence (e.g., sample autocorrelations rk , etc.).
• The largest ρ1 can be is 0.5 (when θ = −1) and the smallest ρ1 can be is −0.5
(when θ = 1). Therefore, if we were to observe a sample lag 1 autocorrelation r1
that was well outside [−0.5, 0.5], this would be inconsistent with the MA(1) model.
PAGE 82
CHAPTER 4 STAT 520, J. TEBBS
0.4
2
1
0.2
ACF
0
Yt
−3 −2 −1
0.0
−0.2
0 20 40 60 80 100 5 10 15 20
Time Lag
2
1
1
0
0
Yt
Yt
−3 −2 −1
−3 −2 −1
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Yt−1 Yt−2
Figure 4.1: Upper left: MA(1) simulation with θ = −0.9, n = 100, and σe2 = 1. Upper
right: Sample autocorrelation function rk . Lower left: Scatterplot of Yt versus Yt−1 .
Lower right: Scatterplot of Yt versus Yt−2 .
Example 4.1. We use R to simulate the MA(1) process Yt = et − θet−1 , where θ = −0.9,
n = 100, and et ∼ iid N (0, 1).
• Note that
−(−0.9)
θ = −0.9 =⇒ ρ1 = ≈ 0.497.
1 + (−0.9)2
• The sample ACF in Figure 4.1 (upper right) looks like what we would expect from
the MA(1) theory. There is a pronounced “spike” at k = 1 in the sample ACF and
√
little action elsewhere (for k > 1). The error bounds at ±2/ 100 = 0.2 correspond
PAGE 83
CHAPTER 4 STAT 520, J. TEBBS
0.2
2
1
0.0
0
ACF
Yt
−0.2
−3 −2 −1
−0.4
0 20 40 60 80 100 5 10 15 20
Time Lag
2
1
1
0
0
Yt
Yt
−3 −2 −1
−3 −2 −1
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Yt−1 Yt−2
Figure 4.2: Upper left: MA(1) simulation with θ = 0.9, n = 100, and σe2 = 1. Upper
right: Sample autocorrelation function rk . Lower left: Scatterplot of Yt versus Yt−1 .
Lower right: Scatterplot of Yt versus Yt−2 .
• The lag 1 scatterplot; i.e., the scatterplot of Yt versus Yt−1 , shows a moderate
increasing linear relationship. This is expected because of the moderately strong
positive lag 1 autorcorrelation.
• The lag 2 scatterplot; i.e., the scatterplot of Yt versus Yt−2 , shows no linear rela-
tionship. This is expected because ρ2 = 0 for an MA(1) process.
• Figure 4.2 displays a second MA(1) simulation, except with θ = 0.9. In this model,
ρ1 ≈ −0.497 and ρk = 0, for all k > 1. Compare Figure 4.2 with Figure 4.1.
PAGE 84
CHAPTER 4 STAT 520, J. TEBBS
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
The process
Yt = et − θ1 et−1 − θ2 et−2
is a moving average process of order 2, denoted by MA(2). For this process, the
mean is
The variance is
PAGE 85
CHAPTER 4 STAT 520, J. TEBBS
IMPORTANT : The MA(2) process has zero correlation beyond lag k = 2! Ob-
servations 1 or 2 time units apart are correlated. Observations more than two time units
apart are not correlated.
Yt = et − θ1 et−1 − θ2 et−2 ,
where θ1 = 0.9, θ2 = −0.7, n = 100, and et ∼ iid N (0, 1). For this process,
and
−θ2 −(−0.7)
ρ2 = 2 2
= ≈ 0.304.
1 + θ1 + θ2 1 + (0.9)2 + (−0.7)2
Figure 4.3 displays the simulated MA(2) time series, the sample ACF, and the lag 1
and 2 scatterplots. There are pronounced “spikes” at k = 1 and k = 2 in the sample
ACF and little action elsewhere (for k > 2). The lagged scatterplots display negative
(positive) autocorrelation at lag 1 (2). All of these observations are consistent with the
√
MA(2) theory. Note that the error bounds at ±2/ 100 = 0.2 correspond to those for a
white noise process; not an MA(2) process.
PAGE 86
CHAPTER 4 STAT 520, J. TEBBS
0.2
2
1
ACF
−0.2
Yt
−3 −2 −1 0
−0.6
0 20 40 60 80 100 5 10 15 20
Time Lag
3
2
2
1
1
Yt
Yt
−3 −2 −1 0
−3 −2 −1 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Yt−1 Yt−2
Figure 4.3: Upper left: MA(2) simulation with θ1 = 0.9, θ2 = −0.7, n = 100, and σe2 = 1.
Upper right: Sample autocorrelation function rk . Lower left: Scatterplot of Yt versus
Yt−1 . Lower right: Scatterplot of Yt versus Yt−2 .
MODEL: Suppose {et } is a zero mean white noise process with var(et ) = σe2 . The MA(q)
process is
Yt = et − θ1 et−1 − θ2 et−2 − · · · − θq et−q .
and
γ0 = var(Yt ) = σe2 (1 + θ12 + θ22 + · · · + θq2 ).
PAGE 87
CHAPTER 4 STAT 520, J. TEBBS
The salient feature is that the (population) ACF ρk is nonzero for lags k = 1, 2, ..., q.
For all lags k > q, the ACF ρk = 0.
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
The process
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕp Yt−p + et
• In this model, the value of the process at time t, Yt , is a weighted linear combination
of the values of the process from the previous p time points plus a “shock” or
“innovation” term et at time t.
• This process (assuming that is stationary) is a special case of the general linear
process defined at the beginning of this chapter.
PAGE 88
CHAPTER 4 STAT 520, J. TEBBS
Yt = ϕYt−1 + et .
This is an AR(1) process. Note that if ϕ = 1, this process reduces to a random walk.
If ϕ = 0, this process reduces to white noise.
VARIANCE : Assuming that this process is stationary (it isn’t always), the variance of
Yt can be obtained in the following way. In the AR(1) equation, take variances of both
sides to get
var(Yt ) = var(ϕYt−1 + et )
PAGE 89
CHAPTER 4 STAT 520, J. TEBBS
From these two observations, we have established the following (recursive) relationship
for an AR(1) process:
γk = ϕγk−1 .
When k = 1, ( )
σe2
γ1 = ϕγ0 = ϕ .
1 − ϕ2
When k = 2, ( )
2 σe2
γ2 = ϕγ1 = ϕ .
1 − ϕ2
This pattern continues for larger k. In general, the autocovariance function for an AR(1)
process is ( )
k σe2
γk = ϕ , for k = 0, 1, 2, ..., .
1 − ϕ2
AUTOCORRELATION : For an AR(1) process,
( 2 )
k σe
γk ϕ 1−ϕ2
ρk = = σe2
= ϕk , for k = 0, 1, 2, ..., .
γ0 2 1−ϕ
IMPORTANT : For an AR(1) process, because −1 < ϕ < 1, the (population) ACF
ρk = ϕk decays exponentially as k increases.
• If ϕ is not close to ±1, then the decay will take place rapidly.
• If ϕ < 0, then the autocorrelations will alternate from negative (k = 1), to positive
(k = 2), to negative (k = 3), to positive (k = 4), and so on.
• Remember these theoretical patterns so that when we see sample ACFs (from real
data!), we can make sensible decisions about potential model selection.
PAGE 90
CHAPTER 4 STAT 520, J. TEBBS
1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 4.4: Population ACFs for AR(1) processes. Upper left: ϕ = 0.9. Upper right:
ϕ = −0.9. Lower left: ϕ = 0.5. Lower right: ϕ = −0.5.
Yt = ϕYt−1 + et ,
These choices of ϕ are consistent with those in Figure 4.4, which depicts the true (pop-
ulation) AR(1) autocorrelation functions.
PAGE 91
CHAPTER 4 STAT 520, J. TEBBS
4
2
2
0
0
Yt
Yt
−6 −4 −2
−2
−4
−6
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
3
2
2
1
1
0
0
Yt
Yt
−3 −2 −1
−3 −2 −1
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
Figure 4.5: AR(1) simulations with n = 100 and σe2 = 1. Upper left: ϕ = 0.9. Upper
right: ϕ = −0.9. Lower left: ϕ = 0.5. Lower right: ϕ = −0.5.
• In Figure 4.5, note the differences between the series on the left (ϕ > 0) and the
series on the right (ϕ < 0).
– When ϕ > 0, the series tends to “hang together” (since ρ1 > 0).
• In Figure 4.6, we display the sample autocorrelation functions. Compare the sample
ACFs to the theoretical ACFs in Figure 4.4. The fact that these figures do not agree
completely is a byproduct of the sample autocorrelations rk exhibiting sampling
√
variability. The error bounds at ±2/ 100 = 0.2 correspond to those for a white
noise process; not an AR(1) process.
PAGE 92
CHAPTER 4 STAT 520, J. TEBBS
0.5
0.0
ACF
ACF
−0.5
−0.2
5 10 15 20 5 10 15 20
Lag Lag
0.2
0.0 0.1 0.2 0.3
0.0
ACF
ACF
−0.2
−0.4
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Figure 4.6: Sample ACFs for AR(1) simulations with n = 100 and σe2 = 1. Upper left:
ϕ = 0.9. Upper right: ϕ = −0.9. Lower left: ϕ = 0.5. Lower right: ϕ = −0.5.
OBSERVATION : Suppose {et } is a zero mean white noise process with var(et ) = σe2 . We
now show that the AR(1) process
Yt = ϕYt−1 + et
Yt = ϕYt−1 + et
= ϕ(ϕYt−2 + et−1 ) + et
= et + ϕet−1 + ϕ2 Yt−2 .
PAGE 93
CHAPTER 4 STAT 520, J. TEBBS
Therefore, the AR(1) process is a special case of the general linear process with Ψj = ϕj ,
for j = 0, 1, 2, ...,.
Yt = ϕYt−1 + et
is stationary if and only if |ϕ| < 1, that is, if −1 < ϕ < 1. If |ϕ| ≥ 1, then the AR(1)
process is not stationary.
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
The AR(2) process is
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et .
• The current value of the process, Yt , is a weighted linear combination of the values
of the process from the previous two time periods, plus a random innovation (error)
at the current time.
PAGE 94
CHAPTER 4 STAT 520, J. TEBBS
• Just as the AR(1) model requires certain conditions for stationarity, the AR(2)
model does too. A thorough discussion of stationarity for the AR(2) model, and
higher order AR models, becomes very theoretical. We highlight only the basic
points.
BYt = Yt−1 ,
that is, B “backs up” the current value Yt one time unit to Yt−1 . For this reason, we call
B the backshift operator. Similarly,
In general, B k Yt = Yt−k . Using this new notation, we can rewrite the AR(2) model
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et
Yt − ϕ1 BYt − ϕ2 B 2 Yt = et ⇐⇒ (1 − ϕ1 B − ϕ2 B 2 )Yt = et .
Finally, treating the B as a dummy variable for algebraic reasons (and using the more
conventional algebraic symbol x), we define the AR(2) characteristic polynomial as
ϕ(x) = 1 − ϕ1 x − ϕ2 x2
ϕ(x) = 1 − ϕ1 x − ϕ2 x2 = 0.
IMPORTANT : Characterizing the stationarity conditions for the AR(2) model is done by
examining this equation and the solutions to it; i.e., the roots of ϕ(x) = 1 − ϕ1 x − ϕ2 x2 .
PAGE 95
CHAPTER 4 STAT 520, J. TEBBS
NOTE : Applying the quadratic formula to the AR(2) characteristic equation, we see that
the roots of ϕ(x) = 1 − ϕ1 x − ϕ2 x2 are
√
ϕ1 ± ϕ21 + 4ϕ2
x= .
−2ϕ2
(see Appendix B, pp 84, CC). These are the stationarity conditions for the AR(2)
model. A sketch of this stationarity region (in the ϕ1 -ϕ2 plane) appears in Figure 4.7.
√
RECALL: Define i = −1 so that z = a + bi is a complex number. The modulus of
z = a + bi is
√
|z| = a2 + b2 .
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et
PAGE 96
CHAPTER 4 STAT 520, J. TEBBS
1.0
Real roots
Complex roots
Outside stationarity region
0.5
0.0
φ2
−0.5
−1.0
−2 −1 0 1 2
φ1
Figure 4.7: Stationarity region for the AR(2) model. The point (ϕ1 , ϕ2 ) must fall inside
the triangular region to satisfy the stationarity conditions. Points falling below the curve
ϕ21 + 4ϕ2 = 0 are complex solutions. Those falling above ϕ21 + 4ϕ2 = 0 are real solutions.
Because {Yt } is a zero mean process, E(Yt Yt−k ) = γk , E(Yt−1 Yt−k ) = γk−1 , and
E(Yt−2 Yt−k ) = γk−2 . Because et is independent of Yt−k , E(et Yt−k ) = E(et )E(Yt−k ) = 0.
This proves that
γk = ϕ1 γk−1 + ϕ2 γk−2 .
ρk = ϕ1 ρk−1 + ϕ2 ρk−2 .
These are called the Yule-Walker equations for the AR(2) process.
PAGE 97
CHAPTER 4 STAT 520, J. TEBBS
ρ 1 = ϕ1 + ϕ2 ρ 1
ρ 2 = ϕ1 ρ 1 + ϕ2 ,
ϕ1 ϕ21 + ϕ2 − ϕ22
ρ1 = and ρ2 = .
1 − ϕ2 1 − ϕ2
• If we want to find higher lag autocorrelations, we can use the (recursive) relation
ρk = ϕ1 ρk−1 + ϕ2 ρk−2 .
REMARK : For those of you that like formulas, it is possible to write out closed-form
expressions for the autocorrelations in an AR(2) process. Denote the roots of the AR(2)
characteristic polynomial by 1/G1 and 1/G2 and assume that these roots both exceed 1
in absolute value (or modulus). Straightforward algebra shows that
√
ϕ1 − ϕ21 + 4ϕ2
G1 =
√2
ϕ1 + ϕ21 + 4ϕ2
G2 = .
2
• If G1 ̸= G2 , then
(1 − G22 )Gk+1
1 − (1 − G21 )Gk+1
2
ρk = .
(G1 − G2 )(1 + G1 G2 )
• If 1/G1 and 1/G2 are complex (i.e., when ϕ21 + 4ϕ2 < 0), then
sin(Θk + Φ)
ρk = R k ,
sin(Φ)
√ √
where R = −ϕ2 , Θ = cos−1 (ϕ1 /2 −ϕ2 ), and Φ = tan−1 [(1 − ϕ2 )/(1 + ϕ2 )].
PAGE 98
CHAPTER 4 STAT 520, J. TEBBS
DISCUSSION : Personally, I don’t think these formulas are all that helpful for computa-
tion purposes. So, why present them? After all, we could use the Yule-Walker equations
for computation.
• The formulas are helpful in that they reveal typical shapes of the AR(2) population
ACFs. This is important because when we see these shapes with real data (through
the sample ACFs), this will aid us in model selection/identification.
• Denote the roots of the AR(2) characteristic polynomial by 1/G1 and 1/G2 . If the
AR(2) process is stationary, then both of these roots are larger than 1 (in absolute
value or modulus). However,
Therefore, each of
(1 − G22 )Gk+1
1 − (1 − G21 )Gk+1
2
ρk =
(G1 − G2 )(1 + G1 G2 )
sin(Θk + Φ)
ρk = Rk
sin(Φ)
[ ( )] ( )k
1 + ϕ2 ϕ1
ρk = 1+k
1 − ϕ2 2
PAGE 99
CHAPTER 4 STAT 520, J. TEBBS
1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 4.8: Population ACFs for AR(2) processes. Upper left: (ϕ1 , ϕ2 ) = (0.5, −0.5).
Upper right: (ϕ1 , ϕ2 ) = (1.1, −0.3). Lower left: (ϕ1 , ϕ2 ) = (−0.5, 0.25). Lower right:
(ϕ1 , ϕ2 ) = (1, −0.5).
• (ϕ1 , ϕ2 ) = (1.1, −0.3). CP: ϕ(x) = 1 − 1.1x + 0.3x2 . Two distinct (real) roots.
• (ϕ1 , ϕ2 ) = (−0.5, 0.25). CP: ϕ(x) = 1 + 0.5x − 0.25x2 . Two distinct (real) roots.
These choices of (ϕ1 , ϕ2 ) are consistent with those in Figure 4.8 that depict the true
(population) AR(2) autocorrelation functions.
PAGE 100
CHAPTER 4 STAT 520, J. TEBBS
4
2
2
1
0
Yt
Yt
0
−2
−1
−4
−2
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
4
2
2
1
0
Yt
Yt
0
−3 −2 −1
−2
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
Figure 4.9: AR(2) simulations with n = 100 and σe2 = 1. Upper left: (ϕ1 , ϕ2 ) =
(0.5, −0.5). Upper right: (ϕ1 , ϕ2 ) = (1.1, −0.3). Lower left: (ϕ1 , ϕ2 ) = (−0.5, 0.25).
Lower right: (ϕ1 , ϕ2 ) = (1, −0.5).
• Consistent with the theory (see the population ACFs in Figure 4.8), the first (upper
left), second (upper right), and the fourth (lower right) series do “hang together;”
this is because of the positive lag 1 autocorrelation. The third series (lower left)
tends to oscillate, as we would expect since ρ1 < 0.
• The sample ACFs in Figure 4.10 resemble somewhat their theoretical counterparts
(at least at the first lag). Later lags generally deviate from the known theoretical
√
autocorrelations (there is a good reason for this). The error bounds at ±2/ 100 =
0.2 correspond to those for a white noise process; not an AR(1) process.
PAGE 101
CHAPTER 4 STAT 520, J. TEBBS
0.4
0.6
0.2
0.4
ACF
ACF
0.0
0.2
−0.4
−0.2
5 10 15 20 5 10 15 20
Lag Lag
ACF
−0.2
−0.2
−0.4
5 10 15 20 5 10 15 20
Lag Lag
Figure 4.10: Sample ACFs for AR(2) simulations with n = 100 and σe2 = 1. Upper
left: (ϕ1 , ϕ2 ) = (0.5, −0.5). Upper right: (ϕ1 , ϕ2 ) = (1.1, −0.3). Lower left: (ϕ1 , ϕ2 ) =
(−0.5, 0.25). Lower right: (ϕ1 , ϕ2 ) = (1, −0.5).
If 1/G1 and 1/G2 are the roots of the AR(2) characteristic polynomial, then
PAGE 102
CHAPTER 4 STAT 520, J. TEBBS
RECALL: Suppose {et } is a zero mean white noise process with var(et ) = σe2 . The general
autoregressive process of order p, denoted AR(p), is
(1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt = et ,
ϕ(x) = 1 − ϕ1 x − ϕ2 x2 − · · · − ϕp xp = 0.
IMPORTANT : An AR(p) process is stationary if and only if the p roots of ϕ(x) each
exceed 1 in absolute value (or in modulus if the roots are complex).
Yt = ϕYt−1 + et ⇐⇒ (1 − ϕB)Yt = et .
1
ϕ(x) = 1 − ϕx = 0 =⇒ x = .
ϕ
Clearly,
1
|x| = > 1 ⇐⇒ |ϕ| < 1,
ϕ
which was the stated stationarity condition for the AR(1) process.
PAGE 103
CHAPTER 4 STAT 520, J. TEBBS
The AR(2) process is stationary if and only if both roots are larger than 1 in
absolute value (or in modulus if complex). That is, both roots must lie outside the
unit circle.
• The same condition on the roots of ϕ(x) is needed for stationarity with any AR(p)
process.
ρ1 = ϕ1 + ϕ2 ρ1 + ϕ3 ρ2 + · · · + ϕp ρp−1
ρ2 = ϕ1 ρ1 + ϕ2 + ϕ3 ρ1 + · · · + ϕp ρp−2
..
.
These are the Yule-Walker equations. For known values of ϕ1 , ϕ2 , ..., ϕp , we can com-
pute the first lag p autocorrelations ρ1 , ρ2 , ..., ρp . Values of ρk , for k > p, can be obtained
by using the recursive relation above. The AR(p) ACF tails off as k gets larger. It does
so as a mixture of exponential decays and/or damped sine waves, depending on if roots
are real or complex.
PAGE 104
CHAPTER 4 STAT 520, J. TEBBS
4.4 Invertibility
ILLUSTRATION : From the definition, we see that stationary autoregressive models are
automatically invertible. However, moving average models may not be. For example,
consider the MA(1) model
Yt = et − θet−1 ,
or slightly rewritten
PAGE 105
CHAPTER 4 STAT 520, J. TEBBS
Yt = et − θet−1
1
Yt = et − et−1 .
θ
Put another way, if we knew the common ACF, we could not say if the MA(1) model
parameter was θ or 1/θ. Thus, we impose the condition that |θ| < 1 to ensure invertibility
(identifiability). Note that under this condition, the second MA(1) model, rewritten
( ) ( )2 ( )3
1 1 1
Yt = − Yt−1 − Yt−2 − Yt−3 − · · · + et ,
θ θ θ
∑ ( 1 )j
is no longer meaningful because the series ∞ j=1 θ diverges.
NOTE : Rewriting the MA(1) model using backshift notation, we see that
Yt = (1 − θB)et .
θ(x) = 1 − θx = 0
1
x= .
θ
For this process to be invertible, we require the root of the characteristic equation to
exceed 1 (in absolute value). Doing so implies that |θ| < 1.
= (1 − θ1 B − θ2 B 2 − · · · − θq B q )et
is invertible if and only if the roots of the MA(q) characteristic polynomial θ(x) =
1 − θ1 x − θ2 x2 − · · · − θq xq all exceed 1 in absolute value (or modulus).
PAGE 106
CHAPTER 4 STAT 520, J. TEBBS
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
The process
PAGE 107
CHAPTER 4 STAT 520, J. TEBBS
(1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt = (1 − θ1 B − θ2 B 2 − · · · − θq B q )et
where
ϕ(B) = 1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p
θ(B) = 1 − θ1 B − θ2 B 2 − · · · − θq B q .
• For the ARMA(p, q) process to be stationary, we need the roots of the AR char-
acteristic polynomial ϕ(x) = 1 − ϕ1 x − ϕ2 x2 − · · · − ϕp xp to all exceed 1 in absolute
value (or modulus).
• For the ARMA(p, q) process to be invertible, we need the roots of the MA char-
acteristic polynomial θ(x) = 1 − θ1 x − θ2 x2 − · · · − θq xq to all exceed 1 in absolute
value (or modulus).
(i) Yt = 0.3Yt−1 + et
using backshift notation and determine whether the model is stationary and/or invertible.
PAGE 108
CHAPTER 4 STAT 520, J. TEBBS
Solutions.
(i) The model in (i) is an AR(1) with ϕ = 0.3. In backshift notation, this model is
(1 − 0.3B)Yt = et . The characteristic polynomial is
ϕ(x) = 1 − 0.3x,
which has the root x = 10/3. Because this root exceeds 1 in absolute value, this
process is stationary. The process is also invertible since it is a stationary AR
process.
(ii) The model in (ii) is an MA(2) with θ1 = 1.3 and θ2 = −0.4. In backshift notation,
this model is Yt = (1 − 1.3B + 0.4B 2 )et . The characteristic polynomial is
which has roots x = 2 and x = 1.25. Because these roots both exceed 1 in absolute
value, this process is invertible. The process is also stationary since it is an invertible
MA process.
(iii) The model in (iii) is an ARMA(1,2) with ϕ1 = 0.5, θ1 = 0.3 and θ2 = −1.2. In
backshift notation, this model is (1 − 0.5B)Yt = (1 − 0.3B + 1.2B 2 )et . The AR
characteristic polynomial is
ϕ(x) = 1 − 0.5x,
which has the root x = 2. Because this root is greater than 1, this process is
stationary. The MA characteristic polynomial is
PAGE 109
CHAPTER 4 STAT 520, J. TEBBS
(iv) The model in (iv), at first glance, appears to be an ARMA(2,2) with ϕ1 = 0.4,
ϕ2 = 0.45, θ1 = −1, and θ2 = −0.25. In backshift notation, this model is written
as
(1 − 0.4B − 0.45B 2 )Yt = (1 + B + 0.25B 2 )et .
(1 − 0.9B)Yt = (1 + 0.5B)et ,
For k > q, we have E(et Yt−k ) = E(et−1 Yt−k ) = · · · = E(et−q Yt−k ) = 0 so that
PAGE 110
CHAPTER 4 STAT 520, J. TEBBS
Plugging in k = 1, 2, ..., p, and using the fact that ρk = ρ−k , we arrive again at the Yule
Walker equations:
ρ1 = ϕ1 + ϕ2 ρ1 + ϕ3 ρ2 + · · · + ϕp ρp−1
ρ2 = ϕ1 ρ1 + ϕ2 + ϕ3 ρ1 + · · · + ϕp ρp−2
..
.
• The R function ARMAacf can compute autocorrelations numerically for any station-
ary ARMA(p, q) process (including those that are purely AR or MA).
• The ACF for the ARMA(p, q) process tails off after lag q in a manner similar to
the AR(p) process.
• However, unlike the AR(p) process, the first q autocorrelations depend on both
θ1 , θ2 , ..., θq and ϕ1 , ϕ2 , ..., ϕp .
SPECIAL CASE : Suppose that {et } is a zero mean white noise process with var(et ) = σe2 .
The process
Yt = ϕYt−1 + et − θet−1
(1 − ϕB)Yt = (1 − θB)et
PAGE 111
CHAPTER 4 STAT 520, J. TEBBS
1.0
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
1.0
0.5
0.5
Autocorrelation
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 4.11: Population ACFs for ARMA(1,1) processes. Upper left: (ϕ, θ) =
(0.9, −0.25). Upper right: (ϕ, θ) = (−0.9, −0.25). Lower left: (ϕ, θ) = (0.5, −0.25).
Lower right: (ϕ, θ) = (−0.5, −0.25).
REMARK : That the ARMA(1,1) model can be written in the general linear process form
defined at the beginning of the chapter is shown on pp 78-79 (CC).
PAGE 112
CHAPTER 5 STAT 520, J. TEBBS
5.1 Introduction
RECALL: Suppose {et } is a zero mean white noise process with variance var(et ) = σe2 .
In the last chapter, we considered the class of ARMA models
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
• We learned that a process {Yt } in this class is stationary if and only if the roots of
the AR characteristic polynomial ϕ(x) all exceed 1 in absolute value (or modulus).
• We learned that a process {Yt } in this class is invertible if and only if the roots of
the MA characteristic polynomial θ(x) all exceed 1 in absolute value (or modulus).
• In this chapter, we extend this class of models to handle processes which are non-
stationary. We accomplish this by generalizing the class of ARMA models to
include differencing.
• Doing so gives rise to a much larger class of models, the autoregressive inte-
grated moving average (ARIMA) class. This class incorporates a wide range
of nonstationary time series processes.
PAGE 113
CHAPTER 5 STAT 520, J. TEBBS
= Yt − 2Yt−1 + Yt−2 .
Yt = Yt−1 + et ,
where {et } is zero mean white noise with variance var(et ) = σe2 . We know that {Yt } is not
stationary because its autocovariance function depends on t (see Chapter 2). However,
the first difference process
∇Yt = Yt − Yt−1 = et
• In Figure 5.1 (top), we display a simulated random walk process with n = 150 and
σe2 = 1. Note how the sample ACF of the series decays very, very slowly over time.
This is typical of a nonstationary series.
• The first difference (white noise) process also appears in Figure 5.1 (bottom), along
with its sample ACF. As we would expect from a white noise process, nearly all of
√
the sample autocorrelations rk are within the ±2/ n bounds.
PAGE 114
CHAPTER 5 STAT 520, J. TEBBS
Sample ACF
ACF
−5
−10
−0.2
0 50 100 150 5 10 15 20
Time Lag
Sample ACF
0.2
2
1
0.1
0
ACF
0.0
−3 −2 −1
−0.1
0 50 100 150 5 10 15 20
Time Lag
Figure 5.1: Top: A simulated random walk process {Yt } and its sample ACF, with
n = 150 and σe2 = 1. Bottom: The first difference process {∇Yt } and its sample ACF.
LINEAR TREND MODELS : In Chapter 3, we talked about how to use regression meth-
ods to fit models of the form
Yt = µt + Xt ,
where µt is a deterministic trend function and where {Xt } is a stochastic process with
E(Xt ) = 0. Suppose that {Xt } is stationary and that the true trend function is
µt = β0 + β1 t,
E(Yt ) = E(β0 + β1 t + Xt )
= β0 + β1 t + E(Xt ) = β0 + β1 t,
PAGE 115
CHAPTER 5 STAT 520, J. TEBBS
Note that
Also,
Because {Xt } is stationary, each of these covariance terms does not depend on t. There-
fore, both E(∇Yt ) and cov(∇Yt , ∇Yt−k ) are free of t; i.e., {∇Yt } is a stationary process.
Taking first differences removes a linear determinstic trend.
µt = β0 + β1 t + β2 t2 ,
a quadratic function of time. Clearly, {Yt } is not a stationary process since E(Yt ) = µt .
The first difference process consists of
∇2 Yt = ∇Yt − ∇Yt−1
Therefore, E(∇2 Yt ) = 2β2 and cov(∇2 Yt , ∇2 Yt−k ) are free of t. This shows that {∇2 Yt }
is stationary. Taking second differences removes a quadratic deterministic trend.
PAGE 116
CHAPTER 5 STAT 520, J. TEBBS
Sample ACF
60
ACF
40
20
0.0
First differences
ACF
0
−0.2
−5
−0.4
−10
Time Lag
Figure 5.2: Ventilation measurements at 15 second intervals. Top: Ventilation series {Yt }
with sample ACF. Bottom: First difference process {∇Yt } with sample ACF.
µt = β0 + β1 t + β2 t2 + · · · + βd td
Example 5.2. The data in Figure 5.2 are ventilation observations (L/min) on a single
cyclist recorded every 15 seconds during exercise. Source: Joe Alemany (Spring, 2010).
• The ventilation time series {Yt } does not resemble a stationary process. There is a
pronounced increasing linear trend over time. Nonstationarity is also reinforced
by examining the sample ACF for the series. In particular, the sample ACF decays
very, very slowly (a sure sign of nonstationarity).
PAGE 117
CHAPTER 5 STAT 520, J. TEBBS
• The first difference series {∇Yt } does resemble a process with a constant mean. In
fact, the sample ACF for {∇Yt } looks like what we would expect from an MA(1)
process (i.e., a pronounced spike at k = 1 and little action elsewhere).
• To summarize, the evidence in Figure 5.2 suggests an MA(1) model for the differ-
ence process {∇Yt }.
(1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt = (1 − θ1 B − θ2 B 2 − · · · − θq B q )et
where {et } is zero mean white noise with variance var(et ) = σe2 . In the ARIMA(p, d, q)
family, take d = 1 so that
PAGE 118
CHAPTER 5 STAT 520, J. TEBBS
Wt = ∇2 Yt = Yt − 2Yt−1 + Yt−2
= Yt − 2BYt + B 2 Yt
= (1 − 2B + B 2 )Yt = (1 − B)2 Yt
IMPORTANT : In practice (with real data), there will rarely be a need to consider values
of the differencing order d > 2. Most real time series data can be coerced into a station-
arity ARMA process by taking one difference or occasionally two differences (perhaps
after transforming the series initially).
REMARK : Autoregressive (AR) models, moving average (MA) models, and autoregres-
sive moving average (ARMA) models are all members of the ARIMA(p, d, q) family. In
particular,
• AR(p) ←→ ARIMA(p, 0, 0)
• MA(q) ←→ ARIMA(0, 0, q)
• ARMA(p, q) ←→ ARIMA(p, 0, q)
• ARI(p, d) ←→ ARIMA(p, d, 0)
PAGE 119
CHAPTER 5 STAT 520, J. TEBBS
ACF
40
Yt
20
0
−0.2
0 50 100 150 5 10 15 20
Time Lag
0.6
2
0.4
Yt − Yt−1
ACF
0.2
0
0.0
−2
0 50 100 150 5 10 15 20
Time Lag
Figure 5.3: Top: ARI(1,1) simulation, with ϕ = 0.7, n = 150, and σe2 = 1, and the
sample ACF. Bottom: First difference process with sample ACF.
Example 5.3. Suppose {et } is a zero mean white noise process. Identify each model
Solutions.
looks like an AR(2) process with ϕ1 = 1.7 and ϕ2 = −0.7. However, upon closer
inspection, we see this process is not stationary because the AR(2) stationary con-
PAGE 120
CHAPTER 5 STAT 520, J. TEBBS
ditions
ϕ1 + ϕ2 < 1 ϕ2 − ϕ1 < 1 |ϕ2 | < 1
are not met with ϕ1 = 1.7 and ϕ2 = −0.7 (in particular, the first condition is not
met). However, note that we can write this process as
⇐⇒ (1 − 0.7B)(1 − B)Yt = et
⇐⇒ (1 − 0.7B)Wt = et ,
where
Wt = (1 − B)Yt = Yt − Yt−1
are the first differences. We identify {Wt } as a stationary AR(1) process with
ϕ = 0.7. Therefore, {Yt } is an ARIMA(1,1,0) ⇐⇒ ARI(1,1) process with ϕ = 0.7.
This ARI(1,1) process is simulated in Figure 5.3.
looks like an ARMA(2,2) process, but this process is not stationary either. To see
why, note that we can write this process as
⇐⇒ (1 − B)Yt = (1 − 0.5B)et
⇐⇒ Wt = (1 − 0.5B)et ,
where Wt = (1 − B)Yt = Yt − Yt−1 . Here, the first differences {Wt } follow an MA(1)
model with θ = 0.5. Therefore, {Yt } is an ARIMA(0,1,1) ⇐⇒ IMA(1,1) process
with θ = 0.5. A realization of this IMA(1,1) process is shown in Figure 5.4.
PAGE 121
CHAPTER 5 STAT 520, J. TEBBS
ACF
Yt
−4
−6
0 50 100 150 5 10 15 20
Time Lag
0.1
1
−0.1 0.0
Yt − Yt−1
ACF
−3 −2 −1
−0.3
0 50 100 150 5 10 15 20
Time Lag
Figure 5.4: Top: IMA(1,1) simulation, with θ = 0.5, n = 150, and σe2 = 1, and the
sample ACF. Bottom: First difference process with sample ACF.
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
An ARIMA(p, d, q) process with p = 0, d = 1, and q = 1 is called an IMA(1,1) process
and is given by
Yt = Yt−1 + et − θet−1 .
This model is very popular in economics applications. Note that if θ = 0, the IMA(1,1)
process reduces to a random walk.
(1 − B)Yt = (1 − θB)et .
PAGE 122
CHAPTER 5 STAT 520, J. TEBBS
ϕ(B) = 1 − B
θ(B) = 1 − θB,
it would be clear that this process is not stationary since the AR characteristic polynomial
ϕ(x) = 1 − x has a unit root, that is, the root of ϕ(x) is x = 1. More appropriately, we
write
(1 − B)Yt = (1 − θB)et ⇐⇒ Wt = (1 − θB)et ,
Wt = (1 − B)Yt = Yt − Yt−1
follow an MA(1) model with parameter θ. From Chapter 4, we know that the first
difference process {Wt } is invertible if and only if |θ| < 1. To summarize,
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
An ARIMA(p, d, q) process with p = 0, d = 2, and q = 2 is called an IMA(2,2) process
and can be expressed as
(1 − B)2 Yt = (1 − θ1 B − θ2 B 2 )et ,
or, equivalently,
∇2 Yt = et − θ1 et−1 − θ2 et−2 .
Wt = ∇2 Yt = (1 − B)2 Yt
PAGE 123
CHAPTER 5 STAT 520, J. TEBBS
1.0
400
0.6
ACF
Yt
200
0.2
−0.2
0
0 50 100 150 5 10 15 20
Time Lag
1.0
First differences
0.6
ACF
0
0.2
−5
−0.2
0 50 100 150 5 10 15 20
Time Lag
0.2
ACF
0.0
−1
−0.4
−3
0 50 100 150 5 10 15 20
Time Lag
Figure 5.5: Top: IMA(2,2) simulation with n = 150, θ1 = 0.3, θ2 = −0.3, and σe2 = 1.
Middle: First difference process. Bottom: Second difference process.
• The first difference process {∇Yt }, which is that of an IMA(1,2), is also clearly
nonstationary to the naked eye. This is also seen in the sample ACF.
PAGE 124
CHAPTER 5 STAT 520, J. TEBBS
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
An ARIMA(p, d, q) process with p = 1, d = 1, and q = 0 is called an ARI(1,1) process
and can be expressed as
(1 − ϕB)(1 − B)Yt = et ,
or, equivalently,
Yt = (1 + ϕ)Yt−1 − ϕYt−2 + et .
(1 − ϕB)Wt = et ,
which we recognize as an AR(1) process with parameter ϕ. The first difference process
{Wt } is stationary if and only if |ϕ| < 1.
Yt = (1 + ϕ)Yt−1 − ϕYt−2 + et
looks like an AR(2) model. However this process is not stationary since the coefficients
satisfy (1 + ϕ) − ϕ = 1; this violates the stationarity requirements for the AR(2) model.
An ARI(1,1) process is simulated in Figure 5.3.
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
An ARIMA(p, d, q) process with p = 1, d = 1, and q = 1 is called an ARIMA(1,1,1)
process and can be expressed as
or, equivalently,
Yt = (1 + ϕ)Yt−1 − ϕYt−2 + et − θet−1 .
PAGE 125
CHAPTER 5 STAT 520, J. TEBBS
ACF
Yt
−0.2
0 50 100 150 5 10 15 20
Time Lag
0.6
2
First differences
0.4
0
ACF
0.2
−2
−0.2 0.0
−4
0 50 100 150 5 10 15 20
Time Lag
Figure 5.6: Top: ARIMA(1,1,1) simulation, with n = 150, ϕ = 0.5, θ = −0.5, and
σe2 = 1, and the sample ACF. Bottom: First difference process with sample ACF.
(1 − ϕB)Wt = (1 − θB)et ,
• The first difference process {Wt } is stationary if and only if |ϕ| < 1. The first
difference process {Wt } is invertible if and only if |θ| < 1.
PAGE 126
CHAPTER 5 STAT 520, J. TEBBS
where {et } is zero mean white noise with var(et ) = σe2 . An extension of this model is
STATIONARY CASE : Suppose that d = 0, in which case the no-constant model becomes
ϕ(B)Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
To examine the effects of adding a constant term, suppose that we replace Yt with Yt − µ,
where µ = E(Yt ). The model becomes
=⇒ ϕ(B)Yt − (1 − ϕ1 − ϕ2 − · · · − ϕp )µ = θ(B)et
=⇒ ϕ(B)Yt = (1 − ϕ1 − ϕ2 − · · · − ϕp )µ +θ(B)et ,
| {z }
= θ0
so that
θ0
θ0 = (1 − ϕ1 − ϕ2 − · · · − ϕp )µ ⇐⇒ µ= .
1 − ϕ1 − ϕ2 − · · · − ϕp
PAGE 127
CHAPTER 5 STAT 520, J. TEBBS
NONSTATIONARY CASE : The impact of adding a constant term θ0 to the model when
d > 0 is quite different. As the simplest example in the ARIMA(p, d, q) family, take
p = q = 0 and d = 1 so that
(1 − B)Yt = θ0 + et ⇐⇒ Yt = θ0 + Yt−1 + et .
This model is called a random walk with drift; see pp 22 (CC). Note that we can
write via successive substitution
Yt = θ0 + Yt−1 + et
Therefore, the process {Yt } contains a linear deterministic trend with slope θ0 .
IMPORTANT : The previous finding holds for any (nonstationary) ARIMA(p, 1, q) model,
that is, adding a constant term θ0 induces a linear deterministic trend. Also,
Note that for very large t, the constant (deterministic trend) term can become very
dominating so that it forces the time series to follow a nearly deterministic pattern.
Therefore, a constant term should be added to a nonstationary ARIMA model (i.e.,
d > 0) only if it is strongly warranted.
PAGE 128
CHAPTER 5 STAT 520, J. TEBBS
5.4 Transformations
• For example, if there is clear evidence of nonconstant variance over time (e.g., the
variance increases over time, etc.), then a suitable transformation to the data
might remove (or lessen the impact of) the nonconstant variance pattern.
Example 5.4. Data file: electricity (TSA). Figure 5.7 displays monthly electricity
usage in the United States (usage from coal, natural gas, nuclear, petroleum, and wind)
between January, 1973 and December, 2005.
• From the plot, we can see that there is increasing variance over time; e.g., the series
is much more variable at later years than it is in earlier years.
• Time series that exhibit this “fanning out” shape are not stationary because the
variance changes over time.
• Before we try to model these data, we should first apply a transformation to make
the variance constant (that is, we would like to first “stabilize” the variance).
THEORY : Suppose that the variance of nonstationary process {Yt } can be written as
var(Yt ) = c0 f (µt ),
where µt = E(Yt ) and c0 is a positive constant free of µt . Therefore, the variance is not
constant because it is a function of µt , which is changing over time. Our goal is to find a
function T so that the transformed series T (Yt ) has constant variance. Such a function is
PAGE 129
CHAPTER 5 STAT 520, J. TEBBS
400000
350000
300000
Electricity usage
250000
200000
150000
Time
Figure 5.7: Electricity data. Monthly U.S. electricity generation, measured in millions of
kilowatt hours, from 1/1973 to 12/2005.
where T ′ (µt ) is the first derivative of T (Yt ), evaluated at µt . Now, note that
where c1 is a constant free of µt . Solving this expression for T ′ (µt ), we get the differential
equation √
′ c1 c2
T (µt ) = =√ ,
c0 f (µt ) f (µt )
PAGE 130
CHAPTER 5 STAT 520, J. TEBBS
√
where c2 = c1 /c0 is free of µt . Integrating both sides, we get
∫
c
T (µt ) = √ 2 dµt + c3 ,
f (µt )
where c3 is a constant free of µt . In the calculations below, the values of c2 and c3 can
be taken to be anything, as long as they are free of µt .
• If var(Yt ) = c0 µ2t , so that the standard deviation of the series is proportional to the
mean, then ∫
c
T (µt ) = √2 dµt = c2 ln(µt ) + c3 ,
µ2t
where c3 is a constant free of µt . If we take c2 = 1 and c3 = 0, we see that the
logarithm of the series, T (Yt ) = ln(Yt ), will provide a constant variance.
• If var(Yt ) = c0 µ4t , so that the standard deviation of the series is proportional to the
square of the mean, then
∫ ( )
c2 1
T (µt ) = √ dµt = c2 − + c3 ,
µ4t µt
T (Yt ) = λ
ln(Yt ), λ = 0,
PAGE 131
CHAPTER 5 STAT 520, J. TEBBS
λ T (Yt ) Description
−2.0 1/Yt2 Inverse square
−1.0 1/Yt Reciprocal
√
−0.5 1/ Yt Inverse square root
0.0 ln(Yt ) Logarithm
√
0.5 Yt Square root
1.0 Yt Identity (no transformation)
2.0 Yt2 Square
where λ is called the transformation parameter. Some common values of λ, and their
implied transformations are given in Table 5.1.
NOTE : To see why the logarithm transformation T (Yt ) = ln(Yt ) is used when λ = 0,
note that by L’Höptial’s Rule (from calculus),
Ytλ − 1 Y λ ln(Yt )
lim = lim t = ln(Yt ).
λ→0 λ λ→0 1
PAGE 132
CHAPTER 5 STAT 520, J. TEBBS
1500
95%
1480
Log Likelihood
1460
1440
1420
−2 −1 0 1 2
Figure 5.8: Electricity data. Log-likelihood function versus λ. Note that λ is on the
horizontal axis. A 95 percent confidence interval for λ is also depicted.
• There is an R function BoxCox.ar that does all of the calculations. The func-
tion also provides an approximate 95 percent confidence interval for λ, which is
constructed using the large sample properties of MLEs.
• The computations needed to produce a figure like the one in Figure 5.8 can be time
consuming if the series is long (i.e., n is large). Also, the profile log-likelihood is
not always as “smooth” as that seen in Figure 5.8.
PAGE 133
CHAPTER 5 STAT 520, J. TEBBS
12.8
12.6
(Log) electricity usage
12.4
12.2
12.0
Time
Figure 5.9: Electricity data (transformed). Monthly U.S. electricity generation measured
on the log scale.
Example 5.4 (continued). Figure 5.8 displays the profile log-likelihood of λ for the
electricity data. The value of λ (on the horizontal axis) that maximizes the log-likelihood
function looks to be λ ≈ −0.1, suggesting the transformation
T (Yt ) = Yt−0.1 .
• The log-transformed series {ln Yt } is displayed in Figure 5.9. We see that applying
the log transformation has notably lessened the nonconstant variance (although
there still is a mild increase in the variance over time).
• Now that we have applied the transformation, we can now return to our previous
PAGE 134
CHAPTER 5 STAT 520, J. TEBBS
0.8
ACF of the 1st differences of the logged series
0.1
0.6
First differences of log(Electricity)
0.4
0.0
0.2
0.0
−0.1
−0.2
−0.4
−0.2
Time Lag
Figure 5.10: Electricity data. Left: Wt = log Yt − log Yt−1 , the first differences of the
log-transformed data. Right: The sample autocorrelation function of the {Wt } data.
• The {Wt } series is plotted in Figure 5.10 (left) along with the sample ACF of the
{Wt } series (right). The {Wt } series appears to have a constant mean.
• However, the sample ACF suggests that there is still a large amount of structure
in the data that remains after differencing the log-transformed series.
PAGE 135
CHAPTER 6 STAT 520, J. TEBBS
6 Model Specification
6.1 Introduction
RECALL: Suppose that {et } is zero mean white noise with var(et ) = σe2 . In general, an
ARIMA(p, d, q) process can be written as
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt
is the series of dth differences. In this chapter, we discuss techniques on how to choose
suitable values of p, d, and q for an observed (or transformed) time series. We want our
choices to be consistent with the underlying structure of the observed data. Bad choices
of p, d, and q lead to bad models, which, in turn, lead to bad predictions (forecasts) of
future values.
RECALL: For time series data Y1 , Y2 , ..., Yn , the sample autocorrelation function
(ACF), at lag k, is given by
∑n
t=k+1 (Yt − Y )(Yt−k −Y)
rk = ∑n ,
t=1 (Yt − Y )
2
PAGE 136
CHAPTER 6 STAT 520, J. TEBBS
as n → ∞, where
∑
∞
ckk = (ρ2l + ρl−k ρl+k − 4ρk ρl ρl−k + 2ρ2k ρ2l ).
l=−∞
In other words, when the sample size n is large, the sample autocorrelation rk is ap-
proximately normally distributed with mean ρk and variance ckk /n; i.e.,
( c )
kk
rk ∼ AN ρk , .
n
We now examine some specific models and specialize this general result to those models.
1. WHITE NOISE: For a white noise process, the formula for ckk simplifies consid-
erably because nearly all the terms in the sum above are zero. For large n,
( )
1
rk ∼ AN 0, ,
n
√
for k = 1, 2, ...,. This explains why ±2/ n serve as approximate margin of error
bounds for rk . Values of rk outside these bounds would be “unusual” under the
white noise model assumption.
2. AR(1): For a stationary AR(1) process Yt = ϕYt−1 + et , the formula for ckk also
reduces considerably. For large n,
rk ∼ AN (ρk , σr2k ),
where ρk = ϕk and
[ ]
1 (1 + ϕ2 )(1 − ϕ2k )
σr2k = − 2kϕ .
2k
n 1 − ϕ2
PAGE 137
CHAPTER 6 STAT 520, J. TEBBS
r1 ∼ AN (ρ1 , σr21 ),
rk ∼ AN (0, σr2k ),
where
1 + 2ρ21
σr2k = .
n
4. MA(q): For an invertible MA(q) process,
when n is large.
REMARK : The MA(q) result above suggests a natural large-sample test for
PAGE 138
CHAPTER 6 STAT 520, J. TEBBS
We can not use Z as a test statistic to test H0 versus H1 because Z depends on ρ1 , ρ2 , ..., ρq
which, in practice, are unknown. However, when n is large, we can use rj as an estimate
for ρj . This should not severely impact the large sample distribution of Z because rj
should be “close” to ρj when n is large. Making this substitution gives the large-sample
test statistic
rq+1
Z∗ = √ ( ).
1
∑q 2
n
1 + 2 j=1 rj
where zα/2 is the upper α/2 quantile from the N (0, 1) distribution. This is a two-
sided test. Of course, an equivalent decision rule is to reject H0 when the (two-sided)
probability value is less than α.
Therefore, we would reject H0 and conclude that the MA(1) model is not appropriate.
PAGE 139
CHAPTER 6 STAT 520, J. TEBBS
To test
we compute
r3 −0.13
z∗ = √ =√ ≈ −1.42.
1 1
n
(1 + 2r12 + 2r22 ) 200
[1 + 2(−0.49)2 + 2(0.31)2 ]
Therefore, we would not reject H0 . An MA(2) model is not inconsistent with these
sample autocorrelations.
Yt = et + 0.7et−1 ,
an MA(1) process with θ = −0.7, where et ∼ iid N (0, 1) and n = 200. In this exam-
ple, we use a technique known as Monte Carlo simulation to simulate the sampling
distributions of the sample autocorrelations r1 , r2 , r5 , and r10 . Here is how this is done:
• We simulate an MA(1) process with θ = −0.7 and compute r1 with the simulated
data. Note that the R function arima.sim can be used to simulate this process.
• We repeat this simulation exercise a large number of times, say, M times. With
each simulated series, we compute r1 .
• We can then plot the M values of r1 in a histogram. This histogram represents the
Monte Carlo sampling distribution of r1 .
• For each simulation, we can also record the values of r2 , r5 , and r10 . We can then
construct their corresponding histograms.
PAGE 140
CHAPTER 6 STAT 520, J. TEBBS
Frequency
150
0 50
0
0.30 0.40 0.50 0.60 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
r1 r2
500
Frequency
Frequency
300
300
100
100
0
−0.3 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 0.3
r5 r10
is a result that requires the sample size n → ∞. With n = 200, we see that the
normal distribution (large-sample) property has largely taken shape.
PAGE 141
CHAPTER 6 STAT 520, J. TEBBS
0.1
1
ACF
0
−0.1
Yt
−3 −2 −1
−0.3
0 20 40 60 80 100 5 10 15 20
Time Lag
ACF
0
Yt
−3 −2 −1
−0.4
0 20 40 60 80 100 5 10 15 20
Time Lag
Figure 6.2: Simulated MA(1) and MA(2) processes with n = 100 and σe2 = 1. Moving
average error bounds are used in the corresponding sample ACFs; not the white noise
√
error bounds ±2/ n.
Example 6.3. We use R to generate data from two moving average processes:
We take et ∼ iid N (0, 1) and n = 100. In Figure 6.2, we display the realized time series
and the corresponding sample autocorrelation functions (ACFs).
• However, instead of using the white noise margin of error bounds, that is,
2 2
±√ = ±√ = ±0.2,
n 100
PAGE 142
CHAPTER 6 STAT 520, J. TEBBS
we use the more precise error bounds from the large sample distribution
[ ( )]
1 ∑ q
rk ∼ AN 0, 1+2 ρ2j .
n j=1
• In particular, for each lag k, the (estimated) standard error bounds are placed at
v ( )
u
u 1 ∑
k−1
±1.96t 1+2 rj2 .
100 j=1
• That is, error bounds at lag k are computed assuming that the MA(k − 1) model is
appropriate. Values of rk which exceed these bounds are deemed to be statistically
significant. Note that the MA error bounds are not constant, unlike those computed
under the white noise assumption.
RECALL: We have seen that for MA(q) models, the population ACF ρk is nonzero for
lags k ≤ q and ρk = 0 for lags greater than q. That is, the ACF for an MA(q) process
“drops off” to zero after lag q.
• Therefore, the ACF provides a considerable amount of information about the order
of the dependence when the process is truly a moving average.
• On the other hand, if the process is autoregressive (AR), then the ACF may not
tell us much about the order of the dependence.
• It is therefore worthwhile to develop a function that will behave like the ACF for
MA models, but for use with AR models instead. This function is called the partial
autocorrelation function (PACF).
MOTIVATION : To set our ideas, consider a stationary, zero mean AR(1) process
Yt = ϕYt−1 + et ,
PAGE 143
CHAPTER 6 STAT 520, J. TEBBS
where {et } is zero mean white noise. The autocovariance between Yt and Yt−2 is
γ2 = cov(Yt , Yt−2 )
= cov(ϕYt−1 + et , Yt−2 )
= ϕ2 var(Yt−2 ) + 0 + 0 = ϕ2 γ0 ,
where γ0 = var(Yt ) = var(Yt−2 ). Recall that et−1 and et are independent of Yt−2 .
• This not true for an AR(1) process because Yt depends on Yt−2 through Yt−1 .
STRATEGY : Suppose that we “break” the dependence between Yt and Yt−2 in an AR(1)
process by removing (or partialing out) the effect of Yt−1 . To do this, consider the
quantities Yt − ϕYt−1 and Yt−2 − ϕYt−1 . Note that
because et is independent of Yt−1 and Yt−2 . Now, we make the following observations.
Yt − ϕYt−1
as the prediction error from regressing Yt on Yt−1 (with no intercept; this is not
needed because we are assuming a zero mean process).
can be thought of as the prediction error from regressing Yt−2 on Yt−1 , again with
no intercept.
PAGE 144
CHAPTER 6 STAT 520, J. TEBBS
• Both of these prediction errors are uncorrelated with the intervening variable
Yt−1 . To see why, note that
= γ1 − ϕγ0 = 0,
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et ,
where {et } is zero mean white noise. Suppose that we “break” the dependence between
Yt and Yt−3 in the AR(2) process by removing the effects of both Yt−1 and Yt−2 . That
is, consider the quantities
Yt − ϕ1 Yt−1 − ϕ2 Yt−2
and
Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 .
Note that
cov(Yt − ϕ1 Yt−1 − ϕ2 Yt−2 , Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 ) = cov(et , Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 ) = 0,
because et is independent of Yt−1 , Yt−2 , and Yt−3 . Again, we note the following:
Yt − ϕ1 Yt−1 − ϕ2 Yt−2
can be thought of as the prediction error from regressing Yt on Yt−1 and Yt−2
(with no intercept).
can be thought of as the prediction error from regressing Yt−3 on Yt−1 and Yt−2 ,
again with no intercept.
PAGE 145
CHAPTER 6 STAT 520, J. TEBBS
• Both of these prediction errors are uncorrelated with the intervening variables
Yt−1 and Yt−2 .
Ybt
(k−1)
= β1 Yt−1 + β2 Yt−2 + · · · + βk−1 Yt−(k−1) .
Let Ybt−k
(k−1)
denote the population regression of Yt−k on the variables Yt−1 , Yt−2 , ..., Yt−(k−1) ,
that is,
Ybt−k = β1 Yt−(k−1) + β2 Yt−(k−2) + · · · + βk−1 Yt−1 .
(k−1)
for k = 2, 3, ...,.
• With regards to Yt and Yt−k , the quantities Ybt and Ybt−k are linear functions
(k−1) (k−1)
• The quantities Yt − Ybt and Yt−k − Ybt−k are called the prediction errors.
(k−1) (k−1)
• That is, ϕkk measures the correlation between Yt and Yt−k after removing the linear
effects of Yt−1 , Yt−2 , ..., Yt−(k−1) .
PAGE 146
CHAPTER 6 STAT 520, J. TEBBS
Yt = ϕYt−1 + et .
We showed that
In this example, the quantities Yt − ϕYt−1 and Yt−2 − ϕYt−1 are the prediction errors from
regressing Yt on Yt−1 and Yt−2 on Yt−1 , respectively. That is, with k = 2, the general
expressions
Ybt
(k−1)
= β1 Yt−1 + β2 Yt−2 + · · · + βk−1 Yt−(k−1)
Ybt−k
(k−1)
= β1 Yt−(k−1) + β2 Yt−(k−2) + · · · + βk−1 Yt−1
become
Ybt
(2−1)
= ϕYt−1
Ybt−2
(2−1)
= ϕYt−1 .
because
IMPORTANT : For the AR(1) model, it follows that ϕ11 ̸= 0 (ϕ11 = ρ1 ) and
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et .
PAGE 147
CHAPTER 6 STAT 520, J. TEBBS
We showed that
Note that in this example, the quantities Yt − ϕ1 Yt−1 − ϕ2 Yt−2 and Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2
are the prediction errors from regressing Yt on Yt−1 and Yt−2 and Yt−3 on Yt−1 and Yt−2 ,
respectively. That is, with k = 3, the general expressions
Ybt
(k−1)
= β1 Yt−1 + β2 Yt−2 + · · · + βk−1 Yt−(k−1)
Ybt−k
(k−1)
= β1 Yt−(k−1) + β2 Yt−(k−2) + · · · + βk−1 Yt−1
become
Ybt
(3−1)
= ϕ1 Yt−1 + ϕ2 Yt−2
Ybt−3
(3−1)
= ϕ1 Yt−1 + ϕ2 Yt−2 .
because
cov(Yt − Ybt , Yt−3 − Ybt−3 ) = cov(Yt − ϕ1 Yt−1 − ϕ2 Yt−2 , Yt−3 − ϕ1 Yt−1 − ϕ2 Yt−2 ) = 0.
(3−1) (3−1)
IMPORTANT : For the AR(2) model, it follows that ϕ11 ̸= 0, ϕ22 ̸= 0, and
• ϕ11 ̸= 0, ϕ22 ̸= 0, ..., ϕpp ̸= 0; i.e., the first p partial autocorrelations are nonzero
For an AR(p) model, the PACF “drops off” to zero after the pth lag. Therefore,
the PACF can help to determine the order of an AR(p) process just like the ACF helps
to determine the order of an MA(q) process!
PAGE 148
CHAPTER 6 STAT 520, J. TEBBS
1.0
1.0
Partial autocorrelation
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
1.0
Partial autocorrelation
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 6.3: Top: AR(1) model with ϕ = 0.9; population ACF (left) and population
PACF (right). Bottom: AR(2) model with ϕ1 = −0.5 and ϕ2 = 0.25; population ACF
(left) and population PACF (right).
We take et ∼ iid N (0, 1) and n = 150. Figure 6.3 displays the true (population) ACF
and PACF for these processes. Figure 6.4 displays the simulated time series from each
AR model and the sample ACF/PACF.
• The population PACFs in Figure 6.3 display the characteristics that we have just
derived; that is, the AR(1) PACF drops off to zero when the lag k > 1. The AR(2)
PACF drops off to zero when the lag k > 2.
PAGE 149
CHAPTER 6 STAT 520, J. TEBBS
2 4 6
4
2
Yt
Yt
0
−2
−2
−6
Time Time
0.4
0.6
ACF
ACF
0.0
0.2
−0.6
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Partial ACF
−0.2
0.2
−0.6
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Figure 6.4: Left: AR(1) simulation with et ∼ iid N (0, 1) and n = 150; sample ACF
(middle), and sample PACF (bottom). Right: AR(2) simulation with et ∼ iid N (0, 1)
and n = 150; sample ACF (middle), and sample PACF (bottom).
• Figure 6.4 displays the sample ACF/PACFs. Just as the sample ACF is an esti-
mate of the true (population) ACF, the sample PACF is an estimate of the true
(population) PACF.
• Note that the sample PACF for the AR(1) simulation declares ϕbkk insignificant for
k > 1. The estimates of ϕkk , for k > 1, are all within the margin of error bounds.
The sample PACF for the AR(2) simulation declares ϕbkk insignificant for k > 2.
• We will soon discuss why the PACF error bounds here are correct.
PAGE 150
CHAPTER 6 STAT 520, J. TEBBS
1.0
1.0
Partial autocorrelation
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
1.0
Partial autocorrelation
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
5 10 15 20 5 10 15 20
k k
Figure 6.5: Top: MA(1) model with θ = 0.9; population ACF (left) and population
PACF (right). Bottom: MA(2) model with θ1 = −0.5 and θ2 = 0.25; population ACF
(left) and population PACF (right).
CURIOSITY : How does the PACF behave for a moving average process? To answer
this, consider the invertible MA(1) model, Yt = et − θet−1 . For this process, it can be
shown that
θk (θ2 − 1)
ϕkk = ,
1 − θ2(k+1)
for k ≥ 1. Because |θ| < 1 (invertibility requirement), note that
θk (θ2 − 1)
lim ϕkk = lim = 0.
k→∞ k→∞ 1 − θ 2(k+1)
That is, the PACF for the MA(1) process decays to zero as the lag k increases, much like
the ACF decays to zero for the AR(1). The same happens in higher order MA models.
PAGE 151
CHAPTER 6 STAT 520, J. TEBBS
2
2
1
Yt
Yt
0
−1
−4 −2
−3
0 50 100 150 0 50 100 150
Time Time
0.2
−0.2 0.0
ACF
ACF
0.0
−0.2
−0.5
5 10 15 20 5 10 15 20
Lag Lag
Partial ACF
−0.1
−0.5
−0.3
5 10 15 20 5 10 15 20
Lag Lag
Figure 6.6: Left: MA(1) simulation with et ∼ iid N (0, 1) and n = 150; sample ACF
(middle), and sample PACF (bottom). Right: MA(2) simulation with et ∼ iid N (0, 1)
and n = 150; sample ACF (middle), and sample PACF (bottom).
IMPORTANT : The PACF for an MA process behaves much like the ACF for
an AR process of the same order.
Example 6.5. We use R to generate observations from two moving average processes:
We take et ∼ iid N (0, 1) and n = 150. Figure 6.5 displays the true (population) ACF
and PACF for these processes. Figure 6.6 displays the simulated time series from each
PAGE 152
CHAPTER 6 STAT 520, J. TEBBS
• The population ACFs in Figure 6.5 display the well-known characteristics; that is,
the MA(1) ACF drops off to zero when the lag k > 1. The MA(2) ACF drops off
to zero when the lag k > 2.
• The population PACF in Figure 6.5 for both the MA(1) and MA(2) decays to zero
as the lag k increases. This is the theoretical behavior exhibited in the ACF for an
AR process.
• The sample versions in Figure 6.6 largely agree with what we know to be true
theoretically.
COMPARISON : The following table succinctly summarizes the behavior of the ACF and
PACF for moving average and autoregressive processes.
AR(p) MA(q)
ACF Tails off Cuts off after lag q
PACF Cuts off after lag p Tails off
Therefore, the ACF is the key tool to help determine the order of a MA process. The
PACF is the key tool to help determine the order of an AR process. For mixed ARMA
processes, we need a different tool (coming up).
PAGE 153
CHAPTER 6 STAT 520, J. TEBBS
where
ρj = corr(Yt , Yt−j )
For known ρ1 , ρ2 , ..., ρk , we can solve this system for ϕk,1 , ϕk,2 , ..., ϕk,k−1 , ϕkk , and keep the
value of ϕkk .
Example 6.6. The ARMAacf function in R will compute partial autocorrelations for any
stationary ARMA model. For example, for the AR(2) model
Yt = 0.6Yt−1 − 0.4Yt−2 + et ,
Yt = et + 0.6et−1 − 0.4et−2 + et ,
PAGE 154
CHAPTER 6 STAT 520, J. TEBBS
ESTIMATION : The partial autocorrelation ϕkk can be estimated by taking the Yule-
Walker equations and substituting rk in for the true autocorrelations ρk , that is,
This system can then be solved for ϕk,1 , ϕk,2 , ..., ϕk,k−1 , ϕkk as before, but now the solutions
are estimates ϕbk,1 , ϕbk,2 , ..., ϕbk,k−1 , ϕbkk . This can be done for each k = 1, 2, ...,.
in the same way that we tested whether or not a specific MA model was appropriate
using the sample autocorrelations rk . See Example 6.1 (notes).
REMARK : We have learned that the autocorrelation function (ACF) can help us deter-
mine the order of an MA(q) process because ρk = 0, for all lags k > q. Similarly, the
partial autocorrelation function (PACF) can help us determine the order of an AR(p)
process because ϕkk = 0, for all lags k > p. Therefore, in the sample versions of the
ACF and PACF, we can look for values of rk and ϕbkk , respectively, that are consistent
with this theory. We have also discussed formal testing procedures that can be used to
PAGE 155
CHAPTER 6 STAT 520, J. TEBBS
MOTIVATION : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
Recall that a stationary ARMA(p, q) process can be expressed as
ϕ(B)Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
Wt ≡ ϕ(B)Yt = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )Yt
Wt = (1 − θ1 B − θ2 B 2 − · · · − θq B q )et .
Of course, the {Wt } process is not observed because Wt depends on ϕ1 , ϕ2 , ..., ϕp , which
are unknown parameters.
STRATEGY : Suppose that we regress Yt on Yt−1 , Yt−2 , ..., Yt−p (that is, use the p lagged
versions of Yt as independent variables in a multiple linear regression) and use ordinary
least squares to fit the no-intercept model
where ϵt denotes a generic error term (not the white noise term in the MA process). This
would produce estimates ϕb1 , ϕb2 , ..., ϕbp from which we could compute
PAGE 156
CHAPTER 6 STAT 520, J. TEBBS
These values (which are merely the residuals from the regression) serve as proxies for the
true {Wt } process, and we could now treat these residuals as our “data.”
ct
• For example, if we fit an AR(2) model Yt = ϕ1 Yt−1 +ϕ2 Yt−2 +ϵt and the residuals W
look to follow an MA(2) process, then this would suggest that a mixed ARMA(2,2)
model is worthy of consideration.
PROBLEM : We have just laid out a sensible strategy on how to select candidate ARMA
models; i.e., choosing values for p and q. The problem is that ordinary least squares
regression estimates ϕb1 , ϕb2 , ..., ϕbp are inconsistent estimates of ϕ1 , ϕ2 , ..., ϕp when the
underlying process is ARMA(p, q). Inconsistency means that the estimates ϕb1 , ϕb2 , ..., ϕbp
estimate the wrong things (in a large-sample sense). Therefore, the strategy that we have
just described could lead to incorrect identification of p and q.
0. Consider using ordinary least squares to fit the same no-intercept AR(p) model
where ϵt denotes the error term (not the white noise term in an MA process). If the
true process is an ARMA(p, q), then the least squares estimates from the regression,
say,
ϕb1 , ϕb2 , ..., ϕb(0)
(0) (0)
p
will not be white noise. In fact, if q ≥ 1 (so that the true process is ARMA), then
(0)
the residuals b
ϵt and lagged versions of them will contain information about the
process {Yt }.
PAGE 157
CHAPTER 6 STAT 520, J. TEBBS
(0)
1. Because the residuals b
ϵt contain information about the value of q, we first fit the
model
(1) (1) (1) (0) (1)
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕ(1)
p Yt−p + β1 b
ϵt−1 + ϵt ,
(0)
Note that we have added the lag 1 residuals b
ϵt−1 from the initial model fit as a
predictor in the regression.
• If the order of the MA part of the ARMA process is truly q = 1, then the
least squares estimates
ϕb1 , ϕb2 , ..., ϕb(1)
(1) (1)
p
will be consistent; i.e., they will estimate the true AR parameters in large
samples.
(1)
• If q > 1, then the estimates will be inconsistent and the residual process {b
ϵt }
will not be white noise.
(1)
2. If q > 1, then the residuals from the most recent regression b
ϵt still contain infor-
mation about the value of q, so we next fit the model
(2) (2) (2) (1) (2) (0) (2)
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕ(2)
p Yt−p + β1 b
ϵt−1 + β2 b
ϵt−2 + ϵt .
(0)
Note that in this model, we have added the lag 2 residuals b
ϵt−2 from the initial
(1)
model fit as well as the lag 1 residuals b
ϵt−1 from the most recent fit.
• If the order of the MA part of the ARMA process is truly q = 2, then the
least squares estimates
ϕb1 , ϕb2 , ..., ϕb(2)
(2) (2)
p
will be consistent; i.e., they will estimate the true AR parameters in large
samples.
(2)
ϵt }
• If q > 2, then the estimates will be inconsistent and the residual process {b
will not be white noise.
3. We continue this iterative process, at each step, adding the residuals from the most
recent fit in the same fashion. For example, at the next step, we would fit
(3) (3) (3) (2) (3) (1) (3) (0) (3)
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + · · · + ϕ(3)
p Yt−p + β1 b
ϵt−1 + β2 b
ϵt−2 + β3 b
ϵt−3 + ϵt .
PAGE 158
CHAPTER 6 STAT 520, J. TEBBS
We continue fitting higher order models until residuals (from the most recent fit)
resemble a white noise process.
EXTENDED ACF : In practice, the true orders p and q of the ARMA(p, q) model are
unknown and have to be estimated. Based on the strategy outlined, however, we can
estimate p and q using a new type of function. For an AR(m) model fit, define the mth
(m)
sample extended autocorrelation function (EACF) ρbj as the sample ACF for
the residual process
for m = 0, 1, 2, ..., and j = 0, 1, 2, ...,. Here, the subscript j refers to the iteration number
in the aforementioned sequential fitting process (hence, j refers to the order the MA
part). The value m refers to the AR part of the process. Usually the maximum values
of m and j are taken to be 10 or so.
MA
AR 0 1 2 3 4 ···
(0) (0) (0) (0) (0)
0 ρb1 ρb2 ρb3 ρb4 ρb5 ···
(1) (1) (1) (1) (1)
1 ρb1 ρb2 ρb3 ρb4 ρb5 ···
(2) (2) (2) (2) (2)
2 ρb1 ρb2 ρb3 ρb4 ρb5 ···
(3) (3) (3) (3) (3)
3 ρb1 ρb2 ρb3 ρb4 ρb5 ···
(4) (4) (4) (4) (4)
4 ρb1 ρb2 ρb3 ρb4 ρb5 ···
.. .. .. .. .. ..
. . . . . . ···
(m)
REPRESENTATION : It is useful to arrange the estimates ρbj in a two-way table
where one direction corresponds to the AR part and the other direction corresponds to
the MA part. Mathematical arguments show that, as n → ∞,
(m)
ρbj −→ 0, for 0 ≤ m − p < j − q
(m)
ρbj −→ c ̸= 0, otherwise.
PAGE 159
CHAPTER 6 STAT 520, J. TEBBS
Therefore, the true large-sample extended autocorrelation function (EACF) table for an
ARMA(1, 1) process, for example, looks like
MA
AR 0 1 2 3 4 5 ···
0 x x x x x x ···
1 x 0 0 0 0 0 ···
2 x x 0 0 0 0 ···
3 x x x 0 0 0 ···
4 x x x x 0 0 ···
5 x x x x x 0 ···
.. .. .. .. .. .. ..
. . . . . . . ···
(m)
In this table, the “0” entries correspond to the zero limits of ρbj . The “x” entries
(m)
correspond to limits of ρbj which are nonzero. Therefore, the geometric pattern formed
by the zeros is a “wedge” with a tip at (1,1). This tip corresponds to the values of p = 1
and q = 1 in the ARMA model.
The true large-sample EACF table for an ARMA(2, 2) process looks like
MA
AR 0 1 2 3 4 5 ···
0 x x x x x x ···
1 x x x x x x ···
2 x x 0 0 0 0 ···
3 x x x 0 0 0 ···
4 x x x x 0 0 ···
5 x x x x x 0 ···
.. .. .. .. .. .. ..
. . . . . . . ···
In this table, we see that the tip of the wedge is at the point (2,2). This tip corresponds
to the values of p = 2 and q = 2 in the ARMA model.
PAGE 160
CHAPTER 6 STAT 520, J. TEBBS
The true large-sample EACF table for an ARMA(2, 1) process looks like
MA
AR 0 1 2 3 4 5 ···
0 x x x x x x ···
1 x x x x x x ···
2 x 0 0 0 0 0 ···
3 x x 0 0 0 0 ···
4 x x x 0 0 0 ···
5 x x x x 0 0 ···
.. .. .. .. .. .. ..
. . . . . . . ···
In this table, we see that the tip of the wedge is at the point (2,1). This tip corresponds
to the values of p = 2 and q = 1 in the ARMA model.
DISCLAIMER: The tables shown above represent theoretical results for infinitely large
sample sizes. Of course, with real data, we would not expect the tables to follow such a
(m)
clear cut pattern. Remember, the sample EACF values ρbj are estimates, so they have
inherent sampling variation! This is important to keep in mind. For some data sets, the
sample EACF table may reveal 2 or 3 models which are consistent with the estimates.
In other situations, the sample EACF may be completely ambiguous and give little or
no information, especially if the sample size n is small.
is truly white noise, then the sample extended autocorrelation function estimator
( )
(m) 1
ρbj ∼ AN 0, ,
n−m−j
(m)
when n is large. Therefore, we would expect 95 percent of the estimates ρbj to fall
√
within ±1.96/ n − m − j. Values outside these cutoffs are classified with an “x” in the
sample EACF. Values within these bounds are classified with a “0.”
PAGE 161
CHAPTER 6 STAT 520, J. TEBBS
Example 6.7. We use R to simulate data from three different ARMA(p, q) processes
and examine the sample EACF produced in R. The first simulation is an
AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x x x x o o o o o o o o o
1 x o o o o o o o o o o o o o
2 x o o o x o o o o o o o o o
3 x x x o o o o o o o o o o o
4 x o x o x o o o o o o o o o
5 x x x x o o o o o o o o o o
6 x x o x x o o o o o o o o o
7 x x o x o x o o o o o o o o
INTERPRETATION : This sample EACF agrees largely with the theory, which says that
there should be a wedge of zeros with tip at (1,1); the “x”s at (2,4) and (4,4) may be false
positives. If one is willing to additionally assume that the “x” at (3,2) is a false positive,
then an ARMA(2,1) model would also be deemed consistent with these estimates.
AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x x o x o o o o o x o o o
1 x x x o x o o o o o x o x o
2 x o o o o o o o o o x o o o
3 x x o o o o o o o o o o o o
4 x x o x o o o o o o o o o o
5 x x x x o o o o o o o o o o
6 x x x x o o o o x o o o o o
7 x o x x x o o o o o o o o o
PAGE 162
CHAPTER 6 STAT 520, J. TEBBS
INTERPRETATION : This sample EACF also agrees largely with the theory, which says
that there should be a wedge of zeros with tip at (2,2). If one is willing to additionally
assume that the “x” at (4,3) is a false positive, then an ARMA(2,1) model would also be
deemed consistent with these estimates.
Finally, we use an
AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x x x x x x x x x x x x x
1 x x x o x x x x x x x o x x
2 x x x o x x x x x x x o x x
3 x x o x x x x x o o o o o o
4 x x o x o o o o o o o o o o
5 x o o x o o o o o o o o o o
6 x o o x o x o o o o o o o o
7 x o o x o o o o o o o o o o
INTERPRETATION : This sample EACF does not agree with the theory, which says
that there should be a wedge of zeros with tip at (3,3). There is more of a “block” of
zeros; not a wedge. If we saw this EACF in practice, it would not be all that helpful in
model selection.
6.5 Nonstationarity
PAGE 163
CHAPTER 6 STAT 520, J. TEBBS
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt .
Up until now, we have discussed three functions to help us identify possible values for p
and q in stationary ARMA processes.
• The sample ACF can be used to determine the order q of a purely MA process.
• The sample PACF can be used to determine the order p of a purely AR process.
• The sample EACF can be used to determine the orders p and q of a mixed ARMA
process.
• When there is a clear trend in the data (e.g., linear) and the sample ACF for a
series decays very slowly, take first differences.
• If the sample ACF for the first differences resembles that a stationary ARMA
process (the ACF decays quickly), then take d = 1 in the ARIMA(p, d, q) family
and use the ACF, PACF, and EACF (on the first differences) to identify plausible
values of p and q.
• If the sample ACF for the first differences still exhibits a slow decay across lags,
take second differences and use d = 2. One can then use the ACF, PACF, and
EACF (on the second differences) to identify plausible values of p and q. There
should rarely be a need to consider values of d > 2. In fact, I have found that it is
not all that often that even second differences (d = 2) are needed.
PAGE 164
CHAPTER 6 STAT 520, J. TEBBS
Yt = Yt−1 + et − θet−1 ,
where |θ| < 1 and {et } is zero mean white noise. The first differences are given by
which is a stationary and invertible MA(1) process. The second differences are given by
∇2 Yt = ∇Yt − ∇Yt−1
= et − (1 + θ)et−1 + θet−2
= [1 − (1 + θ)B + θB 2 ]et .
INFERENCE : Instead of relying on the sample ACF, which may be subjective in “bor-
derline cases,” we can formally test whether or not an observed time series is stationary
using the methodology proposed by Dickey and Fuller (1979).
PAGE 165
CHAPTER 6 STAT 520, J. TEBBS
Yt = αYt−1 + Xt ,
ϕ∗ (B)Yt = et ,
where
ϕ∗ (B) = ϕ(B)(1 − αB)
• if −1 < α < 1, then ϕ∗ (x) does not have a unit root, and {Yt } is a stationary
AR(k + 1) process.
H0 : α = 1 (nonstationarity)
versus
H1 : α < 1 (stationarity).
PAGE 166
CHAPTER 6 STAT 520, J. TEBBS
IMPLEMENTATION : Dickey and Fuller advocated that this test could be carried out
using least squares regression. To see how, note that when H0 : α = 1 is true (i.e., the
process is nonstationary), the model for Yt can be written as
H0 : α = 1 is true ⇐⇒ H0 : a = 0 is true.
Therefore, we carry out the test by regressing ∇Yt on Yt−1 , ∇Yt−1 , ∇Yt−2 , ..., ∇Yt−k . We
can then decide between H0 and H1 by examining the size of the least-squares estimate
of a. In particular,
REMARK : The test statistic needed to test H0 versus H1 , and its large-sample distri-
bution, are complicated (the test statistic is similar to the t test statistic from ordinary
least squares regression; however, the large-sample distribution is not t). Fortunately,
there is an R function to implement the test automatically. The only thing we need to
do is choose a value of k in the model
that is, the value k is the order of the AR process for ∇Yt . Of course, the true value
of k is unknown. However, we can have R determine the “best value” of k using model
selection criteria that we will discuss in the next subsection.
PAGE 167
CHAPTER 6 STAT 520, J. TEBBS
40
0.4
0.2
Global temperature deviations
30
LA rainfall amounts
0.0
20
−0.2
10
−0.4
1860 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980
Year Year
Figure 6.7: Left: Global temperature data. Right: Los Angeles annual rainfall data.
Example 6.8. We illustrate the ADF test using two data sets from Chapter 1, the global
temperature data set (Example 1.1, pp 2, notes) and the Los Angeles annual rainfall
data set (Example 1.13, pp 14, notes). For the global temperature data, the command
ar(diff(globtemp)) is used to determine the “best” value of k for the differences. Here,
it is k = 3. The ADF test output is
In particular, the output automatically produces the p-value for the test
H0 : α = 1 (nonstationarity)
versus
H1 : α < 1 (stationarity).
PAGE 168
CHAPTER 6 STAT 520, J. TEBBS
The large p-value here (> 0.10) does not refute H0 : α = 1. There is insufficient evidence
to conclude that the global temperature process is stationary. For the LA rainfall data,
the command ar(diff(larain)) is used to determine the best value of k, which is k = 4.
The small p-value here (p = 0.015) indicates strong evidence against H0 : α = 1. There
is sufficient evidence to conclude that the LA rainfall process is stationary.
DISCUSSION : When performing the ADF test, some words of caution are in order.
• Because of this, the ADF test outlined here may not have sufficient power to reject
H0 when the process is truly stationary. In addition, the test may reject H0 incor-
rectly because a different form of nonstationarity is present (one that can not be
overcome merely by taking first differences).
• The ADF test outcome must be interpreted with these points in mind, especially
when the sample size n is small. In other words, do not blindly interpret the ADF
test outcome as a yes/no indicator of nonstationarity.
IMPORTANT : To implement the ADF test in R, we need to install the uroot package.
Installing this package has to be done manually.
PAGE 169
CHAPTER 6 STAT 520, J. TEBBS
AIC = −2 ln L + 2k,
where ln L is the natural logarithm of the maximized likelihood function (computed under
a distributional assumption for Y1 , Y2 , ..., Yn ) and k is the number of parameters in the
model (excluding the white noise variance). In a stationary no-intercept ARMA(p, q)
model, there are k = p + q parameters.
• The likelihood function gives (loosely speaking) the “probability of the data,” so
we would like for it to be as large as possible. This is equivalent to wanting −2 ln L
to be as small as possible.
• The 2k term serves as a penalty, namely, we do not want models with too many
parameters (adhering to the Principle of Parsimony).
• The AIC is used more generally for model selection in statistics (not just in the
analysis of time series data). Herein, we restrict attention to its use in selecting
candidate stationary ARMA(p, q) models.
BIC = −2 ln L + k ln n,
where ln L is the natural logarithm of the maximized likelihood function and k is the
number of parameters in the model (excluding the white noise variance). In a stationary
no-intercept ARMA(p, q) model, there are k = p + q parameters.
PAGE 170
CHAPTER 6 STAT 520, J. TEBBS
10
80
5
Ventilation differences
Ventilation (L/min)
60
0
40
−5
20
−10
0 50 100 150 200 0 50 100 150 200
• Both AIC and BIC require the maximization of a log likelihood function (we assume
normality). When compared to AIC, BIC offers a stiffer penalty for overparame-
terized models since ln n will often exceed 2.
Example 6.9. We use the BIC as a means for model selection with the ventilation data
in Example 1.10 (pp 11, notes); see also Example 5.2 (pp 117, notes). Figure 6.8 shows
the original series (left) and the first difference process (right). The BIC output (next
page) is provided by R. Remember that the smaller the BIC, the better the model.
• The original ventilation series displays a clear linear trend. The ADF test (results
not shown) provides a p-value of p > 0.10, indicating that the series is difference
nonstationary.
• We therefore find the “best” ARMA(p, q) model for the first differences; that is, we
are taking d = 1, so we are essentially finding the “best” ARIMA(p, 1, q) model.
• The BIC output in Figure 6.8 shows that the best model (smallest BIC) for the
differences contains a lag 1 error component; i.e., q = 1.
PAGE 171
CHAPTER 6 STAT 520, J. TEBBS
diff.temp−lag1
diff.temp−lag2
diff.temp−lag3
diff.temp−lag4
diff.temp−lag5
diff.temp−lag6
error−lag1
error−lag2
error−lag3
error−lag4
error−lag5
error−lag6
(Intercept)
−61
−58
−56
−52
BIC
−48
−44
−39
−33
Figure 6.9: Ventilation data. ARMA best subsets output for the first difference process
{∇Yt } using the BIC.
• Therefore, the model that provides the smallest BIC for {∇Yt } is an MA(1).
• In other words, the “best” model for the original ventilation series, as judged by
the BIC, is an ARIMA(0,1,1); i.e., an IMA(1,1).
DISCLAIMER: Model selection according to BIC (or AIC) does not always provide
“selected” models that are easily interpretable. Therefore, while AIC and BIC are model
selection tools, they are not the only tools available to us. The ACF, PACF, and EACF
may direct us to models that are different than those deemed “best” by the AIC/BIC.
PAGE 172
CHAPTER 6 STAT 520, J. TEBBS
6.7 Summary
SUMMARY : Here is a summary of the techniques that we have reviewed this chapter.
This summary is presented in an “algorithm” format to help guide the data analyst
through the ARIMA model selection phase. Advice is interspersed throughout.
• Examining the time series plot, we can get an idea about whether the series
contains a trend, seasonality, outliers, nonconstant variance, etc. This under-
standing often provides a basis for postulating a possible data transformation.
• Examine the time series plot for nonconstant variance and perform a suitable
transformation (from the Box-Cox family); see Chapter 5. Alternatively, the
data analyst can try several transformations and choose the one that does the
best at stabilizing the variance.
2. Compute the sample ACF and the sample PACF of the original series (or trans-
formed series) and further confirm the need for differencing.
• If the sample ACF decays very, very slowly, this usually indicates that it is a
good idea to take first differences.
• Tests for stationarity (ADF test) can also be implemented at this point on the
original or transformed series. In a borderline case, differencing is generally
recommended.
• Higher order differencing may be needed (however, I have found that it gen-
erally is not). One can perform an ADF test for stationarity of the first
differences to see if taking second differences is warranted. In nearly all cases,
d is not larger than 2 (i.e., taking second differences).
• Some authors argue that the consequences of overdifferencing are much less
serious than those of underdifferencing. However, overdifferencing can create
model identifiability problems.
PAGE 173
CHAPTER 6 STAT 520, J. TEBBS
3. Compute the sample ACF, the sample PACF, and the sample EACF of the original,
properly transformed, properly differenced, or properly transformed/differenced se-
ries to identify the orders of p and q.
• Usually, p and q are not larger than 4 (excluding seasonal models, which we
have yet to discuss).
• Use knowledge of the patterns for theoretical versions of these functions; i.e.,
• “The art of model selection is very much like the method of an FBI’s agent
criminal search. Most criminals disguise themselves to avoid being recog-
nized.” This is also true of the ACF, PACF, and EACF. Sampling variation
can disguise the theoretical ACF/PACF/EACF patterns.
• BIC and AIC can also be used to identify models consistent with the data.
REMARK : It is rare, after going through all of this, that the analyst will be able to
identify a single model that is a “clear-cut” choice. It is more likely that a small number
of candidate models have been identified from the steps above.
NEXT STEP : With our (hopefully small) set of candidate models, we then move forward
to parameter estimation and model diagnostics (model checking). These topics are the
subjects of Chapter 7 and Chapter 8, respectively. Once a final model has been chosen,
fit, and diagnosed, forecasting then becomes the central focus (Chapter 9).
PAGE 174
CHAPTER 7 STAT 520, J. TEBBS
7 Estimation
7.1 Introduction
RECALL: Suppose that {et } is a zero mean white noise process with var(et ) = σe2 . In
general, an ARIMA(p, d, q) process can be written as
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt
is the series of dth differences. In the last chapter, we were primarily concerned with
selecting values of p, d, and q which were consistent with the observed (or suitably
transformed) data, that is, we were concerned with model selection.
PREVIEW : In this chapter, our efforts are directed towards estimating parameters in
this class of models. In doing so, it suffices to restrict attention to stationary ARMA(p, q)
models. If d > 0 (which corresponds to a nonstationary process), the methodology de-
scribed herein can be applied to the suitably differenced process (1 − B)d Yt = ∇d Yt .
Therefore, when we write Y1 , Y2 , ..., Yn to represent our “data” in this chapter, it is
understood that Y1 , Y2 , ..., Yn may denote the original data, the differenced data, trans-
formed data (e.g., log-transformed, etc.), or possibly data that have been transformed
and differenced.
PAGE 175
CHAPTER 7 STAT 520, J. TEBBS
Yt = ϕYt−1 + et ,
where {et } is zero mean white noise with var(et ) = σe2 . In this model, there are two
parameters: ϕ and σe2 . The MOM estimator of ϕ is obtained by setting the population
lag one autocorrelation ρ1 equal to the sample lag one autocorrelation r1 and solving for
ϕ, that is,
set
ρ1 = r1 .
For this model, we know ρ1 = ϕ (see Chapter 4). Therefore, the MOM estimator of ϕ is
ϕb = r1 .
Yt = ϕ1 Yt−1 + ϕ2 Yt−2 + et ,
there are three parameters: ϕ1 , ϕ2 , and σe2 . To find the MOM estimators of ϕ1 and ϕ2 ,
recall the Yule-Walker equations (derived in Chapter 4) for the AR(2):
ρ 1 = ϕ1 + ρ 1 ϕ2
ρ 2 = ρ 1 ϕ1 + ϕ2 .
r1 = ϕ1 + r1 ϕ2
r2 = r1 ϕ1 + ϕ2 .
PAGE 176
CHAPTER 7 STAT 520, J. TEBBS
r1 (1 − r2 )
ϕb1 =
1 − r12
b r2 − r12
ϕ2 = .
1 − r12
there are p + 1 parameters: ϕ1 , ϕ2 , ..., ϕp and σe2 . We again recall the Yule-Walker
equations from Chapter 4:
ρ1 = ϕ1 + ϕ2 ρ1 + ϕ3 ρ2 + · · · + ϕp ρp−1
ρ2 = ϕ1 ρ1 + ϕ2 + ϕ3 ρ1 + · · · + ϕp ρp−2
..
.
r1 = ϕ1 + ϕ2 r1 + ϕ3 r2 + · · · + ϕp rp−1
r2 = ϕ1 r1 + ϕ2 + ϕ3 r1 + · · · + ϕp rp−2
..
.
The MOM estimators ϕb1 , ϕb2 , ..., ϕbp solve this system of equations.
REMARK : Calculating MOM estimates (or any estimates) in practice should be done
using software. The MOM approach may produce estimates ϕb1 , ϕb2 , ..., ϕbp that fall
“outside” the stationarity region, even if the process is truly stationary! That is, the
estimated AR(p) polynomial, say,
may possess roots which do not exceed 1 in absolute value (or modulus).
PAGE 177
CHAPTER 7 STAT 520, J. TEBBS
Yt = et − θet−1 ,
where {et } is zero mean white noise with var(et ) = σe2 . In this model, there are two
parameters: θ and σe2 . To find the MOM estimator of θ, we solve
−θ set
ρ1 = = r1 ⇐⇒ r1 θ2 + θ + r1 = 0
1 + θ2
for θ. Using the quadratic formula, we find that the solutions to this equation are
√
−1 ± 1 − 4r12
θ= .
2r1
• If |r1 | = 0.5, then the solutions for θ are ±1, which corresponds to an MA(1) model
that is not invertible.
• If |r1 | < 0.5, the invertible solution for θ is the MOM estimator
√
−1 + 1 − 4r12
θb = .
2r1
NOTE : For higher order MA models, the difficulties become more pronounced. For the
general MA(q) case, we are left to solve the highly nonlinear system
for θ1 , θ2 , ..., θq . Just as in the MA(1) case, there will likely be multiple solutions, only
of which at most one will correspond to a fitted invertible model.
IMPORTANT : MOM estimates are not recommended for use with MA models. They
are hard to obtain and (as we will see) they are not necessarily “good” estimates.
PAGE 178
CHAPTER 7 STAT 520, J. TEBBS
Yt = ϕYt−1 + et − θet−1 ,
where {et } is zero mean white noise with var(et ) = σe2 . In this model, there are three
parameters: ϕ, θ, and σe2 . Recall from Chapter 4 that
[ ]
(1 − θϕ)(ϕ − θ) k−1
ρk = ϕ .
1 − 2θϕ + θ2
r2
ϕb = .
r1
This is a quadratic equation in θ, so there are two solutions. The invertible solution θb (if
any) is kept; i.e., θbMOM = 1 − θx
b has root x larger than 1 in absolute value.
GOAL: We now wish to estimate the white noise variance σe2 . To do this, we first note
that for any stationary ARMA model, the process variance γ0 = var(Yt ) can be estimated
by the sample variance
1 ∑
n
2
S = (Yt − Y )2 .
n − 1 t=1
σe2
γ0 = =⇒ σe2 = (1 − ϕ1 ρ1 − ϕ2 ρ2 − · · · − ϕp ρp )γ0 .
1 − ϕ1 ρ1 − ϕ2 ρ2 − · · · − ϕp ρp
PAGE 179
CHAPTER 7 STAT 520, J. TEBBS
Therefore, the MOM estimator of σe2 is obtained by substituting in θbk for θk and
S 2 for γ0 . We obtain
S2
be2 =
σ .
1 + θb2 + θb2 + · · · + θb2
1 2 q
7.2.5 Examples
Example 7.1. Suppose {et } is zero mean white noise with var(et ) = σe2 . In this example,
we use Monte Carlo simulation to approximate the sampling distributions of the MOM
estimators of θ and σe2 in the MA(1) model
Yt = et − θet−1 .
We take θ = 0.7, σe2 = 1, and n = 100. Recall that the MOM approach is generally not
recommended for use with MA models. We will now see why this is true.
• We simulate M = 2000 MA(1) time series, each of length n = 100, with θ = 0.7
and σe2 = 1.
PAGE 180
CHAPTER 7 STAT 520, J. TEBBS
250
150
200
100
Frequency
Frequency
150
100
50
50
0
0
0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
θ σ2e
Figure 7.1: Monte Carlo simulation. Left: Histogram of MOM estimates of θ in the
MA(1) model. Right: Histogram of MOM estimates of σe2 . The true values are θ = 0.7
and σe2 = 1. The sample size is n = 100.
if they exist. Recall the formula for θb only makes sense when |r1 | < 0.5.
• Of the M = 2000 simulated series, only 1388 produced a value of |r1 | < 0.5. For
the other 612 simulated series, the MOM estimates do not exist (therefore, the
histograms in Figure 7.1 contain only 1388 estimates).
• The Monte Carlo distribution of θb illustrates why MOM estimation is not recom-
mended for MA models. The sampling distribution is not even centered at the true
value of θ = 0.7. The MOM estimator θb is negatively biased.
PAGE 181
CHAPTER 7 STAT 520, J. TEBBS
582
581
580
Elevation level (in feet)
579
578
577
576
Year
Figure 7.2: Lake Huron data. Average July water surface elevation (measured in feet)
during 1880-2006.
Example 7.2. Data file: huron. Figure 7.2 displays the average July water surface
elevation (measured in feet) from 1880-2006 at Harbor Beach, Michigan, on Lake Huron.
The sample ACF and PACF for the series, both given in Figure 7.3, suggest that an
AR(1) model or possibly an AR(2) model may be appropriate.
Yt − µ = ϕ(Yt−1 − µ) + et .
Note that this model includes a parameter µ for the overall mean. By inspection, it is
clear that {Yt } is not a zero mean process. I used R to compute the sample statistics
ϕb = r1 = 0.831
PAGE 182
CHAPTER 7 STAT 520, J. TEBBS
0.8
0.6
0.6
0.4
0.4
Partial ACF
ACF
0.2
0.2
0.0
0.0
−0.2
−0.2
5 10 15 20 5 10 15 20
Lag Lag
Figure 7.3: Lake Huron data. Left: Sample ACF. Right: Sample PACF.
Yt = 97.903 + 0.831Yt−1 + et .
b 1 )s2
be2 = (1 − ϕr
σ
= [1 − (0.831)(0.831)](1.783978) ≈ 0.552.
PAGE 183
CHAPTER 7 STAT 520, J. TEBBS
Yt − µ = ϕ1 (Yt−1 − µ) + ϕ2 (Yt−2 − µ) + et .
REMARK : Note that there are minor differences in the estimates obtained “by hand”
and those from using R’s automated procedure. These are likely due to rounding error
and/or computational errors (e.g., in solving the Yule Walker equations, etc.). It should
also be noted that the R command ar(huron,order.max=1,AIC=F,method=’yw’) fits
the model (via MOM) by centering all observations first about an estimate of the overall
mean. This is why no “intercept” output is given.
PAGE 184
CHAPTER 7 STAT 520, J. TEBBS
REMARK : The MOM approach to estimation in stationary ARMA models is not always
satisfactory. In fact, your authors recommend to avoid MOM estimation in any model
with moving average components. We therefore consider other estimation approaches,
starting with conditional least squares (CLS).
Yt − µ = ϕ(Yt−1 − µ) + et ,
where note that a nonzero mean µ = E(Yt ) has been added for flexibility. For this model,
the conditional sum of squares function is
∑
n
SC (ϕ, µ) = [(Yt − µ) − ϕ(Yt−1 − µ)]2 .
t=2
• With a sample of time series data Y1 , Y2 , ..., Yn , note that the t = 1 term does not
make sense because there is no Y0 observation.
• The principle of least squares says to choose the values of ϕ and µ that will minimize
SC (ϕ, µ).
for ϕ and µ. This is a multivariate calculus problem and the details of its solution are
shown on pp 154-155 (CC).
PAGE 185
CHAPTER 7 STAT 520, J. TEBBS
b ≈ Y.
µ
• For this AR(1) model, the CLS estimator ϕb is approximately equal to r1 , the lag
one sample autocorrelation (the only difference is that the denominator does not
include the t = 1 term). We would therefore expect the difference between ϕb and
r1 (the MOM estimator) to be negligible when the sample size n is large.
AR(p): In the general AR(p) model, the conditional sum of squares function is
∑
n
SC (ϕ1 , ϕ2 , ..., ϕp , µ) = [(Yt − µ) − ϕ1 (Yt−1 − µ) − ϕ2 (Yt−2 − µ) − · · · − ϕp (Yt−p − µ)]2 ,
t=p+1
an approximation when n is large (i.e., much larger than p). The CLS estimators for ϕ1 ,
ϕ2 , ..., ϕp are well approximated by the solutions to the sample Yule-Walker equations:
r1 = ϕ1 + ϕ2 r1 + ϕ3 r2 + · · · + ϕp rp−1
r2 = ϕ1 r1 + ϕ2 + ϕ3 r1 + · · · + ϕp rp−2
..
.
Therefore, in stationary AR models, the MOM and CLS estimates should be approxi-
mately equal.
PAGE 186
CHAPTER 7 STAT 520, J. TEBBS
Yt = et − θet−1 ,
where {et } is a zero mean white noise process. Recall from Chapter 4 that we can rewrite
an invertible MA(1) model as an infinite-order AR model; i.e.,
et = Yt + θet−1 ,
e1 = Y1
e2 = Y2 + θe1
e3 = Y3 + θe2
..
.
en = Yn + θen−1 .
Using these expressions for e1 , e2 , ..., en , we can now find the value of θ that minimizes
∑
n
SC (θ) = e2t .
t=1
This minimization problem can be carried out numerically, searching over a grid of θ
values in (−1, 1) and selecting the value of θ that produces the smallest possible SC (θ).
This minimizer is the CLS estimator of θ in the MA(1) model.
PAGE 187
CHAPTER 7 STAT 520, J. TEBBS
MA(q): The technique just described for MA(1) estimation via CLS can be carried
out for any higher-order MA(q) model in the same fashion. When q > 1, the problem
becomes finding the values of θ1 , θ2 , ..., θq such that
∑
n
SC (θ1 , θ2 , ..., θq ) = e2t
t=1
∑n
= (Yt + θ1 et−1 + θ2 et−2 + · · · + θq et−q )2 ,
t=1
is minimized, subject to the initial conditions that e0 = e−1 = · · · = e−q = 0. This can
be done numerically, searching over all possible values of θ1 , θ2 , ..., θq which yield an
invertible solution.
Yt = ϕYt−1 + et − θet−1 ,
where {et } is zero mean white noise. We first rewrite the model as
et = Yt − ϕYt−1 + θet−1 ,
There are now two “startup” problems, namely, specifying values for e0 and Y0 . The
authors of your text recommend avoiding specifying Y0 , taking e1 = 0, and minimizing
∑
n
SC∗ (ϕ, θ) = e2t
t=2
with respect to ϕ and θ instead. Similar modification is recommended for ARMA models
when p > 1 and/or when q > 1. See pp 157-158 (CC).
PAGE 188
CHAPTER 7 STAT 520, J. TEBBS
NOTE : Nothing changes with our formulae for the white noise variance estimates that
we saw previously when discussing the MOM approach. The only difference is that now
CLS estimates for the ϕ’s and θ’s are used in place of MOM estimates.
• AR(p):
be2 = (1 − ϕb1 r1 − ϕb2 r2 − · · · − ϕbp rp )S 2 .
σ
• MA(q):
S2
be2 =
σ .
1 + θb2 + θb2 + · · · + θb2
1 2 q
• ARMA(1,1): ( )
1 − ϕb2
be2 =
σ S 2.
1 − 2ϕbθb + θb2
7.3.5 Examples
Example 7.3. Data file: gota. The Göta River is located in western Sweden near
Göteburg. The annual discharge rates (volume, measured in m3 /s) from 1807-1956 are
depicted in Figure 7.4. The sample ACF and PACF are given in Figure 7.5.
Yt = µ + et − θet−1
is worth considering. Note that this model includes an intercept term µ for the
overall mean. Clearly, {Yt } is not a zero mean process.
Yt − µ = ϕ1 (Yt−1 − µ) + ϕ2 (Yt−2 − µ) + et
PAGE 189
CHAPTER 7 STAT 520, J. TEBBS
700
600
Water discharge rate
500
400
Year
Figure 7.4: Göta River data. Water flow discharge rates (volume, measured in m3 /s)
from 1807-1956.
• We will fit an MA(1) model in this example using both MOM and CLS.
Therefore, the fitted MA(1) model for the discharge rate process is
Yt = 535.4641 + et + 0.654et−1 .
s2 9457.164
be2 =
σ = ≈ 6624.
1 + θb2 1 + (−0.654)2
PAGE 190
CHAPTER 7 STAT 520, J. TEBBS
0.4
0.4
0.3
0.2
0.2
Partial ACF
ACF
0.1
0.0
0.0
−0.2
−0.1
5 10 15 20 5 10 15 20
Lag Lag
Figure 7.5: Göta River data. Left: Sample ACF. Right: Sample PACF.
Yt = 534.7199 + et + 0.5353et−1 .
be2 ≈ 6973.
The white noise variance estimate is σ
• The R output gives estimated standard errors of the CLS estimates, so we can
assess their significance.
• We will learn later that CLS estimates are approximately normal in large samples.
PAGE 191
CHAPTER 7 STAT 520, J. TEBBS
We are 95 percent confident that θ is between −0.652 and −0.419. Note that this
confidence interval does not include 0.
COMPARISON : It is instructive to compare the MOM and CLS estimates for the Göta
River discharge data. This comparison (to 3 decimal places) is summarized below.
Method b
µ θb be2
σ
MOM 535.464 −0.654 6624
CLS 534.720 −0.535 6973
• The estimates for µ are very close. The MA(1) estimate is equal to y whereas the
CLS estimate is only approximately equal to y. See pp 155 (CC).
• The estimates for θ are not close. As previously mentioned, the MOM approach
for MA models is generally not recommended.
Example 7.4. We now revisit the Lake Huron water surface elevation data in Example
7.2 and use R to fit AR(1) and AR(2) models
Yt − µ = ϕ(Yt−1 − µ) + et
and
Yt − µ = ϕ1 (Yt−1 − µ) + ϕ2 (Yt−2 − µ) + et ,
respectively, using conditional least squares (CLS). Recall that in Example 7.2 we fit
both the AR(1) and AR(2) models using MOM.
PAGE 192
CHAPTER 7 STAT 520, J. TEBBS
Yt = 89.267 + 0.846Yt−1 + et .
be2 ≈ 0.489.
The white noise variance estimate, using CLS, is σ
be2 ≈ 0.4776.
The white noise variance estimate, using CLS, is σ
PAGE 193
CHAPTER 7 STAT 520, J. TEBBS
COMPARISON : It is instructive to compare the MOM and CLS estimates for the Lake
Huron data. This comparison (to 3 decimal places) is summarized below.
AR(1) AR(2)
Method b
µ ϕb be2
σ b
µ ϕb1 ϕb2 be2
σ
MOM 579.309 0.831 0.552 579.309 0.959 −0.154 0.539
CLS 579.279 0.846 0.489 579.269 0.987 −0.170 0.478
• Note that the MOM and CLS estimates for µ and the ϕ’s are in large agreement.
This is common in purely AR models (not in models with MA components).
QUESTION : For the Lake Huron data, which model is preferred: AR(1) or AR(2)?
be2 estimate is slightly smaller in the AR(2) fit, but only marginally.
• The σ
• Using the CLS estimates, note that an approximate 95 percent confidence interval
for ϕ2 in the AR(2) model is
This interval does (barely) include 0, indicating that ϕb2 is not statistically different
from 0.
• Note also that the estimated standard error of ϕb1 (in the CLS output) is almost
twice as large in the AR(2) model as in the AR(1) model. Reason: When we fit
a higher-order model, we lose precision in the other model estimates (especially if
the higher-order terms are not needed).
• It is worth noting that the AR(1) model is the ARMA model identified as having
the smallest BIC (using armasubsets in R; see Chapter 6).
• For the last three reasons, and with an interest in being parsimonious, I would pick
the AR(1) if I had to choose between the two.
PAGE 194
CHAPTER 7 STAT 520, J. TEBBS
80
70
Blood sugar level (mg/100ml blood)
60
50
40
0 50 100 150
Days
Figure 7.6: Bovine blood sugar data. Blood sugar levels (mg/100ml blood) for a single
cow measured for n = 176 consecutive days.
Example 7.5. Data file: cows. The data in Figure 7.6 represent daily blood sugar con-
centrations (measured in mg/100ml of blood) on a single cow being dosed intermuscularly
with 10 mg of dexamethasone (commonly given to increase milk production).
• The sample ACF in Figure 7.7 shows an AR-type decay, while the PACF in Figure
7.7 also shows an MA-type (oscillating) decay with “spikes” at the first three lags.
• ARMA(1,1) and AR(3) models are consistent with the sample ACF/PACF.
Yt − µ = ϕ(Yt−1 − µ) + et − θet−1
to represent this process. Note that we have added an overall mean µ parameter in the
model. Clearly, {Yt } is not a zero mean process. Therefore, there are three parameters
to estimate and we do so using conditional least squares (CLS).
PAGE 195
CHAPTER 7 STAT 520, J. TEBBS
0.8
0.8
0.6
0.6
0.4
0.4
Partial ACF
ACF
0.2
0.2
0.0
−0.2
0.0
−0.4
5 10 15 20 5 10 15 20
Lag Lag
Figure 7.7: Bovine blood sugar data. Left: Sample ACF. Right: Sample PACF.
or, equivalently,
Yt = 19.8117 + 0.6625Yt−1 + et + 0.6111et−1 .
PAGE 196
CHAPTER 7 STAT 520, J. TEBBS
• Another advantage is that maximum likelihood estimators have very nice large-
sample distributional properties. This makes statistical inference proceed in a
straightforward manner.
• The main disadvantage is that we have to specify a joint probability distribution for
the random variables in the sample. This makes the method more mathematical.
• Therefore, when we maximize the likelihood function with respect to the model
parameters, we are finding the values of the parameters (i.e., the estimates) that
are most consistent with the observed data.
AR(1): To illustrate how maximum likelihood estimates are obtained, consider the
AR(1) model
Yt − µ = ϕ(Yt−1 − µ) + et ,
where {et } is a normal zero mean white noise process with var(et ) = σe2 and where
µ = E(Yt ) is the overall (process) mean. There are three parameters in this model: ϕ, µ,
and σe2 . The probability density function (pdf) of et ∼ N (0, σe2 ) is
1
f (et ) = √ exp(−e2t /2σe2 ),
2πσe
PAGE 197
CHAPTER 7 STAT 520, J. TEBBS
for all −∞ < et < ∞, where exp(·) denotes the exponential function. Because e1 , e2 , ..., en
are independent, the joint pdf of e2 , e3 , ..., en is given by
∏
n ∏
n
1
f (e2 , e3 , ..., en ) = f (et ) = √ exp(−e2t /2σe2 )
2πσe
t=2 t=2
( )
1 ∑
n
= (2πσe2 )−(n−1)/2 exp − 2 e2t .
2σe t=2
To write out the joint pdf of Y = (Y1 , Y2 , ..., Yn ), we can first perform a multivariate
transformation using
Y2 = µ + ϕ(Y1 − µ) + e2
Y3 = µ + ϕ(Y2 − µ) + e3
.. ..
. = .
Yn = µ + ϕ(Yn−1 − µ) + en ,
with Y1 = y1 (fixed). This will give us the (conditional) joint distribution of Y2 , Y3 , ..., Yn ,
given Y1 = y1 . Applying the laws of conditioning, the joint pdf of Y; i.e., the likelihood
function L ≡ L(ϕ, µ, σe2 |y), is given by
For this AR(1) model, the maximum likelihood estimators (MLEs) of ϕ, µ, and σe2 are
the values which maximize L(ϕ, µ, σe2 |y).
PAGE 198
CHAPTER 7 STAT 520, J. TEBBS
REMARK : In this AR(1) model, the function S(ϕ, µ) is called the unconditional sum-
of-squares function. Note that when S(ϕ, µ) is viewed as random,
where SC (ϕ, µ) is the conditional sum of squares function defined in Section 7.3.1 (notes)
for the same AR(1) model.
• We have already seen in Section 7.3.1 (notes) that the conditional least squares
(CLS) estimates of ϕ and µ are found by minimizing SC (ϕ, µ).
NOTE : The approach to finding MLEs in any stationary ARMA(p, q) model is the same
as what we have just outlined in the special AR(1) case. The likelihood function L
becomes more complex in larger models. However, this turns out not to be a big deal
for us because we will use software to do the estimation. R can compute MLEs in any
stationary ARMA(p, q) model using the arima function. This function also provides
(estimated) standard errors of the MLEs.
PAGE 199
CHAPTER 7 STAT 520, J. TEBBS
THEORY : Suppose that {et } is a normal zero mean white noise process with var(et ) = σe2 .
Consider a stationary ARMA(p, q) process
ϕ(B)Yt = θ(B)et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ).
√
n(ϕbj − ϕj ) −→ N (0, σϕ2b ),
d
for j = 1, 2, ..., p,
j
and
√
n(θbk − θk ) −→ N (0, σθ2b ),
d
for k = 1, 2, ..., q,
k
SPECIFIC CASES :
• AR(1). ( )
1 − ϕ 2
ϕb ∼ AN ϕ,
n
• AR(2). ( )
1 − ϕ22
ϕb1 ∼ ANϕ1 ,
n
( )
b 1 − ϕ22
ϕ2 ∼ AN ϕ2 ,
n
PAGE 200
CHAPTER 7 STAT 520, J. TEBBS
• MA(1). ( )
1 − θ 2
θb ∼ AN θ,
n
• MA(2). ( )
1 − θ22
θb1 ∼ AN θ1 ,
n
( )
1 − θ 2
θb2 ∼ AN θ2 , 2
n
• ARMA(1,1). [ ]
b c(ϕ, θ)(1 − ϕ2 )
ϕ ∼ AN ϕ,
n
[ ]
b c(ϕ, θ)(1 − θ2 )
θ ∼ AN θ, ,
n
where c(ϕ, θ) = [(1 − ϕθ)/(ϕ − θ)]2 .
REMARK : In multi-parameter models; e.g., AR(2), MA(2), ARMA(1,1), etc., the MLEs
are (asymptotically) correlated. This correlation can also be large; see pp 161 (CC) for
further description.
• Approximate confidence intervals for the other ARMA model parameters are com-
puted in the same way.
PAGE 201
CHAPTER 7 STAT 520, J. TEBBS
• The nice thing about R is that ML estimates and their (estimated) standard errors
are given in the output (as they were for CLS estimates), so we have to do almost
no calculation by hand.
NOTE : Maximum likelihood estimators (MLEs) and least-squares estimators (both CLS
and ULS) have the same large-sample distributions. Large sample distributions of MOM
estimators can be quite different for purely MA models (although they are the same for
purely AR models). See pp 162 (CC).
7.4.2 Examples
Example 7.6. We revisit the Göta River discharge data in Example 7.3 (notes) and use
R to fit an MA(1) model
Yt = µ + et − θet−1 ,
Yt = 535.0311 + et + 0.5350et−1 .
PAGE 202
CHAPTER 7 STAT 520, J. TEBBS
We are 95 percent confident that θ is between −0.651 and −0.419. This interval is almost
identical to the one based on the CLS estimate; see Example 7.3.
COMPARISON : We compare the estimates from all three methods (MOM, CLS, and
MLE) with the Göta River discharge data. This comparison (to 3 decimal places) is
summarized below.
Method b
µ θb be2
σ
MOM 535.464 −0.654 6624
CLS 534.720 −0.535 6973
MLE 535.031 −0.535 6957
Note that the CLS and ML estimates of θ are identical (to three decimal places). The
MOM estimate of θ is noticeably different. Recall that MOM estimation is not advised
for models with MA components.
Example 7.7. The data in Figure 7.8 (left) are the number of global earthquakes
annually (with intensities of 7.0 or greater) during 1900-1998. Source: Craig Whitlow
(Spring, 2010). We examined these data in Chapter 1 (Example 1.5, pp 6).
• Because the data (number of earthquakes) are “counts,” this suggests that a trans-
formation is needed. The Box-Cox transformation output in Figure 7.8 (right)
shows that λ = 0.5 resides in an approximate 95 percent confidence interval for λ.
Recall that λ = 0.5 corresponds to the square-root transformation.
• R output for the square-root transformed series is given in Figure 7.9. The
armasubsets output, which ranks competing ARMA models according to their
BIC, selects an ARMA(1,1) model. This model is also consistent with the sample
ACF and PACF.
PAGE 203
CHAPTER 7 STAT 520, J. TEBBS
200
95%
40
190
35
Number of earthquakes (7.0 or greater)
180
30
170
Log Likelihood
25
160
20
150
15
140
10
130
5
Year λ
Figure 7.8: Earthquake data. Left: Number of “large” earthquakes per year from 1900-
1998. Right: Box-Cox transformation output (profile log-likelihood function of λ).
√
• We therefore fit an ARMA(1,1) model to the { Yt } process, that is,
√ √
Yt − µ = ϕ( Yt−1 − µ) + et − θet−1 .
For this model, the maximum likelihood estimates based on these data are ϕb = 0.8352,
θb = 0.4295, and µ
b = 4.3591. The fitted model is
√ √
Yt − 4.3591 = 0.8352( Yt−1 − 4.3591) + et − 0.4295et−1
PAGE 204
CHAPTER 7 STAT 520, J. TEBBS
sqrt(e.qu.)−lag1
sqrt(e.qu.)−lag2
sqrt(e.qu.)−lag3
sqrt(e.qu.)−lag4
sqrt(e.qu.)−lag5
sqrt(e.qu.)−lag6
error−lag1
error−lag2
error−lag3
error−lag4
error−lag5
error−lag6
(Intercept)
No. of earthquakes (Square−root scale)
6 −24
−20
−18
5
−16
BIC
−12
4
−9.1
−5.3
3
−1.7
0.4
Partial ACF
0.2
0.2
ACF
0.0
0.0
−0.2
−0.2
5 10 15 5 10 15
√
Figure 7.9: Earthquake data. Upper left: Time series plot of { Yt } process. Upper
√
right: armasubsets output (on square-root scale). Lower left: Sample ACF of { Yt }.
√
Lower right: Sample PACF of { Yt }.
or, equivalently,
√ √
Yt = 0.7184 + 0.8352 Yt−1 + et − 0.4295et−1 .
PAGE 205
CHAPTER 7 STAT 520, J. TEBBS
95%
0.20
Percentage granted review
350
Log Likelihood
0.15
0.10
300
0.05
250
1940 1960 1980 2000 −2 −1 0 1 2
Time λ
Percentage granted review (Log scale)
−1.5
0.4
Difference of logarithms
0.2
−2.5
0.0
−3.5
−0.4
−4.5
Year Year
Figure 7.10: U.S. Supreme Court data. Upper left: Percent of cases granted review during
1926-2004. Upper right: Box-Cox transformation output. Lower left: Log-transformed
data {log Yt }. Lower right: First differences of log-transformed data {∇ log Yt }.
Example 7.8. The data in Figure 7.10 (upper left) represent the acceptance rate of
cases appealed to the Supreme Court during 1926-2004. Source: Jim Manning (Spring,
2010). We examined these data in Chapter 1 (Example 1.15, pp 16).
• The time series plot suggests that this process {Yt } is not stationary. There is a
clear linear downward trend. There is also a notable nonconstant variance problem.
• The BoxCox.ar transformation output in Figure 7.10 (upper right) suggests a log-
transformation is appropriate; note that λ ≈ 0.
PAGE 206
CHAPTER 7 STAT 520, J. TEBBS
• The log-transformed series {log Yt } in Figure 7.10 (lower left) still displays the linear
trend, as expected. However, the variance in the {log Yt } process is more constant
than in the original series. It looks like the log-transformation has “worked.”
• The lower right plot in Figure 7.10 gives the first differences of the log-transformed
process {∇ log Yt }. This process appears to be stationary.
• The sample ACF, PACF, EACF, and armasubsets results (not shown) suggest
that an MA(1) model for {∇ log Yt } ⇐⇒ an IMA(1,1) model for {log Yt }, that is,
∇ log Yt = et − θet−1 ,
> arima(log(supremecourt),order=c(0,1,1),method=’ML’) # ML
Coefficients:
ma1
-0.3556
s.e. 0.0941
sigma^2 estimated as 0.03408: log likelihood = 21.04, aic = -40.08
∇ log Yt = et − 0.3556et−1 ,
or, equivalently,
log Yt = log Yt−1 + et − 0.3556et−1 .
COMMENT : Note that there is no estimated intercept term in the output above. Recall
that in ARIMA(p, d, q) models with d > 0, intercept terms are generally not used.
PAGE 207
CHAPTER 8 STAT 520, J. TEBBS
8 Model Diagnostics
8.1 Introduction
RECALL: Suppose that {et } is a zero mean white noise process with var(et ) = σe2 . In
general, an ARIMA(p, d, q) process can be written as
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q )
and
(1 − B)d Yt = ∇d Yt
is the series of dth differences. Until now, we have discussed the following topics:
• Model specification (model selection). This deals with specifying the values of
p, d, and q that are most consistent with the observed (or possibly transformed)
data. This was the topic of Chapter 6.
• Model fitting (parameter estimation). This deals with estimating model param-
eters in the ARIMA(p, d, q) class. This was the topic of Chapter 7.
PREVIEW : In this chapter, we are now concerned with model diagnostics, which
generally means that we are “checking the fit of the model.” We were exposed to this
topic in Chapter 3, where we encountered deterministic trend models of the form
Yt = µt + Xt ,
where E(Xt ) = 0. We apply many of the same techniques we used then to our situation
now, that is, to diagnose the fit of ARIMA(p, d, q) models.
PAGE 208
CHAPTER 8 STAT 520, J. TEBBS
TERMINOLOGY : Residuals are random quantities which describe the part of the
variation in {Yt } that is not explained by the fitted model. In general, we have the
general relationship (not just in time series models, but in nearly all statistical models):
where µ = E(Yt ) is the overall (process) mean and where {et } is a zero mean white noise
process. This model can be reparameterized as
where θ0 = µ(1 − ϕ1 − ϕ2 − · · · − ϕp ) is the intercept term. For this model, the residual
at time t is
ebt = Yt − Ybt
where ϕbj is an estimator of ϕj (e.g., ML, CLS, etc.), for j = 1, 2, ..., p, and where
θb0 = µ
b(1 − ϕb1 − ϕb2 − · · · − ϕbp )
is the estimated intercept. Therefore, once we observe the values of Y1 , Y2 , ..., Yn in our
sample, we can compute the n residuals.
PAGE 209
CHAPTER 8 STAT 520, J. TEBBS
that is, the p values of the process {Yt } before time t = 1. We will not discuss backcasting
in detail, but be aware that it is needed to compute early residuals in the process.
ARMA(p, q): To define residuals for an invertible ARMA model containing moving
average terms, we exploit the fact that the model can be written as an inverted autore-
gressive process. To be specific, recall that any zero-mean invertible ARMA(p, q) model
can be written as
Yt = π1 Yt−1 + π2 Yt−2 + π3 Yt−3 + · · · + et ,
where the π coefficients are functions of the ϕ and θ parameters in the specific ARMA(p, q)
model. Residuals are of the form
ebt = Yt − π
b1 Yt−1 − π
b2 Yt−2 − π
b3 Yt−3 − · · · ,
IMPORTANT : The observed residuals ebt serve as “proxies” for the white noise terms et .
We can therefore learn about the quality of the model fit by examining the residuals.
• If the model is correctly specified and our estimates are “reasonably close” to the
true parameters, then the residuals should behave roughly like an iid normal white
noise process, that is, a sequence of independent, normal random variables with zero
mean and constant variance.
• If the model is not correctly specified, then the residuals will not behave roughly like
an iid normal white noise process. Furthermore, examining the residuals carefully
may help us identify a better model.
TERMINOLOGY : It is very common to instead work with residuals which have been
standardized, that is,
ebt
eb∗t = ,
be
σ
be2 is an estimate of the white noise error variance σe2 . We call these standardized
where σ
residuals.
PAGE 210
CHAPTER 8 STAT 520, J. TEBBS
• From the standard normal distribution, we know then that most of the standardized
e∗t } should fall between −3 and 3.
residuals {b
• Standardized residuals that fall outside this range could correspond to observations
which are “outlying” in some sense; we’ll make this more concrete later. If many
standardized residuals fall outside (−3, 3), this suggests that the error process {et }
has a heavy-tailed distribution (common in financial time series applications).
DIAGNOSTICS : Histograms and qq plots of the residuals can be used to assess the
normality assumption visually. Time series plots of the residuals can be helpful to detect
“patterns” which violate the independence assumption.
• We can also apply the hypothesis tests for normality (Shapiro-Wilk) and indepen-
dence (runs test) with the standardized residuals, just as we did in Chapter 3 with
the deterministic trend models.
PAGE 211
CHAPTER 8 STAT 520, J. TEBBS
700
2
Standardised residuals
Water discharge rate
600
1
0
500
−1
400
−2
1850 1900 1950 1850 1900 1950
Year Time
10 15 20 25 30
2
Sample Quantiles
1
Frequency
0
−1
5
−2
0
−3 −2 −1 0 1 2 3 −2 −1 0 1 2
Figure 8.1: Göta River discharge data. Upper left: Discharge rate time series. Up-
per right: Standardized residuals from an MA(1) fit with zero line added. Lower left:
Histogram of the standardized residuals from MA(1) fit. Lower right: QQ plot of the
standardized residuals from MA(1) fit.
Example 8.1. In Example 7.3 (pp 189, notes), we examined the Göta River discharge
rate data and used an MA(1) process to model them. The fit using maximum likelihood
in Example 7.6 (pp 202, notes) was
Yt = 535.0311 + et + 0.5350et−1 .
Figure 8.1 displays the time series plot (upper right), the histogram (lower left), and the
qq plot (lower right) of the standardized residuals. The histogram and the qq plot show
no gross departures from normality. This observation is supported by the Shapiro-Wilk
test for normality, which we perform in R. Here is the output:
PAGE 212
CHAPTER 8 STAT 520, J. TEBBS
> shapiro.test(rstandard(gota.ma1.fit))
Shapiro-Wilk normality test
W = 0.9951, p-value = 0.8975
The large p-value is not evidence against normality (i.e., we do not reject H0 ). To examine
the independence assumption, note that the time series of the residuals in Figure 8.1
(upper right) displays no discernible patterns and looks to be random in appearance.
This observation is supported by the runs test for independence, which we also perform
in R. Here is the output:
> runs(rstandard(gota.ma1.fit))
$pvalue
[1] 0.29
$observed.runs
[1] 69
$expected.runs
[1] 75.94667
CONCLUSION : For the Göta River discharge data, (standardized) residuals from a
MA(1) fit look to reasonably satisfy the normality and independence assumptions.
Example 8.2. In Example 7.2 (pp 182, notes), we examined the Lake Huron elevation
data and considered using an AR(1) process to model them. Here is the R output from
fitting an AR(1) model via maximum likelihood:
PAGE 213
CHAPTER 8 STAT 520, J. TEBBS
582
2
Standardised residuals
Elevation level (in feet)
580
1
0
578
−3 −2 −1
576
Year Time
10 15 20 25
2
Sample Quantiles
1
Frequency
0
−3 −2 −1
5
0
−3 −2 −1 0 1 2 3 −2 −1 0 1 2
Figure 8.2: Lake Huron elevation data. Upper left: Elevation time series. Upper right:
Standardized residuals from an AR(1) fit with zero line added. Lower left: Histogram
of the standardized residuals from AR(1) fit. Lower right: QQ plot of the standardized
residuals from AR(1) fit.
or, equivalently,
Yt = 81.9402 + 0.8586Yt−1 + et .
Figure 8.2 displays the time series plot (upper right), the histogram (lower left), and the
qq plot (lower right) of the standardized residuals. The histogram and the qq plot show
no gross departures from normality. The time series plot of the standardized residuals
displays no noticeable patterns and looks like a stationary random process.
PAGE 214
CHAPTER 8 STAT 520, J. TEBBS
The R output for the Shapiro-Wilk and runs tests is given below:
> shapiro.test(rstandard(huron.ar1.fit))
Shapiro-Wilk normality test
W = 0.9946, p-value = 0.9156
> runs(rstandard(huron.ar1.fit))
$pvalue
[1] 0.373
$observed.runs
[1] 59
$expected.runs
[1] 64.49606
CONCLUSION : For the Lake Huron elevation data, (standardized) residuals from a
AR(1) fit look to reasonably satisfy the normality and independence assumptions.
RECALL: In Chapter 6, we discovered that for a white noise process, the sample
autocorrelation satisfies ( )
1
rk ∼ AN 0, ,
n
for large n. Furthermore, the sample autocorrelations rj and rk , for j ̸= k, are approx-
imately uncorrelated.
for k = 1, 2, ...,. That is, the “hat” symbol in rbk will remind us that we are now
dealing with residuals.
PAGE 215
CHAPTER 8 STAT 520, J. TEBBS
“If the model is correctly specified and our estimates are “reasonably
close” to the true parameters, then the residuals should behave roughly
like an iid normal white noise process.”
• We say “roughly,” because even if the correct model is fit, the sample autocorre-
lations of the residuals, rbk , have sampling distributions that are a little different
than that of white noise (most prominently at early lags).
• In addition, rbj and rbk , for j ̸= k, are correlated, notably so at early lags and more
weakly at later lags.
RESULTS : Suppose that {et } is a zero mean white noise process with var(et ) = σe2 . In
addition, suppose that we have identified and fit the correct ARIMA(p, d, q) model
using maximum likelihood. All of the following are large-sample results (i.e., they are
approximate for large n).
• MA(1).
θ2
r1 ) ≈
var(b
n
1 − (1 − θ2 )θ2k−2
var(b rk ) ≈ , for k > 1
n[ ]
(1 − θ2 )θk−2
r1 , rbk ) ≈ −sign(θ)
corr(b , for k > 1,
1 − (1 − θ2 )θ2k−2
• MA(2).
θ22
r1 ) ≈
var(b
n
θ22 + θ12 (1 + θ2 )2
r2 ) ≈
var(b
n
1
rk ) ≈
var(b , for k > 2.
n
PAGE 216
CHAPTER 8 STAT 520, J. TEBBS
• AR(1).
ϕ2
r1 ) ≈
var(b
n
1 − (1 − ϕ2 )ϕ2k−2
var(b rk ) ≈ , for k > 1
n[ ]
(1 − ϕ2 )ϕk−2
r1 , rbk ) ≈ −sign(ϕ)
corr(b , for k > 1.
1 − (1 − ϕ2 )ϕ2k−2
• AR(2).
ϕ22
r1 ) ≈
var(b
n
ϕ22 + ϕ21 (1 + ϕ2 )2
r2 ) ≈
var(b
n
1
rk ) ≈
var(b , for k > 2.
n
MAIN POINT : Even if we fit the correct ARIMA(p, d, q) model, the residuals from the
fit will not follow a white noise process exactly. At very early lags, there are noticeable
differences from a white noise process. For larger lags, the differences become negligible.
Example 8.3. In Example 8.1, we examined the residuals from an MA(1) fit to the
Göta River discharge data (via ML).
• The sample ACF of the MA(1) residuals is depicted in Figure 8.3 with margin of
error bounds at
2 2
√ =√ ≈ 0.163.
n 150
That is, the margin of error bounds in Figure 8.3 are computed under the white
noise assumption.
PAGE 217
CHAPTER 8 STAT 520, J. TEBBS
0.15
0.10
0.05
0.00
ACF
−0.05
−0.10
−0.15
5 10 15 20
Lag
Figure 8.3: Göta River discharge data. Sample ACF of the residuals from an MA(1)
model fit.
Recall that the MA(1) model fit to these data (via ML) was
Yt = 535.0311 + et + 0.5350et−1 .
PAGE 218
CHAPTER 8 STAT 520, J. TEBBS
Here are first 10 sample autocorrelations for the residuals from the MA(1) fit:
> acf(residuals(gota.ma1.fit),plot=F,lag.max=10)
1 2 3 4 5 6 7 8 9 10
0.059 0.020 -0.115 0.021 -0.074 0.041 -0.009 0.019 -0.076 0.042
We now construct a table which displays these sample autocorrelations, along with their
±2 estimated standard errors
√
±2se(b
b rk ) = ±2 var(b
c rk ),
for k = 1, 2, ..., 10. Values of rbk more than 2 (estimated) standard errors away from 0
would be considered inconsistent with the fitted model.
k 1 2 3 4 5 6 7 8 9 10
rbk 0.059 0.020 −0.115 0.021 −0.074 0.041 −0.009 0.019 −0.076 0.042
b rk )
2se(b 0.087 0.146 0.158 0.162 0.163 0.163 0.163 0.163 0.163 0.163
2 2
√ =√ ≈ 0.163
n 150
• This finding further supports the MA(1) model choice for these data.
PAGE 219
CHAPTER 8 STAT 520, J. TEBBS
• To address this potential occurrence, Ljung and Box (1978) developed a procedure,
based on the sample autocorrelations of the residuals, to test formally whether or
not a certain model in the ARMA(p, q) family was appropriate.
• The sample autocorrelations rbk , for k = 1, 2, ..., K, are computed under the
ARMA(p, q) model assumption in H0 . If a nonstationary model is fit (d > 0),
then the ARMA(p, q) model refers to the suitably differenced process.
• The value K is called the maximum lag; it’s choice is somewhat arbitrary.
Yt = et + Ψ1 et−1 + Ψ2 et−2 + · · · ,
• Typically one can simply compute Q∗ for various choices of K and determine if the
same decision is reached for all values of K.
• For a fixed K, a level α decision rule is to reject H0 if the value of Q∗ exceeds the
upper α quantile of the χ2 distribution with K − p − q degrees of freedom, that is,
PAGE 220
CHAPTER 8 STAT 520, J. TEBBS
• The tsdiag function in R will compute Q∗ at all lags specified by the user.
Example 8.4. In Example 8.1, we examined the residuals from an MA(1) fit to the Göta
River discharge data (via ML). Here we illustrate the use of the modified Ljung-Box test
for the MA(1) model. Recall that we computed the first 10 sample autocorrelations:
> acf(residuals(gota.ma1.fit),plot=F,lag.max=10)
1 2 3 4 5 6 7 8 9 10
0.059 0.020 -0.115 0.021 -0.074 0.041 -0.009 0.019 -0.076 0.042
χ29,0.05 = 16.91898,
which I found using the qchisq(0.95,9) command in R. Because the test statistic Q∗
does not exceed this upper quantile, we do not reject H0 .
REMARK : Note that R can perform the modified Ljung-Box test automatically. Here
is the output:
> Box.test(residuals(gota.ma1.fit),lag=10,type="Ljung-Box",fitdf=1)
Box-Ljung test
X-squared = 5.1305, df = 9, p-value = 0.8228
We do not have evidence against MA(1) model adequacy for these data when K = 10.
PAGE 221
CHAPTER 8 STAT 520, J. TEBBS
Standardized Residuals
2
1
0
−2 −1
Time
0.15
ACF of Residuals
0.05
−0.15 −0.05
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.4: Göta River discharge data. Residual graphics and modified Ljung-Box p-
values for MA(1) fit. This figure was created using the tsdiag function in R.
• The top plot displays the residuals plotted through time (without connecting lines).
• The bottom plot displays the p-values of the modified Ljung-Box test for various
values of K. A horizontal line at α = 0.05 is added.
For the Göta River discharge data, we see in Figure 8.4 that all of the modified Ljung-Box
test p-values are larger than 0.05, lending further support of the MA(1) model.
PAGE 222
CHAPTER 8 STAT 520, J. TEBBS
Standardized Residuals
2
1
0
−2 −1
Time
0.2
ACF of Residuals
0.1
0.0
−0.2 −0.1
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.5: Earthquake data. Residual graphics and modified Ljung-Box p-values for
ARMA(1,1) fit to the square-root transformed data.
Example 8.5. In Example 7.7 (pp 203, notes), we fit an ARMA(1,1) model to the
(square-root transformed) earthquake data using maximum likelihood. Figure 8.5 dis-
plays the tsdiag output for the ARMA(1,1) model fit.
• The Shapiro-Wilk test does not reject normality (p-value = 0.7202). The runs test
does not reject independence (p-value = 0.679). Both the Shapiro-Wilk and runs
tests were applied to the standardized residuals.
• The residual output in Figure 8.5 fully supports the ARMA(1,1) model.
PAGE 223
CHAPTER 8 STAT 520, J. TEBBS
Standardized Residuals
1
0
−1
−2
Time
0.2
ACF of Residuals
0.1
0.0
−0.2
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.6: U.S. Supreme Court data. Residual graphics and modified Ljung-Box p-
values for IMA(1,1) fit to the log transformed data.
Example 8.6. In Example 7.8 (pp 206, notes), we fit an IMA(1,1) model to the (log
transformed) Supreme Court data using maximum likelihood. Figure 8.6 displays the
tsdiag output for the IMA(1,1) model fit.
• The Shapiro-Wilk test does not reject normality (p-value = 0.5638). The runs test
does not reject independence (p-value = 0.864). Both the Shapiro-Wilk and runs
tests were applied to the standardized residuals.
• The modified Ljung-Box test p-values in Figure 8.6 raise serious concerns over the
adequacy of the IMA(1,1) model fit.
PAGE 224
CHAPTER 8 STAT 520, J. TEBBS
60
50
Oil prices
40
30
20
10
Year
Figure 8.7: Crude oil price data. Monthly spot prices in dollars from Cushing, OK, from
1/1986 to 1/2006.
Example 8.7. The data in Figure 8.7 are monthly spot prices for crude oil (measured
in U.S. dollars per barrel). We examined these data in Chapter 1 (Example 1.12, pp 13).
In this example, we assess the fit of an IMA(1,1) model for {log Yt }; i.e.,
∇ log Yt = et − θet−1 .
I have arrived at this candidate model using our established techniques from Chapter 6;
these details are omitted for brevity. I used maximum likelihood to fit the model.
• In Figure 8.8, we display the {∇ log Yt } process (upper left), along with plots of
the standardized residuals from the IMA(1,1) fit.
• It is difficult to notice a pattern in the time series plot of the residuals, although
there are notable outliers on the low and high sides.
PAGE 225
CHAPTER 8 STAT 520, J. TEBBS
0.4
4
1st differences of logarithms
Standardised residuals
0.2
2
0.0
0
−2
−0.2
−4
−0.4
Year Time
4
80
Sample Quantiles
2
Frequency
60
0
40
−2
20
−4
0
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
Figure 8.8: Oil price data with IMA(1,1) fit to {log Yt }. Upper left: {∇ log Yt } process.
Upper right: Standardized residuals with zero line added. Lower left: Histogram of the
standardized residuals. Lower right: QQ plot of the standardized residuals.
• The Shapiro-Wilk test strongly rejects normality of the residuals (p-value < 0.0001).
This is likely due to the extreme outliers on each side, which are not “expected”
under the assumption of normality. The runs test does not reject independence
(p-value = 0.341).
• The tsdiag output for the IMA(1,1) residuals is given in Figure 8.9. The top plot
displays the residuals from the IMA(1,1) fit with “outlier limits” at
z0.025/241 ≈ 3.709744,
PAGE 226
CHAPTER 8 STAT 520, J. TEBBS
4
Standardized Residuals
2
0
−2
−4
Time
0.10
ACF of Residuals
0.00
−0.10
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 8.9: Oil price data. Residual graphics and modified Ljung-Box p-values for
IMA(1,1) fit to the log transformed data.
• According to the Bonferroni criterion, residuals which exceed this value (3.709744)
in absolute value would be classified as outliers. The one around 1991 likely corre-
sponds to the U.S. invasion of Iraq (the first one).
• The sample ACF for the residuals raises concern, but the modified Ljung-Box p-
values do not suggest lack of fit (although it becomes interesting for large K).
• The IMA(1,1) model for the log-transformed data appears to do a fairly good job.
I am a little concerned about the outliers and the residual ACF. Intervention
analysis (Chapter 11) may help to adjust for the outlying observations.
PAGE 227
CHAPTER 8 STAT 520, J. TEBBS
8.3 Overfitting
(b) examining the change in the estimates from the assumed model.
EXAMPLE : Suppose that, after the model specification phase and residual diagnostics,
we are strongly considering an AR(2) model for our data, that is,
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + et .
• AR(3):
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + ϕ3 Yt−3 + et
• ARMA(2,1):
Yt = θ0 + ϕ1 Yt−1 + ϕ2 Yt−2 + et − θet−1 .
• If the additional AR parameter estimate ϕb3 is significantly different than zero, then
this would be evidence that an AR(3) model is worthy of investigation. If ϕb3 is
not significantly different than zero and the estimates of ϕ1 and ϕ2 do not change
much from their values in the AR(2) model fit, this would be evidence that the
more complicated AR(3) model is not needed.
PAGE 228
CHAPTER 8 STAT 520, J. TEBBS
(a) ARIMA(p + 1, d, q)
Example 8.8. Our residual analysis this chapter suggests that an MA(1) model for the
Göta River discharge data is very reasonable. We now overfit using an MA(2) model and
an ARMA(1,1) model. Here is the R output from all three model fits:
> gota.ma1.fit
Call: arima(x = gota, order = c(0, 0, 1), method = "ML")
Coefficients:
ma1 intercept
0.5350 535.0311
s.e. 0.0594 10.4300
sigma^2 estimated as 6957: log likelihood = -876.58, aic = 1757.15
> gota.ma2.overfit
Call: arima(x = gota, order = c(0, 0, 2), method = "ML")
Coefficients:
ma1 ma2 intercept
0.6153 0.1198 534.8117
s.e. 0.0861 0.0843 11.7000
sigma^2 estimated as 6864: log likelihood = -875.59, aic = 1757.18
> gota.arma11.overfit
Call: arima(x = gota, order = c(1, 0, 1), method = "ML")
Coefficients:
ar1 ma1 intercept
0.1574 0.4367 534.8004
s.e. 0.1292 0.1100 11.5217
sigma^2 estimated as 6891: log likelihood = -875.87, aic = 1757.74
PAGE 229
CHAPTER 8 STAT 520, J. TEBBS
ANALYSIS : In the MA(2) overfit, we see that a 95 percent confidence interval for θ2 ,
the additional MA model parameter, is
which does include 0. Therefore, θb2 is not statistically different than zero, which suggests
that the MA(2) model is not necessary. In the ARMA(1,1) overfit, we see that a 95
percent confidence interval for ϕ, the additional AR model parameter, is
which also includes 0. Therefore, ϕb is not statistically different than zero, which suggests
that the ARMA(1,1) model is not necessary. The following table summarizes the output
on the last page:
Model θb (se)
b Additional estimate Significant? be2
σ AIC
MA(1) 0.5350(0.0594) −− −− 6957 1757.15
MA(2) 0.6153(0.0861) θb2 no 6864 1757.18
ARMA(1,1) 0.4367(0.1100) ϕb no 6891 1757.74
Because the additional estimates in the overfit models are not statistically different from
zero, there is no reason to further consider either model. Note also how the estimate of
θ becomes less precise in the two larger models.
PAGE 230
CHAPTER 9 STAT 520, J. TEBBS
9 Forecasting
9.1 Introduction
RECALL: We have discussed two types of statistical models for time series data, namely,
deterministic trend models (Chapter 3) of the form
Yt = µt + Xt ,
where {Xt } is a zero mean stochastic process, and ARIMA(p, d, q) models of the form
where {et } is zero mean white noise. For both types of models, we have studied model
specification, model fitting, and diagnostic procedures to assess model fit.
• We start with a sample of process values up until time t, say, Y1 , Y2 , ..., Yt . These
are our observed data.
• Forecasting refers to the technique of predicting future values of the process, i.e.,
• We call t the forecast origin and l the lead time. The value Yt+l is “l steps
ahead” of the most recently observed value Yt .
PAGE 231
CHAPTER 9 STAT 520, J. TEBBS
than, say, estimating a population (model) parameter. Model parameters are fixed (but
unknown) values. Random variables are not fixed; they are random.
• Suppose that we have a sample of observed data Y1 , Y2 , ..., Yt and that we would
like to predict Yt+l .
• The approach we take is to choose the function h(Y1 , Y2 , ..., Yt ) that minimizes
MSEP. This function will be our forecasted value of Yt+l .
the conditional expectation of Yt+l , given the observed data Y1 , Y2 , ..., Yt (see
Appendices E and F, CC).
This is called the minimum mean squared error (MMSE) forecast. That is,
Ybt (l) is the MMSE forecast of Yt+l .
PAGE 232
CHAPTER 9 STAT 520, J. TEBBS
In other words, once you condition on Y1 , Y2 , ..., Yt , any function of Y1 , Y2 , ..., Yt acts
as a constant.
because µt+l is constant and because Xt+l is a zero mean random variable independent
of Y1 , Y2 , ..., Yt . Therefore,
Ybt (l) = µt+l
PAGE 233
CHAPTER 9 STAT 520, J. TEBBS
where βb0 and βb1 are the least squares estimates of β0 and β1 , respectively. In the cosine
trend model, Ybt (l) is estimated by
where βb0 , βb1 , and βb2 are the least squares estimates.
Example 9.1. In Example 3.4 (pp 53, notes), we fit a straight line trend model to the
global temperature deviation data. The fitted model is
where t = 1900, 1991, ..., 1997, depicted visually in Figure 9.1. Here are examples of
forecasting with this estimated trend model:
b1998 = µ
µ b1997+1 = −12.19 + 0.0062(1997 + 1) ≈ 0.198.
b2005 = µ
µ b1997+8 = −12.19 + 0.0062(1997 + 8) ≈ 0.241.
b2020 = µ
µ b1997+23 = −12.19 + 0.0062(1997 + 23) ≈ 0.334.
PAGE 234
CHAPTER 9 STAT 520, J. TEBBS
0.4
Global temperature deviations (since 1900)
0.2
0.0
−0.2
−0.4
Year
Figure 9.1: Global temperature data. The least squares straight line fit is superimposed.
Example 9.2. In Example 3.6 (pp 66, notes), we fit a cosine trend model to the monthly
US beer sales data (in millions of barrels), which produced the fitted model
• These values for t are used because data arrive monthly and “year” is used as a
predictor in the regression.
PAGE 235
CHAPTER 9 STAT 520, J. TEBBS
17
16
15
Sales
14
13
12
Time
Figure 9.2: Beer sales data. The least squares cosine trend fit is superimposed.
• In December, 1990, we could have used the model to predict for January, 1991,
REMARK : One major drawback with predictions made from deterministic trend models
is that they are based only on the least squares model fit, that is, the forecast for Yt+l
ignores the correlation between Yt+l and Y1 , Y2 , ..., Yt . Therefore, the analyst who makes
these predictions is ignoring this correlation and, in addition, is assuming that the fitted
trend is applicable indefinitely into the future; i.e., “the trend lasts forever.”
PAGE 236
CHAPTER 9 STAT 520, J. TEBBS
Yt = µt + Xt ,
where E(Xt ) = 0 and var(Xt ) = γ0 (constant), the forecast error at lead time l, denoted
by et (l), is the difference between the value of the process at time t + l and the MMSE
forecast at this time. Mathematically,
For all l ≥ 1,
• The first equation implies that forecasts are unbiased because the forecast error
is an unbiased estimator of 0.
• The second equation implies that the forecast error variance is constant for all
lead times l.
• These facts will be useful in deriving prediction intervals for future values.
GOAL: We now discuss forecasting methods with ARIMA models. Recall that an
ARIMA(p, d, q) process can be written generally as
PAGE 237
CHAPTER 9 STAT 520, J. TEBBS
9.3.1 AR(1)
AR(1): Suppose that {et } is zero mean white noise with var(et ) = σe2 . Consider the
AR(1) model
Yt − µ = ϕ(Yt−1 − µ) + et ,
1-step ahead forecast: The MMSE forecast of Yt+1 , the 1-step ahead forecast, is
• E[ϕ(Yt −µ)|Y1 , Y2 , ..., Yt ] = ϕ(Yt −µ), because ϕ(Yt −µ) is a function of Y1 , Y2 , ..., Yt .
2-step ahead forecast: The MMSE forecast of Yt+2 , the 2-step ahead forecast, is
PAGE 238
CHAPTER 9 STAT 520, J. TEBBS
l-step ahead forecast: For larger lead times, this pattern continues. In general, the
MMSE forecast of Yt+l , for all l ≥ 1, is
Ybt (l) → µ.
In other words, MMSE forecasts will “converge” to the overall process mean µ as
the lead time l increases.
PAGE 239
CHAPTER 9 STAT 520, J. TEBBS
FORECAST ERROR: In the AR(1) model, the 1-step ahead forecast error is
Therefore,
E[et (1)] = E(et+1 ) = 0
Because the 1-step ahead forecast error et (1) is an unbiased estimator of 0, we say that
the 1-step ahead forecast Ybt (1) is unbiased. The second equation says that the 1-step
ahead forecast error et (1) has constant variance. To find the l-step ahead forecast
error, et (l), we first remind ourselves (pp 94, notes) that a zero mean AR(1) process can
be written as an infinite order MA model, that is,
Therefore,
PAGE 240
CHAPTER 9 STAT 520, J. TEBBS
i.e., forecasts are unbiased. The variance of the l-step ahead forecast error is
σe2
var[et (l)] → = γ0 = var(Yt ).
1 − ϕ2
Example 9.3. In Example 8.2 (pp 213, notes), we examined the Lake Huron elevation
data (from 1880-2006) and we used an AR(1) process to model them.
• With l = 10, the (estimated) MMSE forecast for Yt+10 (for 2016) is
PAGE 241
CHAPTER 9 STAT 520, J. TEBBS
NOTE : The R function predict provides (estimated) MMSE forecasts and (estimated)
standard errors of the forecast error for any ARIMA(p, d, q) model fit. For example,
consider the Lake Huron data with lead times l = 1, 2, ..., 20 (which corresponds to years
2007, 2008, ..., 2026). R produces the following output:
> round(huron.ar1.predict$se,3)
Start = 2007
End = 2026
[1] 0.704 0.927 1.063 1.152 1.214 1.258 1.289 1.311 1.328 1.340 1.349 1.355 1.360
[14] 1.363 1.366 1.367 1.369 1.370 1.371 1.371
• In Figure 9.3, we display the Lake Huron data. The full data set is from 1880-2006
(one elevation reading per year).
• However, for aesthetic reasons (to emphasize the MMSE forecasts), we start the
series in the plot at year 1940.
• The estimated MMSE forecasts in the R predict output are computed using
b + ϕbl (Yt − µ
Ybt (l) = µ b),
for l = 1, 2, ..., 20, starting with Yt = 581.27, the observed elevation in 2006. There-
fore, the forecasts in Figure 9.3 start at 2007 and end in 2026.
• In the output above, note how MMSE forecasts Ybt (l) approach the estimated mean
b = 579.492, as l increases. This can also be clearly seen in Figure 9.3.
µ
• The (estimated) standard errors of the forecast error (in the predict output above)
are used to construct prediction intervals. We will discuss their construction in
due course.
PAGE 242
CHAPTER 9 STAT 520, J. TEBBS
582
581
Elevation level (in feet)
580
579
578
577
576
Year
Figure 9.3: Lake Huron elevation data. The full data set is from 1880-2006. This figure
starts the series at 1940. AR(1) estimated MMSE forecasts and 95 percent prediction
limits are given for lead times l = 1, 2, ..., 20. These lead times correspond to years
2007-2026.
• Specifically, the (estimated) standard errors of the forecast error (in the predict
output above) are given by
v ( )
u
√ u 1 − b2l
ϕ
se[e c t (l)] = tσ
b t (l)] = var[e be2 ,
1 − ϕb2
b0 .
This value (1.373) is the square root of the estimated AR(1) process variance γ
PAGE 243
CHAPTER 9 STAT 520, J. TEBBS
9.3.2 MA(1)
MA(1): Suppose that {et } is zero mean white noise with var(et ) = σe2 . Consider the
invertible MA(1) process
Yt = µ + et − θet−1 ,
1-step ahead forecast: The MMSE forecast of Yt+1 , the 1-step ahead forecast, is
To compute (∗∗), recall (pp 105, notes) that a zero mean invertible MA(1) process can
be written in its “AR(∞)” expansion
From the representation above, note that the white noise term et can be “computed” in
the 1-step ahead forecast as a byproduct of estimating θ and µ in the MA(1) fit.
l-step ahead forecast: The MMSE prediction for Yt+l , l > 1, is given by
PAGE 244
CHAPTER 9 STAT 520, J. TEBBS
because et+l and et+l−1 are both independent of Y1 , Y2 , ..., Yt , when l > 1. Therefore, we
have shown that for the MA(1) model, MMSE forecasts are
µ − θe , l = 1
Ybt (l) =
t
µ, l > 1.
The key feature of an MA(1) process is that observations one unit apart in time are
correlated, whereas observations l > 1 units apart in time are not. For l > 1, there is no
autocorrelation to exploit in making a prediction; this is why a constant mean prediction
is made. Note: More generally, for any purely MA(q) process, the MMSE forecast is
Ybt (l) = µ at all lead times l > q.
REMARK : Just as we saw in the AR(1) model case, note that Ybt (l) → µ as l → ∞. This
is a characteristic of Ybt (l) in all stationary ARMA(p, q) models.
FORECAST ERROR: In the MA(1) model, the 1-step ahead forecast error is
Therefore,
E[et (1)] = E(et+1 ) = 0
As in the AR(1) model, 1-step ahead forecasts are unbiased and the variance of the 1-step
ahead forecast error is constant. The variance of the l-step ahead prediction error
et (l), for l > 1, is given by
= var(et+l − θet+l−1 )
PAGE 245
CHAPTER 9 STAT 520, J. TEBBS
Summarizing,
σe2 , l=1
var[et (l)] =
σ 2 (1 + θ2 ), l > 1.
e
Example 9.4. In Example 7.6 (pp 202, notes), we examined the Göta River discharge
rate data (1807-1956) and used an MA(1) process to model them. The fitted model
(using ML) is
Yt = 535.0311 + et + 0.5350et−1 .
> round(gota.ma1.predict$se,3)
Start = 1957
End = 1966
[1] 83.411 94.599 94.599 94.599 94.599 94.599 94.599 94.599 94.599 94.599
• In Figure 9.4, we display the Göta River data. The full data set is from 1807-1956
(one discharge reading per year). However, to emphasize the MMSE forecasts in
the plot, we start the series at year 1890.
• With l = 1, 2, ..., 10, the MMSE forecasts in the predict output and in Figure 9.4
start at 1957 and end in 1966.
• From the predict output, note that Ybt (1) = 510.960, the 1-step ahead forecast,
is the only “informative” one. Forecasts for l > 1 are Ybt (l) = µ
b ≈ 535.0311.
PAGE 246
CHAPTER 9 STAT 520, J. TEBBS
700
600
Water discharge rate
500
400
Year
Figure 9.4: Göta River discharge data. The full data set is from 1807-1956. This figure
starts the series at 1890. MA(1) estimated MMSE forecasts and 95 percent prediction
limits are given for lead times l = 1, 2, ..., 10. These lead times correspond to years
1957-1966.
• Recall that MA(1) forecasts only exploit the autocorrelation at the l = 1 lead time!
In the MA(1) process, there is no autocorrelation after the first lag. All future
forecasts (after the first) will revert to the process mean estimate.
b0 .
This value (94.599) is the square root of the estimated MA(1) process variance γ
PAGE 247
CHAPTER 9 STAT 520, J. TEBBS
9.3.3 ARMA(p, q)
ARMA(p, q): Suppose that {et } is zero mean white noise with var(et ) = σe2 and consider
the ARMA(p, q) process
To calculate the l-step ahead MMSE forecast, replace the time index t with t + l and take
conditional expectations of both sides (given the process history Y1 , Y2 , ..., Yt ). Doing this
leads directly to the following difference equation:
For a general ARMA(p, q) process, MMSE forecasts are calculated using this equation.
for j = 1, 2, ..., p. General recursive formulas can be derived to compute this con-
ditional expectation, as we saw in the AR(1) case.
PAGE 248
CHAPTER 9 STAT 520, J. TEBBS
Yt = θ0 + ϕYt−1 + et − θet−1 .
For l = 1, we have
For l = 2, we have
= θ0 + ϕYbt (1).
It is easy to see that this pattern continues for larger lead times l; in general,
for all lead times l > 1. It is important to make the following observations in this special
ARMA(p = 1, q = 1) case:
• The MMSE forecast Ybt (l) depends on the MA components only when l ≤ q = 1.
• When l > q = 1, the MMSE forecast Ybt (l) depends only on the AR components.
PAGE 249
CHAPTER 9 STAT 520, J. TEBBS
• When l ≤ q, MMSE forecasts depend on both the AR and MA parts of the model.
• When l > q, the MA contributions vanish and forecasts will depend solely on the
recursion identified in the AR part. That is, when l > q,
• It is insightful to note that the last expression, for l > q, can be written as
Therefore,
where µ = E(Yt ). Therefore, for large lead times l, MMSE forecasts will be ap-
proximately equal to the process mean.
• For any stationary ARMA(p, q) process, the variance of the l-step ahead fore-
cast error satisfies
lim var[et (l)] = γ0 ,
l→∞
where γ0 = var(Yt ). That is, for large lead times l, the variance of the forecast error
will be close to the process variance.
PAGE 250
CHAPTER 9 STAT 520, J. TEBBS
Example 9.5. In Example 7.5 (pp 195, notes), we examined the bovine blood sugar
data (176 observations) and we used an ARMA(1,1) process to model them. The fitted
ARMA(1,1) model (using ML) is
> round(cows.arma11.predict$se,3)
Start = 177
End = 186
[1] 4.520 7.316 8.249 8.627 8.787 8.856 8.887 8.900 8.906 8.908
• In Figure 9.5, we display the bovine data. The full data set is from day 1-176 (one
blood sugar reading per day). However, to emphasize the MMSE forecasts in the
plot, we start the series at day 81.
• With l = 1, 2, ..., 10, the MMSE forecasts in the predict output and in Figure 9.5
start at day 177 and end at day 186.
• From the predict output and Figure 9.5, note that the predictions are all close to
b = 59.0071, the estimated process mean. This happens because the last observed
µ
b = 59.0071.
data value was Y176 = 55.91133, which is already somewhat close to µ
PAGE 251
CHAPTER 9 STAT 520, J. TEBBS
80
70
Blood sugar level (mg/100ml blood)
60
50
40
Day
Figure 9.5: Bovine blood sugar data. The full data set is from day 1-176. This figure
starts the series at day 81. ARMA(1,1) estimated MMSE forecasts and 95 percent
prediction limits are given for lead times l = 1, 2, ..., 10. These lead times correspond to
days 177-186.
• The variance of the l-step ahead prediction error et (l) should satisfy
( )
1 − 2ϕθ + θ2
lim var[et (l)] = γ0 = σe2 ,
l→∞ 1 − ϕ2
PAGE 252
CHAPTER 9 STAT 520, J. TEBBS
NOTE : For invertible ARIMA(p, d, q) models with d ≥ 1, MMSE forecasts are computed
using the same approach as in the stationary case. To see why, suppose that d = 1, so
that the model is
ϕ(B)(1 − B)Yt = θ(B)et ,
ϕ(B)(1 − B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )(1 − B)
= (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p ) − (B − ϕ1 B 2 − ϕ2 B 3 − · · · − ϕp B p+1 )
ϕ∗ (B)Yt = θ(B)et ,
= E[(1 + ϕ)Yt |Y1 , Y2 , ..., Yt ] − E(ϕYt−1 |Y1 , Y2 , ..., Yt ) + E(et+1 |Y1 , Y2 , ..., Yt )
| {z } | {z } | {z }
= (1+ϕ)Yt = ϕYt−1 = E(et+1 )=0
PAGE 253
CHAPTER 9 STAT 520, J. TEBBS
If l = 2, then
= E[(1 + ϕ)Yt+1 |Y1 , Y2 , ..., Yt ] − E(ϕYt |Y1 , Y2 , ..., Yt ) + E(et+2 |Y1 , Y2 , ..., Yt )
| {z } | {z } | {z }
= (1+ϕ)Ybt (1) = ϕYt = E(et+2 )=0
Writing recursive expressions for MMSE forecasts in any invertible ARIMA(p, d, q) model
can be done similarly.
RESULT : The l-step ahead forecast error et (l) = Yt+l − Ybt (l) for any invertible
ARIMA(p, d, q) model has the following characteristics:
E[et (l)] = 0
∑
l−1
var[et (l)] = σe2 Ψ2j ,
j=0
where the Ψ weights correspond to those in the truncated linear process representation
of the ARIMA(p, d, q) model; see pp 200 (CC).
• The first equation implies that MMSE ARIMA forecasts are unbiased.
• The salient feature in the second equation is that for nonstationary models, the
Ψ weights do not “die out” as they do with stationary models.
• Therefore, for nonstationary models, the variance of the forecast error var[et (l)]
continues to increase as l does. This is not surprising given that the process is not
stationary.
PAGE 254
CHAPTER 9 STAT 520, J. TEBBS
Example 9.6. In Example 8.7 (pp 225, notes), we examined monthly spot prices for
crude oil (measured in U.S. dollars per barrel) from 1/86 to 1/06, and we used a log-
transformed IMA(1,1) process to model them. The model fit (using ML) is
> round(ima11.log.oil.predict$se,3)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 0.082 0.134 0.171 0.201 0.227 0.251 0.272 0.292 0.311 0.328 0.345
2007 0.361
• In Figure 9.6, we display the oil price data. The full data set is from 1/86 to 1/06
(one observation per month). However, to emphasize the MMSE forecasts in the
plot, we start the series at month 1/98.
• With l = 1, 2, ..., 12, the estimated MMSE forecasts in the predict output and in
Figure 9.6 start at 2/06 and end in 1/07.
• From the predict output, note that Ybt (1) = Ybt (2) = · · · = Ybt (12) = 4.208. It is
important to remember that these forecasts are on the log scale.
• On the original scale (in dollars), we will see later that MMSE forecasts are not
constant.
PAGE 255
CHAPTER 9 STAT 520, J. TEBBS
5.0
4.5
Oil prices (on log scale)
4.0
3.5
3.0
2.5
Year
Figure 9.6: Oil price data (log-transformed). The full data set is from 1/86 to 1/06. This
figure starts the series at 1/98. IMA(1,1) estimated MMSE forecasts and 95 percent
prediction limits (on the log scale) are given for lead times l = 1, 2, ..., 12. These lead
times correspond to months 2/06-1/07.
Example 9.7. In Example 1.6 (pp 7, notes), we examined the USC fall enrollment data
(Columbia campus) from 1954-2010. An ARI(1,1) process provides a good fit to these
data; fitting this model in R (using ML) gives the following output:
PAGE 256
CHAPTER 9 STAT 520, J. TEBBS
> round(enrollment.ari11.predict$se,3)
Start = 2011
End = 2020
[1] 1058.229 1789.494 2389.190 2894.460 3332.925 3723.059 4077.018 4402.947 4706.473
[10] 4991.615
• In Figure 9.7, we display the USC enrollment data. The full data set is from 1954-
2010 (one enrollment count per year). However, to emphasize the MMSE forecasts
in the plot, we start the series at year 1974.
• With l = 1, 2, ..., 10, the estimated MMSE forecasts in the predict output and in
Figure 9.7 start at 2011 and end at 2020.
• From the predict output, note that the estimated MMSE forecasts for the next
10 years, based on the ARI(1,1) fit, fluctuate slightly.
PAGE 257
CHAPTER 9 STAT 520, J. TEBBS
35000
USC Columbia fall enrollment
30000
25000
20000
Year
Figure 9.7: University of South Carolina fall enrollment data. The full data set is from
1954-2010. This figure starts the series at 1974. ARI(1,1) estimated MMSE forecasts
and 95 percent prediction limits are given for lead times l = 1, 2, ..., 10. These lead times
correspond to years 2011-2020.
We now derive prediction intervals for future responses with deterministic trend and
ARIMA models.
NOTE : Prediction intervals and confidence intervals, while similar in spirit, have very
different interpretations. A confidence interval is for a population (model) parameter,
which is fixed. A prediction interval is constructed for a random variable.
PAGE 258
CHAPTER 9 STAT 520, J. TEBBS
Yt = µt + Xt ,
where µt is a non-random trend function and where we assume (for purposes of the current
discussion) that {Xt } is a normally distributed stochastic process with E(Xt ) = 0 and
var(Xt ) = γ0 (constant). We have already shown the following:
E[et (l)] = 0
var[et (l)] = γ0 ,
where et (l) = Yt+l − Ybt (l) is the l-step ahead prediction error. Under the assumption of
normality, the random variable
Using algebra to rearrange the event inside the probability symbol, we have
( )
pr Ybt (l) − zα/2 se[et (l)] < Yt+l < Ybt (l) + zα/2 se[et (l)] = 1 − α.
REMARK : The form the prediction interval includes the quantities Ybt (l) = µt+l and
√
se[et (l)] = γ0 . Of course, these are population parameters that must be estimated
using the data.
PAGE 259
CHAPTER 9 STAT 520, J. TEBBS
Example 9.8. Consider the global temperature data from Example 3.4 (pp 53, notes).
Fitting a linear deterministic trend model Yt = β0 + β1 t + Xt , for t = 1900, 1901, ..., 1997,
produces the following output in R:
Suppose that {Xt } is a normal white noise process with (constant) variance γ0 . The
analysis in Section 3.5.1 (notes, pp 72-73) does support the normality assumption.
b0 ≈ (0.1298)2 ≈ 0.0168.
γ
• Therefore, with
√
b t (1)] ≈
se[e b0 ≈ 0.1298,
γ
• If we had made this prediction in 1997, we would have been 95 percent confident
that the temperature deviation for 1998, Y1998 , falls between −0.056 and 0.452.
PAGE 260
CHAPTER 9 STAT 520, J. TEBBS
√
zα/2 se[et (l)] = zα/2 γ0
is free of l, prediction intervals have the same width indefinitely into the future.
RECALL: Suppose that {et } is a zero mean white noise process with var(et ) = σe2 . In
general, an ARIMA(p, d, q) process can be written as
We have seen that the l-step ahead forecast error et (l) = Yt+l − Ybt (l) for any invertible
ARIMA(p, d, q) model has the following characteristics:
E[et (l)] = 0
∑
l−1
var[et (l)] = σe2 Ψ2j ,
j=0
where the Ψ weights are unique to the specific model under investigation. If we addi-
tionally assume that the white noise process {et } is normally distributed, then
is a 100(1 − α) percent prediction interval for Yt+l . As we have seen in the examples
so far, R gives (estimated) MMSE forecasts and standard errors; i.e., estimates of Ybt (l)
and se[et (l)], so we can compute prediction intervals associated with any ARIMA(p, d, q)
model. It is important to emphasize that normality is assumed.
PAGE 261
CHAPTER 9 STAT 520, J. TEBBS
Example 9.9. In Example 9.3, we examined the Lake Huron elevation data (from 1880-
2006) and calculated the (estimated) MMSE forecasts based on an AR(1) model fit with
lead times l = 1, 2, ..., 20. These forecasts, along with 95 percent prediction intervals
(limits) were depicted visually in Figure 9.3. Here are the numerical values of these
prediction intervals from R (I display only out to lead time l = 10 for brevity):
• In the R code, $pred extracts the estimated MMSE forecasts and $se extracts the
estimated standard error of the forecast error. The expression qnorm(0.975,0,1)
gives the upper 0.025 quantile of the N (0, 1) distribution (approximately 1.96).
• For example, we are 95 percent confident that the Lake Huron elevation level for
2015 will be between 577.3406 and 582.5456 feet.
• Note how the prediction limits (lower and upper) start to stabilize as the lead
time l increases. This is typical of a stationary process. Prediction limits from
nonstationary model fits do not stabilize as l increases.
• Important: The validity of prediction intervals depends on the white noise process
{et } being normally distributed.
PAGE 262
CHAPTER 9 STAT 520, J. TEBBS
9.5.1 Differencing
EXAMPLE : Suppose that {et } is a zero mean white noise process with var(et ) = σe2 and
consider the IMA(1,1) model
Yt = Yt−1 + et − θet−1 .
= Ybt (l − 1).
Therefore, we have shown that for the IMA(1,1) model, MMSE forecasts are
Y − θe , l = 1
b
Yt (l) =
t t
Yb (l − 1), l > 1.
t
PAGE 263
CHAPTER 9 STAT 520, J. TEBBS
Now, let Wt = ∇Yt = Yt − Yt−1 , so that Wt follows a zero-mean MA(1) model; i.e.,
Wt = et − θet−1 ,
(b) forecasting the stationary differenced series Wt = ∇Yt and then summing to
obtain the forecast in original terms
are equivalent procedures. In fact, this equivalence holds when forecasting for any
ARIMA(p, d, q) model!
• That is, the analyst can calculate predictions with the nonstationary model for Yt
or with the stationary model for Wt = ∇d Yt (and then convert back to the original
scale by adding).
• The predictions in both cases will be equal (hence, the resulting standard errors
will be the same too).
• The reason this occurs is that differencing is a linear operation (just as conditional
expectation is).
PAGE 264
CHAPTER 9 STAT 520, J. TEBBS
T (Yt ) = λ
ln(Yt ), λ = 0,
where λ is the transformation parameter. Many time series processes {Yt } exhibit
nonconstant variability that can be stabilized by taking logarithms. However, the func-
tion T (x) = ln x is not a linear function, so transformations on the log scale can not
simply be “undone” as easily as with differenced series (differencing is a linear transfor-
mation). MMSE forecasts are not preserved under exponentiation.
Zt = ln Yt ,
and denote the MMSE forecast for Zt+l by Zbt (l), that is, Zbt (l) is the l-step ahead MMSE
forecast on the log scale.
• The MMSE forecast for Yt+l is not Ybt (l) = eZt (l) !! This is sometimes called the
b
• The theoretical argument on pp 210 (CC) shows that the corresponding MMSE
forecast of Yt+l is { }
b b 1
Yt (l) = exp Zt (l) + var[et (l)] ,
2
where var[et (l)] is the variance of the l-step ahead forecast error et (l) = Zt+l − Zbt (l).
Example 9.10. In Example 9.6, we examined the monthly oil price data (1/86-1/01)
and we computed MMSE forecasts and predictions limits for l = 1, 2, ..., 12 (i.e., for 2/06
to 1/07), based on an IMA(1,1) fit for Zt = ln Yt . The estimated MMSE forecasts (on
the log scale) are depicted visually in Figure 9.6. The estimated MMSE forecasts, both
on the log scale and on the original scale (back-transformed), are given below:
PAGE 265
CHAPTER 9 STAT 520, J. TEBBS
For example, the MMSE forecast (on the original scale) for June, 2006 is given by
{ }
b 1
Yt (5) = exp 4.208 + (0.227) ≈ 68.948.
2
2
NOTE : A 100(1 − α) percent prediction interval for Yt+l can be formed by exponen-
tiating the endpoints of the prediction interval for Zt+l = log Yt+l . This is true because
( (L) )
b b(U )
1 − α = pr(Zbt+l < Zt+l < Zbt+l ) = pr eZt+l < Yt+l < eZt+l ;
(L) (U )
that is, because the exponential function f (x) = ex is strictly increasing, the two proba-
bilities above are the same.
• For example, a 95 percent prediction interval for June, 2005 (on the log scale) is
• A 95 percent prediction interval for June, 2005 on the original scale (in dollars) is
Therefore, we are 95 percent confident that the June, 2006 oil price (had we made
this prediction in January, 2006) would fall between 43.08 and 104.90 dollars.
PAGE 266
CHAPTER 10 STAT 520, J. TEBBS
10.1 Introduction
PREVIEW : In this chapter, we introduce new ARIMA models that incorporate seasonal
patterns occurring over time. With seasonal data, dependence with the past occurs most
prominently at multiples of an underlying seasonal lag, denoted by s. Consider the
following examples:
• With monthly data, there can be strong autocorrelation at lags that are multiples
of s = 12. For example, January observations tend to be “alike” across years,
February observations tend to be “alike,” and so on.
• With quarterly data, there can be strong autocorrelation at lags that are multiples
of s = 4. For example, first quarter sales tend to be “alike” across years, second
quarter sales tend to be “alike,” and so on.
Example 10.1. In Example 1.2 (pp 3, notes), we examined the monthly U.S. milk
production data (in millions of pounds) from January, 1994 to December, 2005.
• In Figure 10.1, we see that there are two types of trend in the milk production
data:
PAGE 267
CHAPTER 10 STAT 520, J. TEBBS
1700
1600
Amount of milk produced
1500
1400
1300
Year
Figure 10.1: United States milk production data. Monthly production figures, measured
in millions of pounds, from January, 1994 to December, 2005.
• We know the upward linear trend can be “removed” by working with first differences
∇Yt = Yt − Yt−1 . This is how we removed linear trends with nonseasonal data.
• Figure 10.2 displays the series of first differences ∇Yt . From this plot, it is clear that
the upward linear trend over time has been removed. That is, the first differences
∇Yt look stationary in the mean level.
• However, the first difference process {∇Yt } still displays a pronounced seasonal
pattern that repeats itself every s = 12 months. This is easily seen from the
monthly plotting symbols that I have added. How can we “handle” this type of
pattern? Is it possible to “remove” it as well?
GOAL: We wish to enlarge our class of ARIMA(p, d, q) models to handle seasonal data
such as these.
PAGE 268
CHAPTER 10 STAT 520, J. TEBBS
M
M M
M M
M
M
M M
M
M
M M
M M
M
M
M
150
M
M
Amount of milk produced: First differences
M
M M
M
100
D
D D
D D
D D
D D
D
D
D D
D M
M D
D
M
M D
D O
O D M
D M D
D
D
D M
M M
M M
M M
M M
M M
M
50
O
O O
O O
O O
O
O
O
JMM OO
J
J OJ M
O M
O
OJ OJ M
M O
O O
O
O J J
J J J J
J J J
J J J J
0
A
A J J A JA
AJ J
J A
A JA
A
A A
A A J A A
A
A
A A
A A
A AA A
A A
A AA
A
A A
A N
N AA
A A
A N A A
A A N A
N A
N
N A
S
SN S
SNN N S
S N A
N A
N
−50
N N
N N N
N
S
S N
N S
S
N
N
J SS
S
S S
S
J F
F S
J S
S
F
F SS
S JSS
J
J J J
J JJ
J J
−100
J J F
F
F
F F
F F
F
F
F F
F
F
F F
F
F
F F
F
Year
Figure 10.2: United States milk production data. First differences ∇Yt = Yt − Yt−1 .
Monthly plotting symbols have been added.
10.2.1 MA(Q)s
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 . A
seasonal moving average (MA) model of order Q with seasonal period s, denoted
by MA(Q)s , is
Yt = et − Θ1 et−s − Θ2 et−2s − · · · − ΘQ et−Qs .
A nonzero mean µ could be added for flexibility (as with nonseasonal models), but we
take µ = 0 for simplicity.
Yt = et − Θet−12 .
PAGE 269
CHAPTER 10 STAT 520, J. TEBBS
because no white noise subscripts match. In fact, it is easy to see that γk = 0 for all k,
except when k = s = 12. Note that
= −Θvar(et−12 ) = −Θσe2 .
Because E(Yt ) = 0 and γk are both free of t, an MA(1)12 process is stationary. The
autocorrelation function (ACF) for an MA(1)12 process is
1, k=0
γk Θ
ρk = = − , k = 12
γ0 1 + Θ2
0, otherwise.
NOTE : The form of the MA(1)12 ACF is identical to the form of the nonseasonal MA(1)
ACF from Chapter 4. For the MA(1)12 , the only nonzero autocorrelation occurs at the
first seasonal lag k = 12, as opposed to at k = 1 in the nonseasonal MA(1).
PAGE 270
CHAPTER 10 STAT 520, J. TEBBS
4
2
0
Yt
−2
−4
Time
and θ12 = Θ. Because of this equivalence (which occurs here and with other seasonal
models), we can use our already-established methods to specify, fit, diagnose, and forecast
seasonal models.
where et ∼ iid N (0, 1) and n = 200. This realization is displayed in Figure 10.3. In
Figure 10.4, we display the population (theoretical) ACF and PACF for this MA(1)12
process and the sample versions that correspond to the simulation in Figure 10.3.
PAGE 271
CHAPTER 10 STAT 520, J. TEBBS
1.0
1.0
Partial autocorrelation
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
0 10 20 30 40 50 0 10 20 30 40 50
k k
0.4
0.3
0.2
Partial ACF
ACF
0.1
0.0
−0.2
−0.1
0 10 20 30 40 50 0 10 20 30 40 50
Lag Lag
Figure 10.4: MA(1)12 with Θ = −0.9. Upper left: Population ACF. Upper right: Popu-
lation PACF. Lower left (right): Sample ACF (PACF) using data in Figure 10.3.
• The population ACF and PACF (Figure 10.4; top) display the same patterns as
the nonseasonal MA(1), except that now these patterns occur at seasonal lags.
– The population ACF displays nonzero autocorrelation only at the first (sea-
sonal) lag k = 12. In other words, observations 12 units apart in time are
correlated, whereas all other observations are not.
– The population PACF shows a decay across seasonal lags k = 12, 24, 36, ...,.
• The sample ACF and PACF reveal these same patterns overall. Margin of error
bounds in the sample ACF/PACF are for white noise; not an MA(1)12 process.
PAGE 272
CHAPTER 10 STAT 520, J. TEBBS
Yt = et − Θ1 et−12 − Θ2 et−24 .
NOTE : The ACF for an MA(2)12 process has the same form as the ACF for a nonseasonal
MA(2). The only difference is that nonzero autocorrelations occur at the first two
seasonal lags k = 12 and k = 24, as opposed to at k = 1 and k = 2 in the nonseasonal
MA(2).
θ12 = Θ1 , and θ24 = Θ2 . This again reveals that we can use our already-established
methods to specify, fit, diagnose, and forecast seasonal models.
PAGE 273
CHAPTER 10 STAT 520, J. TEBBS
Yt = et − Θ1 B s et − Θ2 B 2s et − · · · − ΘQ B Qs et
= (1 − Θ1 B s − Θ2 B 2s − · · · − ΘQ B Qs )et ≡ ΘQ (B s )et ,
10.2.2 AR(P )s
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 . A
seasonal autoregressive (AR) model of order P with seasonal period s, denoted
by AR(P )s , is
Yt = Φ1 Yt−s + Φ2 Yt−2s + · · · + ΦP Yt−P s + et .
A nonzero mean µ could be added for flexibility (as with nonseasonal models), but we
take µ = 0 for simplicity.
Yt = ΦYt−12 + et .
PAGE 274
CHAPTER 10 STAT 520, J. TEBBS
6
4
2
Yt
0
−2
−4
Time
• That is, ρ0 = 1, ρ12 = Φ, ρ24 = Φ2 , ρ36 = Φ3 , and so on, similar to the nonseasonal
AR(1). The ACF ρk = 0 at all lags k that are not multiples of s = 12.
and ϕ12 = Φ.
Example 10.3. We use R to simulate one realization of an AR(1)12 process with Φ = 0.9,
that is,
Yt = 0.9Yt−12 + et ,
PAGE 275
CHAPTER 10 STAT 520, J. TEBBS
1.0
1.0
Partial autocorrelation
0.5
0.5
Autocorrelation
0.0
0.0
−0.5
−0.5
−1.0
−1.0
0 10 20 30 40 50 0 10 20 30 40 50
k k
0.6
0.6
0.4
Partial ACF
0.4
ACF
0.2
0.2
−0.2 0.0
−0.2 0.0
0 10 20 30 40 50 0 10 20 30 40 50
Lag Lag
Figure 10.6: AR(1)12 with Φ = −0.9. Upper left: Population ACF. Upper right: Popu-
lation PACF. Lower left (right): Sample ACF (PACF) using data in Figure 10.5.
where et ∼ iid N (0, 1) and n = 200. This realization is displayed in Figure 10.5. In
Figure 10.6, we display the population (theoretical) ACF and PACF for this AR(1)12
process and the sample versions that correspond to the simulation in Figure 10.5.
• The population ACF and PACF (Figure 10.6; top) display the same patterns as
the nonseasonal AR(1), except that now these patterns occur at seasonal lags.
– The population ACF displays a slow decay across the seasonal lags k =
12, 24, 36, 48, ...,. In other words, observations that are 12, 24, 36, 48, etc.
units apart in time are correlated, whereas all other observations are not.
PAGE 276
CHAPTER 10 STAT 520, J. TEBBS
– The population PACF is nonzero at the first seasonal lag k = 12. The PACF
is zero at all other lags. This is analogous to the PACF for an AR(1) being
nonzero when k = 1 and zero elsewhere.
• The sample ACF and PACF reveal these same patterns overall. Margin of error
bounds in the sample ACF/PACF are for white noise; not an AR(1)12 process.
AR(2)12 : A seasonal AR model of order P = 2 with seasonal lag s = 12; i.e., AR(2)12 , is
Yt = ΦYt−12 + Φ2 Yt−24 + et .
• A seasonal AR(2)12 behaves like the nonseasonal AR(2) at the seasonal lags.
– The PACF ϕkk is nonzero at lags k = 12 and k = 24; it is zero at all other
lags.
can be expressed as
⇐⇒ (1 − Φ1 B s − Φ2 B 2s − · · · − ΦP B P s )Yt = et ⇐⇒ ΦP (B s )Yt = et ,
PAGE 277
CHAPTER 10 STAT 520, J. TEBBS
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
A seasonal autoregressive moving average (ARMA) model of orders P and Q
with seasonal period s, denoted by ARMA(P, Q)s , is
A nonzero mean µ could be added for flexibility (as with nonseasonal models), but we
take µ = 0 for simplicity.
ΦP (B s )Yt = ΘQ (B s )et ,
ΦP (B s ) = 1 − Φ1 B s − Φ2 B 2s − · · · − ΦP B P s
ΘQ (B s ) = 1 − Θ1 B s − Θ2 B 2s − · · · − ΘQ B Qs .
PAGE 278
CHAPTER 10 STAT 520, J. TEBBS
• The following table succinctly summarizes the behavior of the population ACF and
PACF for seasonal ARMA(P, Q)s processes:
• In many ways, this “extension” is not that much of an extension, because the
seasonal ARMA(P, Q)s model is essentially an ARMA(p, q) model restricted the
seasonal lags k = s, 2s, 3s, ...,.
• However, if we combine these new seasonal ARMA(P, Q)s models with our tradi-
tional nonseasonal ARMA(p, q) models, we create a larger class of models applicable
for use with stationary processes that exhibit seasonality.
• We now examine this new class of models, the so-called multiplicative seasonal
ARMA class.
PAGE 279
CHAPTER 10 STAT 520, J. TEBBS
MA(1) × MA(1)12 : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
Consider the nonseasonal MA(1) model
Yt = et − θet−1 ⇐⇒ Yt = (1 − θB)et
Yt = et − Θet−12 ⇐⇒ Yt = (1 − ΘB 12 )et .
• The defining characteristic of the nonseasonal MA(1) process is that the only
nonzero autocorrelation occurs at lag k = 1.
• The defining characteristic of the seasonal MA(1)12 process is that the only nonzero
autocorrelation occurs at lag k = 12.
Yt = (1 − θB)(1 − ΘB 12 )et
= (1 − θB − ΘB 12 + θΘB 13 )et ,
or, equivalently,
Yt = et − θet−1 − Θet−12 + θΘet−13 .
We call this a multiplicative seasonal MA(1) × MA(1)12 model. The term “multi-
plicative” arises because the MA characteristic operator 1 − θB − ΘB 12 + θΘB 13 is the
product of (1 − θB) and (1 − ΘB 12 ). An MA(1) × MA(1)12 process has E(Yt ) = 0 and
θ θΘ
ρ1 = − ρ11 =
1 + θ2 (1 + θ2 )(1 + Θ2 )
Θ θΘ
ρ12 = − ρ13 = .
1 + Θ2 (1 + θ2 )(1 + Θ2 )
PAGE 280
CHAPTER 10 STAT 520, J. TEBBS
MA(1) × AR(1)12 : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
Consider the two models
Yt = et − θet−1 ⇐⇒ Yt = (1 − θB)et
and
Yt = ΦYt−12 + et ⇐⇒ (1 − ΦB 12 )Yt = et ,
• The defining characteristic of the nonseasonal MA(1) is that the only nonzero
autocorrelation occurs at lag k = 1.
• The defining characteristic of the seasonal AR(1)12 is that the autocorrelation de-
cays across seasonal lags k = 12, 24, 36, ...,.
(1 − ΦB 12 )Yt = (1 − θB)et ,
or, equivalently,
Yt = ΦYt−12 + et − θet−1 .
PAGE 281
CHAPTER 10 STAT 520, J. TEBBS
1.0
0.2
0.5
Autocorrelation
0.0
PACF
0.0
−0.2
−0.5
−1.0
−0.4
0 10 20 30 40 50 0 10 20 30 40 50
k k
PACF
0.0
−0.5
−0.4
−1.0
0 10 20 30 40 50 0 10 20 30 40 50
k k
Figure 10.7: Top: Population ACF/PACF for MA(1)×MA(1)12 process with θ = 0.5 and
Θ = 0.9. Bottom: Population ACF/PACF for MA(1) × AR(1)12 process with θ = 0.5
and Φ = 0.9.
• additional MA-type autocorrelation at lag k = 1 and at lags one unit in time from
the seasonal lags, that is, at k = 11 and k = 13, k = 23 and k = 25, and so on.
Yt = ΦYt−12 + et − θet−1 ,
PAGE 282
CHAPTER 10 STAT 520, J. TEBBS
TERMINOLOGY : Suppose {et } is a zero mean white noise process with var(et ) = σe2 .
In general, we can combine a nonseasonal ARMA(p, q) process
ϕ(B)Yt = θ(B)et
ΦP (B s )Yt = ΘQ (B s )et
Example 10.4. Data file: boardings (TSA). Figure 10.8 displays the number of public
transit boardings (mostly for bus and light rail) in Denver, Colorado from 8/2000 to
3/2006. The data have been log-transformed.
• From the plot, the boarding process appears to be relatively stationary in the mean
level; that is, there are no pronounced shifts in mean level over time.
PAGE 283
CHAPTER 10 STAT 520, J. TEBBS
12.70
S
12.65 O
N
Number of boardings (log−transformed)
F
S
M
12.60
S J
O A
S O S F M A
AM M
S N
12.55
A O O
N M N FM J
O F
N AM A J D
F J F
J A A J
AM
12.50
J N
M J
A M J
A
J J J
D
J D
12.45
J D
JJ
M
D
12.40
Year
Figure 10.8: Denver public transit data. Monthly number of public transit boardings
(log-transformed) in Denver from 8/2000 to 3/2006. Monthly plotting symbols have
been added.
• In Figure 10.9, we display the sample ACF and PACF for the boardings data. Note
that the margin of error bounds in the plot are for a white noise process.
PAGE 284
CHAPTER 10 STAT 520, J. TEBBS
0.4
0.4
0.2
0.2
Partial ACF
ACF
0.0
0.0
−0.2
−0.2
0 10 20 30 40 0 10 20 30 40
Lag Lag
Figure 10.9: Denver public transit data. Left: Sample ACF. Right: Sample PACF.
– Around the seasonal lags k = 12, k = 24, and k = 36 (in the ACF), there are
noticeable autocorrelations 3 time units in both directions. This suggests a
nonseasonal MA(3) component.
PAGE 285
CHAPTER 10 STAT 520, J. TEBBS
15
3
2
10
1
Sample Quantiles
Frequency
0
5
−1
−2
0
−2 −1 0 1 2 3 −2 −1 0 1 2
Figure 10.10: Denver public transit data. Standardized residuals from ARMA(0, 3) ×
ARMA(1, 0)12 model fit.
Note that each of the parameter estimates is statistically different from zero. The fitted
ARMA(0, 3) × ARMA(1, 0)12 model (on the log scale) is
or equivalently,
be2 ≈ 0.0006542.
The white noise variance estimate is σ
Finally, the tsdiag output in Figure 10.11 shows no notable problems with the ARMA(0, 3)×
ARMA(1, 0)12 model.
PAGE 286
CHAPTER 10 STAT 520, J. TEBBS
3
Standardized Residuals
2
1
0
−2 −1
Time
0.0 0.1 0.2
ACF of Residuals
−0.2
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
Figure 10.11: Denver public transit data. ARMA(0, 3) × ARMA(1, 0)12 tsdiag output.
OVERFITTING: For an ARMA(0, 3)×ARMA(1, 0)12 model, there are 4 overfitted mod-
els. Here are the models and the results from overfitting the boarding data:
PAGE 287
CHAPTER 10 STAT 520, J. TEBBS
12.75
12.70
Number of boardings (log−transformed)
12.65
12.60
12.55
12.50
12.45
Year
Figure 10.12: Denver public transit data. The full data set is from 8/2000-3/2006. This
figure starts the series at 1/2003. ARMA(0, 3)×ARMA(1, 0)12 estimated MMSE forecasts
and 95 percent prediction limits are given for lead times l = 1, 2, ..., 12. These lead times
correspond to years 4/2006-3/2007.
FORECASTING: The estimated forecasts and standard errors (on the log scale) are
given for lead times l = 1, 2, ..., 12 in the predict output below:
> round(boardings.arma03.arma10.predict$se,3)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 0.026 0.032 0.035 0.036 0.036 0.036 0.036 0.036 0.036
2007 0.036 0.036 0.036
PAGE 288
CHAPTER 10 STAT 520, J. TEBBS
• In Figure 10.12, we display the Denver boardings data. The full data set is from
8/00 to 3/06 (one observation per month). However, to emphasize the MMSE
forecasts in the plot, we start the series at month 1/03.
• With l = 1, 2, ..., 12, the estimated MMSE forecasts in the predict output and
in Figure 10.12 start at 4/06 and end in 3/07. It is important to remember that
these forecasts are on the log scale. MMSE forecasts on the original scale and 95
percent prediction intervals are given below.
PAGE 289
CHAPTER 10 STAT 520, J. TEBBS
is a flexible class of time series models for stationary seasonal processes. The next step
is to extend this class of models to handle two types of nonstationarity:
• Seasonal nonstationarity, that is, additional changes in the seasonal mean level,
even after possibly adjusting for nonseasonal stationarity over time.
This definition can be generalized to any number of differences; in general, the dth
differences are given by
∇d Yt = (1 − B)d Yt .
Yt = St + et ,
St = St−12 + ut ,
where {ut } is zero mean white noise that is uncorrelated with {et }. That is, {St } is a
zero mean random walk with period s = 12. For this process, taking nonseasonal
PAGE 290
CHAPTER 10 STAT 520, J. TEBBS
differences (as we have done up until now) will not have an effect on the seasonal
nonstationarity. For example, with d = 1, we have
The first difference process {∇Yt } is still nonstationary because {St } is a random walk
across seasons; i.e, across time points t = 12k.
• That is, taking (nonseasonal) differences has only produced a more complicated
model, one which is still nonstationary across seasons.
• We therefore need to define a new differencing operator that can remove nonsta-
tionarity across seasonal lags.
∇s Yt = Yt − Yt−s = (1 − B s )Yt ,
for a seasonal period s. For example, with s = 12 and monthly data, the first seasonal
differences are
∇12 Yt = Yt − Yt−12 = (1 − B 12 )Yt ,
that is, the first differences of the January observations, the first differences of the Febru-
ary observations, and so on.
Yt = St + et ,
It can be shown that this process has the same ACF as a stationary seasonal MA(1)12 .
That is, taking first seasonal differences has coerced the {Yt } process into stationarity.
PAGE 291
CHAPTER 10 STAT 520, J. TEBBS
1700
150
100
1600
Amount of milk produced
First differences
50
1500
0
1400
−50
1300
1994 1996 1998 2000 2002 2004 2006 −100 1994 1996 1998 2000 2002 2004 2006
Year Year
100
40
First seasonal differences (s=12)
50
20
0
0
−20
−50
−40
1996 1998 2000 2002 2004 2006 1996 1998 2000 2002 2004 2006
Year Year
Figure 10.13: United States milk production data. Upper left: Original series {Yt }.
Upper right: First (nonseasonal) differences ∇Yt = Yt − Yt−1 . Lower left: First (seasonal)
differences ∇12 Yt = Yt − Yt−12 . Lower right: Combined first (seasonal and nonseasonal)
differences ∇∇12 Yt .
PAGE 292
CHAPTER 10 STAT 520, J. TEBBS
Example 10.5. Consider the monthly U.S. milk production data from Example 10.1.
Figure 10.13 (last page) displays the time series plot of the data (upper left), the first
difference process ∇Yt (upper right), the first seasonal difference process ∇12 Yt (lower
left), and the combined difference process ∇∇12 Yt (lower right). The combined difference
process ∇∇12 Yt is given by
= (1 − B − B 12 + B 13 )Yt
• The milk series (Figure 10.13; upper left) displays two trends: nonstationarity over
time and a within-year seasonal pattern. A Box-Cox analysis (results not shown)
suggests that no transformation is necessary for variance stabilization purposes.
• Taking first (nonseasonal) differences; i.e., computing ∇Yt , (Figure 10.13; upper
right) has removed the upward linear trend (as expected), but the process {∇Yt }
still displays notable seasonality.
• Taking first (seasonal) differences; i.e., computing ∇12 Yt , (Figure 10.13; lower left)
has seemingly removed the seasonality (as expected), but the process {∇12 Yt } dis-
plays still strong momentum over time.
– The sample ACF of {∇12 Yt } (not shown) displays a slow decay, a sign of
nonstationarity over time.
• The combined first differences ∇∇12 Yt (Figure 10.13; lower right) look to resemble
a stationary process (at least in the mean level).
REMARK : From this example, it should be clear that we can now extend the multiplica-
tive seasonal (stationary) ARMA(p, q) × ARMA(P, Q)s model
to incorporate the two types of nonstationarity: nonseasonal and seasonal. This leads to
the definition of our largest class of ARIMA models.
PAGE 293
CHAPTER 10 STAT 520, J. TEBBS
TERMINOLOGY : Suppose that {et } is zero mean white noise with var(et ) = σe2 . The
multiplicative seasonal autoregressive integrated moving average (SARIMA)
model with seasonal period s, denoted by ARIMA(p, d, q) × ARIMA(P, D, Q)s , is
ϕ(B)ΦP (B s )∇d ∇D s
s Yt = θ(B)ΘQ (B )et ,
ϕ(B) = (1 − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p )
θ(B) = (1 − θ1 B − θ2 B 2 − · · · − θq B q ),
ΦP (B s ) = 1 − Φ1 B s − Φ2 B 2s − · · · − ΦP B P s
ΘQ (B s ) = 1 − Θ1 B s − Θ2 B 2s − · · · − ΘQ B Qs ,
and
∇d ∇D
s Yt = (1 − B) (1 − B ) Yt .
d s D
In this model,
• For many nonstationary seasonal time series data sets (at least for the ones I have
seen), the most common choice for (d, D) is (1, 1).
The SARIMA class is very flexible. Many times series can be adequately fit by these
models, usually with a small number of parameters, often less than five.
PAGE 294
CHAPTER 10 STAT 520, J. TEBBS
0.2
0.4
0.1
0.2
0.0
Partial ACF
0.0
−0.1
ACF
−0.2
−0.2
−0.3
−0.4
−0.4
0 10 20 30 40 0 10 20 30 40
Lag Lag
Figure 10.14: United States milk production data. Left: Sample ACF for {∇∇12 Yt }.
Right: Sample PACF for {∇∇12 Yt }.
Example 10.5 (continued). For the milk production data in Example 10.1, we have
seen that the combined difference process {∇∇12 Yt } looks to be relatively stationary.
In Figure 10.14, we display the sample ACF (left) and sample PACF (right) of the
{∇∇12 Yt } process. Examining these two plots will help us identify which ARMA(p, q) ×
ARMA(P, Q)12 model is appropriate for {∇∇12 Yt }.
• The sample ACF for {∇∇12 Yt } has a pronounced spike at seasonal lag k = 12 and
one at k = 48 (but none at k = 24 and k = 36).
• The sample PACF for {∇∇12 Yt } displays pronounced spikes at seasonal lags k =
12, 24 and 36.
• The last two observations are consistent with the following choices:
– (P, Q) = (0, 1) if one is willing to ignore the ACF at k = 48. Also, if (P, Q) =
(0, 1), we would expect the the PACF to decay at lags k = 12, 24 and 36.
There is actually not that much of a decay.
– (P, Q) = (3, 0), if one is willing to place strong emphasis on the sample PACF.
PAGE 295
CHAPTER 10 STAT 520, J. TEBBS
• There does not appear to be “anything happening” around seasonal lags in the
ACF, and the ACF at k = 1 is borderline. We therefore take p = 0 and q = 0.
• I have carefully examined both models. The AR(3)12 model provides a much better
fit to the {∇∇12 Yt } process than the MA(1)12 model.
– The AR(3)12 model for {∇∇12 Yt } provides a smaller AIC, a smaller estimate
of the white noise variance, and superior residual diagnostics; e.g., the Ljung-
Box test strongly discounts the MA(1)12 model for {∇∇12 Yt } at all lags.
> milk.arima010.arima310 =
arima(milk,order=c(0,1,0),method=’ML’,seasonal=list(order=c(3,1,0),period=12))
> milk.arima010.arima310
Coefficients:
sar1 sar2 sar3
-0.9133 -0.8146 -0.6002
s.e. 0.0696 0.0776 0.0688
sigma^2 estimated as 121.4: log likelihood = -512.03, aic = 1030.05
b 1,
be2 ≈ 121.4. Note that all parameter estimates (Θ
The white noise variance estimate is σ
b 2 , and Θ
Θ b 3 ) are statistically different from zero (by a very large amount).
PAGE 296
CHAPTER 10 STAT 520, J. TEBBS
30
2
25
1
20
Sample Quantiles
Frequency
0
15
10
−1
5
−2
0
−3 −2 −1 0 1 2 3 −2 −1 0 1 2
Figure 10.15: United States milk production data. Standardized residuals from
ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 model fit.
Finally, the tsdiag output in Figure 10.16 supports the ARIMA(0, 1, 0)×ARIMA(3, 1, 0)12
model choice.
PAGE 297
CHAPTER 10 STAT 520, J. TEBBS
Standardized Residuals
2
1
0
−2 −1
Time
0.10
ACF of Residuals
0.00
−0.15
0 5 10 15 20 25 30
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
0 5 10 15 20 25 30
Number of lags
Figure 10.16: United States milk production data. ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12
tsdiag output.
CONCLUSION : The ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 model does a good job at de-
scribing the U.S. milk production data. With this model, we move forward with fore-
casting future observations.
FORECASTING: We use R to compute forecasts and prediction limits for the lead times
l = 1, 2, ..., 24 (two years ahead) based on the ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 model
fit. Here are the estimated MMSE forecasts and 95 percent prediction limits:
# MMSE forecasts
> milk.arima010.arima310.predict <- predict(milk.arima010.arima310.fit,n.ahead=24)
PAGE 298
CHAPTER 10 STAT 520, J. TEBBS
> round(milk.arima010.arima310.predict$pred,3)
Jan Feb Mar Apr May Jun Jul Aug Sep
2006 1702.409 1584.302 1760.356 1728.246 1783.487 1698.330 1694.116 1680.528 1610.895
2007 1725.769 1608.022 1775.653 1742.424 1792.538 1715.007 1717.981 1695.297 1631.562
Oct Nov Dec
2006 1655.054 1610.777 1689.084
2007 1679.871 1634.033 1712.183
> round(milk.arima010.arima310.predict$se,3)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 11.018 15.581 19.083 22.035 24.636 26.988 29.150 31.162 33.053 34.841 36.541 38.166
2007 40.000 41.753 43.436 45.056 46.620 48.132 49.599 51.024 52.410 53.760 55.077 56.363
• In Figure 10.17, we display the U.S. milk production data. The full data set is from
1/94 to 12/05 (one observation per month). However, to emphasize the MMSE
forecasts in the plot, we start the series at month 1/04.
PAGE 299
CHAPTER 10 STAT 520, J. TEBBS
1800
Amount of milk produced
1700
1600
Year
Figure 10.17: U.S. milk production data. The full data set is from 1/1994-12/2005. This
figure starts the series at 1/2004. ARIMA(0, 1, 0) × ARIMA(3, 1, 0)12 estimated MMSE
forecasts and 95 percent prediction limits are given for lead times l = 1, 2, ..., 24. These
lead times correspond to years 1/2006-12/2007.
• With l = 1, 2, ..., 24, the estimated MMSE forecasts in the predict output and in
Figure 10.17 start at 1/06 and end in 12/07 (24 months).
• Numerical values of the 95 percent prediction intervals are given for 1/06-12/06 in
the prediction interval output. Note how the interval lengths increase as l does.
This is a byproduct of nonstationarity. In Figure 10.17, the impact of nonsta-
tionarity is also easily seen as l increases (prediction limits become wider).
NOTE : Although we did not state so explicitly, determining MMSE forecasts and predic-
tion limits for seasonal models is exactly analogous to the nonseasonal cases we studied
in Chapter 9. Formulae for seasonal MMSE forecasts are given in Section 10.5 (CC) for
special cases.
PAGE 300
CHAPTER 10 STAT 520, J. TEBBS
600
Australian clay brick production (in millions)
500
400
300
200
Time
Figure 10.18: Australian clay brick production data. Number of bricks (in millions)
produced from 1956-1994.
Example 10.6. In this example, we revisit the Australian brick production data in
Example 1.14 (pp 15, notes). The data in Figure 10.18 represent the number of bricks
produced in Australia (in millions) during 1956-1994. The data are quarterly, so the
underlying seasonal lag of interest is s = 4.
• Using the BoxCox.ar function in R (output not shown) suggests that the Box-Cox
transformation parameter λ ≈ 0.5.
• We now examine the transformed data and the relevant differenced series.
PAGE 301
CHAPTER 10 STAT 520, J. TEBBS
3
24
2
Brick production (Square−root transformed)
22
1
First differences
20
0
18
−1
16
−2
14
Time Year
2
2
First seasonal differences (s=4)
1
0
0
−2
−1
−4
−2
Year Year
Figure 10.19: Australian clay brick production data (square-root transformed). Upper
left: Original series {Yt }. Upper right: First (nonseasonal) differences ∇Yt = Yt − Yt−1 .
Lower left: First (seasonal) differences ∇4 Yt = Yt − Yt−4 . Lower right: Combined first
(seasonal and nonseasonal) differences ∇∇4 Yt .
PAGE 302
CHAPTER 10 STAT 520, J. TEBBS
0.2
0.1
0.2
0.0
Partial ACF
−0.1
0.0
ACF
−0.2
−0.2
−0.3
−0.4
−0.4
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Lag Lag
Figure 10.20: Australian clay brick production data (square-root transformed). Left:
Sample ACF for {∇∇4 Yt }. Right: Sample PACF for {∇∇4 Yt }.
NOTE : The combined difference process ∇∇4 Yt in Figure 10.19 looks stationary in the
mean level. The sample ACF/PACF for the ∇∇4 Yt series is given in Figure 10.20. Recall
that our analysis is now on the square-root transformed scale.
ANALYSIS : Examining the sample ACF/PACF for the ∇∇4 Yt data does not lead us to
one single model as a “clear favorite.” In fact, there are ambiguities that emerge; e.g., a
spike in the ACF at lag k = 25 (this is not a seasonal lag), a spike in the PACF at the
seventh seasonal lag k = 28, etc.
• The PACF does display spikes at the first 4 seasonal lags k = 4, k = 8, k = 12,
and k = 16.
• The ACF does not display consistent “action” around these seasonal lags in either
direction.
• These two observations lead us to tentatively consider an AR(4)4 model for the
combined difference process {∇∇4 Yt }; i.e., an ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4
for the square-root transformed series.
PAGE 303
CHAPTER 10 STAT 520, J. TEBBS
3
60
2
50
1
40
Sample Quantiles
Frequency
0
30
−1
20
−2
10
−3
0
−4 −3 −2 −1 0 1 2 3 −2 −1 0 1 2
Figure 10.21: Australian clay brick production data (square-root transformed). Stan-
dardized residuals from ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 model fit.
> sqrt.brick.arima010.arima410 =
arima(sqrt.brick,order=c(0,1,0),method=’ML’,seasonal=list(order=c(4,1,0),period=4))
> sqrt.brick.arima010.arima410
Coefficients:
sar1 sar2 sar3 sar4
-0.8249 -0.8390 -0.5330 -0.3290
s.e. 0.0780 0.0935 0.0936 0.0772
sigma^2 estimated as 0.2889: log likelihood = -122.47, aic = 252.94
b 1,
be2 ≈ 0.2889. Note that all parameter estimates (Θ
The white noise variance estimate is σ
b 2, Θ
Θ b 3 , and Θ
b 4 ) are statistically different from zero (by a very large amount).
PAGE 304
CHAPTER 10 STAT 520, J. TEBBS
3
Standardized Residuals
2
1
−1 0
−3
Time
0.10
ACF of Residuals
0.00
−0.15
5 10 15 20
Lag
0.0 0.2 0.4 0.6 0.8 1.0
P−values
5 10 15 20
Number of lags
DIAGNOSTICS : The tsdiag output in Figure 10.22 does not strongly refute the
ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 model choice, and overfitting (results not shown) does
not lead us to consider a higher order model. However, the qq plot of the standardized
residuals in Figure 10.21 reveals major problems with the normality assumption, and the
Shapiro-Wilk test strongly rejects normality (p-value < 0.0001).
CONCLUSION : The ARIMA(0, 1, 0) × ARIMA(4, 1, 0)4 model for the Australian brick
production data (square-root transformed) is not completely worthless, but I would hes-
itate to use this model for forecasting purposes (since the normality assumption is so
grossly violated). The search for a better model should continue!
PAGE 305
CHAPTER 10 STAT 520, J. TEBBS
DISCUSSION : In this course, we have covered the first 10 chapters of Cryer and Chan
(2008). This material provides you with a powerful arsenal of techniques to analyze many
time series data sets that are seen in practice. These chapters also lay the foundation for
further study in time series analysis.
• Chapter 12. This chapter deals explicitly with modeling financial time se-
ries data (e.g., stock prices, portfolio returns, etc.), mainly with the commonly
used ARCH and GARCH models. The key feature of these models is that they
incorporate additional heteroscedasticity that are common in financial data.
• Chapter 13. This chapter deals with frequency domain methods (spectral
analysis) for periodic data which arise in physics, biomedicine, engineering, etc.
The periodogram and spectral density are introduced. These methods use linear
combinations of sine and cosine functions to model underlying (possibly multiple)
frequencies.
• Chapter 14. This chapter is an extension of Chapter 13 which studies the sampling
characteristics of the spectral density estimator.
• Chapter 15. This chapter discusses nonlinear models for time series data.
This class of models assumes that current data are nonlinear functions of past
observations, which can be a result of nonnormality.
PAGE 306