1-Basic Concepts 34454745
1-Basic Concepts 34454745
1.1 Introduction
Time series analysis refers to statistical theories and methods used to analyze data
indexed by time.
In time series analysis, the word time series has two related meanings:
(i) A data set, actual or simulated, of observations indexed by time.
(ii) A stochastic process (see Definition 1.3) used to model observed time series data.
For clarity, we may use time series data for (i) and time series model for (ii). We
typically denote random variables and stochastic processes in uppercase (e.g., Xt ),
and observed/realized values in lowercase (e.g., xt ). That time is linearly ordered and
directed is crucial in the statistical analysis of time series data (as well as in real life).
It is essential to gain experience by examining a wide variety of real data-sets. We
give two examples to illustrate some general ideas and terminologies.
Example 1.1. The S&P/TSX Composite Index represents the overall performance of
the stocks listed on the Toronto Stock Exchange (TSX).1 We consider the daily closing
value of the index which is available on, say, Yahoo Finance2 and can be downloaded
either directly from Yahoo Finance or through R using the function getSymbol()
in the package quantmod. Familiarity with the underlying subject matter, or domain
knowledge, is essential to meaningful analysis of the data.
0.05
15000
0.00
−0.05
10000
−0.10
2000 2005 2010 2015 2020 2025 2000 2005 2010 2015 2020 2025
Date Date
It is always a good idea to plot the data to examine its features. In Figure 1.1(left)
we plot the daily closing values from 2000-01-04 to 2024-06-14. We say that the
frequency is daily. Other common frequencies (at least for economic data) are weekly,
monthly, quarterly and yearly. While we plot the data in calendar time, the data is
indexed by trading day which skips in particular all holidays. For example, 2000-01-
07 is a Friday and the next trading day is 2000-01-10 (Monday). If we represent the
1
More precisely, it is a capitalization-weighted market index where the influence of a constituent stock
is proportional to its market capitalization.
2
See https://ca.finance.yahoo.com/. The ticker symbol (unique identifier) of the S&P/TSX Com-
posite index is ^GSPTSE
2
series by yt , t = 1, 2, . . . T where T is the length of the series, we interpret t as the
t-th trading day of the data-set.
We say that this time series is in discrete time since the time t takes values in a
discrete case (here t ∈ T = {1, . . . , T }). For visualization purposes we often interpolate
linearly between the data points as done in the figure. Some processes, such as high
frequency trading, speech, brain activities and weather, may be considered to occur
in continuous time (say t ∈ T = [a, b] ⊂ R for some interval), although for practical
reasons they are usually sampled in discrete time.
In Figure 1.1(left), we observe that the series yt evolves in a zig-zag matter but
exhibits an overall increasing trend (in finance, a key question is the trade-off between
reward and risk). It is often useful to transform the data to reveal hidden structures
and/or make it more suitable for statistical analysis; some techniques are discussed in
Section 2. In Figure 1.1(right), we consider the daily logarithmic return (which forms
another time series) given by
where ∇, defined by
∇xt = xt − xt−1 , (1.1)
is the difference operator. This transformation removes the apparent trend in the
data: the log returns are roughly symmetric about zero.3 Also observe that there are
periods of high volatility (e.g., in 2008 (financial crisis) and 2020 (COVID-19)) and
they seem to cluster (rather than spread evenly across time). This phenomenon is
called volatility clustering and is observed in many financial time series.
Example 1.2. We consider Canadian climate data which is publicly available on https:
//climate.weather.gc.ca/. Here, we consider the daily mean temperature,4 in
degrees Celsius, recorded at a meteorological station (id: 6158731) within the Toronto
Pearson International Airport (YYZ). (The concept “temperature of Toronto” is useful
in daily life but is not very specific. An observation must be recorded sometime
somewhere.) The data set is plotted in Figure 1.2. We observe immediately a strong
seasonal pattern (i.e., periodic behaviour) whose fluctuations differ from year to year.
Date
3
A raw data-set must be suitably cleaned and preprocessed before formal statistical
analysis begins. For example, there may be errors in data entry possibly revealed by
values which are unreasonably large or small. The series of interest may be derived
from other data. Here, the daily mean temperature is derived from the daily maximum
and minimum temperatures. In this data-set there are 21 missing values. Missing
values may be treated by various methods often on a case-by-case basis. Since here
the number of missing values is small (21 in 4013 date points), a reasonable approach
is to estimate a missing value yt by averaging nearby values, say 21 yt−1 + 12 yt+1 . In
other cases missing, or unobserved, values are important and may be accounted by
a suitable model. It is concerning if the results of an analysis are sensitive to the
conventions used.
The key idea of time series analysis is to regard an observed time series (possibly
after a suitable transformation) as the realized value of a stochastic process. Statistical
methods are used to infer the properties of the process. We recall here the general
definition of stochastic process.
Definition 1.3. Let T ⊂ R be a nonempty index set.
(i) (Stochastic process) A stochastic process indexed by T is a collection (Xt )t∈T of
random variables defined on some probability space (Ω, F, P). We simply write
(Xt ) if T is clear from the context. We say that (Xt ) is d-dimensional if each
Xt = (X1,t , . . . , Xd,t ) is a d-dimensional random vector.
(ii) (Sample path) If (Xt )t∈T is a stochastic process, the sample path corresponding
to the sample point ω ∈ Ω is the function t ∈ T 7→ Xt (ω).
For the most part we focus on univariate processes (d = 1) in discrete time, such
that T is an interval of Z (e.g., T = Z+ = {0, 1, 2, . . .} or Z).5 A distinctive feature of
time series is that successive data points are typically dependent, so we cannot regard
them as i.i.d. samples as in “conventional statistics”. In a nutshell, time series analysis
is concerned with the modelling of dependence for stochastic processes.
4
A sample path of Gaussian white noise Sample paths of Gaussian random walk
30
2
20
1
10
0
0
−10
−1
−2
−30
0 50 100 150 200 250 300 0 50 100 150 200 250 300
i.i.d.
Figure 1.3: Left: A sample path of Gaussian white noise Xt ∼
N (0, 1). Right: Ten sample paths of Gaussian random walk where
i.i.d.
Xt ∼ N (0, 1).
The converse is generally false but holds (Xt ) is a Gaussian process, i.e., (Xt1 , . . . , Xtk )
has a multivariate normal distribution for all k ≥ 1 and t1 , . . . , tk ∈ T.
Example 1.6 (A white noise which is not i.i.d.). Let R ≥ 0 be a non-negative and non-
constant random variable with E[R2 ] = σ 2 , and let Zt be an i.i.d. process, independent
of R, such that P(Zt = ±1) = 21 . Now define
Xt = RZt .
By independence, we have
Next compute
St = S0 + X1 + · · · + Xt , t = 1, 2, . . . , (1.3)
where (Xt )∞ 2
t=1 is a collection of i.i.d. random variables (e.g., (Xt ) ∼ IID(0, σ )). See
Figure 1.3(right) for an illustration. Assuming Xt has finite variance, we may write
Xt = µ + σt ,
5
A sample path of an MA(3) process
3
2
1
0
−1
−2
−3
which has a linear trend. We may recover (Xt ) from (St ) by taking the first difference:
∇St = St − St−1 = Xt = µ + σt . (1.5)
Here differencing removes the trend. Observe that the TSX series in Figure 1.1(left)
looks qualitatively somewhat similar to a random walk if we neglect the big jumps
and volatility clustering.
Example 1.8 (Moving average process). Let (Zt )t∈Z ∼ WN(0, σ 2 ). Given a constant
θ ∈ R, define (Xt )t∈Z by
Xt = Zt + θZt−1 , t ∈ Z. (1.6)
More generally, given an integer q ≥ 1 and θ1 , . . . , θq ∈ R, we may define (Xt )t∈Z by
Xt = Zt + θ1 Zt−1 + · · · + θq Zt−q , t ∈ Z. (1.7)
If θq 6= 0, we call (Xt ) defined by (1.7) a moving average process of order q, or simply
an MA(q) process. Thus (1.6) defines an MA(1) process if θ 6= 0. The idea is that
Xt is given as a weighted sum of the current noise Zt and up to q previous noises
Zt−1 , . . . , Zt−q . See Figure 1.4 for an example.
Unlike a white noise process, successive values of a moving average process are
correlated, up to lag q. For example, for an MA(1) process (1.6) we have, for any t,
Cov(Xt+1 , Xt )
= Cov(Zt+1 + θZt , Zt + θZt−1 )
= Cov(Zt+1 , Zt ) + θCov(Zt+1 , Zt−1 ) + θCov(Zt , Zt ) + θ2 Cov(Zt , Zt−1 )
= 0 + θ · 0 + θ · σ 2 + θ2 · 0 = θσ 2 ,
and, by a similar argument,
Cov(Xt+2 , Xt ) = Cov(Xt+3 , Xt ) = · · · = 0.
6
Also, we have
1.3 Stationarity
In time series analysis, we often observe only a single series (of some length) regarded
as the realized sample path of a stochastic process. Without some assumption of
stationarity, statistical inference is not possible. For example, suppose we observe one
sample (x1 , . . . , xT ) from N (a, I) where a = (a1 , . . . , aT ) ∈ RT is arbitrary. Regardless
of the length T we cannot expect to estimate a accurately. A key idea of time series
is that the underlying process is built up from some stationary process for which
statistical inference is possible. In the following we let the index set T be either Z+
or Z.
Definition 1.9. A stochastic process (Xt )t∈T is strictly stationary if for all k ≥ 1,
distinct t1 , . . . , tk , and h, we have
d
(Xt1 , . . . , Xtk ) = (Xt1 +h , . . . , Xtk +h ).
That is, the finite dimensional distributions of (Xt ) are invariant under time shifts.
Example 1.10 (Simple examples of strictly stationary processes).
(i) Any i.i.d. process is strictly stationary.
(ii) Let (Xt )t∈Z+ be a time homogeneous Markov chain. Suppose X0 ∼ π where π
is stationary for the chain, i.e., if X0 ∼ π then Xt ∼ π for all t. Then (Xt )
is strictly stationary. To give a concrete example, suppose the state space is
X = {0, 1} and the transition matrix P (x, y) = P(Xt+1 = y|Xt = x), x, y ∈ X ,
is given by
7
and its autocovariance function defined by
Proof. The first statement is clear. The second statement holds since a (multivariate)
normal distribution is completely specified by the mean and covariance matrix.
From now on, by statonarity we mean weak stationarity unless otherwise specified.
In (ii) above, replace t by s + h. So (ii) is equivalent to
for all s, t and h. That is, the covariance function depends only on the lag h. When
h = 0,
γX (t, t) = Var(Xt )
is the common variance of Xt .
Definition 1.13. Let (Xt )t∈T , T = Z+ or Z, be a stationary process.
(i) (Autocovariance function) The autocovarinace function (ACVF) of X is defined
by
γX (h) = Cov(Xt , Xt+h ), h ∈ Z. (1.12)
(ii) (Autocorrelation function) The autocorrelation function (ACF) of X is defined
(when γX (0) > 0) by
Cov(Xt , Xt+h ) γX (h)
ρX (h) = Cor(Xt , Xt+h ) = p p = , h ∈ Z. (1.13)
Var(Xt ) Var(Xt+h ) γX (0)
Remark 1.14. The autocovariance γX (h) = Cov(Xt , Xt+h ) only measures linear de-
pendence between Xt and Xt+h . Xt and Xt+h can be dependent even if Cov(Xt , Xt+h ) =
0.
Let X and Y be real-valued random variables. It can be shown that X and Y are
independent if and only if
for any (bounded) functions f and g. Uncorrelation between X and Y only requires
(1.14) to hold when f and g are affine functions (i.e., functions of the form ax + b).
Proposition 1.15 (Properties of ACVF and ACF). Let (Xt ) be a stationary process.
Then:
(i) (Normalization) ρX (0) = 1.
(ii) (Symmetry) γX (h) = γX (−h) and ρX (h) = ρX (−h). (Thus in (1.12) and (1.13)
we may restrict to h ≥ 0.)
(iii) (Positive semidefiniteness) For k ≥ 1, lags h1 , . . . , hk and constants a1 , . . . , ak ∈
R, we have
Xk
ai aj γX (hi − hj ) ≥ 0. (1.15)
i,j=1
8
Proof. (i) Obvious since ρX (0) = Cor(Xt , Xt ) = 1.
(ii) By symmetry of the covariance, we have
(iii) Observe that the left hand side of (1.15) is the variance of the linear combi-
Pk
nation i=1 ai X(t + hk ), which is non-negative:
k
! n k
X X X
0 ≤ Var ai Xt+hi = ai aj Cov(Xt+hi , Xt+hj ) = ai aj γX (hi − hj ).
i=1 i,j=1 i,j=1
(ii) Let (Xt ) be an MA(1) process as in (1.6) in Example 1.8. From (1.8), (Xt ) is
stationary and its ACF is given by
1 if h = 0;
θ
γX (h) = 2 if |h| = 1; (1.17)
1+θ
0, if |h| =
6 2.
Thus for s, t ≥ 0 we have γS (s, t) = min{s, t} which is not a function of the lag s − t.
We conclude that S is not stationary. Consider
For h ≥ 0, we have
√
t t
Cor(St , St+h ) = √ √ =√ , (1.18)
t( t + h) t+h
9
Example 1.19 (Autoregressive process of order 1). Let (Zt )t∈Z ∼ WN(0, σ 2 ), and let
φ ∈ R be a constant satisfying |φ| < 1. Define a process (Xt )t∈Z by
∞
X
Xt = φj Zt−j = Zt + φZt−1 + φ2 Zt−2 + · · · , (1.19)
j=0
which P
is a moving average process of infinite order (an instance of MA(∞) process).
∞ j 2
Since j=0 |φ | < ∞ and (Zt ) ∼ WN(0, σ ), the series converges absolutely with
probability 1. To show this, consider
X∞ ∞
X
E |φj Zt−j | = |φj |E[|Zt−j |],
j=0 j=0
where the equality holds by the monotone convergence theorem. On the other hand,
since (Zt ) ∼ WN(0, σ 2 ), we have, by the Cauchy-Schwarz inequality,
q
E[|Zt−j |] = E[|Zt |] ≤ E[Zt2 ] · 1 = σ.
P∞
Thus with probability 1 we have j=0 |φj Zt−j | < ∞, i.e., the sum converges abso-
lutely.
Since
Xt = Zt + φ(Zt−1 + φZt−2 + · · · ) = Zt + φXt−1 ,
| {z }
Xt−1
Xt = φXt−1 + Zt , t ∈ Z. (1.20)
Thus Xt is a weighted sum of its previous value Xt−1 and the current noise Zt .
We say that (Xt ) follows an autoregressive process of order 1 (denoted as AR(1)).
Autoregressive processes of higher orders will P be introduced later.
∞
Let’s verify that (Xt ) is stationary. Since j=0 |φj | < ∞ and (Zt ) ∼ WN(0, σ 2 ),
it is possible to exchange sums with expectation/covariance operators.6 We have
X∞ X∞ ∞
X
E[Xt ] = E φj Zt−j = φj E[Zt−j ] = φj · 0 = 0,
j=0 j=0 j=0
6
We omit the proof P
which can be found in [1, Section 3.1].
7
Letting Xt = µ + ∞ j
j=0 φ Zt−j , where µ ∈ R, shifts the mean to µ. The recursion (1.20) becomes
Xt − µ = φ(Xt−1 − µ) + Zt .
10
φ = 0.9 φ = −0.9
6
4
2
2
0
0
−2
−2
−4
−4
0 50 100 150 200 250 300 0 50 100 150 200 250 300
σ2
γX (t, t + h) = φh γX (t, t) = φh , h ≥ 0.
1 − φ2
which decays geometrically. When φ ∈ (0, 1), ρX (h) decays monotonically. When
φ ∈ (−1, 0), the sign of ρX (h) alternates. See Figure 1.5 to get a feel of how this
relates to behaviours of the sample paths. The ACFs (theoretical and sample) are
plotted in Figure 1.6.
11
φ = 0.9 φ = −0.9
1.0
1.0
theoretical theoretical
sample sample
0.5
0.5
ACF
ACF
0.0
0.0
−0.5
−0.5
−1.0
−1.0
0 5 10 15 20 0 5 10 15 20
Lag Lag
(iii) (Sample correlation function) The sample correlation function is defined for 0 ≤
h < T by
γ̂(h)
ρ̂(h) = ρ̂(−h) = . (1.24)
γ̂(0)
Note that in (1.23) we divide by T rather than T − h (which varies with the lag
h). This ensures that γ̂ is positive semidefinite in the sense of Proposition 1.15(iii).
Example 1.21. In Figure 1.6 we plot the sample ACFs of the two simulated paths of
AR(1) process (for φ = 0.9 and φ = −0.9 respectively). Observe that the patterns of
the sample ACFs match – up to sampling errors – those of the theoretical ones.
Example 1.22. In Figure 1.7 we plot a simulated path of a random walk (as in Figure
1.3) and its sample ACF. Observe that the sample ACF stays positive for all lags
shown and decays rather slowly (c.f. (1.18)). These behaviours usually indicate that
the underlying process is nonstationary.
In time series analysis it is frequently useful to know whether a process is approx-
imately a white noise. Naturally, we may examine this by the sample ACF. Even if
(xt ) is a realization of a white noise process, due to sampling errors the sample ACF is
likely to be non-zero for non-zero lags. The typical magnitude of fluctuations is given
d
by the following theorem whose proof is beyond the scope of the course. We use →
to denote convergence in distribution.
Theorem 1.23 (Asymptotic distribution of sample ACF). Let (Xt ) ∼ IID(0, σ 2 ).
Let ρ̂T be the sample ACF computed from (X1 , . . . , XT ). Under additional technical
12
Sample path of a random walk Series sample_path
1.0
0
−5
0.8
−10
0.6
ACF
−15
0.4
−20
0.2
−25
0.0
−30
Lag
where N (0, Ih ) is the standard normal distribution. Thus, for any h 6= 0, ρ̂T (h) is
approximately distributed as N (0, T1 ) when T is sufficiently large.
Here is a straightforward application of the theorem. Fix a lag h, say h = 1. If
(Xt ) is an i.i.d. noise (more precisely, we mean that it satisfies the assumptions of
Theorem 1.23), then ρ̂T (h) is approximately distributed as N (0, T1 ). If we observe
|ρ̂T (h)| > 1.96
√
T
≈ √2T , then we can reject the null hypothesis that (Xt ) is an i.i.d. noise
at 5% significance level. The lines at ± 1.96
√
T
are typically provided in a plot of ACF are
useful for visual inspection of the sample ACF. For example, in Figures √ 1.6 and 1.7
the sample ACF is significant (by which we mean larger than 1.96/ T in magnitude)
at many lags. This is strong evidence that the series is not a white noise.
Corollary 1.24. Let h ≥ 1 be a given maximum lag under consideration Under the
assumptions of Theorem 1.23, we have
h
X d
QT := T ρ̂2T (j) → χ2h , (1.26)
j=1
Proof. This follows from the continuous mapping theorem applied to (1.25).
Given a lag h and a significance level α, the Portmanteau test9 (also called the
Box–Pierce test) states that we reject the null hypothesis that the series is an i.i.d. noise
if the test statistic QT defined by (1.26) exceeds the (1 − α)-quantile of χ2h . Note that
Corollary 1.24 is an asymptotic result so may not be accurate if T is not sufficiently
8
A sufficient condition that the fourth moment of Xt is finite: E[Xt4 ] < ∞.
9
The term “Portmanteau test” is also used for related tests that use a similar test statistic.
13
large. A refinement of the Portmanteau test, called the Ljung-Box test, uses instead
the test statistic
Xh
ρ̂2T (j)
QT := T (T + 2) , (1.27)
j=1
T −j
14