Lecturenotes 3
Lecturenotes 3
3 Basics of forecasting 1
3.1 First steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3.2 Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.3 Deterministic prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 Probabilistic prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Basics of forecasting
Here we delve in the heart of the matter. Throughout this chapter we jointly consider two tasks:
task 1 Predicting the next value yn+1 after observing a sequence y1 , . . . , yn . This is the “time
series forecasting” problem.
task 2 After observing pairs (x1 , y1 ), . . . , (xn , yn ), we observe a new xn+1 and we want to predict
the new yn+1 . This is the “regression” problem. We can think of y as the quantity of interest,
and of x as information that we find relevant for the prediction of y.
We could start with task 1 only but task 2 is fundamental and both tasks are deeply intertwined.
An introduction to time series forecasting is Forecasting: Principles and Practice by Hyndman &
Athanasopoulos. An introductory resource on regression is the book An Introduction to Statistical
Learning by James, Witten, Hastie & Tibshirani, Chapter 3 in particular. Both books are free
online and include R code.
Adjustments. Visualizing data might suggest simple transformations that will simplify subse-
quent analyses. The term “adjustement” is used to designate common-sense modifications of raw
sales
600
temperature
25
20 500
15
10
5
0 400
−5
Jan Mar May Jul Sep Nov Jan −5 0 5 10 15 20 25
time temperature
(a) (b)
Figure 3.1: Sales of lemonade and outside temperature over one year. Synthetic data downloaded
from https://github.com/WHPAN0108/DurstExpress_exercise.
series, such as calendar adjustments (months can have 28, 29, 30 or 31 days, which affects data that
represent monthly counts), population adjustments, inflation adjustments, etc. See Figure 3.2.
Transformations. It is common to transform each element in the series through some function,
such as the logarithm if the series is made of positive values, which helps to identify whether a trend
is polynomial or exponential, and to make the variations around a trend more stable over time.
Another simple transformation is “differencing”, which refers to ∇yt = yt − yt−1 for all t ≥ 2.
The ∇ “nabla” sign before a variable indexed by t indicates differencing. Differencing can be
iterated, for example ∇2 yt = (yt − yt−1 ) − (yt−1 − yt−2 ), for all t ≥ 3. Seasonal differencing refers
to the computation of yt − yt−m , for a period m. Differencing can be used to remove trends. For
positive series it is common to apply the logarithmic transform and then the differencing operator,
i.e. to consider ∇ log yt = log yt − log yt−1 instead of yt . See Figure 3.2.
3.2 Decompositions
We can consider a variety of manipulations to gain insight on the features of the data.
10000
8
(with linear trend in green)
6
1950 1960 1970 1980 1990 2000 2010 2020
0.10
∆log(GDP) ∆ log(real GDP)
0.05
0.00
-0.05
Figure 3.2: Quarterly US GDP. The top panel shows the nominal values, and those adjusted for
inflation (“real”). The middle panel shows the log transforms, with fitted linear trends. The bottom
panel shows the first differences. Note that the 1970s were a period of high inflation in the US; this
is particularly apparent in the bottom panel.
Moving averages. Also called “rolling window” averages, they refer to the construction
k
1 X
ỹt = yt+j , (3.1)
m
j=−k
Trend, seasonal and residual components. We might want to find the decomposition yt =
ỹt + st + rt , where
• st refers to a seasonal component, typically made of m values repeated across periods, where
the periodicity is denoted by m and chosen by the analyst. In math notation: st = st+m for
See Figure 3.3. The series ỹt + rt , or equivalently yt − st , is the “seasonally adjusted” series.
observed
6.0
5.0
4.8 5.2 5.6 6.0
trend
0.2
seasonal
0.0
−0.2
random
0.00
−0.10
Time
Figure 3.3: Decomposition of the classic Box & Jenkins airline data: monthly totals of international
airline passengers, 1949 to 1960. On the left, the additive decomposition is done on the original
series, while it is applied to the log on the right.
The word “typically” appears in the above description because many variations exist, under the
names of X-11, SEATS, or STL (stl function in R). To forecast yn+1 , we can separately forecast
the trend, seasonal and residual components, and then combine the three predictions.
Ask an expert. Experts can be people with domain expertise, and we can gather multiple
experts, hoping that a group would be more effective than individuals. The “Delphi method” is
a method to elicit a consensus from various experts. In recent times, prediction markets, where
many anonymous individuals bet on future outcomes, have been sometimes interpreted as a way of
Baseline strategies. How might we forecast yn+1 using y1:n ? Two basic strategies are:
Pn
average: we report n−1 t=1 yt as a prediction for yn+1 .
− α)j = 1, the
P
where the value α ∈ [0, 1] is a tuning parameter to be chosen. Since α j≥0 (1
forecast can be interpreted as a weighted average of all past values. We can re-express the forecast
in a recursive form,
ŷt+1 = αyt + (1 − α)ŷt , (3.3)
for t = 1, 2, . . . , n, where ŷ1 has to be set somehow. Both ŷ1 and α need to be chosen by the analyst.
Adding trends and seasonalities. If we want to forecast yn+2 and plug ŷn+1 in place of yn+1
in the exponential smoothing recipe, we obtain ŷn+2 = ŷn+1 . Similarly, the forecast of all future
yn+h is equal to the forecast of yn+1 . In other words, our prediction of the future looks like a flat
line. To come up with a more plausible forecast, we can go beyond the above “simple exponential
smoothing” technique, and consider the inclusion of trend and seasonal components, as follows.
We first write the above model as ŷt+h = `t , and `t = αyt + (1 − α)`t−1 , where `t is called the
“level” at time t, and h is the horizon we want to predict over. We can then include a trend, with
co2
350330
1986 1988 1990 1992 1994 1996 1998
time
Figure 3.4: Atmospheric concentrations (monthly) of CO2 in Mauna Loa, expressed in parts per
million (ppm), with predictions obtained with the HoltWinters in R.
ŷt+h = `t + hbt ,
`t = αyt + (1 − α)(`t−1 + bt−1 ),
bt = γ(`t − `t−1 ) + (1 − γ)bt−1 ,
with two parameters, α, γ ∈ [0, 1]. Here bt represents the slope of a linear trend at time t. Similarly
we can add equations and parameters to represent seasonality.
This strategy, generally known as exponential smoothing or “Holt–Winters”, introduced in two
articles in the late 1950s, remains widely used today. A comprehensive treatment of exponential
smoothing is the book “Forecasting with Exponential Smoothing” by Hyndman, Koehler, Ord and
Snyder, 2008. Figure 3.4 represents the forecast provided by the HoltWinters function in R, a
prediction made on a series of concentrations of CO2 in Hawaii. The figure shows a convincing
extrapolation of the monthly series onto two future years.
Those are deterministic recipes: you plug the observed series as an input, and some calculation
yields a series of predicted values. There is no prediction interval, no probability distribution of
the future values, no quantified uncertainty. We will see later in the course that we can re-visit
exponential smoothing in the paradigm of “ARIMA” models and “state space models” and that it
enables the construction of prediction intervals (e.g. as implemented in HoltWinters).
Basic linear regression. Next consider the case where you want to predict y using x (“task 2”),
see Figure 1(b). Given (xt , yt )nt=1 , and given a new xn+1 , how would we predict yn+1 ? We can try
to learn the relationship between yt and xt , by finding a function f such that yt is approximately
f (xt ). To make things simpler, we can restrict the search to the family of linear functions: we want
n
X
(yt − (α + βxt ))2 (3.4)
t=1
with respect to α, β. Indeed “yt ≈ α + βxt ” is equivalent to “(yt − (α + βxt ))2 is small”. Denote by
α̂, β̂ the minimizers of (3.4). Given xn+1 our forecast of yn+1 is then given by α̂ + β̂xn+1 .
With a bit of calculus (differentiating (3.4) with respect to α, β and equating the derivatives to
zero), we find Pn
(x − x̄n )(yt − ȳn )
α̂ = ȳn − β̂ x̄n , β̂ = Pn t
t=1
2
. (3.5)
t=1 (xt − x̄n )
Pn Pn
Recall that x̄n and ȳn refer to the empirical means n−1 t=1 xt and n−1 t=1 yt . The estimates α̂
and β̂ are often called “ordinary least squares” (OLS). Compare β̂ with the correlation coefficient
Ĉov(x1:n , y1:n ) defined in Chapter 2: they’re not identical but still very similar.
800 800
700 700
600 600
y
500 500
400 400
−5 0 5 10 15 20 25 −5 0 5 10 15 20 25
x x
(a) (b)
Figure 3.5: Linear regression of sales of lemonade on the outside temperature. Left: regression
line obtained by minimizing (3.4). Right: various regression lines obtained by “bootstrap”.
The basic linear regression described above is a deterministic way of addressing task 2. How do
we know if it works? How do we assess the error, and construct a prediction intervals? Figure 3.5
illustrates linear regression using the data shown in Figure 3.1, as well as the uncertainty about the
regression line (on the right), obtained through a probabilistic approach.
Guessing a random variable. Consider the task of predicting a real-valued random variable
Y , with a guess c ∈ R. To evaluate our guess, we use a loss function (c, y) 7→ L(c, y). The loss
function, a central concept in decision theory, takes as arguments the guess c and a realization y of
the object to be predicted Y . The loss is larger if c is further away from y. A commonly-used loss
is L(c, y) = (y − c)2 , called the squared loss.
Our objective is to formulate a guess c that makes L(c, y) small on average: we minimize
E[L(c, Y )] with respect to c. With the squared loss this is E[(Y − c)2 ], called the mean squared error
(MSE). Some calculations show that c = E[Y ] minimizes the MSE. This provides a justification for
the first baseline strategy mentioned in Section 3.3.
Conditioning on observed data. Next, suppose that we observe X and we want to predict
Y given X. Any function of X, denoted by c(X), could be used to predict Y . By the “tower
property” (Equation (2.8)), we can write E[(Y − c(X))2 ] = E[E[(Y − c(X))2 |X]]. We can always
write Y − c(X) as Y − E[Y |X] + E[Y |X] − c(X), and then we can expand the square to obtain,
E[(Y − c(X))2 ] = E[E[(Y − E[Y |X])2 |X]] + E[(c(X) − E[Y |X])2 ]. (3.6)
We want to minimize that error. The first term is constant in c. The second term is minimized by
c(X) = E[Y |X]. Thus, the optimal prediction of Y given X is the conditional expectation E[Y |X].
Again, this is useful if we can approximate such quantity, using observed data. See Figure 3.6.
Linear regression, again. It is quite difficult to precisely estimate the conditional expectation
E[Y |X], even if we have data (xt , yt )nt=1 . This is the topic of “nonparametric regression”. We can
look at a simpler task: the best linear approximation. That is, we restrict the function x 7→ c(x)
to be a linear function of x, so c(x) = α + βx, and we find coefficients α, β ∈ R that minimize
the expected error E[(Y − (α + βX))2 ]. By differentiating with respect to α and β, we obtain two
equations:
E [(Y − (α + βX))] = 0 and E [X (Y − (α + βX))] = 0. (3.7)
y
5 5
0 0
−2 0 2 4 6 −2 0 2 4 6
x x
(a) (b)
Figure 3.6: Left: joint density of some pair of variables (X, Y ), and conditional mean E[Y |X = x]
as a function of x in dashed line. Right: scatter plot of independent samples (xt , yt )nt=1 following
that distribution, and regression line of Y onto X.
Two unknowns (α, β), two equations: we can solve that, and find
This provides practical guidance, as we might be able to estimate these quantities using observed
data: we can estimate E[X] with the “empirical mean” x̄n , and likewise we can estimate variances
and covariances. Doing so, we retrieve the expressions obtained in (3.5). But now we know that α̂, β̂
might be approximations of α? , β ? , and we might be interested in the approximation error. We also
note that, starting from the problem of predicting Y given X, and focusing on linear predictions
for simplicity, the concept of covariance between X and Y naturally appears.
Deriving methods from objectives. In passing, the above derivations show that one can define
an objective (minimize some expected prediction error) to arrive at a method (linear regression).
By modifying the objective we can derive other methods. It is very satisfying: we can specify
what you want to achieve, and derive a method that achieves that goal. If we are interested in
probabilistic prediction instead of point prediction, we can use a “scoring rule” as a loss function,
instead of the mean squared error. We’ll see more about scoring rules later in the course.
Linear regression as a model. Here we think of the sequence x1:n as given, fixed, constant.
Consider the following equation,
Yt = α + βxt + εt . (3.9)
It relates xt to Yt , it involves α, β which are called coefficients or parameters, and there is a term
εt called the residual. By re-writing εt = Yt − (α + βxt ) we see that the residual represents the
difference between Yt and a linear function of xt . Compared to basic linear regression where we
wrote yt ≈ α + βxt , here we define εt to be the discrepancy between Yt (the quantity of interest, a
random variable) and α + βxt (a linear function of xt ). The residual εt is seen as a random variable.
In linear regression we often assume that εt has mean zero (E[εt ] = 0), and that εt and εs are
uncorrelated for any times t 6= s (Cov(εt , εs ) = 0). Such assumptions are required to validate the
construction of confidence intervals on the parameters, and in turn the construction of prediction
intervals for future values of Y . There are many distributions for the sequence (εt )t≥1 that would
satisfy the two conditions: E[εt ] = 0 and Cov(εt , εs ) = 0 for t 6= s. We can be more explicit
about the distribution of εt , for example by assuming that (εt )t≥1 are independent Normal(0, σ 2 )
variables, which leads to the alternative model specification:
Yt ∼ N (α + βxt , σ 2 ), (3.10)
in words: Yt is Normal, centered at α + βxt , with variance σ 2 , and Y1:n are independent of one
another. The variance σ 2 is now included in the model parameters.
Likelihood associated with a model. If the residuals are given a specific distribution, we can
write the likelihood function associated with the data (xt , yt )nt=1 and with the parameters α, β, σ 2 .
By definition the likelihood is the probability density function of the model evaluated at the observed
data. It is then viewed as a function of the parameters. Here, using R notation, and factorizing the
with dnorm(x, mean = µ, sd = σ) = (2πσ 2 )−1/2 exp − 2σ1 2 (x − µ)2 . Here maximizing the likeli-
hood with respect to α, β, σ 2 corresponds exactly to the ordinary least squares estimates in (3.5)
for α and β. On top of that, we have an estimate of σ 2 which will be useful for prediction. The
approach is flexible: if we specify another distribution for the residuals (for example a Laplace or a
Student distribution), we can still maximize the likelihood to obtain parameter estimates. The esti-
mates are called “maximum likelihood estimates”, and they are shown to have appealing statistical
properties in general.
Time series modeling. We have just introduced a model to address task 2. Let’s return to
task 1: the prediction of yn+1 using y1:n . We have covered deterministic methods in Section 3.3.
Let’s take a probabilistic perspective. Denote by (Wt ) a sequence of independent Normal(0, σ 2 )
variables, called the “noise terms”.
Consider first the random walk (RW) model,
Yt = δ + Yt−1 + Wt . (3.13)
The parameter δ is called the drift. The initial condition can be set as Y1 = W1 . Here the variables
Y1:n are not independent; but we can still write down the likelihood function. We can check that
Pt
Yt = (t − 1)δ + s=1 Ws . We can then compute E[Yt ] = (t − 1)δ, by linearity of expectation, and
V [Yt ] = tσ 2 , using the independence between the noise terms. Prediction intervals constructed
using the random walk model have a width that increases as we predict further into the future.
Consider next the autoregressive model (AR),
Yt = δ + ρYt−1 + Wt . (3.14)
Here ρ is called the autoregressive coefficient. We recover the random walk model if ρ = 1. Other-
wise, it looks like a linear regression model where Yt is predicted using Xt = Yt−1 . The properties
of the process are much different if |ρ| < 1, compared to ρ = 1 or |ρ| > 1, as can be seen with simple
simulations. In the next chapter we will find that the AR model is a building block for many time
series models.