Forecasting Time Series With Missing Data Using Holt's Model
Forecasting Time Series With Missing Data Using Holt's Model
Forecasting Time Series With Missing Data Using Holt's Model
A R T I C L E I N F O A B S T R A C T
Article history: This paper deals with the prediction of time series with missing data using an alternative
Received 4 May 2007 formulation for Holt's model with additive errors. This formulation simplifies both the calcu-
Received in revised form lus of maximum likelihood estimators of all the unknowns in the model and the calculus of
29 December 2008
point forecasts. In the presence of missing data, the EM algorithm is used to obtain maximum
Accepted 1 January 2009
likelihood estimates and point forecasts. Based on this application we propose a leave-one-out
Available online 14 January 2009
algorithm for the data transformation selection problem which allows us to analyse Holt's
Keywords: model with multiplicative errors. Some numerical results show the performance of these pro-
Forecasting cedures for obtaining robust forecasts.
Exponential smoothing © 2009 Elsevier B.V. All rights reserved.
Linear model
EM algorithm
Data transformation
1. Introduction
Most time series models assume that the observations are sampled with the same frequency, but it is common to find time
series with missing data. In order to carry out a precise analysis of these time series and obtain reliable forecasts, it is necessary
to deal effectively with the missing data.
The missing data problem has been dealt with successfully using the state-space methodology. Jones (1980) obtained the
maximum likelihood estimates of the parameters of an ARMA model in the presence of missing data using the Kalman filter. Kohn
and Ansley (1986) proposed a modified Kalman filter to generalise those previous results to the case of ARIMA models. Gómez
and Maravall (1994) showed a new definition of the likelihood of an ARIMA model with missing observations that permits the
use of the ordinary Kalman filter. Recently, Gómez et al. (1999) proposed filling in the holes in the missing data with arbitrary
values and carrying out the maximum likelihood estimation with additive outliers.
Analogously, Wright (1986) suggested an extension for simple exponential smoothing and Holt's method in the case of missing
data. Cipra et al. (1995) extended the previous approach to the Holt–Winters method: the level, trend and seasonal terms are
updated each time a new observation becomes available, using modified transition equations which take into account the possible
presence of missing data between two consecutive observations. Cipra and Romera (1997) proposed new transition equations for
the Holt–Winters method in the presence of missing data using a robust version of the Kalman filter based on the M-estimation
methodology: if the observation at one time point is missing, the estimated level, trend and seasonality remain unchanged. The
problem of missing data in time series prediction has also been dealt with using neural networks (Hofmann and Tresp, 1998) and
Monte Carlo methods (Chen and Liu, 1998).
Since exponential smoothing methods are widely used for short-term prediction in business and industry (Gardner, 2006),
in this paper, we present a new approach to the prediction of time series with missing data based on an alternative formulation
for Holt's model with additive errors. In this new formulation the stochastic component of the model is introduced by means
of additive, independent, homoscedastic and normal errors. Then the data vector is multivariate normal and its mean and
∗ Corresponding author.
E-mail addresses: [email protected] (J.D. Bermúdez), [email protected] (A. Corberán-Vallet), [email protected] (E. Vercher).
0378-3758/$ - see front matter © 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.jspi.2009.01.004
2792 J.D. Bermúdez et al. / Journal of Statistical Planning and Inference 139 (2009) 2791 -- 2799
covariance matrix are functions of the model parameters. Hence Holt's model can be formulated as a heteroscedastic linear
model with coefficients given by the initial conditions and the covariance matrix relying on the smoothing parameters. This
formulation allows us to obtain the maximum likelihood estimates of all the unknowns, the smoothing parameters and the initial
values of level and trend jointly, as in some other proposals (Harvey, 1989; Ord et al., 1997; Segura and Vercher, 2001; Bermúdez
et al., 2006a,b; Bermúdez et al., 2007, 2008). In the presence of missing data, the EM algorithm (Dempster et al., 1977) is used in
the estimation of the model's parameters and the calculus of point forecasts.
The paper is organised as follows. In Section 2 we define the multivariate linear model for Holt's model and the formulae for the
calculation of the maximum likelihood estimators and point forecasts. In Section 3 we apply the EM algorithm to the estimation
of the model parameters and the calculus of point forecasts when missing data are presented, and show the performance of the
procedure with some numerical results. In Section 4 we develop a new data transformation selection method based on the leave-
one-out technique and present the results corresponding to the prediction of the yearly time series of the M3 Competition when
the proposed data transformation selection mechanism is used. The last section is concerned with some concluding remarks.
We assume that {yt }nt=1 are the observed data. The Holt model with additive errors assumes that the observation at time t
comes from the random variable
Yt = t + t (1)
where t = at−1 + bt−1 , at and bt are the level and trend at time t, respectively, and {t }nt=1 are independent homoscedastic
normal random variables, N(0, 2 ). When a new observation becomes available the level and trend terms are updated through
the transition equations
at = t + t (2)
Let Y = (y1 , y2 , . . . , yn ) be the data vector, = (, ) the vector of smoothing parameters and = (a0 , b0 ) the vector of initial
values. Given the initial values and applying the transition recursively the h steps ahead prediction is usually obtained as
ŷn+h = an + hbn for h 1. In practice, both and vectors are unknown and have to be estimated from the data. Applying (1),
(2) and (3) recursively the data can be stated (Bermúdez et al., 2007) as
Y1 = a0 + b0 + 1
..
.
t−1
Yt = a0 + tb0 + (1 + (t − r))r + t
r=1
Y = A + L (4)
where A is the n × 2 matrix whose first column is the vector (1, 1, . . . , 1) and the second one the vector (1, 2, . . . , n) ; L is the n × n
lower triangular matrix whose elements in the main diagonal are equal to 1 and li,j = (1 + (i − j)) if i > j; = (1 , 2 , . . . , n )
is the error vector. Therefore, the data vector follows a multivariate normal distribution with mean E(Y) = A and covariance
matrix V(Y) = 2 LL . This covariance matrix depends on the smoothing parameters but is always positive-definite because it is
symmetric and its determinant is always positive: |V(Y)| = 2n |L|2 = 2n > 0, due to |L| = 1. We assume that the value of each
smoothing parameter lies in the interval [0,1], although this restriction is not necessary.
From Eq. (4), the log-likelihood function of the data vector Y is given by the logarithm of the multivariate normal density
function
n 1
− ln(2 ) − (Y − A) (LL )−1 (Y − A) (5)
2 2 2
The quadratic form in (5), (Y − A) (LL )−1 (Y − A), can be decomposed as (˜ − ) X X(
˜ − ) + (L−1 Y) (I − PX )(L−1 Y), where
−1 −1
X is the matrix L A, PX = X(X X) X is the orthogonal projection matrix on the vector space generated by the columns of the
matrix X and ˜ = (X X)−1 X L−1 Y is the mean square estimator of when is known. So, the log-likelihood function can be
J.D. Bermúdez et al. / Journal of Statistical Planning and Inference 139 (2009) 2791 -- 2799 2793
expressed as
n 1 1
− ln(2 ) − ˜ − ) X X(
( ˜ − ) − (L−1 Y) (I − PX )(L−1 Y) (6)
2 22 2 2
The first quadratic form that appears in (6) can always be annulled, whatever the value of is, while the second quadratic form
involves only the smoothing vector . The maximum likelihood estimator of , denoted by , can, therefore, be obtained by
minimising
has been obtained, let L̂ be the matrix L computed at ˆ and X̂ = L̂−1 A. The maximum likelihood estimator of ,
Once ˆ is
1
ˆ 2 = (L̂−1 Y) (I − PX̂ )L̂−1 Y (9)
n
Using this approach we only have to solve one optimisation problem (7) with respect to , since once its maximum likelihood
estimator has been calculated, the estimators of and are obtained analytically using (8) and (9).
Notice that for solving the optimisation problem (7), we need to use nonlinear optimisation procedures that enable us to
incorporate box-constraints since we are assuming that the values of the smoothing parameters lie in the interval [0,1]. To do
that we use the `L-BFGS-B' method (Byrd et al., 1995) in the R-language. It is important to perform a multi-start strategy because
the routines used do not guarantee the achievement of the global optima when the objective function is not convex. On the other
hand, the calculus of the matrix PX can give numerical problems that can be solved using the singular value decomposition of the
matrix X. So if X = UDV is the singular value decomposition of X, PX = UU and the matrix (X X)−1 X needed to obtain the maximum
likelihood estimator of is equal to VD−1 U . We work with the software R Development Core Team (2008), where the singular
value decomposition of a matrix and the optimisation of the problem (7) are easily carried out using standard R commands.
Once the parameters of the model have been estimated, forecasts of future values can be obtained as follows. Let Y1 be the
n × 1 data vector and Y2 the h × 1 vector of future values. If we assume that the joint (n + h) × 1 vector Ye = (Y1 , Y2 ) satisfies
Eq. (4), we have
Y1 A1 L1 0 1
Ye = = +
Y2 A2 L21 L2 2
where the vector and the matrices A and L have been partitioned in a similar way to the vector Ye . Therefore (see for example
Seber, 1984, p. 19), the conditional distribution of Y2 given Y1 is multivariate normal with mean
The missing data problem has usually been dealt with using the Kalman filter (Kohn and Ansley, 1986; Gómez and Maravall,
1994; Cipra and Romera, 1997). In this section, we present a new approach to the prediction of time series with missing data
based on the formulation for Holt's model proposed in the previous section and using the EM algorithm for the estimation of all
the unknowns in the model: smoothing parameters and starting values.
The EM algorithm is an iterative optimisation method for finding the maximum likelihood estimates of the parameters of
a probabilistic model. It is appropriate when the given data are incomplete or have missing values, or when optimising the
likelihood function is analytically intractable but can be simplified by assuming the existence of unobserved latent variables.
Suppose that (i) denotes the current value of after i iterations of the algorithm. The next iteration can be described in two
steps, expectation and maximisation steps, giving name to the algorithm:
2794 J.D. Bermúdez et al. / Journal of Statistical Planning and Inference 139 (2009) 2791 -- 2799
E-step (expectation step). Obtain the expected value of the complete data log-likelihood with respect to the unknown data
given the observed data and the current parameter estimates.
M-step (maximisation step). Choose (i+1) to be a value which maximises the above expectation.
In our case, let Yobs =(y1 , . . . , ym1 −1 , ym1 +1 , . . . , ymf −1 , ymf +1 , . . . , yn ) be the observed data vector, where the observations at times
m1 , . . . , mf are missing and Ycom = (y1 , y2 , . . . , ym1 −1 , ym1 , ym1 +1 , . . . , yn ) is the complete data vector. Let us assume that the vector
Ycom is the outcome of Holt's model with additive errors, so Ycom = A + L.
(0)
As with any iterative algorithm, the EM algorithm needs a starting point, that is (0) = ((0) , , (2 )(0) ). To obtain the starting
point we propose analysing the observed data in the time series as if there were no missing data between two consecutive
observations, and calculating the maximum likelihood estimates of the model parameters by using Eqs. (7)–(9). The output of
the first iteration (1) is a new estimate of the model parameters used to begin the next iteration, and so on till the stopping rule
is achieved.
(i)
Let (i) = ((i) , , (2 )(i) ) be the estimate obtained in the ith iteration. The next iteration is given by the following two steps:
E-step. The log-likelihood function for the complete data is given by Eq. (5), and its expectation is
n 1
Q(, , 2 | (i) ) = − ln 2 − E[(Ycom − A) (LL )−1 (Ycom − A)]
2 2 2
where E stands for the expectation operator with respect to the distribution of Ycom given Yobs and = (i) . After some simple
algebra, the above quadratic form becomes
(Ycom − A) (LL )−1 (Ycom − A) = (Ycom − E(Ycom )) (LL )−1 (Ycom − E(Ycom ))
n 1 1
Q(, , 2 | (i) ) = − ln(2 ) − 1 [(LL )−1 ◦ V(Ycom )]1 − (E(Ycom ) − A) (LL )−1 (E(Ycom ) − A) (11)
2 2 2 22
In order to compute E(Ycom ) and V(Ycom ), let YR be the complete data vector reordered so that the missing data {ym1 , . . . , ymf }
are in the last positions, YR = (y1 , . . . , ym1 −1 , ym1 +1 , . . . , yn , ym1 , . . . , ymf ) , and let AR and LR be the matrices A and L reordered in a
similar way. Let YR.1 and YR.2 be the subvectors of YR corresponding to the observed and missing data, respectively. Partitioning
matrices AR and LR in a similar way, Eq. (4) becomes
YR.1 AR.1 LR.11 LR.12 R.1
YR = = +
YR.2 AR.2 LR.21 LR.22 R.2
As the joint distribution of YR is multivariate normal, the conditional distribution of YR.2 given YR.1 is also multivariate normal
with mean and covariance matrix given by (see for example Seber, 1984, p. 19)
V(YR.2 |YR.1 ) =
22 −
21
−1
11
12
where
11
12 LR.11 LR.11 + LR.12 LR.12 LR.11 LR.21 + LR.12 LR.22
V(YR ) = = 2
21
22 LR.21 LR.11 + LR.22 LR.12 LR.21 LR.21 + LR.22 LR.22
Therefore, E(Ycom ) = (y1 , . . . , ymj −1 , E(ymj |Yobs , (i) ), ymj +1 , . . . , yn ) where E(ymj |Yobs , (i) ) is the j-th component of vector
E(YR.2 |YR.1 ) computed at = (i) . In a similar way, the component (ml , mj ) of matrix V(Ycom ) is equal to the (l, j) compo-
nent of matrix V(YR.2 |YR.1 ) computed at = (i) , for all l, j = 1, . . . , f , all other components being zero, which implies that
1 [(LL )−1 ◦ V(Ycom )]1 = 1 [(LL )−1
R.22 ◦ V(YR.2 |YR.1 )]1.
J.D. Bermúdez et al. / Journal of Statistical Planning and Inference 139 (2009) 2791 -- 2799 2795
Moreover, the quadratic form in the last term of Eq. (11) can be decomposed as in expression (5), X = L−1 A and PX being its
orthogonal projection matrix, so
n 1
Q( | (i) ) = − ln(2 ) − 1 [(LL )−1
R.22 ◦ V(YR.2 |YR.1 )]1
2 2 2
1 1
− ˜ − ) X X(
( ˜ − ) − (L−1 E(Ycom )) (I − PX )(L−1 E(Ycom )) (12)
22 2 2
M-step. Maximise the function Q( | (i) ) obtained from the previous E-step over = (, , 2 ) .
The quadratic form corresponding to in Eq. (12) can always be annulled, so the problem can be reduced to minimise the
expression
1 [(LL )−1 −1 −1
R.22 ◦ V(YR.2 |YR.1 )]1 + (L E(Ycom )) (I − PX )(L E(Ycom )) (13)
(i+1) (i+1)
Once is obtained, let L̂ be the matrix L computed at and X̂ = L̂−1 A. The new estimate of , (i+1) , is then given by
and the new estimate of 2 is, as before, the mean squared fitting errors
1 1 −1
(2 )(i+1) = 1 [(L̂L̂ )−1 −1
R.22 ◦ V̂(YR.2 |YR.1 )]1 + (L̂ E(Ycom )) (I − PX̂ )L̂ E(Ycom ) (15)
n n
The algorithm ends when the stopping rule has been reached. We propose the following stopping rule, based on a relative
comparison of mean squared fitting errors in two consecutive iterations:
Notice the similarity between the optimisation problem (13) and Eqs. (14) and (15) with those in Section 2, when no missing data
are present, expression (7) and Eqs. (8) and (9), respectively. The main difference is that in problem (13) a new addend appears,
containing the covariance matrix of the missing data.
Let Ll( ) be the log-likelihood function of the observed data Yobs . The purpose of the EM algorithm is to maximise Ll( ) over
, the parametric space, but by an iterative procedure where the function to maximise in each iteration is Q( |
), where
is a
fixed known constant which is different in each iteration. It is well-known (Dempster et al., 1977) that for a bounded sequence
∗ ∗
Ll( p ) from the EM algorithm, Ll( p ) converges monotonically to some Ll . It is not evident whether Ll is a global maximum of
Ll( ) over . Wu (1983) proves some interesting convergence results of EM sequences that could be of interest in our application.
Specifically, Theorem 2 of Wu (1983) states that all the limit points of any instance { p } of an EM algorithm are stationary points
∗
of Ll( ) and Ll( p ) converges monotonically to Ll = Ll( ∗ ) for some stationary point ∗ if the following conditions are satisfied:
In our proposal, all those conditions are easily proved if the smoothing parameter vector is assumed to be known, and then
matrix L is a known constant matrix and the model is equivalent to a linear homoscedastic normal model. In the general case,
matrix L is a function of the smoothing parameters (recall that L is the n × n lower triangular matrix whose elements in the main
diagonal are equal to 1 and li,j = (1+(i−j)) if i > j), each component of L is continuous and differentiable with respect to =(, ) .
The components of L−1 are bivariate polynomial functions of (, ), therefore, they are also continuous and differentiable, and
likewise for the components of X = L−1 A and PX . Hence function Q( |
), expression (12), is continuous and differentiable. Using
similar reasoning, the same properties can also be verified for function Ll( ), which is the log-likelihood of a multivariate normal
with covariance matrix given by a submatrix of LL . Finally, compactness follows from the continuity of Ll( ) and the bounded
restrictions on the smoothing parameters (, ), which are restricted to belonging to the unit square.
If the log-likelihood Ll( ) is unimodal in with ∗ being the only stationary point, then for any EM sequence { p }, p
converges to the unique maximiser ∗ of Ll( ). In general, however, the log-likelihood Ll( ) could have several maxima and
stationary points, and in such a case the convergence of the EM sequence to either type of point will depend on the choice of
starting points. A multi-start strategy, usual for any other general optimisation algorithm, is also advisable for the EM algorithm.
2796 J.D. Bermúdez et al. / Journal of Statistical Planning and Inference 139 (2009) 2791 -- 2799
5000
4000
3000
2000
1000
0 20 40 60
Fig. 1. Time plot together with the predictions obtained for the time series number 160 of the `other' series of the M3 Competition.
Table 1
Prediction error for eight steps-ahead for the time series number 160 with and without missing values.
Table 2
Average SMAPE for the 174 `other' series.
Holt 1.9 2.9 3.9 4.7 5.8 5.6 7.2 3.32 4.13 4.81
Dampen Holt 1.8 2.7 3.9 4.7 5.8 5.4 6.6 3.28 4.06 4.61
Complete data 1.9 2.4 2.9 3.4 3.9 4.1 4.9 2.68 3.13 3.51
Missing data 1.9 2.4 2.9 3.4 3.9 4.1 4.9 2.67 3.12 3.51
With the aim of checking the performance of our approach to the prediction of time series with missing data, we consider
the `other' series from the M3 Competition (Makridakis et al., 2000). The M3 Competition, the last of the M-Competitions, was
an empirical study which compared the performance of 24 forecasting methods proposed by experts. The competition was
composed of 3003 real time series, mostly economic and business ones, classified as yearly series (645 series), quarterly series
(756 series), monthly series (1428 series) and `other' series (174 series). The series are all strictly positive and their length is quite
short. Specifically, the median length for the 645 yearly series is 19 observations, with a minimum of 14 observations. Those
values are 44 and 16 observations for quarterly series, 115 and 48 observations for monthly series and 63 and 60 observations
for `other' series. The series are available on the web page of the International Institute of Forecasters.
First we present the results obtained in the prediction of the time series number 160 of the `other' series. The length of the
time series is 63 observations and we suppose that the observations at time 48, 53, 59 and 61 are missing. We consider values
from the end of the time series as missing observations since those values are the most significant ones in the prediction of
future values. Fig. 1 shows the time plot of this time series together with the point forecasts for the next eight steps. The points
emphasised correspond to the four missing observations.
The results in Table 1 refer to the prediction error obtained using the missing-values approach previously presented, together
with the prediction error obtained when the complete time series was considered and the results from Section 2 were applied.
Using the complete data series the prediction fit is slightly better than if missing data are present, but not as much as could be
expected.
Table 2 shows the average prediction error for the 174 `other' time series of the M3 Competition when we assume that the time
series present missing data, and we apply the EM procedure introduced here. We assume as missing values those corresponding
J.D. Bermúdez et al. / Journal of Statistical Planning and Inference 139 (2009) 2791 -- 2799 2797
with the positions n − 15, n − 10, n − 4 and n − 2, where n is the length of the time series. We compare these results with those
obtained using the complete data series from both the approach introduced in Section 2 and the exponential smoothing methods
that took part in the M3 Competition called Holt (an automatic Holt linear exponential smoothing method), and Dampen Holt (a
Dampen trend exponential smoothing).
To measure the fitting and forecast accuracy we use the symmetric mean absolute percentage error, SMAPE, defined as
n
1 |yt − ŷt |
SMAPE = ∗ 200
k yt + ŷt
t=n−k+1
where yt is the observation at time t and ŷt is its forecast. We choose the SMAPE because it is scale independent, symmetric and
bounded: it fluctuates between −200% and 200%. It is also the measure of accuracy used in the M3 Competition.
Table 2 shows that Holt and Dampen Holt methods, whose results are communicated in Makridakis and Hibon (2000), have a
worse forecast accuracy than the methods introduced in this paper. This is mainly due to the fact that here we used a complete
maximum likelihood estimation procedure: both the initial conditions and the parameters of the model are introduced as decision
variables when maximising the likelihood function.
The Holt model (4) assumes that the errors are homoscedastic, but on some occasions it is more suitable to assume that the
error variance depends on the level, that is, V(∗t ) = (t )2 , so that variance increases with level. In those cases, the observation
at time t comes from the random variable
∗t ∗
Yt = t + ∗t = t + t = t 1 + t = t (1 + t ) (16)
t t
with t = ∗t / t being an error with variance 2 , for t = 1, 2, . . . , n. This model is known as Holt's model with multiplicative errors.
Taking logarithms in the previous equation, we obtain
ln Yt = ln t + ln(1 + t ) = ln t + et
where et = ln(1 + t ) is a random variable that can be assumed to follow a normal distribution with zero mean.
Holt's model with additive errors is then adequate, if not for the raw data but for when we work with the logarithm of the
given data. It may, therefore, be suitable to forecast using the transformed time series and then obtain the original time series
forecasts by applying the inverse transformation. Other transformations, not only the logarithmic one, could also be of interest.
When it is advisable to work with data transformations it is necessary to have a mechanism which allows us to decide what the
best transformation is. Until now, the mechanism that has generally been used is to select the transformation with the minimum
fitting error from the time series data. This choice, however, guarantees the best fitting but not necessarily the best prediction.
Alternatively, a cross-validation study could be of interest.
The usual way of doing cross-validation is to select a proper subset of data as the training set, using it to estimate the
parameters of the model, and cross-validate the results using the remaining data. When the data set only contains a few cases,
a different method of cross-validation is necessary. The `leave-one-out' method leaves out only one data point from the training
set, uses it to cross-validate, and repeats the procedure several times with a different validation data point each time.
Most of the time series to be forecast in applications, especially in industrial applications, are very short so the training
set has to be almost the complete time series. For this reason, here we proposed a leaving-one-out procedure to select data
transformation in time series analysis. The training set of such a procedure consists of a time series with a missing value, and
the EM algorithm proposed in the above section could be applied to analysing it. An empirical study using the yearly data series
from the M3 Competition shows the performance of our transformation selection procedure. All those 645 yearly time series are
very short, their median length being 19 observations.
Let n be the length of the time series and let k be an integer such that 0 < k n. Our data transformation selection procedure
could be described as follows:
For each transformed time series:
Table 3
Average SMAPE for the 645 yearly series.
1 2 3 4 5 6 1–4 1–6
The value of k is usually set equal to n in leave-one-out procedures, although smaller values are also common in order to speed
up the procedure. In this application, as the first data in a time series are not very informative in forecasting, we propose to use
only the later ones in our leave-one-out procedure.
The row `DT Holt' in Table 3 shows the forecasting accuracy obtained by our data transformation selection mechanism in the
prediction of the 645 yearly time series in the M3 Competition. We consider the raw data, the logarithm transformation, the
square root and the square of the data and set k = 10, although we have obtained very similar results with values of k from 8
to 14, the length of the shortest yearly series in the M3 Competition. Our results are compared with those obtained from the
methods with best forecasting accuracy in the M3 Competition: RBF, ForecastX, Autobox2, Theta and Robust-Trend. Table 3 also
shows the results from the Dampen Holt method, the best exponential smoothing method in the M3 Competition.
RBF is a rule-based forecasting procedure that uses three methods (random walk, linear regression and Holt's) to estimate
level and trend, involving corrections, simplification, automatic feature identification and re-calibration. ForecastX and Autobox2
are commercially available forecasting packages. ForecastX runs tests for seasonality and outliers and selects from among sev-
eral methods: exponential smoothing, Box–Jenkins and Croston's method. Autobox2 is a robust ARIMA univariate Box–Jenkins
with/without intervention detection. The Theta method is based on a specific decomposition technique, projection and combi-
nation of the individual components. It has been proved that the method is a special case of single exponential smoothing with
drift where the drift parameter is half the slope of the linear trend fitted to the data. Robust-Trend is a non-parametric version
of Holt's linear model with a median-based estimate of trend. It is worth pointing out here that all the competitors in the M3
Competition had complete freedom to manipulate the data looking for transformations, outliers, etc.
Note that, except for the one-year forecasting horizon, the average of the SMAPE prediction errors that we obtain is smaller
than that obtained by the other methods. Therefore, the data transformation selection procedure proposed in this paper, which
is easy to implement, obtains better results than the usual methods.
5. Concluding remarks
With our formulation for Holt's model with additive errors, we can obtain the maximum likelihood estimators of the smoothing
parameters and the initial conditions jointly. We only need to solve one optimisation problem with respect to the smoothing
parameters, and then the maximum likelihood estimators of the other parameters are found analytically. Point forecasts are then
easily computed by plugging the obtained maximum likelihood estimations of the parameters into the expected-value forecasts.
This formulation also allows us to solve the problem of missing data, using the EM algorithm to obtain the maximum likelihood
estimates of the unknowns in the model, taking into account the uncertainty originated by the missing data.
The numerical results obtained in the prediction of the yearly time series in the M3 Competition show that the algorithm
proposed in this paper performs well. Except for the one-year forecasting horizon, the average SMAPE prediction error obtained
with our algorithm is smaller than the average given by the other methods. The data transformation selection procedure obtains
better results than the usual methods, is easily implemented and its computational cost is not too high, therefore, its use is
recommended.
Acknowledgements
We would like to acknowledge Grant no. MTM2008-03993 from the Ministerio de Ciencia e Innovación of Spain. Ana Corberán-
Vallet's research was supported by the Generalitat Valenciana, Grant CTBPRB/2005/006. We are also indebted to an anonymous
referee for all their helpful comments, which have improved our paper.
References
Bermúdez, J.D., Segura, J.V., Vercher, E., 2006a. Improving demand forecasting accuracy using non-linear programming software. Journal of the Operational
Research Society 57, 94–100.
Bermúdez, J.D., Segura, J.V., Vercher, E., 2006b. A decision support system methodology for forecasting of time series based on soft computing. Computational
Statistics and Data Analysis 51, 177–191.
J.D. Bermúdez et al. / Journal of Statistical Planning and Inference 139 (2009) 2791 -- 2799 2799
Bermúdez, J.D., Segura, J.V., Vercher, E., 2007. Holt–Winters forecasting: an alternative formulation applied to UK air passenger data. Journal of Applied Statistics
34, 1075–1090.
Bermúdez, J.D., Segura, J.V., Vercher, E., 2008. SIOPRED: a prediction and optimisation integrated system for demand. TOP 16, 258–271.
Byrd, R.H., Lu, P., Nocedal, J., Zhu, C., 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16,
1190–1208.
Chen, R., Liu, J.S., 1998. Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association 93, 1032–1044.
Cipra, T., Rubio, A., Trujillo, J., 1995. Holt–Winters method with missing observations. Management Science 41, 174–178.
Cipra, T., Romera, R., 1997. Kalman filter with outliers and missing observations. Test 6, 379–395.
Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society B-39, 1–38.
Gardner Jr., E.S., 2006. Exponential smoothing: the state of the art-Part II. International Journal of Forecasting 22, 637–666.
Gómez, V., Maravall, A., 1994. Estimation, prediction, and interpolation for nonstationary series with the Kalman filter. Journal of the American Statistical
Association 89, 611–624.
Gómez, V., Maravall, A., Peña, D., 1999. Missing observations in ARIMA models: skipping approach versus additive outlier approach. Journal of Econometrics 88,
341–363.
Harvey, A.C., 1989. Forecasting, Structural time Series Models and the Kalman Filter. Cambridge University Press, Cambridge.
Hofmann, R., Tresp, V., 1998. Nonlinear time-series prediction with missing and noisy data. Neural Computation 10, 731–747.
Jones, R.H., 1980. Maximum likelihood fitting of ARMA models to time series with missing observations. Technometrics 22, 389–395.
Kohn, R., Ansley, C.F., 1986. Estimation, prediction, and interpolation for ARIMA models with missing data. Journal of the American Statistical Association 81,
751–761.
Makridakis, S., Hibon, M., 2000. The M3-competition: results, conclusions and implications. International Journal of Forecasting 16, 451–476.
Makridakis, S., Ord, K., Hibon, M., 2000. The M3-competition. International Journal of Forecasting 16, 433–436.
Ord, J.K., Koehler, A.B., Snyder, R.D., 1997. Estimation and prediction for a class of dynamic nonlinear statistical models. Journal of the American Statistical
Association 92, 1621–1629.
R Development Core Team., 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
Seber, G.A.F., 1984. Multivariate Observations. Wiley, New York.
Segura, J.V., Vercher, E., 2001. A spreadsheet modeling approach to the Holt–Winters optimal forecasting. European Journal of Operational Research 131,
375–388.
Wright, D.J., 1986. Forecasting data published at irregular time intervals using extension of Holt's method. Management Science 32, 499–510.
Wu, C.F.J., 1983. On the convergence properties of the EM algorithm. The Annals of Statistics 11, 95–103.