Intro of Time Series
Intro of Time Series
Introduction
An important step in analyzing Time Series data is to consider the types of data
patterns, so that the models most appropriate to those patterns can be utilized. Four
types of time series components can be distinguished. They are
(i) Horizontal when data values fluctuate around a constant value
(ii) Trend when there is long term increase or decrease in the data
(iii) Seasonal when a series is influenced by seasonal factor and recurs on a
regular periodic basis
(iv) Cyclical when the data exhibit rises and falls that are not of a fixed period
Note that many data series include combinations of the preceding patterns. After
separating out the existing patterns in any time series data, the pattern that remains
unidentifiable form the ‘random’ or ‘error’ component. Time plot (data plotted over time)
and seasonal plot (data plotted against individual seasons in which the data were
observed) help in visualizing these patterns while exploring the data. A crude yet
practical way of decomposing the original data (ignoring cyclical pattern) is to go for a
seasonal decomposition either by assuming an additive or multiplicative model viz.
where
If the magnitude of a Time Series varies with the level of the series then one has to go
for a multiplicative model else an additive model. This decomposition may enable one
to study the Time Series components separately or will allow workers to de-trend or to
do seasonal adjustments if needed for further analysis.
A moving average is simply a numerical average of the last N data points. There
are prior MA, centered MA etc. in the Time Series. In general, the moving average at
time t, taken
over N periods, is given by
Y t Y t 1 ....... Y t N 1
[1 ]
Mt
N
2
where Yt is the observed response at time t. Another way of stating the above equation
is
At each successive time period the most recent observation is included and the farthest
observation is excluded for computing the average. Hence the name ‘moving’ averages.
The simple moving average is intended for data of constant and no trend nature.
If the data have a linear or quadratic trend, the simple moving average will be
misleading.
In order to correct for the bias and develop an improved forecasting equation, the
double moving average can be calculated. To calculate this, simply treat the moving
averages Mt[1] over time as individual data points and obtain a moving average of these
averages.
Let the time series data be denoted by Y1, Y2,…,Yt. Suppose we wish to forecast
the next value of our time series Yt+1 that is yet to be observed with forecast for Yt
denoted by Ft. Then the forecast Ft+1 is based on weighting the most recent observation
Yt with a weight value and weighting the most recent forecast Ft with a weight of (1-)
where is a smoothing constant/ weight between 0 and 1 Thus the forecast for the
period t+1 is given by
Ft 1 Ft Yt Ft
Note that the choice of has considerable impact on the forecast. A large value of
(say 0.9) gives very little smoothing in the forecast, whereas a small value of (say
0.1) gives considerable smoothing. Alternatively, one can choose from a grid of
values (say =0.1,0.2,…,0.9) and choose the value that yields the smallest MSE value.
3
If you expand the above model recursively then F t+1 will come out to be a function
of , past yt values and F1. So, having known values of and past values of yt our point
of concern relates to initializing the value of F1. One method of initialization is to use the
first observed value Y1 as the first forecast (F1=Y1) and then proceed. Another
possibility would be to average the first four or five values in the data set and use this as
the initial forecast. However, because the weight attached to this user-defined F1 is
minimal, its effect on Ft+1 is negligible.
This is to allow forecasting data with trends. The forecast for Holt’s linear exponential
smoothing is found by having two more equations to SES model to deal with – one for
level and one for trend. The smoothing parameters (weights) and can be chosen
from a grid of values (say, each combination of =0.1,0.2,…,0.9 and =0.1,0.2,…,0.9)
and then select the combination of and which correspond to the lowest MSE.
4
y't = y t -1+b 1y’ t -2+…+ b p y’ t -p
where y't denotes the differenced series (yt -yt -1). The number of terms in the
regression, p, is usually set to be about 3. Then if is nearly zero the original series yt
needs differencing. And if <0 then yt is already stationary.
Autocorrelation functions
Autocorrelation
Autocorrelation refers to the way the observations in a time series are related to
each other and is measured by the simple correlation between current observation (Yt)
and observation from p periods before the current one (Ytp ). That is for a given series
Yt, autocorrelation at lag p = correlation (Yt , Ytp ) and is given by
np
Yt Y Yt p Y
t 1
rp
Yt 2
n
Y
t 1
It ranges from 1 to +1. Box and Jenkins has suggested that maximum number of
useful rp are roughly n/4 where n is the number of periods upon which information on yt
is available.
Partial autocorrelation
Partial autocorrelations are used to measure the degree of association between y t and
y t-p when the y-effects at other time lags 1,2 ,3,…,p-1 are removed. Note that usually
unto order 2 for p, d, or q is sufficient for developing a good model in practice.
Theoretical ACFs and PACFs (Autocorrelations versus lags) are available for the
various models chosen. Thus compare the correlograms (plot of sample ACFs versus
5
lags) with these theoretical ACF/PACFs, to find a reasonably good match and
tentatively select one or more ARIMA models. The general characteristics of theoretical
ACFs and PACFs are as follows:- (here ‘spike’ represents the line at various lags in the
plot with length equal to magnitude of autocorrelations)
ARIMA modeling
In general, an ARIMA model is characterized by the notation ARIMA (p,d,q)
Where p, d and q denote orders of auto-regression, integration (differencing) and
moving average respectively. In ARIMA parlance, TS is a linear function of past actual
values and random shocks. For instance, given a time series process {y t}, a first order
auto-regressive process is denoted by ARIMA (1,0,0) or simply AR(1) and is given by
y t = + 1y t-1 + t
and a first order moving average process is denoted by ARIMA (0,0,1) or simply MA(1)
and is given by
y t = - 1 t-1 + t
Alternatively, the model ultimately derived, may be a mixture of these processes and of
higher orders as well. Thus a stationary ARMA (p, q) process is defined by the
equation
y t = 1y t-1+ 2y t-2+…+ p y t-p - 1 t-1- 2 t-2+…- q t-q + t
where t’s are independently and normally distributed with zero mean and constant
variance 2 for t = 1,2,...n. Note here that the values of p and q, in practice lie between
0 and 3.
6
Identification of relevant models and inclusion of suitable seasonal variables are
necessary for seasonal modeling and their applications, say, forecasting production of
crops. Seasonal forecasts of production of principal crops are of greater utility for
planners, administrators and researchers alike. Agricultural seasons vary significantly
among the states of India. For example, Rice is grown in three-seasons in some states
whereas in some states the crop is cultivated in two seasons. Forecasts of rice and
other seasonal crops’ production can be made by developing seasonal ARIMA models.
Identification
The foremost step in the process of modeling is to check for the stationarity of
the series, as the estimation procedures are available only for stationary series. There
are two kinds of stationarity, viz., stationarity in ‘mean’ and stationarity in ‘variance’. A
cursory look at the graph of the data and structure of autocorrelation and partial
correlation coefficients may provide clues for the presence of stationarity. Another way
of checking for stationarity is to fit a first order autoregressive model for the raw data
7
and test whether the coefficient ‘1’ is less than one. If the model is found to be non-
stationary, stationarity could be achieved mostly by differencing the series. Or go for a
Dickey Fuller test( see section 4). Stationarity in variance could be achieved by some
modes of transformation, say, log transformation. This is applicable for both seasonal
and non-seasonal stationarity.
Thus, if ‘X t’ denotes the original series, the non-seasonal difference of first order is
Y t = X t – X t-1
followed by the seasonal differencing (if needed)
Z t = Yt – Y t—s = (X t – X t-1) – (X t-s - Xt-s-1)
The next step in the identification process is to find the initial values for the orders of
seasonal and non-seasonal parameters, p, q, and P, Q. They could be obtained by
looking for significant autocorrelation and partial autocorrelation coefficients (see
section 5 (iii)). Say, if second order auto correlation coefficient is significant, then an
AR (2), or MA (2) or ARMA (2) model could be tried to start with. This is not a hard and
fast rule, as sample autocorrelation coefficients are poor estimates of population
autocorrelation coefficients. Still they can be used as initial values while the final models
are achieved after going through the stages repeatedly.
Estimation
At the identification stage one or more models are tentatively chosen that seem
to provide statistically adequate representations of the available data. Then we attempt
to obtained precise estimates of parameters of the model by least squares as
advocated by Box and Jenkins. Standard computer packages like SAS is available for
finding the estimates of relevant parameters using iterative procedures. The methods
of estimation are not discussed here for brevity.
Diagnostics
8
Low Akaike Information Criteria (AIC)/ Bayesian Information Criteria (BIC)/
Schwarz-Bayesian Information Criteria (SBC).
AIC is given by AIC = (-2 log L + 2 m) where m=p+ q+ P+ Q and L is the likelihood
function. Since -2 log L is approximately equal to {n (1+log 2π) + n log σ 2} where σ2
is the model MSE, AIC can be written as AIC={n (1+log 2π) + n log σ2 + 2 m}and
because first term in this equation is a constant, it is usually omitted while comparing
between models. As an alternative to AIC, sometimes SBC is also used which is
given by SBC = log σ2 + (m log n) /n.
After tentative model has been fitted to the data, it is important to perform
diagnostic checks to test the adequacy of the model and, if need be, to suggest
potential improvements. One way to accomplish this is through the analysis of
residuals. It has been found that it is effective to measure the overall adequacy of the
chosen model by examining a quantity Q known as Box-Pierce statistic (a function of
autocorrelations of residuals) whose approximate distribution is chi-square and is
computed as follows:
Q == nn rr2 ((jj))
22
Q
where summation extends from 1 to k with k as the maximum lag considered, n is the
number of observations in the series, r (j) is the estimated autocorrelation at lag j; k can
be any positive integer and is usually around 20. Q follows Chi-square with (k-m1)
degrees of freedom where m1 is the number of parameters estimated in the model. A
modified Q statistic is the Ljung-box statistic which is given by
9
The Q Statistic is compared to critical values from chi-square distribution. If model is
correctly specified, residuals should be uncorrelated and Q should be small (the
probability value should be large). A significant value indicates that the chosen model
does not fit well.All these stages require considerable care and work and they
themselves are not exhaustive.
Artificial neural networks (ANNs) have recently received a great deal of attention in
many fields of study, like engineering, medical science, and economics. Excitement
stems from the fact that these networks are attempts to model capabilities of human
brain which has approximately ten billion neurons acting in parallel. Neurons, basic
computational unit of brain, are highly interconnected, with a typical neuron being
connected to several thousand other neurons. As opposed to this, ANNs rarely have
more than a few hundred or a few thousand neurons. So, networks comparable to a
human brain in complexity are still far beyond the capability of the fastest, most highly
parallel computers in existence. An ANN is a set of simple computational units (also
called nodes) that are highly interconnected. ANNs have been used for a wide variety of
applications where statistical methods are traditionally employed. They have been used
in classification problems, such as identifying underwater sonar currents, recognizing
speech, and predicting heart problems in patients. In time-series applications, ANNs
have been used in predicting stock market performance. These are currently preferred
tool in predicting protein secondary structures. As statisticians or users of statistics,
these problems are normally solved through classical statistical methods, such as
discriminant analysis, logistic regression, Bayes analysis, multiple regression, and
ARIMA time-series models. It is, therefore, time to recognize ANN as a powerful tool for
data analysis. An excellent overview of various aspects of ANN is provided by Warner
and Misra (1996) and Cheng and Titterington (1994).
The general form of an ANN is a “black box” model of a type that is often used to
model high-dimensional, nonlinear data. However, most ANNs are used to solve
prediction problems for some system, as opposed to formal model-building or
10
development of underlying knowledge of how the system works. For example, a
computer company might want to develop a procedure for automatically reading
handwriting and converting it to typescript. If the procedure can do this quickly and
accurately, the company may have little interest in the specific model used to do it.
Features of ANN
Third, ANNs are universal functional approximators. It has been shown that an
ANN can approximate any continuous function to any desired degree of accuracy.
ANNs have more general and flexible functional forms than the traditional statistical
11
methods can effectively deal with. Any forecasting model assumes that there exists an
underlying relationship between inputs and outputs. Frequently, traditional statistical
forecasting models have limitations in estimating this function due to complexity of the
real system. ANNs can be a good alternative method to identify this function.
Finally, ANNs are nonlinear. It is now well recognized that real world systems are by
and large nonlinear. During the last two decades or so, several nonlinear time-series
models, such as bilinear model, threshold autoregressive model, and autoregressive
conditional heteroscedastic model, have been developed. However, these nonlinear
models are still limited in that an explicit relationship for data series at hand has to be
hypothesized with little knowledge of underlying law. In fact, formulation of a nonlinear
model to a particular data set is a very difficult task since there are too many possible
nonlinear patterns and a prescribed nonlinear model may not be general enough to
capture all important features. ANN, which is a nonlinear data-driven approach, as
opposed to above model-based approach, is capable of performing nonlinear modelling
without apriori knowledge about relationships between input and output variables. Thus,
ANN is a more general and flexible modelling tool for forecasting.
ANN Modelling
Input into a node is a weighted sum of outputs from nodes connected to it. Thus
net input into a node is
Netinput i = ( w i j * output j ) + ui
where w ij are weights connecting neuron j to neuron i, output j is output from unit j
and ui is a threshold for neuron i. Threshold term is baseline input to a node in
absence of any other inputs. If a weight w i j is negative, it is termed inhibitory
because it decreases net input, otherwise it is called excitatory.
Each unit takes its net input and applies an activation function to it. For
example, output of j th unit, also called activation value of the unit, is g( w j i x i),
where g(.) is activation function and x i is output of i th unit connected to unit j. A
12
number of nonlinear functions have been used in the literature as activation
functions. However, most common choice is sigmoid functions, such as
or
g(netinput) = tanh (netinput)
With no hidden units, an ANN can classify only linearly separable problems
(ones for which possible output values can be separated by global hyperplanes).
These are called perceptrons, and this negative result led people to assume that
nonlinear ANNs are not useful for general tasks. However, it has since been shown
that with one hidden layer, an ANN can describe any continuous function (if there
are enough hidden units), and that with two hidden layers, it can describe any
function.
ANNs discussed so far are constructed with layers of units, and thus are
termed multilayered ANNs. A layer of units in such an ANN is composed of units
that perform similar tasks. A feed-forward ANN is one where units in one layer are
connected only to units in the next layer, and not to units in a preceding layer or
units in the same layer. ANNs where the units are connected to other units in the
same layer, to units in the preceding layer, or even to themselves are termed
13
recurrent ANNs. Feed-forward ANNs can be viewed as a special case of recurrent
ANNs.
First layer of a multilayer ANN consists of input units, denoted by x i. These units
are known as independent variables in statistical literature. Last layer contains
output units, denoted by yk. In statistical nomenclature, these units are known as
dependent or response variables. All other units in the model are called hidden units,
hj, and constitute hidden layers. Feedforward ANN can have any number of hidden
layers with a variable number of hidden units per layer. When counting layers, it is
common practice not to count input layer because it does not perform any
computation, but simply passes data onto next layer. So an ANN with an input layer,
one hidden layer, and an output layer is termed a two-layer ANN.
Back-Propagation algorithm
As input units simply pass information to hidden units, input into j th hidden
unit, by eq. (3), is
14
h pj = w ji x pi
Here N is total number of input nodes, w ji is weight from input unit i to hidden unit
j, and x p i is value of i th input for pattern p. The j th unit applies an activation
function to its net input and outputs :
v p j = g( h p j ) = 1/ { 1 + exp( - hpj ) }
assuming g(.) is sigmoid function given by eq.(2). Similarly, output unit k receives a
net input of
f pk = W kj v pj
Here M is the number of hidden units, and W kj represents weight from hidden unit j
to output k. The unit then outputs quantity
Y pk = g ( f pk ) = 1/ { 1 + exp( - fp k ) }
Eqs. (5) to (8) demonstrate that objective function given by eq. (4) is a function of
unknown weights w j i and W k j. So we evaluate partial derivative of objective
function with respect to weights, and then move weights in a direction down the
slope, continuing until error function no longer decreases. Mathematically, this can
be expressed as
W kj = - E / W kj
The term is known as learning rate and simply scales step size. Substituting eqs.
(5) to (8) in eq. (4) and expanding eq.(9) using chain rule, we get
Y pk / f pk = g’ ( f p k ) = Y p k (1 - Y p k )
and
15
f pk/ W kj = v pj
Substituting these results back in eq.(9), change in weights from hidden units to
output units is given by
W kj = - [ (- 1) (ypk - Y p k ) ] Y p k (1 - Y p k ) v p j
W k j (t +1) = W kj( t ) + W kj
Similarly, calculations for weights from inputs to hidden units can be carried out
as given in Warner and Misra (1996). Finally, the algorithms adopted from Hertz et
al. (1991) are as follows:
(i) Initialize the weights to small random values. This puts the output of each unit
around 0.5.
(ii) Choose a pattern p and propagate it forward. This yields values for v pj and Y
p k, the outputs from the hidden layer and output layer.
(iii) Compute the output errors: p k = (ypk - Y p k ) g’ ( f pk )
(iv) Compute the hidden layer errors: pj = pk W kj v pj ( 1 - v pj )
(v) To update the weights, compute
W kj = pk v pj and w ji = pj i pi
(vi) Repeat the steps for each pattern.
ANN Illustration
Artificial Neural Network (ANN) modelling methodology is utilized for modelling time-
series data and further forecasting. Based on the data for the period 1950-51 to 2008-
09, several ANN models were developed using ANN models by making use of 70% of
16
the data for training the model, 20% for testing and remaining 10% of data is utilised for
validating the model. Forecasting was carried out by making use of the most efficient
model. Based on the model, forecasting rice production was carried out. Rice
production for 2009-10 would be 82.19 million tons. This would be lesser by 16.9%
compared to rice production during 2008–09. This pre-harvest forecasting of rice
production was worked out based on ANN model by making use of rainfall data of June
and July months for the above period. Using zone-wise AICRIP multilocational trial data,
clustering of various zones were carried out using ANN models for clustering and
classification. Here, the target variable was estimated using classification function
analysis (discriminant function). The classification functions are used to determine to
which group each case most likely belongs. Classification scores were also worked for
grouping. Statistical models using ANN modellling procedure for forecasting yield in the
presence of pest incidence is also being worked out with the help of available data. Pal
et al. (2002) have proposed an ANN based forecasting model for maximum and
minimum temperatures for ground level at Dum Dum station, Kolkata. It was assumed
that last two days’ information is able to forecast both maximum and minimum
temperatures. However, there is an effect on temperature variation by several other
factors, like pollution. Standard statistical models are not able to forecast for such a
situation. ANN methodology is capable of handling this problem. Daily data on several
variables, such as Mean sea level pressure, Vapour pressure, Relative humidity,
Maximum temperature, Minimum temperature, Rainfall, Direct radiation, and Diffuse
radiation for the period 1989-95 was considered. A total of 2285 records were used for
training the ANN and remaining 270 records for testing it. In case of training data, first
25 fields of a record acted as an input data set for input nodes of ANN and remaining 2
fields of the record acted as target data set for output nodes in output layer of ANN.
After a good deal of effort, it was found that two-layer feedforward ANN with one
hidden layer having 20 nodes and trained with back propagation algorithm is quite
appropriate as the average error is less than 2 oC for about 80% of test and training
cases.
17
Concluding Remarks
The general modeling procedures for Time Series and ANN modeling and
subsequent forecasting is discussed in this write-up. Many other concepts like
stationarity conditions, multivariate Time series modelling, cross correlation function etc.
and methods like regression with ARMA errors, ARIMA modeling with independent
variables, transfer function analysis, intervention analysis, state space modeling,
Structural Time Series Modelling and other relevant concepts have immense in
agriculture and fisheries. Real life data on fisheries can be utilized for development of
models and further forecasting can be explored by making use of Time Series family of
modeling.
References
Blank, D.S. (1986). SAS system for forecasting time series, SAS Institute Inc., USA
Box, G.E.P., Jenkins, G.M. and Reinsel, G.C. (1994). Time series analysis:
Forecasting and control, Pearson Education, Delhi.
Hertz, J., Krogh, A. and Palmer, R. G. (1991). Introduction to the theory of neural
computation. Santa Fe Institute Studies in the Sciences of Complexity (Vol. 1),
Redwood City, CA. Addison Wesley.
Pal, S., Das, J., Sengupta, P., and Banerjee, S. K. (2002). Short term prediction of
atmospheric temperature using neural networks. Mausam, 53, 471-80.
Pankratz, A. (1983). Forecasting with univariate Box – Jenkins models: concepts and
cases, New york : John Wiley & Sons.
Zhang, G., Patuwo, B. E. and Hu, M. Y. (1998). Forecasting with artificial neural
networks : The state of the art. International Journal of Forecasting, 14, 35-62.
******-----*****
18