1 What Is ARIMA?: 1.1 A Little Historical Background
1 What Is ARIMA?: 1.1 A Little Historical Background
Arima is the easternmost and second largest in area of the three boroughs of Trinidad and
Tobago. It is geographically adjacent to – wait, just kidding!
ARIMA stands for auto-regressive integrated moving average. It’s a way of modelling time
series data for forecasting (i.e., for predicting future points in the series), in such a way that:
a pattern of growth/decline in the data is accounted for (hence the “auto-regressive” part)
the rate of change of the growth/decline in the data is accounted for (hence the
“integrated” part)
noise between consecutive time points is accounted for (hence the “moving average”
part)
Just as a reminder, “time series data” = data that is made up of a sequence of data points taken at
successive equally spaced points in time.
More specifically, their approach involved considering a value Y at time point t and
adding/subtracting based on the Y values at previous time points (e.g., t-1, t-2, etc.), and also
adding/subtracting error terms from previous time points.
Yt=c+ϕ1ydt−1+ϕpydt−p+...+θ1et−1+θqet−q+et
Where “e” is an error term and “c” is a constant.
ARIMA models are typically expressed like “ARIMA(p,d,q)”, with the three terms p, d, and q
defined as follows:
This basically takes a vector and, for each value in the vector, subtracts the previous value. So if
you have:
58164
3 -7 5 -2
Of course, since the first value in the original vector did NOT have a previous number, this one
doesn’t get a corresponding value in the new, differenced vector. So, the differenced vector will
have one less data point.
Now, our first step will only take what is known as the “first-order difference – that is, the
difference when you only remove the previous Y values only once. In more formal mathematical
terms,
Ydt=Yt−Yt−1
This lingering nonstationariness is because our original data involved an exponential curve– that
is, the “change of the change” changed! Differencing once only took out one level of change
Let’s try taking the second-order difference instead. This basically takes our once-differenced
data and differences it a second time. Put formally:
Yd2t=Ydt−Ydt−1=(Yt−Yt−1)−(Yt−1−Yt−2)
In effect, this gives you:
Yd2t=Yt−2Yt−1+Yt−2
In a sense, this gives us a measure of “the change of the change,” similar to the concept of
“acceleration” (as opposed to “velocity”). One could even get a little crazy and take a third-order
difference (corresponding to what physicists call “jerk”, or a change in acceleration over time), a
fourth-order difference, etc.
(Remember: p refers to how many previous/lagged Y values are accounted for for each time
point in our model, and q refers to how many previous/lagged error values are accounted for for
each time point in our model. )
To do so, you need to examine the “correlogram” and “partial correlogram” of the stationary
time series.
A partial correlogram is basically the same thing, except that it removes the effect of shorter
autocorrelation lags when calculating the correlation at longer lags. To be more precise, the
partial correlation at lag k is the autocorrelation between Yt and Yt-k that is NOT accounted for
by the autocorrelations from the 1st to the (k-1)st lags.
To plot a correlogram and partial correlogram, we can use the “acf()” and “pacf()” functions in
R, respectively. F.Y.I., if you just want the actual values of the autocorrelations and partial
autocorrelations without the plot, we can set “plot=FALSE” in the “acf()” and “pacf()” functions.
For the purposes of this demonstration, let’s get the autocorrelations for the original, non-
stationary data as well as the once-differenced, stationary data.
Note that you can specify the maximum number of lags to be shown in the plot by specifying a
“lag.max” value:
acf(Diff2TwoSinesGoingUpExponentially, lag.max=30)
The little dotted blue line means that the autocorrelations exceed significance bounds.
In our case, it looks like our time series data repeatedly exceeds these bounds at certain lag
points. There’s a recurring pattern involved. not good!
pacf(Diff2TwoSinesGoingUpExponentially, lag.max=30)
Again, our data seems to follow a pattern at regular lag intervals. This is a sign that our data
involves some kind of seasonal component, which brings us to…