0% found this document useful (0 votes)
108 views5 pages

1 What Is ARIMA?: 1.1 A Little Historical Background

The document discusses ARIMA (auto-regressive integrated moving average) models, which are used to model and forecast time series data. ARIMA models account for patterns of growth/decline in the data over time (auto-regressive), differences in the data that make it stationary (integrated), and noise between time points (moving average). The key aspects of an ARIMA model are p, d, and q - where p is the number of lagged values, d is the number of differences needed to make the data stationary, and q is the number of lagged error values. Correlograms and partial correlograms of stationary time series data are used to determine the
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views5 pages

1 What Is ARIMA?: 1.1 A Little Historical Background

The document discusses ARIMA (auto-regressive integrated moving average) models, which are used to model and forecast time series data. ARIMA models account for patterns of growth/decline in the data over time (auto-regressive), differences in the data that make it stationary (integrated), and noise between time points (moving average). The key aspects of an ARIMA model are p, d, and q - where p is the number of lagged values, d is the number of differences needed to make the data stationary, and q is the number of lagged error values. Correlograms and partial correlograms of stationary time series data are used to determine the
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

1 What is ARIMA?

Arima is the easternmost and second largest in area of the three boroughs of Trinidad and
Tobago. It is geographically adjacent to – wait, just kidding!

ARIMA stands for auto-regressive integrated moving average. It’s a way of modelling time
series data for forecasting (i.e., for predicting future points in the series), in such a way that:

 a pattern of growth/decline in the data is accounted for (hence the “auto-regressive” part)
 the rate of change of the growth/decline in the data is accounted for (hence the
“integrated” part)
 noise between consecutive time points is accounted for (hence the “moving average”
part)

Just as a reminder, “time series data” = data that is made up of a sequence of data points taken at
successive equally spaced points in time.

1.1 A little historical background


Many decades ago, in a galaxy not too far away from here, statisticians analyzed time series data
without taking into account how ‘nonstationariness’(read: growth/decline over time) might have
an effect on their analyses. Then George P. Box and Gwilym Jenkins came along and presented a
famous monograph called “Time Series Analysis: Forecasting and Control” in which they
showed that nonstationary data could be made stationary (read: steady over time) by
“differencing” the series. In this way, they could pull apart a juicy trend at a specific time period
from a growth/decline that would be expected anyway, given the nonstationariness of the data.

More specifically, their approach involved considering a value Y at time point t and
adding/subtracting based on the Y values at previous time points (e.g., t-1, t-2, etc.), and also
adding/subtracting error terms from previous time points.

The formula itself looks like this:

Yt=c+ϕ1ydt−1+ϕpydt−p+...+θ1et−1+θqet−q+et
Where “e” is an error term and “c” is a constant.

ARIMA models are typically expressed like “ARIMA(p,d,q)”, with the three terms p, d, and q
defined as follows:

 p means the number of preceding (“lagged”) Y values that have to be added/subtracted to


Y in the model, so as to make better predictions based on local periods of growth/decline
in our data. This captures the “autoregressive” nature of ARIMA.
 d represents the number of times that the data have to be “differenced” to produce a
stationary signal (i.e., a signal that has a constant mean over time). This captures the
“integrated” nature of ARIMA. If d=0, this means that our data does not tend to go
up/down in the long term (i.e., the model is already “stationary”). In this case, then
technically you are performing just ARMA, not AR-I-MA. If p is 1, then it means that the
data is going up/down linearly. If p is 2, then it means that the data is going up/down
exponentially. More on this below…
 q represents the number of preceding/lagged values for the error term that are
added/subtracted to Y. This captures the “moving average” part of ARIMA.

4.2 Finding the d value - a.k.a, differencing the data to


achieve stationarity
Given that we have non-stationary data, we will need to “difference” the data until we obtain a
stationary time series. We can do this with the “diff” function in R.

This basically takes a vector and, for each value in the vector, subtracts the previous value. So if
you have:

58164

… then the “differenced” vector would be:

3 -7 5 -2

Of course, since the first value in the original vector did NOT have a previous number, this one
doesn’t get a corresponding value in the new, differenced vector. So, the differenced vector will
have one less data point.

Now, our first step will only take what is known as the “first-order difference – that is, the
difference when you only remove the previous Y values only once. In more formal mathematical
terms,

Ydt=Yt−Yt−1
This lingering nonstationariness is because our original data involved an exponential curve– that
is, the “change of the change” changed! Differencing once only took out one level of change

Let’s try taking the second-order difference instead. This basically takes our once-differenced
data and differences it a second time. Put formally:

Yd2t=Ydt−Ydt−1=(Yt−Yt−1)−(Yt−1−Yt−2)
In effect, this gives you:

Yd2t=Yt−2Yt−1+Yt−2
In a sense, this gives us a measure of “the change of the change,” similar to the concept of
“acceleration” (as opposed to “velocity”). One could even get a little crazy and take a third-order
difference (corresponding to what physicists call “jerk”, or a change in acceleration over time), a
fourth-order difference, etc.

5 Using correlograms and partial


correlograms to determine our p and q values
If/once you have a stationary time series, the next step is to select the appropriate ARIMA
model. This means finding the most appropriate values for p and q in the ARIMA(p,d,q) model.

(Remember: p refers to how many previous/lagged Y values are accounted for for each time
point in our model, and q refers to how many previous/lagged error values are accounted for for
each time point in our model. )

To do so, you need to examine the “correlogram” and “partial correlogram” of the stationary
time series.

A correlogram shows the AUTOCORRELATION FUNCTION. It’s just like a correlation,


except that, rather than correlating two completely different variables, it’s correlating a variable
at time t and that same variable at time t-k

A partial correlogram is basically the same thing, except that it removes the effect of shorter
autocorrelation lags when calculating the correlation at longer lags. To be more precise, the
partial correlation at lag k is the autocorrelation between Yt and Yt-k that is NOT accounted for
by the autocorrelations from the 1st to the (k-1)st lags.

To plot a correlogram and partial correlogram, we can use the “acf()” and “pacf()” functions in
R, respectively. F.Y.I., if you just want the actual values of the autocorrelations and partial
autocorrelations without the plot, we can set “plot=FALSE” in the “acf()” and “pacf()” functions.

For the purposes of this demonstration, let’s get the autocorrelations for the original, non-
stationary data as well as the once-differenced, stationary data.

Note that you can specify the maximum number of lags to be shown in the plot by specifying a
“lag.max” value:

acf(Diff2TwoSinesGoingUpExponentially, lag.max=30)
The little dotted blue line means that the autocorrelations exceed significance bounds.

In our case, it looks like our time series data repeatedly exceeds these bounds at certain lag
points. There’s a recurring pattern involved. not good!

Let’s check the partial correlogram, too:

pacf(Diff2TwoSinesGoingUpExponentially, lag.max=30)
Again, our data seems to follow a pattern at regular lag intervals. This is a sign that our data
involves some kind of seasonal component, which brings us to…

You might also like