Sas Arma Forecast

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

ARMA Modeling In SAS

An Example Using The Share Price Of GE


David Blankley
STA9750 Software Tools



G
E

S
h
a
r
e

P
r
i
c
e
12
13
14
15
16
17
18
19
20
21
22
Time Axis
01JAN2010 26FEB2010 23APR2010 18JUN2010 13AUG2010 08OCT2010 03DEC2010 28JAN2011
Table of Contents
ARMA Modeling In SAS ....................................................................................................... 1
Table of Contents ................................................................................................................ 2
Summary ............................................................................................................................. 3
Methodology and Results ................................................................................................... 3
Data ................................................................................................................................. 3
Model Identification ....................................................................................................... 3
Model Parameterization ................................................................................................. 6
Prediction ........................................................................................................................ 7
Creating Quality Graphical Output ................................................................................. 8
Conclusion ........................................................................................................................... 9
Appendix I SAS Code ......................................................................................................... 10
Appendix II Sources ........................................................................................................... 11



Summary
The goal of this project is to extend knowledge of how to model time series in R to SAS.
For the underlying process, the project will build on the stock market homework
assignments of 9750 that developed a set of weekly returns for GE and compared them
to the SP500.

An Auto Regressive Moving Average (ARMA) model will be identified for the stock price
of GE, and then using that model a forecast for GEs share price one, two, three and four
weeks in the future will be made. Finally, as graphical results are an important part of
explaining statistical analyses, a chart with several bells and whistles specific to time
series forecasting will be implemented.

Methodology and Results
ARMA modeling can be decomposed into several steps. Identify the order of the ARMA
model, p, q. Identify the coefficients of the parameters associated with the model,
check the parameterization against several diagnostic tools, and then finally make
predictions with the model.

Data
Of course, there is an unmentioned first step: get data. For this study, the weekly
closing prices for GE were downloaded from finance.yahoo.com over the period
November 27
th
, 2000 to the present, December 6
th
, 2010. For closing prices the
adjusted close field will be used rather than the actual printed price on each date.
This accounts for dividends and splits and ensures the price represents the return a cash
investor in the stock would achieve. Additionally, it is well known that market data
exhibits non-constant variance. To account for this, modeling will be on the natural log
of the adjusted closes.

Model Identification
An ARMA model can be described mathematically as:
Y
t
=

= =

+ +
p
i
q
j
t j t j i t i
e Y Y
1 1
u | where {e
t
} ~ WN(0,o
2
)
The first goal is to identify the order of the two summations in the equation, p and q.
This is accomplished by examining the plots of the sample auto-correlation and partial
auto-correlation.

The auto-correlation function (ACF) measures the level of correlation in the series at
different lags, or set points of time apart, and is effective at identifying the order q of
the moving average portion of the equation. Mathematically, ACF at a set lag h is equal
to:
(h) = Corr(Y
t
, Y
t-h
)
And the plot will examine the acf at increasing levels of h.

Similarly, the partial auto-correlation function (PACF) measures the level of correlation
at a lag when conditioning on the effect of the intervening variables. That is we are
looking for
|
kk
= Corr(Y
t
,Y
t-k
|Y
t-1
,Y
t-2
,,Y
t-k+1
)
The PACF is useful for identifying the order of p in the base ARMA equation. In both
instances, the plot is used to evaluate at each lag the hypothesis:
ACF PACF
H
0
: = 0 H
0
: | = 0
H
a
: | 0 H
a
: | 0
Lags that reject the null are considered good candidates for the order of p or q.

The PROC ARIMA SAS statement, automatically generates the ACF and PACF charts
as part of the standard output. This can be generated with the code:
proc arima;
identify var=lnGE
nlag=30
center;
In this code the var statement identifies the variable being modeled, nlag is the number
of lags to calculate the PACF and ACF to, and center ensures the calculations are
performed on a 0 mean series. The graphical output from this will be a 2x2 panel which
includes the series, the ACF, the PACF, and the IACF, which is used in ARIMA modeling.

Alternatively, the PROC TIMESERIES SAS statement can be used to generate the charts
independently. The code to do so is:
proc timeseries print=summary plots=pacf;
var lnGE;
proc timeseries print=summary plots=acf;
var lnGE;
run;

The first output from SAS for the ACF and PACF is:

What these charts tell us is that the data is not stationary. Stationarity is a condition
regarding independence of the data in the time series on time. In this instance the
reason the data is not stationary is each time point is clearly dependent on previous
time steps. This is most clearly seen in the PACF chart showing the very high PACF at lag
1. The solution to this problem is to difference the data and then recalculate the ACF
and PACF on this new time series. Differencing is a complicated sounding term, but is
really just creating a new series by subtracting the previous data point from the current
one. Mathematically:
X
t
= Y
t
Y
t-1

However, SAS will do this automatically as part of the ARIMA statement if we adjust our
earlier command with:
proc arima;
identify var=lnGE(1)
nlag=30
center stationarity=adf(1,2);
Notice the (1) after the AdjPrice on line 2. A second point to make on this code is the
addition of the stationarity option. It will not always be so obvious that the data should
be differenced and in these cases the researcher can resort to an augmented Dickey-
Fuller test. In this example, the test is run at lags of one and two on the differenced
data.

The Dickey-Fuller test and syntactic sugar of PROC ARIMA is not available with the
PROC TIMESERIES. However, a new series can be created in the data statement to
handle the differencing issue as follows:
* Create a lagged version of GE closes with lag=1
lnGELagged = lag(lnGE);
* Future calls to PROC TIMESERIES should use this var:
GEReturn = lnGE-lnGELagged;
Either way, the result will be a model based on the continuously compounded rate of
return.
The output for the updated plots is:
From these it looks like an MA3, MA4, AR3, AR4, or possibly a simpler combination of
the two such as ARMA(1,1) may all work. More quantitative methods of determining
the order have been developed by Tsay and Tiao. These include both the Extended
Sample Autocorrelation Function (ESACF) and the Smallest Canonical Correlation
Method (SCAN). SAS implements both of these by adding the options SCAN and ESACF
to the IDENTIFY statement portion of the ARIMA statement as follows:
identify var=lnGE(1) scan esacf
The output from this additional command is:
ARMA(p+d,q) Tentative
Order Selection Tests
---SCAN-- --ESACF--
p+d q p+d q
1 1 2 2
4 0 0 4
0 4 1 4
(5% Significance Level)
SCAN proposes an ARMA(1,1) as a first choice while the ESACF proposes ARMA(2,2).
Additionally, they differ in second choice candidate as well. Scan prefers the Auto
Regressive model with order p=4, abbreviated AR(4), whilst ESACF prefers the Moving
Average model with order q=4, abbreviated MA(4).

In general, there is a preference for simpler models over more complex, therefore the
initial model to explore is the ARMA(1,1) and for the remainder of this paper the focus
will be on the ARMA(1,1) model.
Model Parameterization
The equation for an ARMA(1,1) model is:
Y
t
=|
1
Y
t-1
+ u
1
c
t-1
+ c
t

Where {c
t
} ~ WN(0,
2
e
o ). Given this model, the next stage of the analysis is to determine
estimates for the parameters |
1
and u
1
, which, when denoting the estimate will be
represented as
1

| and
1

u .

To estimate the parameters SAS uses the ESTIMATE statement as part of PROC ARIMA.
Some important options to this statement that are used are: noint to set the intercept
to 0, method to set the use of the maximum-likelihood methodology, and p and q to tell
the SAS system the order of the ARMA equation.

In this study we use noint to set the intercept to 0. This is because we have already
transformed the series to a mean 0 untrended process. The intercept should be 0.

Maximum-likelihood estimation (MLE) is a common method for calculating that
parameter estimates, and has been shown to have good properties of an estimator. For
this study, theres no strong reason to move away from using MLEs.

The final two modifiers mentioned are used to specify the AR order p and the MA order
q. In order to provide the researcher with control over the specific |
i
and u
i
estimated,
this is actually a list in parentheses. For example to model an AR(5) with no |
2
or |
4

term the command would be:
p=(1,3,5)
Putting all this information together, the final form of the ESTIMATE statement used for
an ARMA(1,1) model is:
ESTIMATE p=(1) q=(1) noint method=ml;
And when used in conjunction with PROC ARIMA:
PROC ARIMA DATA=GEData;
IDENTIFY var=lnGE(1)
nlag=30
center;
ESTIMATE p=(1) q=(1) noint method=ml;
run;
As an aside, notice the use of the data statement. While not required, many SAS PROCs
output new data objects and thus, unbeknownst to the programmer, alter the last called
data object. Thus, as a defensive measure, it is good software engineering practice to
always explicitly define what data object is being used. This will have an additional
benefit with regard to future maintainability of the code as well.

The result of the above SAS statement is
1

| =.-89396 with a
1

|
se = .08490 and
1

u =.83655
with a
1

u
se =.10416. At this point care must be taken to establish what convention the
SAS system is using to define these estimates. Some programs use the convention
Y
t
+E|
i
=Eu
i
c
t-1
+ c
t
while others use the alternate form Y
t
= E|
i
u
i
c
t-1
+ c
t
. SAS uses the
first form, so to convert to the same form as originally stated we have to change the
sign of
1

| . The resulting model for ln(GE) through December 6


th
is:

t
Y

=.894Y
t-1
+ .837c
t-1


Prediction
Once we have the model, forecasting becomes an exercise in conditional expectation.
For the ARMA(1,1) the resulting maximum likelihood estimator is:

t
Y

=
1

|
1

t
Y
Notice that the MA term has dropped off as it is multiplied by a term with an expected
value of 0.
We can use SAS to generate this prediction with the FORECAST statement of PROC
ARIMA. Two useful options to this statement are LEAD and OUT.

LEAD allows the researcher to set the number of time steps into the future to forecast.
Out is used to specify the DATA object to put the results into. As this model is working
with a transformed time series, this statement is necessary to enable translation of
forecasts into the original terms.

The complete forecast statement is:
proc arima data=GEData;
identify var=lnGE(1) scan esacf nlag=30 center;
estimate p=(1) q=(1) noint method=ml;
forecast lead=4 out=predictOut;
run;
Note that the FORECAST statement must come after the ESTIMATE as the results of
estimate are used to specify the model.

Using this command will result in both an estimate and a nice confidence interval for the
process.
However, this still needs to be transformed back to meaningful units. This is
accomplished with the following data step:
data predictOut;
set predictOut;
l95 = exp( l95 );
u95 = exp( u95 );
forecast = exp( forecast);
This results in a point estimate of next weeks closing pricing of $17.75 with a 95%
confidence interval of {$16.23, $19.42}. The actual closing price of GE on December 13
th

was $17.62.

Creating Quality Graphical Output
A plot of the prediction is also of value, however, a significant amount of manipulation
of the data is required to achieve a professional look. Among the challenges are: the
output of FORECAST does not provide a date for each predicted time step, the plot
should include both the original GE data as well as the forecast, and creating a shaded
region to depict the prediction interval. To accomplish these requires some
manipulation with the data step. The following DATA statement merges the two data
objects, fixes the timestamp problems and creates two new variables, FL95 and FU95 for
the forecast in the prediction time period. Additionally, it creates an extra row that is
used to create dummy values of FL95 and FU95 so that a shaded region can be drawn:
data allData;
MERGE GEData predictOut;
*by TradeDate;
IF TradeDate EQ . THEN TradeDate='06DEC2010'D + (_n_-
523)*7;
IF TradeDate EQ '06DEC2010'D THEN
DO;
FL95=AdjClose;
FU95=AdjClose;
END;
IF TradeDate GE '08DEC2010'D THEN FL95=L95;
IF TradeDate GE '08DEC2010'D THEN FU95=U95;
FORMAT TradeDate Date9.;
IF TradeDate = '03JAN2011'D THEN DO; *Create extra row for
shading;
Output;
TradeDate = '03JAN2011'D;
G
E

S
h
a
r
e

P
r
i
c
e
12
13
14
15
16
17
18
19
20
21
22
Time Axis
01JAN2010 26FEB2010 23APR2010 18JUN2010 13AUG2010 08OCT2010 03DEC2010 28JAN2011
FL95=17.72;
FU95=17.72;
END;
OUTPUT;
Most of this is straightforward. The lone exception is the creation of the extra row. The
reason for creating this is so the drawing algorithm can accurately determine the
bounds of the polygon it is drawing. The start of the code to create the extra row begins
on the line with the associated comment. The output statement saves the current copy,
and a new one is also created. Then the values for that new row are adjusted as
needed, specifically setting the second set of bounds for FL95 and FU95.

Finally, it is important to note that this code is not particularly maintainable. First, the
use of 06DEC2010, 08DEC2010, and 523 tie the command specifically to the current
data set. Second, the time-shift is specified in days and always seven. Production code
should seek to address these issues. A good starting point would be the article by
Croker referenced in the bibliography.
Conclusion
Implementing Time Series analysis in SAS is surprisingly easy. The commands are
relatively straightforward and implement the core algorithms needed for ARMA
modeling, as well as providing the accompanying plots.

Appendix I SAS Code
data GEData;
infile 'F:\SASAssignmentNotes\Project\ge_data.csv' DLM=',' FIRSTOBS=2;
input TradeDate :MMDDYY10. Open High Low Close Volume
AdjClose;
format TradeDate Date9.;
lnGE = log(AdjClose);
lnGELagged = lag(lnGE);
GEReturn = lnGE-lnGELagged;
GELagged = lag(AdjClose);
DiffedGE = AdjClose - GELagged;
output;
run;

proc print;run;
ods graphics on;

proc timeseries print=summary plots=pacf data=GEData;
var lnGE;
proc timeseries print=summary plots=acf data=GEData;
var lnGE;
run;

proc arima data=GEData;
identify var=lnGE(1) scan esacf nlag=30 center;
estimate p=(1) q=(1) noint method=ml;
forecast lead=4 out=predictOut;
run;

data predictOut;
set predictOut;
l95 = exp( l95 );
u95 = exp( u95 );
forecast = exp( forecast );
proc print data=predictOut;
run;

data allData;
MERGE GEData predictOut;
IF TradeDate EQ . THEN TradeDate='06DEC2010'D + (_n_-523)*7;
IF TradeDate EQ '06DEC2010'D THEN
DO;
FL95=AdjClose;
FU95=AdjClose;
END;
IF TradeDate GE '08DEC2010'D THEN FL95=L95;
IF TradeDate GE '08DEC2010'D THEN FU95=U95;
FORMAT TradeDate Date9.;
IF TradeDate = '03JAN2011'D THEN DO;
Output;
TradeDate = '03JAN2011'D;
FL95=17.72;
FU95=17.72;
END;
OUTPUT;

proc print data=allData;
run;

goptions reset=all;
symbol1 value=none i=join line=1 c=black co=libgr;
symbol2 value=none i=join line=3 c=blue co=libgr;
*symbol2 value=none i=join line=3 c=CX803009 co=libgr; *if you prefer
orange;
symbol3 value=none I=ms co=libgr c=gwh;
symbol3 value=none I=ms co=libgr c=CXD9A465;
symbol3 value=none I=ms co=libgr c=CXE5C5C2; * or pink...;

axis1 label=("Time Axis" )
order=('01JAN2010'D to '29JAN2011'D by 56)
value=(h=1 angle=0 rotate=0) ;
* angle MUST come before the text or the text won't be rotated;
axis2 label=(angle=90 rotate=0 "GE Share Price") order=(12 to 22);

Proc Gplot data=allData;
PLOT FL95*TradeDate=3 FU95*TradeDate=3
AdjClose*TradeDate=1 Forecast*TradeDate=2
/overlay haxis=axis1 vaxis=axis2;
run;
quit;

Appendix II Sources
- Presentation Quality Forecast Visualization with SAS/Graph by Samuel T. Croker
http://www.nesug.org/proceedings/nesug07/np/np04.pdf

- Time Series Analysis: With Applications In R by Jonathan D. Cryer and Kung-Sik Chan

- Time Series Analysis and Its Applications: With R Examples by Robert Shumway and
David Stoffer

You might also like