0% found this document useful (0 votes)
51 views43 pages

TS Gas Report

The objective of the report is to analyze Australian Monthly Gas production dataset “GAS” in package “FORECAST”. Monthly gas production of Australian between year 1956 – 1996. Objective here is to read the data from forecast package and do various analysis using reading, plotting, observing and conducting applicable tests. Model building and to forecast for 12 is also expected in this project using ARIMA and Auto ARIMA models. We must come up with best model for our prediction by comparing perfo

Uploaded by

sravanthi m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views43 pages

TS Gas Report

The objective of the report is to analyze Australian Monthly Gas production dataset “GAS” in package “FORECAST”. Monthly gas production of Australian between year 1956 – 1996. Objective here is to read the data from forecast package and do various analysis using reading, plotting, observing and conducting applicable tests. Model building and to forecast for 12 is also expected in this project using ARIMA and Auto ARIMA models. We must come up with best model for our prediction by comparing perfo

Uploaded by

sravanthi m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Mini Project – Gas (Australian monthly gas

production) Sravanthi.M

1
Table of
Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................4
3.1.2.Cleaning up data................................................................................................................4
3.1.3.Reading the Data and visualization....................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................5
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5

5.1 Read the data as a time series object in R. Plot the data

5.2 What do you observe? Which components of the time series are present in this dataset?

5.3 What is the periodicity of dataset?

5.4 Is the time series Stationary? Inspect visually as well as conduct an ADF test? Write down the
null and alternate hypothesis for the stationarity test? De-seasonalize the series if seasonality is
present?

5.5 Develop an ARIMA Model to forecast for next 12 periods. Use both manual and auto. arima
(Show & explain all the steps)

5.6 Report the accuracy of the model

6. Source Code
1 Project Objective
The objective of the report is to analyze Australian Monthly Gas production dataset “GAS” in
package “FORECAST”. Monthly gas production of Australian between year 1956 – 1996. Objective
here is to read the data from forecast package and do various analysis using reading, plotting,
observing and conducting applicable tests. Model building and to forecast for 12 is also expected
in this project using ARIMA and Auto ARIMA models. We must come up with best model for our
prediction by comparing performance measures of the models.

The Dataset looks like as shown below:


Variable Description
Year Year of Production
Month Month of production
Gad Production No of units Produced during the specified month and year

2 Assumptions
 Sample size is adequate to perform techniques applicable for the Time series dataset
 The Australian Gas production time series data was downloaded from “Forecast” package
in R.
 Components of Time series are not known.
 Stationary of Time series are not known.
 Seasonality of Time series is not known.
3 Exploratory Data Analysis – Step by step approach
A Typical Data exploration activity consists of the following steps:

1. Load data and visualization


2. Preprocessing data
3. Check/Make the data
4. Do Hypothesis testing. If time series in non-stationary then stationarize it
5. Do Augmented Dickey Fuller test
6. Determine d, p and q value
7. Create ACF and PACF plots
8. Create ARIMA model
9. Fit Arima model

We shall follow these steps in exploring the provided dataset.

3.1 Reading Data


3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use
install. packages (“Package name”)

3.2 Variable Identification


We are using
3|Page
 summary: is a generic function used to produce result summaries of the results of
various model fitting functions. The function invokes particular methods which
depend on the class of the first argument.
 hist(): To plot histogram

 data(gas, package =”forecast”)

4 Conclusion
Manual Arima model is neck to neck in most of the accuracy parameters like
 Root mean squared error
 Mean absolute percentage error
But when it comes to actual model based on Auto ARIMA performs better against manual
Models. We can always build a complex better fitting model by taking higher parameters.

5 Detailed Explanation of Findings

5.1 Read the data as a time series object in R. Plot the data
Ans: For basic data summary we need to read the data as mentioned above in
3.2 we will be using all the functions and analyze the data.

Output:

5.2 What do you observe? Which components of the time series are present in
this dataset?
4|Page
The production of Gas in Australia has increased significantly over a long period
of time (40 years). There is a significant upward trend which can be observed
and there seems to be some seasonality but there is extremely high variance
which can be observed looking at the plot. The timeline involved is 40 years
therefore it has to be seen how significant the historical data is

Output:

A large number depicting lower production of gas are from the early years,prior
to 1970’s.Wheather these lower values would aid in accurately forecasting the
production in 1996 remains to be seen.

The lowest gas production was recorded in Feb 1956 which is early years of
production and the highest monthly production till date was recorded in Jul
1995.

There exists a huge gap in the production over the years. Current levels not
anywhere near the Mean -21415 or the median 16788

5|Page
5.3 What is the periodicity of dataset?

Ans: The moth plot for the Australian gas production data shows a clear
increasing trend in production of gas during each month from 1956 to 1995. It
can be observed that there is clear upward trend with any visible variations
seen mainly during last 5-10years.

Frequency:

A time series with one observation each month has a monthly sampling
frequency or monthly periodicity and so is called a monthly time series. Data
periodicity is described by specifying periodic time intervals into which the dates
of observations fall. Using the frequency () function we can determine the
periodicity of the time series.

Output:

6|Page
5.4 Is the time series Stationary? Inspect visually as well as conduct an ADF
test? Write down the null and alternate hypothesis for the stationarity test? De-
seasonalize the series if seasonality is present?
Ans: From the above plot we can observe that there is an upward trend with a
semi-annual seasonality which is mainly observed throughout the time series by
looking at the plot. Now we can observe the seasonal component at the
beginning of the series is smaller when compared to seasonal component later
in the series. To account for this, we need to perform log-transform the data

Log Transformation:

Plot a graph of the data against time if the variance increases with the level of
the series, take logs. If not, we will model the original data

Output:

Now when compared to the plot of the original time series data, we can observe
that once we have done the log transformation the variance is less skewed and
quite uniform throughout.

Decomposition:

We have to decide whether an additive or a Multiplicative model would


describe the data appropriately. The size of seasonal fluctuations and random
fluctuations increases in the time series as time goes on, it indicates that an
additive model is not appropriate. Hence our data could be described by
7|Page
multiplicative model rather than an additive model. Already we have done log
transformation we can use Additive model for decomposition.
The time series data includes a seasonal (semi-annual) component, a upward
trend component and residuals or error. The extent of each component can be
deduced by decomposition of the data using the stl() function. Decomposition
can also allow us to remove seasonal trends in our data.

Output:

From the above decomposed plotting we can observe that there is definitely a
trend as noted in our visual inspection along with a semiannual seasonal
component and residuals or white noise. But the trend component is most
significant.

8|Page
Seasonal Plot:

As observed earlier, during our visual inspection there is semi-annual


seasonality present in the time series data along with an upward trend.
Deseasonlization involves removal of seasonal component from the time series
which would help us understand the effect of other components on the time
series.

Then the deseasonalized plot would be compared to original plot to better.

Output:

9|Page
A seasonal plot allows the underlying seasonal pattern to be seen more clearly
and is especially useful in identifying years in which the pattern changes. We
can clearly see that there is a jump in production in July, August each year and
data shows a considerable increase of production for 2013. Increasing trend can
be noticed starting on June 2012 onwards.

Output:

10 | P a g
e
Output:

Output:

11 | P a g
e
Output:

Stationary:

Fitting an ARIMA model requires the series to be stationary. A series is said to


be stationary when its mean, variance, and auto variance are time invariant.
 Need to determine if our time series is stationary

12 | P a g
e
 That is if the mean is generally constant through out the time series, as
opposed to going up or down over time.

Output:

 This does not look like stationary at all, as the mean tends to go up over
time. We can do formal test to determine stationary.
Adf.test:

For this we can use the “Augmented Dickey-Fuller (ADF) test, which tests the
null hypothesis that the series is non-stationary.

Hypothesis:

H0: Non- stationary


Ha: Stationary

If P-value is more than 0.05, alternative (Ha) hypothesis is rejected and null
hypothesis is accepted that the data is Non- stationary.

Output:

13 | P a g
e
 From the above output p value is above 0.05 which means our data is
non-stationary.
 Hence alternative hypothesis Ha is rejected and Null hypothesis H0 is
accepted.
 From the above we can conclude that data is non-stationary

Stationary – Differencing:

We have non-stationary data. We need to difference the data until we obtain a


stationary time series. We can do this with diff() function.

Output:

Determining D value:
 After differencing we again run adf.test to check stationary of data. P
value is less than 0.05.
 P value is 0.01 hence, our alternative hypothesis Ha is accepted and null
hypothesis H0 is rejected.
14 | P a g
e
 Now our data is stationary
 Given that we had to difference the data once, so the d value for our
ARIMA model is 1.

Determine p and q value:

Output:

 We observe from the above ACF(q) and PACF(p) correlation plots that
there is a large amount of correlation that exists

15 | P a g
e
 Looking at ACF plot we can see that there is seasonal pattern which can
be observed.
 The p and q values from the ACF and PACF plots would be 2 and 2
respectively.

5.5 Develop an ARIMA Model to forecast for next 12 periods. Use both manual
and auto. arima (Show & explain all the steps)
Ans: Fitting ARIMA Model

Output:

16 | P a g
e
Output:

Output:

17 | P a g
e
Output:

Conclusion of ARIMA Model built on (p,d,q) values of (2,1,2)


 MAPE observed is 6.992583
 Histogram confirms that data is normally distributed
 As compared to original plot current model is good
 Auto correlation in residuals shows that there is correlation exists Lag 4
onward at various lags
 Box-Ljung Test P-value is significantly less than 0.05, then the residuals
are dependent

Now, building another ARIMA model by adding seasonality into the previous
model.

Output:

Output:

18 | P a g
e
19 | P a g
e
Conclusion of SARIMA model built on (p,d,q) values of (2,1,2)(1,1,2) and below is
the conclusion:
 MAPE obtained is 3.9086
 Histogram confirms that data is Normally distributed
 Compared to original plot -Good
 Auto correlation in residuals – Correlation exists Lag 4 onward at fewer
lags compared to previous model. A better outcome compared to
previous model.
 Box-Ljung Test – P-value is significantly less than 0.05, then the residuals
are dependent. P-value is slightly better than previous model.

Auto ARIMA:

Output:

20 | P a g
e
21 | P a g
e
22 | P a g
e
Conclusion of Auto ARIMA Model built on (p,d,q) values (2,1,1)(0,1,1)[12] is
mentioned below:
 MAPE observed is 3.900233
 Histogram clearly indicates that data is Normally distributed
 As compared to original plot current model is Good
23 | P a g
e
 Auto correlation in residuals is almost similar to previous SARIMA model
 Box-Ljung Test -P – value is significantly less than 0.05 and residuals are
dependent.But P-value is still not so good than the previous built models.

Performing Box-Cox transformation:

Output:

Output:

24 | P a g
e
Output:

25 | P a g
e
26 | P a g
e
Conclusion of Box Cox Transformation (Best model: ARIMA (0,1,1)(0,1,1)[12]) is
mentioned below:

 MAPE observed is 0.5343536 which is lower than the previous built model
 Histogram confirm that data is normally distributed
 As compared to previous plot this model is good.
 Auto correlation in residuals almost close to AUTO ARIMA model
 Box -Ljung Test -P – value is significantly less than 0.05, then the
residuals are dependent.P-value is still not better than the previous
models.
 As we can that this model is also not better than the previous model so
we will be taking subset of the original data and will be performing all the
test on it.

Extracting Subset of Original data and performing Test:

27 | P a g
e
Output:

Output:

28 | P a g
e
Output:

29 | P a g
e
Output:

30 | P a g
e
Output:

31 | P a g
e
32 | P a g
e
Finally, we had taken subset of original time series and data considered is from
year 1990 onward, because we found that there is a lot of dependent residuals
issues along with inaccurate models also there is also white noise in ARIMA as
well as in Auto ARIMA.
Hence, we followed and performed all the steps by taking the subset of data to
build a stable model and based on that we did the forecast with highest
confidence. ARIMA (0,0,2) proved to be the best model with the following below
supporting results:
 MAPE observed is 8.588294
 Histogram also confirms that data is Fairly normally distributed
 As compared to all previous built plots this plot is good
 Auto correlation in residuals showed that there is a significant correlation
and there is no improvement required as compared to previous Auto
ARIMA model
 Box-Ljung Test -P-value is significantly less than 0.05, then the residuals
are dependent. P-value is still not better than earlier built models and we
can further refine it by adding seasonality into it.

33 | P a g
e
Output:

34 | P a g
e
Model which is built by taking subset of data including a seasonal component
SARIMA (0,1,1) (1,1,0) gave following results:
 MAPE observed is 4.139159
 Histogram clearly indicated that data is Normally distributed
 As compared to original plot this plot proved to be the best plot so far.
 No correlation observed in Auto correlation and that proves the best fit so
far.
 Box-Ljung Test -P – value is significantly more than 0.05, and residuals
are independent. P-value is the best so far.

35 | P a g
e
Fitting Auto ARIMA on subset model:

Output:

36 | P a g
e
37 | P a g
e
Model built on by performing auto ARIMA on subset of data proved to be the
best model and it is exactly similar to the previous built SARIMA model (0,1,1)
(1,1,0) which gave the exact same results
 MAPE observed is 4.139159
 Histogram clearly indicated that data is Normally distributed
 As compared to original plot this model proved is also proved to be very
good.
 No correlation observed and is best fit so far.
 Box-Ljung Test -P – value is significantly more than 0.05, hence the
residuals are independent. P-value is the best till now

38 | P a g
e
Make Prediction:

Output:

39 | P a g
e
5.6 Report the accuracy of the model

Ans: So, finally to build the best model we used a subset of the original data as
we found that there is correlation of residuals, white noise, inaccuracy of models
in historical data which was available for 40 years with high variance and it
signifies that it was highly affecting the accuracy of our models and as we were
trying to forecast for next 12 months the data from last 5 years was significantly
stationary and was enough to build model, predict the values and plot the
forecast accordingly.

6.Source code
## loading the data
data("gas", package = "forecast")

## Plot the data

plot(gas, main = "Plot of Australian Gas Production")

hist(gas, main = "Histogram of Australin Gas Production", col = "blue",


border = "orange")

summary(gas)

## Checking frequency of gas data

frequency(gas)

monthplot(gas,main = "Monthly Plot of Australin Gas Production")

##Log transformation

loggas <- log(gas)


plot(loggas, main = "Plot of log(gas)")
loggasdec <- stl(loggas,s.window = "p")
plot(loggasdec)
loggasdec

##Season plot

ggseasonplot(gas)
ggseasonplot(gas, polar = TRUE)

##Deseasonalize

deseasonloggas <- (loggasdec$time.series[,2]+loggasdec$time.series[,3])


ts.plot(deseasonloggas,loggas,col=c("blue","orange"), main= "Comparision of
loggas and Deseasonalized loggas")

##Plotting Actual values with exponentiation

deseasonloggas <-(exp(loggasdec$time.series[,2]+loggasdec$time.series[,3]))
ts.plot(deseasonloggas,gas,col=c("blue","green"), main= "Comparision of
loggas and Deseasonalized loggas")
40 | P a g
e
##Plotting seasonality only for 1st 12 months

logseason = loggasdec$time.series[1:12,1]
plot(logseason,type ="l")

##Exponentitate to get actual values

gasseason = loggasdec$time.series[1:12,1]
plot(gasseason,type="l")

##Checking for stationarity of the data

plot(gas)
adf.test(gas)

##Stationarize the series

diff1 = diff(gas)
plot(diff1,main = "Differenced plot of Gas")
adf.test(diff1)

##AUto correction of lag 50

acf(diff1, lag =50, main= "Auto Correlation(q)")


pacf(diff1, lag =50, main= "Partial Auto Correlation(p)")

##ARIMA

gas.arima.fit = arima(gas, c(2,1,2))


summary(gas.arima.fit)
hist(gas.arima.fit$residuals, col = "Pink")

## Testing fit with original series

ts.plot(gas,fitted(gas.arima.fit), col=c("green","red"))

##Checking fit of test auto correlation in residuals

acf(gas.arima.fit$residuals)

##Box-Ljung test
#H0: Residuals are independent
#Ha: Residuals are not independent
Box.test(gas.arima.fit$residuals,lag = 30,type = "Ljung-Box")

##Adding seasonal components if required


gas.arima.fit.s = arima(gas,c(2,1,2),seasonal = list(order=c(1,1,2),period =
12))
gas.arima.fit.s
summary(gas.arima.fit.s)
hist(gas.arima.fit.s$residuals, col = "coral")
ts.plot(gas,fitted(gas.arima.fit.s),col = c("coral","blue"))
acf(gas.arima.fit.s$residuals)
Box.test(gas.arima.fit.s$residuals,lag = 30,type = "Ljung-Box")
41 | P a g
e
plot(forecast(gas.arima.fit.s,h=12))
gas.arima.fit.s1 <- arima(gas,c(0,1,2),seasonal = list(order=
c(0,1,2),period=12))
gas.arima.fit.s1
summary(gas.arima.fit.s1)
hist(gas.arima.fit.s1$residuals, col = "green")
ts.plot(gas,fitted(gas.arima.fit.s1),col=c("blue","pink"))
acf(gas.arima.fit.s1$residuals)
Box.test(gas.arima.fit.s1$residuals,lag = 30,type = "Ljung-Box")
plot(forecast(gas.arima.fit.s1,h=12))

##Auto-Arima

fitauto = auto.arima(gas,seasonal = TRUE,trace = T)


summary(fitauto)
hist(fitauto$residuals,col = "aquamarine4")
ts.plot(gas,fitted(fitauto),col=c("chartreuse4","red"))
acf(fitauto$residuals)
Box.test(fitauto$residuals,lag = 30,type = "Ljung-Box")
checkresiduals(fitauto)

plot(forecast(fitauto,h=12))
forecastfit = forecast(fitauto,h=12)
automean = forecastfit$mean
automean

##forecast fit Box Cox

gas1 = BoxCox(gas,lambda = BoxCox.lambda(gas))


summary(gas1)
tsdisplay(gas1,lag.max = 150,plot.type = c("histogram"))
fitauto1 = auto.arima(gas1,seasonal = TRUE,trace = T)
summary(fitauto1)
hist(fitauto1$residuals,col = "brown")
ts.plot(gas1,fitted(fitauto1),col=c("brown","pink"))
acf(fitauto1$residuals)
Box.test(fitauto1$residuals,lag = 30,type = "Ljung-Box")
checkresiduals(fitauto1)

##Subset of the data

gassub = window(gas,start=c(1990,1))
plot(gassub)

##Check for stationary of data


adf.test(gassub)

##Sationary with auto correlation of lag 30

acf((gassub),lag=30)
pacf((gassub),lag=30)
plot(gassub)

##ARIMA(p,d,q) with AR & MA without differencing


42 | P a g
e
gas.arima.fit <- arima(gassub,c(0,0,2))

##with AR & MA with differencing

gas.arima.fit <- arima(gassub,c(2,1,1))


gas.arima.fit <- arima(gassub,c(2,1,2))
summary(gas.arima.fit)
hist(gas.arima.fit$residuals,col = "Yellow")

##Testing the fit with original series

ts.plot(gassub,fitted(gas.arima.fit),col=c("red","blue"))
fitted(gas.arima.fit)

##Test auto correlation in residulas to check the fit

acf(gas.arima.fit$residuals)

##Ljung Box method used :H0: residulas are independent

Box.test(gas.arima.fit$residuals,lag = 30,type = "Ljung-Box")

##Adding seasonal components

gas.arima.fit.s = arima(gassub,c(0,1,1),seasonal = list(order=


c(1,1,0),period=12))
gas.arima.fit.s
summary(gas.arima.fit.s)
hist(gas.arima.fit.s$residuals,col = "Orange")
ts.plot(gassub,fitted(gas.arima.fit.s),col=c("Orange","Green"))
acf(gas.arima.fit.s$residuals)
Box.test(gas.arima.fit.s$residuals,lag = 30,type = "Ljung-Box")

##Auto Arima

fitauto = auto.arima(gassub,seasonal = TRUE,trace = T)


summary(fitauto)
hist(fitauto$residuals,col = "chocolate2")
acf(fitauto$residuals)
Box.test(fitauto$residuals,lag = 30,type = "Ljung-Box")
checkresiduals(fitauto)
ts.plot(gassub,fitted(fitauto),col=c("green","blue"))

##Forecast after ensuring model is stable and accurate, forecast for next 12
intervals

fct1 = forecast(gas.arima.fit.s,h=12)
fct1$mean
plot(forecast(gas.arima.fit.s),h=12)
summary(gas.arima.fit.s)

43 | P a g
e

You might also like