0% found this document useful (0 votes)
78 views13 pages

Time Series Analysis in R A Beginner's Guide

Time series

Uploaded by

otieni.reagan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views13 pages

Time Series Analysis in R A Beginner's Guide

Time series

Uploaded by

otieni.reagan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

"Time Series Analysis in R: A Beginner's Guide"

SRINIVAS. S

Time series is a sequence of observations recorded at successive points in time. It represents


data collected chronologically, with “time” referring to intervals such as hours, days, weeks,
months, or even years [1,2]. This temporal sequence allows for the analysis of how a variable of
interest changes over time. In this tutorial, we will explore how to fit ARIMA and SARIMA
models to your data. We will also walk through a step-by-step process for conducting time series
analysis.

The data used in the tutorial is from the article “Statistical methods for predicting tuberculosis
incidence based on data from Guangxi, China” [3]. It is an open access dataset which contains
to TB incidence in Guangxi from January 2012 to June 2019.

The first step is “Description”. When you are presented with time series data the first step is to
plot the data and observe the components of time series like trend, seasonality and Cycles.

#Import the excel data to R platform

library(readxl)
Data = read_xlsx("C:\\Users\\WELCOME\\Downloads\\TB time series data.xlsx")
head(Data)

## # A tibble: 6 × 2
## Time TB
## <dttm> <dbl>
## 1 2012-01-31 00:00:00 13.0
## 2 2012-02-29 00:00:00 17.1
## 3 2012-03-31 00:00:00 20.2
## 4 2012-04-30 00:00:00 18.2
## 5 2012-05-31 00:00:00 18.3
## 6 2012-06-30 00:00:00 17.4

We have to convert the time series data into a time series object for the further analysis.
min_date = min(Data$Time) #Start date of the time series
min_date

## [1] "2012-01-31 UTC"

max_date = max(Data$Time) #End date of the time series


max_date

## [1] "2019-06-30 UTC"

#Coverting the data as time series object

Data.ts = ts(Data$TB, start=c(2012,01), end=c(2019, 06), frequency = 12)


class(Data.ts)

## [1] "ts"

The frequency is given as “12” because the observation (TB incidence) is taken monthly, if the
data is taken yearly then frequency will be fixed as “1”.

Now the data is saved as a time series object. Now we can plot the time series object to observes
the descriptive measures.

#Creating a time plot

plot(Data.ts, xlab ="Time", ylab="TB incidence")


We can able to infer that there is a slight downward trend in the incidence of TB cases in
Guangxi and there are regular patterns every year which indicates seasonality. To infer further
we can do “Decomposition of the time plot”.

Decomposition of time series is a process of visually examining a series in an exploratory


fashion, time series are partitioned according to the components of time series (trend, seasonal
variation and random noise).

#Install the packages "tseries" and "forecast" for further analysis

library(tseries)

library(forecast)

#Decomposition of time series

Decomposition = decompose(Data.ts)
plot(Decomposition)
The above figure depicts what is referred to as classical decomposition, when a time series is
conceived of comprising three components: a trend cycle, seasonal pattern and random
component (Here the trend and cycle are combined because the duration of cycle is unknown).

From the above figure, in the trend part we can see that there is downward trend and from the
seasonal part we can see that there is reoccurring seasonal patterns (which indicates seasonality).
This decomposition of a time series leads to the second step called as “Explanation”.

Next, we are going to see a important concept called as “Stationarity”.

Stationarity is one of the primary assumptions of time series. Most of the time series models like
ARIMA, SARIMA etc... assumes thet the time series is stationary. Broadly speaking a time
series is said to be stationary if there is no change in mean (no trend), no systematic change in
variance and it should be independent of time (no seasonality).

Let’s see how to check whether the time series is stationary or not.

The first way is to plot the ACF and PACF plot.

• ACF refers to “Auto Correlation Function” and PACF refers to “Partial Correlation
Function”. The horizontal blue line in the plot is 95% Confidence Interval (CI).

• The vertical line at lag 0 indicates the correlation of the present value with itself, so we
can see that the correlation is equal to 1. With respect to stationarity, the time series is
said to stationary if all the vertical lines fall inside the CI and if the vertical lines fall
outside the CI, then the series is not stationary.

• ACF and PACF further used to find the order of the ARIMA model, we can see it later.
#Plot the ACF plot

acf(Data.ts)

#Plot the PACF plot

pacf(Data.ts)
From the ACF plot we can see that the vertical lines crossing the blue horizontal line which
indicates the high correlation between the present value sand its lagged versions. Therefore, the
time series is not stationary.

The second way to conduct the Augmented Dickey-Fuller (ADF) test or Kwiatkowski-
Phillips-Schmidt-Shin (KPSS) test.

ADF test:
Null Hypothesis (H0): The time series has a unit root (i.e., it is non-stationary).
Alternative Hypothesis (H1): The time series does not have a unit root (i.e., it is stationary).
If the p-value is lesser than 0.05 then the time series is stationary.

KPSS test:
Null Hypothesis (H0): The time series is stationary (either around a level or a trend).
Alternative Hypothesis (H1): The time series is not stationary.
If the p-value is lesser than 0.05 then the time series is not stationarity.

#ADF test

adf.test(Data.ts, k=15)

## data: Data.ts
## Dickey-Fuller = -2.5906, Lag order = 15, p-value = 0.333
## alternative hypothesis: stationary

#KPSS test

kpss.test(Data.ts)
## Warning in kpss.test(Data.ts): p-value smaller than printed p-value

##
## KPSS Test for Level Stationarity
##
## data: Data.ts
## KPSS Level = 1.5344, Truncation lag parameter = 3, p-value = 0.01
From the above results of both the ADF test and KPSS test we can confirm that the time series is
not stationary and it requires transformation (differencing). These tests are available in the
package “tseries” in R [4].

Differencing the most important method of stationarizing the mean of the time series. It can
remove any trend in the series which is not of interest. There are two types of differencing: Trend
Differencing and Seasonal Differencing. From the name itself we can understand that if the time
series shows a significant trend we can apply Trend Differencing, if the time series shows the
seasonality patterns, then we can apply seasonal differencing and if the time series shows both
the trend and seasonality, we have to apply both kinds of differencing [2]. First order
differencing refers to subtracting the previous observation from the current observation and
Second order differencing refers to subtracting the previous two values from the current
observation and so on.

The third step in the time series is analysis is “Prediction”. The primary aim of any time series
analysis is to predict the future values based on the observed values. For the prediction purpose
we are going to fit an ARIMA or SARIMA model for the time series object.

ARIMA (p, d, q) model predicts the future values of a time series by a linear combination of its
past values and series of errors. This method is suitable for forecasting when the data is
stationary/non-stationary and univariate. The parameter “p” in the model refers to the order of
the auto-regressive part, “d” refers to the order of differencing and “q” refers to order of the
moving average part [5].

The SARIMA (Seasonal Auto-Regressive Integrated Moving Average) model, also known as
the Seasonal ARIMA model, extends the ARIMA model to handle seasonality in time series
data. SARIMA models combine non-seasonal and seasonal components. It is denoted as
SARIMA (p, d, q) (P, D, Q) s. The parameter “p” in the model refers to the order of the auto-
regressive part, “d” refers to the order of differencing and “q” refers to order of the moving
average part. The parameter “P” in the model refers to the order of the auto-regressive part
respect to seasonal period, “d” refers to the order of differencing applied with respect to seasonal
period and “q” refers to order of the moving average part with respect to seasonal period. Lastly
the “s” refers to length of seasonal cycle (Eg: 12 for monthly data with annual seasonality) [6].
The package “forecast” in r provides a useful function called as “auto.arima” which
automatically fit an ARIMA or SARIMA model to a time series [7]. The function selects the best
ARIMA model based on information criteria like AIC (Akaike Information Criterion) or BIC
(Bayesian Information Criterion).

#Fitting the suitable ARIMA model for the observed time series

Prediction_model = auto.arima(Data.ts, ic ="aic", trace = T)

##
## ARIMA(2,0,2)(1,1,1)[12] with drift : Inf
## ARIMA(0,0,0)(0,1,0)[12] with drift : 242.998
## ARIMA(1,0,0)(1,1,0)[12] with drift : 230.2584
## ARIMA(0,0,1)(0,1,1)[12] with drift : 221.2812
## ARIMA(0,0,0)(0,1,0)[12] : 271.8778
## ARIMA(0,0,1)(0,1,0)[12] with drift : 238.6041
## ARIMA(0,0,1)(1,1,1)[12] with drift : Inf
## ARIMA(0,0,1)(0,1,2)[12] with drift : Inf
## ARIMA(0,0,1)(1,1,0)[12] with drift : 230.2804
## ARIMA(0,0,1)(1,1,2)[12] with drift : Inf
## ARIMA(0,0,0)(0,1,1)[12] with drift : 222.7067
## ARIMA(1,0,1)(0,1,1)[12] with drift : Inf
## ARIMA(0,0,2)(0,1,1)[12] with drift : Inf
## ARIMA(1,0,0)(0,1,1)[12] with drift : 221.4776
## ARIMA(1,0,2)(0,1,1)[12] with drift : Inf
## ARIMA(0,0,1)(0,1,1)[12] : 256.612
##
## Best model: ARIMA(0,0,1)(0,1,1)[12] with drift

We have fitted ARIMA models with different combination of parameters and we found the best
model based on Akaike Information Criterion (AIC), lower the value better the model. We
figured it out that the ARIMA model (actually it is SARIMA) with the combination (0,0,1)
(0,1,1)[12] is the best fit model for the given time series.
#Summary of the best fit model

summary(Prediction_model)
## Series: Data.ts
## ARIMA(0,0,1)(0,1,1)[12] with drift
##
## Coefficients:
## ma1 sma1 drift
## 0.2482 -0.8834 -0.0600
## s.e. 0.1443 0.3411 0.0043
##
## sigma^2 = 0.7619: log likelihood = -106.64
## AIC=221.28 AICc=221.83 BIC=230.71
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0327287 0.7968 0.5878878 -0.3110781 4.415815 0.5231631
## ACF1
## Training set 3.835598e-05

Before using this model for prediction, we have to check whether it is stationary or not using
ACF and PACF plot.

acf(ts(Prediction_model$residuals))
pacf(ts(Prediction_model$residuals))

From the above ACF and PACF plots we can infer that all the vertical lines at each lags falls
inside the horizontal boundary, which indicates that the time series is stationary after applying
seasonal differencing using the SARIMA model. So, now we have to predict the future value
based on the observed values.

#The function forecast is used to predict the future values

Forecast_future = forecast(Prediction_model,level = c(95), h=10)


Forecast_future
## Point Forecast Lo 95 Hi 95
## Jul 2019 12.327558 10.577049 14.07807
## Aug 2019 11.152155 9.348518 12.95579
## Sep 2019 10.312491 8.508854 12.11613
## Oct 2019 9.830395 8.026758 11.63403
## Nov 2019 9.309122 7.505485 11.11276
## Dec 2019 8.606364 6.803326 10.40940
## Jan 2020 9.696074 7.902776 11.48937
## Feb 2020 9.396686 7.603388 11.18998
## Mar 2020 12.071068 10.277770 13.86437
## Apr 2020 11.587178 9.793880 13.38048

"level" indicates the confidence interval required (95% in this case) and "h" indicates for how
many times points you need predictions (here 10 refers to next 10 months)
From the results we can infer the prediction of next 10 months for the TB incidence in Guangxi.

#plot the predictions as time plot

plot(Forecast_future)

The final step is to validate the model used for forecasting.

The Box-Jenkins Q Test (also known simply as the Box-Pierce test or Ljung-Box test) is used
to assess the goodness-of-fit of a time series model by testing whether the residuals from the
model resemble white noise [8].

Essentially, it checks whether there is any significant autocorrelation left in the residuals after
fitting the model, which would suggest that the model might not fully capture the underlying
structure of the data.

Null Hypothesis (H0): The residuals are white noise (i.e, they have no autocorrelation).

Alternative Hypothesis (H1): The residuals are not white noise (i.e, they exhibit
autocorrelation).
#Ljung-Box test for the goodness of fit of the model

Box.test(Forecast_future$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: Forecast_future$residuals
## X-squared = 1.3687e-07, df = 1, p-value = 0.9997

From the results of Ljung-Box test, we can infer that the p-value is greater than 0.05. Therefore,
the given model is a good-fit.

Conclusion:

• The results suggested the TB incidence will experience slight decrease, and its changing
trend will be similar to before.

• The prediction results can provide help for reallocating resources so as to get better in
control and prevention of TB in Guangxi, China.

Summary:

• Import the data to R.

• Convert the data as a time series object.

• Plot and decompose the data.

• Test for stationarity.

• Apply the prediction model (ARIMA/SARIMA).

• Check for goodness of fit of the selected predictive model.

• Forecast the future values and interpret the results obtained.


References:

1. Jebb AT, Tay L, Wang W, Huang Q. Time series analysis for psychological research:
examining and forecasting change. Front Psychol. 2015 Jun 9;6:727.

2. Jose J. INTRODUCTION TO TIME SERIES ANALYSIS AND ITS APPLICATIONS.


2022 Aug 1;

3. Statistical methods for predicting tuberculosis incidence based on data from Guangxi, China |
BMC Infectious Diseases | Full Text [Internet]. [cited 2024 Sep 16]. Available from:
https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-020-05033-3

4. Trapletti A, Hornik K. tseries: Time Series Analysis and Computational Finance [Internet].
1999 [cited 2024 Sep 16]. p. 0.10-57. Available from: https://CRAN.R-
project.org/package=tseries

5. Kumar M, Anand M. An Application Of Time Series Arima Forecasting Model For


Predicting Sugarcane Production In India. Studies in Business and Economics. 2014 Apr
30;9:81–94.

6. Liu J, Yu F, Song H. Application of SARIMA model in forecasting and analyzing inpatient


cases of acute mountain sickness. BMC Public Health. 2023 Jan 9;23(1):56.

7. Hyndman R, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, Kuroptev K, et al.


forecast: Forecasting Functions for Time Series and Linear Models [Internet]. 2009 [cited
2024 Sep 16]. p. 8.23.0. Available from: https://CRAN.R-project.org/package=forecast

8. Bobbitt Z. Ljung-Box Test: Definition + Example [Internet]. Statology. 2020 [cited 2024 Sep
16]. Available from: https://www.statology.org/ljung-box-test/

You might also like