Time Series Analysis in R A Beginner's Guide
Time Series Analysis in R A Beginner's Guide
SRINIVAS. S
The data used in the tutorial is from the article “Statistical methods for predicting tuberculosis
incidence based on data from Guangxi, China” [3]. It is an open access dataset which contains
to TB incidence in Guangxi from January 2012 to June 2019.
The first step is “Description”. When you are presented with time series data the first step is to
plot the data and observe the components of time series like trend, seasonality and Cycles.
library(readxl)
Data = read_xlsx("C:\\Users\\WELCOME\\Downloads\\TB time series data.xlsx")
head(Data)
## # A tibble: 6 × 2
## Time TB
## <dttm> <dbl>
## 1 2012-01-31 00:00:00 13.0
## 2 2012-02-29 00:00:00 17.1
## 3 2012-03-31 00:00:00 20.2
## 4 2012-04-30 00:00:00 18.2
## 5 2012-05-31 00:00:00 18.3
## 6 2012-06-30 00:00:00 17.4
We have to convert the time series data into a time series object for the further analysis.
min_date = min(Data$Time) #Start date of the time series
min_date
## [1] "ts"
The frequency is given as “12” because the observation (TB incidence) is taken monthly, if the
data is taken yearly then frequency will be fixed as “1”.
Now the data is saved as a time series object. Now we can plot the time series object to observes
the descriptive measures.
library(tseries)
library(forecast)
Decomposition = decompose(Data.ts)
plot(Decomposition)
The above figure depicts what is referred to as classical decomposition, when a time series is
conceived of comprising three components: a trend cycle, seasonal pattern and random
component (Here the trend and cycle are combined because the duration of cycle is unknown).
From the above figure, in the trend part we can see that there is downward trend and from the
seasonal part we can see that there is reoccurring seasonal patterns (which indicates seasonality).
This decomposition of a time series leads to the second step called as “Explanation”.
Stationarity is one of the primary assumptions of time series. Most of the time series models like
ARIMA, SARIMA etc... assumes thet the time series is stationary. Broadly speaking a time
series is said to be stationary if there is no change in mean (no trend), no systematic change in
variance and it should be independent of time (no seasonality).
Let’s see how to check whether the time series is stationary or not.
• ACF refers to “Auto Correlation Function” and PACF refers to “Partial Correlation
Function”. The horizontal blue line in the plot is 95% Confidence Interval (CI).
• The vertical line at lag 0 indicates the correlation of the present value with itself, so we
can see that the correlation is equal to 1. With respect to stationarity, the time series is
said to stationary if all the vertical lines fall inside the CI and if the vertical lines fall
outside the CI, then the series is not stationary.
• ACF and PACF further used to find the order of the ARIMA model, we can see it later.
#Plot the ACF plot
acf(Data.ts)
pacf(Data.ts)
From the ACF plot we can see that the vertical lines crossing the blue horizontal line which
indicates the high correlation between the present value sand its lagged versions. Therefore, the
time series is not stationary.
The second way to conduct the Augmented Dickey-Fuller (ADF) test or Kwiatkowski-
Phillips-Schmidt-Shin (KPSS) test.
ADF test:
Null Hypothesis (H0): The time series has a unit root (i.e., it is non-stationary).
Alternative Hypothesis (H1): The time series does not have a unit root (i.e., it is stationary).
If the p-value is lesser than 0.05 then the time series is stationary.
KPSS test:
Null Hypothesis (H0): The time series is stationary (either around a level or a trend).
Alternative Hypothesis (H1): The time series is not stationary.
If the p-value is lesser than 0.05 then the time series is not stationarity.
#ADF test
adf.test(Data.ts, k=15)
## data: Data.ts
## Dickey-Fuller = -2.5906, Lag order = 15, p-value = 0.333
## alternative hypothesis: stationary
#KPSS test
kpss.test(Data.ts)
## Warning in kpss.test(Data.ts): p-value smaller than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: Data.ts
## KPSS Level = 1.5344, Truncation lag parameter = 3, p-value = 0.01
From the above results of both the ADF test and KPSS test we can confirm that the time series is
not stationary and it requires transformation (differencing). These tests are available in the
package “tseries” in R [4].
Differencing the most important method of stationarizing the mean of the time series. It can
remove any trend in the series which is not of interest. There are two types of differencing: Trend
Differencing and Seasonal Differencing. From the name itself we can understand that if the time
series shows a significant trend we can apply Trend Differencing, if the time series shows the
seasonality patterns, then we can apply seasonal differencing and if the time series shows both
the trend and seasonality, we have to apply both kinds of differencing [2]. First order
differencing refers to subtracting the previous observation from the current observation and
Second order differencing refers to subtracting the previous two values from the current
observation and so on.
The third step in the time series is analysis is “Prediction”. The primary aim of any time series
analysis is to predict the future values based on the observed values. For the prediction purpose
we are going to fit an ARIMA or SARIMA model for the time series object.
ARIMA (p, d, q) model predicts the future values of a time series by a linear combination of its
past values and series of errors. This method is suitable for forecasting when the data is
stationary/non-stationary and univariate. The parameter “p” in the model refers to the order of
the auto-regressive part, “d” refers to the order of differencing and “q” refers to order of the
moving average part [5].
The SARIMA (Seasonal Auto-Regressive Integrated Moving Average) model, also known as
the Seasonal ARIMA model, extends the ARIMA model to handle seasonality in time series
data. SARIMA models combine non-seasonal and seasonal components. It is denoted as
SARIMA (p, d, q) (P, D, Q) s. The parameter “p” in the model refers to the order of the auto-
regressive part, “d” refers to the order of differencing and “q” refers to order of the moving
average part. The parameter “P” in the model refers to the order of the auto-regressive part
respect to seasonal period, “d” refers to the order of differencing applied with respect to seasonal
period and “q” refers to order of the moving average part with respect to seasonal period. Lastly
the “s” refers to length of seasonal cycle (Eg: 12 for monthly data with annual seasonality) [6].
The package “forecast” in r provides a useful function called as “auto.arima” which
automatically fit an ARIMA or SARIMA model to a time series [7]. The function selects the best
ARIMA model based on information criteria like AIC (Akaike Information Criterion) or BIC
(Bayesian Information Criterion).
#Fitting the suitable ARIMA model for the observed time series
##
## ARIMA(2,0,2)(1,1,1)[12] with drift : Inf
## ARIMA(0,0,0)(0,1,0)[12] with drift : 242.998
## ARIMA(1,0,0)(1,1,0)[12] with drift : 230.2584
## ARIMA(0,0,1)(0,1,1)[12] with drift : 221.2812
## ARIMA(0,0,0)(0,1,0)[12] : 271.8778
## ARIMA(0,0,1)(0,1,0)[12] with drift : 238.6041
## ARIMA(0,0,1)(1,1,1)[12] with drift : Inf
## ARIMA(0,0,1)(0,1,2)[12] with drift : Inf
## ARIMA(0,0,1)(1,1,0)[12] with drift : 230.2804
## ARIMA(0,0,1)(1,1,2)[12] with drift : Inf
## ARIMA(0,0,0)(0,1,1)[12] with drift : 222.7067
## ARIMA(1,0,1)(0,1,1)[12] with drift : Inf
## ARIMA(0,0,2)(0,1,1)[12] with drift : Inf
## ARIMA(1,0,0)(0,1,1)[12] with drift : 221.4776
## ARIMA(1,0,2)(0,1,1)[12] with drift : Inf
## ARIMA(0,0,1)(0,1,1)[12] : 256.612
##
## Best model: ARIMA(0,0,1)(0,1,1)[12] with drift
We have fitted ARIMA models with different combination of parameters and we found the best
model based on Akaike Information Criterion (AIC), lower the value better the model. We
figured it out that the ARIMA model (actually it is SARIMA) with the combination (0,0,1)
(0,1,1)[12] is the best fit model for the given time series.
#Summary of the best fit model
summary(Prediction_model)
## Series: Data.ts
## ARIMA(0,0,1)(0,1,1)[12] with drift
##
## Coefficients:
## ma1 sma1 drift
## 0.2482 -0.8834 -0.0600
## s.e. 0.1443 0.3411 0.0043
##
## sigma^2 = 0.7619: log likelihood = -106.64
## AIC=221.28 AICc=221.83 BIC=230.71
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0327287 0.7968 0.5878878 -0.3110781 4.415815 0.5231631
## ACF1
## Training set 3.835598e-05
Before using this model for prediction, we have to check whether it is stationary or not using
ACF and PACF plot.
acf(ts(Prediction_model$residuals))
pacf(ts(Prediction_model$residuals))
From the above ACF and PACF plots we can infer that all the vertical lines at each lags falls
inside the horizontal boundary, which indicates that the time series is stationary after applying
seasonal differencing using the SARIMA model. So, now we have to predict the future value
based on the observed values.
"level" indicates the confidence interval required (95% in this case) and "h" indicates for how
many times points you need predictions (here 10 refers to next 10 months)
From the results we can infer the prediction of next 10 months for the TB incidence in Guangxi.
plot(Forecast_future)
The Box-Jenkins Q Test (also known simply as the Box-Pierce test or Ljung-Box test) is used
to assess the goodness-of-fit of a time series model by testing whether the residuals from the
model resemble white noise [8].
Essentially, it checks whether there is any significant autocorrelation left in the residuals after
fitting the model, which would suggest that the model might not fully capture the underlying
structure of the data.
Null Hypothesis (H0): The residuals are white noise (i.e, they have no autocorrelation).
Alternative Hypothesis (H1): The residuals are not white noise (i.e, they exhibit
autocorrelation).
#Ljung-Box test for the goodness of fit of the model
Box.test(Forecast_future$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: Forecast_future$residuals
## X-squared = 1.3687e-07, df = 1, p-value = 0.9997
From the results of Ljung-Box test, we can infer that the p-value is greater than 0.05. Therefore,
the given model is a good-fit.
Conclusion:
• The results suggested the TB incidence will experience slight decrease, and its changing
trend will be similar to before.
• The prediction results can provide help for reallocating resources so as to get better in
control and prevention of TB in Guangxi, China.
Summary:
1. Jebb AT, Tay L, Wang W, Huang Q. Time series analysis for psychological research:
examining and forecasting change. Front Psychol. 2015 Jun 9;6:727.
3. Statistical methods for predicting tuberculosis incidence based on data from Guangxi, China |
BMC Infectious Diseases | Full Text [Internet]. [cited 2024 Sep 16]. Available from:
https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-020-05033-3
4. Trapletti A, Hornik K. tseries: Time Series Analysis and Computational Finance [Internet].
1999 [cited 2024 Sep 16]. p. 0.10-57. Available from: https://CRAN.R-
project.org/package=tseries
8. Bobbitt Z. Ljung-Box Test: Definition + Example [Internet]. Statology. 2020 [cited 2024 Sep
16]. Available from: https://www.statology.org/ljung-box-test/