Forecasting of COVID19 Per Regions Using ARIMA Models and Polynomial Functions
Forecasting of COVID19 Per Regions Using ARIMA Models and Polynomial Functions
Forecasting of COVID19 Per Regions Using ARIMA Models and Polynomial Functions
article info a b s t r a c t
Article history: COVID-2019 is a global threat, for this reason around the world, researches have been focused on topics
Received 10 June 2020 such as to detect it, prevent it, cure it, and predict it. Different analyses propose models to predict
Received in revised form 30 July 2020 the evolution of this epidemic. These analyses propose models for specific geographical areas, specific
Accepted 2 August 2020
countries, or create a global model. The models give us the possibility to predict the virus behavior, it
Available online 6 August 2020
could be used to make future response plans. This work presents an analysis of COVID-19 spread that
Keywords: shows a different angle for the whole world, through 6 geographic regions (continents). We propose
Covid-19 epidemic to create a relationship between the countries, which are in the same geographical area to predict the
Forecast advance of the virus. The countries in the same geographic region have variables with similar values
ARIMA model (quantifiable and non-quantifiable), which affect the spread of the virus. We propose an algorithm to
Geographic region performed and evaluated the ARIMA model for 145 countries, which are distributed into 6 regions.
Then, we construct a model for these regions using the ARIMA parameters, the population per 1M
people, the number of cases, and polynomial functions. The proposal is able to predict the COVID-19
cases with a RMSE average of 144.81. The main outcome of this paper is showing a relation between
COVID-19 behavior and population in a region, these results show us the opportunity to create more
models to predict the COVID-19 behavior using variables as humidity, climate, culture, among others.
© 2020 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.asoc.2020.106610
1568-4946/© 2020 Elsevier B.V. All rights reserved.
2 A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610
models give results in terms of its predictive performance. The are the geographic region and the total population in the coun-
models give us the possibility to predict the virus behavior, it try. The geographic regions are North America, South America,
could be used to make future response plans. There are coun- Africa, Asia and Europe. The countries in the same geographic
tries, which have faced the COVID-19 in the same way, (Brazil region (continent) different variables with similar values such as
and Sweden) but with totally different consequences. The dif- quantifiable data (climate, humidity, natural regions, etc.) and
other non-quantifiable (cultural similarities, similar gastronomy,
ferences between these countries are geographical, demographic,
among others).
economic, public health, cultural, poverty, among others. These
We propose an algorithm to performed and evaluated the
differences have caused that Brazil has 694,116 cases while Swe- Auto-Regressive Integrated Moving Average (ARIMA) model for
den has 45,133 cases on June 8. As we can see, Brazil has 15.3 145 countries, which are distributed in 6 geographic regions.
times more cases than Sweden. For this fact, in this work, we The ARIMA models using the available information until April 25
propose to create a relationship between the countries and two 2020. Next, the information is divided into 2 sets, the first set
variables more to predict the COVID-19 behavior. These variables is used to create the ARIMA models and the second set is used
A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610 3
to calculate the RMSE between the real data and predict data. of time series are different per each country. For example, the
The First set uses 90% of the data and the second set uses 10% of time series of Canada starts on January 26 (123 days until May
the data. Then, the calculated parameters of ARIMA models, the 28), Egypt time series starts on February 15 (103 days until May
population per 1M people per country, the number of cases per 28), time series of China starts on December 31 (149 days until
country are used to create polynomial functions, which are able May 28), Italy time series starts on January 31 (118 days until
to predict the ARIMA parameters. These polynomial functions May 28), time series of Australia starts on January 25 (124 days
generate models for the next geographic regions: North America, until May 28), Brazil time series starts on February 26 (92 days
South America, Africa, Oceania, Asia, and Europe. until May 28), etc.
The results are evaluated using RMSE. The main contributions This paper proposes a model per geographic region. The coun-
can be summarized as follows: tries are separated in 6 regions which are North America (13
countries), South America (12 countries), Africa (43 countries),
• We propose an algorithm to calculate the best ARIMA pa- Asia (40 countries) and Europe (33 countries). To create the mod-
rameters per country with low RMSE. els per region, we use the ‘‘total population in the age groups’’
• The algorithm to calculate the best parameter of ARIMA is available in the website of United Nations [19]. The population in
tested with 10% of the original data. the age groups is added to generate a total population. The values
• Our approach is analyzing 145 countries, almost 10 times for each country are shown in the Tables in Appendix A.
more than another proposed scheme.
• The approach starts analyzing particular cases (countries) to 3. Proposed approach
create a general case (geographic region).
• Our approach is able to show a relation between the pre- The proposed approach consists of two stages ‘‘Building the
diction error and other variables. In this work, a relation model’’ and ‘‘Evaluating the model’’. These stages are applied
between the prediction error and the population per 1M 6 times, one time per region. We use the time series ‘‘Total
people is shown. confirmed COVID-19 cases’’. The first stage ‘‘Building the model’’
requires the time series per country, which starts on the day
The organization of the paper is as follows. Section 2 briefs the when each country presented the first case of COVID-19 and it
databases used. Section 3 presents the proposed approach. The finishes on April 25. The second stage ‘‘Evaluating the model’’
paper ends with the Results, a Discussion section and Conclusion. requires the information of COVID-19 on May 28. Then the fore-
casting between May 12 and May 28 is calculated and compared
2. Databases with the real values. In the following subsections, the proposed
approach is explained in a general way and using an example. The
The time series created in this work using the data of ‘‘Our example calculates the ‘‘p, D, q’’ values of ARIMA to Canada and
World in Data’’ [18], which is completely open access. They collect builds the North America model.
the data from the European Centre for Disease Prevention and
Control (ECDC), the WHO, Johns Hopkins, United Nations, World 3.1. Building the model
Bank, Global Burden of Disease, Blavatnik School of Government,
etc. They standardized names of countries using ‘‘Our World Fig. 2 shows a block diagram of this stage, which consists of
in Data’’ [18] standard entity names, they discarded detected the ARIMA and polynomial functions. The inputs of this stage
inconsistencies in the original data, detailed documentation for are time series of the countries per region and Rc, which are
each country is available [18]. Multiple time series for a country explained in Section 3.1.1 (Arima Stage) and 3.1.2 (Polynomial
are collected, the complete COVID-19 dataset only includes the Functions), respectively.
most complete number of people tested, confirmed cases and
deaths. The data on the coronavirus pandemic is updated daily. 3.1.1. ARIMA stage
‘‘Our World in Data’’ has 77 charts on COVID-19. Fig. 1 shows We use the time series ‘‘Total confirmed COVID-19 cases’’ per
an example of one chart. The data of charts contain information country. Then, we have a time series presented in the following
from 207 countries. Then, we can explore the statistics on COVID- equations:
19 for the countries in the world. This work uses the available
information until May 28. The consulted chart is ‘‘Total and daily y = {yt , t ϵ T } (1)
confirmed COVID-19 cases’’, which is used to create the time
T = {T1 , T2 , T3 , . . . , T1+n } (2)
series per country called ‘‘Total confirmed COVID-19 cases’’. The
time series start on the day when each country presented the first In Eq. (1), y means the total confirmed cases per day presented
case of COVID-19 and finish on May 28. This fact means the length in a country. In Eq. (2), T1 means the day when each country
4 A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610
Fig. 4. Canada time series separated into training and testing time series.
Table 1
Data of North America.
Country Population per million Total confirmed COVID-19 cases per 1M people ARIMA Parameters RMSE average
people (ppMp) April 25 May-11 p D q 640.61
the Algorithm 5. The Canada results using the model of North 4. Results
America are shown in Figs. 8–9.
The Eq. (9) is applied to calculate the RMSE between the fore- This section presents the results for each region analyzed.
cast values and real values. Fig. 8 shows a comparison between Table 2 shows the average RMSE per region. In the table, the
the real and forecast signals and Fig. 9 shows the forecast of RMSE is calculated between the forecast and the real values.
Canada with confidence interval of 95%. Fig. 10 presents a comparison using RMSE and the forecast for one
8 A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610
country per each region in the following way: (a) North America, 5. Discussion
(b) South America, (c) Africa, (d) Oceania, (e) Asia, and (f) Europe.
The Appendix A presents the results per country before to
create the geographic models. These results belong to each coun-
try in the different regions. As we mention in Section 3.1.1, the
time series are separated into modeling (90% of the signal) and
testing (10% of the signal). Below, we will discuss each region in
particular.
North America region has 13 countries; this region presents
a RMSE average of 640.61. The RMSE average of this region is
the most bigger between the regions. This fact appears, because
the United States presents the most bigger RMSE between the
145 countries (7749.99), this country has the largest number of
population in the region (329.06 ppMp). On the other hand, Belize
presents the lowest RMSE. The United States has almost 96 times
the population of Belize (0.39 ppMp).
Europe region consists of 33 countries. In this experiment, the
countries, which present a ppMp major than 45 ppMp presents
biggest RMSE. Spain presents an RMSE of 1892.33 with a 46.73
ppMp, Italy has 60.55 ppMp and presents an RMSE of 566.88,
A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610 9
Table 2
Average of RMSE results.
Region Average of RMSE between Average of RMSE between
original and forecast signal in original and forecast signal
training stage from May 12 to May 27
North America 640.61 3.6051e+04
South America 104.78 2.0828e+04
Africa 13.80 1.4913e+03
Oceania 6.79 161.2570
Asia 89.46 3.52964e+03
Europe 218.59 2.88212e+04
a
Average 144.81 1.2723e+04
a
The average is calculated using the 145 countries.
Table 3
Comparison between [10] and this work.
Country RMSE
[10] This work
Italy 1150.31 566.88
Turkey 138.35 1892.33
Spain 379.89 696.35
a b
Average 556.183 144.81
a
Using 3 countries.
b
Using 145 countries.
it has countries with 200 ppMp. This fact could be means a rela-
tion between the virus spread and the climate for example. For
the region Asia the average RMSE is minor to 90, this area was the
first area infected by COVID-19 so there are more available data
in this area. Thus, we have more data to calculate the Forecast.
Table 3 shows a comparison between [10] and this work
before to create the geographic models. As shown Table 3 this
approach has better RMSE to forecast the virus in Italy, on the
contrary [10] has better RMSE to predict the virus in Turkey and
Fig. 9. Forecasting of Canada.
Spain. At first, it seems that their proposal is better than ours,
but when the RMSE averages are compared, we can see that our
proposal has a lower RMSE than them, besides we are analyzing
United Kingdom presents an RMSE of 728.13 with a 467.53 ppMp, 145 countries while they only analyze 3.
Germany has 83.51 ppMp and presents an RMSE of 1075.02, and Fig. 8 presents the results per one country in each region.
Russia presents an RMSE of 958.44 with a 145.87 ppMp. The In Fig. 8(a–f), we can see an upward trend in the number of
RMSE average of this is 218.59. cases, with the exception of the American States, which marks
Brazil presents a RMSE of 591, this country has the largest a decrease in the number of cases. Let us remember that from
number of population in the region (211.04 ppMp). In contrast, the beginning, the American States had the highest RMSE among
Paraguay presents the lowest RMSE. Brazil has almost 30 times all countries. When the geographic models are created, these
the population of Paraguay (7.05 ppMp). These countries belong models are used to predict new cases in a country. The results are
to South America region, the RMSE average of this region is shown in Table 2. The forecast is made 17 days after the models
104.78. are calculated, we take this decision to have a real difference
Asia region consists of 40 countries. In this region, Turkey between the cases on April 25 and May 11 as shown the tables
presents the most bigger RMSE (696.35) with a population (83.43 in Appendix A. As expected, the RMSE error grew because, the
ppMp). China and India have a ppMp major to one thousand, but prediction is making 17 days after the models were created and
the RMSE are 117.88 and 250.83, respectively. On the other hand, we calculate 15 days of prediction cases. In these time interval,
Yemen with a population less than 30 ppMp has a RMSE close to the actions as quarantine control, stay at home campaign, social
zero. distance taken by governments significantly affect the prediction.
Egypt presents a RMSE of 84.08, this country has 110.38 ppMp. If the lector wants current predictions, the information needs to
In contrast, Namibia presents a RMSE close to zero. Egypt has be updated and repeat the building the model stage.
almost 41 times the population of Namibia (2.49 ppMp). These
countries belong to Africa region, the RMSE average of this region 6. Conclusion
is 13.8.
Oceania region consists of 4 countries. In this region, Australia We can conclude that the algorithm to model and evaluate the
presents the most bigger RMSE (24.76) with a population (25.20 ARIMA models is able to develop models, which have low RMSE.
ppMp). The lower RMSE is presented by Fiji with a population On the other hand, this work shows a way to model the COVID
less than 1 ppMp. spread started in particular cases to generate a general case. We
For the regions North America, South America, Oceania, and can conclude, this work contributes to researchers working in
Europe, there is a relation between the major ppMp and the error COVID-19 prediction. It shows there is a relation between the
on the prediction. The RMSE of Africa is minor to 15, even though virus spread and the different variables present in the countries,
10 A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610
Fig. 10. An example of results per region. RMSE and Forecast (a) North America, (B) South America, (c) Africa, (d) Oceania, (e) Asia and (f) Europe.
which belong to the same geographic region. Interestingly, we work different variables could be analyzed, for example, the date
can find a show relation between the population in a country and when the first coronavirus case is detected in the country, humid-
RMSE error in a prediction. In future challenges of the proposed ity, temperature, among other variables. Other kinds of clusters
A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610 11
Table A.1
Data of the countries separated by geographical regions.
North America
Country Population per million Total confirmed COVID-19 cases per million people ARIMA parameters RMSE average
people (ppMp) April 25 May-11 p D q 640.61
could be applied like cultural behavior, religious behavior, hy- to make current predictions, just the information needs to be
giene habits, feeding habits, among others. The approach is able updated.
12 A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610
The authors declare that they have no known competing finan- North America
cial interests or personal relationships that could have appeared pNAp (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 14
to influence the work reported in this paper. pNAd (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 15
pNAq (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47
Acknowledgments
South America
This study is supported by JSPS, Japan KAKENHI (Grants-in-Aid pSAp (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 23
for Scientific Research) #JP20K11955. pSAd (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 23
pSAq (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47
Appendix A Africa
pAfp (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47
See Table A.1. pAfd (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47
14 A. Hernandez-Matamoros, H. Fujita, T. Hayashi et al. / Applied Soft Computing Journal 96 (2020) 106610
pAfq (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47 [9] Hiteshi Tandon, Prabhat Ranjan, Tanmoy Chakraborty, Vandana Suhag,
Oceania Coronavirus (COVID-19): ARIMA based time-series analysis to forecast near
future, 2020, arXiv:2004.07859.
pOp (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47
[10] Lutfi Bayyurt, Burcu Bayyurt, Forecasting of COVID-19 Cases and Deaths
pOd (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47 Using ARIMA Models, medRxiv 2020.04.17.20069237.http://dx.doi.org/10.
pOq (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 47 1101/2020.04.17.20069237.
Asia [11] Simon James Fong, Gloria Li, Nilanjan Dey, Rubén González Crespo, Enrique
pAsp (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 14 Herrera-Viedma, Composite Monte Carlo decision making under high
uncertainty of novel coronavirus epidemic using hybridized deep learning
pAsd (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 40 and fuzzy rule induction, Appl. Soft Comput. (ISSN: 1568-4946) 93 (2020)
pAsq (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 50 106282, http://dx.doi.org/10.1016/j.asoc.2020.106282.
Europe [12] R.K. Singh, M. Rani, A.S. Bhagavathula, R. Sah, A.J. Rodriguez-Morales, H.
pEp (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 12 Kalita, C. Nanda, S. Sharma, Y.D. Sharma, A.A. Rabaan, J. Rahmani, P. Kumar,
pEd (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 50 Prediction of the COVID-19 pandemic for the top 15 affected countries:
Advanced autoregressive integrated moving average (ARIMA) model, JMIR
pEq (t ) = p1 t n + p2 t n−1 + p3 t n−2 + · · · + pn t + pn+1 ; n = 50 Public Health Surv. 6 (2) (2020) e19115, http://dx.doi.org/10.2196/19115.
[13] Xingde Duan, Xiaolei Zhang, ARIMA modelling and forecasting of irreg-
References ularly patterned COVID-19 outbreaks using Japanese and South Korean
data, 2020, 105779, http://dx.doi.org/10.1016/j.dib.2020.105779, (ISSN
[1] World Health Organization, Coronavirus disease (COVID-19) outbreak 2352-3409).
situation retrieved from: OnlineResource. [14] Andrea L. Bertozzi, Elisa Franco, George Mohler, Martin B. Short, Daniel
[2] Ali Narin, Ceren Kaya, Ziynet Pamuk, Automatic detection of coronavirus Sledge, The challenges of modeling and forecasting the spread of COVID-
disease (COVID-19) using X-ray images and deep convolutional neural 19, Proc. Natl. Acad. Sci. 117 (29) (2020) 16732–16738, http://dx.doi.org/
networks, 2020, arXiv, 003.10849. 10.1073/pnas.2006520117.
[3] Chen Lin, Yuxiao Ding, Bin Xie, Zhujian Sun, Xiaogang Li, Zixian Chen, Meng [15] Lixiang Li, Zihang Yang, Zhongkai Dang, Cui Meng, Jingze Huang, Haotian
Niu, Asymptomatic novel coronavirus pneumonia patient outside Wuhan: Meng, Deyu Wang, Guanhua Chen, Jiaxuan Zhang, Haipeng Peng, Yiming
The value of CT images in the course of the disease, Clin. Imaging (ISSN: Shao, Propagation analysis and prediction of the COVID-19, Infect. Dis.
0899-7071) 63 (2020) 7–9, http://dx.doi.org/10.1016/j.clinimag.2020.02. Model. (ISSN: 2468-0427) 5 (2020) 282–292, http://dx.doi.org/10.1016/j.
008. idm.2020.03.002.
[4] BioSpace, Quotient Sciences and CytoAgents Accelerate Potential Treatment [16] Kenji Mizumoto, Gerardo Chowell, Transmission potential of the novel
for COVID-19 Cytokine Storm, retrieved from: OnlineResource. coronavirus (COVID-19) onboard the diamond Princess Cruises Ship, Infect.
[5] Domenico Benvenuto, Marta Giovanetti, Lazzaro Vassallo, Silvia Angeletti, Dis. Model. (ISSN: 2468-0427) 5 (2020) 264–270, http://dx.doi.org/10.1016/
Massimo Ciccozzi, Application of the ARIMA model on the COVID-2019 j.idm.2020.02.003.
epidemic dataset, in: Data in Brief, Vol. 29, 2020, 105340, http://dx.doi. [17] G.E.P. Box, G.M. Jenkins, Time series analysis, in: Forecasting and Control,
org/10.1016/j.dib.2020.105340, (ISSN 2352-3409). Holden-Day, San Francisco, 1976.
[6] Fong Simon, Gloria Li, Nilanjan Dey, Ruben Gonzalez Crespo, Enrique [18] A. Max Roser, Hannah Ritchie, Esteban Ortiz-Ospina, Joe Hasell, Coro-
Herrera-Viedma, Finding an accurate early forecasting model from small navirus pandemic (COVID-19), 2020, Published online at OurWorldIn-
dataset: A case of 2019-nCoV novel coronavirus outbreak, Int. J. Interact. Data.org. Retrieved from: ’https://ourworldindata.org/coronavirus’ [Online
Multimed. Artif. Intell. 6 (2020) 132–140, http://dx.doi.org/10.9781/ijimai. Resource].
2020.02.002. [19] United Nations, Department of Economic and Social Affairs, Population
[7] Gaetano Perone, An ARIMA model to forecast the spread and the final size Dynamics, Retreived from: OnlineResource.
of COVID-2019 epidemic in Italy, 2020, arXiv:2004.00382. [20] J. Fattah, L. Ezzine, Z. Aman, H. El Moussami, A. Lachhab, Forecasting of
[8] Guorong Ding, Xinru Li, Yang Shen, Brief Analysis of the ARIMA model on demand using ARIMA model, Int. J. Eng. Bus. Manag. 10 (2018).
the COVID-19 in Italy, medRxiv 2020.04.08.20058636. http://dx.doi.org/10. [21] S.L. Ho, M. Xie, The use of ARIMA models for reliability forecasting and
1101/2020.04.08.20058636. analysis, Comput. Ind. Eng. (ISSN: 0360-8352) 35 (1–2) (1998) 213–216,
http://dx.doi.org/10.1016/S0360-8352(98)00066-7.
[22] J. Serrà, J.L. Arcos, An empirical evaluation of similarity measures for time
series classification, Knowl. Based Syst. 67 (2014) 305–314.