0% found this document useful (0 votes)
29 views6 pages

Epidemic Outbreak Prediction Using Machine Learning Model

Uploaded by

Meghana Megha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Epidemic Outbreak Prediction Using Machine Learning Model

Uploaded by

Meghana Megha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2022 5th International Conference on Advances in Science and Technology (ICAST)

Epidemic Outbreak Prediction Using Machine


Learning Model
Soham Shinde Seema Yadav Ashelesha Somvanshi
Information Technology Information Technology Information Technology
K.J Somaiya Institute of Engineering K.J Somaiya Institute of Engineering K.J Somaiya Institute of Engineering
and Information Technology and Information Technology and Information Technology
Mumbai, India Mumbai, India Mumbai, India
[email protected] [email protected] [email protected]

Abstract— The intelligent models is used for prediction of effects, insights from other legislative bodies and
diseases as well as creation of model that helps doctor to government. The prediction models is for suggestion of new
2022 5th International Conference on Advances in Science and Technology (ICAST) | 978-1-6654-9263-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICAST55766.2022.10039594

prevent spreading of disease globally is increased day by day. strategies and assess effectiveness of those that have already
When a disease spreads rapidly in a short period of time in a been put in place. (A & G., 2020) [1].
specific area, it is called an epidemic outbreak. An outbreak
might start in a single community or spread across multiple Model uncertainty was greatly enhanced by the
countries. It might last anywhere from a few days to several complexity of community activities in different geographic
years. PHO (Public health organizations) are taking regions and changes in control techniques, in addition to the
preventative efforts to stop the disease from spreading besides many unknown and known underlying factors in the
that they are highly benefited from accurate prediction of transmission. (Darwish, Rahhal, & Jafar, 2020)[2].
infectious disease. The emergence of big data in the sectors of Traditional epidemiological models are therefore faced with
health and biomedicine, precise data analysis aids early disease new challenges in providing more reliable data. Many new
identification and better patient treatment. It is now models have emerged to address this problem, each of which
increasingly viable to use massive computing power to predict adds a set of assumptions to the modelling process. (Scarpino
and manage outbreaks. Our goal is to investigate and & Petri, 2019) [3].
determine how outbreaks spread in villages and suburbs where
medical care may be limited. A machine learning model is SEIR models demonstrated improved accuracy of model
required to forecast epidemic dynamics and identify where the for the Zika and Varicella outbreaks by accounting for the
next outbreak is most likely to occur. Because these are lengthy incubation period that infected patients experience.
important features that contribute subtly to the dynamics of the (Pan, Huang, & Chen, 2012)[4]. A random variable that
disease epidemic, our method considers the climate, geography, affects the incubation duration, and a disease-free equilibrium
and distribution of population in impacted region. Our is assumed in SEIR models, just like the SIR model [5].
approach will assist health authorities in taking the necessary
steps to guarantee that there are sufficient resources to fulfil Due to the complexity and magnitude of the challenge of
demand and, if feasible, to prevent epidemics from arising. developing health-related models, recently machine learning
has received importance for producing epidemic prediction
Keywords— Epidemic, Zika Virus, Outbreak, Cases, Classi models. Machine learning methods intent to make models
fication, Regression, Weather, Population Density, Prediction having improved generalisation capability and reliable
prediction with increased lead periods. (Yin, Tran, Zhou,
I. INTRODUCTION Zheng, & Kwoh, 2018) [6].
The globe is currently dealing with an evolving spectrum
of infectious diseases, which has been influenced by When using textblob in Python to perform sentiment
organismal evolution, human demographics, global travel, analysis, polarity and subjectivity are two important factors to
environmental modifications, and closure of institutions for consider. As proposed by (Alka, 2018), it focuses on some
public health. To meet growing needs for prevention, common areas such as parts of speech, nouns and phrases
detection, reporting, and response, current infectious disease from text, text classification, sentiment analysis, and so on. In
surveillance and response procedures are insufficient. The Python, tokens supplied to textblob can be processed as
ability to predict diseases will equip governments and strings for natural language processing. The sentiment
healthcare practitioners with a way to promptly respond to analyzer provides a tuple of the type sentiment (polarity,
epidemics., reducing the damage and conserving limited subjectivity), with polarity ranging between [-1.0, 1.0] and
resources. Many infectious diseases, especially those subjectivity ranging between [0.0, 1.0].
conveyed by arthropod vectors, possess the capacity to The outcome of data analysis is greatly improved when
anticipate the potential for epidemics in time and space using using an interactive approach. It also significantly enhances
sophisticated monitoring and modelling techniques that comparative analysis [8]. (Buja, Cook, & Swayne., 1996)
combine environmental data. These tactics can offer useful, suggested an interactive data visualization approach focused
timely, and cost-effective tools when paired with on specific analytic tasks like comparisons. Plotly, an open
communication technologies. The focus of this study work is source and interactive python graphing package, is used in
on the Zika virus outbreak, and we took into account multiple this study. Statistical charts, financial charts, scientific charts,
disease dynamics to enable us make accurate predictions. and other sorts of charts can all be plotted.
II. LITERATURE SURVEY Traditional prediction methodologies make it difficult to
manage time components, however time series forecasting
Effective outbreak prediction models are needed to learn
techniques take time components into consideration with
more about the anticipated spread and infectious disease

978-1-6654-9263-8/22/$31.00 ©2022 IEEE 127

Authorized licensed use limited to: VTU Consortium. Downloaded on December 05,2024 at 06:41:16 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Advances in Science and Technology (ICAST)

other factors. Time series forecasting produces far more given a region where an epidemic outbreak has
accurate results than traditional prediction approaches. already occurred.
According to the findings of (Taylor, 2008) study, there is a
lot of potential for using time series forecasting to anticipate A. Project Architecture
future outcomes.[9]
Artificial Intelligence model as a Dengue Outbreak
Predictor: This model was turned into a Graphical User
Interface, which was intended to help and educate the general
population in areas at risk of a dengue outbreak. It employs a
Bayesian network machine learning technique with a 79-84
percent accuracy (Chenar & Deng, 2018)[10]. A fusion of
meteorological data and random forest algorithms was used
to forecast global African swine disease outbreaks [11].
The 12 features used for modelling were subjected to a
logistic regression analysis after that Receiver Operating
Curves were produced, and it was assumed that
precipitation had a substantial impact on the pandemic's
onset. It decided to use random forest algorithms in
conjunction with the subset Evaluator-Best First feature
selection method using ASF outbreak data and weather data
from the world climb database [12].
III. PROBLEM DEFINITION Fig.1. Project Architecture
COVID19 has shown the true state of our healthcare
infrastructure, planning, and preparedness to deal with a We begin our project by collecting data from several
pandemic with limited resources, unprepared personnel, and sources and parameters, such as the Zika Virus dataset,
shattered supply chains. If government agencies are given weather dataset, and population density dataset of virus-
this information ahead of time, they may plan and execute affected areas. Following that, we do exploratory data
projects in a well-organized manner, maximizing the analysis in order to extract useful information based on the
utilization of employees and resources. The objective of this dataset and the data suitably prepared. In addition, we
project is to investigate and develop a multimodal model that generate two datasets for classification and regression
may foretell the likelihood of an outbreak in a certain place. models, respectively, and perform hyperparameter tuning to
pick the best functioning model.
Motivation
Finally, we design a user interface to display the
The sole motivation of this problem statement is the
predictions and results.
current pandemic situation that has occurred abruptly. The
viral condition known as "coronavirus disease" is caused by a IV. IMPLEMENTATION AND RESULTS
coronavirus that has just been discovered (COVID-19). The
COVID-19 situation could have long-term and severe A. Technology Stack
impacts on countries, humanities, and cooperation between For all machine learning activities, we primarily
nations as well as posing complications in management of employed the Python programming language. Along with
sickness and crisis management. There are growing signs Pandas, Numpy, and Seaborn, the Scikit-Learn package is
that the world after the crisis will change and that used to build models. In Tableau, we've also generated a
globalisation will be called into question in a number of dashboard. To make the front end of our website responsive,
situations. we used HTML, CSS, JavaScript, and Bootstrap. The Flask
Scope API is utilised in the backend for Heroku deployment.
The scope of our project is broad and widespread,
globally. We will be including numerous ML algorithms for B. Data Collection
predicting the outbreak of an epidemic disease in a certain
region. Because these factors are pertinent and slightly We collected three major datasets to help us do the
influence the dynamics of the disease outbreak, our prediction for the outbreak of Zika virus-
methodology considers climate, distribution of population in • The Zika Data Repository from the CDC and
the afflicted area and geography. Prevention makes information on the Zika virus
1. Reducing avoidable pain from sickness Reduce the accessible to the general public. It gave us enough
expenditure load on government and healthcare information to build and test the model.
systems by giving them first-hand knowledge of • Historical weather dataset collected from world
outbreak hotspots and the agents that cause epidemics weather online through API.
to spread.
• Population density dataset of the virus struck zones
2. The machine learning model must identify the
where we are doing the prediction.
subsequent outbreak-prone locations and attributes
that greatly aid in the propagation of the outbreak • Also, latitude and longitude information of these areas
for plotting purpose.

128

Authorized licensed use limited to: VTU Consortium. Downloaded on December 05,2024 at 06:41:16 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Advances in Science and Technology (ICAST)

C. Exploratory Data Analysis


We study the data gathered in this process to see if there
are any relevant insights that can help us make sense of it.
We analyse two major elements that may have a significant
impact on the frequency of Zika virus cases: The population
density of a region, as well as the weather in that region.
Effect of Population density on the number of cases
In any epidemic disease, the population plays a critical
role. As a result, we took into account the population density
per sq.km of the specific place where instances were
discovered. We saw that for a country with the least number
of cases, the population density is high whereas for a country Fig.3. Weather effect on number of cases in Nicaragua
with the most number of cases, the population density is low.
In case of Nicaragua, which is a tropical region, we can
TABLE I. POPULATION DENSITY OF COUNTRIES WITH MOST AND LEAST
NUMBER OF CASES
see that; Precipitation and Humidity have a positive
correlation with the quantity of cases. Hence indicating that
Country No. of cases Population Density the quantity of cases in this country are likely to increase with
Panama Metro 9389432 1966.26 the commencement of rainy season.
Las_Garzas
Virgin Islands (US) 19 24970.13

Fig.4. Weather effect on number of cases in Argentina

Fig.2. Scatter plot for number for cases vs population density In case of Argentina, we can see that Precipitation,
Humidity, and temperature have a negative correlation with
Effect of Weather on the number of cases: the number of cases. Hence indicating that in such regions
weather has minimal effect on number of cases.
We noticed that certain countries are tropical, while
others are subtropical, in our list of countries. As a result, we Analysis of weather with incubation period of Zika virus-
decided to track the weather trends in these areas and assess
their impact on the number of instances. For any case observed, the process of getting infected
happens 7 to 14 days prior. So, we have to observe the trend
Tropical locations are those where the months of a of weather in those 7 days. The incubation period of Zika
specific season are seen in a consistent manner. The term virus is generally 3-14 days. In our analysis, we’ve
"subtropical" refers to areas where there is no definite season considered the incubation period of 7 days.
period.
Case 1: When number of cases are 0, the weather
Countries belonging to tropical regions- Colombia, conditions 7 days prior are:
Brazil, Ecuador, Dominican Republic, El Salvador, Haiti
Guatemala, Panama, Nicaragua.
Countries belonging to subtropical regions- United
States, Argentina, Mexico.
As we know, Zika Virus is a mosquito-borne disease and
mosquitoes are most active in the rainy season. So, for
weather analysis, we considered three main factors that will
help us to understand if there was any rain- Precipitation,
Humidity, and maximum temperature of the area.

129

Authorized licensed use limited to: VTU Consortium. Downloaded on December 05,2024 at 06:41:16 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Advances in Science and Technology (ICAST)

and Regression model to predict the probable number of


cases.
For classification following baseline models were built
and tested.

Fig.7. Performance evaluation of classification models

We can see that the ADABoost model performed pretty


well as compared to other classification models.
Fig.5. Previous 7 days weather condition for case 0 After hyperparameter tuning we can observe that
Catboost classification model gave better accuracy as
We can observe that temperature, precipitation and compared to other models which is 60.40%.
humidity levels were quite normal, indicating ordinary days.
Case 2: When number of cases are 214, the weather
conditions 7 days prior are:

Fig.8. Performance evaluation after tuning

Regression models:
For regression following baseline models were built and
tested.

TABLE II. PERFORMANCE EVALUATION OF REGRESSION MODELS


Model MSE R2 score
Linear 482209.82 0.00673
Regression
Lasso 482521.69 0.00624
Ridge 482442.85 0.00673
Decision Tree 902975.40 -0.859
Fig.6. Previous 7 days weather condition for cases 214
Random 669564.42 -0.37
Forest
Here, we can observe an uneven trend in weather XGBoost 512600.67 -0.005
conditions 7 days prior to when maximum number of
cases were observed on a single day. Here, we can see that 7 models were tested for prediction
D. Model Building out of which Linear Regression model gave quite standard
results.
For prediction, we built two kinds of model-
Classification model for probability prediction of any cases

130

Authorized licensed use limited to: VTU Consortium. Downloaded on December 05,2024 at 06:41:16 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Advances in Science and Technology (ICAST)

After hyperparameter tuning of these models: forecast the possibility of an outbreak based on the number of
cases.
TABLE III. PERFORMANCE EVALUATION AFTER TUNING
We may also analyze the pattern of new instances to see if
Model MSE R2 score % increase they will rise or fall.
Lasso 546999.53 0.0067 7.46

Ridge 514096.20 0.0066 1.49

Decision Tree 481223.73 0.0060 101.02

Random Forest 441754.63 0.090 124.32

It has been observed that after hyperparameter tuning the


performance of Random Forest Regressor is better compared
to other models.
Model Selection
We have used Auto ML to select the best performing Fig.9. Enter location and duration for prediction
model for classification and regression in order to gain better
conclusions and results
For Classification:
Auto ML method used: TPOT
Best Model: XGBoost Classifier

TABLE IV. BEST PARAMETERS AFTER HYPERPARAMETER TUNING


Parameter Parameter
base_score max_depth Fig.10. Predicted results
booster min_child_weight
bylevel missing
colsample_l monotone_constraints
colsample_bynode n_estimators
colsample_bytree n_jobs
gamma num_parallel_tree
gpu_id random_state
importance_type reg_alpha
interaction_constraints reg_lambda, validate_parameters
learning_rate scale_pos_weigh
max_delta_step tree_method , verbosity

Performance Evaluation: Accuracy = 60.373%


For Regression:
AutoML method used: Auto Sklearn
Best Model: Random Forest Regressor
Fig.11. Detailed report of next 7 days prediction
Performance Evaluation:
Mean Squared Error = 441754.63
Coefficient of determination (R2 score) = 0.09
These two models were selected in case of classification
and regression for prediction.
E. Model Deployment
We used the Flask API to deploy the model on Heroku,
and the front end was created entirely with HTML, CSS, and
Bootstrap. The website's functionality is as follows: For Fig.12. Line graph to observe the trend of cases
prediction on the webpage, users will need to provide a
country name and a start and finish date. The model will

131

Authorized licensed use limited to: VTU Consortium. Downloaded on December 05,2024 at 06:41:16 UTC from IEEE Xplore. Restrictions apply.
2022 5th International Conference on Advances in Science and Technology (ICAST)

REFERENCES
[1] Remuzzi A, Remuzzi G., “COVID-19 and Italy: what next ? ”, Lancet.
2020;395(10231):pp.1225-1228.
[2] Darwish, A.; Rahhal, Y.; Jafar, A., “A comparative study on predicting
influenza outbreaks using different feature spaces: Application of
influenza-like illness data from Early Warning Alert and Response
System in Syria”. BMC Res. Notes 2020, 13, pp. 33.
[3] Scarpino, S.V., Petri, G., “On the predictability of infectious disease
Fig.13. Map to know the countries which are observed for prediction. outbreaks”. Nat Communication, 10, 2019, pp. 898 .
[4] Pan, J.-R.; Huang, Z.-Q.; Chen, K., “Evaluation of the effect of
varicella outbreak control measures through a discrete time delay SEIR
model”. Zhonghua Yu Fang Yi Xue Za Zhi2012, 46, pp. 343–347
[5] Dantas, E.; Tosin, M.; Cunha, A., Jr., “Calibration of a SEIR–SEI
epidemic model to describe the Zika virus outbreak in Brazil”, Applied.
Math. Computation. 2018, 338, pp. 249–259.
[6] Yin, R.; Tran, V.H.; Zhou, X.; Zheng, J.; Kwoh, C.K., “Predicting
antigenic variants of H1N1 influenza virus based on epidemics and
pandemics using a stacking model. PLoS ONE”, 2018,pp. 13.
[7] Das Adhikari, Nimai & Alka, Arpana & Kushwaha, Jitendra & Nayak,
Ashish, “Sentiment Classifier and Analysis for Epidemic Prediction”.,
10.5121/csit.2018.81004.
[8] Andreas Buja, Dianne Cook & Deborah F. Swayne Research Scientist,
“Interactive High-Dimensional Data Visualization”, Journal of
Computational and Graphical Statistics , 1996, 5:1, pp. 78-99.
Fig.14. List of locations for prediction [9] Taylor, James., “A Comparison of UnivariateTime Series Methods for
Forecasting Intraday Arrivals at a Call Center”,
ManagementScience,2008,54.253-265.0.1287/mnsc.1070.0786.
We have to select the location from this list and thus
[10] Chenar, S.S.; Deng, Z, “Development of genetic programming-based
make the prediction for the number of cases in that location. models for predicting oyster norovirus outbreak risks.”, Water Res.,
2018, 128, pp. 20–37.
V. CONCLUSION [11] Liang, R.; Lu, Y.; Qu, X.; Su, Q.; Li, C.; Xia, S.; Liu, Y.; Zhang, Q.;
The main aspect of this paper is to make people Cao, X.; Chen, Q.; et al. “Prediction for global African swine fever
aware of any epidemic diseases that prevail in their outbreaks based on a combination of random forest algorithms and
meteorological data.” Transbound. Emerg. Dis., 2020, 67, pp. 935–946.
surroundings. Our key emphasis was on studying and
[12] Raja, D.B.; Mallol, R.; Ting, C.Y.; Kamaludin, F.; Ahmad, R.; Ismail,
analyzing the data we gathered before moving on to building S.; Jayaraj, V.J.; Sundram, B.M., “Artificial Intelligence Model as
prediction models. With Epidemic Outbreak Prediction, not Predictor for Dengue Outbreaks”. Malays. J. Public Health Med. ,2019,
only the general public, but also the government, can take the 19, pp. 103–108.
required procedures that limit epidemic diseases and prevent
their spread. This project is very adaptable, and we can use it
to anticipate various diseases. Because the UI is simple to
use, individuals of all ages may access and use it.
VI. FUTURE SCOPE
We can expand this project to predict zika virus cases all
across the world. We can predict the probability of an
epidemic for a few more days now that weather forecasting
data is available. Given the success of our proof of concept,
we may improve it by adding more criteria such as social
media symptomatic data, lifestyle, demographic dynamics,
and so on. Using the same method, we can investigate and
predict a variety of additional epidemics. We can also seek
additional data from government and healthcare institutions
that isn't currently available in the public domain

132

Authorized licensed use limited to: VTU Consortium. Downloaded on December 05,2024 at 06:41:16 UTC from IEEE Xplore. Restrictions apply.

You might also like