Prediction of COVID-19 Using Machine Learning Techniques: Durga Mahesh Matta Meet Kumar Saraf
Prediction of COVID-19 Using Machine Learning Techniques: Durga Mahesh Matta Meet Kumar Saraf
May 2020
The authors declare that they are the sole authors of this thesis and that they have not used
any sources other than those listed in the bibliography and identified as references. They further
declare that they have not submitted this thesis at any other institution to obtain a degree.
Contact Information:
Author(s):
Durga Mahesh Matta
E-mail: [email protected]
University advisor:
Suejb Memeti
Department of Computer Science
Background: Over the past 4-5 months, the Coronavirus has rapidly spread to all
parts of the world. Research is continuing to find a cure for this disease while there
is no exact reason for this outbreak. As the number of cases to test for Coronavirus
is increasing rapidly day by day, it is impossible to test due to the time and cost
factors. Over recent years, machine learning has turned very reliable in the medical
field. Using machine learning to predict COVID-19 in patients will reduce the time
delay for the results of the medical tests and modulate health workers to give proper
medical treatment to them.
Objectives: The main goal of this thesis is to develop a machine learning model
that could predict whether a patient is suffering from COVID-19. To develop such a
model, a literature study alongside an experiment is set to identify a suitable algo-
rithm. To assess the features that impact the prediction model.
Methods: A Systematic Literature Review is performed to identify the most suit-
able algorithms for the prediction model. Then through the findings of the literature
study, an experimental model is developed for prediction of COVID-19 and to iden-
tify the features that impact the model.
Results: A set of algorithms were identified from the Literature study that includes
SVM (Support Vector Machines), RF (Random Forests), ANN (Artificial Neural
Network), which are suitable for prediction. Performance evaluation is conducted
between the chosen algorithms to identify the technique with the highest accuracy.
Feature importance values are generated to identify their impact on the prediction.
Conclusions: Prediction of COVID-19 by using Machine Learning could help in-
crease the speed of disease identification resulting in reduced mortality rate. Ana-
lyzing the results obtained from experiments, Random Forest (RF) was identified to
perform better compared to other algorithms.
We would like to show our sincere gratitude to Prof. Suejb Memeti for supervising
our thesis and guiding us throughout the project with quick and helpful feedback.
We would also like to thank our dear friend Akhila Dindi for providing construc-
tive comments, which helped in improving our work and would like to extend our
gratitude to all those who helped us directly and indirectly.
iii
Contents
Abstract i
Acknowledgments iii
1 Introduction 1
1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Defining the scope of the thesis . . . . . . . . . . . . . . . . . . . . . 2
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Related Work 9
4 Method 11
4.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Software Environment . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.5 Algorithm Configurations . . . . . . . . . . . . . . . . . . . . 15
4.2.6 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 15
5 Results 17
5.1 Literature Review Results . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.1 Support Vector Machine (SVM) Results . . . . . . . . . . . . 21
5.2.2 Random Forest (RF) Results . . . . . . . . . . . . . . . . . . 22
5.2.3 Artificial Neural Networks (ANN) Results . . . . . . . . . . . 23
5.2.4 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.5 Feature Importance Results . . . . . . . . . . . . . . . . . . . 24
v
6.2.1 Experiment Phase 1 . . . . . . . . . . . . . . . . . . . . . . . 27
6.2.2 Experiment Phase 2 . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4 Validity Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 29
References 33
vi
List of Figures
vii
List of Tables
ix
Chapter 1
Introduction
Corona viruses are a large family of viruses that are known to cause illness rang-
ing from the common cold to more severe diseases such as Middle East Respiratory
Syndrome(MERS) and Severe Acute Respiratory Syndrome(SARS) [6]. These two
diseases are spread by the corona viruses named as MERS-CoV and SARS-CoV.
SARS was first seen in 2002 in China and MERS was first seen in 2012 in Saudi
Arabia [8]. The latest virus seen in Wuhan, China is called SARS-COV-2 and it
causes corona virus.
A pneumonia of unknown cause detected in Wuhan, China was first reported to the
World Health Organisation (WHO) Country Office in China on 31 December, 2019
[1]. Since, then the number of cases of corona virus are increasing along with high
death toll. Corona virus spread from one city to whole country in just 30 days [50].
On Feb 11, it was named as COVID-19 by World Health Organisation (WHO)[5].
As this COVID-19 is spread from person to person, Artificial intelligence based elec-
tronic devices can play a pivotal role in preventing the spread of this virus. As
the role of healthcare epidemiologists has expanded, the pervasiveness of electronic
health data has expanded too [13]. The increasing availability of electronic health
data presents a major opportunity in healthcare for both discoveries and practical
applications to improve healthcare [48]. This data can be used for training machine
learning algorithms to improve its decision-making in terms of predicting diseases.
As of May 16, 2020, totally 44,25,485 cases of COVID-19 have been registered and
total number of deaths are 3,02,059 [3]. COVID-19 has spread across the globe with
around 213 countries and territories affected [2]. As the rise in number of cases of in-
fected corona virus quickly outnumbered the available medical resources in hospitals,
resulted a substantial burden on the health care systems [44]. Due to the limited
availability of resources at hospitals and the time delay for the results of the medical
tests, it is a typical situation for health workers to give proper medical treatment to
the patients. As the number of cases to test for corona virus is increasing rapidly
day by day, it is not possible to test due to the time and cost factors [25]. In our
thesis, we would like to use machine learning techniques to predict the infection of
corona virus in patients.
1
2 Chapter 1. Introduction
1.1 Aim
The aim of this thesis is to predict whether a person has COVID-19 or not, using
machine learning techniques. The prediction is performed using the clinical infor-
mation of the patients. The goal is to identify whether a patient can potentially be
diagnosed with COVID-19.
1.2 Objectives
The main objective of our thesis are,
1.5 Outline
The thesis structure is divided into different chapters which are as follows:
• Chapter 1: This chapter contains the introduction to this thesis, aim and
objectives, research questions, and motivation.
• Chapter 3: This chapter contains the summary of the works similar to this
thesis.
• Chapter 6: This chapter consists of analysis and discussions about the results
and methods, the contribution of the thesis to the existing research, threats to
the validity of the thesis.
• Chapter 7: In this chapter,we discuss the conclusion of the thesis and discussion
on possible future work.
Chapter 2
Background
5
6 Chapter 2. Background
considered the amount of sleep. Here, sleep and happy are both variables. Now,
the analysis is done by making predictions[11]. The types of popular regression
techniques are:
• Linear regression.
• Logical regression.
2.1 Algorithms
During our research, we have investigated three algorithms through which we have
performed supervised classification.
Random Forests(RF)
The random sampling and ensemble strategies utilized in RF enable it to achieve ac-
curate predictions as well as better generalizations [40]. The random forests consists
of large number of trees. The higher the number of uncorrelated trees, the higher
the accuracy [54]. Random Forest classifiers can help filling some missing values.
Prediction in Random Forests (RFs) is represented in Figure 2.3.
Shuai Wang et al. has identified the radio-graphical changes in CT images of pa-
tients suffering from COVID-19 in China. In this research, he has used deep learning
methods to extract COVID-19’s graphic features through the CT scan images to
develop it as a alternative diagnostic method. They have collected CT images of
confirmed COVID-19 Patients along with those who were diagnosed with pneumo-
nia. The results from their work provide the proof-of-principle for the use of AI for
accurate COVID-19 prediction[47]. This research uses CT Scan images, which is
different from our research as we use clinical features and laboratory results for the
prediction.
Dawei Wang et al. in this research has described the epidemiological, demographic,
clinical, laboratory, radio-logical and treatment data from Zhongnan Hospital, Wuhan
China. The data was analysed and documented to be used to track the infections[46].
The author gives better insights about the radio-logical and treatment data that
could be used for our prediction of COVID-19 in our model.
Halgurd S. Maghdid et al. have proposed a new framework to detect corona virus
disease using the inboard smartphone sensors. The designed AI framework collects
data from various sensors to predict the grade of pneumonia as well as predicting
the infection of the disease [26]. The proposed framework takes uploaded CT Scan
images as the key method to predict COVID-19. This framework relies on multi-
readings from multiple sensors related to the symptoms of COVID-19.
Ali Narin et al. has developed an automatic detection system as an alternative di-
agnosis option of COVID-19. In this study, "three different convolutional neural
network based models (ResNet50, InceptionV3 and Inception-ResNetV2) have been
proposed for the detection of corona virus pneumonia infected patient using chest
X-ray radio graphs [32]". The author also discusses about the classification perfor-
mance accuracy between the three CNN models.
9
10 Chapter 3. Related Work
In [52], the authors proposed a three-indices based model to predict the mortality
risk. They built a prognostic prediction model based on XGBoost machine learning
algorithm to predict the mortality risk in patients. They determined a clinical route
which is simple to check and asses the risk of death. The research focuses on the
mortality risk which is different from our research, where the prediction is completely
based on the clinical findings of patients suffering from COVID-19.
The authors in the article [9], presented a comparative analysis of machine learning
models to predict the outbreak of COVID-19 in various countries. Their study and
analysis demonstrate the potential of machine learning models for the prediction of
COVID-19. The article was based entirely on the outbreak of cases in various coun-
tries. In our work we predict the disease by using the clinical information.
In the above mentioned papers, various prediction systems were developed using CT
Scan images and symptoms for prediction of COVID-19, mortality risks, outbreak in
various countries. As per the existing knowledge, there is not much evidence of pre-
diction system using clinical information. This thesis will be using machine learning
techniques to predict COVID-19 with clinical information of patients suffering from
COVID-19. It will also determine which features would impact the prediction model.
Chapter 4
Method
The research methods we used here are Literature review and Experiment. Firstly, we
performed a systematic literature review where we carefully analysed the literature
and from the results we conducted an experiment for research question 1 through
which we identified suitable machine leaning techniques for prediction. For research
question 2, we conducted an experiment, where we determined what features would
influence the results of the prediction of COVID-19.
2. Formulating the search strings: From the above identified keywords, pri-
mary keywords were selected to formulate the search string.
3. Locating the literature: Using search string, the search was performed on
various digital database platforms such as Google scholar, IEEE and Science
Direct.
4. Following the Inclusion and Exclusion criteria for selection: From the
collected literature such as articles and conference papers, the inclusion and
exclusion criteria is implemented to confine our research.
Inclusion Criteria
Exclusion Criteria
• Incomplete articles.
11
12 Chapter 4. Method
6. Summarizing the literature: The overall findings from the gathered litera-
ture is summarized and represented for analysis.
4.2 Experiment
An experiment is conducted with the results achieved from the SLR (Systematic
Literature Review) to reach the goals of RQ1 where we identify the suitable machine
learning technique for prediction of COVID-19. The experiment is further continued
to build a model of prediction with the selected algorithm to determine RQ2 where
the factors that influence the prediction are identified.
Python is a high level and effective general use programming language. It supports
multi-paradigms. Python has a large standard library which provide tools suited
to perform various tasks. Python is a simple, less-clustered language with extensive
features and libraries. Different programming abilities are utilized for performing the
experiment in our work. In this thesis, the following python libraries were used [45].
• Matplotlib - It is an open source python package used for making plots and
2D representations. It integrates with python to give effective and interactive
plots for visualization [29].
4.2.2 Dataset
Data Collection
Data collection was an essential and protracted process. Regardless the field of
research, accuracy of the data collection is essential to maintain cohesion. As the
clinical information of patients was not publicly available, it was an inflexible and
tedious process to collect the data. Various Hospitals and Health Institutes in Sweden
and China were approached to get the most accurate data but due to the present
situation at hospitals with heavy inflow of patients with COVID-19, we couldn’t get
access to direct information. An intense search was conducted on various databases
to gather open source clinical information of patients diagnosed with COVID-19.
Dataset Used
The data set that was used to train the model to predict COVID-19 was gathered
from an open source data shared by Yanyan Xu at a repository figshare[51]. The
data set contained information about hospitalized patients with COVID-19. It in-
cluded demographic data, signs and symptoms, previous medical records, laboratory
values that were extracted from electronic records. To train the model with equal
records of patients with negative samples another data set from Kaggel repository
was used[4]. The original data-set contained details of medications followed by the
doctors to cure the disease. As our model doesn’t require such data, those fields have
been eliminated. The data-set is a combined multi-dimensional data. Some of the
data gives information whether the patient is diagnosed with a particular disease in
the past such as Renal Diseases, Digestive Diseases and other data contains precise
clinical values obtained previously. It contains fields with textual data and some
with precise values. Textual data was encoded with integer values for experimental
setup. The attributes that were considered in the data-set for the machine learning
model are presented in Table 4.1.
16 Liver disease
17 Endocrine system disease
18 Diarrhea Chest
19 CT findings - Advances, Absorption
20 White Blood Cell Count
21 Neutrophil count
22 Lymphocyte count
23 Monocyte count
24 CRP - C-reactive protein
25 PCT - Procalcitonin
Table 4.1: Features in the dataset used.
• Imputation of missing values - In our data, missing values have been handled
by using simple imputer from sklearn python package. The missing values are
replaced by using mean strategy.
4.2.4 Implementation
The experiment was conducted in the Python IDLE, which is a default integrated
development and learning environment for python. The experiment was conducted
in various phases that are mentioned below:
• After data collection, the patients data is divided into record sets containing
100 records, 150 records, 200 records, 250 records, 300 records, 355 records
respectively.
• The prediction accuracy of each algorithm at each record set is compared and
evaluated for selecting the suitable algorithm for this data-set.
• Random Forests:
RandomForestClassifier(n_estimators = 10, criterion = ’entropy’, random_state
= 0)
Accuracy
Accuracy is the metric used in this thesis for evaluation of the algorithms. It is the
most used performance metric to evaluate classification techniques. This measure
allows us to understand which model is best at identifying patterns in training set
to give better predictions in the unknown test data-set.
TP + TN
Accuracy =
TP + TN + FP + FN
Where TP = True Positives, TN = True Negatives, FP = False Positives, and
FN = False Negatives.
Chapter 5
Results
Title Findings
This paper determines the most efficient classifi-
cation algorithm based on a clinical data-set (Di-
Supervised machine learning abetics). Seven supervised machine learning al-
algorithms: classification and gorithms were considered concluding SVM (Sup-
comparison [34]. port Vector Machines) followed by RF (Random
Forests) that were found with most precision and
accuracy [34].
The author stated that no single supervised al-
gorithm can outperform other algorithms over all
Emerging artificial intelli-
data-sets. The simplest approach is to estimate
gence applications in com-
the accuracy of the algorithms and choose the
puter engineering: real word
suitable one. But in general, SVM (Support Vec-
AI systems with applications
tor Machines) and ANN (Artificial Neural Net-
in E-health [27].
works) tend to perform better when dealing with
multi-dimensional and continuous features [27].
Of all the six algorithms that were compared in
An empirical comparison this paper, Calibrated Boosted trees, Random
of supervised learning algo- Forests give best performance in all metrics. Ar-
rithms [15]. tificial Neural Networks has reached its peak per-
formance with large datasets [15].
Logistic regression acquires highest accuracy
Performance evaluation of
among the compared algorithms followed by Ar-
different machine learning
tificial Neural Networks. SVM (Support Vector
techniques for prediction of
Machines) on the other hand acquires highest pre-
heart disease [19].
cision [19].
17
18 Chapter 5. Results
• RF (Random Forests).
Each of the above stated algorithms were trained with the data-set that was col-
lected and results were interpreted. Performance of each algorithm was evaluated at
different stages of training set. Each algorithm was trained with records sets con-
taining 100 records, 150 records ,200 records, 250 records, 300 records, 355 records
respectively. This experiment is performed to obtain which algorithm would be the
most suitable for prediction of COVID-19. Also, as the data is split into smaller sets,
we could also asses which algorithm would perform better with different datasets
available.
Number of
Patient Accuracy
Records
100 94.73%
150 96%
200 97.36%
250 97.18%
300 97.71%
355 98.33%
The classification accuracy of Support Vector Machine (SVM) at each record set can
be clearly identified from the chart in Figure 5.1
Number of
Patient Accuracy
Records
100 93.33%
150 96.15%
200 96.29%
250 98.36%
300 98.66%
355 99.44%
The classification accuracy of Random Forest (RF) at each record set can be
identified from the chart in Figure 5.2. The figure represents the change in accuracy
while using each record set as training data.
5.2. Experiment Results 23
Number of
Patient Accuracy
Records
100 80.00%
150 86.20%
200 90.90%
250 96.07%
300 98.65%
355 99.25%
The accuracy of Artificial Neural Networks (ANN) with each record set is represented
in Figure 5.3.
Artificial
Support Vector
Number of Neural
Machine Random Forest
Patient Networks
(SVM) (RF) Accuracy
Records (ANN)
Accuracy
Accuracy
100 0.9473 0.9333 0.8%
150 0.96 0.9615 0.862%
200 0.9736 0.9629 0.909%
250 0.9718 0.9836 0.9607%
300 0.9771 0.9866 0.9865%
355 0.9833 0.9944 0.9925%
Feature Feature
Feature Name
Number Value
Chest CT findings -
18 0.155567
Advances, Absorption
13 Fever 0.135102
21 Lymphocyte count 0.133192
4 Respiratory system disease 0.122220
14 Cough 0.094820
24 PCT - Procalcitonin 0.073363
22 Monocyte count 0.058456
20 Neutrophil count 0.056516
3 Age 0.043897
23 CRP - C-reactive protein 0.041181
19 White Blood Cell Count 0.033647
5 Comorbidity 0.012231
6 Fatigue 0.008432
12 Chest tightness 0.006842
2 Clinical Classification 0.004787
Cardiovascular and
7 0.003711
cerebrovascular disease
17 Diarrhea 0.002848
1 Gender 0.002814
8 Malignant tumor 0.002763
10 Digestive system disease 0.002654
Days from onset of symptoms
0 0.002272
to hospital admission
15 Liver disease 0.001315
16 Endocrine system disease 0.001202
9 Patient Condition 0.000153
11 Renal disease 0.000016
• Support Vector Machines (SVMs) showed better results with smaller training
data records when compared to other algorithms. There was no much difference
observed in the accuracy of prediction when the number of records increased.
• Random Forests (RFs) was found to be the most reliable algorithm among the
other algorithms for prediction of COVID-19. Though ruled out by SVMs for
smallest number of records, RFs showed consistent growth in accuracy at all
stages. RFs has the highest accuracy for classification almost at every record
set used.
27
28 Chapter 6. Analysis and Discussion
Experiment Phase 2 is conducted in order to answer RQ2. The aim of this experi-
ment is to identify which features in the dataset influence the predictive result. A
descending list of factors that effect the prediction of COVID-19 are tabulated in
Table 5.6.
6.3 Discussion
RQ2: What are the features that will influence the predictive result of
COVID-19?
The influence of all the features in the data are calculated by the experiment
conducted. The features that show a major change in the prediction are tabulated
in Table 6.1. The features that have no affect in the prediction are tabulated in
Table 6.2. When features with no affect in the prediction are removed, there was no
difference in the accuracy of prediction.
6.4. Validity Threats 29
Feature
Feature Name
Value
Chest CT findings -
0.155567
Advances, Absorption
Fever 0.135102
Lymphocyte count 0.133192
Respiratory system disease 0.122220
Feature
Feature Name
Value
Days from onset of symptoms
0.002272
to hospital admission
Liver disease 0.001315
Endocrine system disease 0.001202
Patient Condition 0.000153
Renal disease 0.000016
In this research, a systematic literature review has been conducted to identify the
suitable algorithm for prediction of COVID-19 in patients. There was no pure ev-
idence found to summarize one algorithm as the suitable technique for prediction.
Hence, a set of algorithms which include Support Vector Machine (SVM), Artificial
Neural Networks (ANNs) and Random Forests (RF) were chosen. The selected algo-
rithms were trained with the patient clinical information. To evaluate the accuracy
of machine learning models, each algorithm is trained with record sets of varying
number of patients. Using accuracy performance metric, the trained algorithms were
assessed. After result analysis, Random Forest (RF) showed better prediction accu-
racy in comparison with both Support Vector Machine (SVM) and Artificial Neural
Networks (ANNs). The trained algorithms were also assessed to find the features
that affect the prediction of COVID-19 in patients.
There is a lot of scope for Machine Learning in Healthcare. For Future work,
it is recommended to work on calibrated and ensemble methods that could resolve
quirky problems faster with better outcomes than the existing algorithms. Also an
AI-based application can be developed using various sensors and features to identify
and help diagnose diseases.
As healthcare prediction is an essential field for future, A prediction system that
could find the possibility of outbreak of novel diseases that could harm mankind
through socio-economic and cultural factor consideration can be developed..
31
References
[7] Support Vector Machine Machine learning algorithm with example and code,
January 2019. Library Catalog: www.codershood.info Section: Machine learn-
ing.
[8] Ali Al-Hazmi. Challenges presented by MERS corona virus, and SARS corona
virus to global health. Saudi journal of biological sciences, 23(4):507–511, 2016.
Publisher: Elsevier.
[9] Sina F Ardabili, Amir Mosavi, Pedram Ghamisi, Filip Ferdinand, Annamaria R
Varkonyi-Koczy, Uwe Reuter, Timon Rabczuk, and Peter M Atkinson. Covid-19
outbreak prediction with machine learning. Available at SSRN 3580188, 2020.
[10] Hiba Asri, Hajar Mousannif, Hassan Al Moatassime, and Thomas Noel. Using
machine learning algorithms for breast cancer risk prediction and diagnosis.
Procedia Computer Science, 83:1064–1069, 2016.
[11] Taiwo Oladipupo Ayodele. Types of machine learning algorithms. New advances
in machine learning, pages 19–48, 2010.
[12] Taiwo Oladipupo Ayodele. Types of machine learning algorithms. New advances
in machine learning, pages 19–48, 2010. Publisher: InTech.
[13] David W Bates, Suchi Saria, Lucila Ohno-Machado, Anand Shah, and Gabriel
Escobar. Big data in health care: using analytics to identify and manage high-
risk and high-cost patients. Health Affairs, 33(7):1123–1131, 2014.
33
34 References
[14] Hetal Bhavsar and Amit Ganatra. A comparative study of training algorithms
for supervised machine learning. International Journal of Soft Computing and
Engineering (IJSCE), 2(4):2231–2307, 2012.
[16] Nanshan Chen, Min Zhou, Xuan Dong, Jieming Qu, Fengyun Gong, Yang Han,
Yang Qiu, Jingli Wang, Ying Liu, Yuan Wei, et al. Epidemiological and clinical
characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china:
a descriptive study. The Lancet, 395(10223):507–513, 2020.
[17] Dursun Delen. Analysis of cancer data: a data mining approach. Expert Systems,
26(1):100–112, 2009.
[18] Manoj Durairaj and Veera Ranjani. Data mining applications in healthcare sec-
tor: a study. International journal of scientific & technology research, 2(10):29–
35, 2013.
[20] Arihito Endo, Takeo Shibata, and Hiroshi Tanaka. Comparison of seven al-
gorithms to predict breast cancer survival (< special issue> contribution to
21 century intelligent technologies and bioinformatics). International Journal
of Biomedical Soft Computing and Human Sciences: the official journal of the
Biomedical Fuzzy Systems Association, 13(2):11–16, 2008.
[21] Ya-Han Hu, Yi-Lien Lee, Ming-Feng Kang, and Pei-Ju Lee. Constructing in-
patient pressure injury prediction models using machine learning techniques.
Computers, Informatics, Nursing: CIN, 2020.
[22] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforce-
ment learning: A survey. Journal of artificial intelligence research, 4:237–285,
1996.
[25] Halgurd S. Maghdid, Kayhan Zrar Ghafoor, Ali Safaa Sadiq, Kevin Curran,
and Khaled Rabie. A novel ai-enabled framework to diagnose coronavirus
covid 19 using smartphone embedded sensors: Design study. arXiv preprint
arXiv:2003.07434, 2020.
References 35
[26] Halgurd S. Maghdid, Kayhan Zrar Ghafoor, Ali Safaa Sadiq, Kevin Curran, and
Khaled Rabie. A novel ai-enabled framework to diagnose coronavirus covid 19
using smartphone embedded sensors: Design study, 2020.
[28] Younus Ahmad Malla. A machine learning approach for early prediction of
breast cancer. International Journal of Engineering and Computer Science,
2017.
[29] Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy,
and IPython. " O’Reilly Media, Inc.", 2012.
[30] Wes McKinney and PD Team. pandas: powerful python data analysis toolkit.
Pandas—Powerful Python Data Analysis Toolkit, page 1625, 2015.
[31] Sahar A Mokhtar, Alaa Elsayad, et al. Predicting the severity of breast masses
with data mining methods. arXiv preprint arXiv:1305.7057, 2013.
[32] Ali Narin, Ceren Kaya, and Ziynet Pamuk. Automatic detection of coronavirus
disease (covid-19) using x-ray images and deep convolutional neural networks.
arXiv preprint arXiv:2003.10849, 2020.
[33] Narges Alizadeh Noohi, Marzieh Ahmadzadeh, and M Fardaer. Medical data
mining and predictive model for colon cancer survivability. International Journal
of Innovative Research in Engineering & Science, 2, 2013.
[35] Sellappan Palaniappan and Rafiah Awang. Intelligent heart disease prediction
system using data mining techniques. In 2008 IEEE/ACS international confer-
ence on computer systems and applications, pages 108–115. IEEE, 2008.
[38] Sanjay Purushotham, Chuizheng Meng, Zhengping Che, and Yan Liu. Bench-
marking deep learning models on large healthcare datasets. Journal of biomed-
ical informatics, 83:112–134, 2018.
36 References
[39] R Qahwaji and Tufan Colak. Automatic short-term solar flare prediction using
machine learning and sunspot associations. Solar Physics, 241(1):195–211, 2007.
[40] Yanjun Qi. Random forest for bioinformatics. In Ensemble machine learning,
pages 307–323. Springer, 2012.
[45] Guido Van Rossum et al. Python programming language. In USENIX annual
technical conference, volume 41, page 36, 2007.
[46] Dawei Wang, Bo Hu, Chang Hu, Fangfang Zhu, Xing Liu, Jing Zhang, Binbin
Wang, Hui Xiang, Zhenshun Cheng, Yong Xiong, et al. Clinical characteristics
of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in
wuhan, china. Jama, 2020.
[47] Shuai Wang, Bo Kang, Jinlu Ma, Xianjun Zeng, Mingming Xiao, Jia Guo,
Mengjiao Cai, Jingyi Yang, Yaodong Li, Xiangfei Meng, and Bo Xu. A deep
learning algorithm using CT images to screen for Corona Virus Disease (COVID-
19). preprint, Infectious Diseases (except HIV/AIDS), February 2020.
[48] Jenna Wiens and Erica S. Shenoy. Machine learning for healthcare: on the
verge of a major shift in healthcare epidemiology. Clinical Infectious Diseases,
66(1):149–153, 2018. Publisher: Oxford University Press US.
[49] Claes Wohlin. Guidelines for snowballing in systematic literature studies and
a replication in software engineering. In Proceedings of the 18th international
conference on evaluation and assessment in software engineering, pages 1–10,
2014.
[52] Li Yan, Hai-Tao Zhang, Yang Xiao, Maolin Wang, Chuan Sun, Jing Liang,
Shusheng Li, Mingyang Zhang, Yuqi Guo, Ying Xiao, et al. Prediction of crit-
icality in patients with severe covid-19 infection using three clinical features: a
machine learning-based prognostic model with clinical data in wuhan. medRxiv,
2020.
[54] Tony Yiu. Understanding Random Forest, August 2019. Library Catalog: to-
wardsdatascience.com.
[55] Giancarlo Zaccone, Md Rezaul Karim, and Ahmed Menshawy. Deep Learning
with TensorFlow. Packt Publishing Ltd, 2017.
[56] Victor Zhou. Machine Learning for Beginners: An Introduction to Neural Net-
works, December 2019. Library Catalog: towardsdatascience.com.
[57] Xiaojin Zhu, John Lafferty, and Ronald Rosenfeld. Semi-supervised learning
with graphs. PhD Thesis, Carnegie Mellon University, language technologies
institute, school of . . . , 2005.
Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden