Paper 12-Blood Diseases Detection Using Classical Machine
Paper 12-Blood Diseases Detection Using Classical Machine
Abstract—Blood analysis is an essential indicator for many Machine learning is a data analysis technology that teaches
diseases; it contains several parameters which are a sign for computers to act like humans. It uses computational methods to
specific blood diseases. For predicting the disease according to extract information directly from data [8]. The performance of
the blood analysis, patterns that lead to identifying the disease the machine learning algorithm is improved according to the
precisely should be recognized. Machine learning is the field quality of data, as well as enhancing the disease prediction
responsible for building models for predicting the output based process [9].
on previous data. The accuracy of machine learning algorithms is
based on the quality of collected data for the learning process; The main objective of this research is using machine
this research presents a novel benchmark data set that contains learning techniques for detecting blood diseases according to
668 records. The data set is collected and verified by expert the blood tests values; several techniques are performed for
physicians from highly trusted sources. Several classical machine finding the most suitable algorithm that maximizes the
learning algorithms are tested and achieved promising results. prediction accuracy [9]. The rest of this paper is organized as
follows. Section II introduces background information about
Keywords—Machine learning; classification algorithms; the used techniques. Section III presents the different related
decision trees; KNN; k-means; blood disease methods on blood disease prediction using ML classifiers.
I. INTRODUCTION Section IV describes the data set and the blood test attributes.
Section V shows the experiments results. Finally, section VI
Blood has many secrets that affect human life. It is the presents the conclusion and future work of the research.
postman that circulates through body and visits all organs [1].
The growth in age should be reflected in blood. This change II. BACKGROUND
could be detected by the values of parameters inside blood Machine learning is a computer science branch that is
analysis tests [2]. Depending on several attributes like age, responsible for the development of computer systems that can
gender, symptoms, and any health conditions, the physician learn and change their reactions according to the situation [9].
can choose the specific blood tests for diagnosing the disease. The Machine Learning methodology is depending on learning
Many blood tests are standard and essential for everyone to get. from data inputs and evaluating the model results and trying to
Blood tests are widespread because of that; most physicians optimize the output [10]. It is also used in data analytics for
may recommend blood tests to predict the health level of the making predictions on data. Figure 1 shows a brief of machine
patient’s body [3] [4]. learning activity. Machine learning consists of 3 main models
Most of the blood tests do not need special conditions like [11]:
fasting for 8 to 12 hours before the test or preventing some Supervised Learning: Computer is trained with
kinds of medicine [5]. By testing the fluid, different parameters presented inputs and their desired outputs, for
in the blood van be measured. The results help to identify predicting the output of future inputs.
health problems in the early stages or nay predictable diseases
[6]. Physicians cannot diagnose diseases and health problems Unsupervised Learning: Computer is presented with
with blood tests alone. However, they can use them as a factor inputs without desired outputs.
to confirm a diagnosis. These factors may include some signs
and symptoms, which could be integrated with other vital signs Reinforcement learning: Computer interacts with the
for diagnosing the diseases [7]. The disease is diagnosing, and environment, and it must perform a specific goal
prediction process is a necessary process which is based on the without training.
quality of data and physician’s experience. Applying modern Machine Learning techniques become an essential tool for
technological tools for helping physicians to improve the prediction and decision-making in many disciplines [12]. The
accuracy of disease diagnosing, become one of the hot topics availability of clinical data leads machine learning to play a
of research, especially machine learning and artificial critical role in medical decision making. It serves as a valuable
intelligence algorithms [8]. aid in identifying a disease for improving clinical decisions and
choosing suitable medical procedures.
77 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 7, 2019
78 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 7, 2019
sentence level [23]. They tested a variety of Machine Learning, Thrombocytopenia: it is about the lack of platelets. It
rule-based, and hybrid systems. Also, it extracts the bags of is not so dangerous but sometimes leads to bleed too
words, bags of phrases, and bags of concepts. The proposed much [25].
model used Support Vector Machines and achieved a high
recall and precision at 95% at 71% respectively. The core of Leukocytosis: it causes an increase in white cells above
this model is the high quality of the collected documents and the normal range in the blood. It may cause certain
the extraction of information from textual reports and uses parasitic infections or bone tumors, as well as leukemia
them in the disease prediction [24]. [26].
Anemia: it is a decrease in the amount of hemoglobin
IV. BLOOD DISEASES ANALYSIS DATA SET
or red blood cells in the blood. It may cause vague and
This research presents a new benchmark dataset; it contains may include feeling tired, shortness of breath, or
668 patient’s blood analysis. Each blood analysis contains 28 weakness [27].
parameters; these parameters are presented on table I.
Normal: in this class, which all parameters values are
The dataset contains four main classes related to four normal, and there are no essential notifications in the
different blood diseases: blood analysis.
TABLE. I. BLOOD ANALYSIS PARAMETERS [25]
79 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 7, 2019
Each record in the proposed dataset is labeled with his descending order. The overall results prove the success of
related class; this classification is performed manually by applying the classical machine learning algorithms in the
expert physicians. process of blood diseases prediction.
V. EXPERIMENTS RESULTS AND DISCUSSION TABLE. II. EVALUATION METRICS [28]
Using the Weka tool, a classical machine learning Metric Description
algorithms are applied on 668 records that belong to four
different classes as described in the data set section. 10-fold TP Rate True Positive Rate
cross-validation is used for all of the experiments after
FP Rate False Positive Rate
performing the required preprocessing modules presented in
fig.1. Cross-Validation is a statistical method of evaluating and Precision A measure of statistical variability
comparing learning classifiers by dividing data into two
segments: one used to learn or train a model and the other used Recall Classifier Sensitivity
to validate the model. The training and validation sets must F-Measure A measure of a test's accuracy
cross-over in successive rounds such that each data point has a
chance of being validated against. A measure of the quality of binary (two-class)
MCC
classifications
For each classifier several metrics were measured for
determining the accuracy. Furthermore, the parameters values A graph showing the performance of a classification
ROC Area
of each classifier were changed according to the specifications model at all classification thresholds
of each classifier. Table II presents the evaluation metrics used
PRC Area Precision/Recall
in the experiments and their description. Table III shows the
experiments results. The accuracy of all classifiers is ranged Accuracy Accuracy of classifier
between 71.2% and 98.16%. The LogitBoost classifier has the
highest accuracy, where Support Vector Machine classifier has Mean absolute error Assessing the quality of a machine learning model
the lowest value. Table IV shows the classifiers accuracy in
FP Rate
Precision
Recall
F-Measure
MCC
ROC Area
PRC Area
Accuracy
error
absolute
Mean
Classifier
NaiveBayes 0.816 0.059 0.862 0.816 0.835 0.753 0.933 0.857 81.60% 0.09
Bayesian network 0.929 0.04 0.936 0.929 0.93 0.898 0.984 0.967 92.86% 0.0362
MultilayerPerceptron 0.918 0.04 0.918 0.918 0.918 0.879 0.974 0.95 91.80% 0.04
LogitBoost 0.982 0.01 0.982 0.982 0.98 0.972 0.995 0.987 98.16% 0.023
Random forests 0.971 0.022 0.971 0.971 0.969 0.956 0.996 0.99 97.12% 0.042
Support Vector Machine 0.712 0.329 0.799 0.712 0.64 0.494 0.691 0.584 71.20% 0.14
K-Nearest Neighbor 0.93 0.048 0.928 0.93 0.927 0.892 0.94 0.883 92.97% 0.04
Regression analysis 0.965 0.02 0.965 0.965 0.964 0.948 0.992 0.979 96.54% 0.0447
Decision Tree 0.97 0.018 0.969 0.97 0.969 0.955 0.979 0.955 97.00% 0.018
Classifier Accuracy
LogitBoost 98.16%
Random forests 97.12%
Decision Tree 97.00%
Regression analysis 96.54%
K-Nearest Neighbor 92.97%
Bayesian network 92.86%
MultilayerPerceptron 91.80%
NaiveBayes 81.60%
Support Vector Machine 71.20%
80 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 10, No. 7, 2019
VI. CONCLUSION AND FUTURE WORK [10] Jiang, Min, et al. "A study of machine-learning-based approaches to
extract clinical entities and their assertions from discharge summaries."
Machine learning becomes an essential technique for Journal of the American Medical Informatics Association 18.5 ;2011:
modeling the human process in many disciplines, especially in 601-606.
the medical field, because of the high availability of data. One [11] Lison, Pierre. "An introduction to machine learning." ;2015.
of the essential disease detectors is the blood analysis; as it [12] Michalski, Ryszard S., and Yves Kodratoff. "Research in machine
contains many parameters with different values that indicates learning: Recent progress, classification of methods, and future
directions." Machine learning. Morgan Kaufmann, 1990. 3-30.
definite proof for the existence of the disease. The machine
[13] Rish, Irina. "An empirical study of the naive Bayes classifier." IJCAI
learning algorithm accuracy depends mainly on the quality of 2001 workshop on empirical methods in artificial intelligence. Vol. 3.
the dataset; for this reason, a high-quality dataset is collected No. 22. 2001.
and verified from expert physicians. This dataset is used for [14] Friedman, Nir, Dan Geiger, and Moises Goldszmidt. "Bayesian network
training the classifiers for obtaining high accuracy. We tested classifiers." Machine learning 29.2-3 ;1997: 131-163.
several classifiers and achieved accuracy up to 98.16% which [15] Ruck, Dennis W., et al. "The multilayer perceptron as an approximation
realize the research objective, which is helping the physicians to a Bayes optimal discriminant function." IEEE Transactions on Neural
to predict the blood diseases according to general blood test. Networks 1.4 ;1990: 296-298.
[16] Otero, José, and Luciano Sánchez. "Induction of descriptive fuzzy
The future work will focus on testing the proposed data set classifiers with the Logitboost algorithm." Soft Computing 10.9 ;2006:
using different deep learning algorithms to compare between 825-835.
classical and deep learning approaches in this research area. [17] Breiman, Leo. "Random forests." Machine learning 45.1 ;2001: 5-32.
Furthermore, an online Internet of Things (IOT) application [18] Suykens, Johan AK, and Joos Vandewalle. "Least squares support
will be implemented to collect and test more blood data. vector machine classifiers" Neural processing letters 9.3 ;1999: 293-300.
[19] Keller, James M., Michael R. Gray, and James A. Givens. "A fuzzy k-
REFERENCES nearest neighbor algorithm." IEEE transactions on systems, man, and
[1] Lewontin, Richard C. It ain't necessarily so: The dream of the human cybernetics 4 ;1985: 580-585.
genome and other illusions. New York Review of Books, 2001. [20] Seber, George AF, and Alan J. Lee. Linear regression analysis. Vol. 329.
[2] Feldman, Eric A., Eric Feldman, and Ronald Bayer, eds. Blood feuds: John Wiley & Sons, 2012.
AIDS, blood, and the politics of medical disaster. Oxford University [21] Safavian, S. Rasoul, and David Landgrebe. "A survey of decision tree
Press, USA, 1999. classifier methodology." IEEE transactions on systems, man, and
[3] Fekkes, Minne, et al. "Do bullied children get ill, or do ill children get cybernetics 21.3 ;1991: 660-674.
bullied? A prospective cohort study on the relationship between bullying [22] Gunčar, Gregor, et al. "An application of machine learning to
and health-related symptoms." Pediatrics 117.5 ;2006: 1568-1574. haematological diagnosis." Scientific reports 8.1 ;2018: 411.
[4] ESHRE, The Rotterdam, and ASRM-Sponsored PCOS Consensus [23] Martinez, David, et al. "Automatic detection of patients with invasive
Workshop Group. "Revised 2003 consensus on diagnostic criteria and fungal disease from free-text computed tomography (CT) scans."
long-term health risks related to polycystic ovary syndrome." Fertility Journal of biomedical informatics 53 ;2015: 251-260.
and sterility 81.1 ;2004: 19-25.
[24] Pekelharing, J. M., et al. "Haematology reference intervals for
[5] Schalm, Oscar William, Nemi Chand Jain, and Edward James Carroll. established and novel parameters in healthy adults." Sysmex Journal
Veterinary hematology. No. 3rd edition. Lea & Febiger., 1975. International 20.1 ;2010: 1-9.
[6] Allison, James E., et al. "A comparison of fecal occult-blood tests for [25] Warkentin, Theodore E., and John G. Kelton. "A 14-year study of
colorectal-cancer screening." New England Journal of Medicine 334.3 heparin-induced thrombocytopenia." The American journal of medicine
;1996: 155-160. 101.5 ;1996: 502-507.
[7] Park, Sang Hyuk, et al. "Establishment of age-and gender-specific [26] Shopsin, Baron, Richard Friedmann, and Samuel Gershon. "Lithium and
reference ranges for 36 routine and 57 cell population data items in a leukocytosis" Clinical Pharmacology & Therapeutics 12.6;1971:923-
new automated blood cell analyzer, Sysmex XN-2000." Annals of 928.
laboratory medicine 36.3 ;2016: 244-249.
[27] Weiss, Guenter, and Lawrence T. Goodnough. "Anemia of chronic
[8] Cabitza, Federico, Raffaele Rasoini, and Gian Franco Gensini. disease." New England Journal of Medicine 352.10 ;2005: 1011-1023.
"Unintended consequences of machine learning in medicine." Jama
318.6 ;2017: 517-518. [28] Ragab, Abdul Hamid M., et al. "A comparative analysis of classification
algorithms for students college enrollment approval using data mining."
[9] Darcy, Alison M., Alan K. Louie, and Laura Weiss Roberts. "Machine Proceedings of the 2014 Workshop on Interaction Design in Educational
learning and the profession of medicine." Jama 315.6 ;2016: 551-552. Environments. ACM, 2014.
81 | P a g e
www.ijacsa.thesai.org