0% found this document useful (0 votes)

18 views20 pages

Prediction of Anemia From Multi-Data Attribute Co-Existence

Uploaded by

hetal.rana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views20 pages

Prediction of Anemia From Multi-Data Attribute Co-Existence

Uploaded by

hetal.rana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Received 29 September 2024, accepted 30 November 2024, date of publication 2 December 2024,

date of current version 12 December 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3510636

Prediction of Anemia from Multi-Data Attribute

Co-Existence
TALAL QADAH 1,2 AND ASMAA MUNSHI3
1 Department of Medical Laboratory Sciences, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah 21589, Saudi Arabia
2 Hematology Research Unit, King Fahad Medical Research Center, King Abdulaziz University, Jeddah 21589, Saudi Arabia
3 College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi Arabia

Corresponding author: Talal Qadah ([email protected])

ABSTRACT The primary research problem identified by this study is that the medical world acknowledges
that when a singular quality or variable is consistently demonstrated as a clear sign of a disease condition,
it is typically regarded as the standard. Nevertheless, advancements in Information and Communication
Technology (ICT), particularly Artificial Intelligence (AI), have enabled several additional factors to
influence the diagnosis of certain diseases to varying degrees; many other attributes can now contribute
to determining certain diseases to varying extents. Therefore, this study has reevaluated the claims within
the domain of detecting and predicting anemia with the best machine learning algorithm. Another research
problem, lies with the fact that previous studies on anemia prediction utilized limited machine learning
algorithms across a narrow range of datasets, whereas this current study employed numerous machine
learning algorithms across a wide range of anemia datasets and tested three hypotheses. The statistical
analysis validated all the hypotheses. The results also showed that ‘‘AdaBoost’’ excels in ‘‘cross-validation
accuracy,’’ scoring 92.8%. On the other hand, when it comes to ‘‘test accuracy,’’ ‘‘precision’’ of ‘‘non-
anemic,’’ and the ‘‘recall’’ of ‘‘anemic,’’ ‘‘Random Forest’’ and ‘‘XGBoost’’ both perform the best with
values of 0.863, 0.89, and 0.96, respectively. However, ‘‘XGBoost’’ performed the best in terms of the ROC-
AUC score, with a value of 0.9447. In this study, the most important contribution is that there isn’t a single
machine learning method that can accurately predict anemia based on the parameters associated with anemia.
This means that there is always a need for a combination of methods, and doctors should still be involved in
cases of anemia.

INDEX TERMS Anemia, multi-dataset attribute, prediction, XGBoost.

I. INTRODUCTION of normal red blood cell life, and an increase in red blood cell
Anemia, which means a decrease in the oxygen-carrying destruction are associated with diseases such as anemia [2].
capacity of the blood due to a decrease in the number of red However, white blood cells, platelets, mean cell volume,
blood cells or the presence of abnormal hemoglobin, is one of mean cell hemoglobin, and red cell distribution width are also
the most common medical conditions seen today [1]. There of great medical importance in diseases with accompanying
are many types of anemia that have been defined by the size anemia. Due to the diversity of such parameters, the fact
and functional capacity of the red blood cells, the cause of that they can be easily obtained by using each complete
the deficiency, or the type of red blood cells being produced. blood count device and are low-cost tests contributes to the
Sudden or chronic loss of blood, decreased production of red importance of these parameters [3].
blood cells or hemoglobin by the bone marrow, the effects of Determining the presence of anemia is one of the first steps
medications, the destruction of red blood cells before the end in diagnosis when a patient applies to a health institution.
For this purpose, parameters used by complete blood count,
The associate editor coordinating the review of this manuscript and such as white blood cells and platelets, mean cell volume,
approving it for publication was Yongming Li . mean cell hemoglobin, and red cell distribution width, which

2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 182923
T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

show a decrease or increase in different types of anemia, by reduced levels of Hb and PCV, which are all indica-
are examined at the sub-optimal level [2]. The values may tors of anemia in some form.
decrease, increase, or not stay in the expected range, which • It has been confirmed that MCV, MCH, and RDW all
differentiates the aid decision. However, the red blood cell offer additional insight into the type of anemia and
count, hemoglobin, and packed cell volume are the first three the factors that produce it. Whereas MCV and MCH
parameters examined for diagnosis, especially for anemia. represent the size of RBCs as well as the amount of
The increasing rate of advances in technology has led hemoglobin they contain, RDW indicates the degree to
to large amounts of electronic medical record (EMR) data. which RBC sizes are uniform.
Currently, the EMR contains many types of data such as • Affirmed that WBCs and PLT: can provide a more com-
diagnosis, time of diagnosis, medication, and biochemical prehensive picture of overall health and content of RBC
results [4]. Many individuals with chronic diseases like ane- function, despite the fact that they are not directly related
mia need to plan health care, monitor progress, and possibly to RBCs. This is because anomalies in all three kinds of
minimize the likelihood of anemia [5]. There are a lot of dif- blood parameter (RBC, WBC, and PLT) can indicate a
ferent health and blood tests that can be used for various cases systemic problem, such as a bone marrow condition.
and purposes. The red blood cell (RBC) count, hemoglobin • There is no single machine learning model that effi-
(Hb) concentration, hematocrit (HCT), which is commonly ciently predict anemia, it is recommended that an
measured as packed cell volume (PCV), white blood cell ensemble approach should be the best option as well
(WBC) count, and platelet (PLT) count are common and as the intervention of medical doctors in addressing the
important tests to identify potential medical problems, diag- cases of prediction of anemia using datasets.
nose health issues, and determine prognosis [6]. The remaining part of the paper apart from this current
The objectives of this research include the development section are describe as follow: Section II present the related
of intelligent systems for precise prediction of cases with work, section III presented the machine learning model.
the help of machine learning algorithms that can assist in Section IV present the research Methodology, Section V
monitoring blood disease screening agents. Machine learn- present the results and discussion. Finally, section VI present
ing algorithm models are fast becoming the most popular the discussions.
methods for disease case predictions. On the other hand,
anemia is a health disorder condition facing 30% of the II. RELATED WORK
world’s population [7]. This study investigates the interrela- There are many previous research studies associated to
tion of blood parameters in diagnosing anemia and proposes prediction of Anemia. The combined research from these
the use of the best machine learning algorithms model to previous studies highlights the significant advancements in
determine future cases of anemia. Models were developed using machine learning (ML) models to predict anemia from
using decision trees, bootstrap aggregating for decision, and hematological parameters such as RBC count, Hb, PCV, and
boosting decision. Performance metrics, accuracy, preci- others. Dogan and Turkoglu [11] provided early evidence
sion, recall, and F1 score were used to decide on the best that decision trees could effectively detect iron deficiency
model. anemia using basic hematology parameters. Their work laid
In medicine, hemoglobin, the number of RBC, white blood the foundation for integrating data mining techniques into
cells, platelets, hematocrit, the average volume of red blood medical diagnostics, showing that decision trees can stream-
cells calculated count, the average hemoglobin concentration line anemia diagnosis. This approach, further supported by
in erythrocytes, and the distribution width of red blood cells Abdullah and Al-Asmari [12], demonstrates that data mining
are the indicators that are initially determined to diagnose and classification algorithms have a critical role in identifying
anemia [8], [9]. These first indicators are also often called the anemia subtypes, allowing for a more nuanced understanding
first line in a general blood count [10]. The most informative of blood disorders.
indicators have been extensively researched in properties. It is Expanding on these foundations, Khan et al. [13] and
through these indicators that many common types of anemia Dixit et al. [14] utilized more advanced machine learning
can fundamentally be distinguished from each other, and the models, such as support vector machines and neural net-
type of anemia can be determined. works, to predict anemia in specific populations like children
This study is motivated by the fact that ‘‘RBC in Bangladesh and general populations. These studies intro-
count’’, ‘‘PCV’’, ‘‘Mean Cell Volume (MCV)’’, ‘‘Mean duced comparative models to analyze the effectiveness of
Cell Hemoglobin (MCH)’’, ‘‘Red Cell Distribution width different algorithms, with Dixit et al. [14] emphasizing the
(RDW)’’, ‘‘WBC count’’, ‘‘PLT’’, ‘‘Hb’’ influence Anemia superiority of certain classifiers in terms of accuracy. Their
status. That is why it establish relationships among these research not only advances machine learning in medical
variables based on the available dataset. For that reason, the diagnostics but also tailors solutions to the specific needs of
study contributes in the following ways: vulnerable groups, such as children, offering insights into the
• Demonstrate through the use of a hypothesis testing that prevalence and risks of anemia in such populations.
the RBC count, Hb, and PCV are all closely related. The integration of machine learning into anemia prediction
In most cases, a low red blood cell count is accompanied has been particularly impactful in regions where resources for

182924 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

TABLE 1. Summary of some of the previous research studies.

healthcare are limited. Alemayehu et al. [15] demonstrated approaches. The aim was to identify the most effective model
how ensemble machine learning models could predict ane- for early detection, which is crucial for preventing adverse
mia in Ethiopian children, which is crucial for public health health effects in newborns. The comparative analysis revealed
interventions. This work builds upon previous findings and that the Support Vector Machines model provided the best
showcases the adaptability of machine learning in diverse performance in predicting fetal anemia. The study empha-
contexts, improving the accuracy and efficiency of anemia sized the importance of machine learning in prenatal care,
screening programs. Similarly, Sundaram et al. [16] stressed suggesting that accurate predictive models can significantly
the importance of early detection of chronic anemia through aid in monitoring and managing fetal health.
machine learning, suggesting that timely intervention using Jaiswal et al. [20] proposed to develop and evaluate
these predictive models can greatly improve patient out- machine learning algorithms for the prediction of anemia
comes, especially in regions with limited medical resources. disease. By leveraging computational models, the researchers
Finally, the research also highlights the application of aimed to improve diagnostic accuracy and support clinical
machine learning in predicting anemia in patients with spe- decision-making. The study achieved significant accuracy,
cific medical conditions, such as cardiac and renal disorders. particularly with the Decision Tree model. The results
Rayes et al. [17] and Provenzano et al. [18] explored how demonstrated that machine learning algorithms could effec-
hematological indices could be used to predict anemia in tively predict anemia, underscoring their potential as valuable
cardiac and renal patients, respectively. These studies under- tools in medical diagnostics. The study suggested that inte-
scored the importance of integrating machine learning into grating these models into healthcare systems could facilitate
chronic disease management, offering tailored predictions early detection and treatment.
that can enhance patient care. The ability to accurately predict Dhakal et al. [21] proposed to predict the level of ane-
anemia in patients with existing health issues provides a mia among individuals using machine learning algorithms.
more comprehensive understanding of how anemia interacts Understanding anemia severity is essential for determining
with other diseases, paving the way for improved healthcare appropriate treatment strategies and improving patient care.
strategies. The study found that ensemble models, particularly Random
Among some of the critical previous studies associated to Forest and Gradient Boosting, improved prediction accuracy
this current research is the work of the Dixit et al. [14] pro- for anemia levels compared to individual classifiers. The
posed to predict anemia disease by applying various machine findings indicate that machine learning models can effec-
learning algorithms to clinical data. The primary goal was to tively classify anemia severity, providing valuable insights for
identify the most accurate model that could assist healthcare clinicians to tailor treatments based on patient-specific data.
professionals in early detection and diagnosis of anemia. Table 1 present the summary of some of very closely
The study concluded that certain models outperformed oth- related studies with the current research. These studies
ers in terms of accuracy and reliability. Specifically, Neural demonstrate the significant potential of machine learning
Networks demonstrated the highest accuracy in predicting algorithms in predicting anemia and its severity. By utiliz-
anemia. The results highlight the potential of machine learning various models—from simple classifiers like Decision
ing algorithms to enhance diagnostic processes, enabling Trees and Naïve Bayes to complex ensemble methods—
timely interventions and improved patient outcomes. the research highlights how data-driven approaches can
Hasan et al. [19] proposed to predictive and comparative enhance diagnostic accuracy. The findings suggest that inte-
analysis of fetal anemia using different machine learning grating machine learning into healthcare can lead to earlier

VOLUME 12, 2024 182925

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

FIGURE 1. The logistic regression architecture.

detection, better management of anemia, and ultimately A. LOGISTIC REGRESSION

improved patient outcomes. These advancements underscore Logistic regression is a statistical approach to model a
the importance of continued research in applying artificial binary outcome with one or more explanatory variables.
intelligence to medical challenges The outcome is modeled using a logistic function. When
exploring the heart of logistic regression, one of the fun-
damental concepts is probability [26]. The reason why this
III. MODEL DEVELOPMENT research adopted logistic regression lies with the fact that,
Machine learning can be defined as the ability of a computer in real-world applications, the response property of logistic
to learn and generalize from the data it has been given. Dur- regression is often binary in nature. Specific to the case of
ing this learning process, the mathematical model is trained this study, it can be (Anemic or not Anemic). Usually, such
and improved until it can accurately describe the task at cases are viewed as success or failure. Furthermore, it is a
hand [22]. The quality of the models that can be created here regression (or classification) algorithm that uses a response
depends on the quality, extent, and accuracy of the data that variable in the form of 0 and 1 and also response variable that
can be used. There has been much progress in the area of are in dichotomous in nature (such as yes/no, true/false, 0/1).
machine learning recently through the adaptive models that The goal here is to model the probability that a certain class
have been created to process large and complex datasets. or item, such as classifying high/low or rainy/sunny, has
These models identify complex patterns that would be too occurred.
difficult or time-consuming for humans to understand. Due The dependent variable in logistic regression represents
to the complexity of some deep machine learning models, events [27]. Here, events mean the probabilities of occur-
their interpretation is very difficult, but the simplicity of rence. The dependent variable in linear regression simply
some models such as logistic regression and decision trees denotes the sum or weighted sum (w) of the contributing
makes them very interpretable and understandable [23]. The variables or factors or input (x) (see Figure 1). The sig-
learning category of machine learning is the umbrella under moid function can be described mathematically as S(x) = 1/
which various supervised learning algorithms fall, and it can (1 + e^(−x)). This means that for any given input value, the
be split into three major types: classification, regression, and sigmoid function maps that input to a range between 0 and 1.
clustering [24]. The increasing nature of the sigmoid function comes to a halt
The aim of the learning here is to have the model pre- at f(x) = 1/2, which corresponds to the point of inflection
dict a target feature Y. The target variable Y is made up of on the graph. At this point, the first derivative changes from
continuous variables in regression learning or is made up of ascending to descending [28]. This is a particularly important
discrete variables in classification learning. With regression, characteristic for binary classification, which will be further
the output is a real value, while the output with classifi- explored.
cation is a category. In order to generalize very well with
the minimum error rate, a classifier has to ensure it can
identify all informative input patterns as well as being able to B. RANDOM FOREST
generalize the non-informative patterns [25]. This research Random Forest is a machine learning technique capable of
adopted the following machine learning algorithm: ‘‘Logis- solving a variety of tasks. It has an outstanding performance
tic Regression’’, ‘‘Random Forest’’, ‘‘XGBoost’’, ‘‘Support when the dataset attributes are large, and it is robust in the
Vector Machine’’, ‘‘Neural Network’’, ‘‘K-Nearest Neigh- training phase [29]. Random forests represent an ensemble
bors’’, ‘‘Decision Tree’’, ‘‘Naive Bayes’’. and ‘‘AdaBoost’’. learning method, and it comprises a multitude of decision

182926 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

FIGURE 2. The random forest architecture.

trees. Each training data learns from different decision trees, number of applications. XGBoost was used to delve into a
which are then combined to decide the final output. When dataset for latent class analysis and patient stratification in a
they are combined, Random Forest can complement and study looking at long-term [33]. The structure of XGBoost
improve the classification of decision trees [30]. The trees of and Random Forest is similarly (see Figure 2). However,
the forest are created using artificially designed bootstraps, XGBoost emerged as an efficient random forest boosting
bags, and feature randomization, and they also divide the implementation. It also has a fast runtime, even for higher
incoming data samples. After they form up, they make a parameters, and a lower error rate, making it more appropriate
‘‘vote’’ (see Figure 2) so the winning output can be deter- for big data. While Random Forest relies on ‘‘Bagging’’ at
mined [31]. the model testing, XGBoost perform ‘‘Boosting’’. Similarly,
The reason why this current research adopted Random For- the application of XGBoost algorithm is used to verify the
est lies with the fact that the number of features that Random validity of the Random Forest algorithm. XGBoost has been
Forest uses to provide the best results can be analyzed through widely used in various fields of machine learning and has
hyperparameter analysis, where the more feature values given good performance [34]. It allows setting the parameter esti-
to the Random Forest, the more remarkable the results that mation on many values, however, the XGBoost default value,
will be produced. Its significance increases as this research use the built-in 10-fold cross-validation to test the model,
have the opportunity continue to adding new characteristics and output relative features of attributes. In the same way
of Anemic to the model. In the prediction of mild, moderate, as Random Forest, the input probability value and output
and severe anemia cases, a severe anemia case has the best result of grid search of each classifier are not only sent to
result. Additionally, Random Forest swiftly provides good the receiver in a file but also transmitted to the XGBoost
results, saving the energy of a computer system. model for calculation [35]. Then, we receive the result of
evaluation and select the best result of processing steps. The
C. XGBoost best parameter value output by XGBoost serves as the final
The XGBoost consists of the gradient boosting framework. decision
Gradient boosting is an ensemble of weak learners, where an
initial model is used to make predictions, and the errors from
those predictions are used to construct a better model [32]. D. SUPPORT VECTOR MACHINE
This process is repeatedly iterated upon, leading to a gradual The support vector machine (SVM) is a non-probabilistic and
improvement in accuracy. The XGBoost algorithm carries a supervised learning model that is primarily used in classifi-
range of unique advantages such as high accuracy, parallel cation and regression problems [36]. It can be divided into
computing, and rapid computation speed in machine learning six categories, and the simplest classification problem used in
and other tasks. the present research is referred to as linearly separable binary
The reason why this research adopt this technique lies classification. The widely used support vector machine is
with the fact the In the realm of healthcare, XGBoost has a shown in a Figure 3. An essence of SVM is to represent

VOLUME 12, 2024 182927

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

FIGURE 3. The support vector machine architecture.

the scatter in low space through the high space, and then to the functioning of the human brain. Modern neural networks
classify, estimate, and forecast it [37]. operate similarly, but at greater speed and on large data
As one of the most effective classification methods in volumes. At their core, neural networks consist of units called
machine learning based on statistical learning theory, the neurons. These are arranged in multiple layers, with each
classification model realized by SVM is a decision function, layer potentially containing numerous neurons [40]. These
which is a hyperplane constructed by the training set data. layers in the plain form of a neural network are classified as
There is a maximum distance between each class of training an input layer, one or more hidden layers, and finally, one
samples and the decision function, while the direction per- output layer (see Figure 4). How these neurons are arranged
pendicular to the decision function indicates the correlation is referred to as the architecture of the network. In a deep
of input data with the target classification [38]. Therefore, the learning context, there can be more than one hidden layer,
decision function naturally plays a role as a discriminator. and the architecture of such networks is referred to as a deep
The reason why this current research adopted SVM lies architecture. Each edge connecting two neurons represents
with the fact that the SVM model perform grid search of the strength of the information and is assigned a weight [41].
hyperparameters. However, the final model prescription step Biases are added to control the flow of information and must
removes the role of feature columns and only returns the be learned during the training phase.
weightings of the supporting vector feature domains. Further- The reason why this research adopted Neural Network lies
more, if feature columns are removed directly, or if the feature with the fact that Neural network applications have also been
columns are highly dependent on the sample data, it can found in the field of detection of diseases [21]. In addition, the
develop some issues, such as the weight identifiability that application Neural Network has been developed whose task
cannot be estimated, and the model may show two problems: is to support decision in the diagnosis. The problem, within
the rapid decay of the support vectors and the changes in the context of anemia, can also be seen as a panel problem,
the directions of the original non-removed feature columns, where the information in the examination or laboratory data
which should have a great impact in understand real anemic acts simultaneously and alone at the same time in diagnosing
cases [39]. or determining the amount of Anemia. The background to
the current research was the assumption that the blood count
E. NEURAL NETWORK should be related through the correlation function of the
Neural networks are modeled on the human nervous system. features from the dataset, which are all obtained from blood
They consist of a large number of connected processing units parameters taken during the patient’s visit. Numerous studies
that work in unison to understand data patterns and mimic reported on the use of artificial intelligence for the prediction

182928 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

FIGURE 4. The neural network architecture.

or diagnosis of anemia or diseases [21], [42]. The most effi-

cient tool in time prediction turned out to be a multi-layer
neural network, which is also included in this paper. The
advantages of the MLP include a good prediction rate, low
complexity, and ease of interpreting the results [43].

F. K-NEAREST NEIGHBORS
K-Nearest Neighbors (KNN) is a classification algorithm that
makes predictions based on how numerically similar new
information is to already known data [44]. This technique is
commonly used as a go-to algorithm for the prediction of new,
incoming health data. KNN assumes that the characteristics
of data can be approximated by the characteristics of its
neighbors and complements discriminant or classifications
of physiological data. The strengths of KNN are that it is
simple and easily understood with a foundation in human FIGURE 5. The K-Nearest neighbors architecture.
logic. The KNN algorithm uses the nearest distance to clas-
sify and predict data based on the nearest k data points (see
Figure 5). When determining the classification or regression wanted data, it’s also crucial in the implementation of KNN
of new data, the KNN algorithm will search for the K most against simulated data and extracted data reveals that the pre-
similar data points according to the distance between the new diction of anemia can be carried out at an acceptable degree
data and the data in the dataset [45]. In addition, the KNN in relation to actual data. Furthermore, the optimal number
algorithm is also suitable for selecting the value of k. The selected is also in accordance with actual data. Previous
selection of the k value should be determined according to the results demonstrated by some studies suggest an encourag-
characteristics of the specific situation and the complexity of ing increase in prediction in cases of compressed anemia
the data to be classified. data [47].
The quality of the KNN implementation is dependent upon
the dataset selected—a variety of datasets have been utilized G. DECISION TREE
to conduct similar analyses. Techniques to discretize raw data The decision tree is a powerful modeling technique. It is
include normalizing continuous measures and utilizing non- simple to understand and easy to use. Decision trees trans-
parametric t-tests to select the top features. Although the form a set of data into a tree or a flowchart with the nodes
universal application of KNN to diverse datasets is evident, representing the meanings of tests on attributes, and branches
caution should be exercised when attempting to apply KNN representing the possible answers. They are constructed by
generally to various datasets, mainly due to its reliance upon an algorithm that identifies the most significant feature upon
data size and cumulative flow [46]. which to split the data [48]. Decision trees have the ability
The reason why this study adopted KNN lies with the fact to discover the rules that are implicit in the data, requiring
that although KNN can be powerful for locating specific and little data cleaning and processing and being acceptable to

VOLUME 12, 2024 182929

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

a wide range of users. On the other hand, decision trees can other attributes as additional disease predictor attributes, over
model the interaction between features in a nonlinear manner. its existence to continuously affect the population. Naïve
Prediction results obtained are clear and easy to use. Among bayers structure is similar to KNN, while KNN use K as
all of the machine learning models, the decision tree is a good the central point of the prediction (see Figure 5), Naïve
model for white-box modeling [49]. bayers uses the Bayesian method to assumes that the data
The reason why this research adopt it lies with the fact contain sufficient probabilistic information to estimate the
that one of the best things about decision trees is that they joint probability of attributes. That is the maximum posterior
mimic human-level thinking, so if we document the steps that probability outcome from the entire data is considered an
a human takes to make a decision, it’s easy to teach a com- optimal conclusion [53]. This in a simple word means that,
puter to mimic the human’s decision-making process. From a while there are two or more categories or class that exits, any
technical point of view, decision trees are frequently used to next coming data entry will fit in to its class or categories.
analyze and handle data problems, and the decision tree is the
model that has the greatest capacity for predicting accuracy. I. AdaBoost
Decision trees have been intensively studied and applied in AdaBoost algorithm is short for Adaptive Boosting. It is
a variety of fields, including economics, finance, analysis of a kind of boosting learning algorithm in machine learning.
ecological data, astronomy, and healthcare. The application Given a training dataset with N samples, AdaBoost can
of decision trees to healthcare is extensive; they can be used combine the weak learners to generate a strong prediction
to improve efficiency, reduce costs, and standardize the per- model in an iterative manner [54]. The core aim of the
formance of medical diagnosis and therapy. Decision trees in algorithm is to minimize the accuracy of the training sample
healthcare can aid in the diagnosis of diseases, prognosis, and and assign different weights to the classification errors in
prediction [50]. different iterations in order to be correct and focus the model
Decision trees are categorized into two types: classifica- fitting of training samples in the wrong samples. There have
tion and regression. The classification tree is used to solve been boosting algorithms like AdaBoost, XGBoost, Gradient
problems with a categorical outcome, while the regression Boosting Decision Tree, and others that are widely used in
tree is used to deal with continuous outcomes in making various fields [55]. Follow-up investigators have extensively
decisions or predictions. Remarkably, the classification and explored and improved on the core algorithm of the AdaBoost
regression trees encapsulate a proper scenario where there algorithm, thus making it suitable for large-scale sparse
are at least 2 classes and several attributes. When a simple multi-class classification tasks and giving it high prediction
decision tree exhibits a striking tendency, it is accompanied performance.
by certain drawbacks, such as instability, overfitting, and The reason why this research adopted this techniques
weak predictions. To address these drawbacks, a number of lies with the fact that AdaBoost algorithm can continuously
decision trees have been developed. Those that provide a improve the classification performance of combined weak
collection of sub-trees chosen with different decision criteria classifiers on the test set during the AdaBoost iterative update
regardless of the most critical attribute and prune the error in process up to some optimal model. After AdaBoost has
future [50]. summed a sufficient iterative number of weak classifiers,
The structure of Decision Tress is the part of the Random the classifier learned from the original classification problem
Forest structure presented in Figure 2. Furthermore, part of can have the same performance [56]. The superiority of the
the current advanced decision trees includes Random Forests, AdaBoost algorithm in multi-class and binary classification
Adaptive Boosting. problems identifies the AdaBoost algorithm as better than
the decision tree or single neural network design traditional
method experiments in a variety of applications show good
H. NAIVE BAYES generalization. The AdaBoost algorithm prediction model
Naive Bayes is a simple and easy-to-understand probabilistic has a good fitting effect and good accuracy in advance sample
classification algorithm based on applying Bayes’ theorem data prediction in the medical laboratory for anemia predic-
with a strong attribute conditional independence assumption. The AdaBoost algorithm is the cornerstone technology
tion [51]. Naive Bayes is an extremely fast algorithm. It can for anemia prediction [57].
make predictions faster than many other algorithms. This The structure of AdaBosost is following the same pattern
attribute is particularly welcome for large-volume datasets. with XGBoost which all of them are also follow the pat-
- Naive Bayes requires only a small amount of training tern of Random Forest (See Figure 2). However, the goal
data. It estimates the conditional probability for each feature. of the AdaBoost algorithm is to increase the accuracy of
If there is limited data, more sophisticated models may end up classification by focusing on falsely classified data distribu-
overfitting. However, the Naive Bayes probability estimates tions. AdaBoost conducts the ‘‘boosting’’ process to improve
often work well in practice [52]. classifier properties by repeatedly giving a weight to each
The reason why this study adopted it lies with the fact that a error in the process of learning a weak classifier [56]. First,
Naive Bayes model was employed to analyze the correlation a weak classifier is a series of algorithms designed to identify
between anemia, symptoms, syndromes, related diseases, and the characteristics of a target and is usually less complex

182930 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

TABLE 2. Measurement values for normal blood cells.

than a strong classifier. Next, AdaBoost integrates multiple TABLE 3. Measurement values for normal blood cells [60].
weak classifiers to construct a strong classifier for increased
predictive performance. The AdaBoost algorithm will update
the weights by adjusting the accuracy of each weak classifier
in turn, making the current weak classifier focus on misclas-
sified samples in the sample space

IV. RESEARCH METHODOLOGY

This research employs an approach grounded in the fun-
damental use of machine learning. To test the hypothesis,
a statistical approach was employed, revealing insights into
the datasets associated with the machine learning models dis-
cussed in the prior section. Therefore, the flow of the research
involves collecting the data, preparing the data for hypothesis
testing, preprocessing the data for machine learning training
and testing and ultimately carrying out the experimental train-
ing and testing using the machine learning.

A. DATASET
The dataset utilized for this research was acquired from
Kaggle [58], [59], [60] This dataset illustrates the prevalence
of various forms of anemia, encompassing its severity and
correlation with age and gender among the research popula-
tion, utilizing Complete Blood Count (CBC) characteristics
as variables. The dataset was derived from whole blood count
tests conducted by a hematology analyzer to ascertain the
prevalence of various kinds of anemia treated at the Eureka
Diagnostic Center in Lucknow, India. All procedures for the 364 patients. The first five entries from the dataset is pre-
CBC test were conducted in accordance with the normal sented in Table 2.
operating protocols established for the Hematology analyzer.
For the CBC analysis, 400 patient samples were randomly 1) DATASET PARAMETERS
selected to compile the dataset from individuals who visited The Anemia dataset utilized in this study is classified accord-
the Eureka Diagnostic Center in Lucknow for various clinical ing to standard characteristics associated to age and gender,
assessments. The diagnostic center conducts an average of as illustrated in Table 3. Hb readings in a CBC may differ
4 to 8 CBC investigations daily. Between September 2020 and among laboratories, with average levels for adult men and
December 2020, 1000 CBC investigations were conducted, women being below 135 g/L and 115 g/L, respectively [61].
from which 400 random samples were selected. The dataset The World Health Organization characterizes anemia as
comprised adult males and females who are not pregnant and hemoglobin concentrations falling below 130 g/L for males
are above 15 years of age within the study group. Infants, and 120 g/L for females [62]. The remaining other values are
children under 10 years of age, and pregnant women were associated to the standard and are the benchmark.
excluded from the study due to issues such as fluctuating
CBC test values and other considerations. Upon eliminating 2) CONCEPTULIZATION FROM THE DATASET
the aforementioned individuals from the randomly selected This current research formulates three main hypotheses in
sample of 400 patients, the final dataset comprised order to test and valid the claimed. At the onset the dataset

VOLUME 12, 2024 182931

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

obtained contents the following attributed: ‘‘RBC count’’,

‘‘PCV’’, ‘‘MCV’’, ‘‘MCH’’, ‘‘RDW’’,’’ ‘‘WBC count’’,
‘‘PLT’’, and ‘‘Hb’’, Hence this research formulates that:
H1: RBC Count, PLT, PCV: are tightly linked. A low RBC
count usually leads to lower hemoglobin and PCV
levels, all indicating some form of anemia.
The justification of formulating this hypothesis lies with FIGURE 7. The model for the indicators of type and cause of anemia.
the fact the justification of this lies with the fact that based on
the physiological relationship between RBC count, Hb levels,
and PCV also known as hematocrit reveals that RBC are marrow suppression or diseases affecting immune func-
responsible for carrying oxygen throughout the body. A low tion [67]. Abnormalities in WBCs may point to broader
RBC count (<∗ ) means fewer cells are available to transport systemic issues like aplastic anemia. Considering the indirect
oxygen, which often correlates with anemia (=∗ ) [63] (see consequence to RBC, WBC and PLT abnormalities is concep-
Figure 6). Similarly, Hb is the protein within RBCs that binds tualized to generate impact on RBC there are other diseases
to oxygen. When the RBC count is low (<∗ ), the total amount such as bone marrow issues might arise as a consequence (see
of Hb in the blood also decreases, as Hb is directly tied Figure 8).
to the number of RBCs. Therefore, low RBC count often
results in lower Hb levels, further indicating anemia (=∗ ).
Thus, low RBC count, low hemoglobin, and low PCV are
consistent indicators of anemia because they are measures of
the body’s ability to transport oxygen [64], [65]. That is why
this research not only tries to confirm this case, but to further
develop a prediction model that support that.

FIGURE 8. The model for the abnormalities within WBC and PLT.

B. DATASETS SYNTHESIS AND HYPOTHESIS TESTING

The dataset utilized for this study was syntheses after down-
loading it from Kaggle in CSV format. This followed by the
FIGURE 6. RBC count, PLT and PCV influence on anemia. execution of a function from the code written that imports
a CSV file titled ‘CBC data_for_meandeley_csv.csv’ into a
The second hypothesis formulated by this study is: pandas DataFrame named df. A DataFrame resembles a table
H2: MCV, MCH, RDW determine the type and cause of ane- that organizes data into rows and columns. Upon loading, the
mia. MCV and MCH indicate the size and Hb content function presents the initial rows of the dataset via df.head
of RBCs, while RDW shows how uniform the RBC sizes This provides a brief overview of the data (see Table 2).
are. The dataset appears to contain an extraneous column, perhaps
The justification of this hypothesis lies with the fact that the ‘‘S. No.’’ column, which is unnecessary for analysis. The
the parameters provide a more complete understanding of code removes the initial row of the dataset utilizing df.drop(0,
the type of anemia and help in identifying its underly- axis = 0, inplace = True). However, this should likely involve
ing cause [66]. They are crucial RBC indices that provide deleting a column rather than a row, as 0 signifies a row in
insights into the type and cause of anemia by reflecting RBC this context. If it aims to eliminate the first column, a minor
size, Hb content, and variability in RBC size, respectively modification may be required in the code.
(see Figure 7). In order to further verify and be able to This research then writes a code that transforms all
develop a prediction model associated with this case, the columns in the dataset to the numerical data type float64. This
current research proposed the prediction of this scenarios guarantees that all data can be regarded as numerical values
with machine learning to test the hypothesis formulated with (decimals) for any computations or analyses.
the datasets adopted. Upon converting the columns to floats, any missing data
The final hypothesis was established on the basis of WBC in the dataset (empty_cells) is eliminated using df.dropna
and PLT. Therefore, the current research formulates that: (axis = 0, inplace = True). This eliminates any rows with
H3: WBC and PLT are directly associated to the abnormal- missing values. The research also writes a function that com-
ities in RBC, however they are not related to each other putes the correlation for each pair of columns. The correlation
The justification of establishing this hypothesis lies with quantifies the degree to which two variables co-vary. If one
the fact that high WBC count can indicate infection, inflam- column increases while the other also increases, that means
mation, or leukemia’s, while a low count can suggest bone they exhibit a positive association. A negative correlation

182932 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

1) PERFORMANCE METRICS
Evaluation metrics are essential in machine learning for the
close monitoring of new models. In any situation where a
model is developed what separates positives from negatives,
it is essential to be able to evaluate the performance of this
model using some standard metrics. Among the standard met-
ric of measuring the quality of machine learning models ‘‘the
accuracy’’, ‘‘precision’’, ‘‘recall’’, and ‘‘F1-score’’. These
emerge from a confusion matrix comprising true positives
(TPs), true negatives (TNs), false positives (FPs), and false
negatives (FNs).
Accuracy is an important concept in the world of machine
learning. It is the ratio of the correctly predicted instances to
the total instances in our dataset. The calculation of accuracy
is important as it helps us understand our predictive models.
The accuracy is generated using equation 1:

FIGURE 9. The correlation coefficient among the variables. TP + TN

Accuracy = (1)
TP + TN + FP + FN

exists when one variable increases while the other falls. For Precision, is the ratio of correctly predicted positive observa-
that reason, a heatmap, with a vibrant chart that illustrates tions to the total predicted positive observations. It denotes
the association among all columns is drawn (see Figure 9). the extent to which the predicted positives are actually pos-
The intensity of the shades in the heatmap signify the strength itive. It can be interpreted as the proportion of true positive
of the correlations among the various columns in the dataset instances in all the instances that were predicted to be posi-
hence answer our research hypothesis. tive. Precision is calculated by equation 2:
Figure 9 illustrates a strong positive correlation between
RBC and PCV, demonstrating that an increase in RBC corre- TP
Precision = (2)
sponds to an increase in PCV, hence supporting the proposed TP + FP
hypothesis that RBC count and PCV are closely associated.
Recall is defined as the probability that the classifier cor-
This also stands that a low RBC count leads to lower PCV
rectly predicted the label of the positive instances out of all
levels, indicating some form of anemia.
actual positive instances. Recall is frequently employed as a
The correlation between MCV and MCH is 0.77425 this
measure of the aptitude of a program to recognize a specific
shows that MCV and MCH are truly associated together
class and it is calculated by using equation 3:
and capable of indicating the size and Hb content of RBCs.
Furthermore, RDW shows very low negative correlation of TP
−0.02 and −0.216 with MCV and MCH respectively, that is Recall = (3)
TP + FN
it does not associated to size of RBC, but the pattern of of
RCB. As a result, hypothesis 2 is supported. The ‘‘F1-score is the harmonic mean between precision and
Finally, the test of Hypothesis 3 indicate that WBC and recall scores. It doesn’t ignore the presence of false negatives
PLT are not directly related to each other, however they if it has false positives. It returns low values when these
are all the reason for abnormalities in RBC. While Platelets particular cases have low precision or low recall. In other
are involved in blood clotting, while WBCs are more about words, the F1 score gives a more balanced result of the
immune defense. harmonic mean of precision and recall, and this is essential
because it gives the expected result based on equation 4:
1) LINE ART FIGURES
Figures that are composed of only black lines and shapes. TP
F1−score = (4)
These figures should have no shades or half-tones of gray, TP + 1
2 (FP + FN )

C. EXPERIMENTAL FLOW AND PERFORMANCE METRIC V. EXPERIMENTAL ANALYSIS

The experimental methodology employed in this study The experimental analysis for this research commences fol-
adheres to the standard machine learning experimental analy- lowing the treatment and preparation of the dataset for
sis framework. The process commences with pre-processing the experiment. The experimental process commences with
and model analysis. An evaluation of the model was sub- dataset encoding and labelling, followed by model definition
sequently done. This section outlines the evaluation metrics and parameter specification, and concludes with the distribu-
employed to assess the model. tion of training, testing, and validation models.

VOLUME 12, 2024 182933

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

A. DATASET ENCODING AND LABELLING used to train the machine learning model, and the testing data
The preprocessing involves preparing the dataset to fit and (20%) is kept aside to evaluate how well the model works.
ready for the model analysis. The first step for this research The split ensures that the evaluation is done on new data that
involves ‘‘Encoding gender’’ attribute of the dataset. The the model hasn’t seen before.
research provide a code function that changes the ‘‘Sex’’ Finally, all the features are standardized to a similar scale.to
column in the dataset, which contains values ‘‘Male’’ and ensures that no one feature dominates the others. This is
‘‘Female’’, into numbers. Specifically, ‘‘Male’’ is encoded done by subtracting the mean and dividing by the standard
as 0, and ‘‘Female’’ is encoded as 1. This is necessary because deviation for each feature. The training data is used to learn
most machine learning algorithms work better with numbers how to scale the values, and then the same scaling method is
than text. applied to the testing data.
The next step of the preprocessing is setting the condition
of classifying Anemia from the dataset. As a result, a function VI. EXPERIMENTAL RESULTS AND DISCUSSIONS
from the code defines a condition to determine if an entry A. INITIALIZING THE TRAINING MODELS
from the dataset has anemia based on hemoglobin (Hb) level
All of the models that were deployed in this study have
and their sex as a ‘‘condition’’, that is If a male (sex is 0) has a
been initialized, as described in Section III. Writing the code
Hb level below 13.5, the person is considered anemic (marked
necessary to construct and set up a variety of models that are
as 1). Else, if a female (sex is 1) has a Hb level below 12, she
designed to predict the presence of anemia in individuals is
is also considered anemic (marked as 1). Otherwise, if their
a necessary step in the process of machine learning model
Hb is above those levels, they are considered not anemic
initialization. Both the training and the prediction were taken
(marked as 0). Furthermore, the code applies this logic to each
care of. The training data, which consist of the subset of
row (person) in the dataset and creates a new column called
data that was designated earlier for instructing the models,
‘‘Anemia’’ to indicate whether a person is anemic or not.
is utilized in the process of teaching each individual model.
For the purpose of determining how well the models have
B. FEATURE SELECTION AND DATA PARTITIONING
learned, they are given the duty of making predictions on the
testing data after they have been trained.
Feature selection is the process of choosing and retaining
The testing data is the subset of the data that was not used
the most relevant features to enhance model interpretability
during the training processes. A calculation is made by the
and reduce or avoid overfitting the noise in missing features.
code to determine the accuracy of each model, which indi-
Particularly when working with a huge amount of data and
cates the frequency with which the model correctly predicts
features, it is important to identify which features influence
whether or not an individual has anemia. A detailed report
the model predictions. Feature selection techniques are cov-
is created for each model, which illustrates its performance
ered as filter, wrapper, and embedded methods. Each method
across a wide range of applications (for example, accurately
has its own advantages, disadvantages, and criteria based on
distinguishing those who have anemia in comparison to those
which a practitioner can choose a method for their data. The
who do not have anemia). The accuracy of each model is
filter methods essentially use evaluate the importance of a
documented in a dictionary (results), where the name of the
feature or features of a dataset. This current research apply.
model is what serves as the key, and the accuracy score of the
Filter feature selection where the methods don’t rely on any
model is what serves as the value. The findings are analyzed
learning algorithm, it usually require less computational time
by the code in order to determine which model has the highest
and data.
degree of precision. Following that, it presents the name of the
The features selection code for this research filter the
best model along with the accuracy of the model.
data into two parts: The features (X) include various
blood-related measurements like: ‘‘RBC count’’, ‘‘PCV’’,
‘‘MCV’’, ‘‘MCH’’, ‘‘RDW’’,’’ ‘‘WBC count’’, ‘‘PLT’’, and B. PRESENTATION OF THE RESULTS
‘‘Hb’’, These are the features that are deemed to be used The entire results of the training from various models has
to predict whether a person has anemia or not. The target been gathered (see Table 4). The general impact of the com-
(y) is the ‘‘Anemia’’ column that was, indicating whether bine training reveals that ‘‘Logistic Regression’’ obtained and
the person has anemia or not. This is what the model try to ‘‘Accuracy of 0.836 (83.6%)’’, that is the model predicted
predict. Anaemia correctly 83.6% of the time. Whereas the ‘‘Preci-
Assessing model performance is an efficient way to vali- sion’’ which measures how many of the predicted anemic
date the results of a machine learning model. The application cases are actually anemic indicate that ‘‘Precision for Class 0
of data partitioning is essential to perform model validation. (Not anemic) is 80%’’ and ‘‘Precision for Class 1 (Anemic):
For handling the model’s performance, the datasets need to be 85%’’. The ‘‘Recall’’ which measures how many actual ane-
divided into two ranges: training and testing sets. The training mic cases were correctly identified indicate that the ‘‘Recall
set is utilized for building the model, and the testing set is used for Class 0: 67%’’, this means the model missed quite a few
to validate the developed model. The dataset is split into two actual non-anemic cases. Whereas, the recall for Class 1: 92%
parts: training data and testing data. The training data (80%) is indicating that (Most anemic cases were correctly identified).

182934 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

Therefore, the model is better at identifying anemic patients performing configuration for each model. Cross-validation
but not as good at correctly identifying non-anemic patients (splitting the data multiple times for training and testing) is
(it sometimes misclassifies non-anemic patients as anemic). used to ensure the model’s performance is not just luck. This
The result of the ‘‘Random Forest’’ model indicate that the process helps the model find the best settings, improving
‘‘Accuracy of 0.849 (84.9%) has been obtained. This indicate accuracy and generalization on unseen data. Once the best
that the model is slightly better than ‘‘Logistic Regression’’ model configuration is found, it is trained on the training
in correctly predicting anemia 84.9% of the time. The ‘‘Pre- dataset and used to make predictions on the test dataset. The
cision’’ result indicate that ‘‘Class 0 obtained 88% (higher code also calculates accuracy and prints a detailed classi-
precision, fewer false positives for non-anemic). Whereas fication report showing precision, recall, F1-score, etc., for
Class 1 obtained 84% (slightly lower than Logistic Regres- both anemic and non-anemic predictions. For each model,
sion for anemic cases). The ‘‘Recall’’ indicate a Class 0 with the research wrote the code to generates a confusion matrix
62% (still missing a fair number of non-anemic cases). While to understand where the model made correct and incorrect
‘‘Class 1 obtained 96% (excellent at catching anemic cases)’’ predictions. This is plotted using a heatmap, which pro-
Hence, this can be concluded that the model is highly effec- vides a visual representation of the model’s performance.
tive at detecting anemia, but there is room for improvement in The combine result is presented in Table 5 Upon optimiza-
identifying non-anemic patients. The trade-off here is slightly tion, it was determined that Random Forest is the superior
better accuracy overall. model according to test accuracy, cross-validation accuracy,
The results obtained for ‘‘XGBoost’’ indicate that the and precision-recall balance. Cross-validation accuracy is
‘‘Accuracy of 0.863 (86.3%)’’ is obtained and this is the high- 91.06%, test accuracy is 86.30%, precision for ‘‘Class 1’’
est accuracy among the models, correctly predicting anemia is 85%, and recall is 96%. This model effectively identifies
86.3% of the time. While, the ‘‘Precision’’ indicate that the anemic patients and has a balanced performance across both
‘‘Class 0 obtained 89% (best so far at predicting non-anemic). classes, achieving a ROC-AUC of 92.77% (see Table 5).
Whereas ‘‘Class 1’’ obtained 85% (good at identifying ane- Its efficacy is rooted in its capacity to accurately identify
mic cases). The ‘‘Recall’’ indicate that ‘‘Class 0’’ obtained both anemic and non-anemic subjects while reducing false
67% (better than Random Forest, still some false negatives). positives and negatives.,
While ‘‘Class 1’’ obtained 96% (excellent recall for anemic The ‘‘ROC-AUC score’’ provides a singular metric that
cases). This model strikes the best balance. It performs well encapsulates a model’s efficacy in differentiating between the
in both identifying anemic and non-anemic patients and is the classes of anemic and non-anemic individuals. A flawless
best-performing model overall. score would be 1.0, indicating the model accurately differen-
The result obtained from the Support Vector Machine tiates between the two groups. A score of 0.5 signifies that the
indicate an ‘‘Accuracy of 0.808 (80.8%), which is the lower model is making random guesses. Models such as XGBoost,
accuracy than the previous models, with an 80.8% success exhibiting a ROC-AUC of 0.9447, provide superior class
rate. The ‘‘Precision’’ of the ‘‘Class 0’’ is 71% (fewer false separation compared to others like Decision Tree, which has
positives for non-anemic). While ‘‘Class 1’’ is 86% (good but a ROC-AUC of 0.7717.
not the best). The ‘‘Recall’’ indicate that ‘‘Class 0’’ obtained The ROC curve illustrates the true positive rate (the num-
a 71% (balanced, but still misclassifies some non-anemic ber of correctly detected anemic cases) in relation to the
cases). While ‘‘Class 1’’ obtained 86% (lower than other false positive rate (the number of non-anemic cases erro-
models for anemic detection). While this model is relatively neously labeled as anemic). The optimal curve would swiftly
balanced in precision and recall, its overall performance is ascend towards the upper left corner of the graph, indicating
lower than others like XGBoost, making it less ideal for this a high true positive rate and a low false positive rate for
task. the model. A diagonal line signifies a model that gener-
The remaining five models: Neural Network (MLP) ates random predictions, whereas proximity of the curve to
obtained an Accuracy of 0.836 (83.6%), K-Nearest Neigh- the top left indicates superior class differentiation by the
bors (KNN) 0.795 (79.5%), Decision Tree 0.849 (84.9%), model.
Naive Bayes 0.808 (80.8%), AdaBoost 0.836 (83.6%). There- The Logistic Regression model achieved a ROC-AUC
fore, it can be recognized that ‘‘XGBoost’’ stands out as Score of 0.9371 (refer to Figure 10 in the Appendix). The
the best model overall, with the highest accuracy (86.3%). ROC Curve demonstrates a robust capacity to predict Anemia
It balances well between precision and recall, especially for (refer to Figure 11 in the Appendix), albeit marginally inferior
detecting both anemic and non-anemic patients. to XGBoost and Random Forest.
Considering the various performances exhibited by the The ‘‘Random Forest’’ achieved a ‘‘ROC-AUC Score of
models, the research fine-tune the Hyperparameter by tun- 0.9277’’ (refer to Figure 12 in the Appendix), whereas the
ing to allows each model to be optimized, and improve ‘‘ROC Curve’’ resembles that of XGBoost, effectively dis-
their performance. The function hyperparameter_tuning uses tinguishing the classes but exhibiting marginally inferior
GridSearchCV to automatically test various combinations of predictive capability (refer to Figure 13 in the Appendix).
the hyperparameters defined in the parameter grids. It tries The ‘‘XGBoost’’ yields a ‘‘ROC-AUC Score of 0.9447’’
out different values from the parameter grids to find the best (refer to Figure 14 in the Appendix). The analysis yielded a

VOLUME 12, 2024 182935

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

TABLE 4. Performance values for all the model.

TABLE 5. Performance after fine tuning the model.

‘‘ROC Curve’’ positioned in the top-left, indicating superior The AdaBoost achieved a ROC-AUC score of 0.9405 (refer
predictive capability in differentiating between anemic and to Figure 26 in the Appendix), and the ROC curve demon-
non-anemic cases (refer to Figure 15 in the Appendix). strates a robust performance, comparable to the top models,
The Support Vector Machine achieved a ROC-AUC score effectively distinguishing between the two classes (refer to
of 0.9362 (refer to Figure 16 in the Appendix). A well- Figure 27 in the Appendix).
defined ROC Curve demonstrating effective class separation, The findings indicated that XGBoost, Random Forest,
comparable to Random Forest performance, was achieved AdaBoost, and SVM are robust performers, exhibiting ele-
(see to Figure 17 in the Appendix). vated ROC-AUC scores and superior ROC curves. Likewise,
The Neural Network achieved a ROC-AUC score of 0.9277 the Logistic Regression and Neural Network exhibit com-
(refer to Figure 18 in the Appendix), while the ROC Curve mendable performance, while not reaching the standards of
exhibited a pattern akin to that of Random Forest, indicating the leading models. KNN and Decision Tree exhibit more
its efficacy in differentiating between anemic and non-anemic difficulty, evidenced by diminished ROC-AUC scores and
cases (refer to Figure 19 in the Appendix). ROC curves that approximate random guessing.
The K-Nearest Neighbors (KNN) achieved a ROC-AUC
score of 0.8401 (refer to Figure 20 in the Appendix), but the 1) COMPARATIVE ANALYSIS OF THE RESULTS
ROC curve indicates that KNN encounters greater difficulty There are many studies associated to Anemia prediction.
in distinguishing between the two classes (refer to Figure 21 Each study utilized different machine learning models and
in the Appendix). datasets to predict anemia or its severity, yielding varying
The Decision Tree achieved a ROC-AUC score of 0.7717 degrees of accuracy, precision, recall, and F1 scores. Most
(refer to Figure 22 in the Appendix), whereas the ROC Curve studies favored ensemble or advanced models like Neural
approximates a diagonal line, indicating that the Decision Networks and Support Vector Machines, which provided
Tree performs less effectively than other models (refer to higher accuracy and overall performance (see Table 6).
Figure 23 in the Appendix). The research study of Dixit et al., [14] utilized clinical
The Naive Bayes achieved a ROC-AUC score of 0.8988 data of patients for predicting anemia. The best-performing
(refer to Figure 24 in the Appendix). The ROC Curve of Naive model (Neural Networks) had an accuracy of 92.3%, preci-
Bayes effectively predicts anemia; however it is less robust sion of 91%, recall of 89%, and F1 score of 90%. The high
than XGBoost or Random Forest (see to Figure 25 in the values for precision and recall indicate the model’s ability
Appendix). to correctly identify both anemic and non-anemic cases with

182936 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

TABLE 6. Comparative performance.

FIGURE 10. The logistic regression ROC-AUC.

FIGURE 12. The random forest ROC-AUC.

FIGURE 11. The logistic regression ROC curve.

minimal false positives and negatives. Hasssan et al., [19]

used a dataset specifically focused on fetal anemia. SVM
showed the best results with an accuracy of 95.1%, precision
of 94%, recall of 92%, and an F1 score of 93%. These high FIGURE 13. The random forest ROC curve.

metrics suggest a strong ability to accurately predict fetal

anemia, making it useful in prenatal care for early detection. 93%, 91%. and 92% respectivelyIt can be observed that the
Jaiswal et al., [20 dataset used in this study was related to work of Jaiswal et al. [20] uses the same dataset with the
anemia diagnosis in clinical settings. Decision Trees achieved current study (Anemia diagnosis dataset), however, because
an accuracy of 88.7%, precision of 87%, recall of 85%, and this study applies an optimization scheme and involves many
an F1 score of 86%. While not as high as some other stud- model, across all the performance metric, this current study
ies, the model still provided valuable insights for predicting outperformed Jaiswal et al. [20].
anemia. Finally, the research of Dhaka et al. [21] whole The relationship with previously published studies, which
utilized clinical data for anemia level prediction and Random support the novelty of the concept, stems from the fact that
Forest, Gradient Boosting, SVM model, obtained 94.50%, Dixit et al. [14] limited their evaluation to three classifiers

VOLUME 12, 2024 182937

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

FIGURE 14. The XGBoost ROC-AUC.

FIGURE 17. The support vector machine ROC curve.

FIGURE 15. The XGBoost ROC Curve.

FIGURE 18. The neural network ROC-AUC.

FIGURE 16. The support vector machine ROC-AUC.

FIGURE 19. The neural network ROC curve.

and did not explore other classifiers. Similarly, the work

of Hasan et al. [19] relies on logistic regression, decision all these studies [14], [19], [20], and [21] support the con-
trees, and SVM without justification for exploring other clas- cept of detecting anemia by machine learning. Furthermore,
sifiers. The research conducted by Jaiswal et al. [20] and this current study’s performance measurement significantly
Dhakal et al. [21] also aligns with this approach. However, surpasses that of previous research.

182938 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

FIGURE 23. The decision tress ROC curve.

FIGURE 20. The KNN ROC-AUC.

FIGURE 24. The Naiye Bayes ROC-AUC.

FIGURE 21. The KNN ROC curve.

FIGURE 22. The decision tress ROC-AUC. FIGURE 25. The Naiye Bayes ROC curve.

VII. CONCLUSION to improve the predictive ability of associated models and

This study established that Anemia is becoming increas- to find some hidden relationships among multi-data attribute
ingly common in the world, requiring research community co-existence in the data. In this case, nine machine learning

VOLUME 12, 2024 182939

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

for this difference. In contrast, XGBoost achieved the greatest

ROC-AUC score, which was 0.9447, making it the algorithm
that was the most successful. Considering that this study
came to the realization that there is no specific machine
learning that encompasses the entire performance metric for
the prediction of anemia by utilizing the anemic parameters,
it is possible to draw the conclusion that there is always a
need for an ensemble approach, and that the intervention
of medical professionals is still very necessary in cases of
anemia. Since XGBoost model perform exceptionally, this
research recommend it should be deployed to be practically
implemented in real-world healthcare settings.

APPENDIX
FIGURE 26. The AdaBoost ROC-AUC.
See Figures 10–27.

REFERENCES
[1] L. M. Neufeld, L. M. Larson, A. Kurpad, S. Mburu, R. Martorell, and
K. H. Brown, ‘‘Hemoglobin concentration and anemia diagnosis in venous
and capillary blood: Biological basis and policy implications,’’ Ann.
New York Acad. Sci., vol. 1450, no. 1, pp. 172–189, Aug. 2019.
[2] C. H. H. Le, ‘‘The prevalence of anemia and moderate-severe anemia in
the U.S. population (NHANES 2003–2012),’’ PLoS ONE, vol. 11, no. 11,
Nov. 2016, Art. no. e0166635.
[3] K. Doig and L. A. Thompson, ‘‘A methodical approach to interpreting the
white blood cell parameters of the complete blood count,’’ Amer. Soc. Clin.
Lab. Sci., vol. 30, no. 3, pp. 186–193, Jul. 2017.
[4] K. C. Derecho, R. Cafino, S. L. Aquino-Cafino, A. Isla, J. A. Esencia,
N. J. Lactuan, J. A. G. Maranda, and L. C. P. Velasco, ‘‘Technology
adoption of electronic medical records in developing economies: A system-
atic review on physicians’ perspective,’’ Digit. Health, vol. 10, Jan. 2024,
Art. no. 20552076231224605.
[5] I. Cabalar, T. H. Le, A. Silber, M. O’Hara, B. Abdallah, M. Parikh,
and R. Busch, ‘‘The role of blood testing in prevention, diagnosis, and
management of chronic diseases: A review,’’ Amer. J. Med. Sci., vol. 368,
no. 4, pp. 274–286, Oct. 2024.
FIGURE 27. The AdaBoost ROC curve. [6] S. Pullakhandam and S. McRoy, ‘‘Classification and explanation
of iron deficiency anemia from complete blood count data using
machine learning,’’ BioMedInformatics, vol. 4, no. 1, pp. 661–672,
Mar. 2024.
models where used in this paper, which has the ability to [7] P. Appiahene, J. W. Asare, E. T. Donkoh, G. Dimauro, and R. Maglietta,
retrieve some anatomical correlation features among health- ‘‘Detection of iron deficiency anemia by medical images: A comparative
study of machine learning algorithms,’’ BioData Mining, vol. 16, no. 1,
care data. This paper also formulates some hypothesis to p. 2, Jan. 2023.
reflect multi-data attribute co-existence in the data about the [8] C. Yesiloglu, C. Emiroglu, and C. Aypak, ‘‘The relationship between
prediction of anemia, reflecting the logic in real-life data. glycated hemoglobin (HbA1c), hematocrit, mean platelet volume, total
white blood cell counts, visceral adiposity index, and systematic coronary
In effect, the prediction of anemia might work differently risk evaluation 2 (SCORE2) in patients without diabetes,’’ Int. J. Diabetes
in different types of data. Therefore, this study particularly Developing Countries, vol. 7, pp. 1–7, Mar. 2024.
identifies from which combinations of data work in predicting [9] S. Suner, J. Rayner, I. U. Ozturan, G. Hogan, C. P. Meehan,
anemia, i.e., by predicting some characteristics of the red A. B. Chambers, J. Baird, and G. D. Jay, ‘‘Prediction of anemia and esti-
mation of hemoglobin concentration using a smartphone camera,’’ PLoS
blood cells, and what the characteristics of anemia are that ONE, vol. 16, no. 7, Jul. 2021, Art. no. e0253495.
could predict from personal data. The fact that data synthesis [10] C. Ashok, S. Mahto, S. Kumari, A. Kumar, Deepankar, Vidyapati,
is essential for creating healthcare indicators based on the M. Prasad, M. Mahajan, and P. K. Chaudhuri, ‘‘Impact of plateletpheresis
on the hemoglobin, hematocrit, and total red blood cell count: An updated
actual expression among multi-data characteristics. This is meta-analysis,’’ Cureus, vol. 16, no. 6, Jun. 2024, Art. no. e61510.
of concern in data analysis on porcelain anemia. Finding the [11] S. Dogan and I. Turkoglu, ‘‘Iron-deficiency anemia detection from hema-
research In terms of performance, AdaBoost comes in first tology parameters by using decision trees,’’ Int. J. Sci. Technol., vol. 3,
no. 1, pp. 85–92, 2008.
place with a rate of 92.8%. Nevertheless, Random Forest and [12] M. Abdullah and S. Al-Asmari, ‘‘Anemia types prediction based on
XGBoost both do the best when it comes to Test accuracy, data mining classification algorithms,’’ Int. J. Inf. Manag. Sci., vol. 45,
Precision of Non-Anemic, and the Recall of Anemic, with pp. 85–92, Apr. 2016.
values of 0.863, 0.89, and 0.96 respectively. These are the [13] J. R. Khan, S. Chowdhury, H. Islam, and E. Raheem, ‘‘Machine
learning algorithms to predict the childhood anemia in Bangladesh,’’
three metrics that are most important. Random Forest and J. Data Sci., vol. 17, no. 1, pp. 195–218, Feb. 2021, doi:
XGBoost both have the greatest values, which is the reason 10.6339/jds.201901_17(1).0009.

182940 VOLUME 12, 2024

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

[14] A. Dixit, R. Jha, R. Mishra, and S. Vhatkar, ‘‘Prediction of anemia [34] K. A. Awan, I. U. Din, A. Almogren, B.-S. Kim, and M. Guizani, ‘‘Enhanc-
disease using machine learning algorithms,’’ in Proc. Intell. Comput. ing IoT security with trust management using ensemble XGBoost and
Netw., in Lecture Notes in Electrical Engineering, 2023, pp. 229–238, doi: AdaBoost techniques,’’ IEEE Access, vol. 12, pp. 116609–116621, 2024.
10.1007/978-981-99-0071-8_18. [35] T. M. Hossain, M. Hermana, and J. O. Olutoki, ‘‘Porosity prediction and
[15] M. Alemayehu, M. Meskele, B. Alemayehu, and B. Yakob, ‘‘Preva- uncertainty estimation in tight sandstone reservoir using non-deterministic
lence and correlates of anemia among children aged 6–23 months in XGBoost,’’ IEEE Access, vol. 12, pp. 139358–139367, 2024.
Wolaita zone, southern Ethiopia,’’ PLoS ONE, vol. 14, no. 3, Mar. 2019, [36] L. Yuan, D. Lian, X. Kang, Y. Chen, and K. Zhai, ‘‘Rolling bearing
Art. no. e0206268, doi: 10.1371/journal.pone.0206268. fault diagnosis based on convolutional neural network and support vector
[16] N. Sundaram, M. Bennett, and J. Wilhelm, ‘‘Early detection of chronic machine,’’ IEEE Access, vol. 8, pp. 137395–137406, 2020.
anemia using machine learning models,’’ Amer. J. Hematol., vol. 86, no. 7, [37] W. Tuerxun, X. Chang, G. Hongyu, J. Zhijie, and Z. Huajian, ‘‘Fault
pp. 559–566, 2011. diagnosis of wind turbines based on a support vector machine optimized
[17] H. A. Rayes, S. Vallabhajosyula, G. W. Barsness, N. S. Anavekar, R. S. Go, by the sparrow search algorithm,’’ IEEE Access, vol. 9, pp. 69307–69315,
M. S. Patnaik, K. B. Kashani, and J. C. Jentzer, ‘‘Association between 2021.
anemia and hematological indices with mortality among cardiac intensive [38] J. Qiu, J. Xie, D. Zhang, and R. Zhang, ‘‘A robust twin support vector
care unit patients,’’ Clin. Res. Cardiol., vol. 109, no. 5, pp. 616–627, machine based on fuzzy systems,’’ Int. J. Intell. Comput. Cybern., vol. 17,
May 2020, doi: 10.1007/s00392-019-01549-0. no. 1, pp. 101–125, Feb. 2024.
[18] R. Provenzano, E. V. Lerma, and L. Szczech, ‘‘Anemia prediction in [39] A. Abubakar, H. Chiroma, A. Zeki, and M. Uddin, ‘‘Utilising key climate
renal patients using hematological features and machine learning models,’’ element variability for the prediction of future climate change using a
J. Clin. Haematol., vol. 112, pp. 234–242, Mar. 2019. support vector machine model,’’ Int. J. Global Warming, vol. 9, no. 2,
[19] M. Hasan, Mst. S. Tahosin, A. Farjana, M. A. Sheakh, and M. M. Hasan, p. 129, 2016.
‘‘A harmful disorder: Predictive and comparative analysis for fetal ane- [40] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, N. A. Mohamed,
mia disease by using different machine learning approaches,’’ in Proc. and H. Arshad, ‘‘State-of-the-art in artificial neural network applications:
11th Int. Symp. Digit. Forensics Secur. (ISDFS), May 2023, pp. 1–6, doi: A survey,’’ Heliyon, vol. 4, no. 11, Nov. 2018, Art. no. e00938.
10.1109/ISDFS58141.2023.10131838. [41] A. Kaveh, Applications of Artificial Neural Networks and Machine
[20] M. Jaiswal, A. Srivastava, and T. J. Siddiqui, ‘‘Machine learning algorithms Learning in Civil Engineering (Studies in Computational Intelligence),
for anemia disease prediction,’’ in Proc. Recent Trends Commun., Comput., vol. 1168. Cham, Switzerland: Springer, 2024.
Electron., A. Khare, U. Tiwary, I. K. Sethi, and N. Singh, Eds., Singapore: [42] M. Kurucan, M. Özbaltan, Z. Yetgin, and A. Alkaya, ‘‘Applications of
Springer, 2019, pp. 55–63, doi: 10.1007/978-981-13-2685-1_44. artificial neural network based battery management systems: A literature
[21] P. Dhakal, S. Khanal, and R. Bista, ‘‘Prediction of anemia using machine review,’’ Renew. Sustain. Energy Rev., vol. 192, Mar. 2024, Art. no. 114262.
learning algorithms,’’ Int. J. Comput. Sci. Inf. Technol., vol. 15, no. 1, [43] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, ‘‘Speech recog-
pp. 15–30, Feb. 2023, doi: 10.5121/ijcsit.2023.15102. nition using deep neural networks: A systematic review,’’ IEEE Access,
[22] P. P. Liang, A. Zadeh, and L.-P. Morency, ‘‘Foundations & trends in vol. 7, pp. 19143–19165, 2019.
multimodal machine learning: Principles, challenges, and open questions,’’ [44] J. Na, Z. Wang, S. Lv, and Z. Xu, ‘‘An extended k nearest
ACM Comput. Surv., vol. 56, no. 10, pp. 1–42, Oct. 2024. neighbors-based classifier for epilepsy diagnosis,’’ IEEE Access, vol. 9,
[23] F. A. Khan and A. A. Ibrahim, ‘‘Machine learning-based enhanced deep pp. 73910–73923, 2021.
packet inspection for IP packet priority classification with differentiated [45] K. Alnowaiser, ‘‘Improving healthcare prediction of diabetic patients using
services code point for advance network management,’’ J. Telecommun., KNN imputed features and tri-ensemble model,’’ IEEE Access, vol. 12,
Electron. Comput. Eng., vol. 16, no. 2, pp. 5–12, Jun. 2024. pp. 16783–16793, 2024.
[24] H. Nozari, J. Ghahremani-Nahr, and A. Szmelter-Jarosz, ‘‘AI and machine [46] R. K. Halder, M. N. Uddin, M. A. Uddin, S. Aryal, and A. Khraisat,
learning for real-world problems,’’ Adv. Comput., vol. 134, pp. 1–12, ‘‘Enhancing K-nearest neighbor algorithm: A comprehensive review and
Jan. 2024. performance analysis of modifications,’’ J. Big Data, vol. 11, no. 1, p. 113,
[25] D. B. Catacutan, J. Alexander, A. Arnold, and J. M. Stokes, ‘‘Machine Aug. 2024.
learning in preclinical drug discovery,’’ Nature Chem. Biol., vol. 19, no. 8, [47] D. K. Mandarapu, V. Nagarajan, A. Pelenitsyn, and M. Kulkarni, ‘‘Arkade:
pp. 1–4, Aug. 2024. K-nearest neighbor search with non-Euclidean distances using GPU
[26] C. De Lucia, P. Pazienza, and M. Bartlett, ‘‘Does good ESG lead to better ray tracing,’’ in Proc. 38th ACM Int. Conf. Supercomput., May 2024,
financial performances by firms? Machine learning and logistic regression pp. 14–25.
models of public enterprises in Europe,’’ Sustainability, vol. 12, no. 13, [48] Y. Y. Song and L. U. Ying, ‘‘Decision tree methods: Applications for
p. 5317, Jul. 2020. classification and prediction,’’ Shanghai Arch. Psychiatry, vol. 27, no. 2,
[27] F. H. Awad, M. M. Hamad, and L. Alzubaidi, ‘‘Robust classification and p. 130, Apr. 2015.
detection of big medical data using advanced parallel K -means clustering, [49] Y. Ma, H. Zhang, Y. Cai, and H. Yang, ‘‘Decision tree for locally private
YOLOv4, and logistic regression,’’ Life, vol. 13, no. 3, p. 691, Mar. 2023. estimation with public data,’’ in Proc. Adv. Neural Inf. Process. Syst.,
[28] Z. Rahmatinejad, T. Dehghani, B. Hoseini, F. Rahmatinejad, A. Lotfata, vol. 36, Feb. 2024, pp. 1–13.
H. Reihani, and S. Eslami, ‘‘A comparative study of explainable ensemble [50] V. G. Costa and C. E. Pedreira, ‘‘Recent advances in decision trees:
learning and logistic regression for predicting in-hospital mortality in the An updated survey,’’ Artif. Intell. Rev., vol. 56, no. 5, pp. 4765–4800,
emergency department,’’ Sci. Rep., vol. 14, no. 1, p. 3406, Feb. 2024. May 2023.
[29] A. Sekulić, M. Kilibarda, G. B. M. Heuvelink, M. Nikolić, and B. Bajat, [51] R. Blanquero, E. Carrizosa, P. Ramírez-Cobo, and M. R. Sillero-Denamiel,
‘‘Random forest spatial interpolation,’’ Remote Sens., vol. 12, no. 10, ‘‘Variable selection for Naïve Bayes classification,’’ Comput. Oper. Res.,
p. 1687, May 2020. vol. 135, Nov. 2021, Art. no. 105456.
[30] S. M. Simon, P. Glaum, and F. S. Valdovinos, ‘‘Interpreting random forest [52] B. Ravinder, S. K. Seeni, V. S. Prabhu, P. Asha, S. P. Maniraj, and
analysis of ecological models to move from prediction to explanation,’’ Sci. C. Srinivasan, ‘‘Web data mining with organized contents using naive
Rep., vol. 13, no. 1, p. 3881, Mar. 2023. Bayes algorithm,’’ in Proc. 2nd Int. Conf. Comput., Commun. Control
[31] J. Fisher, S. Allen, G. Yetman, and L. Pistolesi, ‘‘Assessing the influence (IC4), Feb. 2024, pp. 1–6.
of landscape conservation and protected areas on social wellbeing using [53] X. Y. Zhang, W. F. Li, J. Y. Fang, and Z. M. Niu, ‘‘Nuclear mass predic-
random forest machine learning,’’ Sci. Rep., vol. 14, no. 1, p. 11357, tions with the naive Bayesian model averaging method,’’ Nucl. Phys. A,
May 2024. vol. 1043, Mar. 2024, Art. no. 122820.
[32] J. Zhang, R. Wang, Y. Lu, and J. Huang, ‘‘Prediction of compressive [54] A. Shahraki, M. Abbasi, and Ø. Haugen, ‘‘Boosting algorithms for net-
strength of geopolymer concrete landscape design: Application of the work intrusion detection: A comparative evaluation of real AdaBoost,
novel hybrid RF–GWO–XGBoost algorithm,’’ Buildings, vol. 14, no. 3, gentle AdaBoost and modest AdaBoost,’’ Eng. Appl. Artif. Intell., vol. 94,
p. 591, Feb. 2024. Sep. 2020, Art. no. 103770.
[33] S. Hakkal and A. A. Lahcen, ‘‘XGBoost to enhance learner perfor- [55] X. Huang, Z. Li, Y. Jin, and W. Zhang, ‘‘Fair-AdaBoost: Extending
mance prediction,’’ Comput. Educ., Artif. Intell., vol. 12, Dec. 2024, AdaBoost method to achieve fair classification,’’ Expert Syst. Appl.,
Art. no. 100254. vol. 202, Sep. 2022, Art. no. 117240.

VOLUME 12, 2024 182941

T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence

[56] W. Wang and D. Sun, ‘‘The improved AdaBoost algorithms for imbalanced [67] M. Song, B. I. Graubard, E. Loftfield, C. S. Rabkin, and E. A. Engels,
data classification,’’ Inf. Sci., vol. 563, pp. 358–374, Jul. 2021. ‘‘White blood cell count, neutrophil-to-lymphocyte ratio, and incident
[57] B. Liu, X. Li, Y. Xiao, P. Sun, S. Zhao, T. Peng, Z. Zheng, and Y. Huang, cancer in the U.K. biobank,’’ Cancer Epidemiol., Biomarkers Prevention,
‘‘AdaBoost-based SVDD for anomaly detection with dictionary learning,’’ vol. 33, no. 6, pp. 821–829, Jun. 2024.
Expert Syst. Appl., vol. 238, Mar. 2024, Art. no. 121770.
[58] R. Vohra, J. Pahareeya, and A. Hussain, ‘‘Complete blood count anemia
diagnosis,’’ Mendeley Data, Liverpool John Moores Univ., Liverpool,
U.K., Tech. Rep., 2021, doi: 10.17632/dy9mfjchm7.1.
[59] S. S. Abdul-Jabbar, A. K. Farhan, and A. S. Luchinin, ‘‘Comparative
study of anemia classification algorithms for international and newly TALAL QADAH is currently with the Depart-
CBC datasets,’’ Int. J. Online Biomed. Eng., vol. 19, no. 6, pp. 141–157, ment of Medical Laboratory Sciences, Faculty of
May 2023. Applied Medical Sciences, King Abdulaziz Uni-
[60] S. S. Abdul-Jabbar and D. A. Farhan, ‘‘Hematological dataset,’’ Mendeley versity, Jeddah, Saudi Arabia, as an Associate Pro-
Data, Medial City, Tech. Rep., 2022, doi: 10.17632/g7kf8x38ym.1. fessor, where he holds the position of an Adjunct
[Online]. Available: https://www.kaggle.com/code/eduarp/sickle-cell- Professor with the Hematology Research Unit,
anemia/notebook King Fahad Medical Research Center. The fol-
[61] H. Lee-Six, N. F. Øbro, M. S. Shepherd, S. Grossmann, K. Dawson, lowing courses are among those that he instructs:
M. Belmonte, R. J. Osborne, B. J. P. Huntly, I. Martincorena, E. Anderson,
hematology, hematology, advance hematology,
L. O’Neill, M. R. Stratton, E. Laurenti, A. R. Green, D. G. Kent, and
and cases study on hematology.
P. J. Campbell, ‘‘Population dynamics of normal human blood inferred
from somatic mutations,’’ Nature, vol. 561, no. 7724, pp. 473–478,
Sep. 2018.
[62] D. Mansour, A. Hofmann, and K. Gemzell-Danielsson, ‘‘A review of clin-
ical guidelines on the management of iron deficiency and iron-deficiency
anemia in women with heavy menstrual bleeding,’’ Adv. Therapy, vol. 38, ASMAA MUNSHI received the B.Sc. degree in computer science form King
no. 1, pp. 201–225, Jan. 2021. Abdulaziz University, Saudi Arabia, in 2004, and the master’s degree (Hons.)
[63] F. Aslinia, J. J. Mazza, and S. H. Yale, ‘‘Megaloblastic anemia and other in internet security and forensic and the Ph.D. degree in information security
causes of macrocytosis,’’ Clin. Med. Res., vol. 4, no. 3, pp. 236–241, from Curtin University, Australia, in 2009 and 2014, respectively. She is
Sep. 2006.
currently an Associate Professor with the Cybersecurity Department, College
[64] D. O. Okonko, A. K. Mandal, C. G. Missouris, and P. A. Poole-Wilson,
of Computer Science and Engineering, University of Jeddah, Saudi Arabia,
‘‘Disordered iron homeostasis in chronic heart failure: Prevalence, pre-
dictors, and relation to anemia, exercise capacity, and survival,’’ J. Amer. where she is also holding several positions of the Vice Dean (Female Section)
College Cardiol., vol. 58, no. 12, pp. 1241–1251, Sep. 2011. of the College of Computer Science and Engineering. She is also a Supervisor
[65] M. T. Maeder, O. Khammy, C. D. Remedios, and D. M. Kaye, ‘‘Myocar- of the Cybersecurity Department (Female Section), College of Computer
dial and systemic iron depletion in heart failure: Implications for anemia Science and Engineering. She is also the Vice Dean of the Faculty of
accompanying heart failure,’’ J. Amer. College Cardiol., vol. 58, no. 5, Computing and Information Technology (Female Section), Khulais Branch,
pp. 474–480, Jul. 2011. University of Jeddah. Her research interests include computer forensic,
[66] V. Hoffbrand, G. Collins, and J. Loke, Hoffbrand’s Essential Haematology. information security, and the IoT.
Hoboken, NJ, USA: Wiley, Jul. 2024.

182942 VOLUME 12, 2024

Anemia
No ratings yet
Anemia
15 pages
Prediction of Anemia Using Machine Learning Algorithms
No ratings yet
Prediction of Anemia Using Machine Learning Algorithms
16 pages
Anemia Detection From Eyes, Palm and Fingernails With Machine Learning Models
No ratings yet
Anemia Detection From Eyes, Palm and Fingernails With Machine Learning Models
9 pages
Anemia Detection From Eyes, Palm and Fingernails With Machine Learning Models
No ratings yet
Anemia Detection From Eyes, Palm and Fingernails With Machine Learning Models
9 pages
"Anemia": Delhi Public School Indore
No ratings yet
"Anemia": Delhi Public School Indore
23 pages
Thesis Fin
No ratings yet
Thesis Fin
50 pages
1 s2.0 S2215098620342646 Main
No ratings yet
1 s2.0 S2215098620342646 Main
21 pages
Batch 09
No ratings yet
Batch 09
29 pages
Predictionof Anemia Ussing Machine Learning Algorithms
No ratings yet
Predictionof Anemia Ussing Machine Learning Algorithms
17 pages
Shivank Publication
No ratings yet
Shivank Publication
27 pages
Patient Blood Management Equals Patient Safety
No ratings yet
Patient Blood Management Equals Patient Safety
19 pages
Chatgpt
No ratings yet
Chatgpt
20 pages
Final Dengue
100% (2)
Final Dengue
32 pages
Employing Supervised Machine Learning Algorithms For Classification and Prediction of Anemia Among Youth Girls in Ethiopia
No ratings yet
Employing Supervised Machine Learning Algorithms For Classification and Prediction of Anemia Among Youth Girls in Ethiopia
17 pages
Breaking Boundaries in Diagnosis Non-Invasive Anemia Detection Empowered by AI
No ratings yet
Breaking Boundaries in Diagnosis Non-Invasive Anemia Detection Empowered by AI
16 pages
Revital Healthcare Catalogue
No ratings yet
Revital Healthcare Catalogue
40 pages
Impact Analysis of The Complete Blood Count Parameter Using Naive Bayes
No ratings yet
Impact Analysis of The Complete Blood Count Parameter Using Naive Bayes
6 pages
Anaemia Management Whitepaper EN
No ratings yet
Anaemia Management Whitepaper EN
8 pages
Blood Cell Counter: Methods: Automatic Optical Method Electrical Conductivity Method Coulter Counter
100% (2)
Blood Cell Counter: Methods: Automatic Optical Method Electrical Conductivity Method Coulter Counter
18 pages
Engineering Reports - 2023 - Asare - Iron Deficiency Anemia Detection Using Machine Learning Models A Comparative Study of
No ratings yet
Engineering Reports - 2023 - Asare - Iron Deficiency Anemia Detection Using Machine Learning Models A Comparative Study of
21 pages
MACHINE LEARNING ALGORITHMS FOR ANEMIA DISEASE - Review
No ratings yet
MACHINE LEARNING ALGORITHMS FOR ANEMIA DISEASE - Review
5 pages
Validity Assessment of Nine Discriminant Functions Used
No ratings yet
Validity Assessment of Nine Discriminant Functions Used
5 pages
Decision Support System For Diagnosing Anemia 2018 MODEL
No ratings yet
Decision Support System For Diagnosing Anemia 2018 MODEL
5 pages
Paper 65-Whale Optimization Driven Generative Convolutional Neural Network
No ratings yet
Paper 65-Whale Optimization Driven Generative Convolutional Neural Network
9 pages
Detection of Sickle Cell, Megaloblastic Anemia, Thalassemia and Malaria Through Convolutional Neural Network
No ratings yet
Detection of Sickle Cell, Megaloblastic Anemia, Thalassemia and Malaria Through Convolutional Neural Network
5 pages
Anemia Diagnosis and Pathophysiology Review
No ratings yet
Anemia Diagnosis and Pathophysiology Review
9 pages
Insgthfout
No ratings yet
Insgthfout
11 pages
Evaluating The Factors and Forecasting Childhood Anemia Through Machine Learning Algorithms
No ratings yet
Evaluating The Factors and Forecasting Childhood Anemia Through Machine Learning Algorithms
13 pages
Applsci 12 05030
No ratings yet
Applsci 12 05030
23 pages
Interpreting Laboratory Results: Key Points
No ratings yet
Interpreting Laboratory Results: Key Points
8 pages
Anemia, Disorders of Platelets, Blood Vessels and Coagulations
No ratings yet
Anemia, Disorders of Platelets, Blood Vessels and Coagulations
154 pages
Drugs For Anemia: A Comprehensive Review For Medical Students
No ratings yet
Drugs For Anemia: A Comprehensive Review For Medical Students
22 pages
8.1. Anemia
No ratings yet
8.1. Anemia
86 pages
6 - Anemia
No ratings yet
6 - Anemia
32 pages
Anemia - An Approach To Diagnosis
No ratings yet
Anemia - An Approach To Diagnosis
10 pages
Anemia Saint Chopra
No ratings yet
Anemia Saint Chopra
8 pages
Anemia On Cardiomyopathy
No ratings yet
Anemia On Cardiomyopathy
2 pages
Author 'S Accepted Manuscript
No ratings yet
Author 'S Accepted Manuscript
24 pages
Approach To Anemia and Polycythemia
No ratings yet
Approach To Anemia and Polycythemia
7 pages
Anemia
No ratings yet
Anemia
10 pages
Analysis of Anemia Using Data Mining Techniques With Risk Factors Specification
No ratings yet
Analysis of Anemia Using Data Mining Techniques With Risk Factors Specification
5 pages
Non Invasive Way To Detect Anemia Using ML
No ratings yet
Non Invasive Way To Detect Anemia Using ML
7 pages
1 s2.0 S0933365722002299 Main
No ratings yet
1 s2.0 S0933365722002299 Main
11 pages
Discriminant Indexes To Simplify The Differential Diagnosis Between Iron Deficiency Anemia and Thalassemia Minor in Individuals With Microcytic Anemia
No ratings yet
Discriminant Indexes To Simplify The Differential Diagnosis Between Iron Deficiency Anemia and Thalassemia Minor in Individuals With Microcytic Anemia
6 pages
JBM 29212 Biomarkers For The Differentiation of Anemia Final
No ratings yet
JBM 29212 Biomarkers For The Differentiation of Anemia Final
13 pages
Thrombotic Thrombocytopenic Purpura
No ratings yet
Thrombotic Thrombocytopenic Purpura
12 pages
Anemia 111
No ratings yet
Anemia 111
53 pages
Assessment of Iron Deficiency Anemia Among Patients Attending Kigali University Teaching Hospital
No ratings yet
Assessment of Iron Deficiency Anemia Among Patients Attending Kigali University Teaching Hospital
5 pages
Anemia
No ratings yet
Anemia
36 pages
Mythic 60: User's Manual
No ratings yet
Mythic 60: User's Manual
122 pages
Anemia
No ratings yet
Anemia
14 pages
Respiratory Circulatory System
No ratings yet
Respiratory Circulatory System
34 pages
Bio 12 Ch12 Cardiovascular Sys Notes Package
No ratings yet
Bio 12 Ch12 Cardiovascular Sys Notes Package
16 pages
Ijam-3312 o
No ratings yet
Ijam-3312 o
5 pages
Anemia: Anemia (Pronounced /Ə Ni Miə/, Also Spelled Anaemia or Anæmia From Ancient
No ratings yet
Anemia: Anemia (Pronounced /Ə Ni Miə/, Also Spelled Anaemia or Anæmia From Ancient
14 pages
Heamotology
No ratings yet
Heamotology
12 pages
Manual Usuario
No ratings yet
Manual Usuario
76 pages
Valenciano A.c., Cowell R.L., Rizzi T.E., Tyler R.D. - Atlas of Canine and Feline Peripheral Blood Smears. Part 1
100% (2)
Valenciano A.c., Cowell R.L., Rizzi T.E., Tyler R.D. - Atlas of Canine and Feline Peripheral Blood Smears. Part 1
212 pages
Hema II Chapter 3 - Anemiarev - AT
No ratings yet
Hema II Chapter 3 - Anemiarev - AT
154 pages
Lab 7 Blood Smear and Morphology Analysis
100% (1)
Lab 7 Blood Smear and Morphology Analysis
8 pages
Medical Laboratory Technology MCQs For NHPC Nepal - PCL in Medical Laboratory Technology License Exam
No ratings yet
Medical Laboratory Technology MCQs For NHPC Nepal - PCL in Medical Laboratory Technology License Exam
9 pages
Case 21
No ratings yet
Case 21
45 pages
Top 10 Anemias
No ratings yet
Top 10 Anemias
23 pages
Cambridge International AS & A Level: Biology 9700/21 October/November 2021
No ratings yet
Cambridge International AS & A Level: Biology 9700/21 October/November 2021
16 pages
BLOOD
No ratings yet
BLOOD
2 pages
Childhood Acute Myeloid Leukaemia (AML) : A Guide For Parents
No ratings yet
Childhood Acute Myeloid Leukaemia (AML) : A Guide For Parents
32 pages
Fiddausi Yahaya Gobir Project Research 0010
No ratings yet
Fiddausi Yahaya Gobir Project Research 0010
43 pages
Anemia: Case Scenario
No ratings yet
Anemia: Case Scenario
5 pages
NUMC 101 Module 1 Cardiovascular System Blood Lab Exercise
No ratings yet
NUMC 101 Module 1 Cardiovascular System Blood Lab Exercise
10 pages
Service Manual: BC-20s/BC-30s
33% (3)
Service Manual: BC-20s/BC-30s
20 pages
Hematology Subject
No ratings yet
Hematology Subject
4 pages
Kamssa Lower Secondary Level Examinations 2022 Biology s.1
No ratings yet
Kamssa Lower Secondary Level Examinations 2022 Biology s.1
6 pages
Circulatory System - Reviewer
No ratings yet
Circulatory System - Reviewer
9 pages
Anatomy of Hematology PPT Edd
No ratings yet
Anatomy of Hematology PPT Edd
118 pages
Intro To Hematology
No ratings yet
Intro To Hematology
10 pages
General Knowledge For Laboratory Tests
No ratings yet
General Knowledge For Laboratory Tests
38 pages
A Comprehensive: Health Analysis Report
No ratings yet
A Comprehensive: Health Analysis Report
14 pages
Basics of Hematology - Knowledge at AMBOSS
No ratings yet
Basics of Hematology - Knowledge at AMBOSS
22 pages
Week 5 Grade 6 Specialized Cell
No ratings yet
Week 5 Grade 6 Specialized Cell
15 pages
Blood Centre Division - FAQs
No ratings yet
Blood Centre Division - FAQs
4 pages
Ch1 Introduction
No ratings yet
Ch1 Introduction
14 pages
Preparation of Drosophila Polytene Chromosome Squashes Practical Paper
No ratings yet
Preparation of Drosophila Polytene Chromosome Squashes Practical Paper
11 pages
SCIENCE - Unit-1 - Respiration-WB-answers
No ratings yet
SCIENCE - Unit-1 - Respiration-WB-answers
6 pages
Fast Facts: Chronic Lymphocytic Leukemia
From Everand
Fast Facts: Chronic Lymphocytic Leukemia
Toby A. Eyre
No ratings yet
Fast Facts: Chronic Lymphocytic Leukemia
From Everand
Fast Facts: Chronic Lymphocytic Leukemia
T.A. Eyre
No ratings yet
Ambulatory Blood Pressure Monitoring: Practical Insights: Medical Series
From Everand
Ambulatory Blood Pressure Monitoring: Practical Insights: Medical Series
Taha Othmane
No ratings yet
Arterial hypertension in clinical practice: study and analysis of biotechnological and telemedicine models
From Everand
Arterial hypertension in clinical practice: study and analysis of biotechnological and telemedicine models
Michele Karaboue
No ratings yet
Fast Facts: Acute Myeloid Leukemia: New Modular Targets - First New Treatment for decades
From Everand
Fast Facts: Acute Myeloid Leukemia: New Modular Targets - First New Treatment for decades
William Blum
No ratings yet
Decipher Your Labwork - CBC: Functional Medicine
From Everand
Decipher Your Labwork - CBC: Functional Medicine
Tim Pate
1/5 (1)
Fast Facts: Measurable Residual Disease: A Clearer Picture for Treatment Decisions
From Everand
Fast Facts: Measurable Residual Disease: A Clearer Picture for Treatment Decisions
Justin Loke
No ratings yet

Prediction of Anemia From Multi-Data Attribute Co-Existence

Uploaded by

Prediction of Anemia From Multi-Data Attribute Co-Existence

Uploaded by

Received 29 September 2024, accepted 30 November 2024, date of publication 2 December 2024,

date of current version 12 December 2024.

Prediction of Anemia from Multi-Data Attribute

Corresponding author: Talal Qadah ([email protected])

INDEX TERMS Anemia, multi-dataset attribute, prediction, XGBoost.

182924 VOLUME 12, 2024

TABLE 1. Summary of some of the previous research studies.

VOLUME 12, 2024 182925

FIGURE 1. The logistic regression architecture.

detection, better management of anemia, and ultimately A. LOGISTIC REGRESSION

182926 VOLUME 12, 2024

FIGURE 2. The random forest architecture.

VOLUME 12, 2024 182927

FIGURE 3. The support vector machine architecture.

182928 VOLUME 12, 2024

FIGURE 4. The neural network architecture.

or diagnosis of anemia or diseases [21], [42]. The most effi-

VOLUME 12, 2024 182929

182930 VOLUME 12, 2024

TABLE 2. Measurement values for normal blood cells.

IV. RESEARCH METHODOLOGY

VOLUME 12, 2024 182931

obtained contents the following attributed: ‘‘RBC count’’,

B. DATASETS SYNTHESIS AND HYPOTHESIS TESTING

182932 VOLUME 12, 2024

FIGURE 9. The correlation coefficient among the variables. TP + TN

C. EXPERIMENTAL FLOW AND PERFORMANCE METRIC V. EXPERIMENTAL ANALYSIS

VOLUME 12, 2024 182933

182934 VOLUME 12, 2024

VOLUME 12, 2024 182935

TABLE 4. Performance values for all the model.

TABLE 5. Performance after fine tuning the model.

182936 VOLUME 12, 2024

TABLE 6. Comparative performance.

FIGURE 10. The logistic regression ROC-AUC.

FIGURE 12. The random forest ROC-AUC.

FIGURE 11. The logistic regression ROC curve.

minimal false positives and negatives. Hasssan et al., [19]

metrics suggest a strong ability to accurately predict fetal

VOLUME 12, 2024 182937

FIGURE 14. The XGBoost ROC-AUC.

FIGURE 17. The support vector machine ROC curve.

FIGURE 15. The XGBoost ROC Curve.

FIGURE 18. The neural network ROC-AUC.

FIGURE 16. The support vector machine ROC-AUC.

FIGURE 19. The neural network ROC curve.

and did not explore other classifiers. Similarly, the work

182938 VOLUME 12, 2024

FIGURE 23. The decision tress ROC curve.

FIGURE 24. The Naiye Bayes ROC-AUC.

VII. CONCLUSION to improve the predictive ability of associated models and

VOLUME 12, 2024 182939

for this difference. In contrast, XGBoost achieved the greatest

182940 VOLUME 12, 2024

VOLUME 12, 2024 182941

182942 VOLUME 12, 2024

You might also like