Prediction of Anemia From Multi-Data Attribute Co-Existence
Prediction of Anemia From Multi-Data Attribute Co-Existence
ABSTRACT The primary research problem identified by this study is that the medical world acknowledges
that when a singular quality or variable is consistently demonstrated as a clear sign of a disease condition,
it is typically regarded as the standard. Nevertheless, advancements in Information and Communication
Technology (ICT), particularly Artificial Intelligence (AI), have enabled several additional factors to
influence the diagnosis of certain diseases to varying degrees; many other attributes can now contribute
to determining certain diseases to varying extents. Therefore, this study has reevaluated the claims within
the domain of detecting and predicting anemia with the best machine learning algorithm. Another research
problem, lies with the fact that previous studies on anemia prediction utilized limited machine learning
algorithms across a narrow range of datasets, whereas this current study employed numerous machine
learning algorithms across a wide range of anemia datasets and tested three hypotheses. The statistical
analysis validated all the hypotheses. The results also showed that ‘‘AdaBoost’’ excels in ‘‘cross-validation
accuracy,’’ scoring 92.8%. On the other hand, when it comes to ‘‘test accuracy,’’ ‘‘precision’’ of ‘‘non-
anemic,’’ and the ‘‘recall’’ of ‘‘anemic,’’ ‘‘Random Forest’’ and ‘‘XGBoost’’ both perform the best with
values of 0.863, 0.89, and 0.96, respectively. However, ‘‘XGBoost’’ performed the best in terms of the ROC-
AUC score, with a value of 0.9447. In this study, the most important contribution is that there isn’t a single
machine learning method that can accurately predict anemia based on the parameters associated with anemia.
This means that there is always a need for a combination of methods, and doctors should still be involved in
cases of anemia.
I. INTRODUCTION of normal red blood cell life, and an increase in red blood cell
Anemia, which means a decrease in the oxygen-carrying destruction are associated with diseases such as anemia [2].
capacity of the blood due to a decrease in the number of red However, white blood cells, platelets, mean cell volume,
blood cells or the presence of abnormal hemoglobin, is one of mean cell hemoglobin, and red cell distribution width are also
the most common medical conditions seen today [1]. There of great medical importance in diseases with accompanying
are many types of anemia that have been defined by the size anemia. Due to the diversity of such parameters, the fact
and functional capacity of the red blood cells, the cause of that they can be easily obtained by using each complete
the deficiency, or the type of red blood cells being produced. blood count device and are low-cost tests contributes to the
Sudden or chronic loss of blood, decreased production of red importance of these parameters [3].
blood cells or hemoglobin by the bone marrow, the effects of Determining the presence of anemia is one of the first steps
medications, the destruction of red blood cells before the end in diagnosis when a patient applies to a health institution.
For this purpose, parameters used by complete blood count,
The associate editor coordinating the review of this manuscript and such as white blood cells and platelets, mean cell volume,
approving it for publication was Yongming Li . mean cell hemoglobin, and red cell distribution width, which
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 182923
T. Qadah, A. Munshi: Prediction of Anemia from Multi-Data Attribute Co-Existence
show a decrease or increase in different types of anemia, by reduced levels of Hb and PCV, which are all indica-
are examined at the sub-optimal level [2]. The values may tors of anemia in some form.
decrease, increase, or not stay in the expected range, which • It has been confirmed that MCV, MCH, and RDW all
differentiates the aid decision. However, the red blood cell offer additional insight into the type of anemia and
count, hemoglobin, and packed cell volume are the first three the factors that produce it. Whereas MCV and MCH
parameters examined for diagnosis, especially for anemia. represent the size of RBCs as well as the amount of
The increasing rate of advances in technology has led hemoglobin they contain, RDW indicates the degree to
to large amounts of electronic medical record (EMR) data. which RBC sizes are uniform.
Currently, the EMR contains many types of data such as • Affirmed that WBCs and PLT: can provide a more com-
diagnosis, time of diagnosis, medication, and biochemical prehensive picture of overall health and content of RBC
results [4]. Many individuals with chronic diseases like ane- function, despite the fact that they are not directly related
mia need to plan health care, monitor progress, and possibly to RBCs. This is because anomalies in all three kinds of
minimize the likelihood of anemia [5]. There are a lot of dif- blood parameter (RBC, WBC, and PLT) can indicate a
ferent health and blood tests that can be used for various cases systemic problem, such as a bone marrow condition.
and purposes. The red blood cell (RBC) count, hemoglobin • There is no single machine learning model that effi-
(Hb) concentration, hematocrit (HCT), which is commonly ciently predict anemia, it is recommended that an
measured as packed cell volume (PCV), white blood cell ensemble approach should be the best option as well
(WBC) count, and platelet (PLT) count are common and as the intervention of medical doctors in addressing the
important tests to identify potential medical problems, diag- cases of prediction of anemia using datasets.
nose health issues, and determine prognosis [6]. The remaining part of the paper apart from this current
The objectives of this research include the development section are describe as follow: Section II present the related
of intelligent systems for precise prediction of cases with work, section III presented the machine learning model.
the help of machine learning algorithms that can assist in Section IV present the research Methodology, Section V
monitoring blood disease screening agents. Machine learn- present the results and discussion. Finally, section VI present
ing algorithm models are fast becoming the most popular the discussions.
methods for disease case predictions. On the other hand,
anemia is a health disorder condition facing 30% of the II. RELATED WORK
world’s population [7]. This study investigates the interrela- There are many previous research studies associated to
tion of blood parameters in diagnosing anemia and proposes prediction of Anemia. The combined research from these
the use of the best machine learning algorithms model to previous studies highlights the significant advancements in
determine future cases of anemia. Models were developed using machine learning (ML) models to predict anemia from
using decision trees, bootstrap aggregating for decision, and hematological parameters such as RBC count, Hb, PCV, and
boosting decision. Performance metrics, accuracy, preci- others. Dogan and Turkoglu [11] provided early evidence
sion, recall, and F1 score were used to decide on the best that decision trees could effectively detect iron deficiency
model. anemia using basic hematology parameters. Their work laid
In medicine, hemoglobin, the number of RBC, white blood the foundation for integrating data mining techniques into
cells, platelets, hematocrit, the average volume of red blood medical diagnostics, showing that decision trees can stream-
cells calculated count, the average hemoglobin concentration line anemia diagnosis. This approach, further supported by
in erythrocytes, and the distribution width of red blood cells Abdullah and Al-Asmari [12], demonstrates that data mining
are the indicators that are initially determined to diagnose and classification algorithms have a critical role in identifying
anemia [8], [9]. These first indicators are also often called the anemia subtypes, allowing for a more nuanced understanding
first line in a general blood count [10]. The most informative of blood disorders.
indicators have been extensively researched in properties. It is Expanding on these foundations, Khan et al. [13] and
through these indicators that many common types of anemia Dixit et al. [14] utilized more advanced machine learning
can fundamentally be distinguished from each other, and the models, such as support vector machines and neural net-
type of anemia can be determined. works, to predict anemia in specific populations like children
This study is motivated by the fact that ‘‘RBC in Bangladesh and general populations. These studies intro-
count’’, ‘‘PCV’’, ‘‘Mean Cell Volume (MCV)’’, ‘‘Mean duced comparative models to analyze the effectiveness of
Cell Hemoglobin (MCH)’’, ‘‘Red Cell Distribution width different algorithms, with Dixit et al. [14] emphasizing the
(RDW)’’, ‘‘WBC count’’, ‘‘PLT’’, ‘‘Hb’’ influence Anemia superiority of certain classifiers in terms of accuracy. Their
status. That is why it establish relationships among these research not only advances machine learning in medical
variables based on the available dataset. For that reason, the diagnostics but also tailors solutions to the specific needs of
study contributes in the following ways: vulnerable groups, such as children, offering insights into the
• Demonstrate through the use of a hypothesis testing that prevalence and risks of anemia in such populations.
the RBC count, Hb, and PCV are all closely related. The integration of machine learning into anemia prediction
In most cases, a low red blood cell count is accompanied has been particularly impactful in regions where resources for
healthcare are limited. Alemayehu et al. [15] demonstrated approaches. The aim was to identify the most effective model
how ensemble machine learning models could predict ane- for early detection, which is crucial for preventing adverse
mia in Ethiopian children, which is crucial for public health health effects in newborns. The comparative analysis revealed
interventions. This work builds upon previous findings and that the Support Vector Machines model provided the best
showcases the adaptability of machine learning in diverse performance in predicting fetal anemia. The study empha-
contexts, improving the accuracy and efficiency of anemia sized the importance of machine learning in prenatal care,
screening programs. Similarly, Sundaram et al. [16] stressed suggesting that accurate predictive models can significantly
the importance of early detection of chronic anemia through aid in monitoring and managing fetal health.
machine learning, suggesting that timely intervention using Jaiswal et al. [20] proposed to develop and evaluate
these predictive models can greatly improve patient out- machine learning algorithms for the prediction of anemia
comes, especially in regions with limited medical resources. disease. By leveraging computational models, the researchers
Finally, the research also highlights the application of aimed to improve diagnostic accuracy and support clinical
machine learning in predicting anemia in patients with spe- decision-making. The study achieved significant accuracy,
cific medical conditions, such as cardiac and renal disorders. particularly with the Decision Tree model. The results
Rayes et al. [17] and Provenzano et al. [18] explored how demonstrated that machine learning algorithms could effec-
hematological indices could be used to predict anemia in tively predict anemia, underscoring their potential as valuable
cardiac and renal patients, respectively. These studies under- tools in medical diagnostics. The study suggested that inte-
scored the importance of integrating machine learning into grating these models into healthcare systems could facilitate
chronic disease management, offering tailored predictions early detection and treatment.
that can enhance patient care. The ability to accurately predict Dhakal et al. [21] proposed to predict the level of ane-
anemia in patients with existing health issues provides a mia among individuals using machine learning algorithms.
more comprehensive understanding of how anemia interacts Understanding anemia severity is essential for determining
with other diseases, paving the way for improved healthcare appropriate treatment strategies and improving patient care.
strategies. The study found that ensemble models, particularly Random
Among some of the critical previous studies associated to Forest and Gradient Boosting, improved prediction accuracy
this current research is the work of the Dixit et al. [14] pro- for anemia levels compared to individual classifiers. The
posed to predict anemia disease by applying various machine findings indicate that machine learning models can effec-
learning algorithms to clinical data. The primary goal was to tively classify anemia severity, providing valuable insights for
identify the most accurate model that could assist healthcare clinicians to tailor treatments based on patient-specific data.
professionals in early detection and diagnosis of anemia. Table 1 present the summary of some of very closely
The study concluded that certain models outperformed oth- related studies with the current research. These studies
ers in terms of accuracy and reliability. Specifically, Neural demonstrate the significant potential of machine learning
Networks demonstrated the highest accuracy in predicting algorithms in predicting anemia and its severity. By utiliz-
anemia. The results highlight the potential of machine learn- ing various models—from simple classifiers like Decision
ing algorithms to enhance diagnostic processes, enabling Trees and Naïve Bayes to complex ensemble methods—
timely interventions and improved patient outcomes. the research highlights how data-driven approaches can
Hasan et al. [19] proposed to predictive and comparative enhance diagnostic accuracy. The findings suggest that inte-
analysis of fetal anemia using different machine learning grating machine learning into healthcare can lead to earlier
trees. Each training data learns from different decision trees, number of applications. XGBoost was used to delve into a
which are then combined to decide the final output. When dataset for latent class analysis and patient stratification in a
they are combined, Random Forest can complement and study looking at long-term [33]. The structure of XGBoost
improve the classification of decision trees [30]. The trees of and Random Forest is similarly (see Figure 2). However,
the forest are created using artificially designed bootstraps, XGBoost emerged as an efficient random forest boosting
bags, and feature randomization, and they also divide the implementation. It also has a fast runtime, even for higher
incoming data samples. After they form up, they make a parameters, and a lower error rate, making it more appropriate
‘‘vote’’ (see Figure 2) so the winning output can be deter- for big data. While Random Forest relies on ‘‘Bagging’’ at
mined [31]. the model testing, XGBoost perform ‘‘Boosting’’. Similarly,
The reason why this current research adopted Random For- the application of XGBoost algorithm is used to verify the
est lies with the fact that the number of features that Random validity of the Random Forest algorithm. XGBoost has been
Forest uses to provide the best results can be analyzed through widely used in various fields of machine learning and has
hyperparameter analysis, where the more feature values given good performance [34]. It allows setting the parameter esti-
to the Random Forest, the more remarkable the results that mation on many values, however, the XGBoost default value,
will be produced. Its significance increases as this research use the built-in 10-fold cross-validation to test the model,
have the opportunity continue to adding new characteristics and output relative features of attributes. In the same way
of Anemic to the model. In the prediction of mild, moderate, as Random Forest, the input probability value and output
and severe anemia cases, a severe anemia case has the best result of grid search of each classifier are not only sent to
result. Additionally, Random Forest swiftly provides good the receiver in a file but also transmitted to the XGBoost
results, saving the energy of a computer system. model for calculation [35]. Then, we receive the result of
evaluation and select the best result of processing steps. The
C. XGBoost best parameter value output by XGBoost serves as the final
The XGBoost consists of the gradient boosting framework. decision
Gradient boosting is an ensemble of weak learners, where an
initial model is used to make predictions, and the errors from
those predictions are used to construct a better model [32]. D. SUPPORT VECTOR MACHINE
This process is repeatedly iterated upon, leading to a gradual The support vector machine (SVM) is a non-probabilistic and
improvement in accuracy. The XGBoost algorithm carries a supervised learning model that is primarily used in classifi-
range of unique advantages such as high accuracy, parallel cation and regression problems [36]. It can be divided into
computing, and rapid computation speed in machine learning six categories, and the simplest classification problem used in
and other tasks. the present research is referred to as linearly separable binary
The reason why this research adopt this technique lies classification. The widely used support vector machine is
with the fact the In the realm of healthcare, XGBoost has a shown in a Figure 3. An essence of SVM is to represent
the scatter in low space through the high space, and then to the functioning of the human brain. Modern neural networks
classify, estimate, and forecast it [37]. operate similarly, but at greater speed and on large data
As one of the most effective classification methods in volumes. At their core, neural networks consist of units called
machine learning based on statistical learning theory, the neurons. These are arranged in multiple layers, with each
classification model realized by SVM is a decision function, layer potentially containing numerous neurons [40]. These
which is a hyperplane constructed by the training set data. layers in the plain form of a neural network are classified as
There is a maximum distance between each class of training an input layer, one or more hidden layers, and finally, one
samples and the decision function, while the direction per- output layer (see Figure 4). How these neurons are arranged
pendicular to the decision function indicates the correlation is referred to as the architecture of the network. In a deep
of input data with the target classification [38]. Therefore, the learning context, there can be more than one hidden layer,
decision function naturally plays a role as a discriminator. and the architecture of such networks is referred to as a deep
The reason why this current research adopted SVM lies architecture. Each edge connecting two neurons represents
with the fact that the SVM model perform grid search of the strength of the information and is assigned a weight [41].
hyperparameters. However, the final model prescription step Biases are added to control the flow of information and must
removes the role of feature columns and only returns the be learned during the training phase.
weightings of the supporting vector feature domains. Further- The reason why this research adopted Neural Network lies
more, if feature columns are removed directly, or if the feature with the fact that Neural network applications have also been
columns are highly dependent on the sample data, it can found in the field of detection of diseases [21]. In addition, the
develop some issues, such as the weight identifiability that application Neural Network has been developed whose task
cannot be estimated, and the model may show two problems: is to support decision in the diagnosis. The problem, within
the rapid decay of the support vectors and the changes in the context of anemia, can also be seen as a panel problem,
the directions of the original non-removed feature columns, where the information in the examination or laboratory data
which should have a great impact in understand real anemic acts simultaneously and alone at the same time in diagnosing
cases [39]. or determining the amount of Anemia. The background to
the current research was the assumption that the blood count
E. NEURAL NETWORK should be related through the correlation function of the
Neural networks are modeled on the human nervous system. features from the dataset, which are all obtained from blood
They consist of a large number of connected processing units parameters taken during the patient’s visit. Numerous studies
that work in unison to understand data patterns and mimic reported on the use of artificial intelligence for the prediction
F. K-NEAREST NEIGHBORS
K-Nearest Neighbors (KNN) is a classification algorithm that
makes predictions based on how numerically similar new
information is to already known data [44]. This technique is
commonly used as a go-to algorithm for the prediction of new,
incoming health data. KNN assumes that the characteristics
of data can be approximated by the characteristics of its
neighbors and complements discriminant or classifications
of physiological data. The strengths of KNN are that it is
simple and easily understood with a foundation in human FIGURE 5. The K-Nearest neighbors architecture.
logic. The KNN algorithm uses the nearest distance to clas-
sify and predict data based on the nearest k data points (see
Figure 5). When determining the classification or regression wanted data, it’s also crucial in the implementation of KNN
of new data, the KNN algorithm will search for the K most against simulated data and extracted data reveals that the pre-
similar data points according to the distance between the new diction of anemia can be carried out at an acceptable degree
data and the data in the dataset [45]. In addition, the KNN in relation to actual data. Furthermore, the optimal number
algorithm is also suitable for selecting the value of k. The selected is also in accordance with actual data. Previous
selection of the k value should be determined according to the results demonstrated by some studies suggest an encourag-
characteristics of the specific situation and the complexity of ing increase in prediction in cases of compressed anemia
the data to be classified. data [47].
The quality of the KNN implementation is dependent upon
the dataset selected—a variety of datasets have been utilized G. DECISION TREE
to conduct similar analyses. Techniques to discretize raw data The decision tree is a powerful modeling technique. It is
include normalizing continuous measures and utilizing non- simple to understand and easy to use. Decision trees trans-
parametric t-tests to select the top features. Although the form a set of data into a tree or a flowchart with the nodes
universal application of KNN to diverse datasets is evident, representing the meanings of tests on attributes, and branches
caution should be exercised when attempting to apply KNN representing the possible answers. They are constructed by
generally to various datasets, mainly due to its reliance upon an algorithm that identifies the most significant feature upon
data size and cumulative flow [46]. which to split the data [48]. Decision trees have the ability
The reason why this study adopted KNN lies with the fact to discover the rules that are implicit in the data, requiring
that although KNN can be powerful for locating specific and little data cleaning and processing and being acceptable to
a wide range of users. On the other hand, decision trees can other attributes as additional disease predictor attributes, over
model the interaction between features in a nonlinear manner. its existence to continuously affect the population. Naïve
Prediction results obtained are clear and easy to use. Among bayers structure is similar to KNN, while KNN use K as
all of the machine learning models, the decision tree is a good the central point of the prediction (see Figure 5), Naïve
model for white-box modeling [49]. bayers uses the Bayesian method to assumes that the data
The reason why this research adopt it lies with the fact contain sufficient probabilistic information to estimate the
that one of the best things about decision trees is that they joint probability of attributes. That is the maximum posterior
mimic human-level thinking, so if we document the steps that probability outcome from the entire data is considered an
a human takes to make a decision, it’s easy to teach a com- optimal conclusion [53]. This in a simple word means that,
puter to mimic the human’s decision-making process. From a while there are two or more categories or class that exits, any
technical point of view, decision trees are frequently used to next coming data entry will fit in to its class or categories.
analyze and handle data problems, and the decision tree is the
model that has the greatest capacity for predicting accuracy. I. AdaBoost
Decision trees have been intensively studied and applied in AdaBoost algorithm is short for Adaptive Boosting. It is
a variety of fields, including economics, finance, analysis of a kind of boosting learning algorithm in machine learning.
ecological data, astronomy, and healthcare. The application Given a training dataset with N samples, AdaBoost can
of decision trees to healthcare is extensive; they can be used combine the weak learners to generate a strong prediction
to improve efficiency, reduce costs, and standardize the per- model in an iterative manner [54]. The core aim of the
formance of medical diagnosis and therapy. Decision trees in algorithm is to minimize the accuracy of the training sample
healthcare can aid in the diagnosis of diseases, prognosis, and and assign different weights to the classification errors in
prediction [50]. different iterations in order to be correct and focus the model
Decision trees are categorized into two types: classifica- fitting of training samples in the wrong samples. There have
tion and regression. The classification tree is used to solve been boosting algorithms like AdaBoost, XGBoost, Gradient
problems with a categorical outcome, while the regression Boosting Decision Tree, and others that are widely used in
tree is used to deal with continuous outcomes in making various fields [55]. Follow-up investigators have extensively
decisions or predictions. Remarkably, the classification and explored and improved on the core algorithm of the AdaBoost
regression trees encapsulate a proper scenario where there algorithm, thus making it suitable for large-scale sparse
are at least 2 classes and several attributes. When a simple multi-class classification tasks and giving it high prediction
decision tree exhibits a striking tendency, it is accompanied performance.
by certain drawbacks, such as instability, overfitting, and The reason why this research adopted this techniques
weak predictions. To address these drawbacks, a number of lies with the fact that AdaBoost algorithm can continuously
decision trees have been developed. Those that provide a improve the classification performance of combined weak
collection of sub-trees chosen with different decision criteria classifiers on the test set during the AdaBoost iterative update
regardless of the most critical attribute and prune the error in process up to some optimal model. After AdaBoost has
future [50]. summed a sufficient iterative number of weak classifiers,
The structure of Decision Tress is the part of the Random the classifier learned from the original classification problem
Forest structure presented in Figure 2. Furthermore, part of can have the same performance [56]. The superiority of the
the current advanced decision trees includes Random Forests, AdaBoost algorithm in multi-class and binary classification
Adaptive Boosting. problems identifies the AdaBoost algorithm as better than
the decision tree or single neural network design traditional
method experiments in a variety of applications show good
H. NAIVE BAYES generalization. The AdaBoost algorithm prediction model
Naive Bayes is a simple and easy-to-understand probabilistic has a good fitting effect and good accuracy in advance sample
classification algorithm based on applying Bayes’ theorem data prediction in the medical laboratory for anemia predic-
with a strong attribute conditional independence assump- tion. The AdaBoost algorithm is the cornerstone technology
tion [51]. Naive Bayes is an extremely fast algorithm. It can for anemia prediction [57].
make predictions faster than many other algorithms. This The structure of AdaBosost is following the same pattern
attribute is particularly welcome for large-volume datasets. with XGBoost which all of them are also follow the pat-
- Naive Bayes requires only a small amount of training tern of Random Forest (See Figure 2). However, the goal
data. It estimates the conditional probability for each feature. of the AdaBoost algorithm is to increase the accuracy of
If there is limited data, more sophisticated models may end up classification by focusing on falsely classified data distribu-
overfitting. However, the Naive Bayes probability estimates tions. AdaBoost conducts the ‘‘boosting’’ process to improve
often work well in practice [52]. classifier properties by repeatedly giving a weight to each
The reason why this study adopted it lies with the fact that a error in the process of learning a weak classifier [56]. First,
Naive Bayes model was employed to analyze the correlation a weak classifier is a series of algorithms designed to identify
between anemia, symptoms, syndromes, related diseases, and the characteristics of a target and is usually less complex
than a strong classifier. Next, AdaBoost integrates multiple TABLE 3. Measurement values for normal blood cells [60].
weak classifiers to construct a strong classifier for increased
predictive performance. The AdaBoost algorithm will update
the weights by adjusting the accuracy of each weak classifier
in turn, making the current weak classifier focus on misclas-
sified samples in the sample space
A. DATASET
The dataset utilized for this research was acquired from
Kaggle [58], [59], [60] This dataset illustrates the prevalence
of various forms of anemia, encompassing its severity and
correlation with age and gender among the research popula-
tion, utilizing Complete Blood Count (CBC) characteristics
as variables. The dataset was derived from whole blood count
tests conducted by a hematology analyzer to ascertain the
prevalence of various kinds of anemia treated at the Eureka
Diagnostic Center in Lucknow, India. All procedures for the 364 patients. The first five entries from the dataset is pre-
CBC test were conducted in accordance with the normal sented in Table 2.
operating protocols established for the Hematology analyzer.
For the CBC analysis, 400 patient samples were randomly 1) DATASET PARAMETERS
selected to compile the dataset from individuals who visited The Anemia dataset utilized in this study is classified accord-
the Eureka Diagnostic Center in Lucknow for various clinical ing to standard characteristics associated to age and gender,
assessments. The diagnostic center conducts an average of as illustrated in Table 3. Hb readings in a CBC may differ
4 to 8 CBC investigations daily. Between September 2020 and among laboratories, with average levels for adult men and
December 2020, 1000 CBC investigations were conducted, women being below 135 g/L and 115 g/L, respectively [61].
from which 400 random samples were selected. The dataset The World Health Organization characterizes anemia as
comprised adult males and females who are not pregnant and hemoglobin concentrations falling below 130 g/L for males
are above 15 years of age within the study group. Infants, and 120 g/L for females [62]. The remaining other values are
children under 10 years of age, and pregnant women were associated to the standard and are the benchmark.
excluded from the study due to issues such as fluctuating
CBC test values and other considerations. Upon eliminating 2) CONCEPTULIZATION FROM THE DATASET
the aforementioned individuals from the randomly selected This current research formulates three main hypotheses in
sample of 400 patients, the final dataset comprised order to test and valid the claimed. At the onset the dataset
FIGURE 8. The model for the abnormalities within WBC and PLT.
1) PERFORMANCE METRICS
Evaluation metrics are essential in machine learning for the
close monitoring of new models. In any situation where a
model is developed what separates positives from negatives,
it is essential to be able to evaluate the performance of this
model using some standard metrics. Among the standard met-
ric of measuring the quality of machine learning models ‘‘the
accuracy’’, ‘‘precision’’, ‘‘recall’’, and ‘‘F1-score’’. These
emerge from a confusion matrix comprising true positives
(TPs), true negatives (TNs), false positives (FPs), and false
negatives (FNs).
Accuracy is an important concept in the world of machine
learning. It is the ratio of the correctly predicted instances to
the total instances in our dataset. The calculation of accuracy
is important as it helps us understand our predictive models.
The accuracy is generated using equation 1:
exists when one variable increases while the other falls. For Precision, is the ratio of correctly predicted positive observa-
that reason, a heatmap, with a vibrant chart that illustrates tions to the total predicted positive observations. It denotes
the association among all columns is drawn (see Figure 9). the extent to which the predicted positives are actually pos-
The intensity of the shades in the heatmap signify the strength itive. It can be interpreted as the proportion of true positive
of the correlations among the various columns in the dataset instances in all the instances that were predicted to be posi-
hence answer our research hypothesis. tive. Precision is calculated by equation 2:
Figure 9 illustrates a strong positive correlation between
RBC and PCV, demonstrating that an increase in RBC corre- TP
Precision = (2)
sponds to an increase in PCV, hence supporting the proposed TP + FP
hypothesis that RBC count and PCV are closely associated.
Recall is defined as the probability that the classifier cor-
This also stands that a low RBC count leads to lower PCV
rectly predicted the label of the positive instances out of all
levels, indicating some form of anemia.
actual positive instances. Recall is frequently employed as a
The correlation between MCV and MCH is 0.77425 this
measure of the aptitude of a program to recognize a specific
shows that MCV and MCH are truly associated together
class and it is calculated by using equation 3:
and capable of indicating the size and Hb content of RBCs.
Furthermore, RDW shows very low negative correlation of TP
−0.02 and −0.216 with MCV and MCH respectively, that is Recall = (3)
TP + FN
it does not associated to size of RBC, but the pattern of of
RCB. As a result, hypothesis 2 is supported. The ‘‘F1-score is the harmonic mean between precision and
Finally, the test of Hypothesis 3 indicate that WBC and recall scores. It doesn’t ignore the presence of false negatives
PLT are not directly related to each other, however they if it has false positives. It returns low values when these
are all the reason for abnormalities in RBC. While Platelets particular cases have low precision or low recall. In other
are involved in blood clotting, while WBCs are more about words, the F1 score gives a more balanced result of the
immune defense. harmonic mean of precision and recall, and this is essential
because it gives the expected result based on equation 4:
1) LINE ART FIGURES
Figures that are composed of only black lines and shapes. TP
F1−score = (4)
These figures should have no shades or half-tones of gray, TP + 1
2 (FP + FN )
A. DATASET ENCODING AND LABELLING used to train the machine learning model, and the testing data
The preprocessing involves preparing the dataset to fit and (20%) is kept aside to evaluate how well the model works.
ready for the model analysis. The first step for this research The split ensures that the evaluation is done on new data that
involves ‘‘Encoding gender’’ attribute of the dataset. The the model hasn’t seen before.
research provide a code function that changes the ‘‘Sex’’ Finally, all the features are standardized to a similar scale.to
column in the dataset, which contains values ‘‘Male’’ and ensures that no one feature dominates the others. This is
‘‘Female’’, into numbers. Specifically, ‘‘Male’’ is encoded done by subtracting the mean and dividing by the standard
as 0, and ‘‘Female’’ is encoded as 1. This is necessary because deviation for each feature. The training data is used to learn
most machine learning algorithms work better with numbers how to scale the values, and then the same scaling method is
than text. applied to the testing data.
The next step of the preprocessing is setting the condition
of classifying Anemia from the dataset. As a result, a function VI. EXPERIMENTAL RESULTS AND DISCUSSIONS
from the code defines a condition to determine if an entry A. INITIALIZING THE TRAINING MODELS
from the dataset has anemia based on hemoglobin (Hb) level
All of the models that were deployed in this study have
and their sex as a ‘‘condition’’, that is If a male (sex is 0) has a
been initialized, as described in Section III. Writing the code
Hb level below 13.5, the person is considered anemic (marked
necessary to construct and set up a variety of models that are
as 1). Else, if a female (sex is 1) has a Hb level below 12, she
designed to predict the presence of anemia in individuals is
is also considered anemic (marked as 1). Otherwise, if their
a necessary step in the process of machine learning model
Hb is above those levels, they are considered not anemic
initialization. Both the training and the prediction were taken
(marked as 0). Furthermore, the code applies this logic to each
care of. The training data, which consist of the subset of
row (person) in the dataset and creates a new column called
data that was designated earlier for instructing the models,
‘‘Anemia’’ to indicate whether a person is anemic or not.
is utilized in the process of teaching each individual model.
For the purpose of determining how well the models have
B. FEATURE SELECTION AND DATA PARTITIONING
learned, they are given the duty of making predictions on the
testing data after they have been trained.
Feature selection is the process of choosing and retaining
The testing data is the subset of the data that was not used
the most relevant features to enhance model interpretability
during the training processes. A calculation is made by the
and reduce or avoid overfitting the noise in missing features.
code to determine the accuracy of each model, which indi-
Particularly when working with a huge amount of data and
cates the frequency with which the model correctly predicts
features, it is important to identify which features influence
whether or not an individual has anemia. A detailed report
the model predictions. Feature selection techniques are cov-
is created for each model, which illustrates its performance
ered as filter, wrapper, and embedded methods. Each method
across a wide range of applications (for example, accurately
has its own advantages, disadvantages, and criteria based on
distinguishing those who have anemia in comparison to those
which a practitioner can choose a method for their data. The
who do not have anemia). The accuracy of each model is
filter methods essentially use evaluate the importance of a
documented in a dictionary (results), where the name of the
feature or features of a dataset. This current research apply.
model is what serves as the key, and the accuracy score of the
Filter feature selection where the methods don’t rely on any
model is what serves as the value. The findings are analyzed
learning algorithm, it usually require less computational time
by the code in order to determine which model has the highest
and data.
degree of precision. Following that, it presents the name of the
The features selection code for this research filter the
best model along with the accuracy of the model.
data into two parts: The features (X) include various
blood-related measurements like: ‘‘RBC count’’, ‘‘PCV’’,
‘‘MCV’’, ‘‘MCH’’, ‘‘RDW’’,’’ ‘‘WBC count’’, ‘‘PLT’’, and B. PRESENTATION OF THE RESULTS
‘‘Hb’’, These are the features that are deemed to be used The entire results of the training from various models has
to predict whether a person has anemia or not. The target been gathered (see Table 4). The general impact of the com-
(y) is the ‘‘Anemia’’ column that was, indicating whether bine training reveals that ‘‘Logistic Regression’’ obtained and
the person has anemia or not. This is what the model try to ‘‘Accuracy of 0.836 (83.6%)’’, that is the model predicted
predict. Anaemia correctly 83.6% of the time. Whereas the ‘‘Preci-
Assessing model performance is an efficient way to vali- sion’’ which measures how many of the predicted anemic
date the results of a machine learning model. The application cases are actually anemic indicate that ‘‘Precision for Class 0
of data partitioning is essential to perform model validation. (Not anemic) is 80%’’ and ‘‘Precision for Class 1 (Anemic):
For handling the model’s performance, the datasets need to be 85%’’. The ‘‘Recall’’ which measures how many actual ane-
divided into two ranges: training and testing sets. The training mic cases were correctly identified indicate that the ‘‘Recall
set is utilized for building the model, and the testing set is used for Class 0: 67%’’, this means the model missed quite a few
to validate the developed model. The dataset is split into two actual non-anemic cases. Whereas, the recall for Class 1: 92%
parts: training data and testing data. The training data (80%) is indicating that (Most anemic cases were correctly identified).
Therefore, the model is better at identifying anemic patients performing configuration for each model. Cross-validation
but not as good at correctly identifying non-anemic patients (splitting the data multiple times for training and testing) is
(it sometimes misclassifies non-anemic patients as anemic). used to ensure the model’s performance is not just luck. This
The result of the ‘‘Random Forest’’ model indicate that the process helps the model find the best settings, improving
‘‘Accuracy of 0.849 (84.9%) has been obtained. This indicate accuracy and generalization on unseen data. Once the best
that the model is slightly better than ‘‘Logistic Regression’’ model configuration is found, it is trained on the training
in correctly predicting anemia 84.9% of the time. The ‘‘Pre- dataset and used to make predictions on the test dataset. The
cision’’ result indicate that ‘‘Class 0 obtained 88% (higher code also calculates accuracy and prints a detailed classi-
precision, fewer false positives for non-anemic). Whereas fication report showing precision, recall, F1-score, etc., for
Class 1 obtained 84% (slightly lower than Logistic Regres- both anemic and non-anemic predictions. For each model,
sion for anemic cases). The ‘‘Recall’’ indicate a Class 0 with the research wrote the code to generates a confusion matrix
62% (still missing a fair number of non-anemic cases). While to understand where the model made correct and incorrect
‘‘Class 1 obtained 96% (excellent at catching anemic cases)’’ predictions. This is plotted using a heatmap, which pro-
Hence, this can be concluded that the model is highly effec- vides a visual representation of the model’s performance.
tive at detecting anemia, but there is room for improvement in The combine result is presented in Table 5 Upon optimiza-
identifying non-anemic patients. The trade-off here is slightly tion, it was determined that Random Forest is the superior
better accuracy overall. model according to test accuracy, cross-validation accuracy,
The results obtained for ‘‘XGBoost’’ indicate that the and precision-recall balance. Cross-validation accuracy is
‘‘Accuracy of 0.863 (86.3%)’’ is obtained and this is the high- 91.06%, test accuracy is 86.30%, precision for ‘‘Class 1’’
est accuracy among the models, correctly predicting anemia is 85%, and recall is 96%. This model effectively identifies
86.3% of the time. While, the ‘‘Precision’’ indicate that the anemic patients and has a balanced performance across both
‘‘Class 0 obtained 89% (best so far at predicting non-anemic). classes, achieving a ROC-AUC of 92.77% (see Table 5).
Whereas ‘‘Class 1’’ obtained 85% (good at identifying ane- Its efficacy is rooted in its capacity to accurately identify
mic cases). The ‘‘Recall’’ indicate that ‘‘Class 0’’ obtained both anemic and non-anemic subjects while reducing false
67% (better than Random Forest, still some false negatives). positives and negatives.,
While ‘‘Class 1’’ obtained 96% (excellent recall for anemic The ‘‘ROC-AUC score’’ provides a singular metric that
cases). This model strikes the best balance. It performs well encapsulates a model’s efficacy in differentiating between the
in both identifying anemic and non-anemic patients and is the classes of anemic and non-anemic individuals. A flawless
best-performing model overall. score would be 1.0, indicating the model accurately differen-
The result obtained from the Support Vector Machine tiates between the two groups. A score of 0.5 signifies that the
indicate an ‘‘Accuracy of 0.808 (80.8%), which is the lower model is making random guesses. Models such as XGBoost,
accuracy than the previous models, with an 80.8% success exhibiting a ROC-AUC of 0.9447, provide superior class
rate. The ‘‘Precision’’ of the ‘‘Class 0’’ is 71% (fewer false separation compared to others like Decision Tree, which has
positives for non-anemic). While ‘‘Class 1’’ is 86% (good but a ROC-AUC of 0.7717.
not the best). The ‘‘Recall’’ indicate that ‘‘Class 0’’ obtained The ROC curve illustrates the true positive rate (the num-
a 71% (balanced, but still misclassifies some non-anemic ber of correctly detected anemic cases) in relation to the
cases). While ‘‘Class 1’’ obtained 86% (lower than other false positive rate (the number of non-anemic cases erro-
models for anemic detection). While this model is relatively neously labeled as anemic). The optimal curve would swiftly
balanced in precision and recall, its overall performance is ascend towards the upper left corner of the graph, indicating
lower than others like XGBoost, making it less ideal for this a high true positive rate and a low false positive rate for
task. the model. A diagonal line signifies a model that gener-
The remaining five models: Neural Network (MLP) ates random predictions, whereas proximity of the curve to
obtained an Accuracy of 0.836 (83.6%), K-Nearest Neigh- the top left indicates superior class differentiation by the
bors (KNN) 0.795 (79.5%), Decision Tree 0.849 (84.9%), model.
Naive Bayes 0.808 (80.8%), AdaBoost 0.836 (83.6%). There- The Logistic Regression model achieved a ROC-AUC
fore, it can be recognized that ‘‘XGBoost’’ stands out as Score of 0.9371 (refer to Figure 10 in the Appendix). The
the best model overall, with the highest accuracy (86.3%). ROC Curve demonstrates a robust capacity to predict Anemia
It balances well between precision and recall, especially for (refer to Figure 11 in the Appendix), albeit marginally inferior
detecting both anemic and non-anemic patients. to XGBoost and Random Forest.
Considering the various performances exhibited by the The ‘‘Random Forest’’ achieved a ‘‘ROC-AUC Score of
models, the research fine-tune the Hyperparameter by tun- 0.9277’’ (refer to Figure 12 in the Appendix), whereas the
ing to allows each model to be optimized, and improve ‘‘ROC Curve’’ resembles that of XGBoost, effectively dis-
their performance. The function hyperparameter_tuning uses tinguishing the classes but exhibiting marginally inferior
GridSearchCV to automatically test various combinations of predictive capability (refer to Figure 13 in the Appendix).
the hyperparameters defined in the parameter grids. It tries The ‘‘XGBoost’’ yields a ‘‘ROC-AUC Score of 0.9447’’
out different values from the parameter grids to find the best (refer to Figure 14 in the Appendix). The analysis yielded a
‘‘ROC Curve’’ positioned in the top-left, indicating superior The AdaBoost achieved a ROC-AUC score of 0.9405 (refer
predictive capability in differentiating between anemic and to Figure 26 in the Appendix), and the ROC curve demon-
non-anemic cases (refer to Figure 15 in the Appendix). strates a robust performance, comparable to the top models,
The Support Vector Machine achieved a ROC-AUC score effectively distinguishing between the two classes (refer to
of 0.9362 (refer to Figure 16 in the Appendix). A well- Figure 27 in the Appendix).
defined ROC Curve demonstrating effective class separation, The findings indicated that XGBoost, Random Forest,
comparable to Random Forest performance, was achieved AdaBoost, and SVM are robust performers, exhibiting ele-
(see to Figure 17 in the Appendix). vated ROC-AUC scores and superior ROC curves. Likewise,
The Neural Network achieved a ROC-AUC score of 0.9277 the Logistic Regression and Neural Network exhibit com-
(refer to Figure 18 in the Appendix), while the ROC Curve mendable performance, while not reaching the standards of
exhibited a pattern akin to that of Random Forest, indicating the leading models. KNN and Decision Tree exhibit more
its efficacy in differentiating between anemic and non-anemic difficulty, evidenced by diminished ROC-AUC scores and
cases (refer to Figure 19 in the Appendix). ROC curves that approximate random guessing.
The K-Nearest Neighbors (KNN) achieved a ROC-AUC
score of 0.8401 (refer to Figure 20 in the Appendix), but the 1) COMPARATIVE ANALYSIS OF THE RESULTS
ROC curve indicates that KNN encounters greater difficulty There are many studies associated to Anemia prediction.
in distinguishing between the two classes (refer to Figure 21 Each study utilized different machine learning models and
in the Appendix). datasets to predict anemia or its severity, yielding varying
The Decision Tree achieved a ROC-AUC score of 0.7717 degrees of accuracy, precision, recall, and F1 scores. Most
(refer to Figure 22 in the Appendix), whereas the ROC Curve studies favored ensemble or advanced models like Neural
approximates a diagonal line, indicating that the Decision Networks and Support Vector Machines, which provided
Tree performs less effectively than other models (refer to higher accuracy and overall performance (see Table 6).
Figure 23 in the Appendix). The research study of Dixit et al., [14] utilized clinical
The Naive Bayes achieved a ROC-AUC score of 0.8988 data of patients for predicting anemia. The best-performing
(refer to Figure 24 in the Appendix). The ROC Curve of Naive model (Neural Networks) had an accuracy of 92.3%, preci-
Bayes effectively predicts anemia; however it is less robust sion of 91%, recall of 89%, and F1 score of 90%. The high
than XGBoost or Random Forest (see to Figure 25 in the values for precision and recall indicate the model’s ability
Appendix). to correctly identify both anemic and non-anemic cases with
FIGURE 22. The decision tress ROC-AUC. FIGURE 25. The Naiye Bayes ROC curve.
APPENDIX
FIGURE 26. The AdaBoost ROC-AUC.
See Figures 10–27.
REFERENCES
[1] L. M. Neufeld, L. M. Larson, A. Kurpad, S. Mburu, R. Martorell, and
K. H. Brown, ‘‘Hemoglobin concentration and anemia diagnosis in venous
and capillary blood: Biological basis and policy implications,’’ Ann.
New York Acad. Sci., vol. 1450, no. 1, pp. 172–189, Aug. 2019.
[2] C. H. H. Le, ‘‘The prevalence of anemia and moderate-severe anemia in
the U.S. population (NHANES 2003–2012),’’ PLoS ONE, vol. 11, no. 11,
Nov. 2016, Art. no. e0166635.
[3] K. Doig and L. A. Thompson, ‘‘A methodical approach to interpreting the
white blood cell parameters of the complete blood count,’’ Amer. Soc. Clin.
Lab. Sci., vol. 30, no. 3, pp. 186–193, Jul. 2017.
[4] K. C. Derecho, R. Cafino, S. L. Aquino-Cafino, A. Isla, J. A. Esencia,
N. J. Lactuan, J. A. G. Maranda, and L. C. P. Velasco, ‘‘Technology
adoption of electronic medical records in developing economies: A system-
atic review on physicians’ perspective,’’ Digit. Health, vol. 10, Jan. 2024,
Art. no. 20552076231224605.
[5] I. Cabalar, T. H. Le, A. Silber, M. O’Hara, B. Abdallah, M. Parikh,
and R. Busch, ‘‘The role of blood testing in prevention, diagnosis, and
management of chronic diseases: A review,’’ Amer. J. Med. Sci., vol. 368,
no. 4, pp. 274–286, Oct. 2024.
FIGURE 27. The AdaBoost ROC curve. [6] S. Pullakhandam and S. McRoy, ‘‘Classification and explanation
of iron deficiency anemia from complete blood count data using
machine learning,’’ BioMedInformatics, vol. 4, no. 1, pp. 661–672,
Mar. 2024.
models where used in this paper, which has the ability to [7] P. Appiahene, J. W. Asare, E. T. Donkoh, G. Dimauro, and R. Maglietta,
retrieve some anatomical correlation features among health- ‘‘Detection of iron deficiency anemia by medical images: A comparative
study of machine learning algorithms,’’ BioData Mining, vol. 16, no. 1,
care data. This paper also formulates some hypothesis to p. 2, Jan. 2023.
reflect multi-data attribute co-existence in the data about the [8] C. Yesiloglu, C. Emiroglu, and C. Aypak, ‘‘The relationship between
prediction of anemia, reflecting the logic in real-life data. glycated hemoglobin (HbA1c), hematocrit, mean platelet volume, total
white blood cell counts, visceral adiposity index, and systematic coronary
In effect, the prediction of anemia might work differently risk evaluation 2 (SCORE2) in patients without diabetes,’’ Int. J. Diabetes
in different types of data. Therefore, this study particularly Developing Countries, vol. 7, pp. 1–7, Mar. 2024.
identifies from which combinations of data work in predicting [9] S. Suner, J. Rayner, I. U. Ozturan, G. Hogan, C. P. Meehan,
anemia, i.e., by predicting some characteristics of the red A. B. Chambers, J. Baird, and G. D. Jay, ‘‘Prediction of anemia and esti-
mation of hemoglobin concentration using a smartphone camera,’’ PLoS
blood cells, and what the characteristics of anemia are that ONE, vol. 16, no. 7, Jul. 2021, Art. no. e0253495.
could predict from personal data. The fact that data synthesis [10] C. Ashok, S. Mahto, S. Kumari, A. Kumar, Deepankar, Vidyapati,
is essential for creating healthcare indicators based on the M. Prasad, M. Mahajan, and P. K. Chaudhuri, ‘‘Impact of plateletpheresis
on the hemoglobin, hematocrit, and total red blood cell count: An updated
actual expression among multi-data characteristics. This is meta-analysis,’’ Cureus, vol. 16, no. 6, Jun. 2024, Art. no. e61510.
of concern in data analysis on porcelain anemia. Finding the [11] S. Dogan and I. Turkoglu, ‘‘Iron-deficiency anemia detection from hema-
research In terms of performance, AdaBoost comes in first tology parameters by using decision trees,’’ Int. J. Sci. Technol., vol. 3,
no. 1, pp. 85–92, 2008.
place with a rate of 92.8%. Nevertheless, Random Forest and [12] M. Abdullah and S. Al-Asmari, ‘‘Anemia types prediction based on
XGBoost both do the best when it comes to Test accuracy, data mining classification algorithms,’’ Int. J. Inf. Manag. Sci., vol. 45,
Precision of Non-Anemic, and the Recall of Anemic, with pp. 85–92, Apr. 2016.
values of 0.863, 0.89, and 0.96 respectively. These are the [13] J. R. Khan, S. Chowdhury, H. Islam, and E. Raheem, ‘‘Machine
learning algorithms to predict the childhood anemia in Bangladesh,’’
three metrics that are most important. Random Forest and J. Data Sci., vol. 17, no. 1, pp. 195–218, Feb. 2021, doi:
XGBoost both have the greatest values, which is the reason 10.6339/jds.201901_17(1).0009.
[14] A. Dixit, R. Jha, R. Mishra, and S. Vhatkar, ‘‘Prediction of anemia [34] K. A. Awan, I. U. Din, A. Almogren, B.-S. Kim, and M. Guizani, ‘‘Enhanc-
disease using machine learning algorithms,’’ in Proc. Intell. Comput. ing IoT security with trust management using ensemble XGBoost and
Netw., in Lecture Notes in Electrical Engineering, 2023, pp. 229–238, doi: AdaBoost techniques,’’ IEEE Access, vol. 12, pp. 116609–116621, 2024.
10.1007/978-981-99-0071-8_18. [35] T. M. Hossain, M. Hermana, and J. O. Olutoki, ‘‘Porosity prediction and
[15] M. Alemayehu, M. Meskele, B. Alemayehu, and B. Yakob, ‘‘Preva- uncertainty estimation in tight sandstone reservoir using non-deterministic
lence and correlates of anemia among children aged 6–23 months in XGBoost,’’ IEEE Access, vol. 12, pp. 139358–139367, 2024.
Wolaita zone, southern Ethiopia,’’ PLoS ONE, vol. 14, no. 3, Mar. 2019, [36] L. Yuan, D. Lian, X. Kang, Y. Chen, and K. Zhai, ‘‘Rolling bearing
Art. no. e0206268, doi: 10.1371/journal.pone.0206268. fault diagnosis based on convolutional neural network and support vector
[16] N. Sundaram, M. Bennett, and J. Wilhelm, ‘‘Early detection of chronic machine,’’ IEEE Access, vol. 8, pp. 137395–137406, 2020.
anemia using machine learning models,’’ Amer. J. Hematol., vol. 86, no. 7, [37] W. Tuerxun, X. Chang, G. Hongyu, J. Zhijie, and Z. Huajian, ‘‘Fault
pp. 559–566, 2011. diagnosis of wind turbines based on a support vector machine optimized
[17] H. A. Rayes, S. Vallabhajosyula, G. W. Barsness, N. S. Anavekar, R. S. Go, by the sparrow search algorithm,’’ IEEE Access, vol. 9, pp. 69307–69315,
M. S. Patnaik, K. B. Kashani, and J. C. Jentzer, ‘‘Association between 2021.
anemia and hematological indices with mortality among cardiac intensive [38] J. Qiu, J. Xie, D. Zhang, and R. Zhang, ‘‘A robust twin support vector
care unit patients,’’ Clin. Res. Cardiol., vol. 109, no. 5, pp. 616–627, machine based on fuzzy systems,’’ Int. J. Intell. Comput. Cybern., vol. 17,
May 2020, doi: 10.1007/s00392-019-01549-0. no. 1, pp. 101–125, Feb. 2024.
[18] R. Provenzano, E. V. Lerma, and L. Szczech, ‘‘Anemia prediction in [39] A. Abubakar, H. Chiroma, A. Zeki, and M. Uddin, ‘‘Utilising key climate
renal patients using hematological features and machine learning models,’’ element variability for the prediction of future climate change using a
J. Clin. Haematol., vol. 112, pp. 234–242, Mar. 2019. support vector machine model,’’ Int. J. Global Warming, vol. 9, no. 2,
[19] M. Hasan, Mst. S. Tahosin, A. Farjana, M. A. Sheakh, and M. M. Hasan, p. 129, 2016.
‘‘A harmful disorder: Predictive and comparative analysis for fetal ane- [40] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, N. A. Mohamed,
mia disease by using different machine learning approaches,’’ in Proc. and H. Arshad, ‘‘State-of-the-art in artificial neural network applications:
11th Int. Symp. Digit. Forensics Secur. (ISDFS), May 2023, pp. 1–6, doi: A survey,’’ Heliyon, vol. 4, no. 11, Nov. 2018, Art. no. e00938.
10.1109/ISDFS58141.2023.10131838. [41] A. Kaveh, Applications of Artificial Neural Networks and Machine
[20] M. Jaiswal, A. Srivastava, and T. J. Siddiqui, ‘‘Machine learning algorithms Learning in Civil Engineering (Studies in Computational Intelligence),
for anemia disease prediction,’’ in Proc. Recent Trends Commun., Comput., vol. 1168. Cham, Switzerland: Springer, 2024.
Electron., A. Khare, U. Tiwary, I. K. Sethi, and N. Singh, Eds., Singapore: [42] M. Kurucan, M. Özbaltan, Z. Yetgin, and A. Alkaya, ‘‘Applications of
Springer, 2019, pp. 55–63, doi: 10.1007/978-981-13-2685-1_44. artificial neural network based battery management systems: A literature
[21] P. Dhakal, S. Khanal, and R. Bista, ‘‘Prediction of anemia using machine review,’’ Renew. Sustain. Energy Rev., vol. 192, Mar. 2024, Art. no. 114262.
learning algorithms,’’ Int. J. Comput. Sci. Inf. Technol., vol. 15, no. 1, [43] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, ‘‘Speech recog-
pp. 15–30, Feb. 2023, doi: 10.5121/ijcsit.2023.15102. nition using deep neural networks: A systematic review,’’ IEEE Access,
[22] P. P. Liang, A. Zadeh, and L.-P. Morency, ‘‘Foundations & trends in vol. 7, pp. 19143–19165, 2019.
multimodal machine learning: Principles, challenges, and open questions,’’ [44] J. Na, Z. Wang, S. Lv, and Z. Xu, ‘‘An extended k nearest
ACM Comput. Surv., vol. 56, no. 10, pp. 1–42, Oct. 2024. neighbors-based classifier for epilepsy diagnosis,’’ IEEE Access, vol. 9,
[23] F. A. Khan and A. A. Ibrahim, ‘‘Machine learning-based enhanced deep pp. 73910–73923, 2021.
packet inspection for IP packet priority classification with differentiated [45] K. Alnowaiser, ‘‘Improving healthcare prediction of diabetic patients using
services code point for advance network management,’’ J. Telecommun., KNN imputed features and tri-ensemble model,’’ IEEE Access, vol. 12,
Electron. Comput. Eng., vol. 16, no. 2, pp. 5–12, Jun. 2024. pp. 16783–16793, 2024.
[24] H. Nozari, J. Ghahremani-Nahr, and A. Szmelter-Jarosz, ‘‘AI and machine [46] R. K. Halder, M. N. Uddin, M. A. Uddin, S. Aryal, and A. Khraisat,
learning for real-world problems,’’ Adv. Comput., vol. 134, pp. 1–12, ‘‘Enhancing K-nearest neighbor algorithm: A comprehensive review and
Jan. 2024. performance analysis of modifications,’’ J. Big Data, vol. 11, no. 1, p. 113,
[25] D. B. Catacutan, J. Alexander, A. Arnold, and J. M. Stokes, ‘‘Machine Aug. 2024.
learning in preclinical drug discovery,’’ Nature Chem. Biol., vol. 19, no. 8, [47] D. K. Mandarapu, V. Nagarajan, A. Pelenitsyn, and M. Kulkarni, ‘‘Arkade:
pp. 1–4, Aug. 2024. K-nearest neighbor search with non-Euclidean distances using GPU
[26] C. De Lucia, P. Pazienza, and M. Bartlett, ‘‘Does good ESG lead to better ray tracing,’’ in Proc. 38th ACM Int. Conf. Supercomput., May 2024,
financial performances by firms? Machine learning and logistic regression pp. 14–25.
models of public enterprises in Europe,’’ Sustainability, vol. 12, no. 13, [48] Y. Y. Song and L. U. Ying, ‘‘Decision tree methods: Applications for
p. 5317, Jul. 2020. classification and prediction,’’ Shanghai Arch. Psychiatry, vol. 27, no. 2,
[27] F. H. Awad, M. M. Hamad, and L. Alzubaidi, ‘‘Robust classification and p. 130, Apr. 2015.
detection of big medical data using advanced parallel K -means clustering, [49] Y. Ma, H. Zhang, Y. Cai, and H. Yang, ‘‘Decision tree for locally private
YOLOv4, and logistic regression,’’ Life, vol. 13, no. 3, p. 691, Mar. 2023. estimation with public data,’’ in Proc. Adv. Neural Inf. Process. Syst.,
[28] Z. Rahmatinejad, T. Dehghani, B. Hoseini, F. Rahmatinejad, A. Lotfata, vol. 36, Feb. 2024, pp. 1–13.
H. Reihani, and S. Eslami, ‘‘A comparative study of explainable ensemble [50] V. G. Costa and C. E. Pedreira, ‘‘Recent advances in decision trees:
learning and logistic regression for predicting in-hospital mortality in the An updated survey,’’ Artif. Intell. Rev., vol. 56, no. 5, pp. 4765–4800,
emergency department,’’ Sci. Rep., vol. 14, no. 1, p. 3406, Feb. 2024. May 2023.
[29] A. Sekulić, M. Kilibarda, G. B. M. Heuvelink, M. Nikolić, and B. Bajat, [51] R. Blanquero, E. Carrizosa, P. Ramírez-Cobo, and M. R. Sillero-Denamiel,
‘‘Random forest spatial interpolation,’’ Remote Sens., vol. 12, no. 10, ‘‘Variable selection for Naïve Bayes classification,’’ Comput. Oper. Res.,
p. 1687, May 2020. vol. 135, Nov. 2021, Art. no. 105456.
[30] S. M. Simon, P. Glaum, and F. S. Valdovinos, ‘‘Interpreting random forest [52] B. Ravinder, S. K. Seeni, V. S. Prabhu, P. Asha, S. P. Maniraj, and
analysis of ecological models to move from prediction to explanation,’’ Sci. C. Srinivasan, ‘‘Web data mining with organized contents using naive
Rep., vol. 13, no. 1, p. 3881, Mar. 2023. Bayes algorithm,’’ in Proc. 2nd Int. Conf. Comput., Commun. Control
[31] J. Fisher, S. Allen, G. Yetman, and L. Pistolesi, ‘‘Assessing the influence (IC4), Feb. 2024, pp. 1–6.
of landscape conservation and protected areas on social wellbeing using [53] X. Y. Zhang, W. F. Li, J. Y. Fang, and Z. M. Niu, ‘‘Nuclear mass predic-
random forest machine learning,’’ Sci. Rep., vol. 14, no. 1, p. 11357, tions with the naive Bayesian model averaging method,’’ Nucl. Phys. A,
May 2024. vol. 1043, Mar. 2024, Art. no. 122820.
[32] J. Zhang, R. Wang, Y. Lu, and J. Huang, ‘‘Prediction of compressive [54] A. Shahraki, M. Abbasi, and Ø. Haugen, ‘‘Boosting algorithms for net-
strength of geopolymer concrete landscape design: Application of the work intrusion detection: A comparative evaluation of real AdaBoost,
novel hybrid RF–GWO–XGBoost algorithm,’’ Buildings, vol. 14, no. 3, gentle AdaBoost and modest AdaBoost,’’ Eng. Appl. Artif. Intell., vol. 94,
p. 591, Feb. 2024. Sep. 2020, Art. no. 103770.
[33] S. Hakkal and A. A. Lahcen, ‘‘XGBoost to enhance learner perfor- [55] X. Huang, Z. Li, Y. Jin, and W. Zhang, ‘‘Fair-AdaBoost: Extending
mance prediction,’’ Comput. Educ., Artif. Intell., vol. 12, Dec. 2024, AdaBoost method to achieve fair classification,’’ Expert Syst. Appl.,
Art. no. 100254. vol. 202, Sep. 2022, Art. no. 117240.
[56] W. Wang and D. Sun, ‘‘The improved AdaBoost algorithms for imbalanced [67] M. Song, B. I. Graubard, E. Loftfield, C. S. Rabkin, and E. A. Engels,
data classification,’’ Inf. Sci., vol. 563, pp. 358–374, Jul. 2021. ‘‘White blood cell count, neutrophil-to-lymphocyte ratio, and incident
[57] B. Liu, X. Li, Y. Xiao, P. Sun, S. Zhao, T. Peng, Z. Zheng, and Y. Huang, cancer in the U.K. biobank,’’ Cancer Epidemiol., Biomarkers Prevention,
‘‘AdaBoost-based SVDD for anomaly detection with dictionary learning,’’ vol. 33, no. 6, pp. 821–829, Jun. 2024.
Expert Syst. Appl., vol. 238, Mar. 2024, Art. no. 121770.
[58] R. Vohra, J. Pahareeya, and A. Hussain, ‘‘Complete blood count anemia
diagnosis,’’ Mendeley Data, Liverpool John Moores Univ., Liverpool,
U.K., Tech. Rep., 2021, doi: 10.17632/dy9mfjchm7.1.
[59] S. S. Abdul-Jabbar, A. K. Farhan, and A. S. Luchinin, ‘‘Comparative
study of anemia classification algorithms for international and newly TALAL QADAH is currently with the Depart-
CBC datasets,’’ Int. J. Online Biomed. Eng., vol. 19, no. 6, pp. 141–157, ment of Medical Laboratory Sciences, Faculty of
May 2023. Applied Medical Sciences, King Abdulaziz Uni-
[60] S. S. Abdul-Jabbar and D. A. Farhan, ‘‘Hematological dataset,’’ Mendeley versity, Jeddah, Saudi Arabia, as an Associate Pro-
Data, Medial City, Tech. Rep., 2022, doi: 10.17632/g7kf8x38ym.1. fessor, where he holds the position of an Adjunct
[Online]. Available: https://www.kaggle.com/code/eduarp/sickle-cell- Professor with the Hematology Research Unit,
anemia/notebook King Fahad Medical Research Center. The fol-
[61] H. Lee-Six, N. F. Øbro, M. S. Shepherd, S. Grossmann, K. Dawson, lowing courses are among those that he instructs:
M. Belmonte, R. J. Osborne, B. J. P. Huntly, I. Martincorena, E. Anderson,
hematology, hematology, advance hematology,
L. O’Neill, M. R. Stratton, E. Laurenti, A. R. Green, D. G. Kent, and
and cases study on hematology.
P. J. Campbell, ‘‘Population dynamics of normal human blood inferred
from somatic mutations,’’ Nature, vol. 561, no. 7724, pp. 473–478,
Sep. 2018.
[62] D. Mansour, A. Hofmann, and K. Gemzell-Danielsson, ‘‘A review of clin-
ical guidelines on the management of iron deficiency and iron-deficiency
anemia in women with heavy menstrual bleeding,’’ Adv. Therapy, vol. 38, ASMAA MUNSHI received the B.Sc. degree in computer science form King
no. 1, pp. 201–225, Jan. 2021. Abdulaziz University, Saudi Arabia, in 2004, and the master’s degree (Hons.)
[63] F. Aslinia, J. J. Mazza, and S. H. Yale, ‘‘Megaloblastic anemia and other in internet security and forensic and the Ph.D. degree in information security
causes of macrocytosis,’’ Clin. Med. Res., vol. 4, no. 3, pp. 236–241, from Curtin University, Australia, in 2009 and 2014, respectively. She is
Sep. 2006.
currently an Associate Professor with the Cybersecurity Department, College
[64] D. O. Okonko, A. K. Mandal, C. G. Missouris, and P. A. Poole-Wilson,
of Computer Science and Engineering, University of Jeddah, Saudi Arabia,
‘‘Disordered iron homeostasis in chronic heart failure: Prevalence, pre-
dictors, and relation to anemia, exercise capacity, and survival,’’ J. Amer. where she is also holding several positions of the Vice Dean (Female Section)
College Cardiol., vol. 58, no. 12, pp. 1241–1251, Sep. 2011. of the College of Computer Science and Engineering. She is also a Supervisor
[65] M. T. Maeder, O. Khammy, C. D. Remedios, and D. M. Kaye, ‘‘Myocar- of the Cybersecurity Department (Female Section), College of Computer
dial and systemic iron depletion in heart failure: Implications for anemia Science and Engineering. She is also the Vice Dean of the Faculty of
accompanying heart failure,’’ J. Amer. College Cardiol., vol. 58, no. 5, Computing and Information Technology (Female Section), Khulais Branch,
pp. 474–480, Jul. 2011. University of Jeddah. Her research interests include computer forensic,
[66] V. Hoffbrand, G. Collins, and J. Loke, Hoffbrand’s Essential Haematology. information security, and the IoT.
Hoboken, NJ, USA: Wiley, Jul. 2024.