Batch 09

Presented By
HARINI S 312420104051

• Anemia results from decreased red blood cell count or structural damage, often due to factors like increased
destruction, blood loss, defective cells, or reduced production.

• Early detection and treatment are crucial to prevent irreversible organ damage.

• Various underlying conditions increase the risk of anemia, including diabetes, kidney disease, cancer,
HIV/AIDS, inflammatory bowel disease, and cardiovascular disease.

• Anemia can stem from diverse causes such as iron deficiency, sickle cell disease, thalassemia, aplastic
anemia, and vitamin or iron deficiency.

• Each type of anemia varies in severity and duration, with causes ranging from mild to severe and transient to

• Anaemia affects 33% of the global population, with 42% of children under six and 40% of pregnant
women affected, primarily due to iron deficiency.

• The high prevalence of anaemia worldwide, primarily caused by iron deficiency, underscores the urgent
need for effective detection and intervention strategies.

• Early diagnosis of anaemia is crucial for preventing adverse health outcomes and improving the quality
of life for affected individuals.

• To solve this problem, detecting the clinical signs of anaemia is needed. This can be done through two
main steps: training and testing.

• Machine learning techniques greatly depends on the quality of the training data .
• Traditional diagnostic methods may be limited in their accuracy and efficiency.

• Therefore, leveraging advanced technologies like artificial intelligence (AI) and machine learning (ML)
holds immense potential for enhancing anemia detection and management.

• The primary objective of this study is to develop and evaluate machine learning-based algorithms for the
detection of anemia using clinical hematological values.

• Specifically, our study will focus on comparing the performance of different ML algorithms, including
Decision Trees Random Forest, Support Vector Machines, Naïve Bayes, and k-Nearest Neighbors (k-NN).

• Through a comprehensive comparative analysis, we seek to identify the most effective algorithm for
anemia detection based on clinical data.

• Ultimately, our goal is to contribute to the development of robust and scalable solutions for combating
anemia and improving public health outcomes globally.

• The non-invasive technique, such as the use of machine learning algorithms, is one of the methods used in
the diagnosing or detection of clinical diseases, which anaemia detection cannot be overlooked in recent days

• In this study, machine learning algorithms were used to detect iron-deficiency anemia with the application of

• Decision Tree
• Random Forest
• Logistic Regression
• Support Vector Machine
• Naïve Bayes

• Given these, numerous studies have been conducted by researchers to come out with robust and non-invasive
approaches to detect or diagnose anaemia such as the use of medical images and machine learning algorithms
which detect anaemia and make predictions for the future.
S.No Year Author Journal Title Description Advantages Disadvantages
1. 2021 J. Ali, International Machine This review provides an Summarizes the state-of- Provides insights into data
A.Ahmad Loay Research Learning for overview of recent studies the-art in anemia detection sources and preprocessing
E. George, Journal of Anemia on anemia detection using with machine learning. techniques. May not
Chen Soong Engineering Detection: A machine learning Helps researchers identify provide in-depth technical
Science, Review techniques. It discusses gaps in the existing details of individual
S. Aziz
various algorithms, data literature. studies.
Technology and
sources, and evaluation The review's quality
Innovation methods used in these depends on the
studies. comprehensiveness of the
studies included.

2. 2020 Anaemia This paper explores the Achieves high accuracy in The model's interpretability
Akinori International detection using application of deep anemia detection from may be limited
Mitani, Yun Research Deep learning learning techniques, retinal images. Requires a
Liu, Abigail Journal of Techniques particularly convolutional large amount of labeled
E. Huang, Microbiology neural networks (CNNs) data for training deep
Gregory S. and recurrent neural learning models.
Corrado, Lily networks (RNNs), for
Peng, anemia detection from
retinal images. Deep
learning methods can
automatically learn
features from images,
reducing the need for
handcrafted features.
S.No Year Author Journal Title Description Advantages Disadvantages
3. 2017 Prakriti International Automated This study proposes a Provides a systematic Reliance on available patient
Dhakal Journal of diagnosis of machine learning-based approach to automated data for training may limit
Computer anemia using approach for automated diagnosis, potentially generalizability. Challenges
Science & machine diagnosis of anemia using reducing the need for manual may arise in interpreting the
learning clinical signs and evaluation. Offers insights decision-making process of
algorithms symptoms. Various machine into the effectiveness of complex machine learning
Technology learning algorithms are different machine learning models.
(IJCSIT) compared and evaluated on algorithms in diagnosing
their performance in anemia.
classifying anemia based on
a dataset of patient data.

4. 2019 G.Dimauro IEEE Machine This research focuses on Non-invasive and scalable Dependence on image quality
A. Guarini, Access learning-based developing a diagnostic aid approach leveraging image and variability in lighting
D.Caivano, diagnostic aid for anemia detection using analysis techniques. Can conditions may affect
for the detection machine learning potentially assist healthcare performance. Integration into
of anemia in techniques applied to professionals in quick and clinical workflows and
Girardi, video-palpebral images of the palpebral accurate diagnosis.. regulatory approval may pose
C.Pasciolla conjunctiva conjunctiva. Features such challenges.
A.Iacobazzi images as color and texture are
extracted from images, and
various classifiers are
trained to differentiate
between anemic and non-
anemic individuals.
S.No Year Author Journal TITLE Description Advantages Disadvantages
5. 2020 Anusorn 2016 Predicting This study explores the Offers potential for early Limited to a specific
Charleonnan, Management Anemia use of machine learning detection and intervention patient population (CKD
Thipwan and Diagnosis in models to predict anemia in patients at risk of patients), which may affect
Fufaung, Innovation Patients with diagnosis in patients with anemia complications. generalizability. Data
Tippawan Technology Chronic Kidney chronic kidney disease Provides insights into availability and quality
Disease Using (CKD) based on various factors contributing to may vary across healthcare
Niyomwong, International
Machine clinical and laboratory anemia in CKD patients. settings, impacting model
Wandee Conference
Learning Models parameters. Different performance.
Chokchueypa (MITicon) algorithms are compared
ttanakit,Sathir for their predictive
performance, and
features important for
prediction are identified.

6. 2021 Nandha Telematika Machine This research investigates Offers a non-invasive and Limited to detecting
Juniaroesita Learning the use of machine potentially low-cost anemia based on specific
Peksi, Algorithms for learning algorithms to approach to anemia visual cues, may not
and Research Anemia detect anemia based on screening. Fingernail color capture all cases.
Bambang Technology Detection using fingernail color images. may serve as an easily Challenges may arise in
Fingernail Color Color features extracted accessible biomarker for standardizing image
Mangaras Images from images are utilized anemia. acquisition and processing
Yanu to train classifiers for across different settings.
Florestiyanto anemia detection, with
performance evaluated on
a dataset of individuals
with and without anemia.
S.No Year Author Journal Title Description Advantages Disadvantages
7. 2018 Yih-Chung Nature Anemia detection his study investigates the Deep learning models can Availability of annotated
Tham, Biomedical in retinal images use of deep learning automatically learn retinal image datasets may
Ching Yu Engineering using deep learning techniques for detecting relevant features from be limited. Interpretability
Cheng, signs of anemia in retinal retinal images, potentially of deep learning models in
Tien Yin images. Convolutional improving accuracy. the medical domain can be
neural networks (CNNs) Retinal imaging offers a challenging.
are trained on a dataset of non-invasive approach to
retinal images, and their anemia screening.
performance in identifying
characteristic features
associated with anemia is

8. 2019 C. Bellinger, International Classification of This research focuses on Provides a quantitative Limited to classifying
A. Amid, Conference on anemia severity developing machine approach to assessing anemia severity within
Japkowicz, Machine using machine learning models to classify anemia severity, aiding in defined categories, may
H. Viktor Learning and learning techniques the severity of anemia treatment decision-making. not capture nuances in
based on clinical and Offers potential for patient presentation.
laboratory parameters. personalized management Generalizability of severity
Various classification of anemia based on classification models
algorithms are employed, severity classification. across different
and their performance in populations may vary
distinguishing between
mild, moderate, and severe
cases of anemia is
S.No Year Author Journal TITLE Description Advantages Disadvantages
9. Detecting Anemia This study explores the Addresses the challenge of Reliability of smartphone-
Maileth 2021 IEEE
2020 in Rural feasibility of using limited access to based measurements may
Rivero- Colombian Healthcare machine learning healthcare resources in be affected by factors such
Palacio, W. Conference on Settings Using algorithms and rural areas by leveraging as device quality and user
Alfonso- Applications of Machine Learning smartphone technology smartphone technology. variability. Integration into
Morales, Computational and Smartphone for detecting anemia in Offers a portable and cost- existing healthcare
Eduardo Intelligence Technology resource-limited rural effective solution for workflows and
Caicedo- (ColCACI) healthcare settings. A anemia screening. infrastructure may require
Bravo mobile application is additional considerations.
developed for capturing
relevant clinical data,
which is then used to
train models for anemia

10. Anemia detection This research Provides a novel approach Availability of annotated
2021 in hematoxylin investigates the use of to anemia detection using histopathology image
Manohar ICPR Contests and eosin stained machine learning histopathology images datasets for anemia
Kuse, Tanuj histopathology techniques for detecting routinely collected in detection may be limited.
Sharma, images using anemia in histopathology clinical practice. Offers Interpretation of
Sudhir Gupta machine learning images stained with potential for integrating histopathology images
hematoxylin and eosin anemia screening into requires specialized
(H&E). Image analysis pathological assessments. expertise, which may
methods are applied to impact scalability.
extract relevant features
indicative of anemia, and
classifiers are trained for
automated detection.

 Clinicians are prone to disease since they are the one who pricks blood from humans.

 The Electrophoresis method cannot be performed if the current or voltage supply is not

 Test results vary from lab to lab which uses different techniques and disturbs the mentality
of the patients on which result to believe.

 Sahli’s method of screening hemoglobin uses Acid hematin as a suspension, not a true

 This method can't measure all hemoglobin and chances of visual error are high

 The proposed system aims to detect anaemia using machine learning algorithms by
analyzing attributes extracted from haematological data, including gender,
haemoglobin, mean corpuscular haemoglobin (MCH), mean corpuscular haemoglobin
concentration (MCHC), and mean corpuscular volume (MCV).

 Initially, the system conducts exploratory data analysis and applies statistical tests such
as t-test, odds ratio, and chi-square test to understand the dataset's characteristics and
associations. Feature selection techniques such as correlation analysis, SelectKBest,
and Extra Tree Classifier are employed to identify relevant attributes.

 Additionally, the system addresses class imbalance through methods like random
under-sampling, random oversampling, SMOTE, and ADASYN, while handling data
leakage by removing pertinent features. Multiple machine learning algorithms
including Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbors,
Support Vector Machine, and Gaussian Naive Bayes are trained and evaluated.

The following are the modules implemented for Detecting Clinical Signs of Anaemia Using Machine Learning

 Exploratory Data Analysis

 Statistical test with t-test , Odd ratio, Chi – Square test for association

 Feature Selection

 Class Imbalance and Data Leakage Handling

 Modelling Module(Decision Tree, Random Forest, Logistic Regression, K-NN, Support Vector Machine)

 Report Generation Module

 Conclusion and Recommendations

Exploratory Data Analysis :

 Initially, the data was examined using methods like df.shape, df.head(), and to understand its dimensions, column names, data types, and the
presence of any missing values.

 The dataset contains 1421 records with 6 columns, all of which are non-null, suggesting no missing values. Summary statistics such as mean, standard
deviation, minimum, maximum, and quartile values were computed using df.describe to understand the central tendency, 23dispersion, and shape of the
dataset distribution.

 No missing values were found, indicating a clean dataset. Additionally, the data types of columns were checked using df.types() and column renaming was
performed for visualization purposes. It was observed that the dataset is imbalanced, with non-anaemic cases comprising 56.37% and anaemic cases
comprising 43.63% of the dataset. The distribution of anaemia across genders was explored using count plots and bar plots. It was observed that females
had a higher mean anaemia rate (56%) compared to males (31%), despite the female population being 4.2% more than males.

 Various features such as Haemoglobin, MCH, MCHC, and MCV were visualized using histograms, violin plots, and distribution plots to understand their
distributions and relationships with anaemia. Statistical measures such as skewness and kurtosis were computed to understand the distribution shapes of
features like Haemoglobin. Furthermore, tables summarizing key metrics such as highest, lowest, and average values of Haemoglobin, MCH, MCHC, and
MCV were created for comprehensive analysis.

1. T-test for haemoglobin levels by gender :

The t-test indicates that there is not a significant difference in the mean haemoglobin levels between males and females.
Despite a slight difference favouring females, it's not substantial enough to reject the null hypothesis. The logarithm
transformation of haemoglobin data helps to address skewness and ensures the validity of the t-test assumptions.

2. Odds Ratio for Anaemia by Gender :

The odds ratio of 2.86 suggests that females have 2.86 times the odds of being anaemic compared to males. This indicates a
significant association between gender and the risk of anaemia, with females being at a higher risk compared to males.

3.Chi-Square Test for Gender and Anaemia Status :

The chi-square test reveals a significant association between gender and status. With a chi-square statistic of 90.06 and a p-
value less than 0.001, there is strong evidence to reject the null hypothesis of independence. This implies that gender and
anaemia status are dependent variables, indicating a relationship between being female and having anaemia.

Correlation Analysis :
The correlation matrix revealed the strength and direction of association between features and the target variable. For instance,
a positive correlation indicates that an increase in one variable corresponds to an increase in the other, while a negative
correlation suggests the opposite. In this study, haemoglobin exhibited a strong negative correlation (Pearson correlation
coefficient of -0.8) with anaemia status, indicating that lower haemoglobin levels are associated with a higher likelihood of

SelectKBest :
To corroborate the findings from correlation analysis and further refine feature selection, the study employed the SelectKBest
method, a statistical technique for univariate feature selection. SelectKBest evaluates the significance of each feature
individually by applying a scoring function, in this case, the chi-squared test statistic. This test measures the dependence
between each feature and the target variable, assessing whether the occurrences of a specific feature and a specific class are
independent based on their frequency distribution.
Log Scaling: Logarithmic transformation was applied to the 'Haemoglobin' feature to address its left-skewed distribution and
mitigate the impact of extreme values. A small constant of 0.01 was added before taking the logarithm to avoid errors with
zero or negative values. The log transformation compresses the range of the data and down-weights extreme values.

Standardization: Standardization involves centering the feature values around the mean and scaling them to have a standard
deviation of one. This was achieved using the StandardScaler from the sklearn.preprocessing module. Standardization ensures
that all features have a comparable scale and prevents any single feature from dominating the objective function. However,
since tree-based algorithms like decision trees, random forests, and boosted trees are invariant to feature scaling,
standardization may not significantly impact their performance. In standardization, each feature's values are shifted such that
their mean becomes zero.

Normalization : Normalization, also known as Min-Max scaling, rescales the feature values to a range between 0 and 1. This
was accomplished using the MinMaxScaler from the sklearn.preprocessing module. Normalization is particularly useful when
the distribution of feature values is not normal and helps in maintaining the relative differences in the range of values. Similar
to standardization, normalization may not be necessary for tree-based algorithms.
Feature Engineering and Visualization: After scaling the features, feature engineering techniques were applied to visualize
the distribution of feature values with respect to the target variable 'Result' using box plots. Four different versions of the
'Haemoglobin' feature were plotted against the ‘Result' variable: original, log-scaled, standardized, and normalized. The box
plots allowed for a comparative analysis of how each scaling method affected the distribution of feature values across
different outcomes of the target variable.

Splitting Data into Training and Testing Samples: Finally, the preprocessed data was split into training and testing datasets
to facilitate model training and evaluation. The train_test_split function from the sklearn.model_selection module was utilized
to randomly partition the data into training (70%) and testing (30%) sets. The 'X_train' and 'X_test' datasets
contain the predictor variables, while the 'y_train' and 'y_test' datasets contain the corresponding target variable values.
Random Oversampling: This technique involves duplicating examples from the minority class, balancing the class
distribution. However, it may lead to overfitting for some models due to the replication of minority class instances.

SMOTE (Synthetic Minority OverSampling Technique): SMOTE generates synthetic data points for the minority class,
addressing class imbalance without replicating existing instances.

ADASYN (Adaptive Synthetic Sampling Method for Imbalanced Data): Similar to SMOTE but focuses on generating more
samples for difficult-to learn instances. Implementation and Evaluation: Each sampling technique is implemented using
appropriate libraries such as imblearn.over_sampling and imblearn.under_sampling. The impact of each technique on model
performance is evaluated using metrics like accuracy, precision, recall, F1 score, and AUC-ROC curve.

Data Leakage: Data leakage occurs when information from outside the training dataset influences model training, leading to
inflated performance estimates. It can occur when test data is inadvertently included in the training process or when
external data is used to create the model.

Logistic Regression (LR)

LR imbalance: The model trained on the original imbalanced dataset achieved an accuracy of 94.1% and an F1 score of 0.931
on the test set.
LR Undersampling: Similar performance to LR imbalance.
LR Oversampling: Achieved an accuracy of 93.9% and an F1 score of 0.930 on the test set.
LR SMOTE: Achieved the same performance as LR Oversampling.
LR ADASYN: Achieved an accuracy of 94.4% and an F1 score of 0.936 on the test set.

Decision Tree (DT)

Performance Across Sampling Techniques:
DT models achieved perfect performance (accuracy, precision, recall, F1
score, and AUC) across all sampling techniques. This suggests potential
overfitting or unrealistic results.
Random Forest (RF)
Performance Across Sampling Techniques: RF models also achieved perfect performance across all sampling techniques,
indicating potential overfitting.

K-Nearest Neighbors (KNN)

Performance Across Sampling Techniques: KNN models achieved high performance across all sampling techniques,
with F1 scores ranging from 0.965 to 0.975.

Support Vector Machines (SVM)

Performance Across Sampling Techniques: SVM models achieved high performance across all sampling techniques, with F1
scores ranging from 0.967 to 0.978.

Gaussian Naive Bayes (NB)

Performance Across Sampling Techniques: NB models achieved good performance across all sampling techniques, with
F1 scores ranging from 0.939 to 0.957.

Hyperparameter tuning is a critical aspect of machine learning model development aimed at optimizing the performance of the
model by selecting the best combination of hyperparameters. In this module, we utilized the GridSearchCV technique to
systematically explore a grid of hyperparameters and identify the set that yields the highest performance for each classifier.

The methodologies implied are GridSearchCV:

This method exhaustively searches through a specified parameter grid to find the combination of hyperparameters that
maximizes a specified scoring metric. We employed a 5-fold cross-validation strategy during the search process to ensure
robust evaluation of each parameter combination.
Classifiers Considered:
We evaluated several classifiers including Decision Tree, Random Forest, Support Vector Machine (SVM), Gaussian Naive
Bayes, Logistic Regression, and K-Nearest Neighbors (KNN).
Parameter Grids:
For each classifier, we defined a parameter grid containing potential hyperparameters to tune.

The bar chart illustrates the comparative performance of different machine learning models based on their
accuracy scores obtained through grid search. Random Forest and Decision Tree models exhibit the highest
accuracy, both achieving a perfect score of 100%. Following closely is the Support Vector Machine model with
an accuracy of 99.4%. K-Nearest Neighbors and Logistic Regression also demonstrate strong performance with
accuracy scores of 98.8% and 93.5%, respectively. Naive Bayes, while slightly lower, still shows respectable
accuracy at 91.4%.

[1] Akmal Hafeel, H.S.M.H. Fernanado, M.Pravienth, Shashika Lokuliyana, N.Kayanthan, and Anuradha Jayakody, “
IoT device to Detect Anemia ” 2019 International Conference On Advancements in Computing (ICAC), Malabe, Sri
Lanka, 2019.

[2] Aparna V, T V Sarath, K.I. Ramachandran, “Simulation model for anemia detection using RBC counting algorithms
and Watershed transform” 2017 International Conference on Intelligent Computing, Instrumentation and Control
Technologies (ICICICT),Kerala, India 2017.

[3] AzwadTamir, Chowdry Jahan, Mohammed S. Saif, and U.Zaman, “Detection of anemia from image of the anterior
conjunctiva of eye by image processing and thresholding” 2017 IEEE Region 10 Humanitarian Technology Conference

[4] Chayashree Patgiri, Amrita Ganguly, “ Comparative Study on Different Local Thresholding Techniques for
Detection of Sickle Cell Anaemia from Microscopic Blood Images” 2019 IEEE 16th India Council International
Conference (INDICON), Rajkot, India.
[8] Enas Walid Abdulhay, Ahmad Ghaith Allow, and Mohammad Eyad Al-Jalouly presented their research on "Detection of
Sickle Cell,Megaloblastic Anemia,Thalassemia, and Malaria through Convolutional Neural Network" at the 2021 Global
Congress on Electrical Engineering (GC-ElecEng) in Valencia, Spain.

[9] Furkan Kiraci, Batuhan Albayrak, Muazzez Buket Darici, Arif Selçuk Öğrenci, Atilla Özmen, and Kerem Ertez, "Orak
Hücreli Anemi Tespiti: Sickle Cell Anemia Detection" at the 2018 Medical Technologies National Congress (TIPTEKNO).

[10] Garima Vyas, Vishwas Sharma, Adhiraj Rathore, shared insights on "Detection of Sickle Cell Anemia and Thalassemia
Causing Abnormalities in Thin Smear of Human Blood Sample Using Image Processing" at the 2016 International Conference on
Inventive Computation Technologies (ICICT) in Coimbatore, India.

[11] Jessie R. Carlos C. Hortinela, Fausto, Paul Daniel C. Divina, and John Philip T. Felices presented research on "Identification
of Abnormal Red Blood Cells and Diagnosing Specific Types of Anemia Using Image Processing and Support Vector Machine"
at the 2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and
Control, Environment, and Management (HNICEM) in Laoag, Philippines.
[12] Joan Cid, Jaime Punter-Villagrasa, Jordi Colomer-Farrarons, Ivón Rodríguez-Villarreal, and Pere Ll. Miribel-Català
discussed progress "Toward an Anemia Early Detection Device Based on 50-μL Whole Blood Sample" in the IEEE Transactions.
Kathirvelu, S. Keerthana, V. Keerthana, S. Lakshitha, and S. Manikandan discussed "Early Detection of Sickle Cell Anemia
Among Tribal Inhabitants" at the 2023 8th International Conference on Communication and Electronics Systems (ICCES) in
Coimbatore, India.

[13] R. Kumar, S. Guruprasad, Krity Kansara, K. N. Raghavendra Rao, Murali Mohan, Manjunath Ramakrishna Reddy, Uday
Haleangadi Prabhu, P. Prakash, Sushovan Chakraborty, Sreetama Das, and K. N. Madhusoodanan introduced an innovative "A
Novel Noninvasive Hemoglobin Sensing Device for Anemia Screening" in the IEEE Sensors Journal, Volume 21, Issue 13,
published in 2021.

[14] Maileth Rivero-Palacio, Wilfredo Alfonso-Morales, and Eduardo Caicedo-Bravo introduced a "Mobile Application for
Anemia Detection through Ocular Conjunctiva Images" at the 2021 IEEE Colombian Conference on Applications of
Computational Intelligence (ColCACI) in Cali, Colombia.

[15] Muljono, Sari Ayu Wulandari, Harun Al Azies, Muhammad Naufal, Wisnu Adi Prasetyanto, and Fatima Az Zahra presented
groundbreaking research titled "Non-Invasive Anemia Detection Empowered by AI: Pushing the Boundaries in Diagnosis" in
IEEE Access, Volume 12, published in 2024.

