0% found this document useful (0 votes)
76 views50 pages

Thesis Fin

Uploaded by

Akshay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views50 pages

Thesis Fin

Uploaded by

Akshay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Machine Learning approaches for Predicting

Anemia Risk in Woman from India

DISSERTATION

Submitted in partial fulfilment of the requirements

for the award of the degree of

Master of Technology

in

Computer Science

by

Akshay Kumar

22/10/MT/034

Under the supervision of

Supervisor: Prof. Manju Khari

SCHOOL OF COMPUTER & SYSTEMS SCIENCES

JAWAHARLAL NEHRU UNIVERSITY

NEW DELHI, INDIA

JUNE 2024
CERTIFICATE

This is to certify that dissertation entitled “Machine Learning approaches for Predicting
Anemia Risk in Woman from India”, being submitted by Akshay Kumar, (Enrollment
Number 22/10/MT/034) in fulfillment of the requirement for the award of degree Master of
Technology in Computer Science to School of Computer & Systems Sciences (SC&SS),
Jawaharlal Nehru University, New Delhi is a record of the candidates own work carried
under the guidance and supervision of Dr. Manju Khari.

The matter presented in this thesis has not been submitted for the award of any other degree
elsewhere.

Prof. Zahid Raza Prof. Manju Khari

(Dean) (Supervisor)

School of Computer and Systems Sciences School of Computer and Systems Sciences

Jawaharlal Nehru University, Jawaharlal Nehru University

New Delhi - 110067 New Delhi -110067


DECLARATION

I hereby declare that the M. Tech Dissertation titled “Machine Learning Algorithms for
Predicting Anemia Risk in Children from India ” being submitted to the School of
Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, in partial
fulfilment of the requirements for the award of the degree of Master of Technology in
Computer Science , is an authentic record of work carried out by me under the guidance and
supervision of Prof. Manju Khari.

I also mention that the research work is original and has not been submitted by me, in part or
full, to any other University or Institution for the award of any degree or diploma.

Akshay Kumar

(22/10/MT/034)

School of Computer and Systems Sciences

Jawaharlal Nehru University

New Delhi-110067, India


ACKNOWLEDGEMENT

I would like to express my sincere gratitude to my supervisor Prof. Manju Khari for her
expertise and continuous support, who has the attitude and the substance of a genius. Without
her guidance and persistent help, this thesis would not have been possible. It has been my
utmost privilege to work with her. My appreciation also extends to my colleagues for their
insightful comments and valuable feedback. Their timely suggestions with kindness,
enthusiasm and dynamism have enabled me to complete my thesis. I would also like to thank
my committee members for their time and cooperation in the completion of this thesis. My
sincerest appreciation and gratitude go to my parents and my brother for their unfailing love
and support, for encouraging me every single day to be a better person, and for giving me
wings to fly.

Akshay Kumar

ABSTRACT
Childhood anemia remains a major health concern in India despite existing
efforts. This study leverages machine learning to develop a model for predicting
anemia risk. Data from a large national survey will be used, encompassing
factors like demographics, diet, and health indicators. Algorithms like Random
Forest will identify key risk factors specific to different regions within India.
This will lead to a geographically-specific predictive model that considers both
national and regional trends. The anticipated outcome is a model that accurately
predicts anemia risk in Indian children. This has the potential to revolutionize
childhood anemia management by enabling early identification of at-risk
children and allowing for timely interventions. Additionally, targeted public
health initiatives and geographically specific education campaigns can be
developed based on the identified risk factors. Furthermore, exploring
environmental factors like access to clean water and sanitation can create a
more comprehensive approach for tackling childhood anemia in India. This
study has the potential to significantly improve public health outcomes for
children across the nation.

Table of Contents

CERTIFICATE
DECLARATION

ACKNOWLEDGEMENT

ABSTRACT

LIST OF FIGURES

LIST OF TABLES

1.INTRODUCTION

1.1 Problem Statement

1.2 Research Objectives

1.3 Research Motivation

1.4 Thesis Outline

1.5 Chapter Summary

2.LITERATURE REVIEW

2.1 Introduction

2.2 Literature Review Analysis

2.3 Chapter Summary

3.RESEARCH METHODOLOGY

3.1 Dataset Description

3.2 Machine Learning (ML) algorithms

3.3 Machine Learning Techniques

3.3.1 Support Vector Machine

3.3.2 K Nearest Neighbours


3.3.3 Decision Tree and Random Forest

3.3.4 Gaussian Naïve Bayes

3.4 Chapter Summary

4.PROPOSED WORK

4.1 introduction

4.2Chapter Summary

5.CONCLUSION AND PROPOSED WORK

5. RESULT AND DISCUSSION

6. CONCLUSION AND FUTURE SCOPE

REFERENCES

List of Abbreviations

SVM : Support Vector Machines


KNN : K Nearest Neighbours

DT : Decision Tree

RF : Random Forest

DHP : Default Hyperparameter Optimization

HPT : Hyper Parameter Tuning

PCA : Principal Component Analysis

List of Figures

Figure 1. SVM Working Principle

Figure 2. KNN Mechanism

Figure 3. KNN with different k sizes

Figure 4. Random Forest & Decision Tree

Figure 5. Workflow of Proposed Work

Figure 6. Performance Analysis

List of Tables

Table 1. Literature Review

Table 2. Performance Analysis

Chapter 1

Introduction
Anaemia is a typical medical health condition wherein the count of the haemoglobin or red
blood cells available in the body is substantially lower than the normal prescribed range.
haemoglobin, Haemoglobin is composed of a simple protein, known as globin and a non-
protein also known as haem(iron containing part). In order for cells to respire, haemoglobin
must first bind with oxygen in the lungs to create oxyhaemoglobin, which is then carried by
the bloodstream to tissues and organs. Haemoglobin helps in the release of waste products of
metabolism and also acts as a carrier of carbon dioxide from the tissues back to the lungs for
expiration.The functionalities served by haemoglobin plays a very pivotal role to regulate
proper functioning of various physiological processes involved in human anatomy. It also
aids in maitaining the pH level of blood by acting as a buffer, preventing drastic changes in
blood acidity or alkalinity. Moreover, it is also responsible for regulation of blood flow, as
haemoglobin can regulate blood flow by releasing nitric oxide, which helps dilate blood
vessels, increasing blood flow to tissues when needed. Additionally, haemoglobin can bind
with and transport other molecules, such as nitric oxide, carbon monoxide (albeit with much
higher affinity than oxygen, which can be toxic), and certain drugs [1].

These functions are essential for preserving homeostasis and ensuring that the body's
numerous physiological systems operate as intended. If a person has some degree of
abnormal red blood cells or insufficient haemoglobin, the oxygen carrying capability of the
blood to the body decrease. Fatigue, weakness, dizziness, and breathing difficulties are signs
of anemia that should be looked for early in the diagnosing process. Gender, residence
elevation, smoking habits, residence elevation, sex, age as well as pregnancy all affect the
ideal concentration of haemoglobin needed to meet physiological needs. The factors that may
lead to anaemia may be caused by various number of factors, such as poor diets or
unbalanced absorption of nutrients may cause nutrient deficiency, infections (e.g.
tuberculosis, parasitic infections,malaria, HIV etc), inflammation in the body, various chronic
diseases, gynaecological and obstetric problems , and inherited RBCs disorders. One of the
most prominent nutritional reasons of anemia is deficiency in iron; however, deficiencies in
folate, vitamins B12 and A, and other essential nutrients should also be taken into account[2].

Considering it mostly affects children, pregnant or recently gave birth women, adolescent
girls, and women who are menstruating, anemia is a critical public health concern that should
be addressed by both the general public and the government. Anemia statistics that should
worry you is that most cases of anemia happen in low- and lower-middle-income nations.
Those most at risk for this type of anemia are those who live in remote areas, in homes with
lower incomes, and have never attended an official educational institution.It is estimated that
anemia currently affects, 37% of expectant mothers, 40% of all children between the ages of
6 and 59 months and 30% of women globally between the ages of 15 to 49. Based on current
estimates, anemia affects half a billion women aged 15 to 49 worldwide and 269 million
children aged 6 to 59 months. In 2019, anemia afflicted 30 percent of non-pregnant women
and 37 percent of pregnant women between the ages of 15 and 49 [3].

Although there could be numerous kinds of factors involved that may lead to anaemia, iron
deficiency happens to be the prominent cause of nutritional anemia across the world. There
can be several factors that can root the seeds of IDA i.e. anaemia due to iron deficiency, some
of them may include scant intake of iron, fine dietary deficiency, high physiological
requirement while in pregnancy period and early childhood, rapid growth surges ( going
through adolescence, puberty) and parasitic infections can cause chronic iron deficiency,
(parasites like hookworm and schistosomiasis).A person's family size, income, educational
attainment, vitamin A deficiency, urban and rural location, gravidity, lack of iron-folic acid
supplementation, excessive menstrual bleeding, and history of abortion are some
environmental factors that may have influenced their early years of conditioning and can also
be considered major contributors to the gradual onset of anemia. When separating additional
causes of anemia from IDA, parasitic illnesses such as HIV, malaria, chronic inflammation,
and protein-energy deficiency are the main culprits, according to the epidemiology
department[4] .

Anaemia has detrimental consequences on a person’s health, society, and economy. WHO
has evaluated anaemia with accordance to the standardised criterion.
Anemia in pregnant women is defined by a haemoglobin level below 110 g/L.Specifically,
anemia is classified as severe when hemoglobin levels fall below 70 g/L, moderate when
levels range between 70 and 100 g/L, and mild when levels are between 100 and 109 g/L.
Even mild to moderate anemia can negatively impact emotional well-being, causing fatigue
and stress, which reduce productivity and overall work efficiency.Maternal mortality and
morbidity in developing nations are significantly influenced by the incidence of severe
anemia. Serious health effects of chronic anemia can include a higher likelihood of infections
and hemorrhage. Meanwhile, severe anaemia has results potentially leading to heart failure
and mortality, various strategies and attempts are being taken by all of the stakeholders
involved to reduce the burden of anaemia. As a result, various initiatives have been seeded
by the government as well as the non-governmental organisations across the whole world to
treat anemia. These remarkable efforts have yielded various results, including short-term
measures such as supplementation and long-term strategies like food-based methods, food
fortification, dietary diversification, and nutritional education. The "Nutritional Iron Plus
Initiative" was initiated in 2013 by the Ministry of Health and Family Welfare with the goal
of combating the early stages, which can affect all group aged people. During the antenatal
and postnatal periods after the first trimester, each pregnant woman receives one iron and
folic acid tablet daily for the next six months. Additionally, pregnant women were widely
encouraged to check for anemia [5].

1.1 Problem Statement

Anemia has serious effects on human health and is a global public health concern. It is one
of the most common illnesses in the world, especially affecting women and young people.
Using cutting-edge technologies to address this problem can significantly lower its
prevalence. Anemia has long been a public health concern since it affects an estimated 2.5
billion people worldwide. Iron deficiency anemia (IDA) is a diagnosis made when a person's
hemoglobin (Hb) levels are below normal for their gender, age as well as physiological state.
Anemia during pregnancy can have an adverse effect on the health of the mother aas well as
the fetus. It increases the risk of intrauterine growth restriction, low birth weight, and early
delivery, all of which are linked to higher risks of perinatal death. The WHO states that the
mother's safety throughout pregnancy depends on eliminating anemia. Any negative effects
on expectant mothers and their unborn baby would have a substantial on public health overall
because anemia is a condition that is highly prevalent [6].

The occurrence of anaemia varies widely due to differences in socioeconomic conditions,


lifestyles, healthcare-seeking behaviours, obstetric and gynaecological circumstances across
different nations and cultures. Anemia symptoms can be subtle and difficult to recognize
clinically, especially if the condition is not severe. Pallor of the skin and exhaustion,
conjunctiva, fatigue, as well as an appetite disorder are kind of these symptoms. Factors like
skin thickness and pigmentation further complicate clinical identification. In poorer nations,
anemia among pregnant women is significantly higher than in developed countries due to
economic, societal, and health factors. By 2025, the WHO aims to achieve a 50 percent
reduction in anemia among women who are reproductive.An estimated 115,000 maternal
fatalities occur globally each year as a result of iron deficiency anemia (WHO, 2023).
Anemia affects over 58 percent of pregnant women in India and is the primary root of 20–
40% of maternal deaths [7].

To address this pervasive issue, it is vital to leverage the power of advanced technological
solutions. These solutions can include developing more accurate and accessible diagnostic
tools, improving nutritional interventions, and enhancing public health campaigns to raise
awareness and promote preventive measures. By integrating these approaches, it is possible
to make significant strides in reducing the global burden of anemia, particularly among the
most vulnerable populations[8].
1.2 Research Objectives

Objective 1 : To Develop a Computational Model to enhance the accuracy and accessibility


of anemia diagnostics by developing a computational model that predicts anemia with high
accuracy and interprets underlying factors and blood characteristics.

Objective 2: Analysis of Disproportionate Anemia Prevalence among Indian Women to


investigate and address the high prevalence of anemia among Indian women, aiming to
contribute to a broader understanding of anemia and improve prevention, management, and
treatment strategies.

Objective 3: Enhance Education and Awareness about Anemia to integrate educational


initiatives within communities, particularly in low- and middle-income countries, to empower
individuals to recognize early symptoms of anemia and seek medical advice promptly,
thereby improving early detection and overall health outcomes.

1.3 Research Motivation

Anemia, particularly iron-deficiency anemia, can significantly affect cognitive function. Iron
is crucial for the development and function of the brain. Lack of sufficient iron leads to
decreased oxygen transport to the brain, impairing neurotransmitter synthesis and function.
This can result in reduced attention span and concentration and often experience difficulties
in maintaining focus on tasks, which can adversely affect academic and professional
performance. Moreover, chronic anemia in children can lead to long-term deficits in
cognitive and motor development, affecting educational attainment and social interactions.[9]

Anemia imposes additional stress on the cardiovascular system as the heart works harder to
pump blood to deliver sufficient oxygen to tissues. This, in adverse cases can lead to Heart
failure as chronic severe anemia can cause or exacerbate heart failure due to the persistent
high workload on the heart. Additionally, eeduced oxygen levels can lead to chest pain
(angina) during physical exertion as the heart muscles receive inadequate oxygen. It can also
lead to arrhythmias in which Abnormal heart rhythms can develop as a result of anemic
hypoxia (reduced oxygen supply to tissues), which affects the electrical activity of the
heart[10].

Anemia during childhood can have long-lasting impacts on growth and development. The
growth of a child can be stunted as iron and other nutrients vital for growth are deficient in
anemic children, leading to shorter stature and delayed physical development. Anemia can
delay the onset of puberty, affecting hormonal balance and secondary sexual characteristics.
Moreover, it can also weaken the immune system as chronic anemia can impair the immune
response, making children more susceptible to infections and illnesses, which further hinders
growth and development. It can also lead to some kind of behavioural problems, resulting in
irritability, fatigue, and apathy, contributing to behavioural issues that can disrupt social
interactions and learning environments[11].

Pregnancy significantly increases the body's demand for iron and other nutrients, making
anemia particularly dangerous for both the mother and the foetus. Particularly, in newborns,
complications such as respiratory distress syndrome, increase the risk of preterm labour,
eventually leading to premature birth.Low birth weight babies are more likely to have
developmental problems as well as higher rates of neonatal morbidity and mortality if their
mothers are anemic. Moreover, the disease can contribute to the development of
preeclampsia, a condition characterised by high blood pressure and damage to organ systems,
which can be life-threatening for both mother and baby. The physical stress of anemia can
contribute to postpartum depression, affecting the mother’s mental health and her ability to
care for the newborn. As the oxygen transportation is reduced, it can lead to inadequate
nutrient and oxygen delivery to the foetus, causing growth restrictions and developmental
delays [12].

The ability of blood to function normally is impacted whenever a human body develops a
blood disorder of any kind. These disorders can lower the quantity of nutrients, proteins,
platelets, or cells in the blood, which can impair the normal functioning of all physiological
processes in the human body. Empirical studies have consistently shown that anemia patients'
bodies have increased blood flow to vital organs such as the brain, heart, liver, kidneys while
decreased blood flow to less vital parts of body. To establish the diagnosis of anemia, the
hematocrit—the ratio of red blood cells to total volume in a blood sample—or the blood's
hemoglobin concentration are usually measured.A patient is considered anemic if their
hematocrit or hemoglobin levels are more than two standard deviations lower than the normal
range. Typically, Hb assessments utilizing capillary/haemoglobin electrophoresis, DNA
analysis, or high-performance liquid chromatography are used to identify BTT and HbE.
Because DNA analysis requires specialized equipment and is costly and time-consuming, it
cannot be applied in standard lab settings [13] .

1.4Paper Organisation (Thesis Outline)

The thesis work has total divided into six chapters with each having something related to the
techniques that are used in research work.

Chapter 1: Introduction

The first chapter is divided into four sections, starting with an introduction to the background
of Anemia. The second section focuses on identifying the problem statement, the third and
fourth sections focus on the objective and motivation part of the thesis and the last one is the
thesis outline.

Chapter 2: Literature Review

Gives a brief overview of related work that has been done earlier related to anaemia
prediction with the help of machine learning algorithms.

Chapter 3: Research Methodology

Gives an outline of the research background, and describes all the algorithms in details.

Chapter 4: Proposed Work

This chapter contains the dataset and research methodology, which are carried out in this
thesis. In this, all the ML techniques have been discussed with detailed descriptions of the
EDA process and how features are selected for the model training

Chapter 5: Results & Discussion

This chapter explains the result and provides a detailed discussion of the result and the
evaluation matrices used for comparison.

Chapter 6: Future scope

This is the final chapter of the research work that gives the conclusion and future scope of the
research
1.5 Chapter Summary

This chapter discussed : Anemea Anemia is a common condition marked by low hemoglobin
or red blood cell levels. Hemoglobin, made of globin protein and iron-containing heme, binds
oxygen in the lungs and transports it to tissues, aiding cellular respiration. It also removes
metabolic waste, maintains blood pH, regulates blood flow by releasing nitric oxide, and can
bind other molecules like carbon monoxide. The critical functions of hemoglobin are
essential for various physiological processes, highlighting the importance of addressing
anemia for overall health.
Chapter 2

Literature Review

The study begins with a comprehensive and detailed literature review, elucidating the
multifaceted nature of anemia in the context of Indian demographic. Anemia in
children remains a significant public health concern in India, with consequences for
growth, development, and cognitive function. Machine learning (ML) algorithms
offer promising avenues for developing non-invasive and potentially cost-effective
methods for predicting anemia risk in this population . Studies have explored the
application of various ML algorithms, including Logistic Regression , Random Forest
[4], and Support Vector Machines (SVM) , to identify key risk factors and improve
anemia detection in children from India. These studies highlight the potential of ML
in this domain, but also acknowledge the need for further research on factors specific
to the Indian context, such as socioeconomic status, dietary patterns, and regional
variations in disease prevalence

Table 1 : Literature Review Analysis

Author Title Contribution Future Scope/ Limitation

There is a potential bias in the


“A. Jiran The study identifies factors that data and a lack of information
Meitei, contribute to anaemia, including on some critical variables.
Akanksha the mother's anaemia status, the Future research directions
Saini, Bibhuti child's age, social status, and the include using different
Bhusan “Predicting child mother's education. machine-learning techniques
Mohapatra & anaemia in the This approach has the potential to and datasets and exploring
Kh. North-Eastern states help reduce the adverse effects of medical image processing
Jitenkumar of India: a machine anaemia, such as psychomotor data for anaemia prediction.
Singh” [14]“ learning approach”” retardation and mortality.
Future research could involve
The paper proposed an automated validating the proposed
prediction model using historical automated prediction model
data and the extreme learning in clinical settings to assess
machine (ELM) algorithm to its accuracy and reliability in
distinguish between different diagnosing different types of
types of anaemia, aiming to anaemia. The study's data is
expedite the diagnosis process for from a specific location and
healthcare providers. By may limit the model's
differentiating between beta generalizability to other
thalassemia trait (BTT), iron populations or healthcare
“A New Artificial deficiency anaemia (IDA), settings, warranting further
Intelligence haemoglobin E (HbE), and validation and adaptation. The
Approach Using combination anaemias, the scalability and
Extreme Learning research offers a more precise implementation of the
“Dimas Machine as the and efficient diagnostic approach. automated prediction model
Chaerul, Ekty Potentially Effective The focus also extends to in real-world healthcare
Saputra, Model to Predict developing a model that can environments, especially in
Khamron and Analyze the streamline the identification of resource-constrained settings,
Sunat, Tri Diagnosis of various types of anaemia, could pose challenges that
Ratnan Anemia” enhancing healthcare efficiency need to be addressed in future
Singh”[15] and accuracy in diagnosis. studies.

“Lei Wang , “Dynamic Anemia Several valuable insights were Longitudinal studies could
Mengjie Li, Status from Infancy provided into the dynamic nature explore the impact of dietary
Sarah-Eve to Preschool Age: of anaemia prevalence in rural habits, socioeconomic factors,
Dill , Yiwei Evidence from Rural Chinese children over time, and healthcare access on
Hu, Scott China”” shedding light on the changing anaemia prevalence in this
Rozelle”[16] patterns of this nutritional population.
deficiency during early Intervention studies could be
childhood. It also highlights that designed to test the
51% of children were anaemic in effectiveness of targeted
infancy, 24% in toddlerhood, and strategies to reduce anaemia
19% at preschool age, with 67% rates among rural Chinese
children, potentially
informing public health
experiencing anaemia at some policies and programs.
point during the study.

The research paper focuses on


utilising machine learning models Further researchers could
to predict anaemia, comparing the explore implementing these
performance of five models: machine learning models in
KNN, Logistic Regression, SVM, real-world healthcare settings
Gaussian Naive Bayes, and Light to assess their practical utility
Gradient Boosting Machines. In and effectiveness in
order to improve prediction predicting anaemia and
“Mahadi accuracy, these models are guiding treatment decisions.
Hasan; Mst. “A Harmful integrated using a voting The study's limitations may
Sazia Disorder: Predictive classifier approach, highlighting include the need for further
Tahosin; Afia and Comparative the significance of precise disease validation of the predictive
Farjana; Md. Analysis, for fetal prediction in the medical models on more extensive and
Alif Sheakh; anaemia Disease by profession for efficient diverse datasets to ensure
Md Maruf using different prevention and treatment. their generalizability and
Hasan”[17] Machine Learning reliability in different
approaches” populations and settings.

“Soumyadipta “Non-Invasive The research paper contributes a Further research could focus
Acharya; Estimation of new ML based technique for non- on expanding the study to a
Dhivya haemoglobin Using invasive estimation of total more diverse population to
Swaminathan; a Multi-Model haemoglobin (Hb) using validate the method's
Sreetama Stacking photoplethysmograms (PPGs) effectiveness across different
Das; Krity Regressor”“ acquired from a custom finger demographics. Enhancing the
Kansara; sensor. It demonstrates the machine learning model with
Sushovan feasibility of this method for additional features or
Chakraborty; maternal anemia detection, algorithms could improve the
Dinesh showing a statistically significant accuracy and robustness of
Kumar R; correlation coefficient of 0.81 the Hb estimation.
Tony Francis; with low Root Mean Square
Kiran R
Aatre” [18] Error(RMSE) .

A comprehensive comparison of
six different classification
“Hetal “Comparative Study algorithms, including decision
Bhavsar, of Training trees, Bayesian networks, neural
Amit Algorithms for networks, k-nearest neighbours,
Ganatra”[19]“ Supervised Machine and support vector machines
Learning”“

The study showcases the


importance of automated disease
diagnosis systems in improving
accuracy, efficiency, and cost-
effectiveness in medical decision-
making, emphasizing the role of
computers in aiding healthcare
professionals.
It arnesses the power of ensemble the study's limitation lies in the
learning methods in classifying use of a specific set of
Red Blood Cells (RBCs) for classifiers and ensemble
anemia detection, highlighting the methods;
superiority of ensemble
classifiers over individual ones. the research could benefit from
It throws light on the application expanding the dataset size and
“Pooja of machine learning techniques, diversity to ensure robustness
Tukaram “Anemia detection such as Stacking, Bagging, and generalizability
Dalvi; using ensemble Voting, Adaboost, and Bayesian
Nagaraj learning techniques Boosting, in medical decision- integration of more advanced
Vernekar”[20 and statistical making processes, particularly in machine learning algorithms
] models” the field of anemia detection or deep learning techniques

“Jahidur “Machine learning The study highlights the potential


Rahman algorithm to predict of machine learning techniques in
The research can serve as a
foundation for developing a
knowledge-based system to
predicting disease status using predict childhood anemia
demographic and health survey incidence in Bangladesh,
data, which can aid in health care complementing existing
planning and policy-making healthcare practices
It also demonstrates the
effectiveness of random forest The cross-sectional nature of
Khan, Srizan (RF) algorithm in achieving the the BDHS 2011 data limited
Chowdhury, best classification accuracy of the inclusion of certain
Humayera 68.53% for predicting childhood attributes like recent diarrhea
Islam, anemia, providing valuable and fever status, potentially
Enayetur childhood anemia in insights for policymakers and impacting the predictive
Raheem”[21] Bangladesh” healthcare providers. models' accuracy

“Manish “Machine Learning It contributes to the


Jaiswal1, Algorithms for understanding of anemia risk
Anima Anemia Disease management Future studies may investigate
Srivastava2, Prediction” Furthermore, the paper critically the causal relationship
and Tanveer reviews the research on iron between iron deficiency and
J. deficiency and its impact on work reduced work capacity in more
Siddiqui”[22] capacity. depth, potentially uncovering
It provides insights into the additional factors influencing
burden of anemia in low-income this association
and middle-income countries, The applicability of decision
highlighting a significant health trees in medical decision-
risk. making may be subject to
The paper discusses the limitations based on the
association between maternal complexity of the medical
anemia and small-for-gestational- conditions being analyzed and
age outcomes, emphasizing the the availability of data
importance of addressing
moderate to severe maternal
anemia
It offers an overview of decision
trees and their application in
medicine, showcasing their
potential utility in medical
decision-making processes.

“Discrimination of This research proposes a decision The model needs to be


β-thalassemia and support system to distinguish validated on a larger dataset.
iron deficiency between β-thalassemia and iron It is also important to
anemia through deficiency anemia, which could investigate how well the model
extreme learning improve the accuracy of generalizes to different
“Betül Çil a, machine and diagnosis and reduce the need for populations.
Hakan regularized extreme more advanced testing. Additionally, the long-term
Ayyıldız b, learning machine The system was found to be effects of using this model in
Taner based decision accurate with an accuracy rate of clinical practice need to be
Tuncer” [23]“ support system” 95.59%. studied.

“Meherwar “Survey of Machine The significance of machine integration of different


Fatima1, Learning learning algorithms in disease Machine Learning algorithms
Maruf Pasha” Algorithms for diagnostics, namely in computer- can be considered to enhance
[24]“[ Disease aided diagnosis (CAD) in medical disease diagnostic accuracy
Diagnostic””” imaging, is the main topic of this even more effectively.
study paper. Investigating the application of
It highlights how crucial machine these algorithms in real-time
learning and pattern recognition clinical settings to evaluate
are to raising the precision of their practical utility and
illness detection and diagnosis in efficiency.
the realm of biomedical research. The paper does not talk about
problems while implementing
Machine Learning algorithms
in real-world medical settings
and lacks a detailed discussion
on the ethical considerations
and potential biases associated
with using AI in disease
diagnosis.

Based on 539 data sets with 10


features, the study report sought
to predict anemia using four
techniques: Bayesian Network
(BN), Naive Bayes (NB),
Logistic Regression (LR), and
Multilayer Perceptron (MLP).
Logistic Regression (LR)
outperformed the other The paper acknowledges the
techniques in predicting anemia. limitations of the techniques
demonstrates the application of used for datasets with varying
attribute evaluators like attribute values, suggesting
information gain to show the potential challenges in
system's high performance with generalizing the results to
minimal characteristics, diverse datasets
enhancing the predictive accuracy
“Mohammed of anemia detection . It hints at the potential for
Sami It addresses the critical challenge using Naive Bayes (NB) with
MOHAMME “Analysis of in healthcare of early detection of Artificial Neural Network
D; Arshed A. Anemia Using Data disorders leading to complex (ANN) datasets to address
AHMAD; Mining Techniques health issues, emphasising the unbalanced data issues,
Murat SARI with Risk Factors importance of timely diagnosis opening avenues for further
[25]“ Specification” and intervention research in this area

“Sneha “Disease Prediction It presents a comparative analysis The performance of the


Grampurohit; using Machine of the results obtained from the machine learning algorithms
Chetan Learning different machine learning may vary depending on the
Sagarnal” Algorithms”“ algorithms used, providing specific diseases or symptoms
[26]” insights into their effectiveness.A under consideration, indicating
sample dataset of 4920 patients' the need for further
records diagnosed with 41 optimization and
diseases was analyzed, with 95
optimized independent variables customization for different
(symptoms) closely related to healthcare scenarios.
diseases selected for the study

There is potential for future


studies to focus on
The research paper emphasizes incorporating real-time data
the critical role of the heart in and continuous monitoring
living organisms and underscores techniques to improve early
the necessity for precise diagnosis detection and prediction of
and prediction of heart-related heart-related conditions,
diseases to prevent adverse thereby enabling timely
outcomes . interventions and personalized
It makes a contribution by healthcare. One possible
assessing, using the UCI barrier could pertain to the
repository dataset, the predictive interpretability of the machine
power of several machine learning models. This pertains
learning methods for cardiac to improving transparency and
disease, such as k-nearest trust between patients and
neighbor, decision tree, linear healthcare providers, which is
“Archana “Heart Disease regression, and support vector crucial for the models to be
Singh; Prediction Using machine. widely accepted and utilized in
Rakesh Machine Learning clinical settings.
Kumar” [27]” Algorithms”

“Arnab K. “Using machine Identified significant The study relied on cross-


Dey a b, learning to determinants of IUD use in India, sectional data, limiting the
Nabamallika understand emphasizing the importance of ability to establish causality
Dehingia a b, determinants of IUD shared family planning goals, between variables.
Nandita Bhan use in India: access to services, desire for no
a, Edwin Analyses of the more children, wealth, education, The research focused on
Elizabeth National Family and maternal and child health married women, excluding
Thomas a, Health Surveys services . unmarried or divorced
Lotus (NFHS-4)” Highlighted the crucial role of individuals who may also
male engagement in family
planning decisions and the need
for targeted awareness efforts,
especially for marginalized
populations with limited access to
McDougal a, care. Lasso and ridge logistic
Sarah regression models were employed benefit from IUD use.
Averbach a c, to assess significant determinants
Julian of IUD use among married The study did not delve into
McAuley d, women in India .Neural network regional variations in IUD
Abhishek approaches were utilized to uptake within India, which
Singh e, analyze the data and identify key could provide valuable
Anita Raj predictors of IUD uptake in the insights for targeted
a”[28]“ study populatio interventions.

“El-Sayed M. “Anemia Estimation The paper introduces a Machine A limitation of the study is the
El-kenawy1, for COVID-19 Learning model for estimating reliance on hematological data
Marwa M. Patients Using A blood levels, specifically focusing alone for estimating
Eid1, Machine Learning on haemoglobin (Hgb) levels, haemoglobin levels. Future
Abdelhameed Model” using hematological criteria. This research could consider
Ibrahim”[29] model aids in accurate blood incorporating additional
evaluation activities, providing clinical parameters or data
essential information for medical sources to further enhance the
professionals accuracy and robustness of the
It explores the application model in predicting blood
oFuture research could focus on levels for COVID-19 patients.
optimizing the proposed model
by utilizing an optimization
algorithm to determine the best
weights for improved accuracy.
This would enhance the model's
performance and reliability in
estimating blood levels accurately
f various classification and
regression approaches, utilizing
Scikit-Learn to analyze
hematological data, particularly
in the context of COVID-19
patients. The study emphasizes
the importance of employing
multiple classifiers to enhance the
accuracy of medical diagnoses
based on hematological
information.Random Forest,
Support Vector Machine, and
Artificial Neural Networks to
approximate haemoglobin values
using hematological criteria.

“Nelly “Early identification Nelly Estefanie Garduno-Rapp et Implementing the developed


Estefanie of patients at risk for al.'s study work employs deep deep learning models in
Garduno- iron-deficiency learning algorithms to identify clinical practice to assist
Rapp, MD, anemia using deep patients at risk for iron-deficiency healthcare providers in
MSHI, Yee learning anemia (IDA) early on. identifying patients at risk for
Seng Ng, techniques.”“ three neural networks—long IDA earlier.
MD, Jenny L short-term memory cells, gated
Weon, MD, recurrent units, and artificial Further refining the models by
PhD, Sameh neural networks—were incorporating additional
N Saleh, MD, developed to forecast the risk of relevant features or data
MBMI, IDA three to six months ahead of sources to enhance prediction
Christoph U the conventional diagnosis. accuracy.The models'
Lehmann, Attained encouraging outcomes, performance was evaluated
MD, Chenlu with the gated recurrent unit based on historical data;
Tian, MD, model outperforming the other further validation in
Andrew models over 200 epochs with an prospective studies is
Quinn, accuracy of 0.83, an AUC of necessary.
MD”[30]“ 0.89, a sensitivity of 0.75, and a
specificity of 0.85.
showed that deep learning may be
used to detect IDA early in the
outpatient context, giving
clinicians a long lead time to
intervene.

In order to classify anemic


datasets, the research study
presents hybrid GA-CNN and
GA-SAE models. Genetic
algorithms are used to optimize
the hyperparameters of the CNN
and SAE deep learning
algorithms.
The suggested GA-CNN model
outperforms alternative methods
in the 98.50% success rate of
nutritional anemia classes nvestigating the scalability and
predicted using the real anemia generalizability of the
dataset. proposed models to larger and
In particular, the study focuses on more diverse datasets could
nutritional anemia, which enhance their practical utility
includes iron deficiency anemia, in clinical settings.
“Serhat “Hybrid models B12 deficiency anemia, folate
KILICARSL based on genetic deficiency anemia, and people The study does not extensively
AN a, Mete algorithm and deep without anemia. It also discusses discuss the computational
CELIK b, learning algorithms the use of deep learning complexity or training time
Şafak for nutritional algorithms in disease prediction. required for the proposed
SAHIN” Anemia disease models, which could be crucial
[31]“ classification” for real-time applications
2.2 Chapter Summary

This chapter provides a comprehensive review of the existing literature on anemia, focusing on its
prevalence, causes, and health implications. It examines the physiological role of hemoglobin, the
biochemical mechanisms underlying red blood cell production, and the factors contributing to anemia,
including nutritional deficiencies, genetic disorders, and chronic diseases. The chapter also explores
the diagnostic criteria for anemia, various treatment approaches, and the socio-economic impact of the
condition. Additionally, it highlights recent advancements in research, identifying gaps and future
directions for study. This review underscores the importance of a multidisciplinary approach to
effectively manage and mitigate anemia's widespread health effects.

Chapter 3

Research Methodology

The methodology section serves as a guiding the research process. We begin by establishing
the core problem addressed by this work. A comprehensive literature review is conducted to
identify existing knowledge and current research gaps. To bridge these gaps, the proposed
work is then introduced. This section details the specific algorithms or factors chosen for
[insert function, e.g., authentication, optimization. Subsequently, the optimisation strategies
implemented to refine the proposed work are explained. Finally, the methodology section
culminates with a discussion of the expected results and their analysis. A flowchart is
included below to illustrate the research progression. Moreover, it also has a mention of the
data being used for the analysis and the process through which it was collected.

3.1 Dataset Description

The data for this analysis comes from the National Family Health Survey (NFHS-5),
conducted between 2019 and 2021. As the fifth edition in the NFHS series, NFHS-5 offers
comprehensive information on the population, health, and nutritional status across all Indian
states and union territories. The survey was primarily funded by the Government of India,
with additional technical support and funding from USAID's Demographic and Health
Surveys Program and ICF, USA. The Indian Council of Medical Research (ICMR) and the
National AIDS Research Institute (NARI) in Pune also supported some of the Clinical,
Anthropometric, and Biochemical (CAB) tests. NFHS-5 examined health and nutritional
issues across all Indian states and union territories, providing district-level estimates for
numerous key variables, similar to NFHS-4. New and significant bioinformatic data
introduced by NFHS-5 include methods and reasons for abortion, preschool education,
menstrual hygiene, expanded age ranges for measuring diabetes and hypertension for
individuals aged 15 and above, frequency of alcohol and tobacco use, micronutrient
components for children, expanded child immunization domains, death registration, and a
new component for non-communicable diseases (NCDs). These additions allowed for a more
comprehensive comparison of data over time. The NFHS-5 sample was designed to provide
estimates of several survey indicators at the national, state/union territory (UT), and district
levels. [14]

The survey covered a wide range of criteria during the design and creation of its indicators,
encompassing 707 districts, 8 union territories, and 28 states. A uniform sample design,
representative at the national, state/UT, and local levels, was employed in each polling cycle.
Each district was divided into rural and urban sections. However, only state/UT and national
levels have access to a variety of assessment indicators related to sexual behavior, HIV/AIDS
attitudes and behaviors, women's work status, husbands' background and awareness, and
domestic violence. Each rural stratum was further classified based on village population and
the proportion of individuals belonging to the SC/ST (scheduled castes and scheduled tribes).
Within each rural sampling stratum, a sample of villages was selected to serve as Primary
Sampling Units (PSUs), categorized based on the literacy rate of women aged six and older
before PSU selection.[15]

Using computer-assisted personal interviewing (CAPI), eligible women aged 15 to 49


completed the Woman's Questionnaire, providing information on a wide range of topics. Four
survey schedules/questionnaires (Household, Woman, Man, and Biomarker) were produced
and distributed in eighteen regional languages. The Household Questionnaire gathered
information on land ownership, mosquito net use, household deaths in the three years prior to
the survey, socioeconomic characteristics, health insurance coverage, disabilities, hygiene,
access to clean water and sanitation, and all household members and guests who spent the
night before the interview. [16]

The Biomarker schedule measured blood pressure, weight, hip and waist circumference,
children's weight, children's height, haemoglobin levels, and random blood glucose levels for
men and women over the age of 15. Along with measuring children's height and haemoglobin
levels, men and women were asked to prick their finger and provide a few extra drops of
blood for laboratory testing to check for vitamin D3, malaria parasites, and HbA1c. The
Woman's Questionnaire aimed to gather comprehensive data on women's health and well-
being. It targeted women aged 15-49 and addressed a wide range of topics. Demographic
information like caste, age, religion, and media exposure was collected alongside
reproductive history details such as pregnancies, births, and terminations. Additionally, blood
tests for anemia were administered to all eligible women. The questionnaire further explored
health concerns including tobacco and alcohol use, tuberculosis awareness, and current
illnesses like cancer, diabetes, and heart disease. Notably, a specific module within the study
(State module subsample) delved into decision-making within households and its potential
connection to anemia. [17]

3.2 Machine Learning (ML) algorithms

Machine learning is a technique that equips computers with the ability to learn and improve
from experience, much like humans do. Imagine a digital gardener nurturing a plant. Instead
of providing water and sunlight, this gardener feeds the plant—representing the computer—
with data and algorithms. This helps the plant comprehend and uncover hidden patterns,
make decisions, and advance over time, all without explicit instructions at each step. The
crux of machine learning can be stated as the art of taking raw data and transforming it into
valuable insights and predictions. This process enables machines to adapt and thrive in their
environment autonomously. The computers learn from the data they are given, identifying
patterns and making decisions based on this learning. Over time, they become more adept at
these tasks, requiring less and less guidance.

To put it simply, machine learning is a procedural approach through which systems gain and
comprehend information from various observations. They enrich and expand their
capabilities, bringing forth new knowledge without relying solely on pre-programmed
instructions. This allows them to evolve and perform increasingly complex tasks, much like a
digital gardener helping a plant to grow and flourish. [18]

3.3Machine Learning Techniques

3.3.1 Support Vector Machines: Finding the Best Divide

Imagine having a collection of data points, each belonging to one of two distinct categories.
For instance, these points could represent emails classified as spam or not spam. An SVM
aims to create a clear division, like a straight line in a two-dimensional space, that separates
these categories with the greatest possible margin. This margin is the distance between the
line and the closest data points from each category, called support vectors. In simpler terms,
the SVM algorithm searches for the best dividing line or plane (called a hyperplane in higher
dimensions) that maximises the gap between the two classes of data. The data points that
define this margin are crucial for the SVM's operation, hence the name "support vectors."
Real-world data isn't always perfectly separable by a straight line. [19]
Figure 1 : SVM working principle

The figure above represents SVM in action. For classification problems, Support Vector
Machines (SVMs) use a geometric approach. By maximising the margin between the
hyperplane and the nearest data points (support vectors), they create a hyperplane that serves
as a boundary for decisions. Strong classification is promoted by this margin optimization,
especially in high-dimensional spaces. Additionally, SVMs use kernel functions to transform
data that is not linearly separable into higher dimensions where linear separation is
possible.For instance, imagine classifying images of cats and dogs. A simple line might not
suffice. To handle this, SVMs can employ a clever trick. They can project the data points into
a higher-dimensional space where a clear separation might exist. This projection is achieved
using mathematical functions called kernel functions. Even though we can't visualise this
higher-dimensional space, the SVM works effectively within it to find an optimal separation.
While commonly used for classification tasks, SVMs can also be adapted for regression
problems, where the goal is to predict a continuous value rather than a class label. SVM has
the ability to handle data with many features as it is very effective and efficient in high
dimension spaces. Additionally, SVMs can deliver good results even with limited data,
making them suitable for scenarios where collecting large amounts of data is a challenging
task. It can also be adapted to various tasks through kernel functions, making them a
powerful tool for diverse machine learning applications. Some crucial considerations are
important to keep in mind, in order to make the algorithm yield the best results. Choosing the
right kernel function is crucial for optimal SVM performance and depends on the specific
data characteristics. Also, the computational cost of training SVMs can be expensive,
especially for large datasets, hence it is vital to keep the size of the dataset just enough. [20]

3.3.2 K-Nearest Neighbors: Voting with Your Data Neighborhood

A basic and popular machine learning technique that could potentially used for both
regression and classification problems is K-Nearest Neighbors (KNN). Its fundamental tenet
is that data items with comparable attributes typically fall into the same class. KNN leverages
this concept to make predictions on new data points by analysing the labels of its closest
neighbors within the training data.[21]

Figure 2 : KNN Mechanism


The diagram represented above gives a sense of how KNN works.For classification
challenges, the K-Nearest Neighbors (KNN) framework makes use of a proximity-based
methodology. Finding the k most comparable data points (neighbours) in the training data set
is the first step in classifying new data points. For each of these neighbours, a majority vote
selects the class label for the new data point. The performance of the KNN model is greatly
impacted by the value of k. Selecting a high k value may miss important local patterns in the
data, while selecting a low k value may result in overfitting. [22]

The breakdown of how KNN works can be divided into 4 phases: training, distance
calculation, identifying nearest neighbours, and classification/ regression. KNN operates in
four distinct stages. In the first phase, training, the algorithm simply stores the entire training
dataset. This isn't a complicated model construction situation. When a new data point is
introduced during the distance calculation phase, KNN uses a selected distance metric to
determine the distance between this point and every other point in the training set. The
identifying nearest neighbours phase involves finding the k closest data points (k being a
user-defined parameter) to the new point. Finally, in the classification/regression phase, the
algorithm makes predictions based on these neighbors. For classification, the most frequent
class label among the k nearest neighbors is assigned to the new point. In regression
problems, the average value of the target variable from the k nearest neighbours is used for
prediction. KNN offers several advantages. It's incredibly easy to understand and implement,
making it a good choice for beginners in machine learning. Additionally, KNN is non-
parametric, meaning it doesn't make any assumptions about the underlying data distribution.
This can be beneficial for complex datasets where other algorithms might struggle. However,
KNN also has limitations. Since it stores the entire training dataset, it can be memory-
intensive for large datasets. Additionally, KNN's performance is highly dependent on the
chosen distance metric and the value of k. Choosing a poor k value can lead to overfitting or
underfitting, which can significantly impact the algorithm's accuracy. [23]

Figure 3 : KNN with different k size

The demonstration above uncovers how curve tracing works in the KNN algorithm for
different values of k. This illustrative diagram depicts the impact of varying the k parameter
in K-Nearest Neighbors (KNN) regression. The x-axis represents the feature space, while the
three y-axes correspond to the target variable for the training data (leftmost), KNN model
predictions (center), and test data (rightmost). Each plotted point likely signifies a sample
with its feature value and corresponding target value. The horizontal dashed lines presumably
represent different values of k, the number of neighbors considered in the KNN analysis. By
visually analyzing the proximity of the model predictions (center) to the test data (rightmost)
across varying k values, we can glean insights into the model's generalizability and potential
for overfitting. Overall, KNN is a versatile and effective machine learning algorithm,
particularly for smaller datasets. Its simplicity and ease of use make it a valuable tool for
various classification and regression tasks. However, it's important to be aware of its
limitations and carefully consider factors like distance metrics and the value of k to ensure
optimal performance. [24]

3.3.3 Decision Tree and Random Forest


A decision tree is an extremely robust and adaptive machine learning algorithm used for both
classification and regression tasks. As a supervised learning method, it predicts the value of a
target variable by learning straightforward decision rules derived from the data features. The
model is organized as a tree, with each internal node representing a test on an attribute, each
branch indicating the result of the test, and each leaf node signifying a class label or a
continuous value. This hierarchical structure makes decision trees very intuitive and easy to
understand, as they reflect human decision-making processes. [25]

Crafting a decision tree includes selection of the optimise feature at each point based on
specific conditions. Standard criteria involve information gain, Gini impurity, and variance
reduction. Gini impurity measures the probability of incorrectly labeling an element if it were
randomly labeled according to the label distribution in the subset. Information gain, derived
from entropy, measures the reduction in uncertainty about the target variable after
partitioning the data based on an attribute. [26]

The main benefit of decision trees is their capability to handle both numerical and categorical
data with primal preprocessing, such as normalization or scaling. They have the ability to
capture nonlinear relationships between input features and the outcome variables, making
them suitable for a variety of data patterns. However, decision trees are sensitive to
overfitting, especially when they grow too deep and become extremely complex. The
phenomenon of overfitting happens when the model also attains the noise occurred during the
training data, resulting in weak generalisation to new unseen data. To battle this, methods like
pruning, setting a maximum depth, or requiring a minimum number of samples per leaf can
be used. Pruning reduces the size of the decision tree by removing parts that do not improve
its predictive power. There are two types of pruning: pre-pruning and post-pruning. Pre-
pruning stops the tree growth early by imposing constraints like limiting the maximum depth
or requiring a minimum number of samples at a node. Post-pruning includes growing the tree
to its full depth and then discarding nodes that contribute little to no predictive capability
based on a validation set or cross-validation.
. [27]

Figure 4 : Random Forest & Decision Tree

Despite their interpretability and simplicity, decision trees have some limitations. They can
be unstable, meaning that small changes in the data can result in very different trees. This
instability can be addressed by ensemble methods such as random forests. Random forests
create a 'forest' of multiple decision trees, each trained on a random subset of the data and
features, and aggregate their predictions to improve accuracy and robustness. By averaging
the results of many trees, random forests reduce the variance of the model, making it more
resistant to overfitting and more capable of handling the complexities of real-world data. In
summary, decision trees are a fundamental yet powerful tool in machine learning, offering
clear advantages in interpretability and flexibility. However, their susceptibility to overfitting
and instability requires careful tuning and the possible use of ensemble techniques like
random forests to enhance their performance. Understanding these nuances allows
practitioners to effectively leverage decision trees and random forests in a variety of
predictive modelling tasks.[28]

3.3.4 Gaussian Naive Bayes


A probabilistic classifier based on Bayes' theorem, the Gaussian Naive Bayes (GNB)
algorithm is designed to function well with continuous data. Using the presumption that the
data follows a Gaussian distribution, this approach excels at handling aspects of regularly
distributed data. Because the underlying data in these disciplines frequently follows a normal
distribution, GNB is an invaluable tool in medical diagnostics, text classification, and
financial prediction, among other areas. The Bayes theorem, which offers a framework for
updating a hypothesis's probability estimate when new data is gathered, is the fundamental
component of the GNB algorithm. The Bayes theorem is used in the context of GNB to
calculate the posterior probability of a class given a set of features. The initial step in
implementing GNB involves estimating the prior probabilities for every class using the
relative frequencies of the classes in the training data. The program then calculates the mean
and variance for each feature, assuming that the feature values within each class follow a
Gaussian distribution. The Gaussian probability density function is then used to determine the
likelihood of a given feature value for each class based on these parameters. These
likelihoods are combined with the prior probabilities to get the posterior probability for each
class, which is then used to make the final classification decision. The predicted class is
determined by taking the class with the highest posterior probability.[28]

One of the prominent advantages of the Gaussian Naive Bayes algorithm is its computational
efficiency. The training phase is notably swift, requiring only the estimation of means and
variances for the features, making GNB particularly suitable for large datasets. Additionally,
the model's simplicity and interpretability enhance its appeal, allowing practitioners to easily
understand the influence of individual features on the classification outcome. Nevertheless,
the effectiveness of GNB may occasionally be restricted by the premise of a Gaussian
distribution and feature independence. The classifier's performance may suffer if these
presumptions are not met, particularly if there is a substantial connection between the
characteristics or a notable deviation from normalcy in the data distribution. Preliminary data
analysis and transformation (e.g., feature scaling, power transformations) might help reduce
these restrictions by better aligning the data with the algorithm's assumptions. [29]
3.4 Chapter Sumaary
This chapter outlines the research methodology employed in the study. It details the research
design, sampling methods, and data collection techniques used to investigate anemia. The
chapter explains the selection of participants, including inclusion and exclusion criteria, and
describes the tools and instruments utilized for data gathering, such as surveys,
questionnaires, and clinical tests. It also covers the procedures for data analysis, including
statistical methods and software used. Ethical considerations, including informed consent and
confidentiality measures, are discussed to ensure the study's integrity. The chapter concludes
with a justification for the chosen methodology, emphasizing its suitability for achieving the
research objectives.

Chapter 4

Proposed Work
This study utilises data from the National Family Health Survey 5 (NFHS-5) to explore and
analyse various health-related metrics. The proposed methodology encompasses several
critical steps, each designed to ensure robust and reliable results.

The first step involves importing the NFHS-5 dataset into a Jupyter notebook environment.
Initial exploratory data analysis (EDA) is conducted to understand the distribution and
characteristics of the data. This includes visualising data distributions, identifying patterns,
and summarising key statistics to gain a comprehensive overview of the dataset.

Subsequently, data preprocessing is performed to prepare the dataset for analysis. This step
includes handling missing values by imputing them with the mean values of respective
features. Moreover, information institutionalization is carried out to guarantee that all
highlights contribute similarly to the investigation. This step is pivotal for moving forward
the execution of machine learning models.

The preprocessed information is at that point part into two subsets: preparing and testing
datasets. The preparing dataset is utilized to prepare the machine learning models, whereas
the testing dataset is saved for assessing the execution of these models.

Five different machine learning models are applied to the training dataset. Initially, these
models are run with default hyperparameters, and their performance is recorded. The models
used in this study include Decision Trees, Random Forest, Support Vector Machine,
Gaussian Naive Bayes, KNN.

To upgrade the effectiveness of the models, hyperparameter tuning is conducted utilizing the
lattice look strategy. This strategy includes efficiently looking through a predefined set of
hyperparameters to recognize the combination that yields the finest execution. The comes
about of the tuned models are compared to those gotten with the default settings.

The performance of each model is meticulously tracked and recorded throughout the process.
Key performance metrics such as accuracy, precision, recall, F1-score, and ROC-AUC are
calculated to evaluate the effectiveness of the models.

To further refine the analysis, dimensionality reduction techniques are applied to the dataset.
This step aims to reduce the number of features while preserving the most significant
information. The reduced dataset is then subjected to the same five machine learning models,
and their performance is evaluated and compared to the results obtained with the original
dataset.Finally, a comprehensive performance analysis is conducted to compare the results of
all models before and after dimensionality reduction. The best-performing model is selected
based on its overall performance across the various metrics. This model is considered the
most suitable for the given dataset and research objectives.

Figure 5 : Workflow of the proposed work

The suggested work to investigate the application of machine learning algorithms to predict
the risk of anemia in Indian children is depicted in the figure above. We made use of NFHS-5
survey data. We performed exploratory data analysis after data import to comprehend data
distributions and find any missing values or outliers. Pre-processing of the data included
outlier removal, normalization, and mean substitution to imputation of missing values. The
information was at that point part into preparing and testing sets. We utilized five machine
learning models: support Vector Machine (SVM), K-Nearest Neighbors (KNN), Gaussian
Credulous Bayes (GNB), Decission Tree (DT), and Random Forest (RF). Initially, each
model was trained with default parameters. Subsequently, we performed hyperparameter
tuning to optimize each model's performance further. To reduce data dimensionality,
Principal Component Analysis (PCA) was employed before re-training all five models.
Finally, a comparative performance analysis was conducted using metrics including recall,
F1Score, precision, and accuracy.

4.1 Chapter Summary

This chapter presents the proposed work for addressing anemia, outlining the objectives,
hypotheses, and research plan. It details the specific interventions and strategies to be
implemented, such as nutritional programs, awareness campaigns, and clinical trials for new
treatments. The chapter describes the target population and the criteria for participant
selection. It also includes a timeline for the project phases, from initial planning to execution
and evaluation. Expected outcomes and potential impacts are discussed, emphasizing how the
proposed work aims to advance understanding, improve management, and reduce the
prevalence of anemia. The chapter concludes with a discussion of anticipated challenges and
mitigation strategies.
Chapter 5

Results and Discussion

The results indicate that all five machine learning models achieved relatively high accuracy
in predicting anemia risk in Indian children using the NFHS-5 data. However, there were
variations in performance across the models and evaluation stages. Support Vector Machine
(SVM) and Decision Tree (DT) emerged as the frontrunners, achieving a perfect accuracy of
1.00 with hyperparameter tuning. This suggests that these models were able to learn the
underlying patterns in the data exceptionally well and make accurate predictions on the
unseen testing set. Random Forest (RF) also exhibited strong performance, closely following
SVM and DT with an accuracy of 0.98 after hyperparameter tuning. This indicates that the
ensemble learning approach of RF was highly effective in this task. K-Nearest Neighbors
(KNN) achieved significant improvement with hyperparameter tuning, reaching an accuracy
of 0.93. This highlights the importance of hyperparameter optimization for enhancing model
generalizability. Gaussian Naive Bayes (GNB) yielded the lowest accuracy across all stages,
with a maximum accuracy of 0.14 after hyperparameter tuning. This suggests that the
assumption of independence between features may not hold.

Precision Recall F1 Score Accuracy PCA


( Accurac
Model DHP HPT DHP HPT DHP HPT DHP HPT y)
SVM 0.99 1 0.99 1 0.99 1 0.98 1 0.99
GNB 0.419 0.45 0.327 0.35 0.139 0.17 0.14 0.14 0.99
KNN 0.074 0.982 0.822 0.95 0.13 0.96 0.97 0.93 0.99
DT 1 0.94 1 0.94 1 0.94 1 0.98 0.99
RF 0.991 1 0.99 1 0.991 1 0.98 1 0.99

Table 2 :

The results in Table 1 indicate that all five machine learning models achieved high accuracy
on the training data (> 98%). However, there is greater variation in performance on the hold-
out test data. Decision Tree (DT) achieved the highest overall accuracy (98%) and F1-score
(0.94) on the test data. Random Forest (RF) had the second highest accuracy (98%) and F1-
score (0.99) but slightly lower precision (0.991) and recall (1.0) compared to DT. Support
Vector Machine (SVM) also achieved high accuracy (98%) and F1-score (0.99) but with
slightly lower values than both DT and RF. Gaussian Naive Bayes (GNB) had the lowest
accuracy (0.14%) and F1-score (0.17) on the test data, indicating poor performance in
correctly classifying anemia risk. K-Nearest Neighbors (KNN) had an intermediate accuracy
(0.93%) and F1-score (0.96) on the test data. Interestingly, PCA appears to have had minimal
impact on model performance, with accuracy values on the test data nearly identical to those
before PCA dimensionality reduction
Figure 6 : Performance analysis

5.1 Chapter Summary

This chapter presents the findings of the study on anemia, followed by an in-depth discussion
of the results. The data collected from surveys, clinical tests, and other methods are analyzed
and displayed through tables, graphs, and charts. Key outcomes, such as the prevalence of
anemia, its correlation with various demographic factors, and the effectiveness of proposed
interventions, are highlighted. The discussion interprets these findings, comparing them with
existing literature and theoretical frameworks. It also addresses any unexpected results,
potential limitations of the study, and their implications. Finally, the chapter emphasizes the
study's contributions to the field and suggests areas for future research.
Chapter 6

Conclusion & Future Scope

This study conducted a comparative analysis of various machine learning models for
predicting anemia risk in Indian children using data from the National Family Health Survey
5 (NFHS-5). The investigation revealed promising results,with Support Vector Machine
(SVM) and Decision Tree (DT) achieving a perfect accuracy of 1.00 after hyperparameter
tuning. Random Forest (RF) also demonstrated strong performance with an accuracy of 0.98,
highlighting the effectiveness of ensemble learning. The findings suggest that machine
learning holds significant potential for developing robust and accurate tools to predict anemia
risk in this population. Dimensionality reduction techniques showed limited impact on model
performance in this specific case. However, incorporating additional data sources, exploring
advanced feature engineering, and integrating the model with healthcare systems present
exciting avenues for future research. Further exploration of Explainable AI (XAI) techniques
and the development of models focused on specific anemia types can provide valuable
insights for targeted interventions. Ultimately, this research paves the way for utilizing
machine learning to enhance the early detection and management of anemia in Indian
children, leading to improved health outcomes. Incorporating additional data sources and
exploring the inclusion of data from medical records, dietary habits, or environmental factors
to potentially improve the accuracy and comprehensiveness of the risk prediction models.
Some advanced feature engineering can help in investigation and creation of new features
derived from existing data or feature selection techniques to identify the most informative
elements for model training. One can develop a user-friendly interface that integrates the
best performing model into existing healthcare systems, allowing for quick and efficient
anemia risk assessment during child checkups. Moreover, investigation of the use of more
advanced deep learning architectures like convolutional neural networks (CNNs) or recurrent
neural networks (RNNs) to potentially capture even more complex relationships within the
data. To increase the scope of external validation and generalizability, one can test the
performance of the best performing model on data from different geographical regions within
India or other countries to assess its generalizability to diverse populations. Implement XAI
techniques to understand the rationale behind the model's predictions. This can provide
valuable insights into the factors that contribute most to anemia risk in the specific context of
the data. Finally, to explore the development of a mobile application that incorporates the
model for anemia risk prediction. This could empower parents and caregivers to assess their
children's risk at home, potentially leading to earlier diagnosis and treatment.

6.1 Chapter Summary

This chapter summarizes the key findings of the study on anemia, highlighting the significant
insights and their implications for public health. It reaffirms the importance of addressing
anemia through targeted interventions and emphasizes the role of comprehensive health
strategies. The chapter also discusses the limitations of the study, providing a critical
evaluation of the research methodology and outcomes. Looking forward, it outlines potential
areas for future research, suggesting further investigation into genetic factors, innovative
treatments, and large-scale public health initiatives. The chapter concludes by stressing the
need for continued efforts to mitigate anemia's impact on global health.
References

[1] Belali, T. M. (2022). Iron deficiency anaemia: prevalence and associated factors
among residents of northern Asir Region, Saudi Arabia. Scientific Reports, 12(1).
https://doi.org/10.1038/s41598-022-23969-1

[2] Cho, H., Lee, S., & Baek, Y. (2021b). Anemia diagnostic system based on
impedance measurement of red blood cells. Sensors, 21(23), 8043.
https://doi.org/10.3390/s21238043

[3]Anaemia in women and children. (n.d.).

https://www.who.int/data/gho/data/themes/topics/anaemia_in_women_and_children

[4] Dey, S., Goswami, S., & Dey, T. (2014). Identifying predictors of childhood anaemia
in North-East India. Journal of Health, Population and Nutrition, 31(4).
https://doi.org/10.3329/jhpn.v31i4.20001

[5] Thakur, H., Chand, R., & Narayan, R. Burden Of Anemia And Its Socio-Economic
Determinates Among Pregnant Women In Himachal Pradesh, India: A Cross-Sectional
Study.

[6] Mantadakis, E., Chatzimichael, E., & Zikidou, P. (2020). IRON DEFICIENCY
ANEMIA IN CHILDREN RESIDING IN HIGH AND LOW-INCOME COUNTRIES:
RISK FACTORS, PREVENTION, DIAGNOSIS AND THERAPY. Mediterranean
Journal of Hematology and Infectious Diseases, 12(1), e2020041.
https://doi.org/10.4084/mjhid.2020.041

[7] Balarajan, Y., Ramakrishnan, U., Özaltin, E., Shankar, A. H., & Subramanian, S.
(2011). Anaemia in low-income and middle-income countries. Lancet, 378(9809), 2123–
2135. https://doi.org/10.1016/s0140-6736(10)62304-5

[8] Neogi, S. B., Sharma, J., Pandey, S., Zaidi, N., Bhattacharya, M., Kar, R., Kar, S. S.,
Purohit, A., Bandyopadhyay, S., & Saxena, R. (2020). Diagnostic accuracy of point-of-
care devices for detection of anemia in community settings in India. BMC Health
Services Research, 20(1). https://doi.org/10.1186/s12913-020-05329-9

[9] Pivina, L., Semenova, Y., Doşa, M.dence from Rural China. International Journal of
Environmental Research and Public Health/International Journal of Environmental
Research and Public Health, 16(15), 2761. https://doi.org/10.3390/ijerph16152761

[12] Rahman, M. A., Khan, M. N., & Rahman, M. M. (2020). Maternal anaemia and risk
of adverse obstetric and neonatal outcomes in South Asian countries: A systematic review
and meta-analysis. Public Health in Practice, 1, 100021.
https://doi.org/10.1016/j.puhip.2020.100021

[13] Song, J., Dong, H., Xu, F., Wang, Y., Li, W., Jue, Z., Wei, L., Yue, Y., & Zhu, C.
(2021). The association of severe anemia, red blood cell transfusion and necrotizing
enterocolitis in neonates. PloS One, 16(7), e0254810.
https://doi.org/10.1371/journal.pone.0254810

[14] Meitei, A. J., Saini, A., Mohapatra, B. B., & Singh, K. J. (2022). Predicting child
anaemia in the North-Eastern states of India: a machine learning approach. International
Journal of System Assurance Engineering and Management, 13(6), 2949-2962.

[15] Saputra, D. C. E., Sunat, K., & Ratnaningsih, T. (2023, February). A new artificial
intelligence approach using extreme learning machine as the potentially effective model to
predict and analyze the diagnosis of anemia. In Healthcare (Vol. 11, No. 5, p. 697). MDPI.

[16] Wang, L., Li, M., Dill, S. E., Hu, Y., & Rozelle, S. (2019). Dynamic anemia status from
infancy to preschool-age: evidence from rural China. International journal of environmental
research and public health, 16(15), 2761.

[17] Hasan, M., Tahosin, M. S., Farjana, A., Sheakh, M. A., & Hasan, M. M. (2023, May). A
harmful disorder: Predictive and comparative analysis for fetal Anemia disease by using
different machine learning approaches. In 2023 11th International Symposium on Digital
Forensics and Security (ISDFS) (pp. 1-6). IEEE.

[18] Acharya, S., Swaminathan, D., Das, S., Kansara, K., Chakraborty, S., Kumar, D., ... &
Aatre, K. R. (2019). Non-invasive estimation of hemoglobin using a multi-model stacking
regressor. IEEE journal of biomedical and health informatics, 24(6), 1717-1726.

[19] Bhavsar, H., & Ganatra, A. (2012). A comparative study of training algorithms for
supervised machine learning. International Journal of Soft Computing and Engineering
(IJSCE), 2(4), 2231-2307.

[20] Dalvi, P. T., & Vernekar, N. (2016, May). Anemia detection using ensemble learning
techniques and statistical models. In 2016 IEEE International Conference on Recent Trends
in Electronics, Information & Communication Technology (RTEICT) (pp. 1747-1751).
IEEE.

[21] Khan, J. R., Chowdhury, S., Islam, H., & Raheem, E. (2019). Machine learning
algorithms to predict the childhood anemia in Bangladesh. Journal of Data Science, 17(1),
195-218.

[22] Jaiswal, M., Srivastava, A., & Siddiqui, T. J. (2019). Machine learning algorithms for
anemia disease prediction. In Recent trends in communication, computing, and electronics:
Select proceedings of IC3E 2018 (pp. 463-469). Springer Singapore.

[23] Çil, B., Ayyıldız, H., & Tuncer, T. (2020). Discrimination of β-thalassemia and iron
deficiency anemia through extreme learning machine and regularized extreme learning
machine based decision support system. Medical hypotheses, 138, 109611.

[24] Fatima, M., & Pasha, M. (2017). Survey of machine learning algorithms for disease
diagnostic. Journal of Intelligent Learning Systems and Applications, 9(01), 1-16.

[25] Mohammed, M. S., Ahmad, A. A., & Murat, S. A. R. I. (2020, June). Analysis of anemia
using data mining techniques with risk factors specification. In 2020 International Conference
for Emerging Technology (INCET) (pp. 1-5). IEEE.

[26] Vardhan, H., Sd, S., Sriram, K., & Kakarla, Y. (2024, January). Disease Prediction
Based on Symptoms Using Ensemble and Hybrid Machine Learning Models. In 2024 14th
International Conference on Cloud Computing, Data Science & Engineering (Confluence)
(pp. 799-804). IEEE.
[27] Shah, D., Patel, S., & Bharti, S. K. (2020). Heart disease prediction using machine
learning techniques. SN Computer Science, 1(6), 345.

[28] Dey, A. K., Dehingia, N., Bhan, N., Thomas, E. E., McDougal, L., Averbach, S., ... &
Raj, A. (2022). Using machine learning to understand determinants of IUD use in India:
Analyses of the National Family Health Surveys (NFHS-4). SSM-Population Health, 19,
101234.

[29] El-Kenawy, E. S. M., Eid, M. M., & Ibrahim, A. (2021). Anemia estimation for covid-
19 patients using a machine learning model. Journal of Computer Science and Information
Systems, 17(11), 2535-1451.

[30] Garduno-Rapp, N. E., Ng, Y. S., Weon, J. L., Saleh, S. N., Lehmann, C. U., Tian, C., &
Quinn, A. (2024). Early identification of patients at risk for iron-deficiency anemia using
deep learning techniques. American Journal of Clinical Pathology, aqae031.

[31] Kilicarslan, S., Celik, M., & Sahin, Ş. (2021). Hybrid models based on genetic
algorithm and deep learning algorithms for nutritional Anemia disease classification.
Biomedical Signal Processing and Control, 63, 102231.

You might also like