Thesis Fin
Thesis Fin
DISSERTATION
Master of Technology
in
Computer Science
by
Akshay Kumar
22/10/MT/034
JUNE 2024
CERTIFICATE
This is to certify that dissertation entitled “Machine Learning approaches for Predicting
Anemia Risk in Woman from India”, being submitted by Akshay Kumar, (Enrollment
Number 22/10/MT/034) in fulfillment of the requirement for the award of degree Master of
Technology in Computer Science to School of Computer & Systems Sciences (SC&SS),
Jawaharlal Nehru University, New Delhi is a record of the candidates own work carried
under the guidance and supervision of Dr. Manju Khari.
The matter presented in this thesis has not been submitted for the award of any other degree
elsewhere.
(Dean) (Supervisor)
School of Computer and Systems Sciences School of Computer and Systems Sciences
I hereby declare that the M. Tech Dissertation titled “Machine Learning Algorithms for
Predicting Anemia Risk in Children from India ” being submitted to the School of
Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, in partial
fulfilment of the requirements for the award of the degree of Master of Technology in
Computer Science , is an authentic record of work carried out by me under the guidance and
supervision of Prof. Manju Khari.
I also mention that the research work is original and has not been submitted by me, in part or
full, to any other University or Institution for the award of any degree or diploma.
Akshay Kumar
(22/10/MT/034)
I would like to express my sincere gratitude to my supervisor Prof. Manju Khari for her
expertise and continuous support, who has the attitude and the substance of a genius. Without
her guidance and persistent help, this thesis would not have been possible. It has been my
utmost privilege to work with her. My appreciation also extends to my colleagues for their
insightful comments and valuable feedback. Their timely suggestions with kindness,
enthusiasm and dynamism have enabled me to complete my thesis. I would also like to thank
my committee members for their time and cooperation in the completion of this thesis. My
sincerest appreciation and gratitude go to my parents and my brother for their unfailing love
and support, for encouraging me every single day to be a better person, and for giving me
wings to fly.
Akshay Kumar
ABSTRACT
Childhood anemia remains a major health concern in India despite existing
efforts. This study leverages machine learning to develop a model for predicting
anemia risk. Data from a large national survey will be used, encompassing
factors like demographics, diet, and health indicators. Algorithms like Random
Forest will identify key risk factors specific to different regions within India.
This will lead to a geographically-specific predictive model that considers both
national and regional trends. The anticipated outcome is a model that accurately
predicts anemia risk in Indian children. This has the potential to revolutionize
childhood anemia management by enabling early identification of at-risk
children and allowing for timely interventions. Additionally, targeted public
health initiatives and geographically specific education campaigns can be
developed based on the identified risk factors. Furthermore, exploring
environmental factors like access to clean water and sanitation can create a
more comprehensive approach for tackling childhood anemia in India. This
study has the potential to significantly improve public health outcomes for
children across the nation.
Table of Contents
CERTIFICATE
DECLARATION
ACKNOWLEDGEMENT
ABSTRACT
LIST OF FIGURES
LIST OF TABLES
1.INTRODUCTION
2.LITERATURE REVIEW
2.1 Introduction
3.RESEARCH METHODOLOGY
4.PROPOSED WORK
4.1 introduction
4.2Chapter Summary
REFERENCES
List of Abbreviations
DT : Decision Tree
RF : Random Forest
List of Figures
List of Tables
Chapter 1
Introduction
Anaemia is a typical medical health condition wherein the count of the haemoglobin or red
blood cells available in the body is substantially lower than the normal prescribed range.
haemoglobin, Haemoglobin is composed of a simple protein, known as globin and a non-
protein also known as haem(iron containing part). In order for cells to respire, haemoglobin
must first bind with oxygen in the lungs to create oxyhaemoglobin, which is then carried by
the bloodstream to tissues and organs. Haemoglobin helps in the release of waste products of
metabolism and also acts as a carrier of carbon dioxide from the tissues back to the lungs for
expiration.The functionalities served by haemoglobin plays a very pivotal role to regulate
proper functioning of various physiological processes involved in human anatomy. It also
aids in maitaining the pH level of blood by acting as a buffer, preventing drastic changes in
blood acidity or alkalinity. Moreover, it is also responsible for regulation of blood flow, as
haemoglobin can regulate blood flow by releasing nitric oxide, which helps dilate blood
vessels, increasing blood flow to tissues when needed. Additionally, haemoglobin can bind
with and transport other molecules, such as nitric oxide, carbon monoxide (albeit with much
higher affinity than oxygen, which can be toxic), and certain drugs [1].
These functions are essential for preserving homeostasis and ensuring that the body's
numerous physiological systems operate as intended. If a person has some degree of
abnormal red blood cells or insufficient haemoglobin, the oxygen carrying capability of the
blood to the body decrease. Fatigue, weakness, dizziness, and breathing difficulties are signs
of anemia that should be looked for early in the diagnosing process. Gender, residence
elevation, smoking habits, residence elevation, sex, age as well as pregnancy all affect the
ideal concentration of haemoglobin needed to meet physiological needs. The factors that may
lead to anaemia may be caused by various number of factors, such as poor diets or
unbalanced absorption of nutrients may cause nutrient deficiency, infections (e.g.
tuberculosis, parasitic infections,malaria, HIV etc), inflammation in the body, various chronic
diseases, gynaecological and obstetric problems , and inherited RBCs disorders. One of the
most prominent nutritional reasons of anemia is deficiency in iron; however, deficiencies in
folate, vitamins B12 and A, and other essential nutrients should also be taken into account[2].
Considering it mostly affects children, pregnant or recently gave birth women, adolescent
girls, and women who are menstruating, anemia is a critical public health concern that should
be addressed by both the general public and the government. Anemia statistics that should
worry you is that most cases of anemia happen in low- and lower-middle-income nations.
Those most at risk for this type of anemia are those who live in remote areas, in homes with
lower incomes, and have never attended an official educational institution.It is estimated that
anemia currently affects, 37% of expectant mothers, 40% of all children between the ages of
6 and 59 months and 30% of women globally between the ages of 15 to 49. Based on current
estimates, anemia affects half a billion women aged 15 to 49 worldwide and 269 million
children aged 6 to 59 months. In 2019, anemia afflicted 30 percent of non-pregnant women
and 37 percent of pregnant women between the ages of 15 and 49 [3].
Although there could be numerous kinds of factors involved that may lead to anaemia, iron
deficiency happens to be the prominent cause of nutritional anemia across the world. There
can be several factors that can root the seeds of IDA i.e. anaemia due to iron deficiency, some
of them may include scant intake of iron, fine dietary deficiency, high physiological
requirement while in pregnancy period and early childhood, rapid growth surges ( going
through adolescence, puberty) and parasitic infections can cause chronic iron deficiency,
(parasites like hookworm and schistosomiasis).A person's family size, income, educational
attainment, vitamin A deficiency, urban and rural location, gravidity, lack of iron-folic acid
supplementation, excessive menstrual bleeding, and history of abortion are some
environmental factors that may have influenced their early years of conditioning and can also
be considered major contributors to the gradual onset of anemia. When separating additional
causes of anemia from IDA, parasitic illnesses such as HIV, malaria, chronic inflammation,
and protein-energy deficiency are the main culprits, according to the epidemiology
department[4] .
Anaemia has detrimental consequences on a person’s health, society, and economy. WHO
has evaluated anaemia with accordance to the standardised criterion.
Anemia in pregnant women is defined by a haemoglobin level below 110 g/L.Specifically,
anemia is classified as severe when hemoglobin levels fall below 70 g/L, moderate when
levels range between 70 and 100 g/L, and mild when levels are between 100 and 109 g/L.
Even mild to moderate anemia can negatively impact emotional well-being, causing fatigue
and stress, which reduce productivity and overall work efficiency.Maternal mortality and
morbidity in developing nations are significantly influenced by the incidence of severe
anemia. Serious health effects of chronic anemia can include a higher likelihood of infections
and hemorrhage. Meanwhile, severe anaemia has results potentially leading to heart failure
and mortality, various strategies and attempts are being taken by all of the stakeholders
involved to reduce the burden of anaemia. As a result, various initiatives have been seeded
by the government as well as the non-governmental organisations across the whole world to
treat anemia. These remarkable efforts have yielded various results, including short-term
measures such as supplementation and long-term strategies like food-based methods, food
fortification, dietary diversification, and nutritional education. The "Nutritional Iron Plus
Initiative" was initiated in 2013 by the Ministry of Health and Family Welfare with the goal
of combating the early stages, which can affect all group aged people. During the antenatal
and postnatal periods after the first trimester, each pregnant woman receives one iron and
folic acid tablet daily for the next six months. Additionally, pregnant women were widely
encouraged to check for anemia [5].
Anemia has serious effects on human health and is a global public health concern. It is one
of the most common illnesses in the world, especially affecting women and young people.
Using cutting-edge technologies to address this problem can significantly lower its
prevalence. Anemia has long been a public health concern since it affects an estimated 2.5
billion people worldwide. Iron deficiency anemia (IDA) is a diagnosis made when a person's
hemoglobin (Hb) levels are below normal for their gender, age as well as physiological state.
Anemia during pregnancy can have an adverse effect on the health of the mother aas well as
the fetus. It increases the risk of intrauterine growth restriction, low birth weight, and early
delivery, all of which are linked to higher risks of perinatal death. The WHO states that the
mother's safety throughout pregnancy depends on eliminating anemia. Any negative effects
on expectant mothers and their unborn baby would have a substantial on public health overall
because anemia is a condition that is highly prevalent [6].
To address this pervasive issue, it is vital to leverage the power of advanced technological
solutions. These solutions can include developing more accurate and accessible diagnostic
tools, improving nutritional interventions, and enhancing public health campaigns to raise
awareness and promote preventive measures. By integrating these approaches, it is possible
to make significant strides in reducing the global burden of anemia, particularly among the
most vulnerable populations[8].
1.2 Research Objectives
Anemia, particularly iron-deficiency anemia, can significantly affect cognitive function. Iron
is crucial for the development and function of the brain. Lack of sufficient iron leads to
decreased oxygen transport to the brain, impairing neurotransmitter synthesis and function.
This can result in reduced attention span and concentration and often experience difficulties
in maintaining focus on tasks, which can adversely affect academic and professional
performance. Moreover, chronic anemia in children can lead to long-term deficits in
cognitive and motor development, affecting educational attainment and social interactions.[9]
Anemia imposes additional stress on the cardiovascular system as the heart works harder to
pump blood to deliver sufficient oxygen to tissues. This, in adverse cases can lead to Heart
failure as chronic severe anemia can cause or exacerbate heart failure due to the persistent
high workload on the heart. Additionally, eeduced oxygen levels can lead to chest pain
(angina) during physical exertion as the heart muscles receive inadequate oxygen. It can also
lead to arrhythmias in which Abnormal heart rhythms can develop as a result of anemic
hypoxia (reduced oxygen supply to tissues), which affects the electrical activity of the
heart[10].
Anemia during childhood can have long-lasting impacts on growth and development. The
growth of a child can be stunted as iron and other nutrients vital for growth are deficient in
anemic children, leading to shorter stature and delayed physical development. Anemia can
delay the onset of puberty, affecting hormonal balance and secondary sexual characteristics.
Moreover, it can also weaken the immune system as chronic anemia can impair the immune
response, making children more susceptible to infections and illnesses, which further hinders
growth and development. It can also lead to some kind of behavioural problems, resulting in
irritability, fatigue, and apathy, contributing to behavioural issues that can disrupt social
interactions and learning environments[11].
Pregnancy significantly increases the body's demand for iron and other nutrients, making
anemia particularly dangerous for both the mother and the foetus. Particularly, in newborns,
complications such as respiratory distress syndrome, increase the risk of preterm labour,
eventually leading to premature birth.Low birth weight babies are more likely to have
developmental problems as well as higher rates of neonatal morbidity and mortality if their
mothers are anemic. Moreover, the disease can contribute to the development of
preeclampsia, a condition characterised by high blood pressure and damage to organ systems,
which can be life-threatening for both mother and baby. The physical stress of anemia can
contribute to postpartum depression, affecting the mother’s mental health and her ability to
care for the newborn. As the oxygen transportation is reduced, it can lead to inadequate
nutrient and oxygen delivery to the foetus, causing growth restrictions and developmental
delays [12].
The ability of blood to function normally is impacted whenever a human body develops a
blood disorder of any kind. These disorders can lower the quantity of nutrients, proteins,
platelets, or cells in the blood, which can impair the normal functioning of all physiological
processes in the human body. Empirical studies have consistently shown that anemia patients'
bodies have increased blood flow to vital organs such as the brain, heart, liver, kidneys while
decreased blood flow to less vital parts of body. To establish the diagnosis of anemia, the
hematocrit—the ratio of red blood cells to total volume in a blood sample—or the blood's
hemoglobin concentration are usually measured.A patient is considered anemic if their
hematocrit or hemoglobin levels are more than two standard deviations lower than the normal
range. Typically, Hb assessments utilizing capillary/haemoglobin electrophoresis, DNA
analysis, or high-performance liquid chromatography are used to identify BTT and HbE.
Because DNA analysis requires specialized equipment and is costly and time-consuming, it
cannot be applied in standard lab settings [13] .
The thesis work has total divided into six chapters with each having something related to the
techniques that are used in research work.
Chapter 1: Introduction
The first chapter is divided into four sections, starting with an introduction to the background
of Anemia. The second section focuses on identifying the problem statement, the third and
fourth sections focus on the objective and motivation part of the thesis and the last one is the
thesis outline.
Gives a brief overview of related work that has been done earlier related to anaemia
prediction with the help of machine learning algorithms.
Gives an outline of the research background, and describes all the algorithms in details.
This chapter contains the dataset and research methodology, which are carried out in this
thesis. In this, all the ML techniques have been discussed with detailed descriptions of the
EDA process and how features are selected for the model training
This chapter explains the result and provides a detailed discussion of the result and the
evaluation matrices used for comparison.
This is the final chapter of the research work that gives the conclusion and future scope of the
research
1.5 Chapter Summary
This chapter discussed : Anemea Anemia is a common condition marked by low hemoglobin
or red blood cell levels. Hemoglobin, made of globin protein and iron-containing heme, binds
oxygen in the lungs and transports it to tissues, aiding cellular respiration. It also removes
metabolic waste, maintains blood pH, regulates blood flow by releasing nitric oxide, and can
bind other molecules like carbon monoxide. The critical functions of hemoglobin are
essential for various physiological processes, highlighting the importance of addressing
anemia for overall health.
Chapter 2
Literature Review
The study begins with a comprehensive and detailed literature review, elucidating the
multifaceted nature of anemia in the context of Indian demographic. Anemia in
children remains a significant public health concern in India, with consequences for
growth, development, and cognitive function. Machine learning (ML) algorithms
offer promising avenues for developing non-invasive and potentially cost-effective
methods for predicting anemia risk in this population . Studies have explored the
application of various ML algorithms, including Logistic Regression , Random Forest
[4], and Support Vector Machines (SVM) , to identify key risk factors and improve
anemia detection in children from India. These studies highlight the potential of ML
in this domain, but also acknowledge the need for further research on factors specific
to the Indian context, such as socioeconomic status, dietary patterns, and regional
variations in disease prevalence
“Lei Wang , “Dynamic Anemia Several valuable insights were Longitudinal studies could
Mengjie Li, Status from Infancy provided into the dynamic nature explore the impact of dietary
Sarah-Eve to Preschool Age: of anaemia prevalence in rural habits, socioeconomic factors,
Dill , Yiwei Evidence from Rural Chinese children over time, and healthcare access on
Hu, Scott China”” shedding light on the changing anaemia prevalence in this
Rozelle”[16] patterns of this nutritional population.
deficiency during early Intervention studies could be
childhood. It also highlights that designed to test the
51% of children were anaemic in effectiveness of targeted
infancy, 24% in toddlerhood, and strategies to reduce anaemia
19% at preschool age, with 67% rates among rural Chinese
children, potentially
informing public health
experiencing anaemia at some policies and programs.
point during the study.
“Soumyadipta “Non-Invasive The research paper contributes a Further research could focus
Acharya; Estimation of new ML based technique for non- on expanding the study to a
Dhivya haemoglobin Using invasive estimation of total more diverse population to
Swaminathan; a Multi-Model haemoglobin (Hb) using validate the method's
Sreetama Stacking photoplethysmograms (PPGs) effectiveness across different
Das; Krity Regressor”“ acquired from a custom finger demographics. Enhancing the
Kansara; sensor. It demonstrates the machine learning model with
Sushovan feasibility of this method for additional features or
Chakraborty; maternal anemia detection, algorithms could improve the
Dinesh showing a statistically significant accuracy and robustness of
Kumar R; correlation coefficient of 0.81 the Hb estimation.
Tony Francis; with low Root Mean Square
Kiran R
Aatre” [18] Error(RMSE) .
A comprehensive comparison of
six different classification
“Hetal “Comparative Study algorithms, including decision
Bhavsar, of Training trees, Bayesian networks, neural
Amit Algorithms for networks, k-nearest neighbours,
Ganatra”[19]“ Supervised Machine and support vector machines
Learning”“
“El-Sayed M. “Anemia Estimation The paper introduces a Machine A limitation of the study is the
El-kenawy1, for COVID-19 Learning model for estimating reliance on hematological data
Marwa M. Patients Using A blood levels, specifically focusing alone for estimating
Eid1, Machine Learning on haemoglobin (Hgb) levels, haemoglobin levels. Future
Abdelhameed Model” using hematological criteria. This research could consider
Ibrahim”[29] model aids in accurate blood incorporating additional
evaluation activities, providing clinical parameters or data
essential information for medical sources to further enhance the
professionals accuracy and robustness of the
It explores the application model in predicting blood
oFuture research could focus on levels for COVID-19 patients.
optimizing the proposed model
by utilizing an optimization
algorithm to determine the best
weights for improved accuracy.
This would enhance the model's
performance and reliability in
estimating blood levels accurately
f various classification and
regression approaches, utilizing
Scikit-Learn to analyze
hematological data, particularly
in the context of COVID-19
patients. The study emphasizes
the importance of employing
multiple classifiers to enhance the
accuracy of medical diagnoses
based on hematological
information.Random Forest,
Support Vector Machine, and
Artificial Neural Networks to
approximate haemoglobin values
using hematological criteria.
This chapter provides a comprehensive review of the existing literature on anemia, focusing on its
prevalence, causes, and health implications. It examines the physiological role of hemoglobin, the
biochemical mechanisms underlying red blood cell production, and the factors contributing to anemia,
including nutritional deficiencies, genetic disorders, and chronic diseases. The chapter also explores
the diagnostic criteria for anemia, various treatment approaches, and the socio-economic impact of the
condition. Additionally, it highlights recent advancements in research, identifying gaps and future
directions for study. This review underscores the importance of a multidisciplinary approach to
effectively manage and mitigate anemia's widespread health effects.
Chapter 3
Research Methodology
The methodology section serves as a guiding the research process. We begin by establishing
the core problem addressed by this work. A comprehensive literature review is conducted to
identify existing knowledge and current research gaps. To bridge these gaps, the proposed
work is then introduced. This section details the specific algorithms or factors chosen for
[insert function, e.g., authentication, optimization. Subsequently, the optimisation strategies
implemented to refine the proposed work are explained. Finally, the methodology section
culminates with a discussion of the expected results and their analysis. A flowchart is
included below to illustrate the research progression. Moreover, it also has a mention of the
data being used for the analysis and the process through which it was collected.
The data for this analysis comes from the National Family Health Survey (NFHS-5),
conducted between 2019 and 2021. As the fifth edition in the NFHS series, NFHS-5 offers
comprehensive information on the population, health, and nutritional status across all Indian
states and union territories. The survey was primarily funded by the Government of India,
with additional technical support and funding from USAID's Demographic and Health
Surveys Program and ICF, USA. The Indian Council of Medical Research (ICMR) and the
National AIDS Research Institute (NARI) in Pune also supported some of the Clinical,
Anthropometric, and Biochemical (CAB) tests. NFHS-5 examined health and nutritional
issues across all Indian states and union territories, providing district-level estimates for
numerous key variables, similar to NFHS-4. New and significant bioinformatic data
introduced by NFHS-5 include methods and reasons for abortion, preschool education,
menstrual hygiene, expanded age ranges for measuring diabetes and hypertension for
individuals aged 15 and above, frequency of alcohol and tobacco use, micronutrient
components for children, expanded child immunization domains, death registration, and a
new component for non-communicable diseases (NCDs). These additions allowed for a more
comprehensive comparison of data over time. The NFHS-5 sample was designed to provide
estimates of several survey indicators at the national, state/union territory (UT), and district
levels. [14]
The survey covered a wide range of criteria during the design and creation of its indicators,
encompassing 707 districts, 8 union territories, and 28 states. A uniform sample design,
representative at the national, state/UT, and local levels, was employed in each polling cycle.
Each district was divided into rural and urban sections. However, only state/UT and national
levels have access to a variety of assessment indicators related to sexual behavior, HIV/AIDS
attitudes and behaviors, women's work status, husbands' background and awareness, and
domestic violence. Each rural stratum was further classified based on village population and
the proportion of individuals belonging to the SC/ST (scheduled castes and scheduled tribes).
Within each rural sampling stratum, a sample of villages was selected to serve as Primary
Sampling Units (PSUs), categorized based on the literacy rate of women aged six and older
before PSU selection.[15]
The Biomarker schedule measured blood pressure, weight, hip and waist circumference,
children's weight, children's height, haemoglobin levels, and random blood glucose levels for
men and women over the age of 15. Along with measuring children's height and haemoglobin
levels, men and women were asked to prick their finger and provide a few extra drops of
blood for laboratory testing to check for vitamin D3, malaria parasites, and HbA1c. The
Woman's Questionnaire aimed to gather comprehensive data on women's health and well-
being. It targeted women aged 15-49 and addressed a wide range of topics. Demographic
information like caste, age, religion, and media exposure was collected alongside
reproductive history details such as pregnancies, births, and terminations. Additionally, blood
tests for anemia were administered to all eligible women. The questionnaire further explored
health concerns including tobacco and alcohol use, tuberculosis awareness, and current
illnesses like cancer, diabetes, and heart disease. Notably, a specific module within the study
(State module subsample) delved into decision-making within households and its potential
connection to anemia. [17]
Machine learning is a technique that equips computers with the ability to learn and improve
from experience, much like humans do. Imagine a digital gardener nurturing a plant. Instead
of providing water and sunlight, this gardener feeds the plant—representing the computer—
with data and algorithms. This helps the plant comprehend and uncover hidden patterns,
make decisions, and advance over time, all without explicit instructions at each step. The
crux of machine learning can be stated as the art of taking raw data and transforming it into
valuable insights and predictions. This process enables machines to adapt and thrive in their
environment autonomously. The computers learn from the data they are given, identifying
patterns and making decisions based on this learning. Over time, they become more adept at
these tasks, requiring less and less guidance.
To put it simply, machine learning is a procedural approach through which systems gain and
comprehend information from various observations. They enrich and expand their
capabilities, bringing forth new knowledge without relying solely on pre-programmed
instructions. This allows them to evolve and perform increasingly complex tasks, much like a
digital gardener helping a plant to grow and flourish. [18]
Imagine having a collection of data points, each belonging to one of two distinct categories.
For instance, these points could represent emails classified as spam or not spam. An SVM
aims to create a clear division, like a straight line in a two-dimensional space, that separates
these categories with the greatest possible margin. This margin is the distance between the
line and the closest data points from each category, called support vectors. In simpler terms,
the SVM algorithm searches for the best dividing line or plane (called a hyperplane in higher
dimensions) that maximises the gap between the two classes of data. The data points that
define this margin are crucial for the SVM's operation, hence the name "support vectors."
Real-world data isn't always perfectly separable by a straight line. [19]
Figure 1 : SVM working principle
The figure above represents SVM in action. For classification problems, Support Vector
Machines (SVMs) use a geometric approach. By maximising the margin between the
hyperplane and the nearest data points (support vectors), they create a hyperplane that serves
as a boundary for decisions. Strong classification is promoted by this margin optimization,
especially in high-dimensional spaces. Additionally, SVMs use kernel functions to transform
data that is not linearly separable into higher dimensions where linear separation is
possible.For instance, imagine classifying images of cats and dogs. A simple line might not
suffice. To handle this, SVMs can employ a clever trick. They can project the data points into
a higher-dimensional space where a clear separation might exist. This projection is achieved
using mathematical functions called kernel functions. Even though we can't visualise this
higher-dimensional space, the SVM works effectively within it to find an optimal separation.
While commonly used for classification tasks, SVMs can also be adapted for regression
problems, where the goal is to predict a continuous value rather than a class label. SVM has
the ability to handle data with many features as it is very effective and efficient in high
dimension spaces. Additionally, SVMs can deliver good results even with limited data,
making them suitable for scenarios where collecting large amounts of data is a challenging
task. It can also be adapted to various tasks through kernel functions, making them a
powerful tool for diverse machine learning applications. Some crucial considerations are
important to keep in mind, in order to make the algorithm yield the best results. Choosing the
right kernel function is crucial for optimal SVM performance and depends on the specific
data characteristics. Also, the computational cost of training SVMs can be expensive,
especially for large datasets, hence it is vital to keep the size of the dataset just enough. [20]
A basic and popular machine learning technique that could potentially used for both
regression and classification problems is K-Nearest Neighbors (KNN). Its fundamental tenet
is that data items with comparable attributes typically fall into the same class. KNN leverages
this concept to make predictions on new data points by analysing the labels of its closest
neighbors within the training data.[21]
The breakdown of how KNN works can be divided into 4 phases: training, distance
calculation, identifying nearest neighbours, and classification/ regression. KNN operates in
four distinct stages. In the first phase, training, the algorithm simply stores the entire training
dataset. This isn't a complicated model construction situation. When a new data point is
introduced during the distance calculation phase, KNN uses a selected distance metric to
determine the distance between this point and every other point in the training set. The
identifying nearest neighbours phase involves finding the k closest data points (k being a
user-defined parameter) to the new point. Finally, in the classification/regression phase, the
algorithm makes predictions based on these neighbors. For classification, the most frequent
class label among the k nearest neighbors is assigned to the new point. In regression
problems, the average value of the target variable from the k nearest neighbours is used for
prediction. KNN offers several advantages. It's incredibly easy to understand and implement,
making it a good choice for beginners in machine learning. Additionally, KNN is non-
parametric, meaning it doesn't make any assumptions about the underlying data distribution.
This can be beneficial for complex datasets where other algorithms might struggle. However,
KNN also has limitations. Since it stores the entire training dataset, it can be memory-
intensive for large datasets. Additionally, KNN's performance is highly dependent on the
chosen distance metric and the value of k. Choosing a poor k value can lead to overfitting or
underfitting, which can significantly impact the algorithm's accuracy. [23]
The demonstration above uncovers how curve tracing works in the KNN algorithm for
different values of k. This illustrative diagram depicts the impact of varying the k parameter
in K-Nearest Neighbors (KNN) regression. The x-axis represents the feature space, while the
three y-axes correspond to the target variable for the training data (leftmost), KNN model
predictions (center), and test data (rightmost). Each plotted point likely signifies a sample
with its feature value and corresponding target value. The horizontal dashed lines presumably
represent different values of k, the number of neighbors considered in the KNN analysis. By
visually analyzing the proximity of the model predictions (center) to the test data (rightmost)
across varying k values, we can glean insights into the model's generalizability and potential
for overfitting. Overall, KNN is a versatile and effective machine learning algorithm,
particularly for smaller datasets. Its simplicity and ease of use make it a valuable tool for
various classification and regression tasks. However, it's important to be aware of its
limitations and carefully consider factors like distance metrics and the value of k to ensure
optimal performance. [24]
Crafting a decision tree includes selection of the optimise feature at each point based on
specific conditions. Standard criteria involve information gain, Gini impurity, and variance
reduction. Gini impurity measures the probability of incorrectly labeling an element if it were
randomly labeled according to the label distribution in the subset. Information gain, derived
from entropy, measures the reduction in uncertainty about the target variable after
partitioning the data based on an attribute. [26]
The main benefit of decision trees is their capability to handle both numerical and categorical
data with primal preprocessing, such as normalization or scaling. They have the ability to
capture nonlinear relationships between input features and the outcome variables, making
them suitable for a variety of data patterns. However, decision trees are sensitive to
overfitting, especially when they grow too deep and become extremely complex. The
phenomenon of overfitting happens when the model also attains the noise occurred during the
training data, resulting in weak generalisation to new unseen data. To battle this, methods like
pruning, setting a maximum depth, or requiring a minimum number of samples per leaf can
be used. Pruning reduces the size of the decision tree by removing parts that do not improve
its predictive power. There are two types of pruning: pre-pruning and post-pruning. Pre-
pruning stops the tree growth early by imposing constraints like limiting the maximum depth
or requiring a minimum number of samples at a node. Post-pruning includes growing the tree
to its full depth and then discarding nodes that contribute little to no predictive capability
based on a validation set or cross-validation.
. [27]
Despite their interpretability and simplicity, decision trees have some limitations. They can
be unstable, meaning that small changes in the data can result in very different trees. This
instability can be addressed by ensemble methods such as random forests. Random forests
create a 'forest' of multiple decision trees, each trained on a random subset of the data and
features, and aggregate their predictions to improve accuracy and robustness. By averaging
the results of many trees, random forests reduce the variance of the model, making it more
resistant to overfitting and more capable of handling the complexities of real-world data. In
summary, decision trees are a fundamental yet powerful tool in machine learning, offering
clear advantages in interpretability and flexibility. However, their susceptibility to overfitting
and instability requires careful tuning and the possible use of ensemble techniques like
random forests to enhance their performance. Understanding these nuances allows
practitioners to effectively leverage decision trees and random forests in a variety of
predictive modelling tasks.[28]
One of the prominent advantages of the Gaussian Naive Bayes algorithm is its computational
efficiency. The training phase is notably swift, requiring only the estimation of means and
variances for the features, making GNB particularly suitable for large datasets. Additionally,
the model's simplicity and interpretability enhance its appeal, allowing practitioners to easily
understand the influence of individual features on the classification outcome. Nevertheless,
the effectiveness of GNB may occasionally be restricted by the premise of a Gaussian
distribution and feature independence. The classifier's performance may suffer if these
presumptions are not met, particularly if there is a substantial connection between the
characteristics or a notable deviation from normalcy in the data distribution. Preliminary data
analysis and transformation (e.g., feature scaling, power transformations) might help reduce
these restrictions by better aligning the data with the algorithm's assumptions. [29]
3.4 Chapter Sumaary
This chapter outlines the research methodology employed in the study. It details the research
design, sampling methods, and data collection techniques used to investigate anemia. The
chapter explains the selection of participants, including inclusion and exclusion criteria, and
describes the tools and instruments utilized for data gathering, such as surveys,
questionnaires, and clinical tests. It also covers the procedures for data analysis, including
statistical methods and software used. Ethical considerations, including informed consent and
confidentiality measures, are discussed to ensure the study's integrity. The chapter concludes
with a justification for the chosen methodology, emphasizing its suitability for achieving the
research objectives.
Chapter 4
Proposed Work
This study utilises data from the National Family Health Survey 5 (NFHS-5) to explore and
analyse various health-related metrics. The proposed methodology encompasses several
critical steps, each designed to ensure robust and reliable results.
The first step involves importing the NFHS-5 dataset into a Jupyter notebook environment.
Initial exploratory data analysis (EDA) is conducted to understand the distribution and
characteristics of the data. This includes visualising data distributions, identifying patterns,
and summarising key statistics to gain a comprehensive overview of the dataset.
Subsequently, data preprocessing is performed to prepare the dataset for analysis. This step
includes handling missing values by imputing them with the mean values of respective
features. Moreover, information institutionalization is carried out to guarantee that all
highlights contribute similarly to the investigation. This step is pivotal for moving forward
the execution of machine learning models.
The preprocessed information is at that point part into two subsets: preparing and testing
datasets. The preparing dataset is utilized to prepare the machine learning models, whereas
the testing dataset is saved for assessing the execution of these models.
Five different machine learning models are applied to the training dataset. Initially, these
models are run with default hyperparameters, and their performance is recorded. The models
used in this study include Decision Trees, Random Forest, Support Vector Machine,
Gaussian Naive Bayes, KNN.
To upgrade the effectiveness of the models, hyperparameter tuning is conducted utilizing the
lattice look strategy. This strategy includes efficiently looking through a predefined set of
hyperparameters to recognize the combination that yields the finest execution. The comes
about of the tuned models are compared to those gotten with the default settings.
The performance of each model is meticulously tracked and recorded throughout the process.
Key performance metrics such as accuracy, precision, recall, F1-score, and ROC-AUC are
calculated to evaluate the effectiveness of the models.
To further refine the analysis, dimensionality reduction techniques are applied to the dataset.
This step aims to reduce the number of features while preserving the most significant
information. The reduced dataset is then subjected to the same five machine learning models,
and their performance is evaluated and compared to the results obtained with the original
dataset.Finally, a comprehensive performance analysis is conducted to compare the results of
all models before and after dimensionality reduction. The best-performing model is selected
based on its overall performance across the various metrics. This model is considered the
most suitable for the given dataset and research objectives.
The suggested work to investigate the application of machine learning algorithms to predict
the risk of anemia in Indian children is depicted in the figure above. We made use of NFHS-5
survey data. We performed exploratory data analysis after data import to comprehend data
distributions and find any missing values or outliers. Pre-processing of the data included
outlier removal, normalization, and mean substitution to imputation of missing values. The
information was at that point part into preparing and testing sets. We utilized five machine
learning models: support Vector Machine (SVM), K-Nearest Neighbors (KNN), Gaussian
Credulous Bayes (GNB), Decission Tree (DT), and Random Forest (RF). Initially, each
model was trained with default parameters. Subsequently, we performed hyperparameter
tuning to optimize each model's performance further. To reduce data dimensionality,
Principal Component Analysis (PCA) was employed before re-training all five models.
Finally, a comparative performance analysis was conducted using metrics including recall,
F1Score, precision, and accuracy.
This chapter presents the proposed work for addressing anemia, outlining the objectives,
hypotheses, and research plan. It details the specific interventions and strategies to be
implemented, such as nutritional programs, awareness campaigns, and clinical trials for new
treatments. The chapter describes the target population and the criteria for participant
selection. It also includes a timeline for the project phases, from initial planning to execution
and evaluation. Expected outcomes and potential impacts are discussed, emphasizing how the
proposed work aims to advance understanding, improve management, and reduce the
prevalence of anemia. The chapter concludes with a discussion of anticipated challenges and
mitigation strategies.
Chapter 5
The results indicate that all five machine learning models achieved relatively high accuracy
in predicting anemia risk in Indian children using the NFHS-5 data. However, there were
variations in performance across the models and evaluation stages. Support Vector Machine
(SVM) and Decision Tree (DT) emerged as the frontrunners, achieving a perfect accuracy of
1.00 with hyperparameter tuning. This suggests that these models were able to learn the
underlying patterns in the data exceptionally well and make accurate predictions on the
unseen testing set. Random Forest (RF) also exhibited strong performance, closely following
SVM and DT with an accuracy of 0.98 after hyperparameter tuning. This indicates that the
ensemble learning approach of RF was highly effective in this task. K-Nearest Neighbors
(KNN) achieved significant improvement with hyperparameter tuning, reaching an accuracy
of 0.93. This highlights the importance of hyperparameter optimization for enhancing model
generalizability. Gaussian Naive Bayes (GNB) yielded the lowest accuracy across all stages,
with a maximum accuracy of 0.14 after hyperparameter tuning. This suggests that the
assumption of independence between features may not hold.
Table 2 :
The results in Table 1 indicate that all five machine learning models achieved high accuracy
on the training data (> 98%). However, there is greater variation in performance on the hold-
out test data. Decision Tree (DT) achieved the highest overall accuracy (98%) and F1-score
(0.94) on the test data. Random Forest (RF) had the second highest accuracy (98%) and F1-
score (0.99) but slightly lower precision (0.991) and recall (1.0) compared to DT. Support
Vector Machine (SVM) also achieved high accuracy (98%) and F1-score (0.99) but with
slightly lower values than both DT and RF. Gaussian Naive Bayes (GNB) had the lowest
accuracy (0.14%) and F1-score (0.17) on the test data, indicating poor performance in
correctly classifying anemia risk. K-Nearest Neighbors (KNN) had an intermediate accuracy
(0.93%) and F1-score (0.96) on the test data. Interestingly, PCA appears to have had minimal
impact on model performance, with accuracy values on the test data nearly identical to those
before PCA dimensionality reduction
Figure 6 : Performance analysis
This chapter presents the findings of the study on anemia, followed by an in-depth discussion
of the results. The data collected from surveys, clinical tests, and other methods are analyzed
and displayed through tables, graphs, and charts. Key outcomes, such as the prevalence of
anemia, its correlation with various demographic factors, and the effectiveness of proposed
interventions, are highlighted. The discussion interprets these findings, comparing them with
existing literature and theoretical frameworks. It also addresses any unexpected results,
potential limitations of the study, and their implications. Finally, the chapter emphasizes the
study's contributions to the field and suggests areas for future research.
Chapter 6
This study conducted a comparative analysis of various machine learning models for
predicting anemia risk in Indian children using data from the National Family Health Survey
5 (NFHS-5). The investigation revealed promising results,with Support Vector Machine
(SVM) and Decision Tree (DT) achieving a perfect accuracy of 1.00 after hyperparameter
tuning. Random Forest (RF) also demonstrated strong performance with an accuracy of 0.98,
highlighting the effectiveness of ensemble learning. The findings suggest that machine
learning holds significant potential for developing robust and accurate tools to predict anemia
risk in this population. Dimensionality reduction techniques showed limited impact on model
performance in this specific case. However, incorporating additional data sources, exploring
advanced feature engineering, and integrating the model with healthcare systems present
exciting avenues for future research. Further exploration of Explainable AI (XAI) techniques
and the development of models focused on specific anemia types can provide valuable
insights for targeted interventions. Ultimately, this research paves the way for utilizing
machine learning to enhance the early detection and management of anemia in Indian
children, leading to improved health outcomes. Incorporating additional data sources and
exploring the inclusion of data from medical records, dietary habits, or environmental factors
to potentially improve the accuracy and comprehensiveness of the risk prediction models.
Some advanced feature engineering can help in investigation and creation of new features
derived from existing data or feature selection techniques to identify the most informative
elements for model training. One can develop a user-friendly interface that integrates the
best performing model into existing healthcare systems, allowing for quick and efficient
anemia risk assessment during child checkups. Moreover, investigation of the use of more
advanced deep learning architectures like convolutional neural networks (CNNs) or recurrent
neural networks (RNNs) to potentially capture even more complex relationships within the
data. To increase the scope of external validation and generalizability, one can test the
performance of the best performing model on data from different geographical regions within
India or other countries to assess its generalizability to diverse populations. Implement XAI
techniques to understand the rationale behind the model's predictions. This can provide
valuable insights into the factors that contribute most to anemia risk in the specific context of
the data. Finally, to explore the development of a mobile application that incorporates the
model for anemia risk prediction. This could empower parents and caregivers to assess their
children's risk at home, potentially leading to earlier diagnosis and treatment.
This chapter summarizes the key findings of the study on anemia, highlighting the significant
insights and their implications for public health. It reaffirms the importance of addressing
anemia through targeted interventions and emphasizes the role of comprehensive health
strategies. The chapter also discusses the limitations of the study, providing a critical
evaluation of the research methodology and outcomes. Looking forward, it outlines potential
areas for future research, suggesting further investigation into genetic factors, innovative
treatments, and large-scale public health initiatives. The chapter concludes by stressing the
need for continued efforts to mitigate anemia's impact on global health.
References
[1] Belali, T. M. (2022). Iron deficiency anaemia: prevalence and associated factors
among residents of northern Asir Region, Saudi Arabia. Scientific Reports, 12(1).
https://doi.org/10.1038/s41598-022-23969-1
[2] Cho, H., Lee, S., & Baek, Y. (2021b). Anemia diagnostic system based on
impedance measurement of red blood cells. Sensors, 21(23), 8043.
https://doi.org/10.3390/s21238043
https://www.who.int/data/gho/data/themes/topics/anaemia_in_women_and_children
[4] Dey, S., Goswami, S., & Dey, T. (2014). Identifying predictors of childhood anaemia
in North-East India. Journal of Health, Population and Nutrition, 31(4).
https://doi.org/10.3329/jhpn.v31i4.20001
[5] Thakur, H., Chand, R., & Narayan, R. Burden Of Anemia And Its Socio-Economic
Determinates Among Pregnant Women In Himachal Pradesh, India: A Cross-Sectional
Study.
[6] Mantadakis, E., Chatzimichael, E., & Zikidou, P. (2020). IRON DEFICIENCY
ANEMIA IN CHILDREN RESIDING IN HIGH AND LOW-INCOME COUNTRIES:
RISK FACTORS, PREVENTION, DIAGNOSIS AND THERAPY. Mediterranean
Journal of Hematology and Infectious Diseases, 12(1), e2020041.
https://doi.org/10.4084/mjhid.2020.041
[7] Balarajan, Y., Ramakrishnan, U., Özaltin, E., Shankar, A. H., & Subramanian, S.
(2011). Anaemia in low-income and middle-income countries. Lancet, 378(9809), 2123–
2135. https://doi.org/10.1016/s0140-6736(10)62304-5
[8] Neogi, S. B., Sharma, J., Pandey, S., Zaidi, N., Bhattacharya, M., Kar, R., Kar, S. S.,
Purohit, A., Bandyopadhyay, S., & Saxena, R. (2020). Diagnostic accuracy of point-of-
care devices for detection of anemia in community settings in India. BMC Health
Services Research, 20(1). https://doi.org/10.1186/s12913-020-05329-9
[9] Pivina, L., Semenova, Y., Doşa, M.dence from Rural China. International Journal of
Environmental Research and Public Health/International Journal of Environmental
Research and Public Health, 16(15), 2761. https://doi.org/10.3390/ijerph16152761
[12] Rahman, M. A., Khan, M. N., & Rahman, M. M. (2020). Maternal anaemia and risk
of adverse obstetric and neonatal outcomes in South Asian countries: A systematic review
and meta-analysis. Public Health in Practice, 1, 100021.
https://doi.org/10.1016/j.puhip.2020.100021
[13] Song, J., Dong, H., Xu, F., Wang, Y., Li, W., Jue, Z., Wei, L., Yue, Y., & Zhu, C.
(2021). The association of severe anemia, red blood cell transfusion and necrotizing
enterocolitis in neonates. PloS One, 16(7), e0254810.
https://doi.org/10.1371/journal.pone.0254810
[14] Meitei, A. J., Saini, A., Mohapatra, B. B., & Singh, K. J. (2022). Predicting child
anaemia in the North-Eastern states of India: a machine learning approach. International
Journal of System Assurance Engineering and Management, 13(6), 2949-2962.
[15] Saputra, D. C. E., Sunat, K., & Ratnaningsih, T. (2023, February). A new artificial
intelligence approach using extreme learning machine as the potentially effective model to
predict and analyze the diagnosis of anemia. In Healthcare (Vol. 11, No. 5, p. 697). MDPI.
[16] Wang, L., Li, M., Dill, S. E., Hu, Y., & Rozelle, S. (2019). Dynamic anemia status from
infancy to preschool-age: evidence from rural China. International journal of environmental
research and public health, 16(15), 2761.
[17] Hasan, M., Tahosin, M. S., Farjana, A., Sheakh, M. A., & Hasan, M. M. (2023, May). A
harmful disorder: Predictive and comparative analysis for fetal Anemia disease by using
different machine learning approaches. In 2023 11th International Symposium on Digital
Forensics and Security (ISDFS) (pp. 1-6). IEEE.
[18] Acharya, S., Swaminathan, D., Das, S., Kansara, K., Chakraborty, S., Kumar, D., ... &
Aatre, K. R. (2019). Non-invasive estimation of hemoglobin using a multi-model stacking
regressor. IEEE journal of biomedical and health informatics, 24(6), 1717-1726.
[19] Bhavsar, H., & Ganatra, A. (2012). A comparative study of training algorithms for
supervised machine learning. International Journal of Soft Computing and Engineering
(IJSCE), 2(4), 2231-2307.
[20] Dalvi, P. T., & Vernekar, N. (2016, May). Anemia detection using ensemble learning
techniques and statistical models. In 2016 IEEE International Conference on Recent Trends
in Electronics, Information & Communication Technology (RTEICT) (pp. 1747-1751).
IEEE.
[21] Khan, J. R., Chowdhury, S., Islam, H., & Raheem, E. (2019). Machine learning
algorithms to predict the childhood anemia in Bangladesh. Journal of Data Science, 17(1),
195-218.
[22] Jaiswal, M., Srivastava, A., & Siddiqui, T. J. (2019). Machine learning algorithms for
anemia disease prediction. In Recent trends in communication, computing, and electronics:
Select proceedings of IC3E 2018 (pp. 463-469). Springer Singapore.
[23] Çil, B., Ayyıldız, H., & Tuncer, T. (2020). Discrimination of β-thalassemia and iron
deficiency anemia through extreme learning machine and regularized extreme learning
machine based decision support system. Medical hypotheses, 138, 109611.
[24] Fatima, M., & Pasha, M. (2017). Survey of machine learning algorithms for disease
diagnostic. Journal of Intelligent Learning Systems and Applications, 9(01), 1-16.
[25] Mohammed, M. S., Ahmad, A. A., & Murat, S. A. R. I. (2020, June). Analysis of anemia
using data mining techniques with risk factors specification. In 2020 International Conference
for Emerging Technology (INCET) (pp. 1-5). IEEE.
[26] Vardhan, H., Sd, S., Sriram, K., & Kakarla, Y. (2024, January). Disease Prediction
Based on Symptoms Using Ensemble and Hybrid Machine Learning Models. In 2024 14th
International Conference on Cloud Computing, Data Science & Engineering (Confluence)
(pp. 799-804). IEEE.
[27] Shah, D., Patel, S., & Bharti, S. K. (2020). Heart disease prediction using machine
learning techniques. SN Computer Science, 1(6), 345.
[28] Dey, A. K., Dehingia, N., Bhan, N., Thomas, E. E., McDougal, L., Averbach, S., ... &
Raj, A. (2022). Using machine learning to understand determinants of IUD use in India:
Analyses of the National Family Health Surveys (NFHS-4). SSM-Population Health, 19,
101234.
[29] El-Kenawy, E. S. M., Eid, M. M., & Ibrahim, A. (2021). Anemia estimation for covid-
19 patients using a machine learning model. Journal of Computer Science and Information
Systems, 17(11), 2535-1451.
[30] Garduno-Rapp, N. E., Ng, Y. S., Weon, J. L., Saleh, S. N., Lehmann, C. U., Tian, C., &
Quinn, A. (2024). Early identification of patients at risk for iron-deficiency anemia using
deep learning techniques. American Journal of Clinical Pathology, aqae031.
[31] Kilicarslan, S., Celik, M., & Sahin, Ş. (2021). Hybrid models based on genetic
algorithm and deep learning algorithms for nutritional Anemia disease classification.
Biomedical Signal Processing and Control, 63, 102231.