ML Final Report
ML Final Report
machine learning
1st Vivek Jain 2nd Manan Garg 3nd Tushar
2021218 2021163 2020478
IIIT Delhi IIIT DELHI IIIT DELHI
New Delhi, India New Delhi, India New Delhi, India
[email protected] [email protected] [email protected]
I. A BSTRACT to those of the flu and may also include high fever, fatigue,
Malaria is a disease induced by a plasmodium parasite shivers, septicemia, pneumonia, gastritis, enteritis, nausea,
and transmitted by the bite of a female Anopheles mosquito vomiting, migraines, and, in extreme cases, convulsions and
infected with the parasite. More than 400 thousand individ- coma, which can result in death. Malaria cannot travel from
uals perish annually from malaria. The standard technique person to person, but it can be transmitted from mother to
for detecting malaria involves preparing a blood smear and fetus and patients can become infected through blood trans-
examining the stained blood smear under a microscope to fusions or the sharing of instruments like syringes. Malaria
detect the parasite genus Plasmodium, which heavily relies is most prevalent in regions with mild, humid climates that
on the expertise of trained professionals. In this paper, we are close to natural water resources and the territory of
compare different feature extraction and classification methods Anopheles mosquitoes. Blood films are commonly used for
on red blood cell smears from sampled cells to develop an microscopic examination of blood cells to diagnose malaria.
accurate and reliable automated system for detecting malaria. Every year, hundreds of millions of blood films are examined
National Institutes of Health provided the dataset used in for malaria, which requires a trained microscopist to manually
this study. The evaluation metrics consisting of accuracy, re- count parasites and infected red blood cells. Not only are
call, precision, F1 score, and Area under the curve (AUC) were accurate parasite counts crucial for malaria diagnosis, but
utilized to compare and select the best performing architecture. also for testing for drug resistance, measuring drug efficacy,
and classifying disease severity. For false-negative cases, this
II. M OTIVATION results in the unnecessary administration of antibiotics, a
Malaria remains a global health concern with limited access second consultation, lost work days, and in some cases, the
to accurate diagnosis, especially in resource-limited areas. The progression of malaria to a severe form. For false-positive
disease claims the lives of more than 400,000 people annually. cases, a misdiagnosis necessitates the unnecessary use of anti-
This project aims to use machine learning to automate the malaria medications and the potential adverse effects associ-
detection of malaria parasites in red blood cell (RBC) images. ated with them, such as nausea, abdominal pain, diarrhea,and
By leveraging machine learning, we can enhance accuracy, ef- sometimes severe complications. Consequently, the F1 Score
ficiency, scalability, and cost-effectiveness while continuously will serve as the primary metric for evaluating our model.
improving the diagnostic process. This project addresses the The F1 Score is determined by the weighted average of
critical need for faster and more accessible malaria diagnosis. Precision and Recall. Therefore, this Score takes into account
The idea emerged from recognizing the challenges of man- both false positives and false negatives. F1 is typically more
ual malaria diagnosis and the potential of machine learning. beneficial than accuracy in such situations, despite the fact
Witnessing the impact of limited healthcare resources on accu- that it is less intuitive to comprehend. We are attempting to
rate malaria detection sparked the concept. The advancement automate the diagnostic system using various machine learning
of machine learning inspired the vision to create an automated techniques that will assist human specialists in producing the
system for efficient, reliable, and accessible malaria diagnosis most accurate diagnosis possible. We performed extensive pre-
through RBC images. . processing on the NIH malaria dataset and employed a few
baseline classification models, including Decision tree, Logis-
III. I NTRODUCTION tic Regression, and Support Vector Machine. In addition, a
Malaria, a fatal mosquito-borne disease, is caused by Plas- number of preprocessing techniques, including various filters,
modium protozoan parasites that infect red blood cells and global feature extraction, and local feature extraction, were
are transmitted through the bites of infected female Anopheles applied to the dataset in order to optimize performance on an
mosquitoes. The symptoms of an infected person are similar impartial test set.To compare the baseline models, we used
a variety of evaluation metrics, including accuracy, recall,
precision, F1 Score, and Area under curve (AUC).
IV. L ITERATURE S URVEY shuffling the Dataset, it is divided into two sets: training test
and testing set with a ratio of 75:25.
Malaria is a significant global health concern that affects
millions of people every year, primarily in developing na-
tions. Early detection and treatment are among the most
effective methods to manage malaria. Rapid diagnostic tests
and microscopic examination of blood smears are commonly
employed for malaria diagnosis, but they are time-consuming
and prone to errors due to subjective interpretation by human
experts. Machine learning (ML) techniques offer a promising
solution to improve the accuracy and speed of malaria diagno-
sis by automatically analyzing rapid diagnostic tests (RDTs)
and microscopic images. In recent years, researchers have
utilized various ML algorithms for malaria detection using
microscopic images. G.B. Saiprasath et al. (2019) compared
the performance of five ML algorithms, including Decision
Tree, Random Forest, Naı̈ve Bayes, K-Nearest Neighbors,
and Support Vector Machine, for malaria detection using
microscopic images. The study found that Random Forest
algorithm achieved the highest accuracy of 96.5. Mahdieh
Poostchi et al. (2018) proposed an automated system for
malaria detection using deep learning algorithms. The au-
thors employed Convolutional Neural Networks (CNNs) to
Fig. 1. Histogram depicting number of images from each class
classify malaria-infected cells from microscopic images. The
system achieved high accuracy, with a sensitivity of 96.6
Vijayalakshmi and Rajesh Kanna (2019) also presented a deep VI. M ETHODOLOGY
learning method for malaria detection utilizing microscopic
We have utilized numerous filters, including edge detection
images. The authors utilized a pre-trained CNN model, VGG-
algorithms Canny, Sobel, and Scharr, as well as Adaptive
16, to classify infected and uninfected cells. This approach
Histogram Equalization, also known as CLAHE. We will
achieved an accuracy of 95.23 In a review article by Gaurav
also use local feature extraction techniques such as SIFT
Kumar and Pradeep Kumar Bhatia (2014), different feature
(Scale-Invariant Feature Transform) and KAZE in addition
extraction techniques employed in image processing systems
to Global feature extraction, which detects features based on
were discussed. The review emphasized that feature extraction
shape, texture, and color histogram. Then, we applied a few
is a crucial step in image processing systems as it reduces
fundamental machine learning models, Decision Tree, Logistic
the image’s dimensionality and enhances the ML algorithm’s
Regression, and Support Vector Machine(SVM), Random
performance. Nagarajan Deivanayagampillai et al. (2017) pro-
Forest, KNN, AdaBoost and Naive Bayes to our processed
posed a feature extraction technique for melanoma detection in
data. We then used Ensembling on the top-performing models
dermoscopic images. The authors used a combination of global
using various bagging and boosting techniques, such as
and local features for classification, achieving an accuracy of
Random Forest and Adaboost, grid search for hyperparameter
92.
optimization.
V. DATASET
Preprocessing:
The dataset, NIH Malaria Dataset, was obtained from the The unstructured pixel data in image patches will not directly
National Institutes of Health (NIH). Our dataset consists aid in the classification process. Instead, we employ a
of 2,7558 red blood cell stained specimens of the sampled representation that is unaffected by translation, rotation,
cells that have been labelled. The dataset consists of 13779 and constant intensity offsets. The primary concern of the
images without infection and 13779 images with parasites. The Plasmodium detection problem is the geometry of the objects
regions of segmented red blood cells consist of three channels within the input segments. We need a representation that is
(RGB) that vary in size from 110 to 150 pixels and have a translation-, intensity-, rotation-, and size-invariant so that we
channel depth of three. Positive samples contained plasmod- scaled the images that were gathered in various sizes. Three
ium, while negative samples lacked plasmodium but may have edge detection filters, namely Canny, Sobel, and Scharr,
contained other substances, such as staining impurities. Later, have been utilized. Our dataset was preprocessed using edge
the images were resampled to output dimensions of 64 x 64 detection because the Parasitized Cell Images contained
with a channel depth of 3. Additionally, we computed the purple-stained regions. As a result, we decided to use edge
greyscale channel. Then, we created a list titled ’labels’ to detection to enhance the image of the infected cell, which
identify the uninfected as 0 and the parasitized as 1. After would aid us in feature extraction. Histogram Equalisation is
utilized for image segmentation. Still, simplistic histogram SVM model achieving the highest accuracy of 87.2 percent.
equalization generates a great deal of image noise because This can be attributed to the low contrast of the image, the
it considers the image’s global contrast as well as its local large number of features, and significant information not being
contrast. Therefore, applying global equalization to our given its due significance due to the lack of feature extraction.
image may not yield optimal results. CLAHE, also known
as Adaptive Histogram Equalization, was thus employed. It
limits contrast and performs histogram equalization on tiny
regions or tiles with high precision.