0% found this document useful (0 votes)
6 views6 pages

ML Final Report

This document discusses a study on automating malaria detection using machine learning techniques applied to red blood cell images. The research compares various feature extraction and classification methods to enhance the accuracy and efficiency of malaria diagnosis, addressing the challenges of manual detection. The study utilizes the NIH malaria dataset and evaluates multiple machine learning models, ultimately achieving significant improvements in detection accuracy.

Uploaded by

Manan Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

ML Final Report

This document discusses a study on automating malaria detection using machine learning techniques applied to red blood cell images. The research compares various feature extraction and classification methods to enhance the accuracy and efficiency of malaria diagnosis, addressing the challenges of manual detection. The study utilizes the NIH malaria dataset and evaluates multiple machine learning models, ultimately achieving significant improvements in detection accuracy.

Uploaded by

Manan Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Malaria detection through RBC images using

machine learning
1st Vivek Jain 2nd Manan Garg 3nd Tushar
2021218 2021163 2020478
IIIT Delhi IIIT DELHI IIIT DELHI
New Delhi, India New Delhi, India New Delhi, India
[email protected] [email protected] [email protected]

I. A BSTRACT to those of the flu and may also include high fever, fatigue,
Malaria is a disease induced by a plasmodium parasite shivers, septicemia, pneumonia, gastritis, enteritis, nausea,
and transmitted by the bite of a female Anopheles mosquito vomiting, migraines, and, in extreme cases, convulsions and
infected with the parasite. More than 400 thousand individ- coma, which can result in death. Malaria cannot travel from
uals perish annually from malaria. The standard technique person to person, but it can be transmitted from mother to
for detecting malaria involves preparing a blood smear and fetus and patients can become infected through blood trans-
examining the stained blood smear under a microscope to fusions or the sharing of instruments like syringes. Malaria
detect the parasite genus Plasmodium, which heavily relies is most prevalent in regions with mild, humid climates that
on the expertise of trained professionals. In this paper, we are close to natural water resources and the territory of
compare different feature extraction and classification methods Anopheles mosquitoes. Blood films are commonly used for
on red blood cell smears from sampled cells to develop an microscopic examination of blood cells to diagnose malaria.
accurate and reliable automated system for detecting malaria. Every year, hundreds of millions of blood films are examined
National Institutes of Health provided the dataset used in for malaria, which requires a trained microscopist to manually
this study. The evaluation metrics consisting of accuracy, re- count parasites and infected red blood cells. Not only are
call, precision, F1 score, and Area under the curve (AUC) were accurate parasite counts crucial for malaria diagnosis, but
utilized to compare and select the best performing architecture. also for testing for drug resistance, measuring drug efficacy,
and classifying disease severity. For false-negative cases, this
II. M OTIVATION results in the unnecessary administration of antibiotics, a
Malaria remains a global health concern with limited access second consultation, lost work days, and in some cases, the
to accurate diagnosis, especially in resource-limited areas. The progression of malaria to a severe form. For false-positive
disease claims the lives of more than 400,000 people annually. cases, a misdiagnosis necessitates the unnecessary use of anti-
This project aims to use machine learning to automate the malaria medications and the potential adverse effects associ-
detection of malaria parasites in red blood cell (RBC) images. ated with them, such as nausea, abdominal pain, diarrhea,and
By leveraging machine learning, we can enhance accuracy, ef- sometimes severe complications. Consequently, the F1 Score
ficiency, scalability, and cost-effectiveness while continuously will serve as the primary metric for evaluating our model.
improving the diagnostic process. This project addresses the The F1 Score is determined by the weighted average of
critical need for faster and more accessible malaria diagnosis. Precision and Recall. Therefore, this Score takes into account
The idea emerged from recognizing the challenges of man- both false positives and false negatives. F1 is typically more
ual malaria diagnosis and the potential of machine learning. beneficial than accuracy in such situations, despite the fact
Witnessing the impact of limited healthcare resources on accu- that it is less intuitive to comprehend. We are attempting to
rate malaria detection sparked the concept. The advancement automate the diagnostic system using various machine learning
of machine learning inspired the vision to create an automated techniques that will assist human specialists in producing the
system for efficient, reliable, and accessible malaria diagnosis most accurate diagnosis possible. We performed extensive pre-
through RBC images. . processing on the NIH malaria dataset and employed a few
baseline classification models, including Decision tree, Logis-
III. I NTRODUCTION tic Regression, and Support Vector Machine. In addition, a
Malaria, a fatal mosquito-borne disease, is caused by Plas- number of preprocessing techniques, including various filters,
modium protozoan parasites that infect red blood cells and global feature extraction, and local feature extraction, were
are transmitted through the bites of infected female Anopheles applied to the dataset in order to optimize performance on an
mosquitoes. The symptoms of an infected person are similar impartial test set.To compare the baseline models, we used
a variety of evaluation metrics, including accuracy, recall,
precision, F1 Score, and Area under curve (AUC).
IV. L ITERATURE S URVEY shuffling the Dataset, it is divided into two sets: training test
and testing set with a ratio of 75:25.
Malaria is a significant global health concern that affects
millions of people every year, primarily in developing na-
tions. Early detection and treatment are among the most
effective methods to manage malaria. Rapid diagnostic tests
and microscopic examination of blood smears are commonly
employed for malaria diagnosis, but they are time-consuming
and prone to errors due to subjective interpretation by human
experts. Machine learning (ML) techniques offer a promising
solution to improve the accuracy and speed of malaria diagno-
sis by automatically analyzing rapid diagnostic tests (RDTs)
and microscopic images. In recent years, researchers have
utilized various ML algorithms for malaria detection using
microscopic images. G.B. Saiprasath et al. (2019) compared
the performance of five ML algorithms, including Decision
Tree, Random Forest, Naı̈ve Bayes, K-Nearest Neighbors,
and Support Vector Machine, for malaria detection using
microscopic images. The study found that Random Forest
algorithm achieved the highest accuracy of 96.5. Mahdieh
Poostchi et al. (2018) proposed an automated system for
malaria detection using deep learning algorithms. The au-
thors employed Convolutional Neural Networks (CNNs) to
Fig. 1. Histogram depicting number of images from each class
classify malaria-infected cells from microscopic images. The
system achieved high accuracy, with a sensitivity of 96.6
Vijayalakshmi and Rajesh Kanna (2019) also presented a deep VI. M ETHODOLOGY
learning method for malaria detection utilizing microscopic
We have utilized numerous filters, including edge detection
images. The authors utilized a pre-trained CNN model, VGG-
algorithms Canny, Sobel, and Scharr, as well as Adaptive
16, to classify infected and uninfected cells. This approach
Histogram Equalization, also known as CLAHE. We will
achieved an accuracy of 95.23 In a review article by Gaurav
also use local feature extraction techniques such as SIFT
Kumar and Pradeep Kumar Bhatia (2014), different feature
(Scale-Invariant Feature Transform) and KAZE in addition
extraction techniques employed in image processing systems
to Global feature extraction, which detects features based on
were discussed. The review emphasized that feature extraction
shape, texture, and color histogram. Then, we applied a few
is a crucial step in image processing systems as it reduces
fundamental machine learning models, Decision Tree, Logistic
the image’s dimensionality and enhances the ML algorithm’s
Regression, and Support Vector Machine(SVM), Random
performance. Nagarajan Deivanayagampillai et al. (2017) pro-
Forest, KNN, AdaBoost and Naive Bayes to our processed
posed a feature extraction technique for melanoma detection in
data. We then used Ensembling on the top-performing models
dermoscopic images. The authors used a combination of global
using various bagging and boosting techniques, such as
and local features for classification, achieving an accuracy of
Random Forest and Adaboost, grid search for hyperparameter
92.
optimization.
V. DATASET
Preprocessing:
The dataset, NIH Malaria Dataset, was obtained from the The unstructured pixel data in image patches will not directly
National Institutes of Health (NIH). Our dataset consists aid in the classification process. Instead, we employ a
of 2,7558 red blood cell stained specimens of the sampled representation that is unaffected by translation, rotation,
cells that have been labelled. The dataset consists of 13779 and constant intensity offsets. The primary concern of the
images without infection and 13779 images with parasites. The Plasmodium detection problem is the geometry of the objects
regions of segmented red blood cells consist of three channels within the input segments. We need a representation that is
(RGB) that vary in size from 110 to 150 pixels and have a translation-, intensity-, rotation-, and size-invariant so that we
channel depth of three. Positive samples contained plasmod- scaled the images that were gathered in various sizes. Three
ium, while negative samples lacked plasmodium but may have edge detection filters, namely Canny, Sobel, and Scharr,
contained other substances, such as staining impurities. Later, have been utilized. Our dataset was preprocessed using edge
the images were resampled to output dimensions of 64 x 64 detection because the Parasitized Cell Images contained
with a channel depth of 3. Additionally, we computed the purple-stained regions. As a result, we decided to use edge
greyscale channel. Then, we created a list titled ’labels’ to detection to enhance the image of the infected cell, which
identify the uninfected as 0 and the parasitized as 1. After would aid us in feature extraction. Histogram Equalisation is
utilized for image segmentation. Still, simplistic histogram SVM model achieving the highest accuracy of 87.2 percent.
equalization generates a great deal of image noise because This can be attributed to the low contrast of the image, the
it considers the image’s global contrast as well as its local large number of features, and significant information not being
contrast. Therefore, applying global equalization to our given its due significance due to the lack of feature extraction.
image may not yield optimal results. CLAHE, also known
as Adaptive Histogram Equalization, was thus employed. It
limits contrast and performs histogram equalization on tiny
regions or tiles with high precision.

Global Feature Extraction:


On our filtered and unfiltered images, we then applied
Global Feature Extraction, which extracted features based
on shape, texture, and color histogram. Since we needed to
calculate moments that are invariant to translation, scale, and
rotation, we utilized OpenCV’s Humoments to determine
the Hu Moments of the structures within the input image.
Haralick texture characteristics are derived from the co-
occurrence matrix, which contains information about how Fig. 2. fig: Model Performance
image intensities in pixels with a particular position in
relation to one another occur together. A color histogram
depicts the distribution of colors within an image. A color
histogram represents, for digital images, the number of On applying models to global features extracted from raw
pixels containing colors in each of a fixed list of color ranges data, we obtained greater accuracy than with previous models,
that span the image’s color space, the set of all possible colors. with SVM achieving a maximum of 87.1 percent. When we
applied the classifiers to the images using the Canny Filter,
Local Feature Extraction: SVM achieved a 86.3 percent accuracy. Filter for Detecting
We applied two Local Feature Extraction techniques, SIFT Edges The Canny filter outperforms its predecessors because
and KAZE, to both filtered and unfiltered images. The the infected material is readily identifiable. However, it per-
Difference-of-Gaussians (DoG) operator is an approximation forms less well than other Edge Detection techniques because
of Laplacian-of-Gaussians (LoG) (Tareen and Saleem, 2018). it reduces noise with a Gaussian filter and then applies Edge
Feature points are detected by seeking for local maxima Detection to the prospective image, thereby diminishing the
using DoG at varying image scales. The description method quality of the features. Using SVM to apply our baseline
generates a total of 128 bin values by extracting a 16x16 models to our CLAHE images yields accuracy of 85.2 percent.
neighborhood around each detected feature and further Simple histogram equalization generates a great deal of image
segmenting the region into subblocks. SIFT is insensitive noise because it evaluates the image’s global contrast as well
to image rotations, scale, and limited affine variations. In as its local contrast. Therefore, conducting global equalization
addition to being invariant to rotation, scale, and limited on our image may not be optimal. In such instances, Adaptive
affine, KAZE features are also more distinctive at differing Histogram Equalization, also known as CLAHE, can be uti-
scales at the expense of a moderate increase in computational lized. It limits contrast and performs histogram equalization on
time. KAZE’s capabilities exploit non-linear scalespace tiny regions or tiles with high precision. With Sobel and Scharr
via non-linear diffusion filtering, rendering blurred images filtered images, we achieve 89.4 percent and 90.9 percent
locally adaptive to feature points, thereby reducing noise and accuracy respectively for SVM and 89.4 and 91.3 respectively
preserving the boundaries of regions in subject photographs. for Logistic Regression. These Edge Detection Filters do not
Multiple scale levels are used to compute the scale normalized employ the noise detection technique and provide a higher-
determinant of the Hessian Matrix for the KAZE detector. quality detection of patch edges that are parasitised. The
Using a moving window, the response maxima of a detector accuracy of decision trees is diminished because they are
are extracted as feature points. Feature description explains ravenous, overgrown, and unpruned. Consequently, each tree
rotation invariance by locating the prevalent orientation in a has a high variance (tendency to overfit) but a low bias. They
circular area surrounding each detected feature. are susceptible to minor data perturbations: a small change
can result in a drastically different tree. Consequently, we did
VII. BASELINE S YSTEMS not achieve favorable results with the decision tree and Naive
We applied a few fundamental machine learning models to Bayes when implementing our baseline models. Therefore,
our filtered and unfiltered raw images with minimal prepro- they were not an adequate baseline model. Also the accuracy
cessing: Decision Tree, Logistic Regression, Support Vector on KNN, adaboost models wasn’t as good as expected and
Machine (SVM), Random Forest, KNN, and AdaBoost. The Random forest provided a decent accuracy of 88.4 with sobel
results of our models on raw data were abysmal, with the filter.
and Scharr, Global Feature Extraction on SVM classifier, and
Local Feature Extraction from raw images on SVM classifier.
For the grid search, we utilized polynomial and RBF kernels.
On Sobel Filter with Global Feature extraction, we achieved
a maximum precision of 91.91 percent.

Fig. 3. Model performance on various filters with global features

VIII. A FTER BASELINE M ODELS


After observing the outcomes of the baseline models, we de-
cided to attempt further improvement. Therefore, we extracted
local features from our filtered and unfiltered images using
SIFT and KAZE. SIFT produced the most accurate results Fig. 7. Grid Search on SVM
when applied to unprocessed images with an SVM classifier,
achieving 90.26 percent accuracy. Then, we decided to utilize additional ensembling tech-
niques, such as Bagging and Boosting, to determine if we
could improve the accuracy of the same models. Random
Forest Classifier, which fits multiple decision tree classifiers
on various subsamples of the dataset and employs averaging
to enhance predictive accuracy and prevent overfitting, was
used for Bagging. Using Bagging, the accuracy of SIFT data
Fig. 4. Model performance with local features remained the same, whereas Scharr and Sobel data became
less precise.
When we applied both Local feature extraction techniques Then, we conducted Boosting with both Logistic Regres-
to the filtered images, the results were unimpressive. KAZE sion and the Decision Tree using Adaboost. The AdaBoost
once again performed unfavorably in comparison to SIFT, classifier applies a classifier to the initial dataset. It then
but even SIFT performed comparatively unfavorably to its fits additional copies of the classifier on the same dataset,
performance on unprocessed data. We discovered that KAZE adjusting the weights of incorrectly classified instances so
was functioning inadequately because it was unable to extract that subsequent classifiers prioritize complex cases. Using
many features from the data. Boosting was ineffective because neither form of classifier
produced superior results than Grid Search on Sobel.

Fig. 8. Bagging with Random Forest

Fig. 5. Filtered Images with Sift

Fig. 9. Adaboost with Logistic Regression

IX. F INAL M ODELS


Fig. 6. Filtered Images with Kaze After applying traditional models such as SVM, Ran-
domTree, AdaBoost,Decision tree, KNN, Naive bayes ,etc., we
Next, we performed Grid Search on our top-performing attained an accuracy of 91.9 percent, which is quite excellent,
models, which were images filtered by Canny, Clahe, Sobel but it requires extensive preprocessing and hyperparameter
results highlighted the importance of preprocessing steps and
feature extraction techniques in enhancing model performance.
Random Forests, when applied to preprocessed data, showed
potential for malaria detection, but more sophisticated pre-
processing may be needed for optimal performance. Logistic
Regression demonstrated effectiveness, especially when used
Fig. 10. Adaboost with Decision Tree with appropriate preprocessing techniques. However, Support
Vector Machine (SVM) emerged as the most promising model
among the baselines, achieving a good accuracy. It showcased
tuning. Instead of relying on output that has been homoge-
the potential of SVM in malaria detection, particularly when
nized, we decided to use Deep Learning models to create a
combined with effective preprocessing and feature extraction
robust system that exploits the properties of images effectively.
methods.
CNN was chosen because it is designed for image classifi-
With various feature extraction techniques and hyperparam-
cation tasks, utilizes the convolution operation, and performs
eter optimization, we observed in our project how traditional
well with high-dimensional images. Each convolution task can
machine learning models could yield high accuracy. When we
acquire intelligent information, such as detecting an edge or
employed Sobel Filter with Global Feature Extraction to the
a specific shape. Pooling reduces dimensionality even further.
SVM Model, we obtained the highest F1 score of 0.918 for
From then on, it has a fully connected layer that behaves like
our conventional models. With a F1 score of 0.941 on testing
an ANN and aids in classification.
the VGG11 CNN architecture, we saw that CNN architecture
To implement CNN, we must initially arrange our dataset
could be used to achieve even greater results.
in a particular manner. We resized each image to 32x32 with
three channels, normalized the data, and divided it 80:20 XI. F UTURE S COPE
for training and testing purposes, respectively. We created
While we have achieved remarkable results on detecting
DataLoaders with a collection size of 256 in the end. We
malaria using deep learning techniques, there are several future
chose Visual Geometry Group(VGG) as our CNN architecture
scopes of work that could be considered for malaria detection.
because it makes higher use of the ”deep” in deep learning
Comparison with other models: While VGG models have
than its predecessors. It supplanted large kernels with multiple
shown good performance, it would be interesting to compare
small kernels stacked on top of one another. This would enable
their results with other state-of-the-art models in the field of
us to obtain the same receptive field with fewer parameters,
image classification, such as ResNet, Inception, or DenseNet.
enabling us to go deeper.
Transfer learning: Transfer learning is a technique where pre-
We used the minimal version of the well-known VGG
trained models are used as a starting point for new models
architecture by reducing the number of neurons in completely
trained on different datasets. It would be interesting to in-
connected layers in order to reduce training overhead. Several
vestigate the performance of transfer learning in the context
convolutional layers, pooling, and bulk normalization are used
of malaria detection and compare it with the performance of
to extract image features. The model is then classified using
models trained from scratch.
a handful of entirely interconnected layers. Relu activation is
utilized to introduce non-linearity and to tackle the explod- XII. C ONCLUSION
ing/vanishing gradient problem. The dropout technique is used
On the cell images, both traditional and deep learning mod-
to reduce overfitting and generate the model. We implemented
els perform well for malaria detection. However, conventional
VGG11.
models require a significant amount of feature extraction,
processing, and engineering to produce accurate results. In
contrast, deep learning models perform remarkably well. Al-
ways, there is a compromise between precision and power
consumption. Deep networks are computationally intensive.
Fig. 11. Training Performance of CNN Nonetheless, traditional architecture, when coupled with sound
engineering behind the scenes, can generate excellent results,
if not superior results than deep learning models.
R EFERENCES
1. Aimon Rahman, Hasib Zunair, M Sohel Rahman, Jesia
Fig. 12. Testing Performance of CNN Quader Yuki, Sabyasachi Biswas, Md Ashraful Alam, Nabila
Binte Alam, and M. R. C. Mahdy. 2019. Improving malaria
parasite detection from red blood cell using deep convolutional
X. R ESULT AND A NALYSIS neural networks.
The study explored various machine learning models for 2. G.B. Saiprasath, N. Babu, J. ArunPriyan, R. Vinayaku-
automated malaria detection using red blood cell smears. The mar, V. Sowmya, and Dr Soman K. P. Performance comparison
of machine learning algorithms for malaria detection using
microscopic images”. IJRAR19RP014 International Journal of
Research and Analytical Reviews, 6(1).
3. Gaurav Kumar and Pradeep Kumar Bhatia. 2014. A
detailed review of feature extraction in image processing sys-
tems. In 2014 Fourth International Conference on Advanced
Computing Communication Technologies, pages 5–12.
4.Mahdieh Poostchi, Kamolrat Silamut, Richard J. Maude,
Stefan Jaeger, and George Thoma. 2018. Image analysis
and machine learning for detecting malaria. Translational
Research, 194:36–55. InDepth Review: Diagnostic Medical
Imaging.
5.Nagarajan Deivanayagampillai, Suruliandi A, and Kavitha
Jc. 2017. Melanoma detection in dermoscopic images using
global and local feature extraction. International Journal of
Multimedia and Ubiquitous Engineering,scopus, 12:19–27
6.Vijayalakshmi and Rajesh Kanna. 2019. Deep learning ap-
proach to detect malaria from microscopic images. Multimedia
Tools and Applications, 79(21- 22):15297–15317
7.Karen Simonyan and Andrew Zisserman. 2015. Very deep
convolutional networks for large-scale image recognition.
8.Kirti Motwani, Abhishek Kanojiya, Cynara Gomes, ,Ab-
hishek Yadav (2021). “Malaria Detection using Image Pro-
cessing and Machine Learning” International Journal of Engi-
neering Research Technology (IJERT)

You might also like