Final Major Project
Final Major Project
Submitted By:
Sandip Mandal (13071022023)
Sovon Chanda (13071022051)
Amogh Pal (13071022034)
Biswarup Mukherjee (13071022024)
Tanmoy Chatterjee (13071022046)
1
Techno Main, Salt Lake
FACULTY OF MCA DEPARTMENT
CERTIFICATE OF RECOMMENDATION
_______________________
Prof. Shauvik Paul
Dept. of MCA
Techno Main Salt Lake, Kolkata: 700091
Date:
2
BONAFIDE CERTIFICATE
Certified that this report for the Data Science project titled
1.
2.
3.
4.
5.
3
[Under Maulana Abul Kalam Azad University of Technology (MAKAUT)]
CERTIFICATE OF APPROVAL
4
ACKNOWLEDGEMENT
We take this occasion to thank God, almighty for blessing us with his
grace and taking our endeavour to a successful culmination. We extend
our sincere and heartfelt thanks to our esteemed guide, Prof. Shauvik
Paul, for providing us with the right guidance and advice at the crucial
junctures and for showing us the right way. We also take this
opportunity to express a deep sense of gratitude to our teachers and
staff for their kind-hearted support, guidance and utmost endeavor to
groom and develop our academic skills.
At the end we would like to express our sincere thanks to all our friends
and others who helped us directly or indirectly during the effort in
shaping this concept till now.
SANDIP MANDAL
BISWARUP MUKHERJEE
SOVON CHANDA
TANMOY CHATTERJEE
AMOGH PAL
5
TEAM STRUCTURE
6
TABLE OF CONTENTS
1. ABSTRACT
2. INTRODUCTION
3. PROJECT OVERVIEW
4. LITERATURE RIVIEW
5. METHODOLOGY
FOR PREDICTION
10. REFERANCE
7
1. ABSTRACT
Agriculture is of one of the few remaining sectors that is yet to receive proper
attention from machine learning community. The importance of dataset in
machine learning discipline cannot be overemphasized. The lack of standard and
publicly available datasets related to agriculture impedes practitioners of this
discipline to harness full benefit of these powerful computational predictive tools
and techniques. To improve this scenario, we develop, to the best of our
knowledge, the first-ever standard, ready-to-use, and publicly available dataset of
mango leaves. The images are collected from four mango orchards of
Bangladesh, one of the top mango-growing countries of the world. The dataset
contains 4000 images of about 1800 distinct leaves covering seven diseases. We
also report accuracy metrics, namely precision, recall and F1 score of three deep
learning models. Although the dataset is developed using mango leaves of
Bangladesh only, since we deal with diseases that are common across many
countries, this dataset is likely to be applicable to identify mango diseases in other
countries as well, thereby boosting mango yield. This dataset is expected to draw
wide attention from machine learning researchers and practitioners in the field of
automated agriculture.
The South Indian mango industry is confronting severe threats due to various leaf
diseases, which significantly impact the yield and quality of the crop. The
management and prevention of these diseases depend mainly on their early
identification and accurate classification. The central objective of this research is
to propose and examine the application of Deep Convolutional Neural Networks
(CNNs) as a potential solution for the precise detection and categorization of
diseases impacting the leaves of South Indian mango trees. Our study collected a
rich dataset of leaf images representing different disease classes, including
Anthracnose, Powdery Mildew, and Leaf Blight. To maintain image quality and
consistency, pre-processing techniques were employed. We then used a
8
customized deep CNN architecture to analyze the accuracy of South Indian mango
leaf disease detection and classification. This proposed CNN model was trained
and evaluated using our collected dataset. The customized deep CNN model
demonstrated high performance in experiments, achieving an impressive 93.34%
classification accuracy. This result outperformed traditional CNN algorithms,
indicating the potential of customized deep CNN as a dependable tool for disease
diagnosis. Our proposed model showed superior accuracy and computational
efficiency performance compared to other basic CNN models. Our research
underscores the practical benefits of customized deep CNNs for automated leaf
disease detection and classification in South Indian mango trees. These findings
support deep CNN as a valuable tool for real-time interventions and improving
crop management practices, thereby mitigating the issues currently facing the
South Indian mango industry
9
2. INTRODUCTION
Mango, sometimes called the “King of Fruits,” is a precious fruit crop grown
in many nations. It is widely consumed and valued for its economic and
nutritional significance. India is a major producer; approximately 40% of
the world’s mangoes come from India, making it the leading country.
However, mango crops face significant challenges due to pests and
diseases, which result in substantial yield losses estimated at around 30%–
40%. Mango leaves, in particular, are susceptible to various diseases that
significantly impact mango production. Mango cultivation is a vital
agricultural activity in South India, contributing significantly to the
region’s economy. However, the growth and productivity of mango trees
are often hampered by various leaf diseases. These diseases can lead to
significant crop losses and reduced fruit quality if not detected and managed
promptly. Manual inspection and diagnosis of leaf diseases are time-
consuming, labor-intensive, and prone to errors, necessitating the
development of automated and accurate disease detection systems. Thus, the
need for autonomous, precise, rapid, and cost-effective plant disease
identification technology is developing. Image processing and machine
learning classify leaf diseases. Deep learning, a machine learning branch, has
garnered attention and found practical applications. It uses deep neural
networks to provide a helpful tool for diagnosing and categorizing plant
diseases. Deep learning, especially Deep Convolutional Neural Networks
(CNNs), has benefited image analysis tasks like disease diagnosis and plant
categorization. CNNs can automatically learn the discriminative features
required for complex pattern recognition tasks from the raw input images.
10
Significance of Early Mango Disease Detection
Mangoes are a vital agricultural crop, but their yield and quality can be
significantly impacted by various diseases. Early detection of these diseases is
crucial for several reasons:
Reduced Crop Loss: Early detection allows farmers to take timely action, such
as applying fungicides or removing infected plants, minimizing the spread of the
disease and saving a significant portion of the crop.
Improved Fruit Quality: Diseases can affect the appearance, taste, and
nutritional value of mangoes. Early detection helps ensure that only healthy fruits
reach the market, enhancing consumer experience and potentially fetching higher
prices.
Lower Treatment Costs: When diseases are detected early, the interventions
required are usually less intensive and expensive compared to situations where
the disease has progressed significantly.
Convolutional layers: These layers apply filters (also called kernels) that slide
across the image, extracting features like edges and shapes. The filters learn
weights and biases during the training process, allowing them to identify specific
patterns within the image.
11
Pooling layers: These layers down sample the output of the convolutional layers,
reducing the dimensionality of the data while preserving important information.
This helps to control computational cost and prevent overfitting.
CNNs have revolutionized the field of computer vision due to their ability to
achieve high accuracy in various image recognition tasks. They are a cornerstone
of many applications, including:
Self-driving cars: CNNs are used to identify objects like vehicles, pedestrians,
and traffic signs, enabling autonomous vehicles to navigate their surroundings.
Facial recognition: CNNs are adept at recognizing faces in images and videos,
with applications in security and social media
12
3.PROJECT OVERVIEW
This research project aims to develop a system for classifying diseased and
healthy mango leaves using Convolutional Neural Networks (CNNs). Early and
accurate detection of mango diseases is crucial for minimizing crop loss,
improving fruit quality, and promoting sustainable farming practices. As we saw
in section 1.1, early intervention can significantly benefit farmers and the
environment.
Training: The customized CNN models will be trained on the prepared mango
leaf image dataset. During training, the models will learn to identify patterns and
features within the images that differentiate healthy from diseased leaves.
Optimization: Based on the evaluation results, we will identify the most effective
CNN model for mango leaf disease classification. We may also explore
techniques for further optimizing the chosen model's performance.
13
4. LITERATURE REVIEW
Building upon the foundation set in the literature review (Section 1.2), section 2.1
can delve deeper into specific research related to CNN-based plant disease
detection. Here's a possible structure:
Explain how you'll customize these pre-trained models for the task of classifying
mango leaf diseases. This typically involves adding new top layers (fully
connected layers) to the pre-trained model. These new layers are trained on your
specific mango leaf disease image dataset to learn the features that differentiate
healthy from diseased leaves and distinguish between different disease
categories.
Building on the foundation laid in section 2.1, which explores CNN architectures
for plant disease detection in general, section 2.2 can delve specifically into
existing research on applying CNNs to mango leaf disease classification. Here's
a possible structure:
14
Focus on Recent CNN Applications in Mango Leaf Disease Detection:
The disease categories included in the datasets used for training and testing the
models.
Any pre-processing techniques applied to the mango leaf images (e.g., resizing,
normalization).
If you find multiple relevant studies, consider including a table summarizing their
key findings. This table can provide a quick comparison of aspects like:
CNN Architecture: Specify the type of CNN model used (pre-trained or custom)
and its variant (e.g., EfficientNet-B0, VGG16).
Disease Categories: List the specific mango leaf diseases the model was trained
to classify.
15
Analysis of Performance Variations:
Discuss the variations in accuracy and other performance metrics reported across
different studies in section 2.2. Consider factors that might contribute to these
variations:
Disease Category Scope: Some studies might focus on a limited set of common
diseases, while others aim for a broader range. Classifying a wider variety of
diseases can be inherently more challenging.
Dataset Size and Quality: The size and quality of the image datasets used for
training can significantly impact model performance. Larger and more diverse
datasets often lead to better generalization.
Identify limitations or areas for improvement in the reviewed research. This could
include:
16
5. METHODOLOGY
The role of data is tremendously important in machine learning to the extent that
it is believed by the practitioners that it is the quality and quantity of the data, and
not the mathematical model, that plays the pivotal role in performance of modern
machine learning systems. That is why researchers must follow the standard
practices from the beginning to the end of dataset preparation procedure. In this
section, firstly, we describe the steps we take in our dataset prepa- ration task.
Secondly, we analyze the visual characteristics of leaf images pertaining to
different diseases. Thirdly, we list the key challenges we faced during the dataset
development.
This section outlines the crucial steps involved in preparing your dataset for
training and evaluating the CNN models for mango leaf disease classification.
Here's a breakdown of the key elements you can include:
Data Source:
Specify how you will acquire the mango leaf images for your dataset. There are
two main options:
Field Collection: You can collect images directly from mango trees in orchards
or fields. This allows you to capture a variety of disease conditions and
environmental variations that might be present in real-world scenarios. However,
field collection requires more time, resources, and potentially controlled settings
to ensure consistent image quality.
Publicly Available Datasets: If field collection is not feasible, you can explore
publicly available datasets containing images of healthy and diseased mango
leaves. These datasets are often curated by research institutions or organizations
and can be a good starting point. Be sure to check the licensing terms associated
with any publicly available datasets you plan to use.
Define the specific mango leaf diseases your CNN models will be trained to
classify. This decision will influence the data collection process and the overall
effectiveness of your system. Here are some factors to consider:
17
Prevalence: Focus on diseases that are economically important or widespread in
your target region.
If collecting images in the field, describe the process of capturing the images.
This might involve details like:
The process of collecting leaves with different disease severities (if applicable).
Following image capture, you'll need to annotate each image with the
corresponding disease class (healthy or specific disease type). This annotation
process can be done manually by experts or through crowdsourcing platforms.
Data Preprocessing:
Regardless of the data source, some preprocessing steps are essential to ensure
the quality and consistency of your dataset for training the CNN models. Here are
some common techniques you might employ:
Resizing: Resize all images to a standard dimension. This ensures that the CNN
model receives inputs of uniform size.
Normalization: Normalize the pixel values in the images to a specific range (e.g.,
0-1 or -1 to 1). This helps the CNN model converge faster during training.
18
Data Splitting:
Once the data is pre-processed, split it into three sets: training, validation, and
testing sets. The training set (largest portion) is used to train the model. The
validation set (smaller portion) is used to monitor the model's performance during
training and prevent overfitting. The testing set (another smaller portion) is used
for final evaluation of the model's generalizability on unseen data. Common splits
are 80% for training, 10% for validation, and 10% for testing.
This subsection delves deeper into the specific details of collecting mango leaf
images directly from orchards or fields, as outlined in section 3.1. Here's a
breakdown of the key aspects you can cover:
Describe the type of camera or imaging device you will use to capture the mango
leaf images. Consider factors like:
Explain the process for capturing consistent and high-quality images of mango
leaves. Here are some details to consider:
19
Data Collection Strategy:
Outline your strategy for collecting a diverse and representative dataset of mango
leaf images. Here are some considerations:
Variety of Mango Trees: If possible, collect images from different mango tree
varieties to account for potential variations in leaf morphology that might affect
image recognition.
20
Geographic Locations (Optional): Consider collecting images from multiple
geographic locations if the prevalence of certain diseases varies regionally.
Explain the process for labeling and annotating the collected images. This
involves assigning a specific class label (e.g., healthy, disease A, disease B) to
each image based on the presence or absence of disease and its type. Here are
some options:
Manual Labeling: If expertise is limited, you might manually label the images
yourself, but ensure proper training to identify the targeted diseases accurately.
Following data collection (Section 3.1), this subsection focuses on the various
data preprocessing techniques you'll employ to prepare your mango leaf image
dataset for training the CNN models. Here's a breakdown of the key elements you
can include:
Describe the essential preprocessing techniques you will apply to all images in
the dataset, regardless of their source (field collection or publicly available
datasets). These techniques ensure consistency and improve the training process
for the CNN models.
Resizing: Resize all images to a standard dimension (e.g., 224x224 pixels). This
ensures the CNN model receives inputs of uniform size, simplifying
computations.
Normalization: Normalize the pixel values in the images to a specific range (e.g.,
0-1 or -1 to 1). Normalization helps the CNN model converge faster during
training by scaling the data to a common range.
21
Additional Preprocessing Techniques (Optional):
Color Space Conversion: If the model performs better with specific color
information (e.g., grayscale for texture analysis), you might convert the images
from RGB to grayscale.
Explain how you will ensure the quality of your preprocessed data. Here are some
considerations:
Data Validation Tools: Utilize data validation tools to check for missing values,
corrupted images, or incorrect labels after preprocessing.
In this section, we'll discuss the process of selecting and customizing pre-trained
Convolutional Neural Network (CNN) models for classifying diseased mango
leaves in our image dataset.
22
Rationale for Choosing EfficientNet, VGG16, and ResNet
ResNet: ResNet architectures address the vanishing gradient problem that can
hinder training in deep neural networks. By introducing skip connections that
allow gradients to flow directly through the network, ResNets can effectively
train deeper models and potentially achieve better performance on complex
image classification tasks like disease detection.
By evaluating these three diverse architectures, we can gain insights into the
trade-offs between model complexity, accuracy, and computational efficiency for
our specific mango leaf disease classification problem.
23
Comparison of EfficientNet, VGG16, and ResNet for
Mango Leaf Disease Classification
24
EfficientNetB7 architecture
Layer (type): This column lists the type of each layer in the network, such as
"Conv2D" (convolutional layer), "MaxPooling2D" (pooling layer), "Flatten"
(flattens the data), "Dense" (fully-connected layer).
Output Shape : This column shows the output shape of each layer, typically
represented as a tuple (height, width, channels).
VGG16 architecture
Layer (type): This column lists the layer type in the network (e.g., Conv2D,
MaxPooling2D, Flatten, Dense).
Output Shape : This column shows the output shape of each layer, typically
represented as a tuple with dimensions (height, width, channels).
25
Total Params : This value, usually shown at the bottom, represents the total
number of trainable parameters in the entire VGG16 model.
ResNet50 architecture
Layer (type): This column specifies the layer type (e.g., Conv2D,
MaxPooling2D, Flatten, Dense).
Output Shape: This column shows the output shape of each layer, typically
represented as a tuple indicating the number of filters (channels), height, and
width of the feature maps.
Param #: This column indicates the number of trainable parameters in each layer.
Total Params: This value, usually displayed at the bottom, represents the total
number of trainable parameters in the entire ResNet50 model.
26
Customization of Pre-trained Models with New Top Layers
These new top layers will typically consist of a few fully-connected layers with
appropriate activation functions like ReLU or softmax. The final layer will have
the number of neurons corresponding to the number of disease classes we aim to
identify (e.g., two neurons for healthy and diseased).
Here's a table summarizing the key steps involved in customizing the pre-trained
models:
27
Comparison of Model Architectures
Here, you can optionally include a table comparing the key architectural features
of EfficientNet, VGG16, and ResNet. This table might include details like:
Training Process
This section will delve into the specifics of how you trained your CNN models
for mango leaf disease classification. Here's a breakdown of what you can
include:
Training Configuration
Learning Rate: The rate at which the model updates its weights based on
errors.
Optimizer: The algorithm used to update the weights (e.g., Adam, SGD).
Batch Size: The number of images processed by the model before weight
updates.
28
Epochs: The number of times the entire dataset is passed through the model
during training.
the validation strategy to prevent overfitting. This could involve techniques like:
Splitting the dataset: Dividing the data into training, validation, and (ideally)
test sets. The model is trained on the training set, evaluated on the validation set,
and final performance is assessed on the unseen test set.
By outlining these details, you provide transparency into the training process and
allow readers to understand how you optimized model performance.
Evaluation Metrics
Accuracy: Accuracy is the most basic metric, representing the overall percentage
of correctly classified images. It's calculated by dividing the number of images
classified correctly (True Positives and True Negatives) by the total number of
images in the dataset.
29
Choosing the Right Metric:
The most suitable metric depends on the specific problem and its priorities. In our
case, where accurately identifying diseased leaves is crucial, high Recall might
be more important than high Accuracy. A model with high Recall would
minimize the number of missed diseased leaves (False Negatives).
We'll report the Accuracy, Precision, Recall, and F1-score for each model in the
Results and Discussion section (Section 4). Analyzing these metrics will allow us
to compare the performance of EfficientNet, VGG16, and ResNet in classifying
diseased mango leaves.
30
Model Training –
EfficientNetB7
31
Model Training –
VGG16
32
Model Training -
ResNet50
33
Classification Performance of Each Model
In this section, we'll present the classification performance achieved by each of
the pre-trained CNN models (EfficientNet, VGG16, and ResNet) on the mango
leaf disease classification task. Here, we'll showcase the results using a
combination of tables and graphs to effectively communicate the model
performance.
Tables:
We can create separate tables for each evaluation metric (Accuracy, Precision,
Recall, F1-score). Each table should include columns for:
The most suitable visualization approach depends on the specific information you
want to emphasize. If you want to focus on the overall performance comparison
across all metrics, a bar chart might be a good choice. Line charts can be useful
for detailed analysis of how each model's performance evolves during training (if
you trained from scratch or fine-tuned the pre-trained models).By presenting the
results in both tables and graphs, you can provide a clear and comprehensive
understanding of how each CNN model performed in classifying diseased mango
leaves.
34
Analysis of EfficientNet's Superior Performance
In the previous section (4.1), we presented the classification performance of
EfficientNet, VGG16, and ResNet on the mango leaf disease classification task.
We observed that EfficientNet achieved the highest accuracy and potentially
outperformed the other models on other metrics like Precision, Recall, or F1-
score (depending on the reported results).
In this section, we'll delve deeper and analyze the potential reasons behind
EfficientNet's superior performance. Here are some key factors to consider:
1. EfficientNet Architecture:
2. Training Configuration:
The hyperparameter settings used during training, such as learning rate and
optimizer choice, could have played a role. EfficientNet training might have
benefited from a more optimal hyperparameter configuration that facilitated
better convergence and learning of disease-discriminative features from the
mango leaf dataset.
Dataset Characteristics:
The specific characteristics of the mango leaf image dataset, such as the number
of images, disease classes present, and image quality, could have influenced the
results. EfficientNet might be inherently more suited for the dataset's properties
compared to VGG16 and ResNet.
35
Further Investigation:
Visualizing the learned filters or feature maps at different layers of each model.
This can provide insights into what kind of features each model is capturing from
the images.
2. Training Optimization:
36
configuration compared to VGG16 and ResNet. This could have led to faster
convergence, better feature learning, and ultimately, higher classification
accuracy for mango leaf diseases.
While the specific characteristics of the mango leaf dataset are unknown, here are
some considerations:
37
6. Loading an Image and Preparing it for Prediction
using a Saved TensorFlow Model
Purpose:
This code snippet aims to load an image from a specified path and prepare it for
prediction using a pre-trained TensorFlow model.
Code Explanation:
import tensorflow as tf, import numpy as np, from tensorflow import keras,
from tensorflow.keras.preprocessing import image, from
tensorflow.keras.applications.imagenet_utils import decode_predictions:
Imports necessary TensorFlow and NumPy modules for image processing and
model loading.
image_path = '/content/drive/MyDrive/gal.png': Specifies the path to the image
file.
img = tf.keras.preprocessing.image.load_img(image_path, target_size=(224,
224)): Loads the image from the specified path, resizing it to the target size of
224x224 pixels.
input_arr = tf.keras.preprocessing.image.img_to_array(img): Converts the
image to a NumPy array.
input_arr = np.array([input_arr]): Reshapes the array to match the input shape
expected by the model.
Usage:
This code snippet is useful for preparing images to be fed into a pre-trained
TensorFlow model for prediction, such as image classification or object
detection models.
Dependencies:
This code relies on the following dependencies:
TensorFlow (tensorflow) for deep learning functionality.
NumPy (numpy) for numerical computations.
keras.preprocessing.image for image preprocessing utilities.
keras.applications.imagenet_utils for decoding model predictions.
38
7. Displaying Disease Prediction from Image
Purpose:
This code segment is intended to visualize the prediction of a disease based on
an input image using a machine learning model.
Code Explanation:
Dependencies:
This code relies on the following dependencies:
matplotlib.pyplot for plotting images.
img - Input image data.
class_names - List or dictionary mapping class indices to disease names.
result_index - Index of the predicted class returned by the machine learning
model.
39
40
8. Summary of Key Findings
41
EfficientNet's architecture with compound scaling:
42
9. Recommendations for Future Work
Building upon the findings of this research, here are some recommendations for
future work to enhance the mango leaf disease classification system:
Explore the use of class activation maps (CAMs): Utilize class activation maps
(CAMs) to visualize the regions of interest in the mango leaf images that the
models focus on for disease classification. This can provide valuable insights into
how the models are making decisions and identify which image features are most
critical for accurate disease detection.
43
10. REFERENCE
1. https://www.kaggle.com/datasets/warcoder/mango-leaf-disease-dataset
2. Issues · animesh1012/machineLearning · GitHub
44