Thesis Paraphased
Thesis Paraphased
agricultural tasks in order to increase the production and benefits while reducing time and
costs. Technological advances in precision agriculture have an essential role and enable the
processing systems. However, there are still many challenges and problems to be solved.
In specific, agriculture attains significant consideration in India due to the rapid population
explosion and increased food scarcity. To overcome the scarcity, accurate and computer-aided
assessment of ripeness and disease detection will help in improving the food quality in
agriculture. Tomatoes are fruits with high nutrition and high in fibre; packed with vitamin C,
vitamin K1, vitamin B9 and minerals. The global tomato processing market has reached 43.4
million tons in 2021. It is important to determine maturity level of the crops before harvesting
to optimize yield. However, manual inspection of ripe tomatoes required huge labour
resources and is time consuming. The amount of labour force for fruit harvesting has
Recently, some studies have attempted to evaluate the feasibility of smart agriculture
involving machine learning for harvest ripeness detection. However, these works typically
used smaller data size, simple dataset with no background or leaves or explored limited
machine learning model. Hence, this thesis aimed to identify tomato ripeness detection using
deep learning model YOLOv5. Further , this research presents a novel approach leveraging
deep learning models for disease detection in tomatoes using ensemble pre-trained models
such as VGG16,ResNet50,DenseNet121,InceptionV3,Xception.
1
The proposed work uses three different datasets for fruit , ripeness and disease detection
respectively. The first dataset consists of 2034 colour images related to fruit detection with a
resolution 416 x 416. The second dataset contains 400 images with different resolutions
which are resized to 224 x 224 while processing. In third dataset, 480 colour images were
added with varied sizes and resolutions. The created dataset is benchmarked in the following
various environmental conditions such as occlusion, lighting, shade, overlap, and others.
Initially, an efficient framework is proposed that combines computer vision techniques with
deep learning architectures for detection of tomato fruit from image or video. Later, the
second method using VGG16 is proposed to assess the ripeness level of tomato fruits. Finally,
The results were evaluated and compared using the metrics such as precison,recall,F1-score,
accuracy and average precision. These results demonstrate the efficacy of approach in
accurately detecting the tomato fruits, classifying fruit ripeness levels and detecting disease
symptoms with high precision. The system's ability to detect tomatoes from images or videos
and distinguish between ripe and unripe, while simultaneously flagging disease-afflicted
methods, for fruit detection 88.1% mAP, 97.71% accuracy in ripeness detection and 98.54%
Furthermore, the research explores the integration of the deep learning model into
agricultural processes, such as automated harvesting systems and quality control pipelines.
By providing rapid and non-invasive assessments of fruit quality, the approach offers the
2
potential to optimize resource allocation, reduce post-harvest losses, and ultimately contribute
3
CHAPTER-1
INTRODUCTION
17% of total GDP and employing more than 58% of people. Several environmental variables,
such as rain, temperature, soil, and pathogens, have an impact on agricultural crop quality and
yield in any country. Horticulture is a branch of agriculture that deals with the cultivation of
garden crops such as fruits, vegetables, flowers, and ornamental plants. Fruits are seed-
bearing structures developed from the ovaries of flowering plants that are an excellent source
of vitamins and minerals, and they help to avoid Vitamin C and Vitamin A deficiencies. The
diversity of India's climate ensures the availability of a wide range of fresh fruits.
During the process of plant harvesting, human experts go through a tedious process of
checking and removing mature plants, making sure they aren’t affected by any disease and
are suitable for human consumption. However, the traditional visual method of identifying
the name of the disease that a certain plant is suffering from takes a long time and is
expensive, especially if the farmhouse is large and there are a lot of plants [1]. A fruit's
quality is one of the most critical factors in determining its maturity. Furthermore, with the
apparent increase in population in the world on a daily basis, it is only sensible that this
conditions over a complete field of plants, flying drones that dispense water and nutrients to
agricultural fields, and autonomous robots that can manoeuvre around a field and collect ripe
fruits and vegetables. The applications of data analytics, intelligent sensing, robots, and other
4
benefits to farmers, including: 1) autonomously handling irrigation, fertilisation, and
treatments, which saves money and effort and is done more efficiently, 2) identifying regions
affected by weeds or diseases and isolating them quickly, saving the rest of the field from
harm, 3) localizing regions with healthy and productive soils to enable for efficient crop
distribution; 4) calculating packing and production costs based on yield estimation for proper
logistics planning; and 5) monitoring crop ripeness to determine exact harvesting schedules.
Having numerous forms of information on the land, soil, and crops immediately supports the
Over the years, technology has redefined farming, and as illustrated in Figure 1.1, technical
advancements have had a variety of effects on the agriculture sector. By 2050, the UN
estimates that an additional two billion people would need to be fed, which will require a
60% increase in food production. The massive demand, however, cannot be supplied via
conventional techniques. This is forcing farmers and agro firms to develop fresh strategies for
raising output and cutting waste. As a result, Artificial Intelligence (AI) is progressively
becoming a part of the technical development of the agriculture sector. The challenge is to
improve global food production by 50% by 2050 in order to feed an additional two billion
people. AI-powered solutions will not only help farmers improve efficiencies, but they will
also increase crop quantity and quality while ensuring a speedier time to market.
5
Figure 1.1 : Artificial Intelligence in Agriculture
Tomatoes are one of the most widely produced vegetables in the world and a significant
source of income for farmers. According to the Food and Agriculture Organisation Corporate
Statistical Database (FAOSTAT) 2020 statistical report, global tomato production was
186.821 million tonnes [2]. They are herbaceous, spreading plants with woody stems that
grow to be 1-3 metres tall. Tomatoes are indigenous to Peru and Mexico and are the world's
second most important crop after potatoes. There are over 1000 tomato types cultivated in
India throughout the year. In India, however, there is a high tomato season at the beginning
One of the most widely grown crops because of their excellent nutritional content, tomato
fruits normally range in size from 50 to 70 mm in diameter and weight from 70 to 150
grammes. Furthermore, it has a long history of use in Indian cuisine, making it one of the
most versatile fruits. Salads, ketchup, purees, sauces, and other processed meals are just a few
of the dishes that include tomatoes. A potent antioxidant that fights cancer, lycopene is
widely present in tomatoes. Carotene, the fruit's second antioxidant, which also gives it its
distinctive red colour and fights cancer, is also present in the fruit.
Tomatoes are typically grown in dry states during the winter or just before the summer. In
April and May, there is typically no tomato production, which drives up the price. All dry
regions, including Gujarat, Tamil Nadu, Andhra Pradesh, Karnataka, Maharashtra, Madhya
Pradesh, and Uttar Pradesh, have high tomato production rates. Some states, like Kerala and
Himachal Pradesh, frequently experience freezing conditions, which have a negative impact
on agricultural output. Since the year 2000, Andhra Pradesh and Madhya Pradesh have
6
consistently produced the most tomatoes in the nation. Tamil Nadu produces 7% of the
tomatoes produced in India as a whole. Tomatoes are grown in a number of Tamil Nadu
districts, including Dindugal, Salem, Tirupur, Krishnagiri, and Dharmapuri. The greatest time
to plant tomatoes in Tamil Nadu is around Aadi Pattam, though they can be grown
throughout the year. Additionally, Udumalaipet, Gudimangalam, and Palladam all have
According to the Food and Agriculture Organisation (FAO), crop diseases cause losses of 20
to 40% of total production [3]. Various tomato plant diseases can have an impact on the
amount and quality of the product, reducing productivity. Two categories can be used to
classify diseases [4]. The first class of diseases is linked to contagious microorganisms like
bacteria, fungus, and viruses. When the situation is favourable, these diseases can quickly
spread from plant to plant in the field. The second category of diseases is brought on by non-
Tomatoes are a widely produced vegetable crop all over the world. The scientific
name for tomato is Solanum Lycopersicum. It is a relatively short-duration crop with a high
yield that is economically appealing, with an increase in area under cultivation day by day.
After China, India is the world's second-largest producer of tomatoes. Other major
competitors in the tomato market are the United States of America, the European Union, and
Turkey. These top five tomato producers account for almost 70% of global production.
Tomato processing is one of the most diverse agricultural sectors on a global scale.
This is due to the fact that tomatoes are nutritious and good to human health. Because of this,
tomatoes are consumed all over the world. Tomato production is substantially higher than that
7
of other crops farmed around the world. It is six times that of rice and three times that of
potatoes [5].
The common fruit tomato is very nutrient-dense, with high levels of fibre, vitamin C,
vitamin K1, vitamin B9, and minerals. Tomatoes typically come in a variety of maturities and
shades, including red, yellow, and green. Green denotes unripe fruit, whereas red and yellow
denote ripe fruit [6]. In 2021, the total market for tomatoes used for processing was 43.4
million tonnes. Before harvesting, it is crucial to assess the crops' maturity stage to maximise
output. It takes a long time and a lot of labour resources to manually inspect ripe tomatoes,
though. Because of rising demand, the labour force for tomato harvesting has grown over the
years. The global tomato processing industry is anticipated to reach 54.5 million tonnes by
2027 [7]. As a result, tomato cultivation is critical in rural and suburban areas of developing
It is easily affected by many diseases, and this condition severely affects the quality
and yield of tomatoes and causes substantial financial losses. The tomato plant has a lifespan
of around 120 days. The flowering or fruiting stage occurs at approximately 45-50 days of the
life of the tomato plant. The life cycle of the tomato plant is shown in Figure 1.2
8
Figure 1.2: Tomato LifeCycle
The tomato life cycle begins with the planting of seeds, followed by germination,
growth into a sprout, seedling, and finally development into a plant. After the plant reaches
maturity, the flowering stage will begin, then the fruiting stage. The seeds of the ripe fruits
are also employed in the following life cycle. The tomato plant has started the process of fruit
production when yellow blossoms appear on it. After the tomato plant's blooms have opened,
the amount of time needed for the fruit to ripen depends on the variety and various
environmental factors. When a tomato plant is established and in the flowering or fruiting
period, the disease tends to strike more frequently. Time management and disease prediction
will help farmers produce more effectively and prevent output loss.
In horticulture, the cost of picking accounts for the largest portion of the total cost.
and chemical fertiliser. The speed, cost, and safety of picking all have a direct impact on the
9
final yield and quality of fruit production. As a result, more harvesting robots are being
employed in the fruit sector to cut picking costs and increase fruit quality [8]. Detection,
picking, localisation, categorization, selection, and grading are some of the duties that
harvesting robots must do. Among these missions, object detection is the most important [9].
It is natural for our human beings to encounter a variety of fruit objects in their environment.
However, it is difficult for robots to solve this challenge. The initial objective for robots is to
learn about their surroundings. In general, the image contains the most crucial information.
Images from the outside could be obtained using a variety of means [10].Several ways have
been used to obtain an image over the years. These approaches use digital cameras to capture
images of various types, such as universal cameras, depth cameras, and near-infrared
cameras. In recent years, the speed and resolution of a universal camera in our cell phones
Despite the fact that the majority of agricultural robots, in particular fruit harvesting systems,
use computer vision to identify targets of fruit, accurate fruit recognition is still a research
and can recognise fruit fast, especially when there are overlapping fruits or significant leaf
occlusions.
Fruit ripening is a highly coordinated, genetically programmed process that takes place in the
changes that result in the development of an edible, ripe fruit with desirable quality
parameters to draw agents that disperse seeds [12,13,14]. Colour, firmness, taste (increase in
sugars and decrease in organic acids), firmness (softening by cell-wall degrading activities),
taste (loss of green colour and development of yellow, orange, red, and other colour
10
characteristics depending on species and cultivar), and flavour (production of volatile
compounds giving the characteristic aroma) are the main changes connected with ripening.
Maturity determines the ripening and storage conditions of a specific vegetable [15].
So, the objective of this work is to identify the maturity status of tomatoes as immature and
mature utilising computer vision techniques, specifically deep learning. Computer vision is a
new approach in food and agriculture that can help solve practical difficulties such as
automatic sorting, grading, categorization, and recognition. Such procedures have surpassed
One such method is deep learning, which is a branch of machine learning. Deep learning
holds a prominent position in the field of research due to its capability to automatically
extract features from a variety of data [16]. The use of deep learning is rapidly gaining
popularity in the field of image processing for image recognition and categorization. The
convolutional neural network, which is based on the artificial neural network, is the
foundation of deep learning designs for image classification. Several architectures have
become more and more popular in recent years. Among them are AlexNet, LeNet, VGG,
Inception, ResNet, and so on. These are pre-trained architectures on massive data sets to
over the ones that were previously used. The use of these pre-trained architectures for
tackling the intended goal is known as deep transfer learning [17]. To determine tomato
maturity status through images, deep transfer learning is applied in this study.
Tomato ripeness can be determined by its surface characteristics [18]. As a result, the
amount of tomato ripeness is assessed by analysing the surface properties of tomatoes with
deep transfer learning techniques. The contribution of this work is the use of transfer learning
11
When immature or unripe fruits are harvested, the quality of these fruits is poor. They are
usually incapable of ripening. Immature fruits are prone to internal deterioration and
decomposition. Similarly, if fruits are harvested late, the chances of acquiring rotten fruits are
very significant. As a result, incorrect harvest timing will result in significant postharvest
loss. It is critical to evaluate the maturity status of the fruit in order to avoid qualitative and
quantitative losses of preharvest and postharvest fruits. One of the most significant
difficulties confronting the tomato agriculture sector is the loss of tomato quality. Farmers
typically use their personal experience to determine the type of disease and maturity level of
tomatoes. The level of maturity of the tomato crop is determined through manual inspection.
This, in turn, leads to an inconsistent reliance on manual labour. Smart harvesting has been
The current challenges of fruit detection and recognition based on DL for automatic
harvesting are the scarcity of high-quality fruit datasets, detection of small target fruits,
detection in occluded and dense scenarios, detection of multiple scales and multiple species
of fruits, and lightweight fruit detection models. The detection and recognition performance
is heavily influenced by the quality and scale of fruit datasets, appropriate improvement
methodologies, and underlying model architectures. For example, fruit processing can
standardise data by cleaning and modifying it. Fruit data augmentation can successfully
expand data and improve data diversity, lowering reliance on specific characteristics and
improving model robustness. Fruit feature fusion is beneficial in reducing the problem of fruit
feature disappearance and improving the detection effect of small target fruits and multi-scale
fruits. The original fruit detection framework is excellent for acquiring more fruit information
12
The development of novel technologies for early detection, identification, and
mapping of fruit diseases would lower the cost, damage, and time required to monitor and
control the diseases. Early diagnosis of fruit diseases boosts productivity by allowing for the
curative treatments since unhealthy plants may show disease signs when it is too late for such
disease detection eliminates the need to use excessive amounts of pesticides and chemicals to
manage them, ensuring that the risks of contaminating ground water and accumulating toxic
residues in agricultural products due to excessive pesticide and chemical use can be avoided
[20,21].
The primary method for identifying and classifying agricultural plant diseases is
through farmer observation with their naked eyes, followed by chemical tests. Farmers in
developing countries may not be able to keep an eye on every plant, every day, due to the size
of the farming land. Non-native diseases go unnoticed by farmers. It could take a lot of time
and money to consult experts on this. Furthermore, the unnecessary use of pesticides may be
hazardous and noxious to natural resources such as water, soil, air, food chain, and so on, and
manner utilising current technological knowledge. In any other case, incorrectly diagnosing a
plant disease will result in a significant loss of time, hard cash, labour, yield, and product
quality and value. Although manual disease recognition is effective, it is possible for ambient
or milieu circumstances to change and for the prediction to go in the wrong direction. By
utilising technological advancement, we may utilise image processing to detect tomato fruit
13
1.7 Objective of the research
The core objective of this research is to develop an efficient framework for ripeness and
disease detection of tomato fruits. It also intends to provide a full pipeline for tomato
techniques. Further, to achieve the objective the following sub-objectives are followed :
approaches is conducted.
Secondly, the tomato fruit dataset is collected and pre-processed for training and
classify the tomatoes based on the ripeness using transfer learning and fine-tuning
strategy. Next, the tomato fruit detection model, which is based on YOLOv5, and the
attention mechanism is proposed for the detection and classification of tomato fruits
Finally, the framework includes an ensemble model for disease detection based on pre-
trained models. The ultimate goal of this research is to provide a reliable and efficient
framework for tomato fruit ripeness and disease detection while overcoming the
challenge of significant fruit overlap in tomato fruit detection. This solution will
advanced CNN models for improving State-of-the-art maturity and disease detection of
tomatoes.
14
The scope of the research comprises a proposal for an efficient framework for tomato
ripeness and disease detection that takes advantage of object detection and reuses learnt
models. Initially, the framework is implemented by proposing the CAM-YOLO model, which
is based on YOLOv5 and the attention mechanism. Later on, it will incorporate the
development of a DEEP CNN model based on VGG16 as well as transfer learning for
maturity detection. Finally, the scope of the research includes an ensemble model for disease
Chapter 1 provides a brief introduction to the background and significance of the detection
of tomato fruit and diseases in agriculture. It outlines the challenges faced in this area, and
how the proposed research can contribute to addressing these challenges. Moreover, this
chapter provides a clear statement of the research questions, objectives, and hypotheses to
Chapter 2 discusses the related work on fruit detection and classification, and disease
This discussion provides a foundation for the proposed research and helps to identify the gaps
in the existing literature that our research aims to address. Chapter 3 describes the deep
learning model for tomato ripeness classification based on the VGG16 model. Chapter 4
depicts the detection model based on the YOLOv5 model for tomato fruit detection and
Chapter 5 illustrates the pre-trained deep learning models and ensemble models for accurate
tomato fruit disease detection. Finally, Chapter 6 concludes the work by summarizing the key
findings and contributions of the research and discussing the limitations and future directions
15
of our proposed approach. It also reflects the broader impact of the research on the field of
16
CHAPTER 2
LITERATURE REVIEW
2.1 INTRODUCTION
Accurate and rapid tomato fruit maturity and disease detection is critical for increasing its
long-term production for agriculture. In the conventional technique, human experts in the
field of agriculture have been accommodated to find out the maturity of fruit and anomalies
in tomato plants caused by pests, diseases, climatic conditions, and nutritional deficiencies.
Automatic tomato ripeness and disease identification is initially solved through conventional
image processing and machine learning approaches which result in less accuracy. In order to
literature investigated provides an overall review of research work carried out in the field of
fruit ripeness and disease identification using image processing, machine learning, and deep
learning approaches. Furthermore, the chapter summaries the related works and the gap
Zhang & McCarthy (2012)[ [22] proposed magnetic resonance imaging (MRI) to determine
tomato maturity using structural and colour parameters. The statistical properties were
determined using a region of interest (ROI) corresponding with the tomato's pericarp. To
determine the ripeness, the partial least square discriminant analysis (PLS-DA) was applied
to a total of 48 image features detected by five MR scans of 144 tomatoes. MRI is a non-
destructive imaging technique that makes use of the magnetic properties of nuclei as well as
their interaction with radio frequency (RF) and the magnetic field.Crossvalidation was
utilized by splitting the dataset into two. One subset was used for training data, while another
was used for testing data. The RMSE of cross-validation was 0.302 in the red stage. Using
17
colour characteristics, the PLS-DA achieved a 90% accuracy. In addition, the sensitivity and
El Bendary et al. (2015)[23] developed a fruit grading method that detects tomato maturity
using multi-level classification. Principal component analysis (PCA) and linear discriminant
analysis (LDA) were also utilised with the SVM classifier to improve performance. This
system is utilised to conduct five different categorization levels, including green and breaker,
turning, pink, mild red, and red. The data set used included 250 JPEG photos with
dimensions of 3664 × 2748 pixels. In terms of ripeness classification, the LDA classification
method was 84% accurate. One-against-all (OAA) multi-level SVM with linear kernel
Using artificial neural networking, Rafiq et al. (2015)[24] quantified the quality features of
agricultural commodities based on their colour and size. Three feed-forward neural network
models (NN) were created: one for converting RGB to L*, a*, and b* values (NN1), one for
determining the stage of tomato ripeness, and one for connecting tomato projection area/size
to weight. Results shown that NN1 could convert RGB to L*, a*, and b* values with a 99%
accuracy rate. However, NN2 classified tomatoes in three stages of ripening with an accuracy
of 96%, with 30 hidden neurons, and a 100% classification was performed when a threshold
value of 0.7 was used. The best results showed excellent abilities at 30 hidden units with R2
of 0.9980 and mean squared error (MSE) of 0.00021.Additionally, NN3 could link area with
maturation stages using Machine Learning, which includes training of various algorithms
such as Decision Tree, Logistic Regression, Gradient Boosting, Random Forest, Support
Vector Machine, K-NN, and XG Boost. This system collects images, extracts features, and
18
trains classifiers on 80% of the entire data. The remaining 20% of total data is used for
testing. It is evident from the results that the classifier's performance is influenced by the
quantity and type of features that are taken from the data set. Accuracy Score, Learning
Curve, and Confusion Matrix are used to represent the results. It has been noted that Random
Forest, out of the seven classifiers, performs well with an accuracy of 92.49%, likely because
to its strong data handling abilities. Because it cannot be trained on a big data set, the Support
The backpropagation neural network (BPNN) classification algorithm and the feature colour
value were used by Wan et al. (2018) [26] to propose a method for identifying the three
maturity levels (green, orange, and red) of fresh market tomatoes (Roma and Pear types). To
gather the tomato images in the lab, a maturity detecting device based on computer vision
technology was created. The tomato targets were obtained based on the image processing
technology after the tomato images were processed. The area for extracting the colour
features was then determined to be the largest inscribed circle of the tomato's surface. Five
concentric circles (sub-domains) were used to partition the colour feature extraction area. The
feature colour values were taken from the average hue values of each sub-region and used to
represent the samples' maturity level. Then, in order to determine the maturity of the tomato
samples, the five feature colour values were imported to the BPNN as input values.
According to an analysis of the data, this method has an average accuracy of 99.31% for
identifying the three tomato sample maturity levels, with a standard variation of 1.2%.
Garcia et al. (2019) [27] recommended a machine learning method for automatically
determining tomato maturity using the Support Vector Machine (SVM) classifier and the
CIELab colour space. The dataset utilised for modelling and validation experiments in a 5-
19
fold cross-validation technique was made up of 900 images gathered from a farm and several
image search engines. Experiment findings indicated that the proposed method was
successful in ripeness categorization detection with 83.39% accuracy when divided into six
Wu et al. (2019) [28] provided an improved technique for combining various characteristics,
feature analysis and selection, a weighted relevance vector machine (RVM) classifier, and a
bi-layer classification strategy to create a unique automated system for recognising ripening
tomatoes. The algorithm employs a two-layer approach to operation. Using the knowledge of
the colour difference, the first-layer classification technique seeks to locate tomato-containing
regions in the images. The second classification method is based on a classifier that has been
trained using information from many media. The processed images are separated into 9 9
pixel blocks in the suggested technique, which makes calculations simpler and increases the
effectiveness of recognition. These blocks, rather than individual pixels, are regarded as the
fundamental units in the classification task. Five textural properties (entropy, energy,
correlation, inertial moment, and local smoothing) and six color-related features (Red (R),
Green (G), Blue (B), Hue (H), Saturation (S), and Intensity (I) components, respectively)
were recovered from pixel blocks. The iterative RELIEF (I-RELIEF) method was used to
examine relevant characteristics and their weights. A weighted RVM classifier was used to
categorise the image blocks based on the relevant attributes that were chosen. The final
tomato recognition results were calculated by merging the results of the block classification
with the bi-layer classification technique. On 120 images, the algorithm achieved a detection
accuracy of 94.90%.
Azarmdel et al. (2020) [29] used ANN and SVM classifiers to assess a fruit grading system.
This technique was used to categorise ripeness into three levels: unripe, ripe, and overripe. In
order to create the dataset, 577 mulberries were used. The correlation-based feature selection
20
subset (CFS) and consistency subset (CONS) were used to extract the various characteristics,
such as texture, colour, and geometrical aspects. The classification of the fruits made use of
both ANN and SVM ideas. This approach used ANN and CFS with 20 neurons in the hidden
layer, and it was 99.13% accurate in grading. Additionally, by utilising the same ANN and
CFS, 100% sensitivity and specificity were reached. But the accuracy of the ANN with
CONS and the hidden layer of 15 neurons was only 98.26%. However, the sensitivity and
specificity offered by the ANN-CFS and ANN-CONS were the same. Additionally, this
system was tested using SVM-CFS and SVM-CONS with several kernel functions, including
the radial basis function (RBF), linear, and polynomial. The accuracy of 99.12% was
achieved by the SVM-CFS with RBF, and 98.25% by the SVM-CONS with RBF. With
SVM-CFS and RBF, the MSE during training was 0.009, while with SVM-CONS and a
Raghavendra et al. (2020) [30] developed a mango grading system and determined ripeness
using multiple classifiers such as L*a*b colour space, RGB, HSV, Gabor colour feature, and
other texture characteristics. The text features utilised were local binary pattern (LBP) and
under-ripe, perfectly ripe, over-ripe, and black-spot. The dataset comprised 230 mangoes
taken from agricultural plantations in Mysore, Karnataka. The 60:40 ratio was employed as a
training and testing subgroup. For the purposes of the experiment, many categorization
schemes were employed. Among the several classifiers, the SVM classifier had the greatest
accuracy of 99.09. Other traditional classifiers, such as NavieBaye's (NB), K-NN, linear
discriminant analysis (LDA), probabilistic neural network (PNN), and threshold classifiers,
In a research on classifying the maturity of fruits, Hermana et al. (2021) [31] employed 9000
training images of the fruits apple, orange, mango, and tomatoes. The data were trained using
21
VGG16 models with a transfer learning approach utilising 200 epochs. To minimise over-
fitting, data augmentation techniques were utilised to generate additional data. The same
MLP is applied to the top layer of models with a variety of parameters, and data is translated
from RGB to L * a * b in order to serve as a colour descriptor for the fruit. With a dropout of
0.5, four frameworks with various approaches were utilised, and the average accuracy rate for
all four was 92%, demonstrating the most impressive performance of all.
Deep learning techniques were used by Rivero et al. (2022) [32] to grade the banana layers
into distinct categories in a non-intrusive manner. Fruits were sorted using a tier-based
system, with grades given for maturity, quality and size. The quality was divided into three
categories: export, midrange, and rejections. Maturity factors included green, yellowish,
yellow, and over-ripe, whilst size characteristics included small, medium, and big for the fruit
grading system. The VGG16 architecture with the transfer learning approach was utilised for
grading systems on an own-created banana image dataset, and it obtained 98% accuracy for
Deep transfer learning methods were employed by Huynh, Danh Phuoc, et al. (2021) [33] to
explore the categorization of tomatoes. They used the transfer learning approach on three
already-trained CNNs models—VGG16, VGG19, and ResNet101—to lessen the need for a
big dataset and the computing expense of the deep learning model. 1374 tomato-related
images were gathered from Fruits-360 and sorted into three classes: green, yellow, and red.
According to the experimental findings, the VGG19 model was able to assess the degree of
Tomato production has significantly expanded recently, and the market is highly competitive.
Through tomato maturity grading, the market price is established. Visual examination of the
22
colour, texture, size, shape, and flaws of the tomatoes is typically used to determine their
ripeness. The expense of external quality control and labour is significant, though, and human
and AlexNet were some of the cutting-edge CNNs with Transfer Learning classifiers that
depending on their maturity levels. From a NY/T 940-2006 tomato image dataset, 233 colour
images with a resolution of 640 x 480 pixels were tested. Colour richness was calculated by
separating mature and rotten tomatoes. The image dataset is separated into two classes based
on tomato colour as the major characteristic: mature and rotten. There were 233 training and
40 testing image datasets. The comparing results showed that the RestNet152V2 produced
the maximum classification precision, with the best training accuracy of 100% and the best
Rajat et al. (2022)[35] assessed the freshness of banana fruit in order to extend the cropping
period and avoid harvesting of either under-matured or over-matured bananas. The study used
and DenseNet121. The dataset used comprises of 300 images that have been increased to a
total of 2369 images by using various augmentation techniques. When compared to other
models, the VGG16 obtained the greatest accuracy of 98.73% in categorising bananas into
Begun et al. (2022)[36] developed a deep transfer learning algorithm for tomato ripeness
identification. Using this approach, tomatoes were automatically categorised into three
maturity classes: immature, partially mature, and mature. Several transfer learning models,
including VGG16, VGG19, InceptionV3, ResNet101, and ResNet152, have been used to
solve the specified objective of classification. During training, the current architecture's
layers are frozen, and an extra classifier is introduced to train the prepared dataset. The
23
models were tested repeatedly with varied epoch numbers and batch sizes. With an accuracy
of 97.37%, the VGG19 performed best at epoch 50 and batch size 32.
Fruits picking by human is a time consuming, tedious and expensive task. Fruit harvesting
automation has been very popular over the past ten years for this reason. The capacity of a
tomato harvester robot to recognise and locate ripe tomatoes on a plant is one of the key
obstacles in designing such a machine since tomato fruits do not ripe at the same time. A
novel segmentation method was created by Arefi(201) [37] utilising a machine vision system
to guide a robot arm in picking a ripe tomato. A vision system was utilised to collect images
from tomato plants that were then adapted to the lighting conditions of the greenhouse. Under
greenhouse lighting settings, 110 colour images of tomatoes were captured. The method
created runs in two steps: (1) eliminating the background in RGB colour space and then
extracting the ripen tomato using a mix of RGB, HSI, and YIQ spaces, and (2) localising the
ripen tomato using image morphological characteristics. The suggested method was 96.36%
accurate overall.
Yamamoto et al.(2014) [38] presented an image processing approach for detecting tomato
fruits in various growth phases, including ripe, immature, and young fruits, for tomato
detection and segmentation. To begin, pixel-based segmentation is used to divide the pixels
into several groups such as fruits, leaves, stems, and backdrops. The misclassifications
created by the first stage are then removed using blob-based segmentation. Finally, they
employed K-means clustering to identify each fruit in a cluster. In summary, the results of
fruit detection shown that the developed approach obtained accurate detection performance,
despite the fact that recognition of immature fruits is extremely challenging owing to their
24
Xiong et al. (2014) [39] employed the K-means clustering technique to segment citrus fruit.
The fruit location was determined as a result of image segmentation. However, the
segmentation impact of this approach is not optimum when the environment is complex.
Traditional machine vision algorithms are frequently unstable, and the placement of fruit
robust fruit recognition is reducing influence from two main disturbances: illumination and
overlapping. Zhao et al. (2016) [40] presented a robust tomato detection system based on
various feature images and image fusion to recognise the tomato in the tree canopy using a
low-cost camera. First, using the L*a*b* colour space and the luminance, in-phase,
quadrature-phase (YIQ) colour space, two unique feature images, the a*-component image
and the I-component image, were recovered. Second, wavelet transformation was used to
merge the feature information of the two source images by fusing the two feature images at
the pixel level. Third, an adaptive threshold technique was employed to determine the
appropriate threshold for segmenting the target tomato from the background. The final
segmentation result was subjected to morphological processing to remove some noise. In the
detection tests, 93% of the target tomatoes were identified out of 200 total samples.
Sa et al. (2016) [41] used deep convolutional neural networks to develop the fruit detecting
system. The system is based on the Faster-RCNN model and employs two modalities: colour
(RGB) and near-infrared (NIR). For merging multi-modal (RGB and NIR) information, early
and late fusion approaches are investigated. This results in a unique multi-modal Faster R-
CNN model that produces state-of-the-art outcomes compared to past work using the F1-
score, which takes into consideration both accuracy and recall performances, increasing from
0.807 to 0.838 for sweet pepper recognition. In addition to better accuracy, this approach is
25
substantially faster to deploy for new fruits since it requires bounding box annotation rather
than pixel-level annotation. The model is retrained to recognise seven fruits, with the entire
process requiring four hours to annotate and train the new model each fruit.
Based on the Faster R-CNN approach, Bargoti et al. (2017) [42] created a fruit detecting
model in orchards. To further understand the practical deployment of such a system, ablation
studies were done on three orchard fruit types: apples, mangoes, and almonds. A comparison
of detection performance against the number of training images revealed the quantity of
training data necessary to achieve convergence. Transfer learning analysis revealed that
moving weights between orchards did not result in substantial performance increases over a
network started straight from the highly generalised ImageNet features. Data augmentation
approaches like as flipping and scaling were found to increase performance with varied
numbers of training images, resulting in similar performance with less than half the amount
of training images. In their report, they achieved an F1-score of more than 90%. The majority
of the lost fruits came from the case when the fruits appeared in dense clusters.
Xiong et al. (2018) [43] employed the Faster R-CNN to recognise green citrus in the natural
environment. The MAP of the training model in the test set was 85.49% under the
configuration of learning rate of 0.01, batch size of 128, and moment of 0.9. Fu et al. (2018)
[44] introduced a deep learning recognition approach based on LeNet CNN for the detection
of multi-cluster kiwi fruit in the field. The identification rates for occluded fruits, overlapping
fruits, neighbouring fruits, and independent fruits were 78.97%, 83.11%, 91.01%, and
94.78%, respectively.
Koirala et al. (2019) [45] employed the Mango-YOLO model, which is built on YOLOv3 and
YOLOv2(tiny), to recognise mango fruit in real time. On a day-time mango image dataset, it
26
For the purpose of identifying and locating ripe kiwifruit, Fu et al. (2019) [46] used image
processing methods. The method was computationally expensive and less reliable since it
it is challenging to attain high precision using classical detection methods and they are
extremely challenging to create and promote, particularly when there is a lack of features or a
small sample size. It is also difficult to use these approaches in real-world circumstances
Using computer vision and deep neural networks, Huang et al. (2019)[47] suggested a high-
throughput method for fruit recognition, localisation, and measurement from video streams. A
flexible technique that may be used for several sorts of fruits is suggested here, in contrast to
prior works that were designed for a particular type of fruit. This method makes use of a
vision system to scan through plants row by row in a greenhouse. Fruit recognition and
localisation on video frames are performed using a real-time object detection technique that
utilises the deep neural network-based detector YOLOv2 with an 84.98% success rate. The
video feed is used to track various fruits using an individual fruit tracking algorithm. The
online tracking method combines feature matching, optical flow, and projective
transformation, which are all enhanced by occlusion handling strategies including using
threshold indices and denoising. The offline tracking algorithm, on the other hand, employs a
voting approach to eliminate false alarms generated by the object detector. Finally,
phenotyping data like as fruit counts, ripening stage, fruit size, and 2D geographic
distribution maps were collected. The suggested framework has demonstrated its efficacy in
27
Image recognition techniques frequently misinterpret tomatoes in neighbouring spots as a
single tomato. Hu et al. (2019) [48] proposed a single ripe tomato identification approach that
combines intuitionistic fuzzy set (IFS) with Faster R-CNN image detection. In comparison to
previous approaches, the suggested method offers various advantages. To begin, ripe
tomatoes are marked in a large number of images in various configurations (e.g., separated,
adjacent, overlapping, and shaded) to train the Faster R-CNN detector. The trained Faster R-
CNN classifier is then used to identify suitable ripe tomato areas in images. The results
demonstrated that the trained Faster R-CNN classifier is capable of accurately and speedily
localising potential ripe tomato regions. The proposed tomato region's RGB colour system
was then converted to an HSV colour scheme. To acquire the candidate tomato body from
single tomatoes recognised by Faster R-CNN, different tomato samples were manually
segmented, and the Gaussian density function was established to eliminate the background.
The unnecessary subpixels are removed from the tomato binary map using morphological
processing, and related tomatoes are separated to eliminate the excess contour discovered
through edge detection. The IFS edge detection approach is then used to retrieve the edge,
and a contour detection method is then applied to connect the edge breakpoints and eliminate
unnecessary edge points. This method is finally recommended as a way to further identify
tomatoes. Despite the complexity of the greenhouse tomato images used—which contain
tomatoes that are close together, overlap, and are obscured—the approach nevertheless
managed to reach an AP of almost 80%. Using the suggested approach, the tomato width and
height had RMSE results of 2.996 pixels and 3.306 pixels, respectively. The horizontal and
vertical centre position shifts' respective mean relative error percent (MRE%) values are
Liu et al. (2020)[49] applied the YOLO-tomato detector to identify tomatoes. This technique
is based on YOLOv3 and uses two approaches: feature extraction via dense architecture and
28
R-Bbox replacement with CBbox. However, the approach did not use contextual information
In Indonesia, human labour is still used for manual fruit inspection and sorting procedures.
The manual procedure consumes a significant amount of time and energy and is prone to
errors. The automation process should be able to replace manual procedures with automated
ones with the use of computer vision in order to cut costs and improve accuracy and
efficiency. The approach employing the YOLOv4 algorithm was proposed by Widyawati et
al(2021) [50]. The algorithm is used to determine the maturity of bananas automatically. The
algorithm training procedure uses 369 banana images separated into two groups, and the
testing phase uses real-time videos acquired. According to the results, the best average
accuracy rate is 87.6%, and the video processing performance is 5 FPS (frames per second)
A rapid optimised Foveabox detection model (rapid-FDM) is presented by Jai et al. (2022)
[51] in order to accomplish fast recognition and localisation of green apples and match the
real-time operating needs of the vision system of harvesting robots. Fast-FDM detects green
network (BiFPN) is used as the feature extraction network to quickly and easily fuse multi-
scale features before feeding the fused features to the fovea head prediction network for
classification and bounding box prediction. The backbone network used is the
EfficientNetV2-S, which has quick training and a small size. Additionally, the direct
selection of positive and negative samples using the adaptive training sample selection
(ATSS) technique results in greater recall and more precise green apple recognition for green
fruits of various sizes. The proposed Fast-FDM achieves improved trade-offs between
29
average precision (mAP) of 62.3% for green apple detection with less parameters and floating
Zhang et al. (2022) [52] combined the GhostNet feature extraction network with the
reconstruct the neck and YOLO head structure. They used the improved YOLOV4 model for
apple fruit recognition. In order to improve the capacity to extract features for medium and
small targets, the Coordinate Attention module is introduced to the feature pyramid network
(FPN) framework. On the constructed apple data set, the mAP of Improved YOLOv4 was
enhanced to 95.72% in comparison to YOLOv4, however the network size was substantial.
The most common techniques for precise, quick, and reliable fruit identification and
recognition are DL-based approaches for fruit detection and recognition. These techniques
represent a significant development trend as well. Environments have a relatively less impact
on them. With regard to fruit image recognition, Xiao et al.(2023)[53] focuses on giving an
overview and review of DL, particularly in the areas of detection and classification. The
following categories may be used to categorise the current fruit detection and identification
techniques based on DL: techniques based on YOLO, SSD, AlexNet, VGGNet, ResNet,
Faster R-CNN, FCN, SegNet, and Mask R-CNN. These approaches can also be divided into
two categories: single-stage fruit detection and recognition methods based on regression
(YOLO, SSD) and two-stage fruit detection and recognition methods based on candidate
areas (AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN). Fruit
detection and recognition systems with two stages aim for quicker speeds and lower weights
while maintaining fruit detection accuracy. Single-stage fruit detection and recognition
algorithms enhance fruit detection accuracy while retaining detection speed and model size
30
benefits. Current development trends include improving fruit identification performance and
The tomato disease classification system was created by James et al. (2016) [54].
Preprocessing, segmentation, feature extraction, and disease classification are the four
primary steps of the system. Each image is subjected to a contrast enhancement approach
before being subjected to the k-means clustering algorithm, which is used to segment the
disease area of the tomato image. The RGB, HSV histogram values, statistical colour
moments, and colour cooccurrence matrix are created as feature vectors for each image.
Anthracnose, Bacterial canker, Bacterial spot, Bacterial speck, Early blight, and Late blight
are among the tomato diseases anticipated. Data is gathered from the local market, and a
database of 600 images is constructed, including 100 images for each condition. Various
segmentation, colour, statistical colour characteristics, colour cooccurrence matrix, and shape
features are retrieved, and lastly ensemble learning is utilised to accurately predict tomato
diseases, are employed.To construct models and evaluate tomato diseases, three types of
ensemble techniques such as AdaBoost, LogitBoost, and TotalBoost are utilised. A multi-
class classification ensemble approach is compared to several boosting and bagging ensemble
methods.When compared to the other two approaches, the AdaBoost learning method has a
Sabrol and Kumar (2016) [55] used a decision tree algorithm to do an experimental
investigation on the classification of healthy and diseased tomato leaves. The results revealed
a 78% classification accuracy in detecting infections caused by bacterial canker, fungal late
31
Biswas Sandika et al.(2016) [56] created the Random Forest algorithm for the detection and
classification of several grape leaf diseases such as Anthracnose, Powdery Mildew, and
environment with a random backdrop. Based on performance, the Random Forest method
was compared to other machine learning algorithms. Random Forest achieves the highest
classification accuracy for Gerbil Lung Cell Conditioned Medium (GLCM) characteristics in
terms of background separation and illness classification. However, it takes more time to train
the data and is not ideal for approaches with sparse features.
Shijie et al. (2017)[57] developed a transfer learning strategy based on VGG16 for the
detection and classification of tomato diseases and pests. They also tested with VGG16 as a
feature extractor and SVM as a classifier. The average accuracy of the training set is 100%,
while that of the test set is 88% when utilising the VGG16 +SVM technique. With an average
accuracy of 89%, the transfer learning strategy employing VGG16 outperformed the
diseases in citrus trees. The described approach consists of two stages: detection of lesion
input image of the citrus fruit, the optimised weighted segmentation algorithm extracts the
lesion location. The texture, colour, and geometric elements are then merged in a codebook.
Furthermore, the best features are determined by combining entropy, PCA score, and
disease classification, the FS is input into the Multi-Class Support Vector Machine (M-
SVM).
Guo et al. (2019)[59] present an enhanced convolutional neural network for detecting tomato
leaf diseases. The experimental dataset came from the publicly accessible PlantVillage
32
dataset. TomatoNet, a new convolutional neural network based on the original ResNet18
model, was built by adding a squeeze-and-excite module and changing the classifier
structure. The results illustrate that the TomatoNet network's average recognition accuracy is
99.63%, which is 0.53% greater than the ResNet18 network. Furthermore, once the AlexNet
Tomatoes are regarded as one of the most popular vegetable crops in the Philippines. Manual
sorting is the most extensively used sorting approach, however it is highly dependent on
human interpretation and hence prone to mistake. Luna et al. (2019) [60] suggested a strategy
for sorting tomato fruit based on the presence of defects. Based on a single tomato fruit
image, the study provided the construction of an image dataset for a deep learning technique
detection of flaws. The OpenCV library and Python programming were used to create the
models. Using the improved image capturing box, 1200 tomato images classed as no defect
and with defect are collected. These images are used to train, validate, and test three deep
learning models: VGG16, InceptionV3, and ResNet50. From this, 240 images are utilised as
testing images to evaluate the performance of the trained models separately using accuracy
and F1-score as performance indicators. Experiment results indicate that the VGG16 model
InceptionV3 model has 56.38-59.24-58.33 and the ResNet50 has 90.58-58.46-64.58. Based
on the dataset collected, comparative study demonstrated that VGG16 is the best deep
learning model to utilise in the identification of the existence of fault in tomato fruit.
Wang et al. (2019)[61] presented tomato disease detection architectures based on deep
convolutional neural networks and an object detection model. These approaches employ two
distinct models, Faster R-CNN and Mask R-CNN, where Faster R-CNN is used to identify
the kinds of tomato diseases and Mask R-CNN is used to detect and segment the locations
and shapes of diseased regions. Four distinct neural networks were chosen and merged with
33
two object detection models. Experiments with data acquired from the Internet reveal that the
suggested approaches are particularly accurate and efficient in recognising tomato disease
kinds and segmenting shapes of diseased regions. The results of the experiments show that
ResNet-101 has the highest detection rate but takes the longest time to train and detect.
ResNet-101 has the lowest detection time, however it is less accurate. In general, several
models can be selected based on the real needs. This study's dataset also offers a range of
images.
Gehlot et al. (2020)[62] used several Convolution Neural Network designs to classify
diseases in tomato leaves. The disease in tomato leaves can have an impact on both the
quality and quantity of produce. Early disease identification, classification, and detection are
essential to solve this challenge. Convolutional Neural Networks are becoming more popular
in object recognition, detection, and classification due to their ease of use and automated
ResNet-101, and DenseNet121 were the five models employed. The model was able to divide
tomato leaf disease into ten distinct groups. All the models worked well, however DenseNet-
121 had the best accuracy and was the smallest of the models.ResNet-101 and VGG16 also
Ashraf et al. (2020) [63] created an innovative approach based on DCNN to categorise
different medical images, such as the body's organs. The final three layers of the GoogLeNet
transfer learning algorithm were swapped out for a fully connected layer, a soft-max layer,
and a classification output layer. When categorising different medical images, the outcome of
34
the fine-tuned technique showed the greatest results. Some studies employed datasets with
fewer images, which would have had an impact on training effectiveness and raised the
misclassification ratio.
Zeng et al. (2020) [64] used several CNN models to detect the severity stages of citrus
1_1, and VGG13. The InceptionV3 model has the maximum accuracy of 74.38% in 60
epochs. Augmentation was also employed to improve the learning and performance of the
model. Deep convolution Generative Adversarial Networks (GANs) were used in this
process. With augmented data, the InceptionV3 model obtained 20% greater accuracy. The
data utilised in this study was gathered from non-profit organisation websites such as Plant
selection with the SVM algorithm. The cucumber image is transformed to colour space in
order to get 15 colour elements, and weighted data of relevant features is analysed using I-
RELIEF. Similarly, the Otsu approach is used to segment the G element and MSER is used to
obtain the mask image to remove a piece of the background region. The top of the three
elements of the weights are provided as input to the DL technique to extract and fuse features
in order to maximise the variances between cucumber and leaves and improve the classifier
accuracy of SVM. Finally, the cucumber identification is achieved by combining the SVM
Using original and feature image datasets, Xiao et al. (2020) [66] suggested the CNN method
with the ResNet50 model for identifying strawberry diseases, including crown leaf blight,
leaf blight, fruit leaf blight, grey mould, and powdery mildew. The training accuracy reached
35
its peak at 20 epochs, with 98.06% accuracy for the original dataset and 99.60% accuracy for
Through the Internet of Things, Ronghua et al. (2020) [67] collected environmental
parameters such as soil temperature and humidity, pH value, air temperature and humidity in
real-time. They then combined image feature fusion, environmental information, and expert
knowledge, and adopted the method of multi-structure parameter ensemble learning for
disease identification to ensure the accuracy of identification under the condition of shorter
identification time. The sample recognition rate ranged from 79.4 to 93.6% when 50 samples
of four different cucumber diseases were tested, including powdery mildew, fusarium wilt,
In China, tomato is grown extensively as a fruit or vegetable. There are several types of
disease and pests that affect tomatoes throughout their whole life cycle, making the
identification and diagnosis of these diseases crucial. For the purpose of detecting plant
diseases, many Deep Learning architectures have been put into use. Transfer learning was
used by Hong, Huiqun et al. (2020) [68] to decrease the amount of training data needed, the
amount of time needed, and the cost of the computations. In addition, 9 varieties of disease
leaves, including healthy tomato leaves, are classified. To extract the features, five deep
were compared. Adjusted the proper training parameters and test the networks. The
parameters and average accuracy varied amongst the five convolutional neural networks. The
36
Gayatri et al. (2020) [69] developed a transfer learning system based on a pre-trained deep
CNN-based framework for disease categorization in tomato plants. The dataset for this study
was compiled from internet sources and consists of 859 images classified into 10 classes.
This is the first research of its sort to include: (i) a dataset containing 10 tomato pest classes;
and (ii) a comprehensive comparison of the performance of 15 pre-trained deep CNN models
on tomato pest classification. The experimental findings demonstrate that the DenseNet169
The diseases that affect plants such as citrus are huge threats to food security, thus early
detection is critical. Convenient disease identification can let the person respond quickly and
eliminate some restricted activities. Plant leaf images may be used to do this recognition
without the assistance of a person. There are several approaches used in machine learning
(ML) models for classification and detection, but the combination of rising improvements in
computer vision looks to have the deep learning (DL) field study to accomplish a tremendous
potential in terms of improving accuracy. Luaibi, Ahmed R. (2021) [70] used two types of
conventional neural networks, Alex Net and Res Net models, with and without data
modifying the existing data. This approach increases the amount of training images in DL
without requiring the addition of new images; it is excellent for small datasets. A collection
of 200 images of diseased and healthy citrus leaves is made. The trained models with data
augmentation produce the greatest results, with Res Net and Alex Net achieving 95.83% and
97.92%, respectively.
Zhang et al. (2021)[71] present a tomato disease detection technique based on Multi-
ResNet34 multi-modal fusion learning based on residual learning for the problem of limited
identification rate of a single RGB image of a tomato disease. This research proposes transfer
learning, which is based on the ResNet34 backbone network, to speed up training, decrease
37
data dependencies, and prevent overfitting owing to a small quantity of sample data; it also
combines multi-source data (tomato disease image data and environmental characteristics).
The feature-level multi-modal data fusion approach is used to keep the important information
of the data that is used to identify the feature, so that the different modal data can
complement, support, and correct each other, resulting in a more accurate identification
impact. To start with, Mask RCNN was used to extract partial images of leaves from
complicated background tomato disease images in order to lessen the impact of background
areas on disease diagnosis. The formed image environment data set was then fed into the
multi-modal fusion model to get disease type identification results. The proposed multi-
modal fusion model Multi-ResNet34 has a classification accuracy of 98.9% for six tomato
diseases: bacterial spot, late blight, leaf mould, yellow aspergillosis, grey mould, and early
38
2.5 Summary
In recent years, the agricultural sector has faced numerous challenges. The detection of
tomato fruit in images and videos and classification of fruit based on maturity in harvesting.
And also, the early identification of tomato fruit disease reduces the costs by skipping the
unnecessary application of pesticide to the plants. The review details an up-to-date analysis of
current research in this field of tomato detection and classifying based on maturity, disease
identification based on Artificial Intelligence technique. The key objective of the work is to
analyze different deep learning techniques broadly used to detect and classify the tomato
based on ripeness and diseases. The literature study discussed in this chapter shows that
Machine Learning and Deep Learning used in the field of agriculture for detection of tomato ,
classification of tomato based on ripeness and diseases. More specifically, Deep Learning
algorithm like CNN shows a significant result for classification of tomato based on ripeness
tomatoes from images and videos. From the literature it is evident that deep learning methods
outperforms other conventional methods like image processing, machine learning and neural
networks in ripeness and disease detection. Further, to improve the performance of the fruit
detection, the YOLO method plays a vital role for object detection can be utilised with certain
modifications.
39
CHAPTER 3
In this chapter, a Deep Learning model for Tomato Ripeness classification is proposed
based on VGG16 using transfer learning and fine-tuning strategy. The proposed method
focuses on the classification of tomatoes into two classes namely ripe and unripe based on the
maturity. The preprocessing stage converts the collected RGB images of different sizes into
224 x 224-pixel size. The pre-trained VGG16 model is used for feature extraction and the last
layer is replaced with Multi-Layer Perceptron (MLP) for the classification of tomatoes. The
performance of the proposed method is evaluated against the collected dataset and
3.1 Introduction
Fruit maturity refers to the completion of development, which can only occur while
the fruit is still connected to the tree and is characterised by a halt in cell growth and the
accumulation of dry matter. The maturity of fruits and vegetables during harvest has a
considerable impact on the quality of all fruits and vegetables along the postharvest value
chain [72]. Fruit ripeness is significant in agriculture because it affects fruit quality [73], and
and labour intensive and can lead to variations in determining fruit ripeness [74-75]. An
efficient and effective automated model that can identify and classify the fruits based on their
maturity degree in a short amount of time is desperately needed. The emergence of computer
vision technology has the potential to address these challenges because fruit ripeness
classification can be done automatically, making it relatively fast, consistent, and cost-
effective [76]. The amount of fruit produced and distributed is determined by the fruit
40
maturation process, which takes a short time, therefore the classification of the fruit's
Automation can be used to determine the level of fruit ripeness based on an image of
the fruit. Each fruit has an unique indicator, such as size, shape, and colour, that shows the
shows the quality of the image used[77]. A number of methods for measuring fruit ripeness
The primary topic addressed in this research is assisting farmers or growers for
protection of fruits from diseases and harvesting them at appropriate timing. Farmers
normally search for matured tomatoes to sell, and the amount of ripeness is usually measured
by colour. Farmers will save substantial human labour if tomatoes can be classified as ripened
classification can be used before (in the field), after (in storage), or both. Furthermore,
automated quantification of tomatoes in the field will assist in more accurately assessing
3.2 Background
With the development of big data technologies and high-efficiency computers, deep
learning technology has opened up new possibilities for crop management and crop
layer, a pooling layer, and a fully connected layer model was suggested by the authors in [78]
to identify bananas. Fruits including apples, strawberries, oranges, mangoes, and bananas
were examined in order to extract features using CNN. Fruits were classified using
algorithms like K-Nearest Neighbour (KNN) and Random Forest (RF). The deep feature RF
41
combination algorithms outperformed the rest by 96.98% when these DL-based RF and CNN
A computer-vision-based application was designed using CNN and tested for the
classification of the ripening stages of the mulberry fruits [79]. The CNN classification model
was fine-tuned using transfer learning to improve accuracy and reduce training costs.
Different CNN models such as AlexNet, ResNet18, ResNet50, DenseNet, and InceptionV3
were used for testing. AlexNet and ResNet18 achieved the highest accuracy of 98.32% and
A huge quantity of data is required to train a deep learning model to solve problems in
a specific domain. It is typically difficult to collect large datasets necessary to train such
models [80], and it is expensive to hire professionals to label such vast data sets [81]. Models
trained on a single job, fortunately, may be reused to handle various issues in the same
that allows you to reuse a model that has already been pre-trained on a large data set.
According to studies, transfer learning is an efficient method for transferring large amounts of
visual information that have been acquired through training on such large image datasets to
42
As demonstrated in Figure 3.1, a model that has been trained for one task is modified
(repurposed) for another similar task using this approach. The transfer of information from
one task to another is simply an optimisation strategy that leads to higher performance in the
The deep transfer learning is based on pre-trained neural models, which constitute the
foundation of transfer learning in the context of deep learning. Deep learning systems are
multi-layered architectures that learn unique features at each layer. Higher-level features are
deeper into the network. To obtain the final output, these layers are eventually joined to the
completely connected layer. This opens up the possibility of employing popular pre-trained
models without their final layer as a fixed feature extractor for various tasks, such as VGG
The primary idea here is to use the pre-trained model's weighted layers to extract
features while avoiding updating the model's weights during training with new data for the
current job. The pre-trained models are trained on a big and general enough dataset and will
43
3.3.1 Fine-tunning
hyperparameters. The initial layers' task is to capture generic features, whereas the latter layers are
more focused on the specific task at hand. It seems reasonable to fine-tune the higher-order feature
representations in the basic model to make them more relevant to the task at hand. Some model
Along with the training of the classifier, one typical approach to improve the model's
performance is to "fine-tune" or re-train the weights of the top layers of the pre-trained model. This
forces the model to adjust the weights based on generic feature mappings learnt from the source
task. Fine-tuning will allow the model to use previous knowledge in the target domain as well as re-
44
Furthermore, rather than fine-tuning the whole model as shown in Figure 3.2, attempt to
fine-tune a limited number of top layers. The first few layers learn basic and generic features that
are applicable to practically all sorts of data. As you progress up the hierarchy, the features become
more particular to the dataset on which the model was trained. Instead of replacing the general
learning, fine-tuning seeks to modify these specialised features to operate with the new dataset.
The six steps involved in transfer learning process are shown in Figure 3.3.
network weights (W1, W2, … , Wn) to the base networks. It is possible to obtain the weights
The network structure can be changed based on the bottom layers, which can include altering
layers, adding layers, and removing layers from networks, among other things. It is possible
c. Freeze Layers
Freezing the starting layers from the pre-trained model is essential to avoid the additional
Only the feature extraction layers are being used as knowledge from the base model. To
predict the model's specialised tasks, we need to add new layers on top of the initial ones.
The pre-trained model’s final output will most likely differ from the output we want for our
model. In this case, the model is trained with a new output layer in place.
Fine-tuning is one approach for enhancing performance. It requires unfreezing a portion of the base
model and retraining the entire model on the full dataset at a very low learning rate. The slow
learning rate will improve the model's performance on the new dataset while limiting overfitting.
46
Over the last decade, several CNN designs have been proposed. Better model
architecture can help improve application performance. CNN's architecture has experienced
include reformulation, regularisation, and parameter optimisation. To put this in context, the
processing units and the creation of additional blocks. The use of network depth has been the
VGG16 is an acronym for the VGG model, often known as VGGNet. It is a 16-layer
convolution neural network (CNN) model. This model was proposed by K. Simonyan and A.
Zisserman of Oxford University and published in the paper Very Deep Convolutional
Networks for Large-Scale Image Recognition [84]. The model achieves 92.7% top-5 test
accuracy in ImageNet, a dataset of over 14 million pictures divided into 1000 classes.
filters. AlexNet's kernel size is 11 for the first convolutional layer and 5 for the second layer.
The benefit of having several smaller layers rather than a single big layer is that the
convolution layers are accompanied by more non-linear activation layers, which enhances the
decision functions and helps the network to converge faster. In order to reduce the network's
inclination to overfit during training, VGG uses a smaller convolutional filter. A 3 x 3 filter is
the best size since smaller filters cannot catch information from the left, right, and up and
down. The simplest model that can interpret the spatial components of an image is VGG as an
VGG16 is a 16-layer deep neural network and a relatively extensive network with a
47
Figure 3.4 VGG16 architecture
only 16 of them are weight layers, or the learnable parameters layers seen in Figure 3.5.
It supports input tensors with three RGB channels in the sizes 224 x 244. The most
filter with stride 1 and always used the same padding and maxpool layer of a 2x2 filter
with stride 2. This was done in place of a large number of hyper-parameters. The
convolution and max pool layers are uniformly organised throughout the
architecture.Conv-1 has 64 filters, Conv-2 has 128 filters, Conv-3 has 256 filters, Conv
4 and Conv 5 have 512 filters. Following a stack of convolutional layers, three Fully
Connected (FC) layers are added: the first two have 4096 channels apiece, while the
Challenge ) classification and hence has 1000 channels (one for each class). The soft-
48
Figure 3.5 Layered VGG16 structure
The ripeness classification is carried out in the way illustrated in Figure 3.6. The data
collected in this study will be used to assess the maturity of tomatoes at various stages. To get
the best results, the transfer learning technique leverages the pre-trained VGG16 utilising a
fine tuning strategy and replaces the classification layer with a multi-Layer perceptron in data
Data Acquisition
Data Pre-
processing
Data
Augmentation
Image
Classification
49
3.5.1 Data Acquisition
The image data for tomato ripeness classification was gathered from the PlantVillage dataset.
The dataset contains 400 images of tomatoes divided into two classes: ripe and unripe.
The amount of data used for classification is insufficient to train a Deep Learning model.
When data is augmented, the dataset size increases many times, which helps in training the
deep network model. To generalise the data, data augmentation is used throughout the
training phase. Data augmentation [85] is a procedure that creates extra training data to
minimise overfitting. The aim is to allow the model to be able to adapt to real-world
challenges without being presented with the same image repeatedly. The ImageDataGenerator
function in Keras is used to do this process, which includes rotation, width and height
shifting, zooming, and horizontal flipping. Table 3.1 shows the augmentation approaches
Parameter Value
Rotation 90
Brightness-range [0.1,0.7]
50
The images in the dataset are of various sizes and must be resized in order to fit into the deep
learning CNN. Because the image's input size must be the same as the model input size, the
images in the dataset are resized to 224 × 224 and normalised for rapid processing by scaling
Training a deep CNN model from scratch, such as VGG16, necessitates a large
amount of data since it has millions of trainable parameters, and a small dataset would be
inadequate to achieve decent generalisation of the model. Furthermore, this model is reused
utilising the pre-trained weights through the Transfer Learning (TL) approach. TL is the
useful Machine Learning (ML) approach in which a pre-trained CNN model is reused to take
use of its weights as initialization for a new CNN model for a different purpose.
TL is used in two stages: Deep feature extraction is the initial stage in extracting the
key features from the dataset and applying them to the trained model. The weight of the pre-
trained model is used to identify which features may be used for challenges. The second
phase is finetuning, which involves freezing the model's basic layers and retraining the last
layers using a new tiny dataset. The weights of the final layers are changed using the back-
propagation technique after training with a new dataset. Furthermore, the number of classes
in the output layer matches to the number of classes in the target dataset. The ImageNet
dataset, which was used to train the VGG16 model, has 14 million images separated into
1,000 different classes. Figure3.7 depicts the proposed DeepCNN model, to which the
learned weights and parameters from the underlying VGG16 model are transferred.
51
Softmax
Preprocess
1x1x2
As illustrated in Figure 3.8, the VGG16 model has 16 layers, with 13 convolutional
layers followed by max pooling and three fully connected layers utilised for classification.
The DeepCNN model is introduced, which is built on the VGG16 model and uses TL. The
top layer is removed and replaced with a new top layer that consists of two separate classes of
tomato. And the proposed model's pre-trained convolutional blocks are frozen (non-trainable)
to prevent the weights from being modified. By un-freezing (trainable), the final two pre-
trained layers are trained on a new dataset to fine-tune the specialised features from the
52
As seen in Figure 3.9, the DeepCNN model is divided into two parts: feature extraction and
Convolution layers and pooling layers are utilised in the feature extraction process, which is
used to extract the features from the provided images. Rectified Linear Unit (ReLU)
activation function, which is present in each convolution layer, is used to activate the
neurons.
Convolution layer
The Convolution layer, which serves as the feature extractor from input images, is the most
crucial part of CNN. Each convolution layer has a series of filters that move over the input
image to identify the key details. By computing the dot product between the pixels from the
53
input image and the filter, the feature maps are created. The resultant size of the feature map
where w,h are width and height of the input image, fw and fh are width and height of the filter ,
ReLU function is widely chosen in neural networks because it converges quickly and
reduces overfitting problems. The ReLU sets all of the negative values on the feature map to
Padding
The padding is used to avoid the feature map's reduction in size, which occurs during
dimensionality reduction. In order to maintain the input image's original size, padding is the
act of adding layers of zeros. Equation 3.3 is used to generate the final feature map after
padding.
Max-Pooling layer
The spatial size of the convolved features is reduced and the overfitting issue is
lessened by the usage of the max-pooling layer. In order to produce the pooled feature map,
54
the max operation determines the largest value in each patch of the feature map as shown in
equation 3.4.
MaxPooling ( X )i , j , k =max ❑ X ¿i . s + m , j. s +n , k ¿
x y (3.4)
m ,n
where ‘X’ is the input, (i,j) are the indices of the output, k is the channel
3.6.2 Classification
The Max Pooling layer and dense layer make up the classification portion of the proposed
model. The maturity prediction is carried out using the dense layer's softmax function.
Softmax classifier
The softmax function is a type of activation function that is used in the multi-class
classification problem and appears at the output layer. It computes a vector of k real numbers
and normalises the input values to obtain a vector of probabilistic values ranging from 0 to 1.
The softmax function returns the maximum probability for each class, and that class is picked
e ( xi )
Softmax ( xi ) = (3.5)
∑ e ( x j)
j
Where x represents the values from the neurons of the output layer.
3.8 Optimizers
Optimizers are algorithms or techniques that change the parameters of a neural network, such
as its weights and learning rate, to reduce losses and improve accuracy[86]. SGD (Stochastic
55
SGD Optimizer
Stochastic Gradient Descent (SGD) is a variant of the gradient descent approach that
calculates the error and updates the model for each occurrence in the training dataset rather
than the full training dataset at once. It updates the parameter θ for each training sample
labeled as xi and yi calculates the gradient θ J(θ )to reduce the objective function J(θ) using
θ=θ−η ∇θ J ( θ ; x i∧ y j ) (3.6)
where:
η is the learning rate, a hyperparameter determining the step size at each iteration
Confusion Matrix
performance[87]. The matrix compares the true values to those predicted by the model.
Figure 3.10 is a heatmap visualisation of the confusion matrix created with Python's sklearn
module. It consists of two rows and two columns that indicate two classes (Class1 and Class2
and Unripe) with correct and erroneous predictions. The measurement matrix of the classifier
56
Figure 3.10 Confusion Matrix
True Negatives (TN): the count of the outcomes which are originally Class2 and are truly
predicted as Class2.
False Positives (FP): the number of images that are originally Class2 but are predicted
False Negatives (FN): the count of Class1 images, which are falsely predicted as Class2, also
True Positives (TP): the count of Class1 images which are truly predicted as Class1.
Classification Report
After creating the confusion matrix, the performance metrics (accuracy, recall, precision, and
F1-score) of the models may be accessed using the classification report. The classification
report may be imported into Python from the sklearn package by using sklearn.metrics
import classification_report. The values of the performance measures are calculated using
57
Accuracy is the measure of all correctly classified images and is represented as the ratio of
correctly classified images to the total number of images in the test dataset, as shown in
equation 3.7
TN+ TP
Accuaracy ( A )= (3.7)
TP+TN + FP+ FN
Precision is the correctly predicted positive images out of all positive images. For instance,
it can be defined as the ratio of correctly classified images as ripe to the total number of
TP
Precision ( P )= (3.8)
TP+ FP
Recall is calculated by dividing the correctly classified images (of a class) by the total
TP
Recall ( R )= (3.9)
TP+ FN
F1-score is the weighted sum of precision and recall with a minimum value of 0 and a
the accuracy metric. The value of the F1-score is measured by equation 3.10.
2∗Precision∗Recall
F 1 Score ( F )= (3.10)
Precision+ Recall
ROC Curve
58
The Receiver Operating Characteristic curve is known as the ROC curve. It is a graphical
threshold value is first established in order to classify the sample data, and it would result in a
set of TP, FP, TN, and FN values[88]. Different sets of values are produced for these values
for the various threshold settings. Lowering the threshold value will result in more true
positives being correctly identified, but it will also result in more false positives and fewer
true negatives, giving us inaccurate numbers for FP and TN. If the threshold is set to a greater
number, the FP values will be erroneous. If a confusion matrix is constructed for each
threshold value, this would result in a large number of confusion matrices that would be
more straightforward method for evaluation by providing all of the previously discussed
information in the form of a graph, as illustrated in Figure 3.11. The Y-axis, which is
displayed in the form of sensitivity, depicts the genuine positive rate. The true positive rate is
the proportion of data that has been accurately categorised. The X-axis shows the false
positive rate. This is the proportion of data that has been mistakenly labelled as false
positives. Each of these rates may be computed at different threshold values and presented as
a ROC graph. One ROC curve displays all of the confusion matrices that would be generated
for different threshold values, allowing us to examine them all in one place and choose the
threshold value that delivers the most accurate prediction. ROC curves can also be used to
compare different neural network models(NNM). The area under the curve (AUC) is
calculated for this purpose. A model is considered better if its AUC value is greater.
59
Figure 3.11: General ROC curve
The experiment is carried out on VGG16 using transfer learning and the tomato fruit
dataset to evaluate model performance. To extract the features, the traditional VGG
architecture with learned weights from the ImageNet dataset is used. Its output features are
supplied into the newly added fully connected layer, which is used to train the tomato dataset.
For the model's performance evaluation, parameters such as the number of epochs, batch size,
learning rate, and optimizers to attain maximum accuracy are taken into account. The model
is trained by splitting the pre-processed and augmented dataset into 60% training, 20%
validation, and 20% testing portions. Furthermore, the proposed model is trained for 25
60
Table 3.2 Hyperparameters
Parameter Value
Batch Size 16
Optimizer SGD
For the calculation of loss, the binary cross entropy method has been applied. The formula for
N
−1
BinaryCrossEntropy= ∑ y . log ( p ( y i ) +( 1− y i) log ( 1− p ( y i ) ))
N i=1 i
(3.11)
where y is the label ( 1 for ripe class and 0 for unripe class) and p(yi) is the predicted
The training results of DeepCNN, a modified VGG16 model with transfer learning,
are reported in this section. Figure 3.12 shows the training and validation accuracy for simple
CNN for ripeness classification. The classification's average accuracy was found to be
72.21%. The ROC curve, shown in Figure 3.13, is a true-positive rate against false-positive
rate graph that displays the performance of all classes at various classification thresholds; the
ROC area for the basic CNN binary classification is 0.83 for two classes (ripe and unripe).
61
The confusion matrix for basic CNN on binary classification is shown in Figure 3.14. Figure
3.15 depicts the training and validation accuracy graph of the DeepCNN model based on
VGG16 without fine-tuning technique, with an average accuracy of around 91%. Figures
3.16 and 3.17 depict the ROC curve and confusion matrix for DeepCNN without fine-tuning.
62
Figure 3.13 ROC curve of Basic CNN model
Figure
3.14 Confusion Matrix of Basic CNN model
63
Figure 3.15 Accuracy Graph of DeepCNN model without Fine-tuning
64
Figure 3.16 ROC of DeepCNN without Fine-tuning
65
The training and validation accuracy of DeepCNN model applying fine-tuning is revealed in
Figure 3.18 . Figure 3.19 indicates the ROC curve, which is a true-positive rate versus false-
positive rate graph which shows the performance of all the classes at certain classification
thresholds, the ROC area of the proposed model for the binary classification is 0.99 for all
classes (ripe and unripe). Figure 3.20 shows confusion matrix for VGG16 with transfer
learning.
66
Figure 3.19 : ROC of DeepCNN with Fine-Tuning
67
3.10.2 Result analysis
Table 3.3 shows the accuracy, precision, Recall and F1-score for Basic CNN model
and DeepCNN model with and without applying fine-tuning strategy. The DeepCNN model
is trained with and without applying fine-tuning strategy using SGD optimisers.
DeepCNN with
97.71 98 98 98
Fine-tuning
DeepCNN
without 90.21 91 90 90
Fine-tuning
The accuracy is maximum for proposed DeepCNN model applying fine-tuning strategy
(VGG16 with transfer learning) i.e. 97.71 %. After that, DeepCNN without fine-tuning
acquired 90.21% accuracy and basic CNN has minimum accuracy around 71.88%. Similarly,
precision is evaluated for the three models. The precision also is maximum for proposed
DeepCNN model around 98% and then VGG 16 without fine-tuning acquired precision of
91% and the minimum precision is 74 % for basic convolutional neural network model. Next
parameter is recall which is evaluate assess the performance. The maximum recall is 98% for
DeepCNN model and minimum for basic CNN that is 72% and VGG16 without fine-tuning
have a recall of 90%. F1-score is also evaluated for transfer learning model for binary
68
classification basics CNN model has minimum score around 71% VGG16 has F1-score of
90% and DeepCNN has F1-score of 98% which is maximum. Figure 3.21 shows the
comparison of three models for ripeness classification. For all the parameters, the DeepCNN
has maximum performance. The Basic CNN models has minimum performance. Maximum
accuracy is 97.21 %, maximum precision is 98%, maximum Recall is 98%, Maximum F-1
score is 98%.
Summary
The Deep learning CNN model based on the VGG16 proposed for detecting and classifying
the ripeness of tomato. The proposed model is performing well and having better accuracy in
detecting the ripeness and classifying tomato fruit by utilising fine-tuning technique. Added
69
the data augmentation process in the dataset and applying the strategy of fine-tuning
produced a more robust model. The usage of dropout and data augmentation to reduced
70
CHAPTER 4
YOLO MODEL
4.1 Introduction
One of the most important computer vision tasks is object detection, which looks for
instances of visual things (such as people, animals, cars, weeds, fruits, or buildings) in digital
images like pictures or video frames. Traditional image processing techniques and modern deep
learning networks can be used to detect objects. Object detection is a computer vision
implementation that not only detects but also estimates the location of objects in a digitized
things while dealing with changes in the shape, orientation, and colour contents of the
objects. Although "detection" may refer to the capacity of intelligence to indicate the
existence and identification of an object, it may also refer to the power of intelligence to
locate a hidden concealed object. Object detection can be interpreted in a variety of ways,
including drawing a bounding box around the object or labelling every pixel in an image that
includes the object (a process known as segmentation) [89]. A rectangular bounding box is
used to define the position of the identified object, which helps the people in locating the
object faster than unprocessed images. It is an identified portion of an image that may be
understood as a single unit in image processing [90]. Typically, an image will contain one or
more objects, the visibility of which is important. For example, the objects represented in
single image can range from a single unit to as many objects as numbers, bordering on
infinite. Thus, given an image or video stream, an object detection model should be able to
determine whether of a known set of objects is present and provide information about their
71
Object detection is commonly used in computer vision applications such as image annotation,
video object co-segmentation, activity recognition, face recognition, face detection, fruit
determined in order to harvest fruits and vegetables, manoeuvre in the field, spray selectively,
and so on. According to the task, an object's 3D location must be computed, obstructions in
the path must be recognised, and object properties such as ripeness and size must be
determined. The presence and classification of diseases is also needed in several applications
[91,92,93]. Despite many years of research in agriculture focused object identification, there
are still several issues that hinder agricultural applications from being implemented [94]. The
complicated plant structure, and varying product shape and size make it difficult to provide a
For many years, hand-crafted features like colour, shape, texture, or their combination
were the primary focus of object detection research [96,97]. Despite the acquisition devices,
the range of target colours, and the various lighting conditions having a significant impact on
colour, colour has been one of the main features employed in detection. The specific features
of the agricultural environment, according to several research, were the greatest obstacles to
precise identification and localisation of the target object [98]. Because of this, several
research have attempted to pursue different approaches for the detection task throughout the
years. For example, they have created adaptive thresholding algorithms to deal with the
72
neural network (ANN) methods have been used in a number of studies to eliminate the
ANNs were first created for learning to classify colour features fed to the network in
order to remove background pixels from target pixels, which resulted in the same difficulties
segmentation, have greatly improved. These networks may be fed raw data, like as the pixels
of a fruit in an image, and automatically learn features from it, eliminating the need for hand-
crafted features and the issues they bring [104]. With enough data, accurate representations of
the target object may be trained, and a robust system can be built that can handle the
uncertainty and high variability that are inherent to real-world vision challenges. In tasks like
disease detection and classification [105], fruit recognition and location [106], and others,
these networks have shown high-performance outcomes in the agriculture domain as well.
Deep Learning Models are currently being used in Object Detection techniques due to
their accurate Image Recognition capacity. These models use features extracted from input
videos and images to identify the objects included inside them. Image Processing, Video
Analysis [107], Speech Recognition, Biomedical Image Analysis, Biometric Recognition, Iris
Forecasting, and Renewable Energy Generation Scheduling are some of the applications of
these models. These models make use of the Convolution Neural Network (CNN) [108]
illustrated in Figure 4.1, these methods are divided into two categories based on their
73
theoretical foundation. The first are machine learning-based (ML-based) approaches [109],
ML inspires us in a variety of ways, particularly for pattern recognition from huge amounts of
The pipeline of the ML-based technique for object detection is depicted in Figure 4.2.
After obtaining an image, the sliding window method will be performed to this image[114],
74
yielding a large number of candidate bounding boxes. This procedure is known as region
selection[115], and it means that the region of interest to detect has been placed in these
boxes. This method will generate a myriad of candidate boxes for one image, those boxes
could be of different sizes and shapes. After this step, feature extraction will be implemented
for each candidate bounding box. Then, the feature inside each box will be extracted from the
image by using specific methods. Consequently, a classifier will be used to classify each box
based on its features. Finally, what class this box is will be given in the last step.
divided into three sections: region selection using the sliding window method, feature
extraction, and classification for each bounding box, with the results provided. Even if it
After region selection, the object bounding box might be generated using the sliding window
technique. However, due to a large computation that limits its transportability, its complexity
ML-based algorithms might be used to address the object detection problem. However, the
complexity restricts its usage and development. However, with the advancement of computer
vision (CV) and deep learning (DL), other methods for object detection have been developed
after DPM. These methods are based on deep neural networks, particularly the convolutional
Deep Learning-based Object detection has two major methodologies, one-stage and two-
stage as shown in Figure 4.3. The one-stage approach combines classification and localisation
in a single step, whereas the two-stage approach requires two separate steps. While one-stage
detectors are often faster, they may compromise some accuracy as compared to two-stage
75
detectors, which are known for their higher accuracy. As a result, there is a trade-off between
R-CNN
In 2014, R-CNN [116], a DL-based approach, was created to address the problem of
selecting a large number of regions. The visual information of one object, such as texture,
shape, colour, and edge, may be comparable and significant to another. As a result, the region
proposal approach rather than the sliding window method can help us in obtaining the
candidate box.
Instead of using the HOG feature, R-CNN's feature extraction process uses a
convolutional neural network (CNN). Additionally, AlexNet is used to extract each box's
features. This model could automatically extract and learn features from boxes by utilising
AlexNet. For each box, AlexNet will eventually provide a feature map. The DL-based
76
technique addresses the drawback of the ML-based method in feature extraction based on the
defined features.
R-CNN continues to use SVM as its classifier for classification. Each box will be classified
using SVM based on its feature map, which was received through AlexNet. R-CNN will
produce 2,000 candidate boxes from an image using the region proposal approach as a result
of these improvements. The collected features will then be used to construct 2,000 feature
maps using AlexNet. Meanwhile, the SVM classifier will be used to classify each box, and a
regressor will be used to regress the bounding box. R-CNN's speed is quite slow due to its
sophisticated computation.
Fast R-CNN
Fast R-CNN was proposed by Girshick in 2015[117] in order to address the drawback
of R-CNN. Fast R-CNN mimics R-CNN, however it only uses a small number of CNN
networks to extract features from each region proposal. There is just one feature map
produced, and it directly extracts features from the input image using VGG. On the basis of
the input image, it will also carry out region selection to retrieve region proposal. This VGG
feature map's output might be partially chosen in the next phase depending on a region
proposal. Then, a ROI pooling layer will wrap and reshape those chosen feature maps into a
fixed size. The wrapped feature maps will then be transferred into a fully connected layer
where they will be ready for classification using Softmax and bounding box regression using
a linear regressor.
As a result, instead of numerous region proposals, Fast R-CNN just does feature
extraction once on the whole image. This might save a significant amount of time and
Faster R-CNN
77
Faster R-CNN outperforms Fast R-CNN by using one CNN for the image instead of
several CNN for region proposal[118]. The initial step in Faster R-CNN is the same as it is in
Fast R-CNN; both use CNN to produce a feature map. Faster R-CNN, as opposed to Fast R-
CNN, uses a neural net called RPN (Region Proposal Network) to replace selective search.
This RPN net will use CNN to construct a large number of anchors with varying sizes and
ratios. Thus, this RPN will determine if this anchor is in the foreground or background, while
simultaneously regressing its bounding box. Finally, in order to save time, the RPN will
Following the RPN net, Faster R-CNN will acquire a high-quality proposal with minimal
background. This will improve the speed and accuracy of the model.
The Single Shot MultiBox Detector (SSD) [119] is a single-stage object detection
model that eliminates the need for separate region proposal and classification processes,
hence streamlining the object detection process. Due to this simplicity, detection performance
is faster, making SSD ideal for applications. The SSD design is made up of a base network,
which is usually a pre-trained CNN, and a sequence of convolutional layers of variable sizes.
These layers are intended to identify objects of various sizes and aspect ratios. For each
feature map cell, SSD uses default bounding boxes, also known as anchor boxes. The model
predicts both the class scores and the box offsets relative to the anchor boxes during training.
SSD has been successfully used to a wide range of fruit detection applications. Wang et al.
(2022)[120] propose a lightweight SSD detection technique for recognising Lingwu long
jujubes in natural settings. The new SSD approach delivers great detection accuracy without
78
the need of pre-trained weights while also reducing complexity to enable mobile platform
technique enhances the accuracy of object detection . In addition, the SSD model has been
customised and optimised for certain fruit detection applications. Researchers studied the use
of data augmentation techniques such as random cropping, flipping, and colour distortion to
YOLO
Faster R-CNN makes use of the RPN network to create ROI. Then it does
classification and regression with a high quality of anchor. As a result, this method assists
Faster R-CNN in achieving more accuracy at a slower rate. Therefore, instead of this two-
stage approach, a new one-stage method YOLO was developed to make the network
YOLO predicts the object's class and location without using an anchor or RPN. As a
result, its speed is extremely rapid, but its precision has suffered. YOLO is the first real-time
detector and the first one-stage object detection system. It lacks a prior box, and the detection
problem is seen as a regression problem, hence its structure is quite basic. YOLOv1 does not
use the prior box and instead employs a GoogLeNet-like model as its backbone by converting
it into a regression problem. These strategies greatly improve detection speed. It does,
YOLOv2, commonly known as YOLO9000, is the second version of YOLO that can
classify 9000 classes. It employs DarkNet-19 as its backbone net and uses a prior box to
detect the offset rather than its size and location [122]. Additionally, it employs multi-scale
and multi-step training. Although the implementation of such methods improves speed, it
79
As the third version of YOLO, YOLOv3 incorporates properties from the current
detection frame, such as residual network and feature fusion. A network called DarkNet-53
was proposed. The residual network allowed the network to be created and developed in great
depth. Meanwhile, the residual network might considerably reduce the gradient
disappearance to accelerate convergence. It will combine deep and shallow feature maps
using upsampling and concatenation. As a result, it will build three different size feature maps
to be utilised for detection. YOLOv3 provides a superior performance for small object
detection using variable size feature maps, despite its slow speed [123].
YOLOv4 is the fourth version of the YOLO object detection algorithm, which was
architecture called CSPNet (Cross Stage Partial Network), which is a variation of the ResNet
architecture intended exclusively for object detection tasks, is the key enhancement in YOLOv4
over YOLOv3. It adds "k-means clustering" as a novel approach for creating the anchor boxes,
which groups the ground truth bounding boxes into clusters and then uses the centroids of the
clusters as the anchor boxes. This allows the anchor boxes to be more precisely matched with the
In 2020, YOLOv5 was released, which used a more complicated architecture known as
EfficientDet, which is based on the EfficientNet network architecture. The use of a more
complicated architecture in YOLO v5 allows for greater accuracy and generalisation to a broader
YOLOv5 employs a novel approach for producing anchor boxes known as "dynamic
anchor boxes." The ground truth bounding boxes are clustered using a clustering method, and the
centroids of the clusters are used as anchor boxes. This allows the anchor boxes to be more
precisely matched with the size and shape of the detected objects. The idea of "spatial pyramid
pooling" (SPP) is also introduced in YOLOv5, which is a form of pooling layer used to reduce the
80
spatial resolution of feature maps. SPP is used to increase small object detection performance by
YOLO is the first real-time detector that might strike a balance in scenarios when speed is
required. It might handle the fruit detection problem since it is a detector without an anchor.
Based on this, YOLOv5 is chosen as the base model in the research to deal with fruit
detection.
"You only look once" (YOLO) is a deep learning strategy that is used in one-stage detectors
that use CNN to detect objects. The design of YOLO is fast and accurate, making it suitable
for real-time detection of objects applications [125]. To accomplish fast network processing,
the YOLO method employs a grid-like architecture similar to that of CNN. As illustrated in
Figure 4.4, the approach includes splitting input images into grid cells of size S x S and
assigning each cell to predict bounding boxes and object classification. In order to create a
prediction, the operation makes a single forward pass through the network, which involves
processing input data via a neural network in a single path, from input to output. Thus, "You-
only-look-once" got its name. Four parameters (x, y, width, and height) are used to define
each bounding box, together with a confidence score that represents the probability that an
object will be found inside the box. The class probabilities are also predicted by the detection
network. The number of classes that YOLO can identify is determined by the number of
output nodes in the final layer of the network. Each node in the output layer represents a
distinct class, and the output of each node is the likelihood that the object in the input image
81
Figure 4.4 :YOLO Model
YOLOv5's adaptive anchor box calculation helps in training new datasets. The adaptive
anchor boxes feature uses the training dataset to construct acceptable bounding box anchors,
allowing the detector to learn different sized objects more successfully. Furthermore,
YOLOv5's loss function is based on GIoU (Generalised Intersection over Union), which
helps in training by preserving the error distance even when predictions and ground truth do
not intersect.
YOLOv5 provides multiple pretrained model checkpoints that have been trained with the
COCO (Common Objects in Context) dataset as premade deployment options. The depth of
of these structures has two resolution options, 640x640 and 1280x1280. A deeper backbone
may extract more information from input images and improve accuracy, but it also makes the
network operate slower computationally. These pretrained weights can also be utilised as a
foundation for fine-tuning, and YOLOv5 includes simple interfaces for doing so.
82
The YOLOv5 model, as illustrated in Figure 4.5, is made up of three parts: the backbone, the
neck, and the head. The backbone collects characteristics from the input image at various
granularities. The neck then aggregates the features retrieved by the backbone and sends them
to the next layer for prediction. Finally, the head predicts the class labels and constructs the
bounding boxes.
Input
83
The YOLOv5 input uses mosaic data augmentation, which stitches four images together at
random using clipping and scaling for improved efficiency in small object detection.
YOLOv5 makes advantage of Mosaic data augmentation for training and fine-tuning. Figure
4.6 depicts a mosaic data augmentation depiction. YOLOv5 uses non-max suppression
(NMS) to construct distinct prediction bounding boxes close to the ground bounding box.
Because the images are not consistent in size, the adaptive zooming approach is used to zoom
them to an acceptable uniform size before being sent into the network for detection, hence
eliminating issues such as a conflict between the feature map and the fully connected layer.
Backbone
To accomplish slice operations, the backbone of YOLOv5 use the Focus layer for down-
sampling. The original RGB input image is sent into the Focus layer, which executes a slice
with 32 kernels to produce a 32-dimensional feature map. CSP1_X and CSP2_X are the two
types of Cross Stage Partial structures (CSP) in YOLOv5.The first CSP structure, CSP1_X, is
used in the backbone to achieve rich gradient information while lowering computing costs.
Spatial Pyramid Pooling (SPP) is used in backbone to generate fixed-size feature maps while
84
Neck
The YOLOv5 neck is largely used to build feature pyramids, which improves the model's
detection of objects of varying sizes and enables recognition of the same object at different
sizes and scales. To aggregate the features, YOLOv5 employs the CSP2_X structure, as well
as the Feature Pyramid Network (FPN) [126] and Path Aggregation Network (PAN) [127].
85
Head
The YOLOv5 head is made up of non-max suppression and a loss function. The loss function
is divided into three parts: bounding-box loss, confidence loss, and classification loss. The
bounding box loss is computed using the Generalised IoU (GIoU) [128]. YOLOv5 uses
weighted NMS in the post-processing of target object detection to filter multiple target
In machine learning, several distinct metrics are used to compare the outcomes of various
approaches and algorithms. These metrics are frequently based on four fundamental values:
true positives, false negatives, false positives, and false negatives. True positives in object
detection are detections that accurately localise and classify a ground truth object. True
negatives, on the other hand, are portions of an image that do not include any objects or
detections. False positives are detections that are incorrectly localised or classified, whereas
false negatives are ground truth objects that are not detected.
The predicted bounding box matches against the ground truth bounding box and the degree
of overlap between the two computers are used to assess the accuracy of item detection. This
is known as the Intersection over Union (IoU) or the Jaccard index, and it ranges from 0 to 1.
An IoU of 1 implies a complete match between the two boxes, whereas an IoU of 0 shows no
overlap, as shown in equation 4.1. In the case of multiple object classes, IoU calculates for
86
Figure 4.7 : Intersection over Union
IoU = ¿ (4.1)
¿
false positive; changing the IoU threshold might have an influence on the system's overall
performance. If the predicted class label matches the ground truth class label and the
predicted bounding box IoU ratio with the ground truth box exceeds a set threshold, the
object bounding box prediction is considered a true positive. All bounding boxes that do not
meet these requirements are considered false positives. False negatives are ground truth
bounding boxes that are not matched by any predictions. True negatives are not included in
the IoU calculation since they are rarely an interesting result, and metrics that rely on them
are rarely used. Figure 4.8 depicts the influence of the IoU computation on true positives,
false positives, and false negatives with example detection and ground truth bounding boxes.
The green bounding boxes in the figure represent ground truth, whereas the red bounding
boxes represent detections. The figure has two images: one on the left and one on the right.
With a 0.5 IoU threshold, the detection of the left image would be considered a true positive
and the detection of the right image as a false positive. Furthermore, the right image's ground
truth object would be regarded as a false negative because the detection for it was a false
positive.
87
Figure 4.8 : IOU Calculation
Precision
Precision is a metric that indicates what proportion of a model's positive predictions are
TP
Precision ( P )= (4.2)
TP+ FP
Where TP is the number of true positives and FP is the number of false positives.
model is making predictions, but it is insufficient on its own to draw conclusions about the
model's actual accuracy. Since false negatives are not taken into consideration while
calculating accuracy, the model may miss nearly every object in the input images and yet
receive a precision score of 1.0. However, precision may be a highly helpful metric for
assessing model performance in situations when the main objective is minimising false
positives.
88
Recall
Recall is a metric that indicates the percentage of ground truth that a model has
TP
Recall ( R )= (4.3)
TP+ FN
Where FN is the number of false negatives. In object detection recall can be used to
easily check a model’s ability to find all the interesting objects from an input image. But as
with precision, recall alone does not give a complete measure of model’s accuracy.
The model can achieve a high recall score just by predicting bounding boxes
Precision-Recall Curve
The precision-recall curve is a graph that connects two metrics: accuracy on the y-axis
and recall on the x-axis. The plot may be used to compute a new metric AUC (area under
curve), which indicates the area left under the precision-recall curve[129]. This new
metric combines accuracy and recall and can mainly overcome the perspective constraints of
its components. The AUC score takes into account Tp, Fp, and Fn, which are important
quantities for object detectionAs a result, it may be utilised as a full metric for the detection
accuracy of a model. However, because the AUC score does not reflect the precision-recall
ratio, the precision-recall curve is equally essential in the accuracy analysis of a model. The
AUC does not take into account which IoU threshold is used to compute Tp, Fp, and Fn.
Figure 4.9 depicts an example precision-recall curve with AUC values produced by the
YOLOv5 framework.
89
Figure 4.9: Precision -Recall curve of YOLOv5 Model
The mAP (mean average precision) measure is commonly used in object detection to assess
accuracy and regression performance. It is defined as the mean of the AP (average precision)
scores for each class. The AUC score of a particular class in a precision-recall curve
calculated via equation 4.4 is represented by AP. The greater the average accuracy, the better
1
AveragePrecision ( AP )=∫ p ( r ) dr (4.4)
0
In object detection research mAP is often presented with a decimal number added to it in
some format like [email protected] . The @0.5 notation represents the IoU threshold value used to
calculate TP, FP and FN . The threshold value can also be given as a range [email protected]:.95
which represents the average of mAP scores with threshold values from the given range.
90
Mean average precision is standard as an overall evaluation metric to measure the accuracy of
an object detection system across all classes by taking the average of the APs across all object
n
1
mAP= ∑ AP ( i ) (4.5)
n i=1
AP(i) is the average precision in class "i," and "n" is the number of classes. The mean average
The area under the curve (AUC) score in Figure 4.9 reflect [email protected] scores for their
respective classes, whereas the overall score represents [email protected]. mAP provides an
informatic performance indicator for measuring object detection accuracy via the IoU
threshold value. Although the measure may not be appropriate for certain tasks since it does
not reveal the exact relationship between accuracy and recall, a higher mAP result with the
same IoU threshold is typically preferable. In certain studies, AP is used to refer to the same
metric as precision, while mAP is used to define the precision metric's average over all
predicted classes.
4.6 Methodology
The images used for this research are from the Laboro Tomato dataset [130], which is a
tomato dataset made up of tomatoes taken at various stages of ripening and created for tasks
such as segmentation and object detection. The dataset consists of 2034 images with a 416 x
conditions such as occlusion, lighting, shade, overlap, and others. Figure 4.10 displays some
91
Figure 4.10 : (a) Single Tomato (b) Overlapping of Tomatoes (c) Occlusion by branch (d).
The YOLO object detection paradigm requires labelled data, which provides the name and
location of the image's ground truth bounding boxes. The labelled dataset may be used to
train and test machine learning models, which will assist computers in automatically
identifying and locating target objects. As demonstrated in Figure 4.11, the labelImg is a
graphical annotation tool that is used to label 2034 images by placing bounding boxes around
each tomato in the image and naming each ground truth bounding box with a class label (ripe
or unripe). Using the LWYS (Label What You See) method, all visible tomatoes, ripe and
unripe, are marked with a bounding box in each image. Notably, the bounding boxes for the
92
highly occluded tomatoes are created by assuming a shape based on the visible part of human
intelligence.
Following the labelling process, the dataset is divided into three unique subsets, namely the
training set, validation set, and test set, each of which contains 1628, 203, and 203 original
images. The usage of a different test set was crucial in measuring the trained model's
adjusting and allows to choose the best model throughout the training phase. By dividing the
dataset into subsets, it is assured that the trained model did not overfit to the training data and
In the discipline of deep learning, having a big and diverse training dataset is critical for
constructing and testing strong and accurate models. Models must be able to identify and find
objects of varied sizes, shapes, and orientations in a variety of environments for difficult tasks
such as object detection. Large training datasets may assist models encounter a wide range of
object instances, backgrounds, and lighting situations, allowing them to learn powerful
features and generalise effectively to new data. Training on vast and diverse datasets is
93
critical for increasing the accuracy and generalisation capabilities of deep learning models.
The goal of data augmentation approaches is to artificially enhance the size and variety of the
training dataset, which improves the performance and generalisation of deep learning models.
The primary objective is to provide variations in the training data that simulate real-world
scenarios that the model is likely to experience upon deployment. Geometric transformations,
for example, allow to replicate diverse views, orientations, and positions of objects in images,
which can help the model acquire more resilient and invariant features Colour space
transformations, on the other hand, help in simulating changes in lighting conditions and
atmospheric effects that can impact the visual appearance of objects in images. Through the
introduction of such variances, the model may learn to recognise and adjust to these
filtering techniques helps in the smoothing out of noise and the reduction of image artifacts,
enhancing image clarity and quality This can assist the model in learning more relevant and
As a result, horizontal flipping is used on the 1628 original images in the training set to
expand the dataset and improve the model's robustness and performance in tomato fruit at
different angles. Figure 4.12 depicts the original images as well as their flipped variants
following the data augmentation strategy. The number of images in the training set rose to
94
Figure 4.12 : Augmented Images (a) Original Image (b) Flipped Image
95
This section goes into detail about the proposed enhanced YOLOv5 architecture, as seen in
Figure 4.13, including attention mechanism and distance intersection over union. Because the
Convolutional Block Attention Module (CBAM) was designed to improve critical features,
the suggested CAM- YOLO inserted it into the network structure after each feature fusion,
Attention mechanism
96
The attention mechanism extracts a little amount of meaningful information from a huge
quantity of information and concentrates on it, ignoring the rest of the unimportant
Module (CBAM), which is divided into two sub-modules: spatial and channel modules.
Figure 4.14 displays the CBAM structure. The channel attention module tries to capture
''what'' is relevant in the provided images, whereas the spatial attention module focuses on
As illustrated in Figure 4.15, the channel attention module first aggregates the spatial data of
the given input feature map in two dimensions (height and width, respectively) using global
max pooling and global average pooling. The multilayer perceptron then processes the output
with the help of a shared fully connected layer. Finally, using the sigmoid activation
operation, the final channel attention feature is built and fed as input to the spatial attention
M CA ( IF )=σ ¿ (4.6)
97
Figure 4.15 : Channel Attention Module
The spatial attention module is presented after the channel attention module. As shown in
Figure 4.16, the spatial attention module conducts max and mean pooling in the channel
dimension. First, max-pooling and average pooling are performed on the convolution
module's input, and then the max-pooled and average-pooled tensors are concatenated. The
visual cues in images are then activated using a convolution operation with a kernel of size (7
x 7) and a sigmoid activation function. The spatial attention map MSA(IF) is computed using
equation 4.7.
98
The proposed algorithm retains the YOLOv5 backbone network's original network structure
while extracting the features from its three feature layers and sending them to the head
network. It then incorporates CBAM in the head to strengthen the network as it transmits
from the shallow layer to the deep layer by enhancing the attention of the given feature map.
The network can learn the target's feature information more efficiently, capture the target's
recognition features more accurately in the same test image, and achieve a better recognition
effect without increasing the training cost by learning meaningful feature maps, particularly
The goal of non-max suppression in object detection algorithms is to select the most optimal
bounding box for the object while suppressing all other boxes. Traditionally, non-max
suppression computes the IoU between the detection box with the best score and the other
boxes and deletes the boxes with IoU greater than the specified threshold. Because it is based
on the basic IoU, the NMS sometimes mistakenly discards occluded elements. Furthermore,
to improve the missed detection scenario, Distance IoU (DIoU) is integrated with NMS.
When two boxes have a substantial IoU, it indicates that two objects have been detected, and
so the boxes should not be eliminated. As stated in equation 4.8, DIoU considers the distance
between the prediction's centre point and the real bounding box, as well as the overlap rate, to
2 ( b , b¿ )
DIoU =IoU −ρ 2 (4.8)
c
Where b, bgt denotes the central points of the predicted box(B) and ground-truth-box (B gt), c2
denotes the Euclidean distance and IoU can be defined by using the equation 4.9.
99
B ∩ B¿
IoU = (4.9)
B ∪ B¿
{
si= si , DIoU ( M , Bi ) <ε
0 , DIoU ( M , Bi ) ≥ ε
(4.10)
Where si denotes the confidence score, M denotes the box with highest confidence score, Bi
is all contrastive boxes of current class. In comparison to IoU, DIoU considers the
information about the centre points of the two frames, which contributes to the resolution of
the occluded issue produced by the close proximity objects. When evaluated by DIoU, the
An improved YOLOv5 network model for tomato fruit detection is developed on 3256
tomato images using the pre-trained YOLOv5s model. Table 4.1 describes the experimental
environment for tomato fruit detection module training. The experiment is carried out using
the Google Colaboratory platform with Python 3.10.12, PyTorch 2.0.1, CUDA version 11.8,
and a Tesla T4 GPU with a memory capacity of 16GB. Google Colaboratory offers a free
cloud computing environment with GPU capability, allowing for fast deep learning model
training. PyTorch is a popular deep learning framework that allows for the efficient
implementations of neural network models. The CUDA version 11.8 is used to accelerate the
training and inference processes by using the capability of GPU computation. The Tesla T4
GPU is well-suited for deep learning applications due to its excellent performance.
100
Table 4.1 Experimental Environment Configuration
(15101MB)
Table 4.2 displays the hyperparameter values of the improved YOLOv5 algorithm used in the
tomato fruit detection module. The table contains specific information on several
hyperparameter variables that were adjusted throughout the training process, such as learning
rate, batch size, and momentum. It also includes the parameters for the number of iterations,
epochs, and steps to reduce the learning rate. The model was trained for 100 epochs using a
batch size of 16 and an image size of 416. A learning rate of 0.01, momentum of 0.937, and
weight decay of 0.0005 are applied to the SGD optimizer. Warmup epochs, momentum, and
bias learning rate were all set to 3.0,0.8, and 0.1, respectively. With a maximum detection
number of 1000, the IoU threshold for non-maximum suppression was set to 0.7 and the
confidence threshold to 0.001. Box, class, and objectness scale hyperparameters were also
significant and were set at 7.5, 0.5, and 1.0, respectively. In addition, data augmentation with
various parameters such as hue, saturation, and brightness is applied during training, and
mosaic was used to combine multiple images. Hyperparameters are critical to the algorithm's
performance, and fine-tuning them can result in higher detection accuracy. As a result, the
updated YOLOv5 algorithm can achieve high detection accuracy for tomato fruit in a various
Parameter Value
101
No. of epochs 100
Batch Size 16
Optimiser SGD
Momentum 0.937
Figure 4.17 illustrates the ground truth of the labelled tomato fruit results on the validation
dataset, which have been carefully curated to serve as a baseline for measuring the detection
model's performance. As shown, tomato fruits vary significantly in size, shape, orientation,
102
Nonetheless, Figure 4.18 shows the detection results of the proposed model on the same
tomato fruit validation dataset. The proposed model performed well in detecting most of the
tomatoes in different scenarios, including those of diverse sizes and overlapping with one
another, as indicated by the high degree of overlap between the predicted bounding boxes and
the ground truth annotations. It also displayed exceptional generalisation ability while
identifying tomatoes of various sizes and orientations. A comparison of Figures 4.17 and 4.18
demonstrates the effectiveness of the proposed model in recognising tomato fruits in real-
world scenarios.
Following the completion of 100 training epochs, the performance of the proposed detection
model is evaluated by plotting its precision-recall (PR) curve, as shown in Figure 4.19. The
PR curve displays the trade-off between accuracy and recall at different confidence
thresholds. Recall represents the proportion of true positive samples correctly detected by the
103
model, whereas precision represents the ratio of true positive samples among the detected
results. The PR curve's area under the curve (AUC) represents the model's performance,
which was 0.868 in this case. It is observed that the model's accuracy is relatively good while
its recall is low which is owing to the high proportion of true positives in the discovered
results. However, as recall grew, accuracy steadily decreased due to an increase in the number
of false positives detected by the model. Furthermore, the model's performance is assessed
using the balance point on the PR curve, which represents the point at which recall and
accuracy are equal. The model had a high balance point, suggesting that it achieved a good
balance between recall and accuracy and accurately etected positive samples. In conclusion,
the proposed detection module performed well on the PR curve after training, accurately
104
Figure 4.20 showcases three loss functions and four evaluation metrics. The loss
function is composed of bounding box regression loss, objectness loss, and classification loss.
The bounding box loss measures the model's accuracy in locating the object's center and
coverage of predicted bounding boxes. It comprises position offset and scale change, where
position offset refers to the deviation between predicted and ground truth bounding boxes,
and scale change indicates the scale ratio between predicted and ground truth bounding
boxes. Objectness loss measures the likelihood of an object existing in a proposed region of
interest. Each predicted bounding box has an objectness score, which indicates whether or not
an object exists. The objectness loss is based on the binary cross entropy loss function, with a
predicted bounding box containing a true target having an objectness score near to one and
vice versa. Because the data only contains one category, the classification loss stays zero.
Four evaluation metrics are shown in Figure 4.20: Precision (P), Recall (R), mAP_0.5, and
mAP_0.5:0.95. Precision assesses the accuracy of bbox predictions, whereas Recall measures
the accuracy of true bbox predictions. The average accuracy at an IoU threshold of 0.5 is
mAP_0.5, whereas the average mAP at multiple IoU thresholds ranging from 0.5 to 0.95 with
a step size of 0.05 is mAP_0.5:0.95. After around 50 epochs of training, the validation data
105
Figure 4.20: Evaluation Metrics of the Proposed Model
Table 4.3 compares the performance of multiple YOLO models on collected datasets,
including their training epochs, model sizes, accuracy, recall, [email protected], and mAP@[.5:.95].
Among all the models, the proposed model had the highest accuracy of 87.3%, recall of
mAP@[0.5..0.95
Image [email protected] Precision Recall
Algorithm Epochs ]
Size (%) (%) (%)
(%)
106
YOLOv5 100 416 x 416 85.9 39.1 84.9 80.1
YOLOv5+
(backbone)
Figure 4.21 shows a comparison of the accuracy, recall, and [email protected] of the proposed
CAM-YOLO model's performance with other state-of-the-art detection models. The results
show that the improved detection model that has been presented outperforms other models,
reaching a mean average precision at intersection over union 0.5 of 88.1%, which is an
improvement over the prior model. Using Figure 4.22, the results are displayed as a graph.
Performance Results
90
88.1
88 86.9 87.3 86.9
85.9 85.4
86 84.9
(In Percent)
84
82 81.3
80.1
80
78
76
[email protected]% Precision% Recall%
Performance Metric
107
Figure 4.22: Performance metrics of the three models
To evaluate the model's performance, many images from the dataset were chosen for testing,
and the detection performance of the CAM-YOLO and standard YOLOv5 models under
tomatoes is demonstrated in Figure 4.23, where the occluded and overlapped tomatoes by the
branch or other tomatoes are not detected by YOLOv5 in Figure 4.23b, while the overlapped
tomato is detected by CAM-YOLO in Figure 4.23c. Figure 4.24 shows the identification of
small tomatoes, where the small tomato is not detected by the YOLOv5 model in Figure
108
Figure 4.23: Results of identification of Overlapped Tomatoes
109
4.10 Summary:
This chapter proposes a CAM-YOLO which is based on YOLOv5 model for tomato
detection
and classification algorithm. Firstly, the attention module is added to the backbone which
allowed to extract the information quickly. Later the algorithm makes use of the Distance
Intersection over Union (DIoU) with Non-Max suppression (NMS) to decrease the rate of
missing the detection of overlapping tomatoes. The [email protected] of the proposed algorithm is
88.1% which is an improvement compared to YOLOv5 model. Also, the proposed CAM-
YOLO is efficient in addressing the low inference, accuracy and rate of missed target
surpasses the traditional YOLOv5 when it comes to detecting tiny and dense targets, resulting
110
Chapter 5
In this chapter, a Deep Learning based Ensemble model for detection and
classification of tomato fruit diseases. Firstly, the dataset was trained by using pre-trained
models like VGG16, ResNet50, DenseNet121, InceptionV3 and Xception applying the
transfer learning and fine-tuning strategy. The proposed ensemble is compared with State-of-
The-Art (SOTA) models to examine the performance of the model on the collected dataset.
The average accuracy of 98.54% is obtained by average ensemble model on tomato dataset.
5.1 Introduction
According to the Food and Agriculture Organisation (FAO), diseases and pests destroy up to
40% of the world's fruit crops [131]. Plant disease may severely impact the quality and
quantity of plant production. Plant diseases are extremely natural, which is why disease
detection in plants is so crucial in farming. Diseases cause both direct and indirect financial
losses for farmers. Plant disease classification is imperative since it helps in establishing an
appropriate management step towards limiting the disease's further spread after the disease is
accurately recognised and treated appropriately. The advancement of machine learning and
deep learning approaches in plant disease identification has been remarkable, representing a
a supplement meal for protection. Because of its short lifespan and great production, this crop
is critical from an economic aspect, and as a result, the area under cultivation is expanding
every day. Tomato is a fundamental component in many preserved dishes, including ketchup,
111
sauce, chutney, soup, paste, and puree. It is sensitive to a number of diseases, and this
condition has a substantial negative influence on tomato quality and yield, resulting in large
financial losses.
The majority of tomato diseases cause colour changes or various spots on the soft
tissue or flesh on the exterior of the fruit. Both biotic and abiotic substances have the
potential to be toxic to tomato fruit. Insects, bacteria, fungus, and viruses are examples of
living creatures (biotic) [132]. Non-living abiotics include a variety of atmospheric conditions
such as rapid temperature changes, too much moisture, nutrient deficiency, acidic soil, and
5.2 Background
The use of an image processing approach for disease detection [134] has proven to be
more effective. It has been proven that the disease detection method works better when image
processing is applied. In a few specific circumstances and crops, the current trend of
classifying plant diseases using different Machine Learning (ML) algorithms has produced
positive results [135]. On the basis of computer image processing, a rapid and precise
approach for identifying diseases of plants will be created. Results from deep learning (DL)
techniques are beginning to rival those from shallow learning algorithms. Plant diseases may
disease detection. The reasons for such deep learning models are complex and suited for
learning from computer-vision related datasets. Many deep learning algorithms exist and are
utilised for a variety of purposes. Deep learning models are utilised to optimise plant disease
detection and highlight the benefits of hyperspectral images on different scales for plant
112
Networks (CNN) are a good choice for processing plant leaf and fruit images in order detect
disease in tomato leaves. This model is made up of three convolution layers, three maximum
pooling layers, and two completely connected layers. According to experimental results, the
proposed model outperforms the pre-trained models VGG16, InceptionV3, and MobileNet.
The recommended model's average classification accuracy for the 9 disease and 1 healthy
classes is 91.2%, ranging from 76% to 100% depending on the class. However, the model
Furthermore, because the testing accuracy is low, it is necessary to improve the same model
accuracy using the PlantVillage dataset. Using transfer learning approaches, the model's
performance was evaluated based on classification accuracy, sensitivity, specificity, and F1-
score [138].
CNN may be hosted on the internet for simple access and immediately used by
farmers to identify diseases damaging their crops. Farmers can salvage the fruits if they spot
the disease early on. Even if the disease has spread to a portion of the fruit crop, it would be
good to identify it so that the disease may be stopped from spreading further.
disease classification performance. Rapid and exact detection of the disease's severity will
assist in lowering yield losses. The research here focuses on the classification of tomato fruit
disease. Six classes are examined here, including a healthy class and five tomato fruit
diseases. Anthracnose (AN), Bacterial Canker (BC), Bacterial Speck (BS), Blossom End Rot
(BER), Ghostspot (GS), and Healthy (HT) are the tomato fruit diseases analysed here.
113
5.2.1 Common Diseases of Tomatoes
Anthracnose
Anthracnose is typically a problem on mature (or overripe) fruit, although it can also
infect leaves, stems, and roots. Fruit that is not yet ripe may become infected as well;
however, symptoms may not develop until the fruit begins to mature. Disease development is
aided by moist circumstances. The pathogens can be disseminated by splashing water from
Symptoms: Figure 5.1 depicts the development of small, slightly depressed, circular spots on
mature fruit. These lesions may grow in size, become sunken, and merge together. Small
black patches (microsclerotia) eventually develop in the tan cores of lesions. When moist
circumstances exist, salmon-colored spores can be seen in masses on the surfaces of lesions.
Bacterial canker
is seedborne and may persist on surfaces and production supplies (stakes, trays) as well as in
114
infected plant debris and weed hosts. The disease can also be transported from plant to plant
Symptoms: Tomatoes infected with the bacterial canker pathogen can exhibit a wide range of
symptoms. Wilting is the most visible symptom. Wilting normally begins in the bottom half
of the plant and progresses upward; however, wilting can occur at the point of pathogen
beginning when plants are injured. Fruit may have little (1/4 inch) creamy white dots with tan
or brown centres (called a bird eye mark). Fruit surface may appear netted or marbled as
Bacterial speck
Bacterial Speak is a dangerous tomato disease that may be difficult to treat when
disease incidence is high and climatic circumstances are favourable. Disease growth is helped
by high humidity and cold temperatures. The infection is transferred by seeding and can also
be spread through splashing water, infected tools and equipment, and personnel. If conditions
are favourable, the disease can also live from season to season in agricultural waste.
115
Symptoms: On leaflets, circular, dark brown to black lesions occur; a yellow halo may form
around lesions over time. Fruit can also get dark lesions. These fruit lesions may be
Blossom end rot is an environmental (not fungal) problem that is most typically
caused by uneven watering or a calcium deficit. This common garden "disease" is frequently
Symptoms: Blossom end rot symptoms appear on both green and ripe fruits and can be
recognised by water-soaked regions on the bottom end that gradually enlarge and mature into
116
Figure 5.4 Blossom End Rot
Ghostspot
Ghost spot is a prevalent disease that may spread quickly on tomatoes grown in
confined structures. The pathogen prefers high humidity and cool temperatures, and it needs
free moisture to germinate its spores. Spores can be disseminated by wind and air movement.
Symptoms: Figure 5.5 shows fruit with faint, pale halos (3 to 8 mm in diameter). They are
white on immature fruit and yellow on matured fruit. A small necrotic fleck may occur
alongside the halo. Spots seldom go further, but a shift in favourable conditions causes Ghost
117
Figure 5.5 Ghostspot
Buckeye rot
disease is helped by high humidity and warm temperatures. The disease is especially severe
in damp soils. Water that has been infected or that has been splashed can transmit the
infection.
Symptoms: On diseased fruit, a dark, greasy lesion appears. Lesions increase with time and
may cover a considerable area of the fruit. Concentric rings are often observed within the
lesion depicted in Figure 5.6. When the lesion is exposed to moisture, a white, cottony fungal
growth may form on the surface. Infected fruits are frequently in direct touch with the earth
118
Figure 5.6 Buckeye Rot
Deep learning facilitates feature extraction from the input image [139]. It is extremely
accurate and fast in its ability to resolve challenging problems. By modifying the layers and
how they are combined in the model, the accuracy of the model may be increased. Due to
their promising results, deep learning networks have been widely used in numerous
areas[140,141]. The classification of tomato fruit disease detection and classification for six
VGG16
The VGG16 model [142], also referred to as the VGGNet, has a depth of 16 layers.
The VGG16 convolutional neural network model was presented by K. Simonyan and A.
Zisserman from the University of Oxford in the publication "Very Deep Convolutional
million images divided into 1000 classes, achieved a top-5 test accuracy of about 92.7%. It
replaces huge kernel-sized filters with a set of 33 kernel-sized filters, outperforming AlexNet
by a substantial margin. It has trained using Nvidia Titan Black GPUs over the course of
119
many weeks. Figure 5.7 depicts the 3x3 filters used in the 16 layer deep VGG16 network,
ResNet50
A ResNet [143], also known as a residual neural network, is a sort of deep learning
model in which the weight layers learn residual functions with reference to the layer's inputs.
A residual network is one that contains skip connections for identity mappings and is added to
the layer outputs. One of the challenges addressed by ResNets is the famous vanishing
gradient. This is because, when the network is too deep, the gradients that are required to
calculate the loss function simply go to zero after a certain number of chain rule operations. As
a result, learning is not happening since the weights' values are never updated. However, the
skip connections from later layers to initial filters can allow gradients in ResNets to pass
120
Figure 5.8 depicts ResNet50, a ResNet model version of 48 Convolution layers, 1
MaxPool layer, and 1 Average Pool layer. The model also includes approximately 23 million
trainable parameters, indicating a deep architecture that improves image classification. The
ResNet model accepts images that are 224 x 224 in size. If you need to create a model from
scratch, you will need to collect a large amount of data and train it yourself. Using a pretrained
model is a highly effective method. There are other pretrained deep models to use, such as
VGG19, GoogleNet, or AlexNet, however the ResNet50 is famous for excellent generalisation
performance with lower error rates on recognition tasks and is therefore a useful tool to know.
InceptionV3
The Inception network [144], also known as GoogleNet, was designed to overcome
the limitations of existing networks and to improve accuracy and speed. An inception
network is a deep neural network with an architectural layout made up of repeating elements
known as inception modules. Prior to its development, most common CNNs merely layered
convolution layers deeper and further in with the purpose of enhancing efficiency, but this
resulted in an overfitting issue. Figure 5.9 depicts the basic idea behind the inception module,
which is to conduct many operations with varying filter sizes like 1x1,3x3,5x5 in parallel to
avoid any trade-offs. The inception model is made up of many inception modules.
121
Figure 5.9 Inception module
122
InceptionV3's architecture is seen in Figure 5.10. It features 42 layers and a lower error rate
DenseNet
concatenates (.) the output of the previous layer with the output of the future layer, whereas
ResNet employs an additive technique (+) that combines the previous layer (identity) with the
future layer. It was developed particularly to overcome the decreasing accuracy caused by the
vanished gradient in high-level neural networks. Simply said, because of the longer path
between the input layer and the output layer, the information vanishes before reaching its
target. The DenseNet is organised into DenseBlocks, each with its own set of filters but
sharing the same dimensions. The Transition Layer employs batch normalisation via
123
downsampling with convolution and pooling layers of 1x1 and 2x2. DenseNet is available in
Xception
Xception [146], which means "extreme inception," takes the essential principles of Inception
to their logical conclusion. Inception's original input was compressed using 1x1 convolutions,
and each depth space was produced using a separate set of filters from a distinct input space.
Xception just reverses this stage. Instead, it applies the filters to each depth map
independently before using 1x1 convolution to compress the input space overall. As
illustrated in Figure 5.12, its architecture is a linear stack of 36 residually linked depth-
separable convolution layers that serve as the network's feature extraction foundation. The
124
Figure 5.12 Xception Architecture
relationships between variables and approximating any mapping function given sufficient
resources. However, the models are heavily dependent on exact training data used to train the
model as well as the random initial weights. As a result, the final model delivers different
predictions each time the same model configuration is trained on the same dataset. This may
be uncomfortable when training a final model to make predictions on new data. The large
variance of the approach can be decreased by training different models for the problem and
125
Figure 5.13 Ensemble Method
developed to predict a result, either via the use of a number of modelling approaches or
through the use of a variety of training data sets. The ensemble model then aggregates the
predicted value of each base model, yielding a single final prediction for the unseen data.
There are several types of ensemble methods, each with its own set of advantages and
averaging, mixing, and voting are all common ensemble approaches [148].
The detection and classification of Tomato fruit disease performed in this research is
explained here. The study focuses on the classification of diseases in tomato fruit. This task
discusses tomato fruit classification utilising the newly created average ensemble CNN
model, as well as VGG16, ResNet50, DenseNet121, InceptionV3 and Xception via transfer
126
detection and classification. The collected dataset is augmented, and the images are
resized to the required size of 224 x 224. The dataset is further divided into training and test
datasets. The proposed ensemble model along with VGG16, ResNet50, DenseNet121,
InceptionV3 and Xception are trained on a training dataset for disease classification. The
trained model is validated against test data to predict new data classes from the data collected.
The input image requirement for the pre-trained model is 224 x 224. Consequently,
the image's input data is scaled to meet this size after augmenting to fit the model with the
appropriate format of input image size. The flowchart for the proposed models for tomato
disease classification and prediction is shown in Figure 5.15. For all the models, transfer
learning is also done for the six classes of tomato fruit diseases. The top layer of the pre-
trained models is replaced with a fully connected layer, the SoftMax classifier layer, to apply
transfer learning.
127
Figure 5.15 Disease Detection and Classification flowchart
5.4 Dataset
128
The images used for tomato disease detection and classification were obtained from
plant communities [149], forestry images [150], and the internet. This study focuses on
Tomato fruit images from six different classes of healthy and diseased tomatoes, including
Anthracnose, Bacterial Canker, Bacterial Speck, Blossom End Rot, and Ghostspot. Figure
The dataset used for classification is too small to train a deep learning model such as
480 as there are six classifications, one healthy and five diseased. When data is augmented,
the dataset size is extended many times, which helps in training the deep network model. In
this study, four augmentation approaches are utilised with the rotation of 45 0, as well as
flipping and shifting, as illustrated in Table 5.1. The dataset has grown to 2400 images after
129
being augmented with the said combination. Figure 5.17 shows a sample of images following
augmentation.
130
Figure 5.17 Augmented images : (a) ,(b),(c) Original Tomato images
The images in the dataset must be resized to fit the deep learning model that will be
used. The required input image size is 224 x 224. The input size of the images must be
satisfied for the network to fit the model. The images used in this work were RGB with
coefficients values between 0 and 255. The model with larger values is difficult to process.
Hence, a 1/255 scaling factor was applied to all the images in the dataset, normalising all the
values from 0 to 1.
5.7. Methodology
and InceptionV3 were chosen from among all SOTA CNN models to train the dataset in this
131
study. The training dataset is utilised to train the CNN models VGG16, ResNet50,
InceptionV3, DenseNet121, and Xception separately using transfer learning. Originally, all
pre-trained models included a classification layer with 1000 nodes as the final layer, but
because only six classes are examined in this study, the last layer was replaced with the new
head. The new head contains three layers: a global average pooling layer, a dense layer, and a
present study, an average ensemble learning with the same weights assigned to each
individual model is proposed. The final softmax outputs from all models were averaged using
Output ( P ) =
∑ ( pi ) (5.1)
N
Where ‘N’ denotes the no. of models and ‘pi’ denotes the probability of model ‘i’.
This study explores five distinct SOTA architectures for training and ensembles every
132
Figure 5.18 Proposed Average Ensemble Model
(EDL-TDD)
InceptionV3 and Xception are used to identify the various types of tomato fruit diseases.
Each model is fine-tuned using the images of disease and healthy tomato fruits from the
dataset (Mi, Ni), where M is the total number of images, each measuring 224 x 224, and N is
the labels for the images, N = {n/n∊{Anthracnose, Bacterial Canker, Bacterial Speak,
Blossom End Rot, Ghostspot, Healthy}}. In order to increase the accuracy of deep learning
models by lowering the empirical loss, the training set is divided into 16-piece mini batches.
Ensemble(VGG16+DenseNet121+InceptionV3), Ensemble
133
(VGG16+DenseNet121+Xception) and Ensemble(DenseNet121+InceptionV3+Xception)
have been experimented. These base ensemble models did better than individual models
because these multi-model ensemble methods work together to average the prediction values
of all the models and make a final prediction based on the average score, where
foundation of this model is the integration of DenseNet121, InceptionV3 and Xception. Thet
134
Figure 5.19 Proposed System Architecture
135
The experiment is run on a tomato dataset using a variety of pretrained CNN models and the
performance.With a total of 1920 images for building models and another 480 images for
testing, the performance is assessed using a dataset made up of six tomato class classes that is
split into 60% training, 20% validation, and 20% testing.The six tomato classes and the
number of training, validation, and testing samples are listed in Table 5.2.
The set of parameters that control the model's learning process is known as
hyperparameters. Variables such as optimizers, the number of layers and epochs, activation
functions, learning rate, and so on are examples of these. The learning rate, momentum was
fixed through experimentation after a number of attempts. Table 5.3 depicts the
136
Table 5.3 Hyper Parameters
Hyperparameter Value
No. of Epochs 25
Batch Size 16
To begin, the experiment is carried out by automating the disease detection of tomato fruits
using five pre-trained models such as VGG16, ResNet50, DenseNet121, InceptionV3, and
Xception. During the experiment, the performance of each model is assessed using the
16 layers, the pretrained model VGG16 attained an accuracy of 85%. ResNet50 obtained an
layers deep, while DenseNet121 achieved an accuracy of 93.96% with 51,387,398 trainable
and 7,037,504 non-trainable parameters in 121 layers. With 52,435,974 trainable parameters
137
71-layer-deep Xception model obtained a 90% accuracy with 2,104,326 trainable parameters
and 20,861,480 non-trainable parameters. Table 5.4 displays the precision, recall, F1-score,
1 VGG16 87 85 85 85.00
2 ResNet50 82 79 79 83.00
3 DensetNet121 94 94 93 93.96
4 InceptionV3 91 90 90 90.42
5 Xception 85 83 83 90.00
The performance of five pre-trained models is depicted in Figure 5.20. The ResNet50 model
performed the lowest, with an average of 79% for recall, F1 Score, and 83% classification
accuracy. The VGG16, InceptionV3, and Xception models perform well on nearly all
assessment criteria, with evaluation measures ranging from 85% to 90%. The DenseNet121
model exceeds all other classification models, obtaining 94% accuracy, F1 Score, precision,
and recall.
138
100
95
90
Extent (%)
85
80
75
70
VGG16 ResNet50 DenseNet121 InceptionV3 Xception
Model
The overall classification accuracy of the five pre-trained model during training is shown in
the Figure 5.21. From figure, it is observed that the model accuracy improvement occurred
gradually as the number of training sessions increased and finally stabilized, which indicates
that the model was better trained. And the loss of the models during the training are shown in
139
Figure 5.21 Training Accuracy of pre-trained models
140
To increase the accuracy of the results, all of these models are merged and trained
using average ensemble learning. As a result, ensemble deep learning tomato disease
detection (EDL-TDD) has been proposed and evaluated on datasets through combining
chosen pre-trained models. To evaluate the performance of EDL TDD, the pre-trained models
from table 5.4 were used to construct ensemble architecture. Initially, VGG16 and ResNet50
are merged with InceptionV3, and the performance of the proposed EDL TDD is assessed on
a dataset, yielding an accuracy of 91.24%. Later, VGG16 and DenseNet121 are merged with
Xception are combined, the proposed model achieves an accuracy of 98.54% . The results are
shown in the Table 5.5 and are depicted in the form of bar graph using Figure 5.23 .
VGG16+ResNet50+InceptionV3 91.24
VGG16+DenseNet121+InceptionV3 96.2
ResNet50+DenseNet121+Xception 92.9
VGG16+DenseNet121+Xception 95.16
DenseNet121+InceptionV3+Xception 98.54
141
100
98.54
98
96.2
96
95.16
Accuracy(%)
94
92.9
92 91.24
90
88
86
VGG16+DenseNet121+Xception DenseNet121+InceptionV3+Xception
Figure 5.24 show the precision recall and F1 Score values obtained for each disease in the
142
Performance Values of Ensemble Model
102
100
98
96
Extent (%)
Precision
94 Recall
F1-Score
92
90
88
An- Bacterial Bacterial Blossom GhostSpot Healthy
thrac- Canker Speak End Rot
nose
Tomato Disease Class
Figure 5.25 depicts the confusion matrix of the proposed EDL-TDD model utilising
DenseNet121, InceptionV3, Xception and it is shown that our proposed model successfully
identified the majority of the tomato fruit disease types in each sample image. The confusion
matrix shows the information about classification and misclassification by the model. The
diagonal elements show the correct classification, and the nondiagonal elements show the
misclassification information.
143
Figure 5.25 Confusion Matrix of Proposed Ensemble Model
Summary
The tomato is a significant agricultural crop that is produced in large quantities, but it is
highly susceptible to diseases, which lead to huge yield loss. Plant diseases have an impact on
the development of the corresponding crops, so it is crucial to detect them early. The yield
and quality of crops are significantly impacted by several diseases, fungi, and insects. Here
pre-trained models such as VGG16, ResNet50, InceptionV3, DenseNet121 and Xception are
used for classification of tomato diseases. Despite the good performance of all models, the
proposed average ensembling deep learning produced the best results using DenseNet121,
144
InceptionV3 and Xception pretrained models. Transfer learning and data augmentation
145
146
147
References
1. Gavhale, M. and Gawande, U. (2014) ‘An Overview of the Research on Plant Leaves
2. Food and Agriculture Organization of the United Nations. “FAOSTAT” Crops and
Climate Change Fans Spread of Pests and Threatens Plants and Crops—New FAO
Study. https://www.fao.org/news/story/en/item/1402920/icode/ .
4. Gobalakrishnan, N.; Pradeep, K.; Raman, C.J.; Ali, L.J.; Gopinath, M.P. A Systematic
Review on Image Processing and Machine Learning Techniques for Detecting Plant
and Signal Processing (ICCSP), Chennai, India, 28–30 July 2020; pp. 0465–0468.
5. Background, "The global tomato processing industry," Tomato Neiss, June 2020.
https://www.healthline.com/nutrition/foods/tomatoes#vitamins-and-minerals.
https://www.imarcgroup.com/global-tomato-processing-market.
8. Hayashi, M., Ueda, Y., & Suzuki, H. (1988). Development of agricultural robot. In
9. Moltó, E., Pla, F., & Juste, F. (1992). Vision systems for the location of citrus fruit in a
148
10. Xin, H., & Shao, B. (2005). Real-time behavior-based assessment and control of
12. Brady, C.J. (1987). Fruit ripening. Annual Review of Plant Physiology, 38, 155-178.
13. Giovannoni, J. (2001). Molecular biology of fruit maturation and ripening. Annual
14. Lelièvre, J.M., Latché, A., Jones, B., Bouzayen, M., Pech, J.C. (1997). Ethylene and
15. K. Prasad, S. Jacob, M.W. Siddiqui, Fruit maturity, harvesting, and quality standards,
16. S. Dargan, M. Kumar, M.R. Ayyagari, G. Kumar, A survey of deep learning and its
applications: a new paradigm to machine learning, Arch. Comput. Meth. Eng. 27 (4)
(2020) 1071–1092 .
17. T. Wang, J. Huan, M. Zhu, Instance-based deep transfer learning, in: 2019 IEEE
367–375 .
18. I.N.C.E. Ahmet, M.Y. Çevik, K.K. Vursavu ş , Effects of maturity stages on textural
mechanical properties of tomato, Int. J. Agric. Biol. Eng. 9 (6) (2016) 200–206 .
19. Fahrentrapp, J., Ria, F., Geilhausen, M., & Panassiti, B. (2019). Detection of Gray
20. Dubey, S. R., & Jalal, A. S. (2014). Adapted Approach for Fruit Disease Identification
149
21. Mohanty, S. P., Hughes, D. P., & Salathe, M. (2016). Using Deep Learning for
doi:10.3389/fpls.2016.01419
22. Zhang, L & McCarthy, MJ 2012, 'Measurement and evaluation of tomato maturity
using magnetic resonance imaging', Postharvest Biology and Technology, vol. 67, pp.
37-43.
23. El-Bendary, Nashwa, et al. "Using machine learning techniques for evaluating tomato
24. Rafiq, Aasima, Hilal A. Makroo, and Manuj K. Hazarika. "Artificial Neural Network‐
25. Kaur, Kamalpreet, and O. P. Gupta. "A machine learning approach to determine
26. Wan, Peng et al. “A methodology for fresh tomato maturity detection using computer
Approach for Tomato Ripening Stage Identification Using Pixel-Based Color Image
Environment, and Management ( HNICEM ), Laoag, Philippines, 2019, pp. 1-6, doi:
10.1109/HNICEM48295.2019.9072892.
28. Wu, J.; Zhang, B.; Zhou, J.; Xiong, Y.; Gu, B.; Yang, X. Automatic Recognition of
150
Classification Strategy for Harvesting Robots. Sensors 2019, 19, 612.
https://doi.org/10.3390/s19030612
ripeness level using artificial neural networks (ANNs) and support vector machine
30. Raghavendra, A, Guru, D, Rao, MK & Sumithra, R 2020, 'Hierarchical approach for
252.
31. Hermana, A.N.; Rosmala, D.; Husada, M.G. Transfer Learning for Classification of
Fruit Ripeness Using VGG16. In Proceedings of the ICCMB 2021: 2021 The 4th
32. Rivero Mesa, A.; Chiang, J. Non-invasive Grading System for Banana Tiers using
RGB Imaging and Deep Learning. In Proceedings of the ICCAI 2021: 2021 7th
33. Huynh, D. P., Van Vo, M., Van Dang, N., & Truong, T. Q. (2021, March). Classifying
Conference Series: Materials Science and Engineering (Vol. 1109, No. 1, p. 012058).
IOP Publishing.
34. Sakunrasrisuay, Chinapat & Musikawan, Pakarat & Nguyen, Anh-Nhat & Kongsorot,
Yanika & Aimtongkham, Phet & So-In, Chakchai. (2021). Tomato Maturity
10.1109/ICSEC53205.2021.9684584.
151
35. R. Mishra, S. Goyal, T. Choudhury and T. Sarkar, "Banana ripeness classification
36. Begum, Ninja, and Manuj Kumar Hazarika. "Maturity detection of tomatoes using
37. Arefi, A., Motlagh, A. M., Mollazade, K., & Teimourlou, R. F. (2011). Recognition
and localization of ripen tomato based on machine vision. Australian Journal of Crop
38. K. Yamamoto, W. Guo, Y. Yoshioka, and S. Ninomiya, "On Plant Detection of Intact
Tomato Fruits Using Image Analysis and Machine Learning Methods," Sensors, vol.
39. J. T. Xiong, X. J. Zou, H. X. Peng, W. Chen, and G. Lin, “Realtime identification and
40. Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. Robust Tomato Recognition for Robotic
Harvesting Using Feature Images Fusion. Sensors 2016, 16, 173. https://doi.
41. I. Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. DeepFruits: A Fruit
Detection System Using Deep Neural Networks. Sensors 2016, 16, 1222.
https://doi.org/10.3390/s16081222
42. S. Bargoti and J. Underwood, "Deep fruit detection in orchards," 2017 IEEE
43. Xiong Juntao, Liu Zhen, Tang Linyue, Lin Rui, Bu Yibin, Peng Hongxing, Study on
152
44. Fu Longsheng, Feng Yali, Elkamil Tola, Liu Zhihao, Li Rui, Cui Yongjie. Image
45. Koirala, A., Walsh, K. B., Wang, Z., & McCarthy, C. (2019). Deep learning for real-
09642-0
46. Fu, L.; Tola, E.; Al-Mallahi, A.; Li, R.; Cui, Y. A Novel Image Processing Algorithm
47. Huang, Yi Hsuan, and Ta Te Lin. "High-throughput image analysis framework for
fruit detection, localization and measurement from video streams." 2019 ASABE
Engineers, 2019.
48. C. Hu, X. Liu, Z. Pan and P. Li, "Automatic Detection of Single Ripe Tomato on
Plant Combining Faster R-CNN and Intuitionistic Fuzzy Set," in IEEE Access, vol. 7,
49. Liu, G., Nouaze, J. C., Touko Mbouembe, P. L., & Kim, J. H. (2020). YOLO-tomato:
A robust algorithm for tomato detection based on YOLOv3. Sensors (Basel), 20(7),
2145. doi:10.339020072145
50. Widyawati, W, and Febriani, R (2021). Real-time detection of fruit ripeness using the
210. https://doi.org/10.36055/tjst.v17i2.12254
51. Jia, Weikuan, et al. "A fast and efficient green apple object detection model based on
153
52. Zhang, C.; Kang, F.; , Y. An Improved Apple Object Detection Method Based on
53. Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit Detection and Recognition Based on Deep
Learning for Automatic Harvesting: An Overview and Review. Agronomy 2023, 13,
1625. https://doi.org/10.3390/agronomy13061625
54. James, Ginne M., and S. C. Punitha. "Tomato Disease Classification using Ensemble
55. Sabrol, H & Satish, K 2016, ‘Tomato plant disease classification in digital images
56. Biswas Sandika, SaunshiAvil, Sarangi Sanat and PappulaSrinivasu, “Random forest
57. J. Shijie, J. Peiyi, H. Siping and s. Haibo, "Automatic detection of tomato diseases
and pests based on leaf images," 2017 Chinese Automation Congress (CAC), Jinan,
58. Sharif, Muhammad & Khan, Muhammad & Iqbal, Zahid & Azam, Faisal & Lali,
10.1016/j.compag.2018.04.023.
154
59. Guo, X. Q., T. J. Fan, and Xin Shu. "Tomato leaf diseases recognition based on
improved multi-scale AlexNet." Trans. Chin. Soc. Agricult. Eng 35.13 (2019): 162-
169.
Image Dataset for Deep Transfer Learning-based Defect Detection," 2019 IEEE
61. Qimei Wang, Feng Qi, Minghe Sun, Jianhua Qu, Jie Xue, "Identification of Tomato
Disease Types and Detection of Infected Areas Based on Deep Convolutional Neural
https://doi.org/10.1155/2019/9142753
62. M. Gehlot and M. L. Saini, "Analysis of Different CNN Architectures for Tomato
Advances and Innovations in Engineering (ICRAIE), Jaipur, India, 2020, pp. 1-6, doi:
10.1109/ICRAIE51050.2020.9358279.
63. Ashraf, R.; Habib, M.A.; Akram, M.U.; Latif, M.A.; Malik, M.S.; Awais, M.; Dar,
S.H.; Mahmood, T.; Yasir, M.; Abbas, Z. Deep Convolution Neural Network for Big
64. Zeng, Q.; Ma, X.; Cheng, B.; Zhou, E.; Pang, W. GANs-Based Data Augmentation
for Citrus Disease Severity Detection Using Deep Learning. IEEE Access 2020, 8,
155
65. Shihan Mao, Yuhua Li, You Ma, Baohua Zhang, Jun Zhou, and Kai Wang. 2020.
environment using deep learning and multi-feature fusion. Comput. Electron. Agric.
66. Xiao JR, Chung PC, Wu HY, Phan QH, Yeh JA, Hou MT. Detection of Strawberry
67. Gao, R.; Li, Q.; Sun, X. Intelligent Diagnosis of Greenhouse Cucumber Diseases
68. H. Hong, J. Lin and F. Huang, "Tomato Disease Detection and Classification by Deep
Internet of Things Engineering (ICBAIE), Fuzhou, China, 2020, pp. 25-29, doi:
10.1109/ICBAIE49996.2020.00012.
69. Gayatri Pattnaik, Vimal K. Shrivastava & K. Parvathi (2020) Transfer Learning-Based
70. Luaibi, A.R., Salman, T.M., & Miry, A.H. (2021). Detection of citrus leaf diseases
71. Zhang, N.;Wu, H.; Zhu, H.;Deng, Y.; Han, X. Tomato Disease Classification and
156
72. Fruit Maturity—An Overview|ScienceDirect Topics. Available
online: https://www.sciencedirect.com/topics/agricultural-and-biological-
Machine Learning Techniques and Different Color Spaces,” IEEE Access, vol. 7, pp.
74. M. P. Arakeri, “Computer Vision Based Fruit Grading System for Quality Evaluation
of Tomato in Agriculture industry,” ProcediaComput. Sci., vol. 79, pp. 426-433, 2016,
75. A. Wajid, N. K. Singh, P. Junjun, and M. A. Mughal, “Recognition of ripe, unripe and
(Amsterdam)., vol. 251, no. January, pp. 247- 251, 2019, doi:
10.1016/j.scienta.2019.03.033.
77. Vibhute, Anup, dan Bodhe, S.K. (2013) : Outdoor Illumination Estimation of Color
78. Vijayalakshmi, M.; Peter, V.J. CNN based approach for identifying banana species
from fruits. Int. J. Inf. Technol. 2021, 13, 27–32. [Google Scholar] [CrossRef]
157
79. Ashtiani, S.-H.M.; Javanmardi, S.; Jahanbanifard, M.; Martynenko, A.; Verbeek, F.J.
80. Brownlee, Jason. "A gentle introduction to transfer learning for deep
81. Pan, Sinno & Yang, Qiang. (2010). A Survey on Transfer Learning. Knowledge and
82. Mohanty, Sharada P., David P. Hughes, and Marcel Salathé. "Using deep learning for
83. Reyes, Angie K., Juan C. Caicedo, and Jorge E. Camargo. "Fine-tuning Deep
Convolutional Networks for Plant Recognition." CLEF (Working Notes) 1391 (2015):
467-475.
84. Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for
85. Shorten, C., Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep
87. Johnson, J.M., Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J
88. Kanika, J. Singla and Nikita, "Comparing ROC Curve based Thresholding Methods in
158
(ICCCIS), Greater Noida, India, 2021, pp. 9-12, doi:
10.1109/ICCCIS51004.2021.9397167.
https://towardsdatascience.com/object-detection-simplified-e07aa3830954, [Assessed:
02/10/20]
91. Jonathan Long, Evan Shelhamer, and Trevor Dar-rell. “Fully convolutional networks
92. Jennifer Mack, Christian Lenz, Johannes Teutrine, and Volker Steinhage. “High-
precision 3D detection and reconstruction of grapes from laser range data forefficient
93. Zhenglin Wang, Brijesh Verma, Kerry B Walsh, PhulSubedi, and Anand Koirala.
94. A Gongal, S Amatya, Manoj Karkee, Q Zhang, and K Lewis. “Sensors and systems
for fruit detection and localization: A review”. In: Computers and Electronics in
95. Edan, and Ohad Ben-Shahar. “Computer vision for fruit harvesting robots–state of the
art and challenges ahead”. In: International Journal of Computational Vision and
159
96. Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan.
transactions on pattern analysis and machine intelligence 32.9 (2010), pp. 1627–1645
(cit. on p. 23).
network model for a mechanism of visual pattern recognition”. In: Competition and
98. C Wouter Bac, Eldert J Henten, Jochen Hemming, and Yael Edan. “Harvesting
Robots for High-value Crops: State-of-the-art Review and Challenges Ahead”. In:
99. Ahmad Ostovar, Ola Ringdahl, and Thomas Hell-strom. “Adaptive Image
Thresholding of Yellow Peppers for a Harvesting Robot”. In: Robotics 7.1 (2018), p.
11.
thermal images for fruit detection”. In: Biosystems Engineering 103.1 (2009), pp. 12–
22.
101. Efi Vitzrabin and Yael Edan. “Adaptive thresholding with fusion using a
RGBD sensor for red sweet-pepper detection”. In: Biosystems Engineering 146
102.Yongting Tao and Jun Zhou. “Automatic apple recognition based on the fusion of
color and 3D feature for robotic fruit picking”. In: Computers and Electronics in
103.Murali Regunathan and Won Suk Lee. “Citrus fruit identification and size
determination using machine vision and ultrasonic sensors”. In: 2005 ASAE Annual
160
104.Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.“Deep learning”. In: Nature
105.Sharada P Mohanty, David P Hughes, and Marcel Salath´e. “Using deep learning for
106.Inkyu Sa, Zongyuan Ge, Feras Dayoub, Ben Upcroft, Tristan Perez, and Chris
McCool. “Deepfruits: A fruit detection system using deep neural networks”. In:
Neural Networks”, ECCV 2014, Part I, LNCS 8689, pp. 818-833, 2014.
109.Sebe, N., Cohen, I., Garg, A., & Huang, T. S. (2005). Machine Learning in Computer
University Press.
111.Nixon, M., & Aguado, A. (2019). Feature Extraction and Image Processing for
112.Gould, S. (2012). DARWIN: A framework for machine learning and computer vision
research and development. The Journal of Machine Learning Research, 13(1), 3533-
3537.
113.Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection.
161
114. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019).
Generalized intersection over union: A metric and a loss for bounding box regression.
115.Padilla, R., Netto, S. L., & Da Silva, E. A. (2020, July). A survey on performance
116.Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and
118.Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region
119.Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C.
120.Wang, Y., Xing, Z., Ma, L., Qu, A., & Xue, J. (2022). Object detection algorithm for
lingwu long jujubes based on the improved SSD. Agriculture, 12(9), 1456.
121.Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once:
122.Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In IEEE
162
123.Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv
preprint arXiv:1804.02767.
arXiv:2004.10934 (2020).
126.Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of
127.Liu, Shu, et al. "Path aggregation network for instance segmentation." Proceedings of
128.Rezatofighi, Hamid, et al. "Generalized intersection over union: A metric and a loss
129.R. Padilla, S. L. Netto and E. A. B. da Silva, "A Survey on Performance Metrics for
and Image Processing (IWSSIP), Niteroi, Brazil, 2020, pp. 237-242, doi:
10.1109/IWSSIP48289.2020.9145130.
https://github.com/laboroai/LaboroTomato.git .
2021).
163
132.Shah, D., Paul, P., De Wolf, E., and Madden, L. (2019). Predicting plant disease
major insect pests in tomato, lycopersicon esculentum mill. under middle gujarat
134.V. Pooja, R. Das, and V. Kanchana, "Identification of plant leaf diseases using image
and Rural Development (TIAR), Chennai, India, 2017, pp. 130-133, doi:
10.1109/TIAR.2017.8273700
135.Rangarajan, A.K., Purushothaman, R. and Ramesh, A., 2018. Tomato crop disease
detection and plant protection: a technical perspective. J Plant Dis Prot 125(1):5–20.
137.Agarwal, Mohit, et al. "ToLeD: Tomato leaf disease detection using convolution
leaf disease detection in crops using images for agricultural applications,’’ Agronomy,
139.LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–
444
164
140.Ding, J., Chen, B., Liu, H., and Huang, M. (2016). Convolutional neural network with
data augmentation for sar target recognition. IEEE Geoscience and remote sensing
letters, 13(3):364–368.
143.He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2818-2826,
doi: 10.1109/CVPR.2016.308.
145.G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, "Densely Connected
10.1109/CVPR.2017.243.
(CVPR): 1800-1807
165
148. K. A. Nguyen, W. Chen, B.-S. Lin, and U. Seeboonruang, "Comparison of ensemble
machine learning methods for soil erosion pin measurements", ISPRS Int. J. Geo-Inf.,
149.https://www.plantvillage.psu.edu/topics/tomato/infos
150. https://www.forestryimages.org
166