0% found this document useful (0 votes)
39 views166 pages

Thesis Paraphased

This document discusses the application of deep learning models, particularly YOLOv5 and ensemble pre-trained models, for detecting tomato ripeness and diseases in agriculture, addressing the challenges of manual inspection. The research highlights the importance of automating these processes to improve food quality and reduce labor costs, especially in India where tomato production is significant. Results indicate high accuracy in detecting fruit and assessing ripeness and disease, showcasing the potential for integration into agricultural practices for sustainable farming.

Uploaded by

ksathishkm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views166 pages

Thesis Paraphased

This document discusses the application of deep learning models, particularly YOLOv5 and ensemble pre-trained models, for detecting tomato ripeness and diseases in agriculture, addressing the challenges of manual inspection. The research highlights the importance of automating these processes to improve food quality and reduce labor costs, especially in India where tomato production is significant. Results indicate high accuracy in detecting fruit and assessing ripeness and disease, showcasing the potential for integration into agricultural practices for sustainable farming.

Uploaded by

ksathishkm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 166

Abstract

Agricultural industry is demanding technological solutions focused on automating

agricultural tasks in order to increase the production and benefits while reducing time and

costs. Technological advances in precision agriculture have an essential role and enable the

implementation of new techniques based on computer vision technologies and image

processing systems. However, there are still many challenges and problems to be solved.

In specific, agriculture attains significant consideration in India due to the rapid population

explosion and increased food scarcity. To overcome the scarcity, accurate and computer-aided

assessment of ripeness and disease detection will help in improving the food quality in

agriculture. Tomatoes are fruits with high nutrition and high in fibre; packed with vitamin C,

vitamin K1, vitamin B9 and minerals. The global tomato processing market has reached 43.4

million tons in 2021. It is important to determine maturity level of the crops before harvesting

to optimize yield. However, manual inspection of ripe tomatoes required huge labour

resources and is time consuming. The amount of labour force for fruit harvesting has

increased over the years due to increasing demand.

Recently, some studies have attempted to evaluate the feasibility of smart agriculture

involving machine learning for harvest ripeness detection. However, these works typically

used smaller data size, simple dataset with no background or leaves or explored limited

machine learning model. Hence, this thesis aimed to identify tomato ripeness detection using

deep learning model YOLOv5. Further , this research presents a novel approach leveraging

deep learning models for disease detection in tomatoes using ensemble pre-trained models

such as VGG16,ResNet50,DenseNet121,InceptionV3,Xception.

1
The proposed work uses three different datasets for fruit , ripeness and disease detection

respectively. The first dataset consists of 2034 colour images related to fruit detection with a

resolution 416 x 416. The second dataset contains 400 images with different resolutions

which are resized to 224 x 224 while processing. In third dataset, 480 colour images were

added with varied sizes and resolutions. The created dataset is benchmarked in the following

link: https://doi.org/10.5281/zenodo.7955736. Images in the dataset are collected under

various environmental conditions such as occlusion, lighting, shade, overlap, and others.

Initially, an efficient framework is proposed that combines computer vision techniques with

deep learning architectures for detection of tomato fruit from image or video. Later, the

second method using VGG16 is proposed to assess the ripeness level of tomato fruits. Finally,

the ensemble model is presented for disease detection.

The results were evaluated and compared using the metrics such as precison,recall,F1-score,

accuracy and average precision. These results demonstrate the efficacy of approach in

accurately detecting the tomato fruits, classifying fruit ripeness levels and detecting disease

symptoms with high precision. The system's ability to detect tomatoes from images or videos

and distinguish between ripe and unripe, while simultaneously flagging disease-afflicted

samples, showcases its potential for real-world applications in agriculture.

The overall performance of the proposed approach is better in comparison to existing

methods, for fruit detection 88.1% mAP, 97.71% accuracy in ripeness detection and 98.54%

of accuracy in disease detection is provided.

Furthermore, the research explores the integration of the deep learning model into

agricultural processes, such as automated harvesting systems and quality control pipelines.

By providing rapid and non-invasive assessments of fruit quality, the approach offers the

2
potential to optimize resource allocation, reduce post-harvest losses, and ultimately contribute

to sustainable agriculture practices.

3
CHAPTER-1

INTRODUCTION

Agriculture contributes significantly to the Indian economy, accounting for around

17% of total GDP and employing more than 58% of people. Several environmental variables,

such as rain, temperature, soil, and pathogens, have an impact on agricultural crop quality and

yield in any country. Horticulture is a branch of agriculture that deals with the cultivation of

garden crops such as fruits, vegetables, flowers, and ornamental plants. Fruits are seed-

bearing structures developed from the ovaries of flowering plants that are an excellent source

of vitamins and minerals, and they help to avoid Vitamin C and Vitamin A deficiencies. The

diversity of India's climate ensures the availability of a wide range of fresh fruits.

During the process of plant harvesting, human experts go through a tedious process of

checking and removing mature plants, making sure they aren’t affected by any disease and

are suitable for human consumption. However, the traditional visual method of identifying

the name of the disease that a certain plant is suffering from takes a long time and is

expensive, especially if the farmhouse is large and there are a lot of plants [1]. A fruit's

quality is one of the most critical factors in determining its maturity. Furthermore, with the

apparent increase in population in the world on a daily basis, it is only sensible that this

procedure be automated in order to meet the growing demands of the people.

Precision agriculture entails deploying a wireless sensor network to monitor soil

conditions over a complete field of plants, flying drones that dispense water and nutrients to

agricultural fields, and autonomous robots that can manoeuvre around a field and collect ripe

fruits and vegetables. The applications of data analytics, intelligent sensing, robots, and other

current agricultural technologies are limitless. Precision agriculture provides numerous

4
benefits to farmers, including: 1) autonomously handling irrigation, fertilisation, and

treatments, which saves money and effort and is done more efficiently, 2) identifying regions

affected by weeds or diseases and isolating them quickly, saving the rest of the field from

harm, 3) localizing regions with healthy and productive soils to enable for efficient crop

distribution; 4) calculating packing and production costs based on yield estimation for proper

logistics planning; and 5) monitoring crop ripeness to determine exact harvesting schedules.

Having numerous forms of information on the land, soil, and crops immediately supports the

farmer in making educated decisions and positively influences profitability.

1.1 Artificial Intelligence in Agriculture

Over the years, technology has redefined farming, and as illustrated in Figure 1.1, technical

advancements have had a variety of effects on the agriculture sector. By 2050, the UN

estimates that an additional two billion people would need to be fed, which will require a

60% increase in food production. The massive demand, however, cannot be supplied via

conventional techniques. This is forcing farmers and agro firms to develop fresh strategies for

raising output and cutting waste. As a result, Artificial Intelligence (AI) is progressively

becoming a part of the technical development of the agriculture sector. The challenge is to

improve global food production by 50% by 2050 in order to feed an additional two billion

people. AI-powered solutions will not only help farmers improve efficiencies, but they will

also increase crop quantity and quality while ensuring a speedier time to market.

5
Figure 1.1 : Artificial Intelligence in Agriculture

1.2 Tomato Production in India

Tomatoes are one of the most widely produced vegetables in the world and a significant

source of income for farmers. According to the Food and Agriculture Organisation Corporate

Statistical Database (FAOSTAT) 2020 statistical report, global tomato production was

186.821 million tonnes [2]. They are herbaceous, spreading plants with woody stems that

grow to be 1-3 metres tall. Tomatoes are indigenous to Peru and Mexico and are the world's

second most important crop after potatoes. There are over 1000 tomato types cultivated in

India throughout the year. In India, however, there is a high tomato season at the beginning

and end of the year.

One of the most widely grown crops because of their excellent nutritional content, tomato

fruits normally range in size from 50 to 70 mm in diameter and weight from 70 to 150

grammes. Furthermore, it has a long history of use in Indian cuisine, making it one of the

most versatile fruits. Salads, ketchup, purees, sauces, and other processed meals are just a few

of the dishes that include tomatoes. A potent antioxidant that fights cancer, lycopene is

widely present in tomatoes. Carotene, the fruit's second antioxidant, which also gives it its

distinctive red colour and fights cancer, is also present in the fruit.

Tomatoes are typically grown in dry states during the winter or just before the summer. In

April and May, there is typically no tomato production, which drives up the price. All dry

regions, including Gujarat, Tamil Nadu, Andhra Pradesh, Karnataka, Maharashtra, Madhya

Pradesh, and Uttar Pradesh, have high tomato production rates. Some states, like Kerala and

Himachal Pradesh, frequently experience freezing conditions, which have a negative impact

on agricultural output. Since the year 2000, Andhra Pradesh and Madhya Pradesh have

6
consistently produced the most tomatoes in the nation. Tamil Nadu produces 7% of the

tomatoes produced in India as a whole. Tomatoes are grown in a number of Tamil Nadu

districts, including Dindugal, Salem, Tirupur, Krishnagiri, and Dharmapuri. The greatest time

to plant tomatoes in Tamil Nadu is around Aadi Pattam, though they can be grown

throughout the year. Additionally, Udumalaipet, Gudimangalam, and Palladam all have

extensive tomato cultivation.

According to the Food and Agriculture Organisation (FAO), crop diseases cause losses of 20

to 40% of total production [3]. Various tomato plant diseases can have an impact on the

amount and quality of the product, reducing productivity. Two categories can be used to

classify diseases [4]. The first class of diseases is linked to contagious microorganisms like

bacteria, fungus, and viruses. When the situation is favourable, these diseases can quickly

spread from plant to plant in the field. The second category of diseases is brought on by non-

infectious chemical or physical factors, such as harmful environmental conditions, dietary or

physiological issues, and herbicide damage.

1.3 Tomato LifeCycle

Tomatoes are a widely produced vegetable crop all over the world. The scientific

name for tomato is Solanum Lycopersicum. It is a relatively short-duration crop with a high

yield that is economically appealing, with an increase in area under cultivation day by day.

After China, India is the world's second-largest producer of tomatoes. Other major

competitors in the tomato market are the United States of America, the European Union, and

Turkey. These top five tomato producers account for almost 70% of global production.

Tomato processing is one of the most diverse agricultural sectors on a global scale.

This is due to the fact that tomatoes are nutritious and good to human health. Because of this,

tomatoes are consumed all over the world. Tomato production is substantially higher than that

7
of other crops farmed around the world. It is six times that of rice and three times that of

potatoes [5].

The common fruit tomato is very nutrient-dense, with high levels of fibre, vitamin C,

vitamin K1, vitamin B9, and minerals. Tomatoes typically come in a variety of maturities and

shades, including red, yellow, and green. Green denotes unripe fruit, whereas red and yellow

denote ripe fruit [6]. In 2021, the total market for tomatoes used for processing was 43.4

million tonnes. Before harvesting, it is crucial to assess the crops' maturity stage to maximise

output. It takes a long time and a lot of labour resources to manually inspect ripe tomatoes,

though. Because of rising demand, the labour force for tomato harvesting has grown over the

years. The global tomato processing industry is anticipated to reach 54.5 million tonnes by

2027 [7]. As a result, tomato cultivation is critical in rural and suburban areas of developing

countries since it can strengthen the local economy.

It is easily affected by many diseases, and this condition severely affects the quality

and yield of tomatoes and causes substantial financial losses. The tomato plant has a lifespan

of around 120 days. The flowering or fruiting stage occurs at approximately 45-50 days of the

life of the tomato plant. The life cycle of the tomato plant is shown in Figure 1.2

8
Figure 1.2: Tomato LifeCycle

The tomato life cycle begins with the planting of seeds, followed by germination,

growth into a sprout, seedling, and finally development into a plant. After the plant reaches

maturity, the flowering stage will begin, then the fruiting stage. The seeds of the ripe fruits

are also employed in the following life cycle. The tomato plant has started the process of fruit

production when yellow blossoms appear on it. After the tomato plant's blooms have opened,

the amount of time needed for the fruit to ripen depends on the variety and various

environmental factors. When a tomato plant is established and in the flowering or fruiting

period, the disease tends to strike more frequently. Time management and disease prediction

will help farmers produce more effectively and prevent output loss.

1.4 Need for Fruit Recognition

In horticulture, the cost of picking accounts for the largest portion of the total cost.

Agriculture development necessitates a significant amount of electric power, fuel, irrigation,

and chemical fertiliser. The speed, cost, and safety of picking all have a direct impact on the

9
final yield and quality of fruit production. As a result, more harvesting robots are being

employed in the fruit sector to cut picking costs and increase fruit quality [8]. Detection,

picking, localisation, categorization, selection, and grading are some of the duties that

harvesting robots must do. Among these missions, object detection is the most important [9].

It is natural for our human beings to encounter a variety of fruit objects in their environment.

However, it is difficult for robots to solve this challenge. The initial objective for robots is to

learn about their surroundings. In general, the image contains the most crucial information.

Images from the outside could be obtained using a variety of means [10].Several ways have

been used to obtain an image over the years. These approaches use digital cameras to capture

images of various types, such as universal cameras, depth cameras, and near-infrared

cameras. In recent years, the speed and resolution of a universal camera in our cell phones

have greatly improved [11].

Despite the fact that the majority of agricultural robots, in particular fruit harvesting systems,

use computer vision to identify targets of fruit, accurate fruit recognition is still a research

problem. It is difficult to develop a visual system that functions as intelligently as a person

and can recognise fruit fast, especially when there are overlapping fruits or significant leaf

occlusions.

1.5 Need for Maturity detection of fruits:

Fruit ripening is a highly coordinated, genetically programmed process that takes place in the

later stages of maturation. It involves a number of physiological, biochemical, and sensory

changes that result in the development of an edible, ripe fruit with desirable quality

parameters to draw agents that disperse seeds [12,13,14]. Colour, firmness, taste (increase in

sugars and decrease in organic acids), firmness (softening by cell-wall degrading activities),

taste (loss of green colour and development of yellow, orange, red, and other colour

10
characteristics depending on species and cultivar), and flavour (production of volatile

compounds giving the characteristic aroma) are the main changes connected with ripening.

Maturity determines the ripening and storage conditions of a specific vegetable [15].

So, the objective of this work is to identify the maturity status of tomatoes as immature and

mature utilising computer vision techniques, specifically deep learning. Computer vision is a

new approach in food and agriculture that can help solve practical difficulties such as

automatic sorting, grading, categorization, and recognition. Such procedures have surpassed

the manual in a non-destructive, time-saving, economical, rigorous, and accurate manner.

One such method is deep learning, which is a branch of machine learning. Deep learning

holds a prominent position in the field of research due to its capability to automatically

extract features from a variety of data [16]. The use of deep learning is rapidly gaining

popularity in the field of image processing for image recognition and categorization. The

convolutional neural network, which is based on the artificial neural network, is the

foundation of deep learning designs for image classification. Several architectures have

become more and more popular in recent years. Among them are AlexNet, LeNet, VGG,

Inception, ResNet, and so on. These are pre-trained architectures on massive data sets to

address particular issues. Each architecture has developed a development or improvement

over the ones that were previously used. The use of these pre-trained architectures for

tackling the intended goal is known as deep transfer learning [17]. To determine tomato

maturity status through images, deep transfer learning is applied in this study.

Tomato ripeness can be determined by its surface characteristics [18]. As a result, the

amount of tomato ripeness is assessed by analysing the surface properties of tomatoes with

deep transfer learning techniques. The contribution of this work is the use of transfer learning

algorithms to classify tomatoes based on their ripeness(maturity).

11
When immature or unripe fruits are harvested, the quality of these fruits is poor. They are

usually incapable of ripening. Immature fruits are prone to internal deterioration and

decomposition. Similarly, if fruits are harvested late, the chances of acquiring rotten fruits are

very significant. As a result, incorrect harvest timing will result in significant postharvest

loss. It is critical to evaluate the maturity status of the fruit in order to avoid qualitative and

quantitative losses of preharvest and postharvest fruits. One of the most significant

difficulties confronting the tomato agriculture sector is the loss of tomato quality. Farmers

typically use their personal experience to determine the type of disease and maturity level of

tomatoes. The level of maturity of the tomato crop is determined through manual inspection.

This, in turn, leads to an inconsistent reliance on manual labour. Smart harvesting has been

propagated in recent years as a solution to these challenges.

The current challenges of fruit detection and recognition based on DL for automatic

harvesting are the scarcity of high-quality fruit datasets, detection of small target fruits,

detection in occluded and dense scenarios, detection of multiple scales and multiple species

of fruits, and lightweight fruit detection models. The detection and recognition performance

is heavily influenced by the quality and scale of fruit datasets, appropriate improvement

methodologies, and underlying model architectures. For example, fruit processing can

standardise data by cleaning and modifying it. Fruit data augmentation can successfully

expand data and improve data diversity, lowering reliance on specific characteristics and

improving model robustness. Fruit feature fusion is beneficial in reducing the problem of fruit

feature disappearance and improving the detection effect of small target fruits and multi-scale

fruits. The original fruit detection framework is excellent for acquiring more fruit information

by merging other learning tasks when building a multi-task learning model.

1.6 Need for early detection of diseases in fruits

12
The development of novel technologies for early detection, identification, and

mapping of fruit diseases would lower the cost, damage, and time required to monitor and

control the diseases. Early diagnosis of fruit diseases boosts productivity by allowing for the

early implementation of disease-prevention measures. This is more effective than applying

curative treatments since unhealthy plants may show disease signs when it is too late for such

treatments to be beneficial [19]. In addition to increasing productivity, the technologies will

enable the implementation of effective disease management practises. Furthermore, early

disease detection eliminates the need to use excessive amounts of pesticides and chemicals to

manage them, ensuring that the risks of contaminating ground water and accumulating toxic

residues in agricultural products due to excessive pesticide and chemical use can be avoided

[20,21].

The primary method for identifying and classifying agricultural plant diseases is

through farmer observation with their naked eyes, followed by chemical tests. Farmers in

developing countries may not be able to keep an eye on every plant, every day, due to the size

of the farming land. Non-native diseases go unnoticed by farmers. It could take a lot of time

and money to consult experts on this. Furthermore, the unnecessary use of pesticides may be

hazardous and noxious to natural resources such as water, soil, air, food chain, and so on, and

it is expected that there would be less pesticide poisoning of food commodities.

As diseases are unavoidable, identifying them is crucial, especially in an unreliable

manner utilising current technological knowledge. In any other case, incorrectly diagnosing a

plant disease will result in a significant loss of time, hard cash, labour, yield, and product

quality and value. Although manual disease recognition is effective, it is possible for ambient

or milieu circumstances to change and for the prediction to go in the wrong direction. By

utilising technological advancement, we may utilise image processing to detect tomato fruit

disease by analysing its insufficiency.

13
1.7 Objective of the research

The core objective of this research is to develop an efficient framework for ripeness and

disease detection of tomato fruits. It also intends to provide a full pipeline for tomato

recognition, classification, and detection of diseases utilizing cutting-edge deep learning

techniques. Further, to achieve the objective the following sub-objectives are followed :

 Firstly, a comprehensive literature review of existing fruit detection, classification, and

disease detection methods in agriculture with a focus on deep learning-based

approaches is conducted.

 Secondly, the tomato fruit dataset is collected and pre-processed for training and

evaluating the proposed methods.

 Thirdly, a framework is implemented by proposing a VGG-based CNN model to

classify the tomatoes based on the ripeness using transfer learning and fine-tuning

strategy. Next, the tomato fruit detection model, which is based on YOLOv5, and the

attention mechanism is proposed for the detection and classification of tomato fruits

from images and videos.

 Finally, the framework includes an ensemble model for disease detection based on pre-

trained models. The ultimate goal of this research is to provide a reliable and efficient

framework for tomato fruit ripeness and disease detection while overcoming the

challenge of significant fruit overlap in tomato fruit detection. This solution will

significantly improve the accuracy and efficiency of crop management in agriculture.

1.8 Scope of the Research

The research is focused on the development of an efficient framework to exploit the

advanced CNN models for improving State-of-the-art maturity and disease detection of

tomatoes.

14
The scope of the research comprises a proposal for an efficient framework for tomato

ripeness and disease detection that takes advantage of object detection and reuses learnt

models. Initially, the framework is implemented by proposing the CAM-YOLO model, which

is based on YOLOv5 and the attention mechanism. Later on, it will incorporate the

development of a DEEP CNN model based on VGG16 as well as transfer learning for

maturity detection. Finally, the scope of the research includes an ensemble model for disease

detection based on pre-trained models.

1.9 Structure of This Thesis

Chapter 1 provides a brief introduction to the background and significance of the detection

of tomato fruit and diseases in agriculture. It outlines the challenges faced in this area, and

how the proposed research can contribute to addressing these challenges. Moreover, this

chapter provides a clear statement of the research questions, objectives, and hypotheses to

guide the development of the methodology.

Chapter 2 discusses the related work on fruit detection and classification, and disease

detection in agriculture, highlighting the strengths and limitations of existing approaches.

This discussion provides a foundation for the proposed research and helps to identify the gaps

in the existing literature that our research aims to address. Chapter 3 describes the deep

learning model for tomato ripeness classification based on the VGG16 model. Chapter 4

depicts the detection model based on the YOLOv5 model for tomato fruit detection and

classification from images and videos.

Chapter 5 illustrates the pre-trained deep learning models and ensemble models for accurate

tomato fruit disease detection. Finally, Chapter 6 concludes the work by summarizing the key

findings and contributions of the research and discussing the limitations and future directions

15
of our proposed approach. It also reflects the broader impact of the research on the field of

agriculture and deep learning with recommendations for future research.

16
CHAPTER 2

LITERATURE REVIEW

2.1 INTRODUCTION

Accurate and rapid tomato fruit maturity and disease detection is critical for increasing its

long-term production for agriculture. In the conventional technique, human experts in the

field of agriculture have been accommodated to find out the maturity of fruit and anomalies

in tomato plants caused by pests, diseases, climatic conditions, and nutritional deficiencies.

Automatic tomato ripeness and disease identification is initially solved through conventional

image processing and machine learning approaches which result in less accuracy. In order to

produce greater prediction accuracy, deep learning-based classification is introduced. The

literature investigated provides an overall review of research work carried out in the field of

fruit ripeness and disease identification using image processing, machine learning, and deep

learning approaches. Furthermore, the chapter summaries the related works and the gap

analysis which should be dealt in with the proposed research work.

2.2 Maturity based fruit classification using deep learning models:

Zhang & McCarthy (2012)[ [22] proposed magnetic resonance imaging (MRI) to determine

tomato maturity using structural and colour parameters. The statistical properties were

determined using a region of interest (ROI) corresponding with the tomato's pericarp. To

determine the ripeness, the partial least square discriminant analysis (PLS-DA) was applied

to a total of 48 image features detected by five MR scans of 144 tomatoes. MRI is a non-

destructive imaging technique that makes use of the magnetic properties of nuclei as well as

their interaction with radio frequency (RF) and the magnetic field.Crossvalidation was

utilized by splitting the dataset into two. One subset was used for training data, while another

was used for testing data. The RMSE of cross-validation was 0.302 in the red stage. Using

17
colour characteristics, the PLS-DA achieved a 90% accuracy. In addition, the sensitivity and

specificity of the PLSDA module were 0.907 in the red stage.

El Bendary et al. (2015)[23] developed a fruit grading method that detects tomato maturity

using multi-level classification. Principal component analysis (PCA) and linear discriminant

analysis (LDA) were also utilised with the SVM classifier to improve performance. This

system is utilised to conduct five different categorization levels, including green and breaker,

turning, pink, mild red, and red. The data set used included 250 JPEG photos with

dimensions of 3664 × 2748 pixels. In terms of ripeness classification, the LDA classification

method was 84% accurate. One-against-all (OAA) multi-level SVM with linear kernel

function produced 84.80% accuracy, whereas one-against-one produced 90.80% accuracy.

Using artificial neural networking, Rafiq et al. (2015)[24] quantified the quality features of

agricultural commodities based on their colour and size. Three feed-forward neural network

models (NN) were created: one for converting RGB to L*, a*, and b* values (NN1), one for

determining the stage of tomato ripeness, and one for connecting tomato projection area/size

to weight. Results shown that NN1 could convert RGB to L*, a*, and b* values with a 99%

accuracy rate. However, NN2 classified tomatoes in three stages of ripening with an accuracy

of 96%, with 30 hidden neurons, and a 100% classification was performed when a threshold

value of 0.7 was used. The best results showed excellent abilities at 30 hidden units with R2

of 0.9980 and mean squared error (MSE) of 0.00021.Additionally, NN3 could link area with

weight with 99% accuracy by utilising three hidden neurons.

Kaur et al.(2017)[25] proposed a Tomato Classification system for evaluating tomato

maturation stages using Machine Learning, which includes training of various algorithms

such as Decision Tree, Logistic Regression, Gradient Boosting, Random Forest, Support

Vector Machine, K-NN, and XG Boost. This system collects images, extracts features, and

18
trains classifiers on 80% of the entire data. The remaining 20% of total data is used for

testing. It is evident from the results that the classifier's performance is influenced by the

quantity and type of features that are taken from the data set. Accuracy Score, Learning

Curve, and Confusion Matrix are used to represent the results. It has been noted that Random

Forest, out of the seven classifiers, performs well with an accuracy of 92.49%, likely because

to its strong data handling abilities. Because it cannot be trained on a big data set, the Support

Vector Machine has demonstrated the lowest accuracy.

The backpropagation neural network (BPNN) classification algorithm and the feature colour

value were used by Wan et al. (2018) [26] to propose a method for identifying the three

maturity levels (green, orange, and red) of fresh market tomatoes (Roma and Pear types). To

gather the tomato images in the lab, a maturity detecting device based on computer vision

technology was created. The tomato targets were obtained based on the image processing

technology after the tomato images were processed. The area for extracting the colour

features was then determined to be the largest inscribed circle of the tomato's surface. Five

concentric circles (sub-domains) were used to partition the colour feature extraction area. The

feature colour values were taken from the average hue values of each sub-region and used to

represent the samples' maturity level. Then, in order to determine the maturity of the tomato

samples, the five feature colour values were imported to the BPNN as input values.

According to an analysis of the data, this method has an average accuracy of 99.31% for

identifying the three tomato sample maturity levels, with a standard variation of 1.2%.

Garcia et al. (2019) [27] recommended a machine learning method for automatically

determining tomato maturity using the Support Vector Machine (SVM) classifier and the

CIELab colour space. The dataset utilised for modelling and validation experiments in a 5-

19
fold cross-validation technique was made up of 900 images gathered from a farm and several

image search engines. Experiment findings indicated that the proposed method was

successful in ripeness categorization detection with 83.39% accuracy when divided into six

classes representing tomato ripening stages.

Wu et al. (2019) [28] provided an improved technique for combining various characteristics,

feature analysis and selection, a weighted relevance vector machine (RVM) classifier, and a

bi-layer classification strategy to create a unique automated system for recognising ripening

tomatoes. The algorithm employs a two-layer approach to operation. Using the knowledge of

the colour difference, the first-layer classification technique seeks to locate tomato-containing

regions in the images. The second classification method is based on a classifier that has been

trained using information from many media. The processed images are separated into 9 9

pixel blocks in the suggested technique, which makes calculations simpler and increases the

effectiveness of recognition. These blocks, rather than individual pixels, are regarded as the

fundamental units in the classification task. Five textural properties (entropy, energy,

correlation, inertial moment, and local smoothing) and six color-related features (Red (R),

Green (G), Blue (B), Hue (H), Saturation (S), and Intensity (I) components, respectively)

were recovered from pixel blocks. The iterative RELIEF (I-RELIEF) method was used to

examine relevant characteristics and their weights. A weighted RVM classifier was used to

categorise the image blocks based on the relevant attributes that were chosen. The final

tomato recognition results were calculated by merging the results of the block classification

with the bi-layer classification technique. On 120 images, the algorithm achieved a detection

accuracy of 94.90%.

Azarmdel et al. (2020) [29] used ANN and SVM classifiers to assess a fruit grading system.

This technique was used to categorise ripeness into three levels: unripe, ripe, and overripe. In

order to create the dataset, 577 mulberries were used. The correlation-based feature selection

20
subset (CFS) and consistency subset (CONS) were used to extract the various characteristics,

such as texture, colour, and geometrical aspects. The classification of the fruits made use of

both ANN and SVM ideas. This approach used ANN and CFS with 20 neurons in the hidden

layer, and it was 99.13% accurate in grading. Additionally, by utilising the same ANN and

CFS, 100% sensitivity and specificity were reached. But the accuracy of the ANN with

CONS and the hidden layer of 15 neurons was only 98.26%. However, the sensitivity and

specificity offered by the ANN-CFS and ANN-CONS were the same. Additionally, this

system was tested using SVM-CFS and SVM-CONS with several kernel functions, including

the radial basis function (RBF), linear, and polynomial. The accuracy of 99.12% was

achieved by the SVM-CFS with RBF, and 98.25% by the SVM-CONS with RBF. With

SVM-CFS and RBF, the MSE during training was 0.009, while with SVM-CONS and a

polynomial kernel function, it was 0.017.

Raghavendra et al. (2020) [30] developed a mango grading system and determined ripeness

using multiple classifiers such as L*a*b colour space, RGB, HSV, Gabor colour feature, and

other texture characteristics. The text features utilised were local binary pattern (LBP) and

Gray-level co-occurence matrix (GLCM). It was used to conduct classifications such as

under-ripe, perfectly ripe, over-ripe, and black-spot. The dataset comprised 230 mangoes

taken from agricultural plantations in Mysore, Karnataka. The 60:40 ratio was employed as a

training and testing subgroup. For the purposes of the experiment, many categorization

schemes were employed. Among the several classifiers, the SVM classifier had the greatest

accuracy of 99.09. Other traditional classifiers, such as NavieBaye's (NB), K-NN, linear

discriminant analysis (LDA), probabilistic neural network (PNN), and threshold classifiers,

obtained 95.81, 96.20, 97.95, 98.44, and 99.77, respectively.

In a research on classifying the maturity of fruits, Hermana et al. (2021) [31] employed 9000

training images of the fruits apple, orange, mango, and tomatoes. The data were trained using

21
VGG16 models with a transfer learning approach utilising 200 epochs. To minimise over-

fitting, data augmentation techniques were utilised to generate additional data. The same

MLP is applied to the top layer of models with a variety of parameters, and data is translated

from RGB to L * a * b in order to serve as a colour descriptor for the fruit. With a dropout of

0.5, four frameworks with various approaches were utilised, and the average accuracy rate for

all four was 92%, demonstrating the most impressive performance of all.

Deep learning techniques were used by Rivero et al. (2022) [32] to grade the banana layers

into distinct categories in a non-intrusive manner. Fruits were sorted using a tier-based

system, with grades given for maturity, quality and size. The quality was divided into three

categories: export, midrange, and rejections. Maturity factors included green, yellowish,

yellow, and over-ripe, whilst size characteristics included small, medium, and big for the fruit

grading system. The VGG16 architecture with the transfer learning approach was utilised for

grading systems on an own-created banana image dataset, and it obtained 98% accuracy for

training and 92% accuracy for validation.

Deep transfer learning methods were employed by Huynh, Danh Phuoc, et al. (2021) [33] to

explore the categorization of tomatoes. They used the transfer learning approach on three

already-trained CNNs models—VGG16, VGG19, and ResNet101—to lessen the need for a

big dataset and the computing expense of the deep learning model. 1374 tomato-related

images were gathered from Fruits-360 and sorted into three classes: green, yellow, and red.

According to the experimental findings, the VGG19 model was able to assess the degree of

cherry tomato ripening with a high precision of 94.14% accuracy.

Tomato production has significantly expanded recently, and the market is highly competitive.

Through tomato maturity grading, the market price is established. Visual examination of the

22
colour, texture, size, shape, and flaws of the tomatoes is typically used to determine their

ripeness. The expense of external quality control and labour is significant, though, and human

sorting may not be of the highest quality. InceptionResNetV2, ResNet152V4, MobileNetV2,

and AlexNet were some of the cutting-edge CNNs with Transfer Learning classifiers that

Sakunrasrisuay et al (2021) [34] used to evaluate the effectiveness of tomato classification

depending on their maturity levels. From a NY/T 940-2006 tomato image dataset, 233 colour

images with a resolution of 640 x 480 pixels were tested. Colour richness was calculated by

separating mature and rotten tomatoes. The image dataset is separated into two classes based

on tomato colour as the major characteristic: mature and rotten. There were 233 training and

40 testing image datasets. The comparing results showed that the RestNet152V2 produced

the maximum classification precision, with the best training accuracy of 100% and the best

testing accuracy of 99.46%.

Rajat et al. (2022)[35] assessed the freshness of banana fruit in order to extend the cropping

period and avoid harvesting of either under-matured or over-matured bananas. The study used

transfer learning to apply five pre-trained models: VGG16,VGG19,InceptionV3,Xception,

and DenseNet121. The dataset used comprises of 300 images that have been increased to a

total of 2369 images by using various augmentation techniques. When compared to other

models, the VGG16 obtained the greatest accuracy of 98.73% in categorising bananas into

three categories: unripe, ripe, and overripe.

Begun et al. (2022)[36] developed a deep transfer learning algorithm for tomato ripeness

identification. Using this approach, tomatoes were automatically categorised into three

maturity classes: immature, partially mature, and mature. Several transfer learning models,

including VGG16, VGG19, InceptionV3, ResNet101, and ResNet152, have been used to

solve the specified objective of classification. During training, the current architecture's

layers are frozen, and an extra classifier is introduced to train the prepared dataset. The

23
models were tested repeatedly with varied epoch numbers and batch sizes. With an accuracy

of 97.37%, the VGG19 performed best at epoch 50 and batch size 32.

2.3 Fruit detection and classification based on Object Detection Techniques

Fruits picking by human is a time consuming, tedious and expensive task. Fruit harvesting

automation has been very popular over the past ten years for this reason. The capacity of a

tomato harvester robot to recognise and locate ripe tomatoes on a plant is one of the key

obstacles in designing such a machine since tomato fruits do not ripe at the same time. A

novel segmentation method was created by Arefi(201) [37] utilising a machine vision system

to guide a robot arm in picking a ripe tomato. A vision system was utilised to collect images

from tomato plants that were then adapted to the lighting conditions of the greenhouse. Under

greenhouse lighting settings, 110 colour images of tomatoes were captured. The method

created runs in two steps: (1) eliminating the background in RGB colour space and then

extracting the ripen tomato using a mix of RGB, HSI, and YIQ spaces, and (2) localising the

ripen tomato using image morphological characteristics. The suggested method was 96.36%

accurate overall.

Yamamoto et al.(2014) [38] presented an image processing approach for detecting tomato

fruits in various growth phases, including ripe, immature, and young fruits, for tomato

detection and segmentation. To begin, pixel-based segmentation is used to divide the pixels

into several groups such as fruits, leaves, stems, and backdrops. The misclassifications

created by the first stage are then removed using blob-based segmentation. Finally, they

employed K-means clustering to identify each fruit in a cluster. In summary, the results of

fruit detection shown that the developed approach obtained accurate detection performance,

despite the fact that recognition of immature fruits is extremely challenging owing to their

small size and resemblance to stems.

24
Xiong et al. (2014) [39] employed the K-means clustering technique to segment citrus fruit.

The fruit location was determined as a result of image segmentation. However, the

segmentation impact of this approach is not optimum when the environment is complex.

Traditional machine vision algorithms are frequently unstable, and the placement of fruit

targets in complex scenes is frequently erroneous.

An autonomous harvesting robot faces difficulties in automatically identifying ripe fruits in a

complex agricultural setting because of diverse background disturbances. The bottleneck to

robust fruit recognition is reducing influence from two main disturbances: illumination and

overlapping. Zhao et al. (2016) [40] presented a robust tomato detection system based on

various feature images and image fusion to recognise the tomato in the tree canopy using a

low-cost camera. First, using the L*a*b* colour space and the luminance, in-phase,

quadrature-phase (YIQ) colour space, two unique feature images, the a*-component image

and the I-component image, were recovered. Second, wavelet transformation was used to

merge the feature information of the two source images by fusing the two feature images at

the pixel level. Third, an adaptive threshold technique was employed to determine the

appropriate threshold for segmenting the target tomato from the background. The final

segmentation result was subjected to morphological processing to remove some noise. In the

detection tests, 93% of the target tomatoes were identified out of 200 total samples.

Sa et al. (2016) [41] used deep convolutional neural networks to develop the fruit detecting

system. The system is based on the Faster-RCNN model and employs two modalities: colour

(RGB) and near-infrared (NIR). For merging multi-modal (RGB and NIR) information, early

and late fusion approaches are investigated. This results in a unique multi-modal Faster R-

CNN model that produces state-of-the-art outcomes compared to past work using the F1-

score, which takes into consideration both accuracy and recall performances, increasing from

0.807 to 0.838 for sweet pepper recognition. In addition to better accuracy, this approach is

25
substantially faster to deploy for new fruits since it requires bounding box annotation rather

than pixel-level annotation. The model is retrained to recognise seven fruits, with the entire

process requiring four hours to annotate and train the new model each fruit.

Based on the Faster R-CNN approach, Bargoti et al. (2017) [42] created a fruit detecting

model in orchards. To further understand the practical deployment of such a system, ablation

studies were done on three orchard fruit types: apples, mangoes, and almonds. A comparison

of detection performance against the number of training images revealed the quantity of

training data necessary to achieve convergence. Transfer learning analysis revealed that

moving weights between orchards did not result in substantial performance increases over a

network started straight from the highly generalised ImageNet features. Data augmentation

approaches like as flipping and scaling were found to increase performance with varied

numbers of training images, resulting in similar performance with less than half the amount

of training images. In their report, they achieved an F1-score of more than 90%. The majority

of the lost fruits came from the case when the fruits appeared in dense clusters.

Xiong et al. (2018) [43] employed the Faster R-CNN to recognise green citrus in the natural

environment. The MAP of the training model in the test set was 85.49% under the

configuration of learning rate of 0.01, batch size of 128, and moment of 0.9. Fu et al. (2018)

[44] introduced a deep learning recognition approach based on LeNet CNN for the detection

of multi-cluster kiwi fruit in the field. The identification rates for occluded fruits, overlapping

fruits, neighbouring fruits, and independent fruits were 78.97%, 83.11%, 91.01%, and

94.78%, respectively.

Koirala et al. (2019) [45] employed the Mango-YOLO model, which is built on YOLOv3 and

YOLOv2(tiny), to recognise mango fruit in real time. On a day-time mango image dataset, it

achieved an F1-score of 0.89.

26
For the purpose of identifying and locating ripe kiwifruit, Fu et al. (2019) [46] used image

processing methods. The method was computationally expensive and less reliable since it

required carrying out several color-channel and color-space-based operations. In conclusion,

it is challenging to attain high precision using classical detection methods and they are

extremely challenging to create and promote, particularly when there is a lack of features or a

small sample size. It is also difficult to use these approaches in real-world circumstances

because of the low resilience of detection.

Using computer vision and deep neural networks, Huang et al. (2019)[47] suggested a high-

throughput method for fruit recognition, localisation, and measurement from video streams. A

flexible technique that may be used for several sorts of fruits is suggested here, in contrast to

prior works that were designed for a particular type of fruit. This method makes use of a

vision system to scan through plants row by row in a greenhouse. Fruit recognition and

localisation on video frames are performed using a real-time object detection technique that

utilises the deep neural network-based detector YOLOv2 with an 84.98% success rate. The

video feed is used to track various fruits using an individual fruit tracking algorithm. The

online tracking method combines feature matching, optical flow, and projective

transformation, which are all enhanced by occlusion handling strategies including using

threshold indices and denoising. The offline tracking algorithm, on the other hand, employs a

voting approach to eliminate false alarms generated by the object detector. Finally,

phenotyping data like as fruit counts, ripening stage, fruit size, and 2D geographic

distribution maps were collected. The suggested framework has demonstrated its efficacy in

collecting sufficient phenotyping information that is useful for production management, as

well as its prospective application in robotic operations.

27
Image recognition techniques frequently misinterpret tomatoes in neighbouring spots as a

single tomato. Hu et al. (2019) [48] proposed a single ripe tomato identification approach that

combines intuitionistic fuzzy set (IFS) with Faster R-CNN image detection. In comparison to

previous approaches, the suggested method offers various advantages. To begin, ripe

tomatoes are marked in a large number of images in various configurations (e.g., separated,

adjacent, overlapping, and shaded) to train the Faster R-CNN detector. The trained Faster R-

CNN classifier is then used to identify suitable ripe tomato areas in images. The results

demonstrated that the trained Faster R-CNN classifier is capable of accurately and speedily

localising potential ripe tomato regions. The proposed tomato region's RGB colour system

was then converted to an HSV colour scheme. To acquire the candidate tomato body from

single tomatoes recognised by Faster R-CNN, different tomato samples were manually

segmented, and the Gaussian density function was established to eliminate the background.

The unnecessary subpixels are removed from the tomato binary map using morphological

processing, and related tomatoes are separated to eliminate the excess contour discovered

through edge detection. The IFS edge detection approach is then used to retrieve the edge,

and a contour detection method is then applied to connect the edge breakpoints and eliminate

unnecessary edge points. This method is finally recommended as a way to further identify

tomatoes. Despite the complexity of the greenhouse tomato images used—which contain

tomatoes that are close together, overlap, and are obscured—the approach nevertheless

managed to reach an AP of almost 80%. Using the suggested approach, the tomato width and

height had RMSE results of 2.996 pixels and 3.306 pixels, respectively. The horizontal and

vertical centre position shifts' respective mean relative error percent (MRE%) values are

0.261% and 1.179%.

Liu et al. (2020)[49] applied the YOLO-tomato detector to identify tomatoes. This technique

is based on YOLOv3 and uses two approaches: feature extraction via dense architecture and

28
R-Bbox replacement with CBbox. However, the approach did not use contextual information

to detect obstructed tomatoes.

In Indonesia, human labour is still used for manual fruit inspection and sorting procedures.

The manual procedure consumes a significant amount of time and energy and is prone to

errors. The automation process should be able to replace manual procedures with automated

ones with the use of computer vision in order to cut costs and improve accuracy and

efficiency. The approach employing the YOLOv4 algorithm was proposed by Widyawati et

al(2021) [50]. The algorithm is used to determine the maturity of bananas automatically. The

algorithm training procedure uses 369 banana images separated into two groups, and the

testing phase uses real-time videos acquired. According to the results, the best average

accuracy rate is 87.6%, and the video processing performance is 5 FPS (frames per second)

while employing a single-GPU architecture.

A rapid optimised Foveabox detection model (rapid-FDM) is presented by Jai et al. (2022)

[51] in order to accomplish fast recognition and localisation of green apples and match the

real-time operating needs of the vision system of harvesting robots. Fast-FDM detects green

apples reliably and efficiently in harvesting conditions by utilising an optimised variant of

anchor-free Foveabox. To be more precise, the weighted bi-directional feature pyramid

network (BiFPN) is used as the feature extraction network to quickly and easily fuse multi-

scale features before feeding the fused features to the fovea head prediction network for

classification and bounding box prediction. The backbone network used is the

EfficientNetV2-S, which has quick training and a small size. Additionally, the direct

selection of positive and negative samples using the adaptive training sample selection

(ATSS) technique results in greater recall and more precise green apple recognition for green

fruits of various sizes. The proposed Fast-FDM achieves improved trade-offs between

accuracy and detection efficiency, according to experimental results, achieving a mean

29
average precision (mAP) of 62.3% for green apple detection with less parameters and floating

point of operations (FLOPs).

Zhang et al. (2022) [52] combined the GhostNet feature extraction network with the

coordinate attention module in YOLOv4 and incorporated depthwise convolution to

reconstruct the neck and YOLO head structure. They used the improved YOLOV4 model for

apple fruit recognition. In order to improve the capacity to extract features for medium and

small targets, the Coordinate Attention module is introduced to the feature pyramid network

(FPN) framework. On the constructed apple data set, the mAP of Improved YOLOv4 was

enhanced to 95.72% in comparison to YOLOv4, however the network size was substantial.

The most common techniques for precise, quick, and reliable fruit identification and

recognition are DL-based approaches for fruit detection and recognition. These techniques

represent a significant development trend as well. Environments have a relatively less impact

on them. With regard to fruit image recognition, Xiao et al.(2023)[53] focuses on giving an

overview and review of DL, particularly in the areas of detection and classification. The

following categories may be used to categorise the current fruit detection and identification

techniques based on DL: techniques based on YOLO, SSD, AlexNet, VGGNet, ResNet,

Faster R-CNN, FCN, SegNet, and Mask R-CNN. These approaches can also be divided into

two categories: single-stage fruit detection and recognition methods based on regression

(YOLO, SSD) and two-stage fruit detection and recognition methods based on candidate

areas (AlexNet, VGGNet, ResNet, Faster R-CNN, FCN, SegNet, and Mask R-CNN). Fruit

detection and recognition systems with two stages aim for quicker speeds and lower weights

while maintaining fruit detection accuracy. Single-stage fruit detection and recognition

algorithms enhance fruit detection accuracy while retaining detection speed and model size

30
benefits. Current development trends include improving fruit identification performance and

striking a balance between precision and quickness.

2.4 Plant Disease Detection using Deep Learning Techniques

The tomato disease classification system was created by James et al. (2016) [54].

Preprocessing, segmentation, feature extraction, and disease classification are the four

primary steps of the system. Each image is subjected to a contrast enhancement approach

before being subjected to the k-means clustering algorithm, which is used to segment the

disease area of the tomato image. The RGB, HSV histogram values, statistical colour

moments, and colour cooccurrence matrix are created as feature vectors for each image.

Anthracnose, Bacterial canker, Bacterial spot, Bacterial speck, Early blight, and Late blight

are among the tomato diseases anticipated. Data is gathered from the local market, and a

database of 600 images is constructed, including 100 images for each condition. Various

approaches, such as contrast enhancement in preprocessing, k-means for tomato disease

segmentation, colour, statistical colour characteristics, colour cooccurrence matrix, and shape

features are retrieved, and lastly ensemble learning is utilised to accurately predict tomato

diseases, are employed.To construct models and evaluate tomato diseases, three types of

ensemble techniques such as AdaBoost, LogitBoost, and TotalBoost are utilised. A multi-

class classification ensemble approach is compared to several boosting and bagging ensemble

methods.When compared to the other two approaches, the AdaBoost learning method has a

tomato disease classification accuracy of 92.33% with a minimal error of 0.0233.

Sabrol and Kumar (2016) [55] used a decision tree algorithm to do an experimental

investigation on the classification of healthy and diseased tomato leaves. The results revealed

a 78% classification accuracy in detecting infections caused by bacterial canker, fungal late

blight, leaf curl diseases, and bacterial leaf spot.

31
Biswas Sandika et al.(2016) [56] created the Random Forest algorithm for the detection and

classification of several grape leaf diseases such as Anthracnose, Powdery Mildew, and

Downy Mildew, utilising samples collected from images captured in an uncontrolled

environment with a random backdrop. Based on performance, the Random Forest method

was compared to other machine learning algorithms. Random Forest achieves the highest

classification accuracy for Gerbil Lung Cell Conditioned Medium (GLCM) characteristics in

terms of background separation and illness classification. However, it takes more time to train

the data and is not ideal for approaches with sparse features.

Shijie et al. (2017)[57] developed a transfer learning strategy based on VGG16 for the

detection and classification of tomato diseases and pests. They also tested with VGG16 as a

feature extractor and SVM as a classifier. The average accuracy of the training set is 100%,

while that of the test set is 88% when utilising the VGG16 +SVM technique. With an average

accuracy of 89%, the transfer learning strategy employing VGG16 outperformed the

VGG16+ SVM approach.

Sharif (2018)[SHA18][58] suggested a hybrid approach for identifying and classifying

diseases in citrus trees. The described approach consists of two stages: detection of lesion

location on citrus fruits/leaves and categorization of citrus diseases. On the pre-processed

input image of the citrus fruit, the optimised weighted segmentation algorithm extracts the

lesion location. The texture, colour, and geometric elements are then merged in a codebook.

Furthermore, the best features are determined by combining entropy, PCA score, and

skewness-based covariance vectors in a hybrid feature selection(FS) approach. For citrus

disease classification, the FS is input into the Multi-Class Support Vector Machine (M-

SVM).

Guo et al. (2019)[59] present an enhanced convolutional neural network for detecting tomato

leaf diseases. The experimental dataset came from the publicly accessible PlantVillage

32
dataset. TomatoNet, a new convolutional neural network based on the original ResNet18

model, was built by adding a squeeze-and-excite module and changing the classifier

structure. The results illustrate that the TomatoNet network's average recognition accuracy is

99.63%, which is 0.53% greater than the ResNet18 network. Furthermore, once the AlexNet

network was upgraded, recognition accuracy increased from 88.97% to 98.35%.

Tomatoes are regarded as one of the most popular vegetable crops in the Philippines. Manual

sorting is the most extensively used sorting approach, however it is highly dependent on

human interpretation and hence prone to mistake. Luna et al. (2019) [60] suggested a strategy

for sorting tomato fruit based on the presence of defects. Based on a single tomato fruit

image, the study provided the construction of an image dataset for a deep learning technique

detection of flaws. The OpenCV library and Python programming were used to create the

models. Using the improved image capturing box, 1200 tomato images classed as no defect

and with defect are collected. These images are used to train, validate, and test three deep

learning models: VGG16, InceptionV3, and ResNet50. From this, 240 images are utilised as

testing images to evaluate the performance of the trained models separately using accuracy

and F1-score as performance indicators. Experiment results indicate that the VGG16 model

has 95.75-95.92-98.75 training-validation-testing accuracy % performance, whereas the

InceptionV3 model has 56.38-59.24-58.33 and the ResNet50 has 90.58-58.46-64.58. Based

on the dataset collected, comparative study demonstrated that VGG16 is the best deep

learning model to utilise in the identification of the existence of fault in tomato fruit.

Wang et al. (2019)[61] presented tomato disease detection architectures based on deep

convolutional neural networks and an object detection model. These approaches employ two

distinct models, Faster R-CNN and Mask R-CNN, where Faster R-CNN is used to identify

the kinds of tomato diseases and Mask R-CNN is used to detect and segment the locations

and shapes of diseased regions. Four distinct neural networks were chosen and merged with

33
two object detection models. Experiments with data acquired from the Internet reveal that the

suggested approaches are particularly accurate and efficient in recognising tomato disease

kinds and segmenting shapes of diseased regions. The results of the experiments show that

ResNet-101 has the highest detection rate but takes the longest time to train and detect.

ResNet-101 has the lowest detection time, however it is less accurate. In general, several

models can be selected based on the real needs. This study's dataset also offers a range of

complex backgrounds, allowing architectures to increase their ability to recognise complex

images.

Gehlot et al. (2020)[62] used several Convolution Neural Network designs to classify

diseases in tomato leaves. The disease in tomato leaves can have an impact on both the

quality and quantity of produce. Early disease identification, classification, and detection are

essential to solve this challenge. Convolutional Neural Networks are becoming more popular

in object recognition, detection, and classification due to their ease of use and automated

feature extraction as compared to previous approaches. AlexNet, VGG16, GoogleNet,

ResNet-101, and DenseNet121 were the five models employed. The model was able to divide

tomato leaf disease into ten distinct groups. All the models worked well, however DenseNet-

121 had the best accuracy and was the smallest of the models.ResNet-101 and VGG16 also

performed on par with DenseNet-121.However, because ResNet-101 is bigger, it won't work

with smaller devices.

Ashraf et al. (2020) [63] created an innovative approach based on DCNN to categorise

different medical images, such as the body's organs. The final three layers of the GoogLeNet

transfer learning algorithm were swapped out for a fully connected layer, a soft-max layer,

and a classification output layer. When categorising different medical images, the outcome of

34
the fine-tuned technique showed the greatest results. Some studies employed datasets with

fewer images, which would have had an impact on training effectiveness and raised the

misclassification ratio.

Zeng et al. (2020) [64] used several CNN models to detect the severity stages of citrus

greening diseases, including AlexNet, DenseNet169, InceptionV3, ResNet34, SqueezeNet

1_1, and VGG13. The InceptionV3 model has the maximum accuracy of 74.38% in 60

epochs. Augmentation was also employed to improve the learning and performance of the

model. Deep convolution Generative Adversarial Networks (GANs) were used in this

process. With augmented data, the InceptionV3 model obtained 20% greater accuracy. The

data utilised in this study was gathered from non-profit organisation websites such as Plant

Village and crowd AI.

Mao et al. (2020)[65] introduced a new Multi-Path Convolutional Neural Network

(MPCNN)-based cucumber area recognition approach. It combines the colour element

selection with the SVM algorithm. The cucumber image is transformed to colour space in

order to get 15 colour elements, and weighted data of relevant features is analysed using I-

RELIEF. Similarly, the Otsu approach is used to segment the G element and MSER is used to

obtain the mask image to remove a piece of the background region. The top of the three

elements of the weights are provided as input to the DL technique to extract and fuse features

in order to maximise the variances between cucumber and leaves and improve the classifier

accuracy of SVM. Finally, the cucumber identification is achieved by combining the SVM

classifier and the mask image.

Using original and feature image datasets, Xiao et al. (2020) [66] suggested the CNN method

with the ResNet50 model for identifying strawberry diseases, including crown leaf blight,

leaf blight, fruit leaf blight, grey mould, and powdery mildew. The training accuracy reached

35
its peak at 20 epochs, with 98.06% accuracy for the original dataset and 99.60% accuracy for

the feature dataset, respectively.

Through the Internet of Things, Ronghua et al. (2020) [67] collected environmental

parameters such as soil temperature and humidity, pH value, air temperature and humidity in

real-time. They then combined image feature fusion, environmental information, and expert

knowledge, and adopted the method of multi-structure parameter ensemble learning for

disease identification to ensure the accuracy of identification under the condition of shorter

identification time. The sample recognition rate ranged from 79.4 to 93.6% when 50 samples

of four different cucumber diseases were tested, including powdery mildew, fusarium wilt,

keratoderma, and sclerotinia sclerotiorum.

In China, tomato is grown extensively as a fruit or vegetable. There are several types of

disease and pests that affect tomatoes throughout their whole life cycle, making the

identification and diagnosis of these diseases crucial. For the purpose of detecting plant

diseases, many Deep Learning architectures have been put into use. Transfer learning was

used by Hong, Huiqun et al. (2020) [68] to decrease the amount of training data needed, the

amount of time needed, and the cost of the computations. In addition, 9 varieties of disease

leaves, including healthy tomato leaves, are classified. To extract the features, five deep

network architectures were used: ResNet50, Xception, MobileNet, ShuffleNet, and

DenseNet121_Xception. In the experiment, network architectures with varying learning rates

were compared. Adjusted the proper training parameters and test the networks. The

parameters and average accuracy varied amongst the five convolutional neural networks. The

best recognition accuracy of DenseNet_Xception is 97.10%, but the parameters of

DenseNet_Xception are at their most, whereas ShuffleNet's recognition accuracy is 83.68%

and the parameters are minimal.

36
Gayatri et al. (2020) [69] developed a transfer learning system based on a pre-trained deep

CNN-based framework for disease categorization in tomato plants. The dataset for this study

was compiled from internet sources and consists of 859 images classified into 10 classes.

This is the first research of its sort to include: (i) a dataset containing 10 tomato pest classes;

and (ii) a comprehensive comparison of the performance of 15 pre-trained deep CNN models

on tomato pest classification. The experimental findings demonstrate that the DenseNet169

model has the maximum classification accuracy of 88.83%.

The diseases that affect plants such as citrus are huge threats to food security, thus early

detection is critical. Convenient disease identification can let the person respond quickly and

eliminate some restricted activities. Plant leaf images may be used to do this recognition

without the assistance of a person. There are several approaches used in machine learning

(ML) models for classification and detection, but the combination of rising improvements in

computer vision looks to have the deep learning (DL) field study to accomplish a tremendous

potential in terms of improving accuracy. Luaibi, Ahmed R. (2021) [70] used two types of

conventional neural networks, Alex Net and Res Net models, with and without data

augmentation. Data augmentation is the technique of adding additional data points by

modifying the existing data. This approach increases the amount of training images in DL

without requiring the addition of new images; it is excellent for small datasets. A collection

of 200 images of diseased and healthy citrus leaves is made. The trained models with data

augmentation produce the greatest results, with Res Net and Alex Net achieving 95.83% and

97.92%, respectively.

Zhang et al. (2021)[71] present a tomato disease detection technique based on Multi-

ResNet34 multi-modal fusion learning based on residual learning for the problem of limited

identification rate of a single RGB image of a tomato disease. This research proposes transfer

learning, which is based on the ResNet34 backbone network, to speed up training, decrease

37
data dependencies, and prevent overfitting owing to a small quantity of sample data; it also

combines multi-source data (tomato disease image data and environmental characteristics).

The feature-level multi-modal data fusion approach is used to keep the important information

of the data that is used to identify the feature, so that the different modal data can

complement, support, and correct each other, resulting in a more accurate identification

impact. To start with, Mask RCNN was used to extract partial images of leaves from

complicated background tomato disease images in order to lessen the impact of background

areas on disease diagnosis. The formed image environment data set was then fed into the

multi-modal fusion model to get disease type identification results. The proposed multi-

modal fusion model Multi-ResNet34 has a classification accuracy of 98.9% for six tomato

diseases: bacterial spot, late blight, leaf mould, yellow aspergillosis, grey mould, and early

blight, which is greater than that of the single-modal model.

38
2.5 Summary

In recent years, the agricultural sector has faced numerous challenges. The detection of

tomato fruit in images and videos and classification of fruit based on maturity in harvesting.

And also, the early identification of tomato fruit disease reduces the costs by skipping the

unnecessary application of pesticide to the plants. The review details an up-to-date analysis of

current research in this field of tomato detection and classifying based on maturity, disease

identification based on Artificial Intelligence technique. The key objective of the work is to

analyze different deep learning techniques broadly used to detect and classify the tomato

based on ripeness and diseases. The literature study discussed in this chapter shows that

Machine Learning and Deep Learning used in the field of agriculture for detection of tomato ,

classification of tomato based on ripeness and diseases. More specifically, Deep Learning

algorithm like CNN shows a significant result for classification of tomato based on ripeness

as well as diseases. The single-stage object detectors performed well in detection of

tomatoes from images and videos. From the literature it is evident that deep learning methods

outperforms other conventional methods like image processing, machine learning and neural

networks in ripeness and disease detection. Further, to improve the performance of the fruit

detection, the YOLO method plays a vital role for object detection can be utilised with certain

modifications.

39
CHAPTER 3

TOMATO RIPENESS DETECTION USING DEEP LEARNING MODEL

In this chapter, a Deep Learning model for Tomato Ripeness classification is proposed

based on VGG16 using transfer learning and fine-tuning strategy. The proposed method

focuses on the classification of tomatoes into two classes namely ripe and unripe based on the

maturity. The preprocessing stage converts the collected RGB images of different sizes into

224 x 224-pixel size. The pre-trained VGG16 model is used for feature extraction and the last

layer is replaced with Multi-Layer Perceptron (MLP) for the classification of tomatoes. The

performance of the proposed method is evaluated against the collected dataset and

experimental results are compared.

3.1 Introduction

Fruit maturity refers to the completion of development, which can only occur while

the fruit is still connected to the tree and is characterised by a halt in cell growth and the

accumulation of dry matter. The maturity of fruits and vegetables during harvest has a

considerable impact on the quality of all fruits and vegetables along the postharvest value

chain [72]. Fruit ripeness is significant in agriculture because it affects fruit quality [73], and

it is determined manually. The manual technique has several limitations. It is time-consuming

and labour intensive and can lead to variations in determining fruit ripeness [74-75]. An

efficient and effective automated model that can identify and classify the fruits based on their

maturity degree in a short amount of time is desperately needed. The emergence of computer

vision technology has the potential to address these challenges because fruit ripeness

classification can be done automatically, making it relatively fast, consistent, and cost-

effective [76]. The amount of fruit produced and distributed is determined by the fruit

40
maturation process, which takes a short time, therefore the classification of the fruit's

maturity level is critical.

Automation can be used to determine the level of fruit ripeness based on an image of

the fruit. Each fruit has an unique indicator, such as size, shape, and colour, that shows the

amount of fruit development. Colour is important in digital image processing because it

shows the quality of the image used[77]. A number of methods for measuring fruit ripeness

have been developed.

The primary topic addressed in this research is assisting farmers or growers for

protection of fruits from diseases and harvesting them at appropriate timing. Farmers

normally search for matured tomatoes to sell, and the amount of ripeness is usually measured

by colour. Farmers will save substantial human labour if tomatoes can be classified as ripened

or unripen (green) and this information is generated automatically. This method of

classification can be used before (in the field), after (in storage), or both. Furthermore,

automated quantification of tomatoes in the field will assist in more accurately assessing

economic values than old ways.

3.2 Background

With the development of big data technologies and high-efficiency computers, deep

learning technology has opened up new possibilities for crop management and crop

harvesting in agricultural operation environments. A five-layer CNN including a convolution

layer, a pooling layer, and a fully connected layer model was suggested by the authors in [78]

to identify bananas. Fruits including apples, strawberries, oranges, mangoes, and bananas

were examined in order to extract features using CNN. Fruits were classified using

algorithms like K-Nearest Neighbour (KNN) and Random Forest (RF). The deep feature RF

41
combination algorithms outperformed the rest by 96.98% when these DL-based RF and CNN

algorithms were compared to current systems.

A computer-vision-based application was designed using CNN and tested for the

classification of the ripening stages of the mulberry fruits [79]. The CNN classification model

was fine-tuned using transfer learning to improve accuracy and reduce training costs.

Different CNN models such as AlexNet, ResNet18, ResNet50, DenseNet, and InceptionV3

were used for testing. AlexNet and ResNet18 achieved the highest accuracy of 98.32% and

98.65% for white and black mulberry’s ripeness classification.

3.3. Transfer Learning

A huge quantity of data is required to train a deep learning model to solve problems in

a specific domain. It is typically difficult to collect large datasets necessary to train such

models [80], and it is expensive to hire professionals to label such vast data sets [81]. Models

trained on a single job, fortunately, may be reused to handle various issues in the same

domain or comparable problems in a different domain. Transfer learning (TL) is a method

that allows you to reuse a model that has already been pre-trained on a large data set.

According to studies, transfer learning is an efficient method for transferring large amounts of

visual information that have been acquired through training on such large image datasets to

new image datasets [82].

42
As demonstrated in Figure 3.1, a model that has been trained for one task is modified

(repurposed) for another similar task using this approach. The transfer of information from

one task to another is simply an optimisation strategy that leads to higher performance in the

second task's modelling.

Figure 3.1: Traditional vs Transfer Learning

The deep transfer learning is based on pre-trained neural models, which constitute the

foundation of transfer learning in the context of deep learning. Deep learning systems are

multi-layered architectures that learn unique features at each layer. Higher-level features are

compiled in the initial layers, which filter down to fine-grained characteristics as go

deeper into the network. To obtain the final output, these layers are eventually joined to the

completely connected layer. This opens up the possibility of employing popular pre-trained

models without their final layer as a fixed feature extractor for various tasks, such as VGG

Model, Inception Model, and ResNet Model.

The primary idea here is to use the pre-trained model's weighted layers to extract

features while avoiding updating the model's weights during training with new data for the

current job. The pre-trained models are trained on a big and general enough dataset and will

successfully function as a generic model of the visual environment.

43
3.3.1 Fine-tunning

Deep neural networks are multi-layered architectures with several configurable

hyperparameters. The initial layers' task is to capture generic features, whereas the latter layers are

more focused on the specific task at hand. It seems reasonable to fine-tune the higher-order feature

representations in the basic model to make them more relevant to the task at hand. Some model

layers can be re-trained while others remain frozen in training.

Along with the training of the classifier, one typical approach to improve the model's

performance is to "fine-tune" or re-train the weights of the top layers of the pre-trained model. This

forces the model to adjust the weights based on generic feature mappings learnt from the source

task. Fine-tuning will allow the model to use previous knowledge in the target domain as well as re-

learn some aspects [83].

44
Furthermore, rather than fine-tuning the whole model as shown in Figure 3.2, attempt to

fine-tune a limited number of top layers. The first few layers learn basic and generic features that

are applicable to practically all sorts of data. As you progress up the hierarchy, the features become

more particular to the dataset on which the model was trained. Instead of replacing the general

learning, fine-tuning seeks to modify these specialised features to operate with the new dataset.

Figure 3.2 : Fine-Tuning Approach

3.3.2 Transfer Learning Process:

The six steps involved in transfer learning process are shown in Figure 3.3.

Obtain the Create a Freeze Add the Train the Improve


pre- base Layers New layers new layers the model
trained model on the via fine-
model tuning
dataset

Figure 3.3: Transfer Learning Process

a. Identify Transfer Learning Base Model


45
Using the pre-trained CNN model, determine the transfer learning base networks and give the

network weights (W1, W2, … , Wn) to the base networks. It is possible to obtain the weights

of bottom layers from a well-trained CNN via the Keras API.

b. Create a New Neural Network

The network structure can be changed based on the bottom layers, which can include altering

layers, adding layers, and removing layers from networks, among other things. It is possible

to generate a new network structure in this way.

c. Freeze Layers

Freezing the starting layers from the pre-trained model is essential to avoid the additional

work of making the model learn the basic features.

d. Add new trainable layers

Only the feature extraction layers are being used as knowledge from the base model. To

predict the model's specialised tasks, we need to add new layers on top of the initial ones.

e. Train the New Layers

The pre-trained model’s final output will most likely differ from the output we want for our

model. In this case, the model is trained with a new output layer in place.

f. Fine-tune the new model

Fine-tuning is one approach for enhancing performance. It requires unfreezing a portion of the base

model and retraining the entire model on the full dataset at a very low learning rate. The slow

learning rate will improve the model's performance on the new dataset while limiting overfitting.

3.4 Base Trained Models

46
Over the last decade, several CNN designs have been proposed. Better model

architecture can help improve application performance. CNN's architecture has experienced

significant renovations from 1989 to the present. Examples of structural modifications

include reformulation, regularisation, and parameter optimisation. To put this in context, the

significant improvement in CNN performance was primarily due to the reorganisation of

processing units and the creation of additional blocks. The use of network depth has been the

focus of the most innovative developments in CNN architectures.

3.4.1 VGG16 Model

VGG16 is an acronym for the VGG model, often known as VGGNet. It is a 16-layer

convolution neural network (CNN) model. This model was proposed by K. Simonyan and A.

Zisserman of Oxford University and published in the paper Very Deep Convolutional

Networks for Large-Scale Image Recognition [84]. The model achieves 92.7% top-5 test

accuracy in ImageNet, a dataset of over 14 million pictures divided into 1000 classes.

VGG16 surpasses AlexNet by replacing huge filters with sequences of smaller 3 x 3

filters. AlexNet's kernel size is 11 for the first convolutional layer and 5 for the second layer.

The benefit of having several smaller layers rather than a single big layer is that the

convolution layers are accompanied by more non-linear activation layers, which enhances the

decision functions and helps the network to converge faster. In order to reduce the network's

inclination to overfit during training, VGG uses a smaller convolutional filter. A 3 x 3 filter is

the best size since smaller filters cannot catch information from the left, right, and up and

down. The simplest model that can interpret the spatial components of an image is VGG as an

outcome. As a result of the network's reliable 3 x 3 convolutions, operation is simple.

VGG16 is a 16-layer deep neural network and a relatively extensive network with a

total of 138 million parameters shown in Figure 3.4.

47
Figure 3.4 VGG16 architecture

VGG16 has 21 layers in total—13 convolutional, 5 Max Pooling, 3 Dense—but

only 16 of them are weight layers, or the learnable parameters layers seen in Figure 3.5.

It supports input tensors with three RGB channels in the sizes 224 x 244. The most

notable aspect of VGG16 is that it concentrated on having convolution layers of a 3x3

filter with stride 1 and always used the same padding and maxpool layer of a 2x2 filter

with stride 2. This was done in place of a large number of hyper-parameters. The

convolution and max pool layers are uniformly organised throughout the

architecture.Conv-1 has 64 filters, Conv-2 has 128 filters, Conv-3 has 256 filters, Conv

4 and Conv 5 have 512 filters. Following a stack of convolutional layers, three Fully

Connected (FC) layers are added: the first two have 4096 channels apiece, while the

third performs 1000-way ILSVRC ( ImageNet Large Scale Visual Recognition

Challenge ) classification and hence has 1000 channels (one for each class). The soft-

max layer is the final layer.

48
Figure 3.5 Layered VGG16 structure

3.5 Proposed Method:

The ripeness classification is carried out in the way illustrated in Figure 3.6. The data

collected in this study will be used to assess the maturity of tomatoes at various stages. To get

the best results, the transfer learning technique leverages the pre-trained VGG16 utilising a

fine tuning strategy and replaces the classification layer with a multi-Layer perceptron in data

training during the training stage.

Data Acquisition

Data Pre-
processing

Data
Augmentation

Image
Classification

Figure 3.6: Steps involved in Ripeness Classification

49
3.5.1 Data Acquisition

The image data for tomato ripeness classification was gathered from the PlantVillage dataset.

The dataset contains 400 images of tomatoes divided into two classes: ripe and unripe.

3.5.2 Data pre-processing and Augmentation

The amount of data used for classification is insufficient to train a Deep Learning model.

When data is augmented, the dataset size increases many times, which helps in training the

deep network model. To generalise the data, data augmentation is used throughout the

training phase. Data augmentation [85] is a procedure that creates extra training data to

minimise overfitting. The aim is to allow the model to be able to adapt to real-world

challenges without being presented with the same image repeatedly. The ImageDataGenerator

function in Keras is used to do this process, which includes rotation, width and height

shifting, zooming, and horizontal flipping. Table 3.1 shows the augmentation approaches

used in this research on the gathered dataset.

Table 3.1: Augmentation Techniques

Parameter Value

Rotation 90

Brightness-range [0.1,0.7]

Horizontal Flip True

Vertical Flip True

Horizontal Shift 0.2

Vertical Shift 0.2

50
The images in the dataset are of various sizes and must be resized in order to fit into the deep

learning CNN. Because the image's input size must be the same as the model input size, the

images in the dataset are resized to 224 × 224 and normalised for rapid processing by scaling

the pixel values within the range of [0,1].

3.6 Implementation of VGG16 using Transfer Learning:

Training a deep CNN model from scratch, such as VGG16, necessitates a large

amount of data since it has millions of trainable parameters, and a small dataset would be

inadequate to achieve decent generalisation of the model. Furthermore, this model is reused

utilising the pre-trained weights through the Transfer Learning (TL) approach. TL is the

useful Machine Learning (ML) approach in which a pre-trained CNN model is reused to take

use of its weights as initialization for a new CNN model for a different purpose.

TL is used in two stages: Deep feature extraction is the initial stage in extracting the

key features from the dataset and applying them to the trained model. The weight of the pre-

trained model is used to identify which features may be used for challenges. The second

phase is finetuning, which involves freezing the model's basic layers and retraining the last

layers using a new tiny dataset. The weights of the final layers are changed using the back-

propagation technique after training with a new dataset. Furthermore, the number of classes

in the output layer matches to the number of classes in the target dataset. The ImageNet

dataset, which was used to train the VGG16 model, has 14 million images separated into

1,000 different classes. Figure3.7 depicts the proposed DeepCNN model, to which the

learned weights and parameters from the underlying VGG16 model are transferred.

51
Softmax
Preprocess
1x1x2

Figure 3.7 : Proposed VGG16 Architecture

As illustrated in Figure 3.8, the VGG16 model has 16 layers, with 13 convolutional

layers followed by max pooling and three fully connected layers utilised for classification.

The DeepCNN model is introduced, which is built on the VGG16 model and uses TL. The

top layer is removed and replaced with a new top layer that consists of two separate classes of

tomato. And the proposed model's pre-trained convolutional blocks are frozen (non-trainable)

to prevent the weights from being modified. By un-freezing (trainable), the final two pre-

trained layers are trained on a new dataset to fine-tune the specialised features from the

dataset, which is known as the fine-tuning strategy.

Figure 3.8 : Proposed VGG16 layered Structure.

52
As seen in Figure 3.9, the DeepCNN model is divided into two parts: feature extraction and

classification. The following is a thorough description of the internal components of feature

extraction and classification:

Figure 3.9: DeepCNN Architecture Model

3.6.1 Feature Extraction

Convolution layers and pooling layers are utilised in the feature extraction process, which is

used to extract the features from the provided images. Rectified Linear Unit (ReLU)

activation function, which is present in each convolution layer, is used to activate the

neurons.

Convolution layer

The Convolution layer, which serves as the feature extractor from input images, is the most

crucial part of CNN. Each convolution layer has a series of filters that move over the input

image to identify the key details. By computing the dot product between the pixels from the

53
input image and the filter, the feature maps are created. The resultant size of the feature map

is obtained using equation 3.1:

FM =[(w−f w )/ s+1]× ¿ (3.1)

where w,h are width and height of the input image, fw and fh are width and height of the filter ,

s is the stride and k denote the number of filters.

ReLU Activation Function

ReLU function is widely chosen in neural networks because it converges quickly and

reduces overfitting problems. The ReLU sets all of the negative values on the feature map to

zero using equation 3.2.

f ( z )=max ( 0 , z ) ( 3.2 ) (3.2)

where z is the feature map value

Padding

The padding is used to avoid the feature map's reduction in size, which occurs during

dimensionality reduction. In order to maintain the input image's original size, padding is the

act of adding layers of zeros. Equation 3.3 is used to generate the final feature map after

padding.

FM =[ ( w+2 p−f w ) / s+1 ] × [ ( h+2 p−f h ) / s+ 1 ] × k (3.3)

Where p denotes the padding size

Max-Pooling layer

The spatial size of the convolved features is reduced and the overfitting issue is

lessened by the usage of the max-pooling layer. In order to produce the pooled feature map,

54
the max operation determines the largest value in each patch of the feature map as shown in

equation 3.4.

MaxPooling ( X )i , j , k =max ❑ X ¿i . s + m , j. s +n , k ¿
x y (3.4)
m ,n

where ‘X’ is the input, (i,j) are the indices of the output, k is the channel

index, sx and sy are the stride values.

3.6.2 Classification

The Max Pooling layer and dense layer make up the classification portion of the proposed

model. The maturity prediction is carried out using the dense layer's softmax function.

Softmax classifier

The softmax function is a type of activation function that is used in the multi-class

classification problem and appears at the output layer. It computes a vector of k real numbers

and normalises the input values to obtain a vector of probabilistic values ranging from 0 to 1.

The softmax function returns the maximum probability for each class, and that class is picked

as the class with the highest likelihood as depicted by equation 3.5.

e ( xi )
Softmax ( xi ) = (3.5)
∑ e ( x j)
j

Where x represents the values from the neurons of the output layer.

3.8 Optimizers

Optimizers are algorithms or techniques that change the parameters of a neural network, such

as its weights and learning rate, to reduce losses and improve accuracy[86]. SGD (Stochastic

Gradient Descent) is the sort of optimizer used in this research.

55
SGD Optimizer

Stochastic Gradient Descent (SGD) is a variant of the gradient descent approach that

calculates the error and updates the model for each occurrence in the training dataset rather

than the full training dataset at once. It updates the parameter θ for each training sample

labeled as xi and yi calculates the gradient θ J(θ )to reduce the objective function J(θ) using

the equation 3.6.

θ=θ−η ∇θ J ( θ ; x i∧ y j ) (3.6)

where:

 θ represents the parameters (or weights) of the model.

 η is the learning rate, a hyperparameter determining the step size at each iteration

while moving toward a minimum of a loss function.

 ∇J(θ; xi ; yi) is the gradient (partial derivatives) of the loss function J,

3.8 Model Performance

Confusion Matrix

The confusion matrix offers a comprehensive picture of a classification model's

performance[87]. The matrix compares the true values to those predicted by the model.

Figure 3.10 is a heatmap visualisation of the confusion matrix created with Python's sklearn

module. It consists of two rows and two columns that indicate two classes (Class1 and Class2

and Unripe) with correct and erroneous predictions. The measurement matrix of the classifier

is defined by the four properties listed below.

56
Figure 3.10 Confusion Matrix

True Negatives (TN): the count of the outcomes which are originally Class2 and are truly

predicted as Class2.

False Positives (FP): the number of images that are originally Class2 but are predicted

falsely as Class1. This error is named as a type 1 error.

False Negatives (FN): the count of Class1 images, which are falsely predicted as Class2, also

known as a type 2 error.

True Positives (TP): the count of Class1 images which are truly predicted as Class1.

Classification Report

After creating the confusion matrix, the performance metrics (accuracy, recall, precision, and

F1-score) of the models may be accessed using the classification report. The classification

report may be imported into Python from the sklearn package by using sklearn.metrics

import classification_report. The values of the performance measures are calculated using

TN, FP, FN, and TP.

57
Accuracy is the measure of all correctly classified images and is represented as the ratio of

correctly classified images to the total number of images in the test dataset, as shown in

equation 3.7

TN+ TP
Accuaracy ( A )= (3.7)
TP+TN + FP+ FN

Precision is the correctly predicted positive images out of all positive images. For instance,

it can be defined as the ratio of correctly classified images as ripe to the total number of

images predicted as ripe, as shown in equation 3.8.

TP
Precision ( P )= (3.8)
TP+ FP

Recall is calculated by dividing the correctly classified images (of a class) by the total

number of images belonging to that class as shown in equation 3.9.

TP
Recall ( R )= (3.9)
TP+ FN

F1-score is the weighted sum of precision and recall with a minimum value of 0 and a

maximum value of 1. It provides a better measure of incorrectly classified images than

the accuracy metric. The value of the F1-score is measured by equation 3.10.

2∗Precision∗Recall
F 1 Score ( F )= (3.10)
Precision+ Recall

ROC Curve

58
The Receiver Operating Characteristic curve is known as the ROC curve. It is a graphical

representation of a binary classifier's performance at various classification thresholds. A

threshold value is first established in order to classify the sample data, and it would result in a

set of TP, FP, TN, and FN values[88]. Different sets of values are produced for these values

for the various threshold settings. Lowering the threshold value will result in more true

positives being correctly identified, but it will also result in more false positives and fewer

true negatives, giving us inaccurate numbers for FP and TN. If the threshold is set to a greater

number, the FP values will be erroneous. If a confusion matrix is constructed for each

threshold value, this would result in a large number of confusion matrices that would be

difficult to analyse and compare. It is difficult to determine the proper threshold.

Instead of confusion matrices, Receiver Operator Characteristic graphs (ROC) provide a

more straightforward method for evaluation by providing all of the previously discussed

information in the form of a graph, as illustrated in Figure 3.11. The Y-axis, which is

displayed in the form of sensitivity, depicts the genuine positive rate. The true positive rate is

the proportion of data that has been accurately categorised. The X-axis shows the false

positive rate. This is the proportion of data that has been mistakenly labelled as false

positives. Each of these rates may be computed at different threshold values and presented as

a ROC graph. One ROC curve displays all of the confusion matrices that would be generated

for different threshold values, allowing us to examine them all in one place and choose the

threshold value that delivers the most accurate prediction. ROC curves can also be used to

compare different neural network models(NNM). The area under the curve (AUC) is

calculated for this purpose. A model is considered better if its AUC value is greater.

59
Figure 3.11: General ROC curve

3.10 Results and Discussion

The experiment is carried out on VGG16 using transfer learning and the tomato fruit

dataset to evaluate model performance. To extract the features, the traditional VGG

architecture with learned weights from the ImageNet dataset is used. Its output features are

supplied into the newly added fully connected layer, which is used to train the tomato dataset.

For the model's performance evaluation, parameters such as the number of epochs, batch size,

learning rate, and optimizers to attain maximum accuracy are taken into account. The model

is trained by splitting the pre-processed and augmented dataset into 60% training, 20%

validation, and 20% testing portions. Furthermore, the proposed model is trained for 25

epochs using the hyperparameters listed in Table3.2.

60
Table 3.2 Hyperparameters

Parameter Value

No. of Epochs 100

Batch Size 16

Learning Rate 0.001

Optimizer SGD

Loss Function Binary Cross entropy

For the calculation of loss, the binary cross entropy method has been applied. The formula for

calculating it represented in following equation 3.11.

N
−1
BinaryCrossEntropy= ∑ y . log ( p ( y i ) +( 1− y i) log ( 1− p ( y i ) ))
N i=1 i
(3.11)

where y is the label ( 1 for ripe class and 0 for unripe class) and p(yi) is the predicted

probability of the class being ripe for all N .

3.10.1 Training on Transfer Learning Models

The training results of DeepCNN, a modified VGG16 model with transfer learning,

are reported in this section. Figure 3.12 shows the training and validation accuracy for simple

CNN for ripeness classification. The classification's average accuracy was found to be

72.21%. The ROC curve, shown in Figure 3.13, is a true-positive rate against false-positive

rate graph that displays the performance of all classes at various classification thresholds; the

ROC area for the basic CNN binary classification is 0.83 for two classes (ripe and unripe).

61
The confusion matrix for basic CNN on binary classification is shown in Figure 3.14. Figure

3.15 depicts the training and validation accuracy graph of the DeepCNN model based on

VGG16 without fine-tuning technique, with an average accuracy of around 91%. Figures

3.16 and 3.17 depict the ROC curve and confusion matrix for DeepCNN without fine-tuning.

Figure 3.12 Accuracy Graph of Basic CNN Model

62
Figure 3.13 ROC curve of Basic CNN model

Figure
3.14 Confusion Matrix of Basic CNN model

63
Figure 3.15 Accuracy Graph of DeepCNN model without Fine-tuning

64
Figure 3.16 ROC of DeepCNN without Fine-tuning

Figure 3.17 Confusion Matrix of DeepCNN without Fine-tuning

65
The training and validation accuracy of DeepCNN model applying fine-tuning is revealed in

Figure 3.18 . Figure 3.19 indicates the ROC curve, which is a true-positive rate versus false-

positive rate graph which shows the performance of all the classes at certain classification

thresholds, the ROC area of the proposed model for the binary classification is 0.99 for all

classes (ripe and unripe). Figure 3.20 shows confusion matrix for VGG16 with transfer

learning.

Figure 3.18 Accuracy Graph of DeepCNN with Fine-tuning

66
Figure 3.19 : ROC of DeepCNN with Fine-Tuning

Figure 3.20 Confusion Matrix of DeepCNN with Fine-tuning

67
3.10.2 Result analysis

Table 3.3 shows the accuracy, precision, Recall and F1-score for Basic CNN model

and DeepCNN model with and without applying fine-tuning strategy. The DeepCNN model

is trained with and without applying fine-tuning strategy using SGD optimisers.

Table 3.3 Performance Evaluation of three models

Model Accuracy Precision Recall F1-score

(%) (%) (%) (%)

DeepCNN with
97.71 98 98 98
Fine-tuning

DeepCNN

without 90.21 91 90 90

Fine-tuning

Basic CNN 71.88 74 72 71

The accuracy is maximum for proposed DeepCNN model applying fine-tuning strategy

(VGG16 with transfer learning) i.e. 97.71 %. After that, DeepCNN without fine-tuning

acquired 90.21% accuracy and basic CNN has minimum accuracy around 71.88%. Similarly,

precision is evaluated for the three models. The precision also is maximum for proposed

DeepCNN model around 98% and then VGG 16 without fine-tuning acquired precision of

91% and the minimum precision is 74 % for basic convolutional neural network model. Next

parameter is recall which is evaluate assess the performance. The maximum recall is 98% for

DeepCNN model and minimum for basic CNN that is 72% and VGG16 without fine-tuning

have a recall of 90%. F1-score is also evaluated for transfer learning model for binary

68
classification basics CNN model has minimum score around 71% VGG16 has F1-score of

90% and DeepCNN has F1-score of 98% which is maximum. Figure 3.21 shows the

comparison of three models for ripeness classification. For all the parameters, the DeepCNN

has maximum performance. The Basic CNN models has minimum performance. Maximum

accuracy is 97.21 %, maximum precision is 98%, maximum Recall is 98%, Maximum F-1

score is 98%.

Figure 3.21 Performance Evaluation of three models

Summary

The Deep learning CNN model based on the VGG16 proposed for detecting and classifying

the ripeness of tomato. The proposed model is performing well and having better accuracy in

detecting the ripeness and classifying tomato fruit by utilising fine-tuning technique. Added

69
the data augmentation process in the dataset and applying the strategy of fine-tuning

produced a more robust model. The usage of dropout and data augmentation to reduced

overfitting ensured the robustness of the proposed model.

70
CHAPTER 4

TOMATO DETECTION AND CLASSIFICATION USING ATTENTION-BASED

YOLO MODEL

4.1 Introduction

One of the most important computer vision tasks is object detection, which looks for

instances of visual things (such as people, animals, cars, weeds, fruits, or buildings) in digital

images like pictures or video frames. Traditional image processing techniques and modern deep

learning networks can be used to detect objects. Object detection is a computer vision

implementation that not only detects but also estimates the location of objects in a digitized

scene such as an image or video. It is concerned with identifying instances of real-world

things while dealing with changes in the shape, orientation, and colour contents of the

objects. Although "detection" may refer to the capacity of intelligence to indicate the

existence and identification of an object, it may also refer to the power of intelligence to

locate a hidden concealed object. Object detection can be interpreted in a variety of ways,

including drawing a bounding box around the object or labelling every pixel in an image that

includes the object (a process known as segmentation) [89]. A rectangular bounding box is

used to define the position of the identified object, which helps the people in locating the

object faster than unprocessed images. It is an identified portion of an image that may be

understood as a single unit in image processing [90]. Typically, an image will contain one or

more objects, the visibility of which is important. For example, the objects represented in

single image can range from a single unit to as many objects as numbers, bordering on

infinite. Thus, given an image or video stream, an object detection model should be able to

determine whether of a known set of objects is present and provide information about their

locations within the image.

71
Object detection is commonly used in computer vision applications such as image annotation,

video object co-segmentation, activity recognition, face recognition, face detection, fruit

detection, and counting.

4.2 Object Detection in Agriculture:

Object detection, categorization, and localisation in images is a crucial step in the

development of agricultural applications. The location of objects in an image must be

determined in order to harvest fruits and vegetables, manoeuvre in the field, spray selectively,

and so on. According to the task, an object's 3D location must be computed, obstructions in

the path must be recognised, and object properties such as ripeness and size must be

determined. The presence and classification of diseases is also needed in several applications

[91,92,93]. Despite many years of research in agriculture focused object identification, there

are still several issues that hinder agricultural applications from being implemented [94]. The

extremely varied and unstructured outside environment, changing lighting conditions,

complicated plant structure, and varying product shape and size make it difficult to provide a

worldwide solution to object detection in the agricultural environment [95].

For many years, hand-crafted features like colour, shape, texture, or their combination

were the primary focus of object detection research [96,97]. Despite the acquisition devices,

the range of target colours, and the various lighting conditions having a significant impact on

colour, colour has been one of the main features employed in detection. The specific features

of the agricultural environment, according to several research, were the greatest obstacles to

precise identification and localisation of the target object [98]. Because of this, several

research have attempted to pursue different approaches for the detection task throughout the

years. For example, they have created adaptive thresholding algorithms to deal with the

dynamic variations in illumination[99] and combined multiple sensors [100,101]. Artificial

72
neural network (ANN) methods have been used in a number of studies to eliminate the

necessity to define and choose the features.

ANNs were first created for learning to classify colour features fed to the network in

order to remove background pixels from target pixels, which resulted in the same difficulties

caused by environmental variability [102,103]. Recent advances in computer vision

algorithms have been associated to deep Convolutional Neural Networks (CNNs).

Performances in various classic vision tasks, such as classification, detection, and

segmentation, have greatly improved. These networks may be fed raw data, like as the pixels

of a fruit in an image, and automatically learn features from it, eliminating the need for hand-

crafted features and the issues they bring [104]. With enough data, accurate representations of

the target object may be trained, and a robust system can be built that can handle the

uncertainty and high variability that are inherent to real-world vision challenges. In tasks like

disease detection and classification [105], fruit recognition and location [106], and others,

these networks have shown high-performance outcomes in the agriculture domain as well.

4.3 Object Detection Techniques

Deep Learning Models are currently being used in Object Detection techniques due to

their accurate Image Recognition capacity. These models use features extracted from input

videos and images to identify the objects included inside them. Image Processing, Video

Analysis [107], Speech Recognition, Biomedical Image Analysis, Biometric Recognition, Iris

Recognition, National Security, Cyber Security, Natural Language Processing, Weather

Forecasting, and Renewable Energy Generation Scheduling are some of the applications of

these models. These models make use of the Convolution Neural Network (CNN) [108]

framework, which is made up of many layers of artificial neurons.

A significant number of different methods have been developed and implemented. As

illustrated in Figure 4.1, these methods are divided into two categories based on their

73
theoretical foundation. The first are machine learning-based (ML-based) approaches [109],

and the second are deep learning-based methods (DPM) [110].

Figure 4.1:Object Detection Models

4.3.1 Machine Learning-based methods

ML inspires us in a variety of ways, particularly for pattern recognition from huge amounts of

high-dimensional images [111], which also contributes to the advancement of computer

vision[112]. Figure 4.2 depicts the usual progression of ML-based methods[113].

Figure 4.2: Machine Learning Method for Object Detection

The pipeline of the ML-based technique for object detection is depicted in Figure 4.2.

After obtaining an image, the sliding window method will be performed to this image[114],

74
yielding a large number of candidate bounding boxes. This procedure is known as region

selection[115], and it means that the region of interest to detect has been placed in these

boxes. This method will generate a myriad of candidate boxes for one image, those boxes

could be of different sizes and shapes. After this step, feature extraction will be implemented

for each candidate bounding box. Then, the feature inside each box will be extracted from the

image by using specific methods. Consequently, a classifier will be used to classify each box

based on its features. Finally, what class this box is will be given in the last step.

Object detection is a simple procedure in ML-based approaches. The procedure is

divided into three sections: region selection using the sliding window method, feature

extraction, and classification for each bounding box, with the results provided. Even if it

could solve object detection problems this solution has problems.

After region selection, the object bounding box might be generated using the sliding window

technique. However, due to a large computation that limits its transportability, its complexity

and redundancy also limit its speed and efficiency.

4.3.2 Deep Learning-Based Methods

ML-based algorithms might be used to address the object detection problem. However, the

complexity restricts its usage and development. However, with the advancement of computer

vision (CV) and deep learning (DL), other methods for object detection have been developed

after DPM. These methods are based on deep neural networks, particularly the convolutional

neural network (CNN).

Deep Learning-based Object detection has two major methodologies, one-stage and two-

stage as shown in Figure 4.3. The one-stage approach combines classification and localisation

in a single step, whereas the two-stage approach requires two separate steps. While one-stage

detectors are often faster, they may compromise some accuracy as compared to two-stage

75
detectors, which are known for their higher accuracy. As a result, there is a trade-off between

detection speed and accuracy in object detection.

Figure 4.3: Types of Deep Learning Object Detection Systems

R-CNN

In 2014, R-CNN [116], a DL-based approach, was created to address the problem of

selecting a large number of regions. The visual information of one object, such as texture,

shape, colour, and edge, may be comparable and significant to another. As a result, the region

proposal approach rather than the sliding window method can help us in obtaining the

candidate box.

Instead of using the HOG feature, R-CNN's feature extraction process uses a

convolutional neural network (CNN). Additionally, AlexNet is used to extract each box's

features. This model could automatically extract and learn features from boxes by utilising

AlexNet. For each box, AlexNet will eventually provide a feature map. The DL-based

76
technique addresses the drawback of the ML-based method in feature extraction based on the

defined features.

R-CNN continues to use SVM as its classifier for classification. Each box will be classified

using SVM based on its feature map, which was received through AlexNet. R-CNN will

produce 2,000 candidate boxes from an image using the region proposal approach as a result

of these improvements. The collected features will then be used to construct 2,000 feature

maps using AlexNet. Meanwhile, the SVM classifier will be used to classify each box, and a

regressor will be used to regress the bounding box. R-CNN's speed is quite slow due to its

sophisticated computation.

Fast R-CNN

Fast R-CNN was proposed by Girshick in 2015[117] in order to address the drawback

of R-CNN. Fast R-CNN mimics R-CNN, however it only uses a small number of CNN

networks to extract features from each region proposal. There is just one feature map

produced, and it directly extracts features from the input image using VGG. On the basis of

the input image, it will also carry out region selection to retrieve region proposal. This VGG

feature map's output might be partially chosen in the next phase depending on a region

proposal. Then, a ROI pooling layer will wrap and reshape those chosen feature maps into a

fixed size. The wrapped feature maps will then be transferred into a fully connected layer

where they will be ready for classification using Softmax and bounding box regression using

a linear regressor.

As a result, instead of numerous region proposals, Fast R-CNN just does feature

extraction once on the whole image. This might save a significant amount of time and

memory for calculations and storing the CNN node.

Faster R-CNN

77
Faster R-CNN outperforms Fast R-CNN by using one CNN for the image instead of

several CNN for region proposal[118]. The initial step in Faster R-CNN is the same as it is in

Fast R-CNN; both use CNN to produce a feature map. Faster R-CNN, as opposed to Fast R-

CNN, uses a neural net called RPN (Region Proposal Network) to replace selective search.

This RPN net will use CNN to construct a large number of anchors with varying sizes and

ratios. Thus, this RPN will determine if this anchor is in the foreground or background, while

simultaneously regressing its bounding box. Finally, in order to save time, the RPN will

discard a large amount of background information.

Following the RPN net, Faster R-CNN will acquire a high-quality proposal with minimal

background. This will improve the speed and accuracy of the model.

Single Shot MultiBox Detector (SSD)

The Single Shot MultiBox Detector (SSD) [119] is a single-stage object detection

model that eliminates the need for separate region proposal and classification processes,

hence streamlining the object detection process. Due to this simplicity, detection performance

is faster, making SSD ideal for applications. The SSD design is made up of a base network,

which is usually a pre-trained CNN, and a sequence of convolutional layers of variable sizes.

These layers are intended to identify objects of various sizes and aspect ratios. For each

feature map cell, SSD uses default bounding boxes, also known as anchor boxes. The model

predicts both the class scores and the box offsets relative to the anchor boxes during training.

Non-Maximum Suppression (NMS) is applied to the combined set of projected boxes to

generate the final predictions.

SSD has been successfully used to a wide range of fruit detection applications. Wang et al.

(2022)[120] propose a lightweight SSD detection technique for recognising Lingwu long

jujubes in natural settings. The new SSD approach delivers great detection accuracy without

78
the need of pre-trained weights while also reducing complexity to enable mobile platform

deployment. The combination of a coordinate attention module and a global attention

technique enhances the accuracy of object detection . In addition, the SSD model has been

customised and optimised for certain fruit detection applications. Researchers studied the use

of data augmentation techniques such as random cropping, flipping, and colour distortion to

improve the model's performance and durability.

YOLO

Faster R-CNN makes use of the RPN network to create ROI. Then it does

classification and regression with a high quality of anchor. As a result, this method assists

Faster R-CNN in achieving more accuracy at a slower rate. Therefore, instead of this two-

stage approach, a new one-stage method YOLO was developed to make the network

incredibly fast [121].

YOLO predicts the object's class and location without using an anchor or RPN. As a

result, its speed is extremely rapid, but its precision has suffered. YOLO is the first real-time

detector and the first one-stage object detection system. It lacks a prior box, and the detection

problem is seen as a regression problem, hence its structure is quite basic. YOLOv1 does not

use the prior box and instead employs a GoogLeNet-like model as its backbone by converting

it into a regression problem. These strategies greatly improve detection speed. It does,

however, limit its efficacy in some small objects.

YOLOv2, commonly known as YOLO9000, is the second version of YOLO that can

classify 9000 classes. It employs DarkNet-19 as its backbone net and uses a prior box to

detect the offset rather than its size and location [122]. Additionally, it employs multi-scale

and multi-step training. Although the implementation of such methods improves speed, it

does not solve the poor performance in small objects.

79
As the third version of YOLO, YOLOv3 incorporates properties from the current

detection frame, such as residual network and feature fusion. A network called DarkNet-53

was proposed. The residual network allowed the network to be created and developed in great

depth. Meanwhile, the residual network might considerably reduce the gradient

disappearance to accelerate convergence. It will combine deep and shallow feature maps

using upsampling and concatenation. As a result, it will build three different size feature maps

to be utilised for detection. YOLOv3 provides a superior performance for small object

detection using variable size feature maps, despite its slow speed [123].

YOLOv4 is the fourth version of the YOLO object detection algorithm, which was

announced in 2020 as an enhancement over YOLOv3[124]. The introduction of a new CNN

architecture called CSPNet (Cross Stage Partial Network), which is a variation of the ResNet

architecture intended exclusively for object detection tasks, is the key enhancement in YOLOv4

over YOLOv3. It adds "k-means clustering" as a novel approach for creating the anchor boxes,

which groups the ground truth bounding boxes into clusters and then uses the centroids of the

clusters as the anchor boxes. This allows the anchor boxes to be more precisely matched with the

size and shape of the detected objects.

In 2020, YOLOv5 was released, which used a more complicated architecture known as

EfficientDet, which is based on the EfficientNet network architecture. The use of a more

complicated architecture in YOLO v5 allows for greater accuracy and generalisation to a broader

variety of object categories.

YOLOv5 employs a novel approach for producing anchor boxes known as "dynamic

anchor boxes." The ground truth bounding boxes are clustered using a clustering method, and the

centroids of the clusters are used as anchor boxes. This allows the anchor boxes to be more

precisely matched with the size and shape of the detected objects. The idea of "spatial pyramid

pooling" (SPP) is also introduced in YOLOv5, which is a form of pooling layer used to reduce the

80
spatial resolution of feature maps. SPP is used to increase small object detection performance by

allowing the model to observe the objects at multiple scales.

YOLO is the first real-time detector that might strike a balance in scenarios when speed is

required. It might handle the fruit detection problem since it is a detector without an anchor.

Based on this, YOLOv5 is chosen as the base model in the research to deal with fruit

detection.

4.4 Architecture of YOLOv5

"You only look once" (YOLO) is a deep learning strategy that is used in one-stage detectors

that use CNN to detect objects. The design of YOLO is fast and accurate, making it suitable

for real-time detection of objects applications [125]. To accomplish fast network processing,

the YOLO method employs a grid-like architecture similar to that of CNN. As illustrated in

Figure 4.4, the approach includes splitting input images into grid cells of size S x S and

assigning each cell to predict bounding boxes and object classification. In order to create a

prediction, the operation makes a single forward pass through the network, which involves

processing input data via a neural network in a single path, from input to output. Thus, "You-

only-look-once" got its name. Four parameters (x, y, width, and height) are used to define

each bounding box, together with a confidence score that represents the probability that an

object will be found inside the box. The class probabilities are also predicted by the detection

network. The number of classes that YOLO can identify is determined by the number of

output nodes in the final layer of the network. Each node in the output layer represents a

distinct class, and the output of each node is the likelihood that the object in the input image

belongs to that class.

81
Figure 4.4 :YOLO Model

YOLOv5's adaptive anchor box calculation helps in training new datasets. The adaptive

anchor boxes feature uses the training dataset to construct acceptable bounding box anchors,

allowing the detector to learn different sized objects more successfully. Furthermore,

YOLOv5's loss function is based on GIoU (Generalised Intersection over Union), which

helps in training by preserving the error distance even when predictions and ground truth do

not intersect.

YOLOv5 provides multiple pretrained model checkpoints that have been trained with the

COCO (Common Objects in Context) dataset as premade deployment options. The depth of

the backbone network is determined by 5 distinct pretrained structures, n, s, m, l, x, and each

of these structures has two resolution options, 640x640 and 1280x1280. A deeper backbone

may extract more information from input images and improve accuracy, but it also makes the

network operate slower computationally. These pretrained weights can also be utilised as a

foundation for fine-tuning, and YOLOv5 includes simple interfaces for doing so.

82
The YOLOv5 model, as illustrated in Figure 4.5, is made up of three parts: the backbone, the

neck, and the head. The backbone collects characteristics from the input image at various

granularities. The neck then aggregates the features retrieved by the backbone and sends them

to the next layer for prediction. Finally, the head predicts the class labels and constructs the

bounding boxes.

Figure 4.5 : YOLOv5 Architecture

Input

83
The YOLOv5 input uses mosaic data augmentation, which stitches four images together at

random using clipping and scaling for improved efficiency in small object detection.

YOLOv5 makes advantage of Mosaic data augmentation for training and fine-tuning. Figure

4.6 depicts a mosaic data augmentation depiction. YOLOv5 uses non-max suppression

(NMS) to construct distinct prediction bounding boxes close to the ground bounding box.

Because the images are not consistent in size, the adaptive zooming approach is used to zoom

them to an acceptable uniform size before being sent into the network for detection, hence

eliminating issues such as a conflict between the feature map and the fully connected layer.

Figure 4.6 : Mosaic Data Augmentation Sample Images

Backbone

To accomplish slice operations, the backbone of YOLOv5 use the Focus layer for down-

sampling. The original RGB input image is sent into the Focus layer, which executes a slice

operation to build a 12-dimensional feature map before performing a convolution operation

with 32 kernels to produce a 32-dimensional feature map. CSP1_X and CSP2_X are the two

types of Cross Stage Partial structures (CSP) in YOLOv5.The first CSP structure, CSP1_X, is

used in the backbone to achieve rich gradient information while lowering computing costs.

Spatial Pyramid Pooling (SPP) is used in backbone to generate fixed-size feature maps while

preserving image detection accuracy.

84
Neck

The YOLOv5 neck is largely used to build feature pyramids, which improves the model's

detection of objects of varying sizes and enables recognition of the same object at different

sizes and scales. To aggregate the features, YOLOv5 employs the CSP2_X structure, as well

as the Feature Pyramid Network (FPN) [126] and Path Aggregation Network (PAN) [127].

85
Head

The YOLOv5 head is made up of non-max suppression and a loss function. The loss function

is divided into three parts: bounding-box loss, confidence loss, and classification loss. The

bounding box loss is computed using the Generalised IoU (GIoU) [128]. YOLOv5 uses

weighted NMS in the post-processing of target object detection to filter multiple target

bounding boxes and delete duplicate boxes.

4.5 Performance Metrics

In machine learning, several distinct metrics are used to compare the outcomes of various

approaches and algorithms. These metrics are frequently based on four fundamental values:

true positives, false negatives, false positives, and false negatives. True positives in object

detection are detections that accurately localise and classify a ground truth object. True

negatives, on the other hand, are portions of an image that do not include any objects or

detections. False positives are detections that are incorrectly localised or classified, whereas

false negatives are ground truth objects that are not detected.

Intersection over union (IoU)

The predicted bounding box matches against the ground truth bounding box and the degree

of overlap between the two computers are used to assess the accuracy of item detection. This

is known as the Intersection over Union (IoU) or the Jaccard index, and it ranges from 0 to 1.

An IoU of 1 implies a complete match between the two boxes, whereas an IoU of 0 shows no

overlap, as shown in equation 4.1. In the case of multiple object classes, IoU calculates for

each class separately. Figure 4.7 shows a graphic depiction of IoU.

86
Figure 4.7 : Intersection over Union

IoU = ¿ (4.1)
¿

Typically, an IoU threshold value defines whether a prediction is a true positive or

false positive; changing the IoU threshold might have an influence on the system's overall

performance. If the predicted class label matches the ground truth class label and the

predicted bounding box IoU ratio with the ground truth box exceeds a set threshold, the

object bounding box prediction is considered a true positive. All bounding boxes that do not

meet these requirements are considered false positives. False negatives are ground truth

bounding boxes that are not matched by any predictions. True negatives are not included in

the IoU calculation since they are rarely an interesting result, and metrics that rely on them

are rarely used. Figure 4.8 depicts the influence of the IoU computation on true positives,

false positives, and false negatives with example detection and ground truth bounding boxes.

The green bounding boxes in the figure represent ground truth, whereas the red bounding

boxes represent detections. The figure has two images: one on the left and one on the right.

With a 0.5 IoU threshold, the detection of the left image would be considered a true positive

and the detection of the right image as a false positive. Furthermore, the right image's ground

truth object would be regarded as a false negative because the detection for it was a false

positive.

87
Figure 4.8 : IOU Calculation

Precision

Precision is a metric that indicates what proportion of a model's positive predictions are

correct. It may be determined using the formula in equation 4.2.

TP
Precision ( P )= (4.2)
TP+ FP

Where TP is the number of true positives and FP is the number of false positives.

In object detection, the precision score is effective in measuring how accurately a

model is making predictions, but it is insufficient on its own to draw conclusions about the

model's actual accuracy. Since false negatives are not taken into consideration while

calculating accuracy, the model may miss nearly every object in the input images and yet

receive a precision score of 1.0. However, precision may be a highly helpful metric for

assessing model performance in situations when the main objective is minimising false

positives.

88
Recall

Recall is a metric that indicates the percentage of ground truth that a model has

correctly predicted . It is calculated by using equation 4.3.

TP
Recall ( R )= (4.3)
TP+ FN

Where FN is the number of false negatives. In object detection recall can be used to

easily check a model’s ability to find all the interesting objects from an input image. But as

with precision, recall alone does not give a complete measure of model’s accuracy.

The model can achieve a high recall score just by predicting bounding boxes

everywhere in an image, which as an output does not convey much information.

Precision-Recall Curve

The precision-recall curve is a graph that connects two metrics: accuracy on the y-axis

and recall on the x-axis. The plot may be used to compute a new metric AUC (area under

curve), which indicates the area left under the precision-recall curve[129]. This new

metric combines accuracy and recall and can mainly overcome the perspective constraints of

its components. The AUC score takes into account Tp, Fp, and Fn, which are important

quantities for object detectionAs a result, it may be utilised as a full metric for the detection

accuracy of a model. However, because the AUC score does not reflect the precision-recall

ratio, the precision-recall curve is equally essential in the accuracy analysis of a model. The

AUC does not take into account which IoU threshold is used to compute Tp, Fp, and Fn.

Figure 4.9 depicts an example precision-recall curve with AUC values produced by the

YOLOv5 framework.

89
Figure 4.9: Precision -Recall curve of YOLOv5 Model

Mean Average Precision

The mAP (mean average precision) measure is commonly used in object detection to assess

accuracy and regression performance. It is defined as the mean of the AP (average precision)

scores for each class. The AUC score of a particular class in a precision-recall curve

calculated via equation 4.4 is represented by AP. The greater the average accuracy, the better

the performance in detecting objects of a specific class.

1
AveragePrecision ( AP )=∫ p ( r ) dr (4.4)
0

In object detection research mAP is often presented with a decimal number added to it in

some format like [email protected] . The @0.5 notation represents the IoU threshold value used to

calculate TP, FP and FN . The threshold value can also be given as a range [email protected]:.95

which represents the average of mAP scores with threshold values from the given range.

90
Mean average precision is standard as an overall evaluation metric to measure the accuracy of

an object detection system across all classes by taking the average of the APs across all object

classes; see equation 4.5.

n
1
mAP= ∑ AP ( i ) (4.5)
n i=1

AP(i) is the average precision in class "i," and "n" is the number of classes. The mean average

precision will correspond to each class's identification/multiclass problems accuracy curve, as

the mean average is a weighted sum for each curve.

The area under the curve (AUC) score in Figure 4.9 reflect [email protected] scores for their

respective classes, whereas the overall score represents [email protected]. mAP provides an

informatic performance indicator for measuring object detection accuracy via the IoU

threshold value. Although the measure may not be appropriate for certain tasks since it does

not reveal the exact relationship between accuracy and recall, a higher mAP result with the

same IoU threshold is typically preferable. In certain studies, AP is used to refer to the same

metric as precision, while mAP is used to define the precision metric's average over all

predicted classes.

4.6 Methodology

4.6.1 Data Collection

The images used for this research are from the Laboro Tomato dataset [130], which is a

tomato dataset made up of tomatoes taken at various stages of ripening and created for tasks

such as segmentation and object detection. The dataset consists of 2034 images with a 416 x

416 input resolution. It is made up of images collected under various environmental

conditions such as occlusion, lighting, shade, overlap, and others. Figure 4.10 displays some

of the images acquired under different conditions.

91
Figure 4.10 : (a) Single Tomato (b) Overlapping of Tomatoes (c) Occlusion by branch (d).

Under Shading Conditions (e) In Sunlight Conditions

4.6.2 Data Labelling and Splitting

The YOLO object detection paradigm requires labelled data, which provides the name and

location of the image's ground truth bounding boxes. The labelled dataset may be used to

train and test machine learning models, which will assist computers in automatically

identifying and locating target objects. As demonstrated in Figure 4.11, the labelImg is a

graphical annotation tool that is used to label 2034 images by placing bounding boxes around

each tomato in the image and naming each ground truth bounding box with a class label (ripe

or unripe). Using the LWYS (Label What You See) method, all visible tomatoes, ripe and

unripe, are marked with a bounding box in each image. Notably, the bounding boxes for the

92
highly occluded tomatoes are created by assuming a shape based on the visible part of human

intelligence.

Figure 4.11: Annotation of Sample Tomato Image

Following the labelling process, the dataset is divided into three unique subsets, namely the

training set, validation set, and test set, each of which contains 1628, 203, and 203 original

images. The usage of a different test set was crucial in measuring the trained model's

generalisation performance. Furthermore, the validation set assisted in hyperparameter

adjusting and allows to choose the best model throughout the training phase. By dividing the

dataset into subsets, it is assured that the trained model did not overfit to the training data and

could generalise effectively to previously unseen data.

4.6.3 Data Augmentation

In the discipline of deep learning, having a big and diverse training dataset is critical for

constructing and testing strong and accurate models. Models must be able to identify and find

objects of varied sizes, shapes, and orientations in a variety of environments for difficult tasks

such as object detection. Large training datasets may assist models encounter a wide range of

object instances, backgrounds, and lighting situations, allowing them to learn powerful

features and generalise effectively to new data. Training on vast and diverse datasets is

93
critical for increasing the accuracy and generalisation capabilities of deep learning models.

The goal of data augmentation approaches is to artificially enhance the size and variety of the

training dataset, which improves the performance and generalisation of deep learning models.

The primary objective is to provide variations in the training data that simulate real-world

scenarios that the model is likely to experience upon deployment. Geometric transformations,

for example, allow to replicate diverse views, orientations, and positions of objects in images,

which can help the model acquire more resilient and invariant features Colour space

transformations, on the other hand, help in simulating changes in lighting conditions and

atmospheric effects that can impact the visual appearance of objects in images. Through the

introduction of such variances, the model may learn to recognise and adjust to these

differences during training, resulting in improved performance on unseen data. Image

filtering techniques helps in the smoothing out of noise and the reduction of image artifacts,

enhancing image clarity and quality This can assist the model in learning more relevant and

discriminative features, resulting in improved classification and detection accuracy.

As a result, horizontal flipping is used on the 1628 original images in the training set to

expand the dataset and improve the model's robustness and performance in tomato fruit at

different angles. Figure 4.12 depicts the original images as well as their flipped variants

following the data augmentation strategy. The number of images in the training set rose to

3256 as a result of the above method.

94
Figure 4.12 : Augmented Images (a) Original Image (b) Flipped Image

4.7 Proposed YOLOv5 Model:

95
This section goes into detail about the proposed enhanced YOLOv5 architecture, as seen in

Figure 4.13, including attention mechanism and distance intersection over union. Because the

Convolutional Block Attention Module (CBAM) was designed to improve critical features,

the suggested CAM- YOLO inserted it into the network structure after each feature fusion,

i.e., after ''add,'' ''concat,'' and before the detection head.

Figure 4.13: Modified YOLOv5 Architecture

Attention mechanism

96
The attention mechanism extracts a little amount of meaningful information from a huge

quantity of information and concentrates on it, ignoring the rest of the unimportant

information. The proposed algorithm CAM-YOLO employs a Convolutional Block Attention

Module (CBAM), which is divided into two sub-modules: spatial and channel modules.

Figure 4.14 displays the CBAM structure. The channel attention module tries to capture

''what'' is relevant in the provided images, whereas the spatial attention module focuses on

''where,'' or which region of an image is significant (spatial).

Figure 4.14: Convolutional Block Attention Module Structure

As illustrated in Figure 4.15, the channel attention module first aggregates the spatial data of

the given input feature map in two dimensions (height and width, respectively) using global

max pooling and global average pooling. The multilayer perceptron then processes the output

with the help of a shared fully connected layer. Finally, using the sigmoid activation

operation, the final channel attention feature is built and fed as input to the spatial attention

module. Equation 4.6 is used to compute channel attention.

M CA ( IF )=σ ¿ (4.6)

97
Figure 4.15 : Channel Attention Module

The spatial attention module is presented after the channel attention module. As shown in

Figure 4.16, the spatial attention module conducts max and mean pooling in the channel

dimension. First, max-pooling and average pooling are performed on the convolution

module's input, and then the max-pooled and average-pooled tensors are concatenated. The

visual cues in images are then activated using a convolution operation with a kernel of size (7

x 7) and a sigmoid activation function. The spatial attention map MSA(IF) is computed using

equation 4.7.

MSA(IF) = σ (f 7X7([MaxPool(IF); AvgPool(IF)]) (4.7)

Figure 4.16: Spatial Attention Module

98
The proposed algorithm retains the YOLOv5 backbone network's original network structure

while extracting the features from its three feature layers and sending them to the head

network. It then incorporates CBAM in the head to strengthen the network as it transmits

from the shallow layer to the deep layer by enhancing the attention of the given feature map.

The network can learn the target's feature information more efficiently, capture the target's

recognition features more accurately in the same test image, and achieve a better recognition

effect without increasing the training cost by learning meaningful feature maps, particularly

for small target features.

Improved non-max suppression.

The goal of non-max suppression in object detection algorithms is to select the most optimal

bounding box for the object while suppressing all other boxes. Traditionally, non-max

suppression computes the IoU between the detection box with the best score and the other

boxes and deletes the boxes with IoU greater than the specified threshold. Because it is based

on the basic IoU, the NMS sometimes mistakenly discards occluded elements. Furthermore,

to improve the missed detection scenario, Distance IoU (DIoU) is integrated with NMS.

When two boxes have a substantial IoU, it indicates that two objects have been detected, and

so the boxes should not be eliminated. As stated in equation 4.8, DIoU considers the distance

between the prediction's centre point and the real bounding box, as well as the overlap rate, to

make the regression more consistent.

2 ( b , b¿ )
DIoU =IoU −ρ 2 (4.8)
c

Where b, bgt denotes the central points of the predicted box(B) and ground-truth-box (B gt), c2

denotes the Euclidean distance and IoU can be defined by using the equation 4.9.

99
B ∩ B¿
IoU = (4.9)
B ∪ B¿

The DIoU definition can be formulated as shown in equation 4.10

{
si= si , DIoU ( M , Bi ) <ε
0 , DIoU ( M , Bi ) ≥ ε
(4.10)

Where si denotes the confidence score, M denotes the box with highest confidence score, Bi

is all contrastive boxes of current class. In comparison to IoU, DIoU considers the

information about the centre points of the two frames, which contributes to the resolution of

the occluded issue produced by the close proximity objects. When evaluated by DIoU, the

NMS effect is therefore more realistic and has a considerable influence.

4.8 Performance of Proposed model:

An improved YOLOv5 network model for tomato fruit detection is developed on 3256

tomato images using the pre-trained YOLOv5s model. Table 4.1 describes the experimental

environment for tomato fruit detection module training. The experiment is carried out using

the Google Colaboratory platform with Python 3.10.12, PyTorch 2.0.1, CUDA version 11.8,

and a Tesla T4 GPU with a memory capacity of 16GB. Google Colaboratory offers a free

cloud computing environment with GPU capability, allowing for fast deep learning model

training. PyTorch is a popular deep learning framework that allows for the efficient

implementations of neural network models. The CUDA version 11.8 is used to accelerate the

training and inference processes by using the capability of GPU computation. The Tesla T4

GPU is well-suited for deep learning applications due to its excellent performance.

100
Table 4.1 Experimental Environment Configuration

Platform Python Pytorch CUDA GPU

Google Colaboratory 3.10.12 2.0.1 11.8 Tesla T4

(15101MB)

Table 4.2 displays the hyperparameter values of the improved YOLOv5 algorithm used in the

tomato fruit detection module. The table contains specific information on several

hyperparameter variables that were adjusted throughout the training process, such as learning

rate, batch size, and momentum. It also includes the parameters for the number of iterations,

epochs, and steps to reduce the learning rate. The model was trained for 100 epochs using a

batch size of 16 and an image size of 416. A learning rate of 0.01, momentum of 0.937, and

weight decay of 0.0005 are applied to the SGD optimizer. Warmup epochs, momentum, and

bias learning rate were all set to 3.0,0.8, and 0.1, respectively. With a maximum detection

number of 1000, the IoU threshold for non-maximum suppression was set to 0.7 and the

confidence threshold to 0.001. Box, class, and objectness scale hyperparameters were also

significant and were set at 7.5, 0.5, and 1.0, respectively. In addition, data augmentation with

various parameters such as hue, saturation, and brightness is applied during training, and

mosaic was used to combine multiple images. Hyperparameters are critical to the algorithm's

performance, and fine-tuning them can result in higher detection accuracy. As a result, the

updated YOLOv5 algorithm can achieve high detection accuracy for tomato fruit in a various

complex scenarios by using these hyperparameters.

Table 4.2 Hyperparameter Values for Training

Parameter Value

101
No. of epochs 100

Image Size 416 x 416

Batch Size 16

Optimiser SGD

Learning Rate 0.01

Momentum 0.937

Figure 4.17 illustrates the ground truth of the labelled tomato fruit results on the validation

dataset, which have been carefully curated to serve as a baseline for measuring the detection

model's performance. As shown, tomato fruits vary significantly in size, shape, orientation,

and environmental context, making detection a difficult process.

Figure 4.17 Labelled Results on the Validation set

102
Nonetheless, Figure 4.18 shows the detection results of the proposed model on the same

tomato fruit validation dataset. The proposed model performed well in detecting most of the

tomatoes in different scenarios, including those of diverse sizes and overlapping with one

another, as indicated by the high degree of overlap between the predicted bounding boxes and

the ground truth annotations. It also displayed exceptional generalisation ability while

identifying tomatoes of various sizes and orientations. A comparison of Figures 4.17 and 4.18

demonstrates the effectiveness of the proposed model in recognising tomato fruits in real-

world scenarios.

Figure 4.18 Prediction Results of Proposed Model on the validation set

Following the completion of 100 training epochs, the performance of the proposed detection

model is evaluated by plotting its precision-recall (PR) curve, as shown in Figure 4.19. The

PR curve displays the trade-off between accuracy and recall at different confidence

thresholds. Recall represents the proportion of true positive samples correctly detected by the
103
model, whereas precision represents the ratio of true positive samples among the detected

results. The PR curve's area under the curve (AUC) represents the model's performance,

which was 0.868 in this case. It is observed that the model's accuracy is relatively good while

its recall is low which is owing to the high proportion of true positives in the discovered

results. However, as recall grew, accuracy steadily decreased due to an increase in the number

of false positives detected by the model. Furthermore, the model's performance is assessed

using the balance point on the PR curve, which represents the point at which recall and

accuracy are equal. The model had a high balance point, suggesting that it achieved a good

balance between recall and accuracy and accurately etected positive samples. In conclusion,

the proposed detection module performed well on the PR curve after training, accurately

detecting positive samples.

Figure 4.19: P-R Curve of Proposed Model

104
Figure 4.20 showcases three loss functions and four evaluation metrics. The loss

function is composed of bounding box regression loss, objectness loss, and classification loss.

The bounding box loss measures the model's accuracy in locating the object's center and

coverage of predicted bounding boxes. It comprises position offset and scale change, where

position offset refers to the deviation between predicted and ground truth bounding boxes,

and scale change indicates the scale ratio between predicted and ground truth bounding

boxes. Objectness loss measures the likelihood of an object existing in a proposed region of

interest. Each predicted bounding box has an objectness score, which indicates whether or not

an object exists. The objectness loss is based on the binary cross entropy loss function, with a

predicted bounding box containing a true target having an objectness score near to one and

vice versa. Because the data only contains one category, the classification loss stays zero.

Four evaluation metrics are shown in Figure 4.20: Precision (P), Recall (R), mAP_0.5, and

mAP_0.5:0.95. Precision assesses the accuracy of bbox predictions, whereas Recall measures

the accuracy of true bbox predictions. The average accuracy at an IoU threshold of 0.5 is

mAP_0.5, whereas the average mAP at multiple IoU thresholds ranging from 0.5 to 0.95 with

a step size of 0.05 is mAP_0.5:0.95. After around 50 epochs of training, the validation data

shows a rapid drop in box and objectness losses.

105
Figure 4.20: Evaluation Metrics of the Proposed Model

Table 4.3 compares the performance of multiple YOLO models on collected datasets,

including their training epochs, model sizes, accuracy, recall, [email protected], and mAP@[.5:.95].

Among all the models, the proposed model had the highest accuracy of 87.3%, recall of

86.9%, [email protected] of 88.1%, and mAP@[.5:.95] of 44.7%.

Table 4.3: Results of Three Models

mAP@[0.5..0.95
Image [email protected] Precision Recall
Algorithm Epochs ]
Size (%) (%) (%)
(%)

106
YOLOv5 100 416 x 416 85.9 39.1 84.9 80.1

YOLOv5+

BottleneckCSP 100 416 x 416 86.9 41.0 85.4 81.3

(backbone)

CAM-YOLO 100 416 x 416 88.1 44.7 87.3 86.9

Figure 4.21 shows a comparison of the accuracy, recall, and [email protected] of the proposed

CAM-YOLO model's performance with other state-of-the-art detection models. The results

show that the improved detection model that has been presented outperforms other models,

reaching a mean average precision at intersection over union 0.5 of 88.1%, which is an

improvement over the prior model. Using Figure 4.22, the results are displayed as a graph.

Performance Results
90
88.1
88 86.9 87.3 86.9
85.9 85.4
86 84.9
(In Percent)

84
82 81.3
80.1
80
78
76
[email protected]% Precision% Recall%
Performance Metric

YOLOv5 YOLOv5(BottleNeck CSP)


CAM-YOLO

Figure 4.21: Performance metrics of the three models

107
Figure 4.22: Performance metrics of the three models

4.9 Evaluation of detection results

To evaluate the model's performance, many images from the dataset were chosen for testing,

and the detection performance of the CAM-YOLO and standard YOLOv5 models under

various circumstances is displayed in the figures below. The identification of overlapping

tomatoes is demonstrated in Figure 4.23, where the occluded and overlapped tomatoes by the

branch or other tomatoes are not detected by YOLOv5 in Figure 4.23b, while the overlapped

tomato is detected by CAM-YOLO in Figure 4.23c. Figure 4.24 shows the identification of

small tomatoes, where the small tomato is not detected by the YOLOv5 model in Figure

4.24b but is detected by the CAM-YOLO model in Figure 4.24c.

108
Figure 4.23: Results of identification of Overlapped Tomatoes

(a) Original Image (b) YOLOv5s (c) Improved YOLOv5 model

Figure 4.24: Results of identification of small Tomatoes

(a) Original Image (b) YOLOv5s (c) Improved YOLOv5 model

109
4.10 Summary:

This chapter proposes a CAM-YOLO which is based on YOLOv5 model for tomato

detection

and classification algorithm. Firstly, the attention module is added to the backbone which

allowed to extract the information quickly. Later the algorithm makes use of the Distance

Intersection over Union (DIoU) with Non-Max suppression (NMS) to decrease the rate of

missing the detection of overlapping tomatoes. The [email protected] of the proposed algorithm is

88.1% which is an improvement compared to YOLOv5 model. Also, the proposed CAM-

YOLO is efficient in addressing the low inference, accuracy and rate of missed target

detection caused by overlapping and occlusion. To summarise, the CAM-YOLO model

surpasses the traditional YOLOv5 when it comes to detecting tiny and dense targets, resulting

in improved accuracy, detection, and identification.

110
Chapter 5

Detection and Classification of Tomato Fruit Diseases

using Ensemble Method

In this chapter, a Deep Learning based Ensemble model for detection and

classification of tomato fruit diseases. Firstly, the dataset was trained by using pre-trained

models like VGG16, ResNet50, DenseNet121, InceptionV3 and Xception applying the

transfer learning and fine-tuning strategy. The proposed ensemble is compared with State-of-

The-Art (SOTA) models to examine the performance of the model on the collected dataset.

The average accuracy of 98.54% is obtained by average ensemble model on tomato dataset.

5.1 Introduction

According to the Food and Agriculture Organisation (FAO), diseases and pests destroy up to

40% of the world's fruit crops [131]. Plant disease may severely impact the quality and

quantity of plant production. Plant diseases are extremely natural, which is why disease

detection in plants is so crucial in farming. Diseases cause both direct and indirect financial

losses for farmers. Plant disease classification is imperative since it helps in establishing an

appropriate management step towards limiting the disease's further spread after the disease is

accurately recognised and treated appropriately. The advancement of machine learning and

deep learning approaches in plant disease identification has been remarkable, representing a

huge accomplishment in research.

The tomato is recognised as a significant commercial and nutritious vegetable crop. It is

a supplement meal for protection. Because of its short lifespan and great production, this crop

is critical from an economic aspect, and as a result, the area under cultivation is expanding

every day. Tomato is a fundamental component in many preserved dishes, including ketchup,

111
sauce, chutney, soup, paste, and puree. It is sensitive to a number of diseases, and this

condition has a substantial negative influence on tomato quality and yield, resulting in large

financial losses.

The majority of tomato diseases cause colour changes or various spots on the soft

tissue or flesh on the exterior of the fruit. Both biotic and abiotic substances have the

potential to be toxic to tomato fruit. Insects, bacteria, fungus, and viruses are examples of

living creatures (biotic) [132]. Non-living abiotics include a variety of atmospheric conditions

such as rapid temperature changes, too much moisture, nutrient deficiency, acidic soil, and

high humidity levels [133].

5.2 Background

The use of an image processing approach for disease detection [134] has proven to be

more effective. It has been proven that the disease detection method works better when image

processing is applied. In a few specific circumstances and crops, the current trend of

classifying plant diseases using different Machine Learning (ML) algorithms has produced

positive results [135]. On the basis of computer image processing, a rapid and precise

approach for identifying diseases of plants will be created. Results from deep learning (DL)

techniques are beginning to rival those from shallow learning algorithms. Plant diseases may

be faster and more precisely identified using deep learning networks.

The introduction of deep learning is an excellent technique for improving plant

disease detection. The reasons for such deep learning models are complex and suited for

learning from computer-vision related datasets. Many deep learning algorithms exist and are

utilised for a variety of purposes. Deep learning models are utilised to optimise plant disease

detection and highlight the benefits of hyperspectral images on different scales for plant

protection and disease detection[136]. However, it is understood that Convolutional Neural

112
Networks (CNN) are a good choice for processing plant leaf and fruit images in order detect

disease and even further classify disease.

The authors of [137] used a convolution neural network-based approach to detect

disease in tomato leaves. This model is made up of three convolution layers, three maximum

pooling layers, and two completely connected layers. According to experimental results, the

proposed model outperforms the pre-trained models VGG16, InceptionV3, and MobileNet.

The recommended model's average classification accuracy for the 9 disease and 1 healthy

classes is 91.2%, ranging from 76% to 100% depending on the class. However, the model

might be improved by utilising a bigger number of images with various cropping.

Furthermore, because the testing accuracy is low, it is necessary to improve the same model

on the same dataset. According to a recent research, DenseNet121 outperforms cutting-edge

pre-trained models such as ResNet50, VGG16, and InceptionV4 in terms of classification

accuracy using the PlantVillage dataset. Using transfer learning approaches, the model's

performance was evaluated based on classification accuracy, sensitivity, specificity, and F1-

score [138].

CNN may be hosted on the internet for simple access and immediately used by

farmers to identify diseases damaging their crops. Farmers can salvage the fruits if they spot

the disease early on. Even if the disease has spread to a portion of the fruit crop, it would be

good to identify it so that the disease may be stopped from spreading further.

Deep learning CNN approaches have shown considerable improvements in plant

disease classification performance. Rapid and exact detection of the disease's severity will

assist in lowering yield losses. The research here focuses on the classification of tomato fruit

disease. Six classes are examined here, including a healthy class and five tomato fruit

diseases. Anthracnose (AN), Bacterial Canker (BC), Bacterial Speck (BS), Blossom End Rot

(BER), Ghostspot (GS), and Healthy (HT) are the tomato fruit diseases analysed here.

113
5.2.1 Common Diseases of Tomatoes

Anthracnose

Anthracnose is typically a problem on mature (or overripe) fruit, although it can also

infect leaves, stems, and roots. Fruit that is not yet ripe may become infected as well;

however, symptoms may not develop until the fruit begins to mature. Disease development is

aided by moist circumstances. The pathogens can be disseminated by splashing water from

rain or overhead irrigation.

Symptoms: Figure 5.1 depicts the development of small, slightly depressed, circular spots on

mature fruit. These lesions may grow in size, become sunken, and merge together. Small

black patches (microsclerotia) eventually develop in the tan cores of lesions. When moist

circumstances exist, salmon-colored spores can be seen in masses on the surfaces of lesions.

Figure 5.1 Anthracnose

Bacterial canker

Bacterial canker can be a serious disease in greenhouse-grown tomatoes. The disease

is seedborne and may persist on surfaces and production supplies (stakes, trays) as well as in

114
infected plant debris and weed hosts. The disease can also be transported from plant to plant

on workers' hands, as well as through water splashing and pruning.

Symptoms: Tomatoes infected with the bacterial canker pathogen can exhibit a wide range of

symptoms. Wilting is the most visible symptom. Wilting normally begins in the bottom half

of the plant and progresses upward; however, wilting can occur at the point of pathogen

beginning when plants are injured. Fruit may have little (1/4 inch) creamy white dots with tan

or brown centres (called a bird eye mark). Fruit surface may appear netted or marbled as

shown in Figure 5.2

Figure 5.2 Bacterial Canker

Bacterial speck

Bacterial Speak is a dangerous tomato disease that may be difficult to treat when

disease incidence is high and climatic circumstances are favourable. Disease growth is helped

by high humidity and cold temperatures. The infection is transferred by seeding and can also

be spread through splashing water, infected tools and equipment, and personnel. If conditions

are favourable, the disease can also live from season to season in agricultural waste.

115
Symptoms: On leaflets, circular, dark brown to black lesions occur; a yellow halo may form

around lesions over time. Fruit can also get dark lesions. These fruit lesions may be

surrounded by a dark green halo shown in Figure 5.3.

Figure 5.3 Bacterial Speck

Blossom end rot

Blossom end rot is an environmental (not fungal) problem that is most typically

caused by uneven watering or a calcium deficit. This common garden "disease" is frequently

caused by excessive fertiliser, high salt levels, or dryness.

Symptoms: Blossom end rot symptoms appear on both green and ripe fruits and can be

recognised by water-soaked regions on the bottom end that gradually enlarge and mature into

sunken, brown, leathery spots as illustrated in Figure 5.4.

116
Figure 5.4 Blossom End Rot

Ghostspot

Ghost spot is a prevalent disease that may spread quickly on tomatoes grown in

confined structures. The pathogen prefers high humidity and cool temperatures, and it needs

free moisture to germinate its spores. Spores can be disseminated by wind and air movement.

Symptoms: Figure 5.5 shows fruit with faint, pale halos (3 to 8 mm in diameter). They are

white on immature fruit and yellow on matured fruit. A small necrotic fleck may occur

alongside the halo. Spots seldom go further, but a shift in favourable conditions causes Ghost

Spot to progress to fruit rot.

117
Figure 5.5 Ghostspot

Buckeye rot

Buckeye rot is a widespread tomato disease in the southeast. The development of

disease is helped by high humidity and warm temperatures. The disease is especially severe

in damp soils. Water that has been infected or that has been splashed can transmit the

infection.

Symptoms: On diseased fruit, a dark, greasy lesion appears. Lesions increase with time and

may cover a considerable area of the fruit. Concentric rings are often observed within the

lesion depicted in Figure 5.6. When the lesion is exposed to moisture, a white, cottony fungal

growth may form on the surface. Infected fruits are frequently in direct touch with the earth

or are low to the ground.

118
Figure 5.6 Buckeye Rot

5.2.2 Deep Learning Models

Deep learning facilitates feature extraction from the input image [139]. It is extremely

accurate and fast in its ability to resolve challenging problems. By modifying the layers and

how they are combined in the model, the accuracy of the model may be increased. Due to

their promising results, deep learning networks have been widely used in numerous

areas[140,141]. The classification of tomato fruit disease detection and classification for six

classes is done in the proposed research utilising VGG16, ResNet50, DenseNet121,

InceptionV3 and Xception pre-trained models.

VGG16

The VGG16 model [142], also referred to as the VGGNet, has a depth of 16 layers.

The VGG16 convolutional neural network model was presented by K. Simonyan and A.

Zisserman from the University of Oxford in the publication "Very Deep Convolutional

Networks for Large-Scale Image Recognition. This model on ImageNet, a dataset of 14

million images divided into 1000 classes, achieved a top-5 test accuracy of about 92.7%. It

replaces huge kernel-sized filters with a set of 33 kernel-sized filters, outperforming AlexNet

by a substantial margin. It has trained using Nvidia Titan Black GPUs over the course of

119
many weeks. Figure 5.7 depicts the 3x3 filters used in the 16 layer deep VGG16 network,

which has a stride of 1. It accepts images of size 224 x 224 .

Figure 5.7 VGG16 Architecture

ResNet50

A ResNet [143], also known as a residual neural network, is a sort of deep learning

model in which the weight layers learn residual functions with reference to the layer's inputs.

A residual network is one that contains skip connections for identity mappings and is added to

the layer outputs. One of the challenges addressed by ResNets is the famous vanishing

gradient. This is because, when the network is too deep, the gradients that are required to

calculate the loss function simply go to zero after a certain number of chain rule operations. As

a result, learning is not happening since the weights' values are never updated. However, the

skip connections from later layers to initial filters can allow gradients in ResNets to pass

through them directly.

120
Figure 5.8 depicts ResNet50, a ResNet model version of 48 Convolution layers, 1

MaxPool layer, and 1 Average Pool layer. The model also includes approximately 23 million

trainable parameters, indicating a deep architecture that improves image classification. The

ResNet model accepts images that are 224 x 224 in size. If you need to create a model from

scratch, you will need to collect a large amount of data and train it yourself. Using a pretrained

model is a highly effective method. There are other pretrained deep models to use, such as

VGG19, GoogleNet, or AlexNet, however the ResNet50 is famous for excellent generalisation

performance with lower error rates on recognition tasks and is therefore a useful tool to know.

Figure 5.8 ResNet50 Model Architecture

InceptionV3

The Inception network [144], also known as GoogleNet, was designed to overcome

the limitations of existing networks and to improve accuracy and speed. An inception

network is a deep neural network with an architectural layout made up of repeating elements

known as inception modules. Prior to its development, most common CNNs merely layered

convolution layers deeper and further in with the purpose of enhancing efficiency, but this

resulted in an overfitting issue. Figure 5.9 depicts the basic idea behind the inception module,

which is to conduct many operations with varying filter sizes like 1x1,3x3,5x5 in parallel to

avoid any trade-offs. The inception model is made up of many inception modules.

121
Figure 5.9 Inception module

122
InceptionV3's architecture is seen in Figure 5.10. It features 42 layers and a lower error rate

than its prior version.

Figure 5.10 InceptionV3 Architecture

DenseNet

DenseNet [145] resembles ResNet with a few important differences. DenseNet

concatenates (.) the output of the previous layer with the output of the future layer, whereas

ResNet employs an additive technique (+) that combines the previous layer (identity) with the

future layer. It was developed particularly to overcome the decreasing accuracy caused by the

vanished gradient in high-level neural networks. Simply said, because of the longer path

between the input layer and the output layer, the information vanishes before reaching its

target. The DenseNet is organised into DenseBlocks, each with its own set of filters but

sharing the same dimensions. The Transition Layer employs batch normalisation via

123
downsampling with convolution and pooling layers of 1x1 and 2x2. DenseNet is available in

four variants depending on the number of layers: DenseNet121, DenseNet169, DenseNet201,

and DenseNet264. Figure 5.11 represents DenseNet121's architecture.

Figure 5.11 DenseNet121 Architecture

Xception

Xception [146], which means "extreme inception," takes the essential principles of Inception

to their logical conclusion. Inception's original input was compressed using 1x1 convolutions,

and each depth space was produced using a separate set of filters from a distinct input space.

Xception just reverses this stage. Instead, it applies the filters to each depth map

independently before using 1x1 convolution to compress the input space overall. As

illustrated in Figure 5.12, its architecture is a linear stack of 36 residually linked depth-

separable convolution layers that serve as the network's feature extraction foundation. The

network's image input size is 299 x 299.

124
Figure 5.12 Xception Architecture

5.2.3 Ensemble Method

Deep Learning models are incredibly versatile, capable of learning complicated

relationships between variables and approximating any mapping function given sufficient

resources. However, the models are heavily dependent on exact training data used to train the

model as well as the random initial weights. As a result, the final model delivers different

predictions each time the same model configuration is trained on the same dataset. This may

be uncomfortable when training a final model to make predictions on new data. The large

variance of the approach can be decreased by training different models for the problem and

combining their predictions. This method of combining numerous models is known as

ensemble learning, and it is seen in Figure 5.13.

125
Figure 5.13 Ensemble Method

Ensemble modelling [147] is a process in which multiple independent models are

developed to predict a result, either via the use of a number of modelling approaches or

through the use of a variety of training data sets. The ensemble model then aggregates the

predicted value of each base model, yielding a single final prediction for the unseen data.

There are several types of ensemble methods, each with its own set of advantages and

disadvantages. Bagging (bootstrap aggregation), boosting, stacking (stacked generalisation),

averaging, mixing, and voting are all common ensemble approaches [148].

5.3 Proposed Approach

The detection and classification of Tomato fruit disease performed in this research is

explained here. The study focuses on the classification of diseases in tomato fruit. This task

discusses tomato fruit classification utilising the newly created average ensemble CNN

model, as well as VGG16, ResNet50, DenseNet121, InceptionV3 and Xception via transfer

learning. Figure 5.14 depicts the workflow for disease.

126
detection and classification. The collected dataset is augmented, and the images are

resized to the required size of 224 x 224. The dataset is further divided into training and test

datasets. The proposed ensemble model along with VGG16, ResNet50, DenseNet121,

InceptionV3 and Xception are trained on a training dataset for disease classification. The

trained model is validated against test data to predict new data classes from the data collected.

Figure 5.14 Disease Detection and Classification Workflow

The input image requirement for the pre-trained model is 224 x 224. Consequently,

the image's input data is scaled to meet this size after augmenting to fit the model with the

appropriate format of input image size. The flowchart for the proposed models for tomato

disease classification and prediction is shown in Figure 5.15. For all the models, transfer

learning is also done for the six classes of tomato fruit diseases. The top layer of the pre-

trained models is replaced with a fully connected layer, the SoftMax classifier layer, to apply

transfer learning.

127
Figure 5.15 Disease Detection and Classification flowchart

5.4 Dataset

128
The images used for tomato disease detection and classification were obtained from

plant communities [149], forestry images [150], and the internet. This study focuses on

Tomato fruit images from six different classes of healthy and diseased tomatoes, including

Anthracnose, Bacterial Canker, Bacterial Speck, Blossom End Rot, and Ghostspot. Figure

5.16 depicts a sample of each tomato disease class.

Figure 5.16 Sample Images

5.5 Data pre-processing and data augmentation

The dataset used for classification is too small to train a deep learning model such as

VGG16, ResNet50, DenseNet121, InceptionV3, or Xception. The total number of images is

480 as there are six classifications, one healthy and five diseased. When data is augmented,

the dataset size is extended many times, which helps in training the deep network model. In

this study, four augmentation approaches are utilised with the rotation of 45 0, as well as

flipping and shifting, as illustrated in Table 5.1. The dataset has grown to 2400 images after

129
being augmented with the said combination. Figure 5.17 shows a sample of images following

augmentation.

Table 5. 3 Total number of images before and after augmentation

Original After After After Total


Tomato Class
Images Rotations Flipping Shifting images

Anthracnose 80 80 160 80 400

Bacterial Speak 80 80 160 80 400

Bacterial Canker 80 80 160 80 400

Blossom end rot 80 80 160 80 400

Ghostspot 80 80 160 80 400

Healthy 80 80 160 80 400

130
Figure 5.17 Augmented images : (a) ,(b),(c) Original Tomato images

(d) Rotated image (e) Horizontally Flipped (f) Vertically Flipped

The images in the dataset must be resized to fit the deep learning model that will be

used. The required input image size is 224 x 224. The input size of the images must be

satisfied for the network to fit the model. The images used in this work were RGB with

coefficients values between 0 and 255. The model with larger values is difficult to process.

Hence, a 1/255 scaling factor was applied to all the images in the dataset, normalising all the

values from 0 to 1.

5.7. Methodology

Pre-trained CNN models such as VGG16, ResNet50, DenseNet121, Xception,

and InceptionV3 were chosen from among all SOTA CNN models to train the dataset in this

131
study. The training dataset is utilised to train the CNN models VGG16, ResNet50,

InceptionV3, DenseNet121, and Xception separately using transfer learning. Originally, all

pre-trained models included a classification layer with 1000 nodes as the final layer, but

because only six classes are examined in this study, the last layer was replaced with the new

head. The new head contains three layers: a global average pooling layer, a dense layer, and a

softmax layer with six output nodes.

An ensemble of deep neural networks always outperforms a single model. In the

present study, an average ensemble learning with the same weights assigned to each

individual model is proposed. The final softmax outputs from all models were averaged using

the equation 5.1.

Output ( P ) =
∑ ( pi ) (5.1)
N

Where ‘N’ denotes the no. of models and ‘pi’ denotes the probability of model ‘i’.

This study explores five distinct SOTA architectures for training and ensembles every

conceivable combination of them.

The proposed average ensemble of the pre-trained models is depicted in Figure.5.18.

132
Figure 5.18 Proposed Average Ensemble Model

5.7.1 Proposed framework Ensemble Deep Learning-Tomato Disease Detection

(EDL-TDD)

Transfer learning-based pre-trained models including VGG16, ResNet50, DenseNet121,

InceptionV3 and Xception are used to identify the various types of tomato fruit diseases.

Each model is fine-tuned using the images of disease and healthy tomato fruits from the

dataset (Mi, Ni), where M is the total number of images, each measuring 224 x 224, and N is

the labels for the images, N = {n/n∊{Anthracnose, Bacterial Canker, Bacterial Speak,

Blossom End Rot, Ghostspot, Healthy}}. In order to increase the accuracy of deep learning

models by lowering the empirical loss, the training set is divided into 16-piece mini batches.

Several ensemble methods, such as Ensemble(VGG16+ResNet50+InceptionV3),

Ensemble(VGG16+DenseNet121+InceptionV3), Ensemble

133
(VGG16+DenseNet121+Xception) and Ensemble(DenseNet121+InceptionV3+Xception)

have been experimented. These base ensemble models did better than individual models

because these multi-model ensemble methods work together to average the prediction values

of all the models and make a final prediction based on the average score, where

Ensemble(DenseNet121+InceptionV3+Xception) got better results than other models. The

foundation of this model is the integration of DenseNet121, InceptionV3 and Xception. Thet

suggested architecture is seen in Figure 5.19.

134
Figure 5.19 Proposed System Architecture

5.8 Results and Discussion

135
The experiment is run on a tomato dataset using a variety of pretrained CNN models and the

proposed ensemble deep learning tomato disease detection (EDL-TDD) to assess

performance.With a total of 1920 images for building models and another 480 images for

testing, the performance is assessed using a dataset made up of six tomato class classes that is

split into 60% training, 20% validation, and 20% testing.The six tomato classes and the

number of training, validation, and testing samples are listed in Table 5.2.

Table 5.2 Number of Samples

No. of Training No. of Validation No. of Testing


Tomato Class Total
Samples Samples Samples

Healthy 240 80 80 400

Antharcnose 240 80 80 400

Bacterial Canker r40 80 80 400

Bacterial Speak 240 80 80 400

Blossom end rot 240 80 80 400

Ghostspot 240 80 80 400

The set of parameters that control the model's learning process is known as

hyperparameters. Variables such as optimizers, the number of layers and epochs, activation

functions, learning rate, and so on are examples of these. The learning rate, momentum was

fixed through experimentation after a number of attempts. Table 5.3 depicts the

hyperparameters employed in the pretrained models.

136
Table 5.3 Hyper Parameters

Hyperparameter Value

No. of Epochs 25

Image Size 224 x 224

Batch Size 16

Learning rate 0.01

To begin, the experiment is carried out by automating the disease detection of tomato fruits

using five pre-trained models such as VGG16, ResNet50, DenseNet121, InceptionV3, and

Xception. During the experiment, the performance of each model is assessed using the

metrics accuracy, precision, recall, and F1 score.

With trainable and non-trainable parameters of 14,714,688,25,701,386 and a depth of

16 layers, the pretrained model VGG16 attained an accuracy of 85%. ResNet50 obtained an

accuracy of 83% with 102,771,722 trainable and 23,587,712 non-trainable parameters 50

layers deep, while DenseNet121 achieved an accuracy of 93.96% with 51,387,398 trainable

and 7,037,504 non-trainable parameters in 121 layers. With 52,435,974 trainable parameters

and 21,802,784 non-trainable parameters, InceptionV3 achieved an accuracy of 90.42%. The

137
71-layer-deep Xception model obtained a 90% accuracy with 2,104,326 trainable parameters

and 20,861,480 non-trainable parameters. Table 5.4 displays the precision, recall, F1-score,

and accuracy of the five SOTA models considered.

Table 5.4 Performance Values of SOTA models

Precision Recall F1 Score Accuracy


S.No Model
(%) (%) (%) (%)

1 VGG16 87 85 85 85.00

2 ResNet50 82 79 79 83.00

3 DensetNet121 94 94 93 93.96

4 InceptionV3 91 90 90 90.42

5 Xception 85 83 83 90.00

The performance of five pre-trained models is depicted in Figure 5.20. The ResNet50 model

performed the lowest, with an average of 79% for recall, F1 Score, and 83% classification

accuracy. The VGG16, InceptionV3, and Xception models perform well on nearly all

assessment criteria, with evaluation measures ranging from 85% to 90%. The DenseNet121

model exceeds all other classification models, obtaining 94% accuracy, F1 Score, precision,

and recall.

138
100

95

90
Extent (%)
85

80

75

70
VGG16 ResNet50 DenseNet121 InceptionV3 Xception

Model

Precision Recall F1-Score Accuracy

Figure 5.20 Performance Values of Pre-trained models

The overall classification accuracy of the five pre-trained model during training is shown in

the Figure 5.21. From figure, it is observed that the model accuracy improvement occurred

gradually as the number of training sessions increased and finally stabilized, which indicates

that the model was better trained. And the loss of the models during the training are shown in

the Figure 5.22.

139
Figure 5.21 Training Accuracy of pre-trained models

igure 5.22 Training Loss of pre-trained models

140
To increase the accuracy of the results, all of these models are merged and trained

using average ensemble learning. As a result, ensemble deep learning tomato disease

detection (EDL-TDD) has been proposed and evaluated on datasets through combining

chosen pre-trained models. To evaluate the performance of EDL TDD, the pre-trained models

from table 5.4 were used to construct ensemble architecture. Initially, VGG16 and ResNet50

are merged with InceptionV3, and the performance of the proposed EDL TDD is assessed on

a dataset, yielding an accuracy of 91.24%. Later, VGG16 and DenseNet121 are merged with

Xception to achieve an accuracy of 94.07%. However, when InceptionV3, DenseNet121, and

Xception are combined, the proposed model achieves an accuracy of 98.54% . The results are

shown in the Table 5.5 and are depicted in the form of bar graph using Figure 5.23 .

Table 5.5 Accuracy of Ensemble models

Method Accuracy (%)

VGG16+ResNet50+InceptionV3 91.24

VGG16+DenseNet121+InceptionV3 96.2

ResNet50+DenseNet121+Xception 92.9

VGG16+DenseNet121+Xception 95.16

DenseNet121+InceptionV3+Xception 98.54

141
100
98.54
98

96.2
96
95.16
Accuracy(%)

94
92.9

92 91.24

90

88

86

VGG16+ResNet50+InceptionV3 Ensemble Model


VGG16+DenseNet121+InceptionV3 ResNet50+DenseNet121+Xception

VGG16+DenseNet121+Xception DenseNet121+InceptionV3+Xception

Figure 5.23 Accuracy of Ensemble models

Figure 5.24 show the precision recall and F1 Score values obtained for each disease in the

tomato dataset using proposed average ensemble model (EDL-TDD).

142
Performance Values of Ensemble Model
102

100

98

96
Extent (%)

Precision
94 Recall
F1-Score
92

90

88
An- Bacterial Bacterial Blossom GhostSpot Healthy
thrac- Canker Speak End Rot
nose
Tomato Disease Class

Figure 5.24 Performance Values of tomato disease using Ensemble model.

Figure 5.25 depicts the confusion matrix of the proposed EDL-TDD model utilising

DenseNet121, InceptionV3, Xception and it is shown that our proposed model successfully

identified the majority of the tomato fruit disease types in each sample image. The confusion

matrix shows the information about classification and misclassification by the model. The

diagonal elements show the correct classification, and the nondiagonal elements show the

misclassification information.

143
Figure 5.25 Confusion Matrix of Proposed Ensemble Model

Summary

The tomato is a significant agricultural crop that is produced in large quantities, but it is

highly susceptible to diseases, which lead to huge yield loss. Plant diseases have an impact on

the development of the corresponding crops, so it is crucial to detect them early. The yield

and quality of crops are significantly impacted by several diseases, fungi, and insects. Here

pre-trained models such as VGG16, ResNet50, InceptionV3, DenseNet121 and Xception are

used for classification of tomato diseases. Despite the good performance of all models, the

proposed average ensembling deep learning produced the best results using DenseNet121,

144
InceptionV3 and Xception pretrained models. Transfer learning and data augmentation

techniques are used to fine-tune the parameters in the network.

145
146
147
References

1. Gavhale, M. and Gawande, U. (2014) ‘An Overview of the Research on Plant Leaves

Disease detection using Image Processing Techniques’, IOSR Journal of Computer

Engineering, 16, pp. 10–16. doi: 10.9790/0661-16151016.

2. Food and Agriculture Organization of the United Nations. “FAOSTAT” Crops and

Livestock Products. https://www.fao.org/faostat/en/#data/QCL

3. Food and Agriculture Organization of the United Nations. FAO—News Article:

Climate Change Fans Spread of Pests and Threatens Plants and Crops—New FAO

Study. https://www.fao.org/news/story/en/item/1402920/icode/ .

4. Gobalakrishnan, N.; Pradeep, K.; Raman, C.J.; Ali, L.J.; Gopinath, M.P. A Systematic

Review on Image Processing and Machine Learning Techniques for Detecting Plant

Diseases. In Proceedings of the 2020 International Conference on Communication

and Signal Processing (ICCSP), Chennai, India, 28–30 July 2020; pp. 0465–0468.

5. Background, "The global tomato processing industry," Tomato Neiss, June 2020.

[Online]. Available: http://www.tomatonews.com/en/background_47.html. .

6. Bjarnadottir, "Healthline," Healthline, 25 March 2019. [Online]. Available:

https://www.healthline.com/nutrition/foods/tomatoes#vitamins-and-minerals.

7. "Global Tomato Processing Market," Imarc, 11 July 2020. [Online]. Available:

https://www.imarcgroup.com/global-tomato-processing-market.

8. Hayashi, M., Ueda, Y., & Suzuki, H. (1988). Development of agricultural robot. In

Sixth Conference on Robotics (pp. 579-580).

9. Moltó, E., Pla, F., & Juste, F. (1992). Vision systems for the location of citrus fruit in a

tree canopy. Journal of Agricultural Engineering Research, 52 , 101-110.

148
10. Xin, H., & Shao, B. (2005). Real-time behavior-based assessment and control of

swine thermal comfort. In Livestock Environment VII, 18-20 (pp. 694).

11. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.

International Journal of Computer Vision, 60(2), 91-110.

12. Brady, C.J. (1987). Fruit ripening. Annual Review of Plant Physiology, 38, 155-178.

13. Giovannoni, J. (2001). Molecular biology of fruit maturation and ripening. Annual

Review of Plant Physiology and Plant Molecular Biology, 52, 725-749.

14. Lelièvre, J.M., Latché, A., Jones, B., Bouzayen, M., Pech, J.C. (1997). Ethylene and

fruit ripening. Physiologia Plantarum, 101, 727-739.

15. K. Prasad, S. Jacob, M.W. Siddiqui, Fruit maturity, harvesting, and quality standards,

in: Preharvest Modulation of Postharvest Fruit and Vegetable Quality, Academic

Press, 2018, pp. 41–69 .

16. S. Dargan, M. Kumar, M.R. Ayyagari, G. Kumar, A survey of deep learning and its

applications: a new paradigm to machine learning, Arch. Comput. Meth. Eng. 27 (4)

(2020) 1071–1092 .

17. T. Wang, J. Huan, M. Zhu, Instance-based deep transfer learning, in: 2019 IEEE

Winter Conference on Applications of Computer Vision (WACV), IEEE, 2019, pp.

367–375 .

18. I.N.C.E. Ahmet, M.Y. Çevik, K.K. Vursavu ş , Effects of maturity stages on textural

mechanical properties of tomato, Int. J. Agric. Biol. Eng. 9 (6) (2016) 200–206 .

19. Fahrentrapp, J., Ria, F., Geilhausen, M., & Panassiti, B. (2019). Detection of Gray

Mold Leaf Infections Prior to Visual Symptom Appearance Using a Five-Band

Multispectral Sensor. Frontiers in plant science, 10, 628.

20. Dubey, S. R., & Jalal, A. S. (2014). Adapted Approach for Fruit Disease Identification

Using Images. arXiv preprint arXiv:1405.4930

149
21. Mohanty, S. P., Hughes, D. P., & Salathe, M. (2016). Using Deep Learning for

ImageBased Plant Disease Detection. Front Plant Sci, 7, 1419.

doi:10.3389/fpls.2016.01419

22. Zhang, L & McCarthy, MJ 2012, 'Measurement and evaluation of tomato maturity

using magnetic resonance imaging', Postharvest Biology and Technology, vol. 67, pp.

37-43.

23. El-Bendary, Nashwa, et al. "Using machine learning techniques for evaluating tomato

ripeness." Expert Systems with Applications 42.4 (2015): 1892-1905.

24. Rafiq, Aasima, Hilal A. Makroo, and Manuj K. Hazarika. "Artificial Neural Network‐

Based Image Analysis for Evaluation of Quality Attributes of Agricultural

Produce." Journal of Food Processing and Preservation 40.5 (2016): 1010-1019.

25. Kaur, Kamalpreet, and O. P. Gupta. "A machine learning approach to determine

maturity stages of tomatoes." Oriental journal of computer science and

technology 10.3 (2017): 683-690.

26. Wan, Peng et al. “A methodology for fresh tomato maturity detection using computer

vision.” Comput. Electron. Agric. 146 (2018): 43-50.

27. M. B. Garcia, S. Ambat and R. T. Adao, "Tomayto, Tomahto: A Machine Learning

Approach for Tomato Ripening Stage Identification Using Pixel-Based Color Image

Classification," 2019 IEEE 11th International Conference on Humanoid,

Nanotechnology, Information Technology, Communication and Control,

Environment, and Management ( HNICEM ), Laoag, Philippines, 2019, pp. 1-6, doi:

10.1109/HNICEM48295.2019.9072892.

28. Wu, J.; Zhang, B.; Zhou, J.; Xiong, Y.; Gu, B.; Yang, X. Automatic Recognition of

Ripening Tomatoes by Combining Multi-Feature Fusion with a Bi-Layer

150
Classification Strategy for Harvesting Robots. Sensors 2019, 19, 612.

https://doi.org/10.3390/s19030612

29. Azarmdel, H, Jahanbakhshi, A, Mohtasebi, SS & Muñoz, AR 2020, 'Evaluation of

image processing technique as an expert system in mulberry fruit grading based on

ripeness level using artificial neural networks (ANNs) and support vector machine

(SVM)', Postharvest Biology and Technology, vol. 166, P. 111201.

30. Raghavendra, A, Guru, D, Rao, MK & Sumithra, R 2020, 'Hierarchical approach for

ripeness grading of mangoes', Artificial Intelligence in Agriculture, vol. 4, pp. 243-

252.

31. Hermana, A.N.; Rosmala, D.; Husada, M.G. Transfer Learning for Classification of

Fruit Ripeness Using VGG16. In Proceedings of the ICCMB 2021: 2021 The 4th

International Conference on Computers in Management and Business, Singapore, 30

January–1 February 2021; pp. 139–146. [Google Scholar] [CrossRef]

32. Rivero Mesa, A.; Chiang, J. Non-invasive Grading System for Banana Tiers using

RGB Imaging and Deep Learning. In Proceedings of the ICCAI 2021: 2021 7th

International Conference on Computing and Artificial Intelligence, Tianjin, China,

23–26 April 2021; pp. 113–118.

33. Huynh, D. P., Van Vo, M., Van Dang, N., & Truong, T. Q. (2021, March). Classifying

maturity of cherry tomatoes using Deep Transfer Learning techniques. In IOP

Conference Series: Materials Science and Engineering (Vol. 1109, No. 1, p. 012058).

IOP Publishing.

34. Sakunrasrisuay, Chinapat & Musikawan, Pakarat & Nguyen, Anh-Nhat & Kongsorot,

Yanika & Aimtongkham, Phet & So-In, Chakchai. (2021). Tomato Maturity

Classification: A Transfer Learning Approach. 411-416.

10.1109/ICSEC53205.2021.9684584.

151
35. R. Mishra, S. Goyal, T. Choudhury and T. Sarkar, "Banana ripeness classification

using transfer learning techniques," 2022 International Conference on

India, 2022, pp. 1-6, doi: 10.1109/IC3SIS54991.2022.9885244.

36. Begum, Ninja, and Manuj Kumar Hazarika. "Maturity detection of tomatoes using

transfer learning." Measurement: Food 7 (2022): 100038.

37. Arefi, A., Motlagh, A. M., Mollazade, K., & Teimourlou, R. F. (2011). Recognition

and localization of ripen tomato based on machine vision. Australian Journal of Crop

Science, 5(10), 1144–1149.

38. K. Yamamoto, W. Guo, Y. Yoshioka, and S. Ninomiya, "On Plant Detection of Intact

Tomato Fruits Using Image Analysis and Machine Learning Methods," Sensors, vol.

14, p. 12191, 2014

39. J. T. Xiong, X. J. Zou, H. X. Peng, W. Chen, and G. Lin, “Realtime identification and

picking point localization of disturbance citrus picking,” Transactions of the CSAM,

vol. 45, no. 8, pp. 38–43, 2014.

40. Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. Robust Tomato Recognition for Robotic

Harvesting Using Feature Images Fusion. Sensors 2016, 16, 173. https://doi.

41. I. Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. DeepFruits: A Fruit

Detection System Using Deep Neural Networks. Sensors 2016, 16, 1222.

https://doi.org/10.3390/s16081222

42. S. Bargoti and J. Underwood, "Deep fruit detection in orchards," 2017 IEEE

International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp.

3626-3633, doi: 10.1109/ICRA.2017.7989417.

43. Xiong Juntao, Liu Zhen, Tang Linyue, Lin Rui, Bu Yibin, Peng Hongxing, Study on

the visual detection technology of green citrus in natural environment, Transactions of

the Chinese Society of Agricultural Machinery, 2018, 49(04): 45-52.

152
44. Fu Longsheng, Feng Yali, Elkamil Tola, Liu Zhihao, Li Rui, Cui Yongjie. Image

recognition method of multi-layered kiwifruit in field based on convolutional neural

network, Society of Agricultural Engineering,2018,34(02):205-211

45. Koirala, A., Walsh, K. B., Wang, Z., & McCarthy, C. (2019). Deep learning for real-

time fruit detection and orchard fruit load estimation: Benchmarking of

‘MangoYOLO’. Precision Agriculture, 20(6),1107–1135. doi:10.100711119-019-

09642-0

46. Fu, L.; Tola, E.; Al-Mallahi, A.; Li, R.; Cui, Y. A Novel Image Processing Algorithm

to Separate Linearly Clustered Kiwifruits. Biosyst. Eng. 2019, 183, 184–195.

47. Huang, Yi Hsuan, and Ta Te Lin. "High-throughput image analysis framework for

fruit detection, localization and measurement from video streams." 2019 ASABE

Annual International Meeting. American Society of Agricultural and Biological

Engineers, 2019.

48. C. Hu, X. Liu, Z. Pan and P. Li, "Automatic Detection of Single Ripe Tomato on

Plant Combining Faster R-CNN and Intuitionistic Fuzzy Set," in IEEE Access, vol. 7,

pp. 154683-154696, 2019, doi: 10.1109/ACCESS.2019.2949343.

49. Liu, G., Nouaze, J. C., Touko Mbouembe, P. L., & Kim, J. H. (2020). YOLO-tomato:

A robust algorithm for tomato detection based on YOLOv3. Sensors (Basel), 20(7),

2145. doi:10.339020072145

50. Widyawati, W, and Febriani, R (2021). Real-time detection of fruit ripeness using the

YOLOv4 algorithm. Teknika: Jurnal Sains dan Teknologi. 17, 205-

210. https://doi.org/10.36055/tjst.v17i2.12254

51. Jia, Weikuan, et al. "A fast and efficient green apple object detection model based on

Foveabox." Journal of King Saud University-Computer and Information

Sciences 34.8 (2022): 5156-5169.

153
52. Zhang, C.; Kang, F.; , Y. An Improved Apple Object Detection Method Based on

Lightweight YOLOv4 in Complex Backgrounds. Remote Sens. 2022, 14, 4150.

53. Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit Detection and Recognition Based on Deep

Learning for Automatic Harvesting: An Overview and Review. Agronomy 2023, 13,

1625. https://doi.org/10.3390/agronomy13061625

54. James, Ginne M., and S. C. Punitha. "Tomato Disease Classification using Ensemble

Learning Approach." International Journal of Research in Engineering and

Technology 5.10 (2016): 104-108.

55. Sabrol, H & Satish, K 2016, ‘Tomato plant disease classification in digital images

using classification tree’, In 2016 International Conference on Communication and

Signal Processing (ICCSP), pp. 1242-1246.

56. Biswas Sandika, SaunshiAvil, Sarangi Sanat and PappulaSrinivasu, “Random forest

based classification of diseases in grapes from images captured in uncontrolled

environments,” In proceedings of IEEE 13th International Conference on Signal

Processing (ICSP), pp.1775- 1780, 2016.

57. J. Shijie, J. Peiyi, H. Siping and s. Haibo, "Automatic detection of tomato diseases

and pests based on leaf images," 2017 Chinese Automation Congress (CAC), Jinan,

China, 2017, pp. 2537-2510, doi: 10.1109/CAC.2017.8243388.

58. Sharif, Muhammad & Khan, Muhammad & Iqbal, Zahid & Azam, Faisal & Lali,

Muhammad Ikram & Javed, Muhammad. (2018). Detection and classification of

citrus diseases in agriculture based on optimized weighted segmentation and feature

selection. Computers and Electronics in Agriculture. 150.

10.1016/j.compag.2018.04.023.

154
59. Guo, X. Q., T. J. Fan, and Xin Shu. "Tomato leaf diseases recognition based on

improved multi-scale AlexNet." Trans. Chin. Soc. Agricult. Eng 35.13 (2019): 162-

169.

60. R. G. de Luna, E. P. Dadios, A. A. Bandala and R. R. P. Vicerra, "Tomato Fruit

Image Dataset for Deep Transfer Learning-based Defect Detection," 2019 IEEE

International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE

Conference on Robotics, Automation and Mechatronics (RAM), Bangkok, Thailand,

2019, pp. 356-361, doi: 10.1109/CIS-RAM47153.2019.9095778.

61. Qimei Wang, Feng Qi, Minghe Sun, Jianhua Qu, Jie Xue, "Identification of Tomato

Disease Types and Detection of Infected Areas Based on Deep Convolutional Neural

Networks and Object Detection Techniques", Computational Intelligence and

Neuroscience, vol. 2019, Article ID 9142753, 15 pages, 2019.

https://doi.org/10.1155/2019/9142753

62. M. Gehlot and M. L. Saini, "Analysis of Different CNN Architectures for Tomato

Leaf Disease Classification," 2020 5th IEEE International Conference on Recent

Advances and Innovations in Engineering (ICRAIE), Jaipur, India, 2020, pp. 1-6, doi:

10.1109/ICRAIE51050.2020.9358279.

63. Ashraf, R.; Habib, M.A.; Akram, M.U.; Latif, M.A.; Malik, M.S.; Awais, M.; Dar,

S.H.; Mahmood, T.; Yasir, M.; Abbas, Z. Deep Convolution Neural Network for Big

Data Medical Image Classification. IEEE Access 2020, 8, 105659–105670

64. Zeng, Q.; Ma, X.; Cheng, B.; Zhou, E.; Pang, W. GANs-Based Data Augmentation

for Citrus Disease Severity Detection Using Deep Learning. IEEE Access 2020, 8,

172882–172891. [Google Scholar] [CrossRef]

155
65. Shihan Mao, Yuhua Li, You Ma, Baohua Zhang, Jun Zhou, and Kai Wang. 2020.

Automatic cucumber recognition algorithm for harvesting robots in the natural

environment using deep learning and multi-feature fusion. Comput. Electron. Agric.

170, C (Mar 2020). https://doi.org/10.1016/j.compag.2020.105254

66. Xiao JR, Chung PC, Wu HY, Phan QH, Yeh JA, Hou MT. Detection of Strawberry

Diseases Using a Convolutional Neural Network. Plants (Basel). 2020 Dec

25;10(1):31. doi: 10.3390/plants10010031. PMID: 33375537; PMCID: PMC7823414.

67. Gao, R.; Li, Q.; Sun, X. Intelligent Diagnosis of Greenhouse Cucumber Diseases

Based on Multi-Structure Parameter Ensemble Learning. Trans. Chin. Soc. Agric.

Eng. (Trans. CSAE) 2020, 36, 158–165,

68. H. Hong, J. Lin and F. Huang, "Tomato Disease Detection and Classification by Deep

Learning," 2020 International Conference on Big Data, Artificial Intelligence and

Internet of Things Engineering (ICBAIE), Fuzhou, China, 2020, pp. 25-29, doi:

10.1109/ICBAIE49996.2020.00012.

69. Gayatri Pattnaik, Vimal K. Shrivastava & K. Parvathi (2020) Transfer Learning-Based

Framework for Classification of Pest in Tomato Plants, Applied Artificial

Intelligence, 34:13, 981-993, DOI: 10.1080/08839514.2020.1792034

70. Luaibi, A.R., Salman, T.M., & Miry, A.H. (2021). Detection of citrus leaf diseases

using a deep learning technique. International Journal of Electrical and Computer

Engineering, 11, 1719-1727.

71. Zhang, N.;Wu, H.; Zhu, H.;Deng, Y.; Han, X. Tomato Disease Classification and

Identification Method Based on Multimodal Fusion Deep Learning. Agriculture 2022,

12, 2014. https://doi.org/10.3390/ agriculture12122014

156
72. Fruit Maturity—An Overview|ScienceDirect Topics. Available

online: https://www.sciencedirect.com/topics/agricultural-and-biological-

sciences/fruit-maturity (accessed on 2 November 2022).

73. W. Castro, J. Oblitas, M. De-La-Torre, C. Cotrina, K. Bazán, and H. Avila-George,

“Classification of Cape Gooseberry Fruit According to its Level of Ripeness Using

Machine Learning Techniques and Different Color Spaces,” IEEE Access, vol. 7, pp.

27389-27400, 2019, doi: 10.11 09/ACCESS .2019.2898223.

74. M. P. Arakeri, “Computer Vision Based Fruit Grading System for Quality Evaluation

of Tomato in Agriculture industry,” ProcediaComput. Sci., vol. 79, pp. 426-433, 2016,

doi: 10.1016/j.procs.2016.- 03.055.

75. A. Wajid, N. K. Singh, P. Junjun, and M. A. Mughal, “Recognition of ripe, unripe and

scaled condition of orange citrus based on decision tree classification,” in 2018

International Conference on Computing, Mathematics and Engineering Technologies:

Invent, Innovate and Integrate for Socioeconomic Development, iCoMET 2018 -

Proceedings, 2018, vol. 2018-Janua, pp. 1-4, doi: 10.1109/ICOMET.- 2018.8346354.

76. M. Khojastehnazhand, V. Mohammadi, and S. Minaei, “Maturity detection and

volume estimation of apricot using image processing technique,” Sci. Hortic.

(Amsterdam)., vol. 251, no. January, pp. 247- 251, 2019, doi:

10.1016/j.scienta.2019.03.033.

77. Vibhute, Anup, dan Bodhe, S.K. (2013) : Outdoor Illumination Estimation of Color

Images. IEEE, Communication and Signal Processing hal 331-334

78. Vijayalakshmi, M.; Peter, V.J. CNN based approach for identifying banana species

from fruits. Int. J. Inf. Technol. 2021, 13, 27–32. [Google Scholar] [CrossRef]

157
79. Ashtiani, S.-H.M.; Javanmardi, S.; Jahanbanifard, M.; Martynenko, A.; Verbeek, F.J.

Detection of Mulberry Ripeness Stages Using Deep Learning Models. IEEE

Access 2021, 9, 100380–100394. [Google Scholar] [CrossRef]

80. Brownlee, Jason. "A gentle introduction to transfer learning for deep

learning." Machine Learning Mastery 20 (2017).

81. Pan, Sinno & Yang, Qiang. (2010). A Survey on Transfer Learning. Knowledge and

Data Engineering, IEEE Transactions on. 22. 1345 - 1359. 10.1109/TKDE.2009.191.

82. Mohanty, Sharada P., David P. Hughes, and Marcel Salathé. "Using deep learning for

image-based plant disease detection." Frontiers in plant science 7 (2016): 1419.

83. Reyes, Angie K., Juan C. Caicedo, and Jorge E. Camargo. "Fine-tuning Deep

Convolutional Networks for Plant Recognition." CLEF (Working Notes) 1391 (2015):

467-475.

84. Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for

large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

85. Shorten, C., Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep

Learning. J Big Data 6, 60 (2019). https://doi.org/10.1186/s40537-019-0197-0

86. Nwankpa, Chigozie Enyinna. "Advances in optimisation algorithms and techniques

for deep learning." Advances in Science, Technology and Engineering Systems

Journal 5.5 (2020): 563-577.

87. Johnson, J.M., Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J

Big Data 6, 27 (2019). https://doi.org/10.1186/s40537-019-0192-5

88. Kanika, J. Singla and Nikita, "Comparing ROC Curve based Thresholding Methods in

Online Transactions Fraud Detection System using Deep Learning," 2021

International Conference on Computing, Communication, and Intelligent Systems

158
(ICCCIS), Greater Noida, India, 2021, pp. 9-12, doi:

10.1109/ICCCIS51004.2021.9397167.

89. P. Ganesh, Object Detection: Simplified [Online], Available at:

https://towardsdatascience.com/object-detection-simplified-e07aa3830954, [Assessed:

02/10/20]

90. Object (image processing) [Online], Available at:

https://en.wikipedia.org/wiki/Object_(image_processing) , [Assessed: 02/10/20]

91. Jonathan Long, Evan Shelhamer, and Trevor Dar-rell. “Fully convolutional networks

for semantic seg-mentation”. In: Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition. 2015,pp. 3431–3440.

92. Jennifer Mack, Christian Lenz, Johannes Teutrine, and Volker Steinhage. “High-

precision 3D detection and reconstruction of grapes from laser range data forefficient

phenotyping based on supervised learning”. In: Computers and Electronics in

Agriculture 135(2017), pp. 300–311

93. Zhenglin Wang, Brijesh Verma, Kerry B Walsh, PhulSubedi, and Anand Koirala.

“Automated mango flowering assessment via refinement segmentation”. In:Image and

Vision Computing New Zealand (IVCNZ),2016 International Conference on. IEEE.

2016, pp. 1–6.

94. A Gongal, S Amatya, Manoj Karkee, Q Zhang, and K Lewis. “Sensors and systems

for fruit detection and localization: A review”. In: Computers and Electronics in

Agriculture 116 (2015), pp. 8–19.

95. Edan, and Ohad Ben-Shahar. “Computer vision for fruit harvesting robots–state of the

art and challenges ahead”. In: International Journal of Computational Vision and

Robotics 3.1-2 (2012), pp. 4–34.

159
96. Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan.

“Object detection with discriminatively trained part-based models”. In: IEEE

transactions on pattern analysis and machine intelligence 32.9 (2010), pp. 1627–1645

(cit. on p. 23).

97. Kunihiko Fukushima and Sei Miyake. “Neocognitron: A self-organizing neural

network model for a mechanism of visual pattern recognition”. In: Competition and

cooperation in neural nets. Springer, 1982, pp. 267–285 (cit. on p. 16).

98. C Wouter Bac, Eldert J Henten, Jochen Hemming, and Yael Edan. “Harvesting

Robots for High-value Crops: State-of-the-art Review and Challenges Ahead”. In:

Journal of Field Robotics 31.6 (2014), pp. 888–911.4

99. Ahmad Ostovar, Ola Ringdahl, and Thomas Hell-strom. “Adaptive Image

Thresholding of Yellow Peppers for a Harvesting Robot”. In: Robotics 7.1 (2018), p.

11.

100. DM Bulanon, TF Burks, and V Alchanatis. “Image fusion of visible and

thermal images for fruit detection”. In: Biosystems Engineering 103.1 (2009), pp. 12–

22.

101. Efi Vitzrabin and Yael Edan. “Adaptive thresholding with fusion using a

RGBD sensor for red sweet-pepper detection”. In: Biosystems Engineering 146

(2016), pp. 45–56.

102.Yongting Tao and Jun Zhou. “Automatic apple recognition based on the fusion of

color and 3D feature for robotic fruit picking”. In: Computers and Electronics in

Agriculture 142 (2017), pp. 388–396.

103.Murali Regunathan and Won Suk Lee. “Citrus fruit identification and size

determination using machine vision and ultrasonic sensors”. In: 2005 ASAE Annual

Meeting. American Society of Agricultural and Biological Engineers. 2005, p. 1

160
104.Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.“Deep learning”. In: Nature

521.7553 (2015), pp. 436–444.

105.Sharada P Mohanty, David P Hughes, and Marcel Salath´e. “Using deep learning for

image-based plant disease detection”. In: Frontiers in plant science 7 (2016).

106.Inkyu Sa, Zongyuan Ge, Feras Dayoub, Ben Upcroft, Tristan Perez, and Chris

McCool. “Deepfruits: A fruit detection system using deep neural networks”. In:

Sensors 16.8 (2016), p. 1222.

107.Gutchess, M. Trajkovics, E. Cohen-Solal, D. Lyons and A. K. Jain, “A background

model initialization algorithm for video surveillance”, Proceedings of the Eighth

IEEE International Conference on Computer Vision. ICCV, Vancouver, BC, Canada,

Vol. 1, pp. 733-740, 2001.

108. Matthew D Zeiler, Rob Fergus, “Visualizing and Understanding Convolutional

Neural Networks”, ECCV 2014, Part I, LNCS 8689, pp. 818-833, 2014.

109.Sebe, N., Cohen, I., Garg, A., & Huang, T. S. (2005). Machine Learning in Computer

Vision (Vol. 29). Springer Science & Business Media.

110.Prince, S. J. (2012). Computer Vision: Models, Learning, and Inference. Cambridge

University Press.

111.Nixon, M., & Aguado, A. (2019). Feature Extraction and Image Processing for

Computer Vision. Academic Press.

112.Gould, S. (2012). DARWIN: A framework for machine learning and computer vision

research and development. The Journal of Machine Learning Research, 13(1), 3533-

3537.

113.Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection.

International Journal of Computer Vision, 38(1), 15-33.

161
114. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019).

Generalized intersection over union: A metric and a loss for bounding box regression.

In Proceedings of the IEEE/CVF conference on computer vision and pattern

recognition (pp. 658-666).

115.Padilla, R., Netto, S. L., & Da Silva, E. A. (2020, July). A survey on performance

metrics for object-detection algorithms. In 2020 international conference on systems,

signals and image processing (IWSSIP) (pp. 237- 242). IEEE.

116.Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and

semantic segmentation." Proceedings of the IEEE conference on computer vision and

pattern recognition. 2014.

117.Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on

computer vision. 2015.

118.Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region

proposal networks." Advances in neural information processing systems 28 (2015).

119.Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C.

(2016). SSD: Single shot multibox detector. ECCV Conference.

120.Wang, Y., Xing, Z., Ma, L., Qu, A., & Xue, J. (2022). Object detection algorithm for

lingwu long jujubes based on the improved SSD. Agriculture, 12(9), 1456.

121.Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once:

Unified, real-time object detection. In IEEE Conference on Computer Vision and

Pattern Recognition (pp. 779-788).

122.Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In IEEE

Conference on Computer Vision and Pattern Recognition (pp. 7263-7271).

162
123.Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv

preprint arXiv:1804.02767.

124.Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "Yolov4:

Optimal speed and accuracy of object detection." arXiv preprint

arXiv:2004.10934 (2020).

125.M. Karthi, V. Muthulakshmi, R. Priscilla, P. Praveen and K. Vanisri, "Evolution of

YOLO-V5 Algorithm for Object Detection: Automated Detection of Library Books

and Performace validation of Dataset," 2021 International Conference on Innovative

Computing, Intelligent Communication and Smart Electrical Systems (ICSES),

Chennai, India, 2021, pp. 1-6, doi: 10.1109/ICSES52305.2021.9633834.

126.Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of

the IEEE conference on computer vision and pattern recognition. 2017.

127.Liu, Shu, et al. "Path aggregation network for instance segmentation." Proceedings of

the IEEE conference on computer vision and pattern recognition. 2018.

128.Rezatofighi, Hamid, et al. "Generalized intersection over union: A metric and a loss

for bounding box regression." Proceedings of the IEEE/CVF conference on computer

vision and pattern recognition. 2019.

129.R. Padilla, S. L. Netto and E. A. B. da Silva, "A Survey on Performance Metrics for

Object-Detection Algorithms," 2020 International Conference on Systems, Signals

and Image Processing (IWSSIP), Niteroi, Brazil, 2020, pp. 237-242, doi:

10.1109/IWSSIP48289.2020.9145130.

130.LaboroAI. 2020. Laboro Tomato. Available at

https://github.com/laboroai/LaboroTomato.git .

131.https://planthealthaction.org/news/plant-health-facts (accessed on 18 December

2021).
163
132.Shah, D., Paul, P., De Wolf, E., and Madden, L. (2019). Predicting plant disease

epidemics from functionally represented weather series. Philosophical Transactions of

the Royal Society B: Biological Sciences, 374(1775):20180273

133.Deb, S. and Bharpoda, T. (2017). Impact of meteorological factors on population of

major insect pests in tomato, lycopersicon esculentum mill. under middle gujarat

condition. Journal of Agrometeorology, 19(3):251–254

134.V. Pooja, R. Das, and V. Kanchana, "Identification of plant leaf diseases using image

processing techniques," 2017 IEEE Technological Innovations in ICT for Agriculture

and Rural Development (TIAR), Chennai, India, 2017, pp. 130-133, doi:

10.1109/TIAR.2017.8273700

135.Rangarajan, A.K., Purushothaman, R. and Ramesh, A., 2018. Tomato crop disease

classification using pre-trained deep learning algorithm. Procedia computer

science, 133, pp.1040-1047

136.Thomas S, Kuska MT, Bohnenkamp D, Brugger A, Alisaac E, Wahabzada M,

Behmann J, Mahlein AK (2018) Benefits of hyperspectral imaging for plant disease

detection and plant protection: a technical perspective. J Plant Dis Prot 125(1):5–20.

https:// doi.org/ 10. 1007/ s41348- 017- 0124-6

137.Agarwal, Mohit, et al. "ToLeD: Tomato leaf disease detection using convolution

neural network." Procedia Computer Science 167 (2020): 293-301.

138.J. Eunice, D. E. Popescu, M. K. Chowdary, and J. Hemanth, ‘‘Deep learning-based

leaf disease detection in crops using images for agricultural applications,’’ Agronomy,

vol. 12, no. 10, p. 2395, Oct. 2022

139.LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–

444

164
140.Ding, J., Chen, B., Liu, H., and Huang, M. (2016). Convolutional neural network with

data augmentation for sar target recognition. IEEE Geoscience and remote sensing

letters, 13(3):364–368.

141.Volpi, M. and Tuia, D. (2016). Dense semantic labeling of subdecimeter resolution

images with convolutional neural networks. IEEE Transactions on Geoscience and

Remote Sensing, 55(2):881–893.

142.Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale

image recognition. arXiv preprint. arXiv: 14091556

143.He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image

recognition. In Proceedings of the IEEE conference on computer vision and pattern

recognition, pages 770-778.

144.C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the

Inception Architecture for Computer Vision," 2016 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2818-2826,

doi: 10.1109/CVPR.2016.308.

145.G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, "Densely Connected

Convolutional Networks," 2017 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2261-2269, doi:

10.1109/CVPR.2017.243.

146.François Chollet. (2017) "Xception: Deep learning with depthwise separable

convolutions." 2017 IEEE Conference on Computer Vision and Pattern Recognition

(CVPR): 1800-1807

147.Prasad PS, Senthilrajan A (2020) Technical paper an ensemble deep learning

technique for plant identification 8:133–135

165
148. K. A. Nguyen, W. Chen, B.-S. Lin, and U. Seeboonruang, "Comparison of ensemble

machine learning methods for soil erosion pin measurements", ISPRS Int. J. Geo-Inf.,

vol. 10, no. 1, pp. 42, Jan. 2021

149.https://www.plantvillage.psu.edu/topics/tomato/infos

150. https://www.forestryimages.org

166

You might also like