A Customized ViT For Detection of Java Plum Leaf Disease - Revised
A Customized ViT For Detection of Java Plum Leaf Disease - Revised
Faruk Ahmed
4IR Research cell
Daffodil International University, Dhaka, Bangladesh
[email protected]
Bo Song
Lecturer (Industrial Automation)
School of Engineering
University of Southern Queensland
[email protected]
Yan Li
Professor (Computing)
School of Mathematics, Physics and Computing
Toowoomba Campus
University of Southern Queensland
[email protected]
Abstract
Vision Transformer (ViT) has recently attracted significant attention for its performance in
image classification. However, few studies shed light on detecting and classifying plant leaf
disease. Most existing research on diseased plant leaf detection focused on non-transformer
convolutional neural networks (CNN). Moreover, the studies that applied ViT narrowly
experimented using hyperparameters such as image size, patch size, learning rate, attention
head, epoch, and batch size. However, these hyperparameters significantly contribute to the
model performance. Realising the gap, this study applied ViT to Java Plum disease detection
using optimised hyperparameters. To utilise the performance of ViT, this study presents an
experiment on Java Plum leaf disease detection. Java Plum leaf diseases significantly threaten
agricultural productivity by negatively impacting yield and quality. Timely detection and
diagnosis are essential for successful crop management. The primary dataset collected in
Bangladesh includes six classes, ‘Bacterial Spot’, ‘Brown Blight’, ‘Powdery Mildew’, and
‘Sooty Mold’, ‘healthy’, and ‘dry’. This experiment contributes to a thorough understanding of
Java Plum leaf diseases. This model demonstrated a significant accuracy rate of 97.51%
following rigorous testing and refinement. This achievement demonstrates the possibilities of
using deep-learning tools in agriculture. This research offers a foundational model to ensure
crop quality by precise detection and takes Bangladesh to a strong position in the global Java
Plum market.
Keywords: Vision Transformer, ViT, Deep Learning, Java Plum leaf disease detection,
Accuracy, Validation, Confusion Matrix, Train, Primary Dataset.
1 Introduction
Vision Transformer (ViT) is a neural network architecture with deep learning technology that
has gained significant attention in image detection and classification. Among the image
classification tasks, ViT gained an auspicious position in plant leaf disease detection through its
remarkable ability to simulate long-range dependencies using the self-attention mechanism
(Alzahrani et al., 2023; Fu et al., 2023; Zhan et al., 2023; Zhang et al., 2023) While traditional
CNNs depend on convolution-based architecture, ViT depends on the transformer-based
architecture. Transformer-based architecture collects information from image data patterns,
making image processing more efficient (Alzahrani et al., 2023). ViT can automatically process
image features and utilise these features for image classification and detection using pattern
recognition (Fu et al., 2023). The transformer's acyclic network structure is synchronised with
parallel computing through an encoder and decoder. Moreover, training time reduction and
performance enhancement in transformers are done by a self-attention mechanism (Zhan et al.,
2023; Zhang et al., 2023). Thus, ViT has emerged as a promising methodology for disease
detection on Java Plum leaves, as it enhances detection accuracy, demonstrating notable efficacy
(Yu et al., 2023). In the context of Java Plum production, accurate and timely disease
identification is important (Alzahrani et al., 2023).
Java Plum is a delicious and nutrition-rich fruit, mainly grown in Southeast Asia. It is also
known as jambolana, or black plum, and contains significant amounts of iron, calcium, and
vitamins A and C (Bhowmik et al., 2023). Although this fruit has some significant health
benefits, the production of this crop faces various challenges due to its extensive leaf diseases.
Some of the primary diseases of Java Plum leaves are bacterial spot, brown blight, powdery
mildew, and sooty mold. These diseases threaten the food supply considerably and hamper the
overall production of Java Plum, which could lead to an economic loss for the Java Plum
growing countries. This threat can be reduced by detecting these diseases accurately (De Silva et
al., 2023). However, the manual detection of plant leaf disease is more time-consuming and often
leads to mistakes (Ahad et al., 2023). Thus, effective and accurate identification of Java Plum
leaf diseases is imperative for reducing fruit production loss. However, despite the advancements
in precision agriculture, many existing technologies classify plant diseases with minimal
effectiveness. This is because of the complex nature of Java Plum diseases in real-world
scenarios, where conventional detection techniques often struggle to produce rapid and accurate
results. To address the issue, this study aims to utilise a ViT-based model to improve the
accuracy of Java Plum leaf disease detection from primary data. The main goal of the model is to
detect the leaf diseases accurately, which can minimise the production loss of Java Plums.
Though vision transformers (ViT) have received significant attention in plant leaf disease
detection, ViT remains unexplored in Java Plum disease detection. However, ViT shows its
promising performance in Java Plum leaf disease detection in this study. In addition, several
significant gaps have been identified in the recent studies. Firstly, the absence of well-defined
models poses a significant challenge in accurately distinguishing between various leaf diseases in
real-world scenarios. Secondly, most crop disease detection studies are conducted using
secondary datasets. However, leaf disease varies from country to country due to differing
environments. Therefore, it is necessary to create a comprehensive and well-sorted dataset of
Java Plum leaf disease for training and testing the efficiency of the ViT model. The third
problem is related to the accuracy of leaf disease detection, where the classification of modalities
is a concern. Because lower detection accuracy and large false-positive values will narrow the
applicability and acceptability of ViT in precision agriculture. The fourth problem is that it is still
unknown what path size and image size provides the best accuracy in implementing ViT.
To fill the gap, in this research, this study seeks to answer two research questions:
RQ1: How can Java Plum leaf disease detection accuracy be improved using ViT?
RQ2: Which patch size and image size combination can provide better accuracy in the Java Plum
dataset?
This study aims to solve these problems and investigates the efficiency of ViT in Java Plum leaf
disease detection. It is expected that the outcomes of this research will help farmers in timely
Java Plum leaf disease detection by improving the precision of disease detection. Additionally, it
aims to develop an automated and efficient tool for detecting and classifying Java Plum leaf
diseases.
This study optimises the ViT architecture by customising patch size, hyperparameter
tuning, and encoder layer for Java Plum leaf disease detection.
This research collects a primary dataset to consider the scalability and practicality of
deploying the ViT model in real-world Java Plum fields.
This study introduces a model that effectively classifies leaf diseases based on ViT. It
also compares the experimental results with state-of-the-art studies (Devi et al., 2023;
Mehta et al., 2023). The results of this study achieve better accuracy performance, with at
least 1% more than the best of these models.
2 Related works
Several groups of researchers have used different techniques for plant leaf disease detection.
However, Numerous researchers have employed diverse vision transformer techniques (See
Table 1).
The initial group of researchers employed the conventional Vision Transformer (ViT) model for
detecting plant leaf diseases. For instance, (Salamai et al., 2023; De Silva et al., 2023) utilised
the traditional ViT model in their studies. (Salamai et al., 2023) I chose this model because of its
ability to creatively learn visual representations of leaf diseases across spatial and channel
dimensions. De Silva et al. (2023) Opted for this model because it effectively identifies plant leaf
diseases under natural environmental conditions. In this line of research, (De Silva et al., 2023)
achieved a training accuracy of 93.71% and a test accuracy of 90.02%. However, it is
noteworthy that both studies were conducted on publicly available datasets, and conducting
experiments on primary datasets related to the actual environment could enhance the precision of
these studies.
The second group of researchers employed various hybrid Vision Transformer (ViT) models to
detect diseases in diverse plant leaves. For instance, (Tabbakh et al., 2023) applied two different
models, Transfer Learning and Vision Transformer (TLMViT), on a wheat dataset. (Zhang et al.,
2023) implemented the Shuffle-convolution-based lightweight Vision Transformer (SLViT)
model on sugarcane leaves and utilised the Improved Vision Transformer (ViT) to identify
agricultural pests. Additionally, (Sun et al., 2023) applied SE-VIT to diagnose diseases in
sugarcane leaves. (Zhan et al., 2023) developed the IterationVIT model for diagnosing tea
diseases. (Thai et al., 2023) used the Tiny-LeViT model for efficient leaf disease detection, and
(Zhang et al., 2023) applied the hybrid model IEM-ViT for the quick and accurate recognition of
tea diseases, whereas (Parez et al., 2023) used GreenViT. (Zeng et al., 2023) used the Squeeze-
and-Excitation Vision Transformer (SEViT) for identifying large-scale and fine-grained diseases,
and (Li et al., 2023) implemented the Plant-based MobileViT (PMVT) model for real-time plant
disease detection. Some of these studies reported notable accuracies, such as Zhou et al. (2023),
Tabbakh et al. (2023), Sun et al. (2023), Zhan et al. (2023), Thai et al. (2023) and Zhang et al.
(2023) achieved 92.00%, 98.81%, 97.26%, 98%, 97.25% accuracies respectively. However, it is
worth noting that while (Zhou et al., 2023 Zhan et al., 2023 Zhang et al., 2023 and De Silva et
al., 2023) conducted the studies on primary datasets, other researchers implemented their models
on publicly available datasets, introducing some limitations to their studies.
The third group of researchers employed various models to compare accuracy and performance
on the same or different datasets. Rethik et al. (2023) utilised ViT1, ViT2, and pre-trained
ViT_b16 models to assess accuracy and performance for classifying plant leaf diseases. The
experiment shows that the pre-trained ViT_b16 model performs better than other models and
presents accuracy values of 85.87%, 89.16%, and 94.16% for ViT1, ViT2, and pre-trained
ViT_b16 models. Hossain et al. (2023) compared the accuracy of DenseNet169, ResNet50V2,
and ViT models for early detection and recognition of tomato leaf diseases, concluding that
DenseNet169 demonstrated the highest efficacy and presented 99.88% train and 99.00% test
accuracy for the DenseNet121 model and 95.60% train and 98.00% test accuracy for the
ResNet50V2 and ViT models. Alzahrani et al. (2023) evaluated the accuracy and performance
of EANet, MaxViT, CCT, and PVT models for tomato leaf disease detection, with PVT
emerging as the superior model and found 89%, 97%, 91%, and 93% accuracy rates for EANet,
MaxViT, CCT, and PVT models, respectively. Öğrekçi et al. (2023) implemented DenseNet121,
ViT, and ViT & CNN models to compare accuracy and performance in identifying sugarcane
leaf diseases, with ViT proving to be the most effective model. Among these studies, some
researchers achieved notable accuracy for robust model comparisons and achieved 92.87%,
93.34%, and 87.37% for DenseNet121, ViT, and ViT + CNN models, respectively. Notably, the
researchers used public datasets for their models, which might limit how much their studies add
to knowledge enrichment compared to using first-hand datasets.
The fourth group of researchers employed multiple models to construct ensemble models, aiming
for improved accuracy and performance. (Kumar et al., 2023) utilised EfficientNet, SEResNeXt,
ViT, DeIT, and MobileNetV3 models to develop an ensemble model for detecting cassava leaf
diseases and achieved an accuracy of 90.75% on their ensemble model. (Ganguly et al., 2023)
Introduced an ensemble model incorporating CNN, ResNeXt, and InceptionV3 to detect plant
leaf diseases. (Chang et al., 2024) constructed an ensemble model comprising ViT, PVT, and
Swin to enhance identification quality in plant leaf disease detection. Among these studies, only
(Kumar et al., 2023) conducted their experiments on publicly available datasets.
The last group of researchers employed diverse models to enhance accuracy and performance in
disease detection. Diana Andrushia et al. (2023) utilised convolutional capsule networks for
identifying grape leaf diseases and achieved 99.12% accuracy in their study. (Kumar et al., 2023)
introduced paddy leaf disease detection using a multi-scale feature fusion-based RDTNet and
attained 99.55%, 99.54%, and 99.53% accuracy, f1-score, and precision, respectively. (Hu et al.,
2023) proposed an Adaptive Fourier Neural Operators (AFNO)-based Transformer architecture
named FOTCA, focusing on extracting global features in advance and reported a 99.8% accuracy
for the FOTCA model. (Arshad et al., 2023) presented a novel hybrid deep learning model,
PLDPNet, designed to forecast potato leaf diseases automatically and achieved an overall
accuracy of 98.66% with their proposed model. Zhang et al. (2023) introduced a unique
segmentation model for grape leaf diseases in natural scene photos, termed Locally Reversible
Transformer (LRT). (Thai et al., 2023) developed a transformer-based leaf disease detection
model known as FormerLeaf. (Devi et al., 2023) created an EffectiveNetV2 model for pest
identification and plant disease categorisation. Given the use of publicly available datasets by all
researchers, there may be a potential deficiency in knowledge enrichment in their studies at some
level.
Table 1: Research Matrix
Accuracy bonus
of Presented a hybrid model that was initially trained on the
Li et al. (2023) SLViT 1.87 % over publicly available dataset.
MobileNetV3
Squeeze-and- Excitation
Zeng et al. SEViT has improved the classification accuracy by 5.15%
Vision Test 88.34%
(2023). compared to the baseline model.
Transformer (SEViT)
Plant-based MobileViT 93.6%, 85.4% & Created a plant disease diagnostic application utilising the
Li et al. (2023).
(PMVT) 93.1% PMVT model to detect plant diseases in various situations.
Rethik et al. ViT1, ViT2 & pre- trained 85.87%, 89.16% Substituted CNN with a novel Vision Transformer
(2023). ViT_b16 and 94.16%. technique to classify plant leaf diseases.
Hossain et al. EANet, MaxViT, CCT & 89%, 97%, 91% Analysed the effect of four different transformer-based
(2023) PVT & 93% models.
3.1 Dataset
The dataset used in this study was collected from two different areas in Bangladesh: "Titas” and
“Barura”, located in the “Cumilla” district. This dataset is designed to develop an accurate Java
Plum leaf disease detection system using the ViT model. Different real-world Java Plum farm
images were captured to build the Java Plum leaf disease dataset containing different classes.
The dataset contains six distinct classes, each representing a specific category of Java Plum leaf
(See Table 2). Four classes correspond to different Java Plum leaf diseases, including ‘Bacterial
Spot’, ‘Brown Blight’ ‘Powdery Mildew’, ‘Sooty Mold’, ‘healthy’, and ‘dry’ Java Plum leaves.
Table 2: Images of 6 classes from the dataset
The entire dataset goes through random partitioning into training, validation, and testing sets.
This division enabled training of the Vision Transformer model, validating it, and evaluating its
performance objectively. The dataset serves as the main supporting structure for this experiment
and enables the construction and evaluation of a cutting-edge deep learning system for Java Plum
leaf disease identification.
The proposed model consists of a few prominent customisations. Firstly, patch size can be
adjusted dynamically according to the input image size; after that, each patch size is linearly
projected to a fixed projection dimension of 64. Secondly, positional encoding is added with
project patch embedding to encode spatial information; the size of positional encoding is
computed based on the input image size. Thirdly, a customisable encoder layer is built where the
number of attention heads can be 2,4,6, etc. The architecture of the transformer encoder is built
with one layer. Each layer contains a self-attention mechanism and a feed-forward network.
After passing through transformer encoder layers, global average pooling aggregate information
across all patches. Lastly, this model is trained using hyperparameter tuning by making a
learning rate of 0.001, weight decay of 0.0001, and variable batch size. ViT shows exceptional
performance in image classification and analysis by following this technique. Figure 1 shows the
model structure of the proposed ViT for Java Plum leaf diseases.
4 Experimental Result
Different measures are utilized to assess the performance of machine learning classification
models, providing insights into the effectiveness of vision transformer-based models within a
specific application. The following metrics for performance evaluation are considered:
4.1 Accuracy
Accuracy is a measure of how well a detection model performs. In simpler terms, it's the
proportion of predictions the model makes correctly. The computation of accuracy involves
determining the ratio of correctly classified images to the overall number of images in the
dataset. The following formula is used to calculate accuracy:
TP+TN
Accuracy= (1)
TP+ FP+ TN + FN
4.2 Precision
Precision is an essential metric for evaluating classification models. It is calculated using the
ratio of true positive by all positive values. Generally, it answers a specific question: "From all
positive values, how many are true positives?”. This metric has significant value in case of false
positives considered in references.
TP
Precision= (2)
TP+ FP
4.3 Recall
Recall is also an essential metric for evaluating classification models. It is calculated using the
ratio of true positive by the summation of true positive and false negative values. Generally, it
answers a specific question: "From all true positive and false negative values, how many are
positive?” This metric has significant value where false negatives are more costly than false
positives.
TP
Recall= (3)
TP+ FN
4.4 F1-Score
The F1-score uses both precision and recall for assessing detection accuracy. It is computed as
the harmonic means of precision and recall. It achieves the highest value in terms of precision
and recall, which are equal. The formula for the F1 score is as follows:
2∗Precision∗Recall
F 1−Score= (4 )
Precision+ Recall
Table 3 illustrates the evaluations of eight distinct image and patch size combinations at epoch
250, head 6, and a learning rate of 0.001. The combinations of image size and patch
configurations are: (Image [64×64], patch 12), (Image [56×56], patch 7), (Image [48×48], patch
6), (Image [32×32], patch 8), (Image [32×32], patch 6), (Image [28×28], patch 7), (Image
[28×28], patch 4) and (Image [21×21], patch 7). Table 5 presents the accuracy outcomes for
these configurations, highlighting the highest accuracy of 97.51% in 245 seconds for (Image
[64x64], patch 12) and the lowest accuracy of 93.36% in 240 seconds for (Image [21×21], patch
7).
Table 3: Evaluating eight image sizes and patch configurations for detection performance.
Accuracy [%] Training time
Image size Patch size Accuracy [%]
Valid Test [s]
64 12 97%, 98% 245 97.51%
56 7 97%, 97% 236 97.27%
48 6 98%, 97% 241 97.1%
32 8 92%, 95% 239 94.67%
32 6 97%, 96% 242 96.00%
28 7 96%, 96% 234 96.21%
28 4 97%, 95% 231 94.70%
21 7 95%, 93% 240 93.36%
Table 4 and Table 5 illustrate the precision, recall, f1- score, and support for validation and
testing for the image size and patch configurations: (Image [64×64], patch 12) and (Image
[56×56], patch 7). Between these, (Image [64×64], patch 12) has shown the best accuracy of
97.51%, and (Image [56×56], patch 7) has given 97.27% accuracy.
f1-
Class precision recall Support Class precision recall f1-score Support
score
Validation Test
Bacterial Spot 0.96 0.96 0.96 67 Bacterial Spot 0.93 0.97 0.95 41
Brown Blight 0.97 0.94 0.95 64 Brown Blight 0.98 1.00 0.99 40
Dry 0.99 1.00 0.99 71 Dry 1.00 1.00 1.00 41
Healthy 0.92 0.93 0.92 58 Healthy 0.97 0.93 0.95 40
Powdery Powdery
0.98 0.98 0.98 61 1.00 0.97 0.99 40
Mildew Mildew
Sooty Mold 1.00 1.00 1.00 69 Sooty Mold 0.97 0.97 0.97 40
accuracy 0.97 390 accuracy 0.98 241
macro avg 0.97 0.97 0.97 390 macro avg 0.98 0.98 0.98 241
weighted avg 0.97 0.97 0.97 390 weighted avg 0.98 0.98 0.98 241
Table 5: Performance matrices for image size [56X56] and patch 7
f1-
Class precision recall Support Class precision recall f1-score Support
score
Validation Test
Bacterial Spot 0.97 0.91 0.94 67 Bacterial Spot 0.93 0.95 0.94 40
Brown Blight 0.98 0.98 0.98 64 Brown Blight 0.95 1.00 0.98 40
Dry 1.00 1.00 1.00 71 Dry 1.00 1.00 1.00 41
Healthy 0.93 0.95 0.94 58 Healthy 0.97 0.90 0.94 40
Powdery 0.95 1.00 0.98 61 Powdery 0.98 1.00 0.99 40
Mildew Mildew
Sooty Mold 1.00 1.00 1.00 69 Sooty Mold 1.00 0.97 0.99 40
accuracy 0.97 390 accuracy 0.97 241
macro avg 0.97 0.97 0.97 390 macro avg 0.97 0.97 0.97 241
weighted avg 0.97 0.97 0.97 390 weighted avg 0.97 0.97 0.97 241
Figures 2 and 3 illustrate the confusion matrix for validation and testing for the image size and
patch configurations: (Image [64×64], patch 12) and (Image [56×56], patch 7). In the confusion
matrix, there are six classes. The matrix shows that the number of true positives increases when
the image size is more significant. It also shows that several False Positives depend on image
size. However, the performance of this experiment increased when the image size was [64×64]
and the patch size was 12.
It is essential to measure the training loss when designing a model. This metric evaluates the
performance of fitting training data and compares its predicted output with target values. The
goal is to decrease the loss, indicating that the model accurately represents the input-output
relationship. It is important to note that the validation loss metric evaluates the model's
performance on new data it has not seen before. Figures 4 and 5 demonstrate the training loss
and validation loss over epoch 250 and head six on the combinations of (Image [64×64], patch
12) and (Image [56×56], patch 7). The validation pattern for (Image [56×56], patch 7) is much
noisier than (Image [64×64], patch 12). However, the accuracy is higher in the combination of
(Image [64×64], patch 12), at 97.51%.
Figure 4: Accuracy diagram for image size [64x64] and patch 12
This research built a customized model based on a vision transformer to diagnose Java Plum leaf
disease. This research tested the number of attention heads to achieve the best accuracy and
found that this model performed best when the number of attention heads was 6. Various
combinations of image and patch sizes have been examined to determine this model’s
effectiveness. This study has shown the performance of this model in different image and patch
size combinations in Table 3. This model obtained the best performance for the image and patch
size combination of 64/12. On the other hand, 28/4 took less time to train the dataset. Though the
64/12 combination took more time to train the dataset, it acquired 97.51% accuracy in testing
which is the highest among all combinations, and this accuracy answers RQ2. A comparison of
testing accuracy in the experiment is shown in Figure 6.
Performance Comparison
97.51% 97.33%
97.27%
96.00% 96.21%
94.67% 94.70%
93.36%
Accuracy [%]
Squeeze-and- Excitation
Zeng et al. (2023) [16] Test 88.34%
Vision Transformer (SEViT)
In this study, an optimized ViT model and its hyperparameters are investigated to detect
and classify Java Plum leaf disease. The study experimented on the primary dataset using the
image and patch size of 64/12, 56/7, 48/6, 32/8, 32/6, 28/7, 28/4, and 21/7. For image size 64 and
patch size 12, the optimized ViT obtained the highest accuracy of 97.5% in the validation and
test set. Although image size 64 and patch size 12 result in the highest accuracy, they take more
training time than all other image sizes and patch sizes. Attention head, according to the number
of classes, provides the highest accuracy for classification on an image dataset. Epochs are
known as ‘Model and data-meeting strategies. Epochs need to be used, considering the
overfitting and underfitting ratios of the model. Less epochs may cause underfitting where the
model needs to learn more from the data, and more epochs can cause overfitting where training
accuracy will be higher, but for new data, it will not be able to precisely detect diseases.
7 Limitation
This study finds the fundamental limitations related to the research outcomes. It is crucial to
emphasize that the Java Plum Leaf Disease Dataset was gathered during September, October,
and November of 2023 in Cumilla (Bangladesh). Additionally, it is essential to recognize that the
success of deep learning approaches in disease classification depends on suitable datasets.
Obtaining a diverse and comprehensive data set with accurately labeled instances of different
diseases presents a significant challenge. Therefore, the proposed ViT models may perform
differently in field conditions. Even though ViT shows promising performance in detecting Java
Plum leaf disease, it has several limitations. One of the main limitations of ViT is that it uses a
self-attention mechanism. However, the effectiveness of this approach depends on several factors
such as dataset size, model architecture, batch size, patch size, head numbers, etc. This model's
lack of practical implementation is another limitation of this research work. Although this model
demonstrates impressive recognition accuracy, it is crucial to understand that no model is free of
errors, and it is also impossible to eliminate the chance of misclassification or false positives.
This study does not go into depth to identify and classify other plant diseases, and it only focuses
on some limited diseases that can affect Java Plum leaves. This study aims to assist future
researchers in more accurate result interpretation by acknowledging these limitations.
8 Contribution
The authors of this research work declare that they do not have any known competing interests or
personal interests that can appear to influence the work in this paper.
Data availability
This dataset was carefully collected from different areas of Bangladesh to make the primary
repository of Java Plum leaf information. It has been deposited on Mendeley, a respected
platform for academic cooperation and data dissemination. Those involved in research and
academia can leverage this dataset to progress their studies, perform analyses, and contribute to
the expanding knowledge base concerning Java Plum leaves. The intent behind sharing this data
on Mendeley is to promote collaboration, openness, and continued scientific inquiry, enabling
the research community to gain valuable insights from this extensive compilation of Java Plum
leaf disease data.
[Bhowmik, Auvick Chandra; Ahad, Taimur (2024), “Java Plum Leaf Disease Dataset”,
Mendeley Data, V3, doi: 10.17632/43d75vptz4.3]
Funding statement
The research received no financial support, and none of the researchers received funding for the
work.
References
Alzahrani, M. S., & Alsaade, F. W. (2023). Transform and Deep Learning Algorithms for the
Early Detection and Recognition of Tomato Leaf Disease. Agronomy, 13(5), 1184.
doi.org/10.3390/agronomy13051184
Fu, X., Ma, Q., Yang, F., Zhang, C., Zhao, X., Chang, F., & Han, L. (2023). Crop pest image
recognition based on the improved ViT method. Information Processing in Agriculture.
doi.org/10.1016/j.inpa.2023.02.007
Zhan, B., Li, M., Luo, W., Li, P., Li, X., & Zhang, H. (2023). Study on the Tea Pest
Classification Model Using a Convolutional and Embedded Iterative Region of Interest Encoding
Transformer. Biology, 12(7), 1017. doi.org/10.3390/biology12071017
Zhang, J., Guo, H., Guo, J., & Zhang, J. (2023). An Information Entropy Masked Vision
Transformer (IEM-ViT) Model for Recognition of Tea Diseases. Agronomy, 13(4), 1156.
doi.org/10.3390/agronomy13041156
Bhowmik, A. C., Ahad, D. M. T., & Emon, Y. R. (2023). Machine Learning-Based Soybean Leaf
Disease Detection: A Comprehensive Review. arXiv preprint arXiv:2311.15741.
De Silva, M., & Brown, D. (2023, August). Plant Disease Detection using Vision Transformers
on Multispectral Natural Environment Images. In 2023 International Conference on Artificial
Intelligence, Big Data, Computing and Data Communication Systems (icABCD) (pp. 1-6). IEEE. DOI:
10.1109/icABCD59051.2023.10220517
Ahad, M. T., Li, Y., Song, B., & Bhuiyan, T. (2023). Comparison of CNN-based deep learning
architectures for rice disease classification. Artificial Intelligence in Agriculture, 9, 22-35.
doi.org/10.1016/j.aiia.2023.07.001
Zhou, C., Zhong, Y., Zhou, S., Song, J., & Xiang, W. (2023). Rice leaf disease identification by
residual-distilled transformer. Engineering Applications of Artificial Intelligence, 121, 106020.
doi.org/10.1016/j.engappai.2023.106020
Tabbakh, A., & Barpanda, S. S. (2023). A Deep Features extraction model based on the Transfer
learning model and vision transformer" TLMViT" for Plant Disease Classification. IEEE Access. DOI:
10.1109/ACCESS.2023.3273317
Sun, C., Zhou, X., Zhang, M., & Qin, A. (2023). SEVisionTransformer: Hybrid Network for
Diagnosing Sugarcane Leaf Diseases Based on Attention Mechanism. Sensors, 23(20), 8529.
doi.org/10.3390/s23208529
Thai, H. T., Le, K. H., & Nguyen, N. L. T. (2023). Towards sustainable agriculture: A
lightweight hybrid model and cloud-based collection of datasets for efficient leaf disease detection.
Future Generation Computer Systems. doi.org/10.1016/j.future.2023.06.016
De Silva, M., & Brown, D. Plant Disease Detection Using Multispectral Imaging with Hybrid
Vision Transformers. doi: 10.3390/s23208531
Parez, S., Dilshad, N., Alghamdi, N. S., Alanazi, T. M., & Lee, J. W. (2023). Visual Intelligence
in Precision Agriculture: Exploring Plant Disease Detection via Efficient Vision Transformers. Sensors,
23(15), 6949. doi.org/10.3390/s23156949
Yu, S., Xie, L., & Huang, Q. (2023). Inception convolutional vision transformers for plant disease
identification. Internet of Things, 21, 100650. doi.org/10.1016/j.iot.2022.100650
Zeng, Q., Niu, L., Wang, S., & Ni, W. (2023). SEViT: a large-scale and fine-grained plant disease
classification model based on transformer and attention convolution. Multimedia Systems, 29(3), 1001-
1010. doi.org/10.1007/s00530-022-01034-1
Li, G., Wang, Y., Zhao, Q., Yuan, P., & Chang, B. (2023). PMVT: a lightweight vision
transformer for plant disease identification on mobile devices. Frontiers in Plant Science, 14, 1256773.
doi.org/10.3389/fpls.2023.1256773
Rethik, K., & Singh, D. (2023, May). Attention-Based Mapping for Plants Leaf to Classify
Diseases using Vision Transformer. In 2023 4th International Conference for Emerging Technology
(INCET) (pp. 1-5). IEEE. DOI: 10.1109/INCET57972.2023.10170081
Hossain, S., Tanzim Reza, M., Chakrabarty, A., & Jung, Y. J. (2023). Aggregating Different
Scales of Attention on Feature Variants for Tomato Leaf Disease Diagnosis from Image Data: A
Transformer Driven Study. Sensors, 23(7), 3751. doi.org/10.3390/s23073751
Mustofa, S., Munna, M. M. H., Emon, Y. R., Rabbany, G., & Ahad, M. T. (2023). A
comprehensive review on Plant Leaf Disease detection using Deep learning. arXiv preprint
arXiv:2308.14087. doi.org/10.48550/arXiv.2308.14087
Öğrekçi, S., Ünal, Y., & Dudak, M. N. (2023). A comparative study of vision transformers and
convolutional neural networks: sugarcane leaf diseases identification. European Food Research and
Technology, 249(7), 1833-1843. doi.org/10.1007/s00217-023- 04258-1
Kumar, H., Velu, S., Lokesh, A., Suman, K., & Chebrolu, S. (2023, February). Cassava Leaf
Disease Detection Using Ensembling of EfficientNet, SEResNeXt, ViT, DeIT, and MobileNetV3 Models.
In Proceedings of the International Conference on Paradigms of Computing, Communication and Data
Sciences: PCCDS 2022 (pp. 183-193). Singapore: Springer Nature Singapore. doi.org/10.1007/978-981-
19-8742-7_15
Ganguly, A., Tiwari, B., Reddy, G. P. K., & Chauhan, M. (2023). Ensemble Learning for Plant
Leaf Disease Detection: A Novel Approach for Improved Classification Accuracy.
doi.org/10.21203/rs.3.rs-3257323/v1
Chang, B., Wang, Y., Zhao, X., Li, G., & Yuan, P. (2024). A general-purpose edge-feature
guidance module to enhance vision transformers for plant disease identification. Expert Systems with
Applications, 237, 121638. doi.org/10.1016/j.eswa.2023.121638
Diana Andrushia, A., Mary Neebha, T., Trephena Patricia, A., Umadevi, S., Anand, N., &
Varshney, A. (2023). Image-based disease classification in grape leaves using convolutional capsule
network. Soft Computing, 27(3), 1457-1470. doi.org/10.1007/s00500-022-07446-5
Kumar, A., Yadav, D. P., Kumar, D., Pant, M., & Pant, G. (2023). Multi-scale feature fusion-
based lightweight dual stream transformer for detection of paddy leaf disease. Environmental Monitoring
and Assessment, 195(9), 1020. doi.org/10.1007/s10661-023-11628-5
Hu, B., Jiang, W., Zeng, J., Cheng, C., & He, L. (2023). FOTCA: hybrid transformer-CNN
architecture using AFNO for accurate plant leaf disease image recognition. Frontiers in Plant Science, 14.
doi: 10.3389/fpls.2023.1231903
Arshad, F., Mateen, M., Hayat, S., Wardah, M., Al-Huda, Z., Gu, Y. H., & Al-antari, M. A.
(2023). PLDPNet: End-to-end hybrid deep learning framework for potato leaf disease prediction.
Alexandria Engineering Journal, 78, 406-418. doi.org/10.1016/j.aej.2023.07.076
Zhang, X., Li, F., Jin, H., & Mu, W. (2023). Local Reversible Transformer for semantic
segmentation of grape leaf diseases. Applied Soft Computing, 143, 110392.
doi.org/10.1016/j.asoc.2023.110392
Thai, H. T., Le, K. H., & Nguyen, N. L. T. (2023). FormerLeaf: An efficient vision transformer
for Cassava Leaf Disease detection. Computers and Electronics in Agriculture, 204, 107518.
doi.org/10.1016/j.compag.2022.107518
Devi, R. S., Kumar, V. R., & Sivakumar, P. (2023). EfficientNetV2 Model for Plant Disease
Classification and Pest Recognition. Computer Systems Science & Engineering, 45(2). DOI:
10.32604/csse.2023.032231
Mehta, S., Kukreja, V., & Vats, S. (2023, July). Innovative Approaches to Java Plum Leaf
Disease Identification: Federated Learning meets Convolutional Neural Networks. In 2023 14th
International Conference on Computing Communication and Networking Technologies (ICCCNT) (pp. 1-
6). IEEE. DOI: 10.1109/ICCCNT56998.2023.10307120
Scratch Vision Transformer Model for Diagnosis Grape Leaf Disease
SB Mamun, MT Ahad, MM Morshed, N Hossain, YR Emon
International Conference on Trends in Computational and Cognitive …