DOI: 10.1002/ppj2.20079
Arti Singh, Department of Agronomy, Iowa Abstract
State University, Ames, IA, USA. Insect pests cause significant damage to food production, so early detection and
Email: [email protected]
efficient mitigation strategies are crucial. There is a continual shift toward machine
Assigned to Associate Editor Weizhen Liu. learning (ML)-based approaches for automating agricultural pest detection. Although
supervised learning has achieved remarkable progress in this regard, it is impeded by
Funding information
NSF, Grant/Award Number: 1952045; CPS the need for significant expert involvement in labeling the data used for model train-
Frontier, Grant/Award Number: 1954556; ing. This makes real-world applications tedious and oftentimes infeasible. Recently,
Agricultural Research Service; National
self-supervised learning (SSL) approaches have provided a viable alternative to train-
Institute of Food and Agriculture,
Grant/Award Numbers: 2019-67021-29938, ing ML models with minimal annotations. Here, we present an SSL approach to
2021-67021-35329, 2022-67013-37120; classify 22 insect pests. The framework was assessed on raw and segmented field-
U.S. Department of Agriculture,
Grant/Award Number: IOW04714
captured images using three different SSL methods, Nearest Neighbor Contrastive
Learning of Visual Representations (NNCLR), Bootstrap Your Own Latent, and Bar-
low Twins. SSL pre-training was done on ResNet-18 and ResNet-50 models using all
three SSL methods on the original RGB images and foreground segmented images.
The performance of SSL pre-training methods was evaluated using linear probing
of SSL representations and end-to-end fine-tuning approaches. The SSL-pre-trained
convolutional neural network models were able to perform annotation-efficient
classification. NNCLR was the best performing SSL method for both linear and full
model fine-tuning. With just 5% annotated images, transfer learning with ImageNet
initialization obtained 74% accuracy, whereas NNCLR achieved an improved clas-
sification accuracy of 79% for end-to-end fine-tuning. Models created using SSL
pre-training consistently performed better, especially under very low annotation, and
were robust to object class imbalances. These approaches help overcome annotation
bottlenecks and are resource efficient.
of the day. Due to the presence of different insect species in tered on a leaf or flower, creating an impression of
varying numbers, we have an imbalance dataset in different overlapping objects.
insect species classes, which was desirable for the objective vii. Insects from different classes were found together in
of this research project. Additional variation was created due the same image.
to the imaging of insects in various crops, leaf, or stem in the
background, different zooms while taking images, and vari- These variabilities (Figure 3) not only make the classi-
ation in types of insect species present. It was noticed that fication task challenging but also make the dataset unique,
the insects appeared at the top of the canopy mostly during because it unravels the opportunities for solving complex
the early morning or evening hours, when the temperature real-world computer vision problems (Singh et al. 2021c).
and environmental conditions were mild. This characteristic
in insect sightings was also reported by Tetila et al. (2020).
However, some insects like the Japanese beetles could be 2.3 Description of the SSL methods
found in clusters throughout the day. We did not experience
any challenge in collecting sufficient images for 21 of 22 SSL methods differ based on the augmentation approaches
insect species. However, fall armyworm (FAW) (Spodoptera and the loss function definitions, which control the selection
frugiperda) was difficult to collect images because those were of the constraints and the way an optimal solution is achieved.
rarely sighted compared to the other insects. Hence, the lar- The three SSL methods leveraged in this study are briefly
vae were first reared and grown in the lab and then imaged described below.
with varying background conditions to get a sizable number
of FAW images. 2.4 BYOL
To incorporate variability in the dataset, the images were
also taken from varying camera angles with an intent to serve BYOL is a distillation-based SSL method that does not rely on
as a natural augmentation technique in training the mod- negative samples, unlike contrastive methods. It rather works
els. Thus, the mentioned insect-pest dataset includes both on two same architecture networks, the online and the tar-
between- and within-species variability in terms of type, get network. The online network is tasked with learning the
size, shape, and overall visual features. All these phenotyping representations for an augmented view of an image, then pre-
efforts led to the creation of “IA insect-pest dataset 22,” that dicting the representations of the target network trained on
is, IA-IP22, which comprises 14,665 images across 22 insects another augmentation of the same image. Although the online
(Figure 1), and the number of insects per class varies from 95 network gets updated as per the prediction errors, the tar-
to 1653 (Figure 2). As few insects were extremely tiny to pho- get network weights are also simultaneously updated with the
tograph, a very close-range 5× zoomed mode was primarily moving averages of the online network weights. Thus, BYOL
used; however, the zoom level differed based on the insect enables self-supervision by learning interactively from two
type and their location on the plant canopy. In the following encoder networks (Grill et al., 2020).
section, the challenges faced with and the methodology for
handling such data are demonstrated.
2.5 Barlow Twins
2.2 Challenges in classifier training Barlow Twins also leverages two identical networks to learn
image features, like BYOL. However, in the Barlow Twins
Using this dataset is challenging from the ML perspective due method, embeddings from both the networks trained on dif-
to the following reasons: ferent augmentations of the same images are cross-correlated.
The model is optimized by making the cross-correlation
i. Several classes had large intra-class variability in size, matrix close to identity, such that the learned embeddings are
shape, color, patterns, and texture. distortion-agnostic providing maximized information. The
ii. Insects from different classes looked very similar, that objective function thus tries to minimize redundancy between
is, very small inter-class variability. the representations learned from the networks and works on a
iii. The dataset was highly class imbalanced. simpler concept than BYOL (Zbontar et al., 2021).
iv. There was a large background compared to the insect
or the foreground.
v. Due to varying illumination conditions in a day, 2.6 NNCLR
shadow effects were also found.
vi. Many images consisted of multiple instances of the NNCLR exploits a contrastive learning approach to find-
same insect, in cases where insects were found clus- ing positives from other samples closest in the latent space
F I G U R E 1 An illustration of some of the insect pest images collected from Iowa State University research fields in Iowa, USA. These
Plot representing the count of insect per class, arranged in descending order (top to bottom).
represent the variety, type, and quality of the collected images.
6 of 20 KAR ET AL.
F I G U R E 3 (A) Single (left) and multiple (right) instances of the same insect, milkweed bug; (B) two examples of similar-looking insects from
different classes—(B-i) black soldier beetle (left) and sap beetle (right) and (B-ii) southern corn rootworm (left) and bean leaf beetle (right); (C) two
examples of camouflaging background effect with an instance of a northern corn rootworm in each; (D) intra-class variability in the same insect
class, bean leaf beetle; (E) multiple insect classes in the same image—(E-i) a lady beetle, one soldier beetle and two northern corn rootworms and
(E-ii) a northern and a western corn rootworm; (F) visual similarity between western corn rootworm (left) and striped cucumber beetle (right); (g)
instances of both noisy background and multiple insects in the same image—(G-i) northern and western corn rootworm and (G-ii) northern corn
rootworm and milkweed bug; (H) background and illumination effects on the foreground, an instance of bean leaf beetle in each.
than using augmentations of the same image. This enables sidering the complexity of the images, SSL performance
increasing semantic variability compared to the latter. The on raw and segmented images was compared via linear
networks thus learn beyond a single discriminative instance evaluation of the representations learned in both cases. Sub-
in providing better invariance to different viewpoints, defor- sequently, supervised fine-tuning was performed to compare
mations, and even intra-class variations. This not only makes supervised versus self-supervised results. In this process,
the method less reliant on complex data augmentations but two backbone architectures, ResNet-18 and -50 (RN 18/50),
also helps with significant improvement in performance in were examined for different sampling strategies, random,
downstream tasks. random-augmented, diverse, and diverse-augmented, with
label fractions of the sample varying from 0.1%, 0.3%, 0.5%,
and 100%. All the experiments were repeated three times,
2.7 Workflow and the average results from each method were used to
compare between SSL and SL performances. Thus, this paper
The classification framework comprises three major primarily aims to examine the minimum amount of training
steps: data pre-processing and extraction of segmented data needed to obtain at least 80% classification accuracy,
images, deriving latent representations through different and how efficiently SSL helps in handling class imbalance.
self-supervised procedures, and finally, classification. Con- The detailed methodology is illustrated in Figure 4.
F I G U R E 4 Detailed methodology flowchart, representing the two input sets, raw/segmented, the backbone architectures ResNet-18 and 50
(RN 18/50), sampling strategies—random, random-augmented, diverse, and diverse-augmented, and the labeled fractions of sample varying from
0.1%, 0.3%, 0.5%, to 100%.
2.8 Data pre-processing (raw and is first removed via segmentation before executing the pre-
segmented) and pre-training setup training methods. The visual difference between the raw and
the segmented images is shown (Figure 5) in BGR format,
The dataset was first cleaned up by removing the duplicate and the default format used by the OpenCV library employed
empty images. Then it was partitioned in an approximately for the image pre-processing operations in this study. In the
70:15:15 ratio yielding 10,725 training images, 2081 valida- segmented images, much of the background is removed; how-
tion, and 1859 test images. All the images were resized to ever, essential visual patterns in the foreground are retained,
224 × 224 dimensions for processing efficiency, and then, for example, the bean leaf beetle, and the aphid images are
training samples were labeled with increasing proportion, desirably segmented despite very high similarity between the
that is, [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10, 30, 50, 70, 100]. foreground and background.
This approach would eventually help identify the amount For segmentation, the local entropy-based (Hržić et al.,
of training data ideally required for reasonable SSL perfor- 2019) masking approach was leveraged to segment an image
mance. In this study, four different sampling strategies were based on the level of complexity contained in a given neigh-
adopted: (a) random, (b) diverse (by selecting diverse samples borhood, defined by the structuring element, disk radius. The
from the latent space of the encoder output (Bortolato et al., entropy filter first detects subtle variations in the local gray
2022), (c) random-augmented, and (d) diverse-augmented. level distribution in the defined neighborhood and captures
Although the two former training sample sets (a, b) included the inherent properties of the transition regions. Image bina-
imbalanced classes, the latter two (c, d) were augmented rization was then performed using a threshold of 0.8 to obtain
via over-sampling for ensuring balanced classes. Again, this the mask. On applying the resultant mask to the grayscale
strategy was adopted to test the impact of an imbalanced image, only those portions of the image that exceeded the
dataset on SSL results. Thus, there were four training sets threshold were retained. The resultant entropy-masked image
that differed in the sampling strategy. The entire training was then converted back to the color image format, which
set was replicated, and each image was segmented, to cre- now represented the foreground, which was segmented from
ate the segmented training samples, such that both the raw the background. In this process, for each insect class, the
and segmented training data contained the same images. The foreground-object texture was selectively segregated using
classification framework was then parallelly employed for entropy, by varying the disk radius from 5 to 20. Satisfac-
subsequent analysis of any difference in performance. This tory segmentation results were empirically achieved for a disk
study hypothesizes a possible improvement in the perfor- radius of 20 for southern corn rootworm and flea beetle; for
mance of downstream tasks if much of the noisy background stink bug, northern corn rootworm, and flea beetle, it was 15,
F I G U R E 5 Examples of raw (top row) and corresponding segmented (bottom row) images are shown (in BGR format) for specific insect
classes, northern corn rootworm, flea beetle, corn earworm larvae, bean leaf beetle, and aphids.
and for the remaining insect classes, 5. Similarly, the thresh- ResNet-50 model. We used different label fractions of train-
old for masking was also empirically chosen to be 0.8. This ing sets (0.1%, 0.3%, 0.5%, 0.7%, 1%, 3%, 5%, 7%, 10%, 30%,
segmentation method was adopted because it takes image tex- 50%, 70%, and 100%) for the classifier. All the linear prob-
ture into account rather than color variations and is simpler ing experiments were repeated three times. We also evaluated
and remarkably faster than other reported methods like the the SSL model initializations as shown in Figure 6b. For this,
Simple Linear Iterative Clustering super pixel segmentation we fine-tuned the model end-to-end using supervised learn-
(Stutz, 2015). Thus, once the datasets were prepared, pre- ing. We used different label fractions of training sets (0.1%,
training was performed for 800 epochs by employing SSL 0.3%, 0.5%, 0.7%, 1%, 3%, 5%, 7%, and 10%) for fine-tuning
methods described above. Two backbone architectures were the classifier. Unlike the linear probing evaluation, here we
compared during pre-training, ResNet-18 and ResNet-50, ini- focus on accessing performance when there is a limited bud-
tialized with ImageNet weights (Krizhevsky et al., 2017). get for labeling (set to 10% of the dataset). All the fine-tuning
The hyperparameters were fine-tuned for each of the methods experiments were repeated three times, and the average results
(Table 1), and the model checkpoint with the lowest train- from each method were used to compare between SSL and SL
ing and validation loss was saved for the downstream task. performances.
For training optimization, the stochastic gradient descent opti-
mizer was used for each of the experiments, and the models
were trained using ReLU activations in the convolutional and
2.10 Performance metrics
dense layers.
We calculate the multi-class classification accuracy from the
confusion matrix: true positives (TP), true negatives (TN),
2.9 Linear probing versus end-to-end false positives (FP), and false negatives (FN). TP and TN
fine-tuning are the samples that were correctly classified by the model
and are shown on the main diagonal of the confusion matrix.
We used two different types of evaluation for the SSL methods
FP and FN are the samples that were incorrectly classified
as shown in Figure 6. To evaluate the transfer of representa-
by the model. From these values, the classification accuracy,
tions, a popular evaluation protocol is to freeze the backbone
precision, recall, and F1-score are calculated as follows:
model and train a linear classifier on the final layer repre-
sentation (Kolesnikov et al., 2019) as shown in Figure 6a. 𝑇𝑃 + 𝑇𝑁
This method is used to understand the effectiveness of SSL 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
representations for downstream classification. Here, we froze
the ResNet backbone model and used the representation from
the final layer of the model to train a linear classifier. A lin- 𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2)
ear classifier with 512 nodes was used for the ResNet-18 𝑇𝑃 + 𝐹𝑃
model, and a linear classifier with 2048 nodes was used for the
TA B L E 1 The values of hyperparameters tuned during pre-training of each self-supervised learning (SSL) model.
Note: The list is as per the hyperparameters provided in the solo-learn library (da Costa et al., 2022).
Abbreviations: BYOL, Bootstrap Your Own Latent; NNCLR, Nearest Neighbor Contrastive Learning of Visual Representations; sgd, stochastic gradient descent.
F I G U R E 6 Illustration of (A) linear classification and (B) end-to-end fine-tuning methods, which were used to compare the accuracy of
self-supervised learning (SSL) methods. In (A), only weights of the last fully connected layer are fine-tuned, and in (B), all model weights are
fine-tuned in the end-to-end evaluation.
F I G U R E 7 The mean (across all four sampling strategies and three repetitions) self-supervised learning (SSL) performance with both
ResNet-18 and 50 (RN18/RN50) backbones is plotted for raw and segmented datasets.
difference was reduced, still leading to an average of 3% incre- tried to learn all the data points including noise and inac-
ment with 100% training samples. This shows that entropy- curate values present in the dataset, thereby reducing model
based image segmentation combined with the NNCLR-SSL accuracy (Santos et al., 2018). There were very minor to no
method could be a highly annotation-efficient solution with differences noticed in the performance between the random
greater than 70% accuracy, even with very low sample size of and diverse sampling strategies. In addition, in the case of
3% (i.e., 3% of 10,725 = 322 images in this case). random sampling, results from both the imbalanced (raw) and
Currently, there are varied SSL implementations for solv- balanced (raw-augmented) datasets were almost similar with
ing fine-grained image classification problem, for exam- no noticeable difference. Thus, these findings confirmed that
ple, semantic learning from the discriminative feature- SSL methods are robust to class imbalance, also suggested in
representations of image parts (Yang et al., 2022; Yu et al., Liu, Zhang et al. (2021), and these methods can achieve better
2022), part-level contrastive learning (Wang et al., 2022), performance with segmented images. Therefore, the subse-
attentively identifying fine-grained images by interaction quent results demonstrate the performance difference between
(Zhuang et al., 2020). However, this study shows the ability linear probing and fine-tuning, based only on the randomly
of local entropy-mask segmentation in enhancing SSL per- sampled segmented images, and do not include the diverse
formance to classify insect pests from complex images, as and augmented cases.
segmentation helps retain mostly the foreground portions that
accentuate the learning of more meaningful representations
during the pretext task, compared to the raw images. In the 3.2 Fine-tuning evaluation
latter case, some of the latent representations could belong to
the image background, which is intuitively not very helpful Figure 9 shows the performance of end-to-end fine-tuning
in generalizing the downstream task. Utilizing image seg- results of ResNet-18 and ResNet-50 models. All the fine-
mentation for aiding supervised classification performance tuning experiments were repeated three times, and the mean
has been found to be beneficial in previous studies (Liu, classification accuracy across the three repetitions is shown in
HaoChen, et al., 2021; Mahbod et al., 2020). Additionally, it Figure 9a,b. NNCLR was the best performing SSL method.
may be noted that such improvement in model effectiveness For 5% of the labeled samples, the NNCLR method obtained
was achieved from “local” entropy-mask-based segmentation a mean classification accuracy of ∼79% for the ResNet-
that may still be influenced by external factors like illumina- 50 model and an accuracy of ∼74% for the ResNet-18
tion and occlusion. Hence, as a future research domain, the model. All the SSL pre-training methods outperformed super-
“locally adaptive” entropy-based thresholding (Zhang et al., vised baseline for end-to-end fine-tuning evaluation. The SSL
2022), which is rather a computationally expensive approach pre-training methods were more annotation efficient than
can be tested to determine the change in performance. ImageNet initialization for training fractions less than 5%. The
Regarding the backbone architecture, ResNet-50-based performance of ImageNet initialization was on-par with SSL
experiments yielded a 5%–9% increase in accuracy than the methods for training fractions greater than 5%. These results
ResNet-18-based experiments, when sample size was 100%. were as expected because evidence suggests that the benefit of
However, when the training size was just 1% or less, ResNet- SSL models increases with the availability of larger amounts
18-based experiments seemed to achieve an average of ∼3% of unlabeled data for pre-training. Among the SSL methods,
higher accuracy than the ResNet-50 ones. Such an effect was Barlow Twins had the lowest performance. For 10% train-
prominent in the BYOL and Barlow Twins methods. However, ing data, the ResNet-50 model obtained a mean classification
in the case of NNCLR, ResNet-50 proved beneficial across all accuracy of 86% and was ∼4% better than the ResNet-18
the sample sizes with a 4% increase in accuracy on an average, model.
both on raw and segmented datasets. This states that when the Figures 10 and 11 show the confusion matrices of the
training size is extremely low, simpler architectures are better ResNet-50 model with ImageNet and NNCLR initializations,
for information maximization or distillation-based SSL meth- respectively. The model was trained with 7% of labeled data,
ods. However, based on the overall result from the comparison and the input images were pre-processed with entropy-based
between the backbone architectures, the sampling strategies segmentation. For confounding classes, like bean leaf beetle
were examined for the ResNet-50-based models (Figure 8). and ladybird beetle, the NNCLR model performed better than
There was no improvement noticed with the augmented ImageNet initialization. The NNCLR initialization obtained
dataset containing balanced classes, on any of the three SSL an accuracy of 96% for bean leaf beetle, whereas the Ima-
methods. It was observed that classification accuracy rather geNet model obtained an accuracy of 78%. Similarly, for
dropped with diverse-augmented samples, particularly if the the confounding classes like FAW and corn earworm lar-
proportion of labeled samples in the training set was less than vae, the NNCLR model obtained accuracies of 97% and 90%,
10%. This could have potentially resulted from over-sampling respectively, whereas the ImageNet model obtained accura-
that led to overfitting for specific classes, where the model cies of 89% and 92%, respectively.
F I G U R E 8 Comparison of the impact of different sampling strategies on each of the self-supervised learning (SSL) methods. For brevity, the
results are plotted for sample sizes of 1%, 5%, 10%, 50%, and 100%, which potentially capture the overall pattern of improvement in classification
accuracy as the sample size increases.
F I G U R E 9 End-to-end fine-tuning evaluation of (A) ResNet-18 and (B) ResNet-50 models using segmented images. The “Supervised” curve
corresponds to training from random initialization. The models were fine-tuned for different label percentage fractions (0.1%, 0.3%, 0.5%, 0.7%, 1%,
3%, 5%, 7%, and 10%).
F I G U R E 1 0 Confusion matrix for ImageNet initialized ResNet-50. The model was trained with 7% of labeled images. The input images were
pre-processed with entropy-based segmentation for removing the background. The 22 classes are “Aphids”: 0, “Bean leaf beetle”: 1, “Corn earworm
larvae”: 2, “Fall armyworm”: 3, “Flea beetle”: 4, “Green lacewing”: 5, “Green leaf hopper”: 6, “Japanese beetle”: 7, “Ladybird beetle”: 8, “Maize
calligrapher”: 9, “Milkweed bug”: 10, “Northern corn rootworm beetle”: 11, “Sap beetle”: 12, “Silver spotted caterpillars”: 13, “Soldier beetle”: 14,
“Southern corn rootworm beetle”: 15, “Soybean nodule fly”: 16, “Stink bug”: 17, “Striped cucumber beetle”: 18, “Tarnished plant bug”: 19,
“Western corn rootworm beetle”: 20, “White fly”: 21.
F I G U R E 1 1 Confusion matrix for Nearest Neighbor Contrastive Learning of Visual Representations (NNCLR) initialized ResNet-50 model
trained on segmented images. The model was trained with 7% of labeled images. The input images were pre-processed with entropy-based
segmentation for removing the background. The 22 classes are “Aphids”: 0, “Bean leaf beetle”: 1, “Corn earworm larvae”: 2, “Fall armyworm”: 3,
“Flea beetle”: 4, “Green lace wing”: 5, “Green leaf hopper”: 6, “Japanese beetle”: 7, “Ladybird beetle”: 8, “Maize calligrapher”: 9, “Milkweed bug”:
10, “Northern corn rootworm beetle”: 11, “Sap beetle”: 12, “Silver spotted caterpillars”: 13, “Soldier beetle”: 14, “Southern corn rootworm beetle”:
15, “Soybean nodule fly”: 16, “Stink bug”: 17, “Striped cucumber beetle”: 18, “Tarnished plant bug”: 19, “Western corn rootworm beetle”: 20,
“White fly”: 21.
T A B L E 2 Precision obtained for each of the 22 classes at 5%, 7%, and 10% proportions of training data, from the ImagNet and Nearest
Neighbor Contrastive Learning of Visual Representations (NNCLR) models.
5p 7p 10p
ImageNet NNCLR ImageNet NNCLR ImageNet NNCLR
Aphids: 0 0.926 0.955 0.958 0.987 0.990 0.999
Bean leaf beetle: 1 0.546 0.503 0.696 0.653 0.728 0.685
Corn earworm larvae: 2 0.850 0.859 0.925 0.934 0.957 0.966
Fall armyworm: 3 0.820 0.992 0.825 0.997 0.857 0.999
Flea beetle: 4 0.833 0.764 0.888 0.819 0.920 0.871
Green lace wing: 5 0.541 0.884 0.556 0.899 0.676 0.991
Green leaf hopper: 6 0.884 0.763 0.909 0.863 0.941 0.955
Japanese beetle: 7 0.668 0.812 0.743 0.887 0.863 0.919
Ladybird beetle: 8 0.775 0.905 0.820 0.950 0.852 0.982
Maize calligrapher: 9 0.485 0.757 0.520 0.792 0.640 0.824
Milkweed bug: 10 0.976 1.000 0.976 1.000 0.978 1.000
Northern corn rootworm 0.417 0.401 0.467 0.451 0.787 0.483
beetle: 11
Sap beetle: 12 0.959 0.945 0.974 0.960 0.976 0.992
Silver spotted caterpillars: 13 0.967 0.956 0.972 0.981 0.974 0.983
Soldier beetle: 14 0.639 0.699 0.789 0.849 0.841 0.881
Southern corn rootworm 0.753 0.831 0.828 0.906 0.860 0.938
beetle: 15
Soybean nodule fly: 16 0.430 0.472 0.505 0.547 0.625 0.579
Stink bug: 17 0.776 0.643 0.871 0.738 0.903 0.770
Striped cucumber beetle: 18 0.880 0.860 0.935 0.915 0.967 0.947
Tarnished plant bug: 19 0.747 0.788 0.812 0.853 0.932 0.885
Western corn rootworm 0.724 0.659 0.789 0.724 0.821 0.756
beetle: 20
White fly: 21 0.644 0.872 0.699 0.947 0.731 0.979
The precision, recall, and F1-scores are presented in 12% higher than that of ImageNet. Contrarily, western corn
Tables 2–4. Overall, the NNCLR model yielded 4.9%, 5.43%, rootworm beetle was the only class for which the ImageNet
and 2.56% better precision than the ImageNet model with 5%, classifier performed better in all the three metrics, with a mean
10%, and 10% labeling of the training samples, respectively. increase of ∼6% (precision), 2% (recall), and 5% (F1-score)
Similarly, the NNCLR model’s recall was higher by 2.07%, across the three scenarios with 5%, 7%, and 10% labeled data.
4.12%, and 2.0%, whereas the F1-score improved by 2.46%, However, for the minority classes like the green lacewing, and
4.07%, and 0.52% for 5%, 7%, and 10% labeled fractions of the maize calligrapher, NNCLR performed remarkably bet-
the training set. Nevertheless, it was interesting to note that ter. In the case of green lacewing, precision and recall were
some classes, for example, bean leaf beetle, northern corn higher by 33.3% and 22.3%, whereas for maize calligrapher,
rootworm beetle, and stink bug, could be classified with bet- the respective scores were up by 24.3% and 15%. Another
ter precision by the ImageNet, while the corresponding recall notable example demonstrating the efficiency of the SSL-pre-
scores from the NNCLR model were higher. This implied that trained model in correctly classifying a confounding class is
the NNCLR model produced fewer FN, that is, it was bet- that of the southern corn rootworm beetle (with ∼8% higher
ter at identifying both positive and negative samples of the precision, recall, and F1-score), which looks very similar to a
classes with high intra-class variability like the bean leaf bee- bean leaf beetle (Figure 3b-ii).
tle, and the northern corn rootworm beetle that is tan to pale These classification results show that the NNCLR model
green in color and easily camouflages with the background that was trained on smaller in-domain unlabeled data was
in the field. Considering all the three sampling scenarios, able to obtain good accuracy for challenging classes with few
the NNCLR-based recall for the northern corn rootworm was labels compared to ImageNet model that was pre-trained on
T A B L E 3 Recall obtained for each of the 22 classes at 5%, 7%, and 10% proportions of training data, from the ImagNet and Nearest Neighbor
Contrastive Learning of Visual Representations (NNCLR) models.
5p 7p 10p
ImageNet NNCLR ImageNet NNCLR ImageNet NNCLR
Aphids: 0 0.811 0.922 0.836 0.947 0.891 0.972
Bean leaf beetle: 1 0.781 0.959 0.806 0.984 0.861 0.999
Corn earworm larvae: 2 0.894 0.973 0.919 0.988 0.944 0.993
Fall armyworm: 3 0.915 0.903 0.920 0.958 0.945 0.983
Flea beetle: 4 0.699 0.519 0.849 0.669 0.944 0.819
Green lace wing: 5 0.230 0.412 0.480 0.662 0.505 0.812
Green leaf hopper: 6 0.857 0.929 0.902 0.974 0.927 0.989
Japanese beetle: 7 0.821 0.660 0.846 0.910 0.941 0.965
Ladybird beetle: 8 0.730 0.742 0.805 0.817 0.955 0.842
Maize calligrapher: 9 0.189 0.379 0.264 0.454 0.414 0.479
Milkweed bug: 10 0.683 0.683 0.883 0.883 0.938 0.908
Northern corn rootworm 0.691 0.823 0.791 0.923 0.846 0.948
beetle: 11
Sap beetle: 12 0.750 0.450 0.845 0.545 0.900 0.570
Silver spotted caterpillars: 13 0.858 0.929 0.923 0.974 0.948 0.999
Soldier beetle: 14 0.858 0.964 0.898 0.994 0.923 0.999
Southern corn rootworm 0.772 0.771 0.797 0.921 0.822 0.946
beetle: 15
Soybean nodule fly: 16 0.653 0.578 0.718 0.633 0.868 0.658
Stink bug: 17 0.675 0.741 0.805 0.834 0.900 0.859
Striped cucumber beetle: 18 0.940 0.980 0.953 0.993 0.978 0.998
Tarnished plant bug: 19 0.382 0.379 0.455 0.452 0.805 0.547
Western corn rootworm 0.837 0.824 0.902 0.889 0.957 0.914
beetle: 20
White fly: 21 0.778 0.741 0.793 0.891 0.888 0.946
large, labeled data from out-of-domain. This showed that SSL downstream tasks (Liu, Zhang, et al., 2021; Yang & Xu,
could solve fine-grained inter- and intra-class classification 2020). More specifically, SSL is not actuated by any labels,
problems, because the bean leaf beetle class contained the unlike the SL approach. Hence, SSL is not limited to learning
high intra-class variability, whereas the confounding classes only the label-relevant features that help predict the frequent
had fine-grained inter-class variability. As the proportion of classes, but rather a diverse set of generalizable represen-
labeled samples increased from 5% to 10%, the recall or the tations, including both label-relevant and irrelevant features
ability of the SSL method in correctly identifying the bean from unlabeled data. Learning during the pretext task also
leaf beetle images increased from 95.9% to 99.9%, compared contributes to the representation-invariance property of an
to a recall of 0.861 by the ImageNet model, when trained with SSL model (Tendle & Hasan, 2021), such that it captures
just 10% labeled samples. Similar patterns in the results were the ingrained characteristics of the input distribution, that are
also observed in the case of confounding classes like green generalizable or transferable to downstream tasks. Therefore,
lace wing and the green leaf hopper, also identified as one SSL methods can generalize to rare classes better than SL
of the minority classes in the dataset. Aphids is another class approaches. SSL’s robustness to class imbalance is thoroughly
with high fine-grained variability, which could be classified demonstrated by Liu, Zhang et al. (2021), and the general-
with 92% accuracy with 7% training using SSL, whereas the izability of self-supervised representations is discussed by
ImageNet method’s accuracy was 11% lower. Tendle and Hasan (2021).
Such robustness of SSL to dataset imbalance could be Overall, the SSL methods provide an exciting opportunity
attributed to its ability to learn richer features that are trans- and application in the plant science domain. At the same
ferable across layers to help classify the rare classes and time, there are several open questions that require future
T A B L E 4 F1-Score obtained for each of the 22 classes at 5%, 7%, and 10% proportions of training data, from the ImagNet and Nearest
Neighbor Contrastive Learning of Visual Representations (NNCLR) models.
5p 7p 10p
ImageNet NNCLR ImageNet NNCLR ImageNet NNCLR
Aphids: 0 0.864 0.938 0.893 0.967 0.938 0.985
Bean leaf beetle: 1 0.643 0.660 0.747 0.785 0.789 0.813
Corn earworm larvae: 2 0.871 0.913 0.922 0.960 0.950 0.979
Fall armyworm: 3 0.865 0.945 0.870 0.977 0.899 0.991
Flea beetle: 4 0.760 0.618 0.868 0.736 0.932 0.844
Green lace wing: 5 0.323 0.562 0.515 0.762 0.578 0.892
Green leaf hopper: 6 0.870 0.838 0.905 0.915 0.934 0.972
Japanese beetle: 7 0.737 0.728 0.791 0.898 0.900 0.941
Ladybird beetle: 8 0.752 0.815 0.812 0.878 0.901 0.906
Maize calligrapher: 9 0.272 0.505 0.351 0.577 0.503 0.606
Milkweed bug: 10 0.804 0.811 0.927 0.938 0.958 0.952
Northern corn rootworm 0.520 0.539 0.587 0.606 0.816 0.640
beetle: 11
Sap beetle: 12 0.842 0.610 0.905 0.695 0.936 0.724
Silver spotted caterpillars: 13 0.910 0.942 0.947 0.977 0.961 0.991
Soldier beetle: 14 0.733 0.810 0.840 0.916 0.880 0.936
Southern corn rootworm 0.763 0.800 0.813 0.913 0.841 0.942
beetle: 15
Soybean nodule fly: 16 0.518 0.520 0.593 0.587 0.726 0.616
Stink bug: 17 0.722 0.689 0.837 0.783 0.902 0.812
Striped cucumber beetle: 18 0.909 0.916 0.944 0.953 0.973 0.972
Tarnished plant bug: 19 0.505 0.512 0.583 0.591 0.864 0.676
Western corn rootworm 0.777 0.732 0.842 0.798 0.884 0.827
beetle: 20
White fly: 21 0.705 0.801 0.743 0.918 0.802 0.962
research. The SSL-based insect-pest identification should without prior image segmentation, to circumvent data anno-
investigate (a) designing pretext classes specifically to insect- tation challenges that plague plant scientists as biological
pest classification, (b) using class-specific loss functions, (c) systems are inherently very complex. We found that SSL-
pre-training with both out-of-domain and in-domain data, and pre-trained models were annotation efficient for insect-pest
(d) developing a mobile application for farmers and breeders. classification. For learning with few labels, the model ini-
tializations and latent representation from NNCLR was better
than the ImageNet model. Pre-training with segmented input
4 CONCLUSIONS images provided better performance than the original images.
All the SSL methods performed better than the supervised
This paper presents an IA insect-pest dataset that gener- baseline for both linear probing and end-to-end evaluation.
ates exciting opportunities for researchers and practitioners The SSL-pre-trained models were robust to class imbalances
to utilize the dataset in ML model development. This dataset and were able to differentiate confounding insect classes.
includes (a) several classes with large intra-class variability in These results indicate the usefulness of SSL methods, espe-
size, shape, color, patterns, and texture; (b) insects from differ- cially with segmented images for data labeling/annotation
ent classes that look similar; (c) high class imbalance; (d) large challenges to save time, cost, physical resource, computa-
background noise compared to the insect or the foreground; tion, and integrate phone-based imaging with ML pipeline
(e) varying illumination conditions and shadows; (f) overlap- that can work across geographies to help identify and even-
ping objects in the image; (g) multiple insect-pest species tually control insect pests in the field. SSL models from our
in the same image frame. Using this insect-pest dataset, paper will be efficient in solving a variety of plant phenomics
we thoroughly investigated different SSL methods, with and problems, which includes the early detection of insect pests,
framework for plant stress phenotyping. Proceedings of the National Liu, Y., Zhang, Z., Liu, X., Wang, L., & Xia, X. (2021). Efficient image
Academy of Sciences of the United States of America, 115(18), segmentation based on deep learning for mineral image classification.
4613–4618. Advanced Powder Technology, 32(10), 3885–3903.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Mahbod, A., Tschandl, P., Langs, G., Ecker, R., & Ellinger, I. (2020).
Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. The effects of skin lesion segmentation on the performance of der-
G., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). matoscopic image classification. Computer Methods and Programs
Bootstrap your own latent: A new approach to self-supervised in Biomedicine, 197, 105725.
Learning. Advances in Neural Information Processing Systems, 33, Margapuri, V., & Neilsen, M. (2021). Classification of seeds using
21271–21284. domain randomization on self-supervised learning frameworks. 2021
Gullino, M., Albajes, R., Al-Jboory, I., Angelotti, F., Chakraborty, IEEE Symposium Series on Computational Intelligence (SSCI) (pp.
S., Garrett, K., Hurley, B., Juroszek, P., Makkouk, K., Pan, X., & 01–08). IEEE.
Stephenson, T. (2021). Scientific review of the impact of climate Masood, A., Al-Jumaily, A., & Anam, K. (2015). Self-supervised
change on plant pests. FAO on behalf of the IPPC Secretariat. https:// learning model for skin cancer diagnosis. 2015 7th International IEEE/EMBS Conference on Neural Engineering (NER) (pp. 1012–
Hao, G.-F., Zhao, W., & Song, B.-A. (2020). Big data platform: An 1015). IEEE.
emerging opportunity for precision pesticides. Journal of Agricultural Misra, I., & van der Maaten, L. (2020). Self-supervised learning
and Food Chemistry, 68(41), 11317–11319. of pretext-invariant representations. Proceedings of the IEEE/CVF
acs.jafc.0c05584 Conference on Computer Vision and Pattern Recognition (pp. 6707–
Hržić, F., Štajduhar, I., Tschauner, S., Sorantin, E., & Lerga, J. 6717). IEEE.
(2019). Local-entropy based approach for X-ray image segmentation Mohanty, S. P., Hughes, D. P., & Salathé, M. (2016). Using deep learning
and fracture detection. Entropy, 21(4), 338. for image-based plant disease detection. Frontiers in Plant Science, 7,
e21040338 1419.
Jubery, T. Z., Carley, C. N., Singh, A., Sarkar, S., Nagasubramanian, K., Jubery, T., Fotouhi Ardakani, F., Mirnezami, S.
Ganapathysubramanian, B., & Singh, A. K. (2021). Using V., Singh, A. K., Singh, A., Sarkar, S., & Ganapathysubramanian, B.
machine learning to develop a fully automated Soybean Nodule (2021). How useful is active learning for image-based plant pheno-
Acquisition Pipeline (SNAP). Plant Phenomics, 2021, 9834746. typing? The Plant Phenome Journal, 4(1), e20020. 1002/ppj2.20020
Kahn, G., Abbeel, P., & Levine, S. (2021). BADGR: An autonomous Nagasubramanian, K., Singh, A. K., Singh, A., Sarkar, S., &
self-supervised learning-based navigation system. IEEE Robotics and Ganapathysubramanian, B. (2022). Plant phenotyping with limited
Automation Letters, 6(2), 1312–1319. annotation: Doing more with less. The Plant Phenome Journal, 5,
2021.3057023 e20051.
Kolesnikov, A., Zhai, X., & Beyer, L. (2019). Revisiting self-supervised Nanni, L., Manfè, A., Maguolo, G., Lumini, A., & Brahnam, S. (2022).
visual representation learning. Proceedings of the IEEE/CVF Confer- High performing ensemble of convolutional neural networks for insect
ence on Computer Vision and Pattern Recognition (pp. 1920–1929). pest image detection. Ecological Informatics, 67, 101515.
IEEE. org/10.1016/j.ecoinf.2021.101515
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet Osorio, K., Puerto, A., Pedraza, C., Jamaica, D., & Rodríguez, L. (2020).
classification with deep convolutional neural networks. Commu- A deep learning approach for weed detection in lettuce crops using
nications of the ACM, 60(6), 84–90. multispectral images. AgriEngineering, 2(3), 471–488.
386 org/10.3390/agriengineering2030032
Kulkarni, O. (2018). Crop disease detection using deep learning. 2018 Rairdin, A., Fotouhi, F., Zhang, J., Mueller, D. S.,
Fourth International Conference on Computing Communication Con- Ganapathysubramanian, B., Singh, A. K., Dutta, S., Sarkar, S., &
trol and Automation (ICCUBEA) (pp. 1–4). IEEE. Singh, A. (2022). Deep learning-based phenotyping for genome wide
1109/ICCUBEA.2018.8697390 association studies of sudden death syndrome in soybean. Frontiers in
Li, W., Chen, P., Wang, B., & Xie, C. (2019). Automatic localization Plant Science, 13, 966244.
and count of agricultural crop pests based on an improved deep learn- Rangarajan, A. K., Purushothaman, R., & Ramesh, A. (2018). Tomato
ing pipeline. Scientific Reports, 9(1), 7024. crop disease classification using pre-trained deep learning algo-
s41598-019-43171-0 rithm. Procedia Computer Science, 133, 1040–1047.
Li, W., Zheng, T., Yang, Z., Li, M., Sun, C., & Yang, X. (2021). 10.1016/j.procs.2018.07.070
Classification and detection of insects from field images using deep Razfar, N., True, J., Bassiouny, R., Venkatesh, V., & Kashef, R. (2022).
learning for smart pest management: A systematic review. Ecolog- Weed detection in soybean crops using custom lightweight deep learn-
ical Informatics, 66, 101460. ing models. Journal of Agriculture and Food Research, 8, 100308.
Liebhold, A., & Bentz, B. (2011). Insect disturbance and climate change. Riera, L. G., Carroll, M. E., Zhang, Z., Shook, J. M., Ghosal, S., Gao, T.,
USDA Forest Service, Climate Change Resource Center. www.fs. Singh, A., Bhattacharya, S., Ganapathysubramanian, B., Singh, A. K., & Sarkar, S. (2021). Deep multiview image fusion for soybean yield
Liu, H., HaoChen, J. Z., Gaidon, A., & Ma, T. (2021). Self-supervised estimation in breeding applications. Plant Phenomics, 2021, 9846470.
learning is more robust to dataset imbalance. arXiv preprint.
Santos, M. S., Soares, J. P., Abreu, P. H., Araujo, H., & Santos, J. (2018).
Liu, J., & Wang, X. (2021). Plant diseases and pests detection based on
Cross-validation for imbalanced datasets: Avoiding overoptimistic
deep learning: A review. Plant Methods, 17, 22.
and overfitting approaches [research frontier]. IEEE Computational
Intelligence Magazine, 13(4), 59–76.
Shook, J., Gangopadhyay, T., Wu, L., Ganapathysubramanian, B., 30th ACM International Conference on Multimedia (pp. 6416–6424).
Sarkar, S., & Singh, A. K. (2021). Crop yield prediction integrating Association for Computing Machinery.
genotype and weather variables using deep learning. PLoS One, 16(6), Wang, M., Xu, S., & Zhou, H. (2020). Self-supervised learning for
e0252402. low frequency extension of seismic data. SEG Technical Program
Shurrab, S., & Duwairi, R. (2022). Self-supervised learning methods and Expanded Abstracts 2020 (pp. 1501–1505). SEG.
applications in medical imaging analysis: A survey. PeerJ Computer 1190/segam2020-3427086.1
Science, 8, e1045. Wu, X., Zhan, C., Lai, Y.-K., Cheng, M.-M., & Yang, J. (2019). IP102:
Singh, A., Ganapathysubramanian, B., Singh, A. K., & Sarkar, S. (2016). A large-scale benchmark dataset for insect pest recognition. 2019
Machine learning for high-throughput stress phenotyping in plants. IEEE/CVF Conference on Computer Vision and Pattern Recognition
Trends in Plant Science, 21(2), 110–124. (CVPR) (pp. 8779–8788). IEEE.
tplants.2015.10.015 00899
Singh, A., Jones, S., Ganapathysubramanian, B., Sarkar, S., Mueller, Xia, D., Chen, P., Wang, B., Zhang, J., & Xie, C. (2018). Insect detection
D., Sandhu, K., & Nagasubramanian, K. (2021a). Challenges and and classification based on an improved convolutional neural network.
opportunities in machine-augmented plant stress phenotyping. Trends Sensors, 18, 4169.
in Plant Science, 26(1), 53–69. Xie, C., Zhang, J., Li, R., Li, J., Hong, P., Xia, J., & Chen, P. (2015).
07.010 Automatic classification for field crop insects via multiple-task sparse
Singh, A. K., Ganapathysubramanian, B., Sarkar, S., & Singh, A. (2018). representation and multiple-kernel learning. Computers and Electron-
Deep learning for plant stress phenotyping: Trends and future per- ics in Agriculture, 119, 123–132.
spectives. Trends in Plant Science, 23(10), 883–898. 2015.10.015
10.1016/j.tplants.2018.07.004 Yang, X., Wang, Y., Chen, K., Xu, Y., & Tian, Y. (2022). Fine-grained
Singh, A. K., Singh, A., Sarkar, S., Ganapathysubramanian, B., object classification via self-supervised pose alignment. Proceed-
Schapaugh, W., Miguez, F. E., Carley, C. N., Carroll, M. E., Chiozza, ings of the IEEE/CVF Conference on Computer Vision and Pattern
M. V., Chiteri, K. O., Falk, K. G., Jones, S. E., Jubery, T. Z., Recognition (pp. 7399–7408). IEEE.
Mirnezami, S. V., Nagasubramanian, K., Parmley, K. A., Rairdin, Yang, Y., & Xu, Z. (2020). Rethinking the value of labels for improv-
A. M., Shook, J. M., van der Laan, L., . . . Zhang, J. (2021b). High- ing class-imbalanced learning. Advances in Neural Information
throughput phenotyping in soybean. In J. Zhou & H. T. Nguyen Processing Systems, 33, 19290–19301.
(Eds.), High-throughput crop phenotyping. Concepts and strategies Yi, J., Krusenbaum, L., Unger, P., Hüging, H., Seidel, S. J., Schaaf, G., &
in plant sciences (1st ed., pp. 129–163). Springer. Gall, J. (2020). Deep learning for non-invasive diagnosis of nutrient
1007/978-3-030-73734-4_7 deficiencies in sugar beet using RGB images. Sensors, 20(20), 5893.
Singh, D. P., Singh, A. K., & Singh, A. (2021c). Plant breeding and cul-
tivar development (1st ed.). Elsevier. Yu, X., Zhao, Y., & Gao, Y. (2022). SPARE: Self-supervised part erasing
0-01730-2 for ultra-fine-grained visual categorization. Pattern Recognition, 128,
Skendžić, S., Zovko, M., Živković, I. P., Lešić, V., & Lemić, D. (2021). 108691.
The impact of climate change on agricultural insect pests. Insects, Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021).
12(5), 440. Barlow Twins: Self-supervised learning via redundancy reduction.
Stutz, D. (2015). Superpixel segmentation: An evaluation. In J. Gall, P. International Conference on Machine Learning (pp. 12310–12320).
Gehler, & B. Leibe (Eds.), Pattern recognition. DAGM 2015. Lecture PMLR.
notes in computer science: Vol. 9358 (pp. 555–562). Springer. https:// Zhang, M., Cheng, S., Cao, X., Chen, H., & Xu, X. (2022). Entropy- based locally adaptive thresholding for image segmentation. https://
Tendle, A., & Hasan, M. R. (2021). A study of the generalizability of
self-supervised representations. Machine Learning with Applications, Zhong, Y., Gao, J., Lei, Q., & Zhou, Y. (2018). A Vision-based
6, 100124. counting and recognition system for flying insects in intelligent
Tetila, E. C., Machado, B. B., Astolfi, G., de Souza Belete, N. A., agriculture. Sensors, 18(5), 1489.
Amorim, W. P., Roel, A. R., & Pistori, H. (2020). Detection and clas- 489
sification of soybean pests using deep learning with UAV images. Zhuang, P., Wang, Y., & Qiao, Y. (2020). Learning attentive pairwise
Computers and Electronics in Agriculture, 179, 105836. interaction for fine-grained classification. Proceedings of the AAAI
org/10.1016/j.compag.2020.105836 Conference on Artificial Intelligence, 34(07), 13130–13137.
Thenmozhi, K., & Srinivasulu Reddy, U. (2019). Crop pest classifica-
tion based on deep convolutional neural network and transfer learning.
Computers and Electronics in Agriculture, 164, 104906.
org/10.1016/j.compag.2019.104906 How to cite this article: Kar, S., Nagasubramanian,
Venugoban, K., & Ramanan, A. (2014). Image classification of paddy
K., Elango, D., Carroll, M. E., Abel, C. A., Nair, A.,
field insect pests using gradient-based features. International Journal
of Machine Learning and Computing, 4, 1–5.
Mueller, D. S., O’Neal, M. E., Singh, A. K., Sarkar,
IJMLC.2014.V4.376 S., Ganapathysubramanian, B., & Singh, A. (2023).
Waheed, H., Zafar, N., Akram, W., Manzoor, A., Gani, A., & Islam, S. U. Self-supervised learning improves classification of
(2022). Deep learning based disease, pest pattern and nutritional defi- agriculturally important insect pests in plants. The
ciency detection system for “Zingiberaceae” crop. Agriculture, 12(6), Plant Phenome Journal, 6, e20079.
Wang, C., Fu, H., & Ma, H. (2022). PaCL: Part-level contrastive learn-
ing for fine-grained few-shot image classification. Proceedings of the