1. Introduction
Object detection, in a strict narrow sense, refers to the computer vision task of localizing and classifying target objects in images through bounding boxes. Models that are capable of completing such tasks, typically convolutional neural network (CNN)-based frameworks, are named object detectors. Although the concept of object detection is also captured in closely related computer vision tasks, such as semantic segmentation, instance segmentation, or panoptic segmentation, where precise geometric masks are used to highlight target objects, object segmentors are not to be confused with object detectors [
1]. Object detection has wide implications across various disciplines [
2,
3], including precision agriculture [
4], a growing branch of agriculture that aims to improve agricultural management efficiency utilizing technologies [
5]. Current mainstream object detectors rely on supervised learning and require large diverse datasets to achieve robust performances, which translates to human image annotation needs with significant labor and time costs. For a multi-class dataset, it is common for certain object classes to contain substantially fewer instances than the rest due to data collection difficulties. For example, an apple flower bud at tip growth stage in early spring can grow into five to eight flowers at bloom growth stage in late spring [
6,
7], implying that a single tip bounding box annotation would correspond to multiple bloom bounding box annotations. This problem is known as the class imbalance issue in deep learning, which causes weak exposure to certain classes for object detectors, hence leading to poor model performances for the classes [
8] (
Figure 1).
Generative adversarial networks (GANs), proposed by Goodfellow et al. [
9], are generative models designed for creating synthetic data that mimic real data. GANs comprise generators and discriminators that compete in a minimax two-player game. Generators are trained on real datasets to capture real data distribution and learn to map input variables, such as random multidimensional noise vectors or source domain images, to synthetic target domain samples. Discriminators, on the other hand, estimate the probability of given samples being real during model training. Generators’ objective functions aim to minimize the likelihood of discriminators assigning high and low real data probability to real and synthetic data, respectively, while discriminators’ objective functions aim to maximize the same likelihood. Practically, generators become better at generating synthetic samples indistinguishable from real samples, and discriminators become better at identifying synthetic samples from real samples, as model training proceeds. With sufficient data, GAN trainings, in theory, should eventually converge and reach Nash equilibria, meaning that neither the generators nor the discriminators can improve further against their counterparts [
10]. Similar to object detection, research regarding GAN in recent years has gained significant traction in many disciplines, and GAN’s common applications include image synthesis, image super-resolution, image-to-image translation, etc. [
11,
12].
Given GAN’s ability to synthetize data based on learned data distributions [
9], utilizing synthetic data from GANs to improve weakly trained object detectors naturally becomes an intriguing research topic and a potentially valid solution for insufficient model training data. The underlying method here would be to first develop one or multiple GANs targeted towards the challenging classes for an object detector using its limited training dataset, then generate sufficient synthetic data for the classes, and finally retrain the object detector by incorporating the synthetic data into the original training dataset. Fundamentally, this approach is an alternative form of traditional data augmentation techniques [
13], and shares similarities with simulation-based image generation [
14], as it involves creating entirely new data from scratch rather than modifying or transforming existing data. It is worth noting that fine-tuning object detectors with synthetic data can be an equally valid approach to improve their performances. However, besides shorter training durations, this strategy does not offer advantages over retraining in terms of final model performance on real test datasets, as fine-tuned models are not exposed to real training data unlike retrained models during their second phase of development. Despite the theoretical potential of this idea, current literature has reported that synthetic training data can reduce artificial intelligence (AI) model performances and even lead to model collapse. For example, Bohacek and Farid [
15] observed that the popular image generation model Stable Diffusion was highly vulnerable to data poisoning or synthetic training data, and yielded severely distorted and less diverse images even when the retraining datasets contained low quantities of self-generated images. Shumailov et al. [
16] discovered that generative AIs, such as large language models (LLMs), variational autoencoders (VAEs), and Gaussian mixture models (GMMs), all experienced performance degradation when their training datasets were polluted by synthetic data from preceding models.
Additionally, under the context of GAN and object detection, further methodological concerns exist. First, GAN development is generally conducted on massive datasets and arguably more data-demanding than object detector development. For example, the famous StyleGAN was trained on the FFHQ dataset with 70,000 images [
17], while the subsequent StyleGAN2 was trained on the LSUN Car dataset with 893,000 images [
18]. Brock et al. [
19] trained BigGAN and BigGAN-deep using a subset of the JFT-300M dataset, containing 292 million images. When trained on insufficient data, GANs are prone to the mode collapse issue [
20], where generators produce highly similar outputs instead of diverse samples that reflect full training data distributions. Consequently, the value of a homogeneous synthetic dataset would be limited for improving object detectors. Second, GAN training can be rather computationally expensive. For example, StyleGAN was trained for one week on an NVIDIA DGX-1 with eight Tesla V100 GPUs, while the StyleGAN2 project was executed for approximately 51 GPU years based on a single NVIDIA V100 GPU. The high computation requirement of GAN development could pose a significant barrier to research with limited resource access. Third, object detection often deals with multi-class datasets, while regular unconditional GANs trained on such datasets do not have the ability to synthesize data of target classes on demand. Specialized GANs, such as conditional GANs [
21], would be necessary for precise class data generation. Fourth, small objects, characterized by their relatively small sizes compared to overall images, are common yet challenging targets in object detection. In contrast, GANs are typically trained on images dominated by a single, centrally located target object. It is questionable whether GANs can effectively synthesize images with randomly located small target objects. Lastly, even though GANs can generate an unlimited number of images, corresponding image annotations are still needed for the synthesized data to be useful for object detectors. Traditional costly and inefficient manual annotations are apparently not the ideal solution for such a scenario, as it can easily become a bottleneck that significantly limits the dataset preparation process. Alternative automated methods, such as pseudo-labeling, can address the annotation speed constraint; however, the generated image annotations may have suboptimal quality.
In light of the aforementioned background, the current study investigated the feasibility of mitigating class imbalance in object detection utilizing GAN-generated data, as the first attempt under an agricultural context in current literature. The lightweight unconditional GAN FastGAN, the state-of-art object detector YOLO11n, and the multi-growth stage apple flower bud dataset AriAplBud were chosen for the study. Such an experiment design was implemented to not only avoid extreme GAN development time and resource requirements, but also verify the viability of the seemingly paradoxical concept: utilizing less-capable lightweight GANs developed on few images to improve advanced object detectors that struggle with the same small training dataset. The objectives of the study included: (1) developing a state-of-art apple flower bud detector; (2) developing GAN models for apple flower bud image synthesis at individual growth stages; (3) quantifying GAN’s capacity to successfully synthesize images containing object-detector-detectable apple flower buds; (4) evaluating the usefulness of GAN-based synthetic data by quantity in improving object detector performance for weakly trained classes. This article follows a standard research article structure, sequentially presenting a comprehensive literature review, a detailed method description, experimental results and discussions, and study conclusions in the subsequent sections.
2. Literature Review
A moderate number of studies exists in current literature utilizing GANs for object detection purposes. Based on their nature, the studies can be generically divided into three categories: implementing GAN with object detector, retraining object detector with synthetic data from GAN, and manipulating real data with GAN for object detector. The current study falls into the second category and addresses the knowledge gap by focusing on lightweight GAN in an agricultural context.
Unique characteristics exist for the three study categories, as they utilize GANs in distinct manners. Incorporating GAN modules into object detection frameworks creates a complete, unified solution. However, deep and up-to-date understandings of neural networks, significant network architecture designing and modification, and iterative experimentation might be required to develop such models. Retraining with additional GAN-generated data would update model weights and hence should be able to fundamentally improve object detectors for future applications. However, the efficacy of the model retraining depends on the quantity and quality of the synthetic data, which in turn rely on the specific GANs used as well as their developments. Transforming real data into a target domain that object detectors are more familiar with through GAN allows for model performance improvements without detector redevelopment. Yet, to maintain the same level of model performance, the utilization of GANs becomes a constant necessity. It is difficult to conclude which categories are exploiting GANs for object detection more effectively, as the studies often developed and evaluated their detectors based on different datasets, metrics, and benchmarking counterparts for different applications, and therefore are not always directly comparable.
2.1. Implementing GAN with Object Detector
Li et al. [
22] proposed Perceptual GAN to address small object detection problems. Perceptual GAN consisted of a deep residual generator that utilized fine-grained details from low-level convolutional layers to generate super-resolved high-level convolutional features, and a discriminator with an adversarial branch for object probability estimation and a perception branch for object classification and bounding box regression. The study demonstrated that Perceptual GAN outperformed Fast R-CNN and Faster R-CNN for the Tsinghua-Tencent 100K traffic sign dataset.
Wang et al. [
23] proposed CMTGAN based on deep CNNs dedicated to small object detection. CMTGAN included a generator with a centered mask for image super-resolution, and a discriminator for two-stage object detection by proposing ROIs first and then predicting object categories and regressing bounding boxes on the ROIs. The study experiments suggested that CMTGAN outperformed YOLOv4 and Faster R-CNN combined with bilinear, bicubic, SPSR, and ESRGAN upsampling methods based on the PASCAL VOC dataset.
Sun et al. [
24] proposed Ganster R-CNN consisting of RFPN and IGAN modules for occluded object detection. RFPN was created by combining RPN and FPN to integrate semantic information from high-level feature maps and location information from low-level feature maps, and extracted samples from real images. IGAN was composed of a GAN generator and Faster R-CNN detector. The GAN generator created synthetic occluded samples from noise variables and real sample high- and low-resolution feature maps, while Faster R-CNN detector learned whether the samples were real or synthetic. Based on the PASCAL VOC2007, PASCAL VOC2012, and MS COCO 2017 datasets, Ganster R-CNN outperformed Faster R-CNN, SSD513, and R-FCN.
Dewi et al. [
25] proposed DC YOLO-GAN by incorporating a one-class YOLO architecture into a GAN discriminator to recognize similar-looking musical instruments, including basson, cello, clarinet, erhu, flute, French horn, guitar, harp, recorder, saxophone, trumpet, and violin. Their study based on the PPMI dataset showed that DC YOLO-GAN outperformed YOLOv2 marginally for certain instrument classes.
Jaw et al. [
26] proposed RodNet that incorporated GAN and object detector for nighttime image detection. RodNet GAN treated nighttime images as source domain and daytime images as target domain to achieve feature transformation and project low-luminance features into visible and clean features. RodNet object detector shared features from the GAN generator as inputs and made subsequent object predictions. RodNets incorporated with YOLOv3-416 and YOLOv7-tiny were tested on the BDD100K, KITTI, and CityScape nighttime datasets. The results showed that RodNet-YOLOv3 outperformed SSD-512, RetinaNet, and YOLOv3, and RodNet-YOLOv7 outperformed YOLOv7, for both daytime and nighttime domains.
Ni et al. [
27] proposed NaGAN for off-nadir object detection of multi-view remote sensing imagery. NaGAN consisted of a generator with feature generation and label alignment modules to generate nadir-like representations from off-nadir objects, and a discriminator with adversarial and detecting heads. Based on the SpaceNet satellite dataset, NaGAN consistently outperformed Faster R-CNN, Cascade R-CNN, CornerNet, FoveaBox, RetinaNet, HTC, Libra R-CNN, NAS-FPN, and CentripetalNet for all sensor viewing angles.
Bai et al. [
28] proposed SOD-MTGAN to be incorporated with any object detectors and improve their small object detections. The baseline object detector was used to first crop out image regions of interests (ROIs), which were then fed into the generator of SOD-MTGAN to construct corresponding high-resolution samples, and the discriminator of SOD-MTGAN finally classified object categories and regressed bounding boxes. Their experiments showed that SOD-MTGAN was able to improve the performances of both Faster R-CNN and FPN based on the MS COCO minival dataset.
Chen et al. [
29] trained DRBox with a small set of human-labeled airplane images and pseudo-labeled the remaining large dataset. They further trained DCGAN to classify the human-labeled, pseudo-labeled, and generated images to filter out false detections by DRBox and prevent model overfitting.
Jiang and Ying [
30] added a GAN before DSSD as a foreground-background separation translation model and performed data augmentation, including color channel change, noise addition, and contrast enhancement only on image foregrounds. The proposed model marginally outperformed DSSD based on the PASCAL VOC2007 and PASCAL VOC2012 datasets.
Zhai et al. [
31] proposed GAN-FRCNN to address the low-resolution and undersampling problem in CSGI object detection. GAN-FRCNN utilized TVAL3 algorithm for reconstructing images at different resolutions and sampling rates using real images. Pretrained Faster R-CNN based on real images was used to obtain object classification loss and bounding box regression loss of the reconstructed images, and the high-loss images were selected as training dataset. Based on the MS COCO 2017 dataset, GAN-FRCNN achieved substantial performance improvements for many object classes.
2.2. Retraining Object Detector with Synthetic Data from GAN
Bosquet et al. [
32] proposed DS-GAN to increase the number of small objects in video datasets. DS-GAN had an encoder-decoder generator and a residual block discriminator, and was able to create downsampled low-resolution small objects from high-resolution objects. The authors incorporated DS-GAN into a small object data augmentation pipeline by using Mask R-CNN to extract small foreground target objects and inpainting and blending the objects in images at plausible locations with correct orientations and scales. The synthetic data generated by the pipeline improved STDnet, FPN, and CenterNet for the UAVDT car dataset.
Posilovic et al. [
33] proposed DetectionGAN based on Pix2pixHD, consisting of a U-net generator, two PatchGAN discriminators, and a pre-trained object detector. DetectionGAN was used as a conditional GAN to translate binary masks of steel block detects into realistic ultrasonic images, and the synthetic data were further utilized to retrain YOLOv3 along with real data. Model performance improvements were successfully achieved in the study.
Lee et al. [
34] proposed RDAGAN for data augmentation purposes, which comprised an object generation network based on InfoGAN that generates target objects, and an image translation network with an encoder-decoder generator and global and local discriminators that inserts the generated object batches within bounding box masks of clean images and translates the overall images into target domain. According to the FiSmo fire and Google Landmarks v2 datasets, YOLOv5 trained with both real and augmented images showed a performance improvement.
Dai et al. [
35] proposed CPGAN for thermal infrared data augmentation based on RGB image translation. CPGAN was composed of a cascade pyramid generator and a multi-scale discriminator. The cascade pyramid generator consisted of three branches with similar network structures including low-, medium-, and high-resolution generators for high-resolution image generation. The multi-scale discriminator consisted of three discriminators with the same structure but were to be executed on images of low, medium, and high resolutions. The study demonstrated that synthetic thermal infrared images, when added to training datasets, were able to help improve the performance of Faster RCNN, R-FCN, YOLOv2, YOLOv3, YOLOv4, and SSD.
Liu et al. [
36] proposed DetectorGAN based on CycleGAN, which incorporated a ResNet generator, two global and one local PatchGAN discriminators, and a detector that took both real and synthetic labeled images as input and outputted bounding boxes. The study suggested that RetinaNet had performance improvements when trained with both real and synthetic images based on the NIH Chest X-ray nodule dataset and Cityscapes pedestrian dataset.
Zhu et al. [
37] proposed MCGAN to augment data for object detection in optical remote sensing images. MCGAN architecture contained a DCGAN’s generator, three discriminators, and a classifier. It was trained based on cropped target objects rather than whole images, and the synthetic objects were Poisson mosaicked into real images to increase data diversity. Pretrained Faster R-CNN was utilized to filter out misidentified or unidentified objects to maintain data distribution. Based on the NWPU VHR-10 and DOTA geospatial datasets with seven classes, Faster R-CNN trained on datasets with different levels of added synthetic objects was able to perform better for most classes.
Kim et al. [
38] trained a CNN-based GAN using baggage X-ray images from the GDXray dataset, and retrained Faster R-CNN with additional GAN-generated synthetic data, which achieved superior performances on detecting handgun, shuriken, and razor.
Lin et al. [
39] proposed SYN-MTGAN to synthesize traffic sign images, consisting of an encoder-decoder generator inspired by CycleGAN, and a multi-task discriminator that predicted real and synthetic images and target object classes. Adversarial, cycle consistency, identity, classification loss functions were used to calculate the overall weighted sum loss function. According to a customized traffic sign dataset, Faster R-CNN trained on the synthetic dataset reached considerably higher accuracies on certain traffic sign categories than the one trained on real scene images.
Maeda et al. [
40] trained PG-GAN using cropped pothole images and Poisson blended synthetic images into undamaged road images. Based on their Road Damage Dataset 2019, retrained SSD MobileNet performance first improved but then reduced as the size of synthetic data increased to the same size of real data in the training datasets.
2.3. Manipulating Real Data with GAN for Object Detector
Courtrai et al. [
41] developed SR-CWGAN-Yolo as an image super-resolution network by incorporating SR-GAN, CycleGAN, and YOLOv3. They observed that pretrained Faster R-CNN, EfficientDet, RetinaNet-50, and YOLOv3 based on the ISPRS Potsdam car dataset had much higher object detection accuracies on the upsampled dataset generated by SR-CWGAN-Yolo than the ones generated using bicubic interpolation and EDSR methods.
Nath and Behzadan [
42] proposed a deep CNN-based GAN for image super-resolution and missing pixel information generation in low-resolution images. Based on two in-house datasets, namely the Pictor-v2 dataset with common construction objects including buildings, equipment, and worker, and the Pictor-v3 dataset with worker, hat, and vest, YOLOv3 trained on high-resolution images consistently performed better on super-resolved images than their corresponding low-resolution images at various resolution levels.
Li et al. [
43] tackled the issue of deep learning model generalizability when a model was trained on source datasets but needed to be applied to target datasets. They applied CycleGAN and AgGAN as image translation models to transform target domain images to source-domain-stylized images under three scenarios, including datasets captured at two time points, over two locations, and by two sensors, respectively. Faster R-CNN trained on source images was subsequently applied to the transformed target images. The study results showed that the methodology failed to boost Faster R-CNN performance.
5. Conclusions
Based on YOLO11n, FastGAN, and AriAplBud, the following conclusions were intentionally made to highlight broader implications. The study demonstrated the feasibility of utilizing GAN to selectively improve agricultural object detector class performance and mitigate the class imbalance issue in object detection, even if the GAN was lightweight, developed on a very small dataset, and not able to converge during training. From the synthetic image quality perspective, training divergence did not necessarily indicate complete training failure, such as mode collapse for GAN, especially when periodic model validation was appropriately executed. Despite the small size of agricultural instances, which were randomly distributed in images, GAN was able to capture their characteristics and successfully generate synthetic images with object detector-detectable instances. Positive sample rate of synthetic data generally correlated with object detector performance; that is, higher detector performance of a class implies higher synthetic positive sample rate of the class at non-extreme confidence thresholds in general. The average object detector-detectable instance per image of synthetic data, however, tended to be lower than that of real data, especially at high confidence thresholds. Synthetic positive samples from GAN, when employed for object detector retraining, were able to help improve object detector performances on target classes considerably. However, optimal synthetic instance quantity for model retraining remained unclear, and a negative influence from the synthetic samples was also observed for non-target classes. Further studies are needed to investigate how the quantity and quality of synthetic instances impact object detector performance improvement or degradation in target and non-target classes.