Copy Paste++ v1
Copy Paste++ v1
Abstract: We present an extended version of the Simple Copy Paste Augmentation [1]
which applies mask-wise large scale jittering (LSJ) that helps improve the robustness of the
model during training by helping it identify and learn objects in a size as well as scenario
agnostic manner for instance segmentation. We further introduce a stratified training tech-
nique which enables us to train our models on smaller batch sizes. Our model is able to
achieve a score of 22.9 AP on the LVIS Challenge 2021 - Instance Segmentation task [2],
within just 8 epochs, and is yet to converge.
1. Introduction
Instance segmentation is one of the prominent tasks in computer vision where the goal is to localize and classify
instances in an image. The LVIS dataset is one such instance segmentation dataset that has a large number of
categories. The number of images in some categories are much smaller than the others. This long tailed nature of
LVIS dataset poses a huge challenge to model training. Some existing works such as Balanced Group Softmax [3],
Seesaw Loss [4], Balanced Mosaic [5], and Equalization Loss [6] have shown that re-weighting the loss for
tail classes and enhancing images using augmentations are effective ways to achieve better results. One such
effective and efficient augmentation technique is Simple Copy Paste. Building on this, we propose the Copy
Paste++ Augmentation.
2. Previous Work
2.1. Seesaw Loss
Seesaw Loss is derived from cross-entropy loss and it works by accumulating the number of training samples
for each category during every training iteration. It uses two complementary factors, i.e., mitigation factor and
compensation factor to re-balance the gradients of positive and negative samples based on the number of accumu-
lated samples. The mitigation factor reduces punishments to tail categories, meanwhile, the compensation factor
increases the penalty of incorrectly classified instances to avoid false positives of rarely occurring categories.
2.2. SyncBN
Synchronized Batch Normalization(SyncBN)[ [7]] is a type of batch normalization used for multi-GPU training.
Standard batch normalization only normalizes the data within each device (GPU). SyncBN normalizes the input
within the whole mini-batch.
3. Experiments
For our baseline, we use a Mask RCNN [8] architecture with Resnet-101 [9] as our backbone. We train it with
multi-scale training, repeat factor sampling, SyncBN and Seesaw Loss using a 2x scheduler provided by MMDe-
tection [10]. Further, we upgrade the architecture to Cascade Mask RCNN [11] which gives us an increment of 1.9
AP. We are fine tuning this model by using Copy Paste ++ to get a current score of 22.8 AP after 22 epochs. We
train our model on 8 Nvidia Tesla T4 GPUs, with a learning rate of 10e-5 using Stochastic Gradient Descent with
a Momentum of 0.9, along with a weight decay of 10e-4. We use a batch size of 16 images. The maximum detec-
tions per image was set to 1000 in all our inferences. All scores are reported on the LVIS V1 validation dataset.
For the Cascade Mask RCNN models we have trained for 24 epochs & for with the Copy Paste++ augmentation
we have trained for 8 epochs.*
Experiment List
Experiments Mask AP Mask APr Mask APc Mask AP f
Baseline 28.2 21.0 27.8 31.8
Baseline + Cascade Mask RCNN 30.1 21.1 30.3 33.9
Baseline + Cascade Mask RCNN + Copy 22.9 14.8 22.7 26.7
Paste ++*
4. Our Contribution
4.1. Copy Paste ++
Our approach, proposes an alternative to the Simple Copy Paste Algorithm. Instead of performing large scale
jittering on all the masks in the source image with the same random resize ratio, we randomly choose a set of
masks from the source image and perform large scale jittering independently on each of them with a different
random resize ratio. We then apply a smart pasting technique which not only modifies the annotations in the
destination image but also the source image, which ensures that masks jittered with different random scales do
not have annotations from completely occluded groundtruths and have updated annotations for partially occluded
groundtruths.
This helps us achieve two things. Firstly, we are able to enhance the size agnostic behavior of the model by
increasing the variance in sizes of individual masks by independently scaling each of them with a different resize
ratio. This also avoids any unwanted correlations between the feature representations corresponding to the sizes
of different objects in the same image.
Secondly, with the help of our smart pasting technique, we are able to provide new images with spatial overlaps
between non-spatially overlapping objects from the source image. This enhancement would not have been possible
in the original Copy Paste method.
5. Future Work
Since we were limited by computational constraints we were unable to train our model completely using our
stratified fine-tuning methodology, however we believe that our model can achieve even better results if trained
till convergence for larger ranges of re-scaling ratios.
6. Conclusion
We propose an augmentation technique, Copy Paste++, which is an extension of Simple Copy Paste. We also
propose a stratified fine-tuning method for researchers with limited compute, so that they are able to leverage our
augmentation technique, even with a batch size as small as 16.
References
1. Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph.
Simple copy-paste is a strong data augmentation method for instance segmentation, 2021.
2. Agrim Gupta, Piotr Dollár, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation, 2019.
3. Yu Li, Tao Wang, Bingyi Kang, Sheng Tang, Chunfeng Wang, Jintao Li, and Jiashi Feng. Overcoming classifier
imbalance for long-tail object detection with balanced group softmax, 2020.
4. Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang Cao, Jiangmiao Pang, Tao Gong, Kai Chen, Ziwei Liu, Chen Change
Loy, and Dahua Lin. Seesaw loss for long-tailed instance segmentation, 2021.
5. Wei Li Zhibin Wang Lei Chen, Qiang Zhou and Hao Li. Balanced mosaic and double classifier for large vocabulary
instance segmentation, 2020.
6. Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and Junjie Yan. Equalization loss
for long-tailed object recognition, 2020.
7. Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con-
text encoding for semantic segmentation, 2018.
8. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn, 2018.
9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
10. Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue
Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open
mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
11. Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: High quality object detection and instance segmentation, 2019.