Bag of Freebies For Training Object Detection Neural Networks

Bag of Freebies for Training Object Detection Neural Networks
Zhi Zhang, Tong He, Hang Zhang, Zhongyue Zhang, Junyuan Xie, Mu Li
Amazon Web Services
{zhiz, htong, hzaws, zhongyue, junyuanx, mli}@amazon.com
arXiv:1902.04103v3 [cs.CV] 12 Apr 2019
Abstract
Training heuristics greatly improve various image clas-

sification model accuracies [8]. Object detection models,
however, have more complex neural network structures and
optimization targets. The training strategies and pipelines
dramatically vary among different models. In this works,
we explore training tweaks that apply to various models
including Faster R-CNN and YOLOv3. These tweaks do
not change the model architectures, therefore, the inference
costs remain the same. Our empirical results demonstrate
that, however, these freebies can improve up to 5% absolute
precision compared to state-of-the-art baselines.
Figure 1. The Bag of Freebies improves object detector perfor-

1. Introduction mances. There is no extra inference cost since models are not
changed.
Object detection is no doubt one of the most fundamen-
tal applications in computer vision drawing attentions of re-
searchers from various fields. Latest state-of-the-art detec- crementally stacking them to train single and multiple stage
tors, including single (SSD [12] and YOLO [16]) or multi- object detection networks.
ple stage RCNN-like [3] neural networks and many varia-
Our major contributions can be summarized as follows:
tions or extended work [17, 20], are based on image clas-
sification backbone networks, e.g., VGG [21], ResNet [7],
1. Systematically evaluating the various training heuris-
Inception [22] and MobileNet series [9, 19]. Despite the
tics applied to different object detection pipelines,
rapid development and great success of the modern object
providing valuable practice guidelines for future re-
detectors, different work usually employs different data pre-
searches. To our best knowledge, this is the first work
possessing and training pipeline, which makes it hard for
for surveying training heuristics for object detection.
different object detection method to benefit from each other
or relevant advancement in other area. 2. A visually coherent image mixup method designed for
In this work, we focus on exploring effective and general training object detection networks. Empirical results
approaches that can boost the performance of all popular show that it is effective in improving model general-
object detection networks without introducing extra com- ization capabilities.
putational cost during inference. We first explore the mixup
3. Extending the research depth on object detection data
technique on object detection. Unlike [24], we recognize
augmentation domain that strengthen the model gener-
the special property of multiple object detection task which
alization capability and help reduce over-fitting prob-
favors spatial preserving transforms. Therefore we pro-
lems.
posed a visually coherent image mixup methods designed
for object detection tasks. Second, we explore detailed 4. We achieve up to 5% absolute precision improvement
training pipelines including learning rate scheduling, label (15 to 20% better than baseline) without modifying
smoothing and synchronized BatchNorm [25, 14]. Third, the network architectures. These model improvements
we investigate the effectiveness of our training tweaks by in- bring no extra inference cost.
1
The rest of this paper is organized as follows. First, we RoIs are either propagated by neural networks or determin-
briefly introduce previous works in Section 2 on improving istic algorithms (e.g. Selective Search [23]). This major
image classification and the potential to transfer to object difference caused significant divergence in data processing
detection models. Second, the proposed tweaks are detailed and network optimization. For example, due to the lack of
in Section 3. Third, the experimental results will be bench- spatial variation in single stage pipelines, spatial data aug-
marked in Section 4. Finally, Section 5 will conclude this mentation is crucial to the performance as proven in Single-
work. Shot MultiBox Object Detector (SSD) [12]. Due to lack of
All related code are open-sourced and pre-trained exploration, many training details are exclusive to one se-
weights for the models are available in GluonCV toolkit [1]. ries. In this work, we systematically explore the mutually
beneficial tweaks and tricks that may help to boost the per-
2. Related Work formance for both pipelines.
In this section, we briefly discuss related work regarding
3. Bag of Freebies
bag of tricks for image classification and heuristic object
detection in common. In this section, we propose a visual coherent image
mixup method for object detection. We will introduce data
2.1. Scattering tricks from Image Classification
processing and training schedule designed to improve per-
Image classification serves as the foundation of major formance of object detection models.
computer vision tasks. Classification models are less com-
putation intensive comparing with popular object detec-
3.1. Visually Coherent Image Mixup for Object De-
tion and semantic segmentation models, therefore attractive
tection
enormous researchers to prototyping ideas. In this section, Mixup has been proved successful in alleviating adver-
we briefly describe previous works that open the shed for sarial perturbations in classification networks [24]. The key
this area. Learning rate warmup heuristic [6] is introduced idea of mixup in image classification task is to regularize the
to overcome the negative effect of extremely large mini- neural network to favor simple linear behavior by mixing up
batch size. Interestingly, even though mini-batch size used pixels as interpolations between pairs of training images. At
in typical object detection training is nowhere close to the the same time, one-hot image labels are mixed up using the
scale in image classification(e.g. 10k or 30k [6]), a large same ratio. An example of mixup in image classification is
amount of anchor size(up to 30k) is effectively contribut- illustrated in Fig. 2.
ing to batch size implicitly. A gradual warmup heuristic is The distribution of blending ratio in mixup algorithm
crucial to YOLOv3 [16] as in our experiments. There is proposed by Zhang et al. [24] is drawn from a beta distribu-
a line of approaches trying to address the vulnerability of tion B(0.2, 0.2). The majority of mixups are barely noises
deep neural network. Label smoothing was introduced in with such beta distributions. Rosenfeld et al. [18] conduct
[22], which modifies the hard ground truth labeling in cross a series of interesting experiments named as “Elephant in
entropy loss. Zhang et al. [24] proposed mixup to allevi- the room”, where an elephant patch is randomly placed on
ate adversarial perturbation. Cosine annealing strategy for a natural image, then this adversarial image is used to chal-
learning rate decay is proposed in [13] in response to tradi- lenge existing object detection models. The results indicate
tional step policy. He et al. achieved significant improve- that existing object detection models are prune to such at-
ments on training accuracy by exploring bag of tricks [8]. tack and show weakness to detect such transplanted objects.
In this work, we dive deeper into the heuristic techniques Inspired by the heuristic experiments by Rosenfeld et al.
introduced by image classification in the context of object [18], we focus on the natural co-occurrence object presen-
detection. tations which play significant roles in object detection. By
applying more complex spatial transforms, we introduce oc-
2.2. Deep Object Detection Pipelines
clusions, spatial signal perturbations that are common in
Most state-of-the-art deep neural network based object natural image presentations.
detection models are derived from multiple stages and sin- In our empirical experiments, continue increasing blend-
gle stage pipelines, starting from R-CNN [4] and YOLO ing ratio used in the mixup process, the objects in resulting
[15], respectively. In single stage pipelines, predictions are frames are more vibrant and coherent to the natural pre-
generated by a single convolutional network and therefore sentations, similar to the transition frames commonly ob-
preserve the spatial alignments (except that YOLO used served when we are watching low FPS movies or surveil-
Fully Connected layers at the end). However, in multiple lance videos. The visual comparisons of image classifica-
stage pipelines, e.g. Fast R-CNN [3] and Faster-RCNN tion and such high ratio mixup are illustrated in Fig. 2 and
[17], final predictions are generated from features which are Fig. 3, respectively. In particular, we use geometry pre-
sampled and pooled in a specific region of interests (RoIs). served alignment for image mixup to avoid distort images
2
* 0.9 + * 0.1 =
bittern 0 bittern 1 bittern 0.1
... 0 ... 0 ... 0
otter 0 otter 0 otter 0
... 0 ... 0 ... 0
analog_clock 1 analog_clock 0 analog_clock 0.9
Figure 2. Mixup visualization of image classification with typical mixup ratio at 0.1 : 0.9. Two images are mixed uniformly across all
pixels, and image labels are weighted summation of original one-hot label vector.
0. Sheep
1. Sheep
0. Sheep
1. Sheep
2. Stop Sign
0.Stop Sign
Figure 3. Geometry preserved alignment of mixed images for object detection. Image pixels are mixed up, object labels are merged as a
new array.
Model mAP @ 0.5 ing the YOLOv3 network on Pascal VOC dataset. Table. 1
baseline 81.5 shows the actual improvements by adopting detection mix-
0.5:0.5 evenly 83.05 ups with ratios sampled by different beta distributions. Beta
B(1.0, 1.0), weighted loss 83.48 distribution with α and β both equal to 1.5 is marginally bet-
B(1.5, 1.5), weighted loss 83.54 ter than 1.0 (equivalent to uniform distribution) and better
than fixed even mixup. We recognize that for object detec-
Table 1. Effect of various mixup approaches, validated with tion where mutual object occlusion is common, networks
YOLOv3 [16] on Pascal VOC 2007 test set. Weighted loss in- are encouraged to observe unusual crowded patches, either
dicates the overall loss is the summation of multiple objects with presented naturally or created by adversarial techniques.
ratio 0 to 1 according to image blending ratio they belong to in the
To validate the effectiveness of visually coherent mixup,
original training images.
we followed the same experiments of ”Elephant in the
room” [18] by sliding an elephant image patch through an
indoor room image. We trained two YOLOv3 models on
at the initial steps. We also choose a beta distribution with COCO 2017 dataset with identical settings except for that
α and β are both at least 1, which is more visually coherent, model mix is using our mixup approach. We depict some
instead of following the same practice in image classifica- surprising discoveries in Fig. 5. As we can observe in Fig. 5,
tion, as depicted in Figure 4. vanilla model trained without our mix approach is struggles
To verify mixup designed for object detection, we ex- to detect ”elephant in the room” due to heavy occlusion and
perimentally tested empirical mixup ratio distributions us- lack of context because it’s rare to capture an elephant in a
3
January 2019
1 Introduction
8 3.2. Classification Head Label Smoothing
↵ = 0.2, = 0.2
↵ = 1.0, = 1.0 For each object, detection networks often compute a
↵ = 1.5, = 1.5 probability distribution over all classes with softmax func-
6 tion:
ezi
pi = P zj , (1)
4 je
where zi s are the unnormalized logits directly from the

last linear layer for classification prediction. For object
2 detection during training, we only modify the classifica-
tion loss by comparing the output distribution p against the
ground truth distribution q with cross-entropy
0 L=−
X
qi log pi . (2)
0 0.2 0.4 0.6 0.8 1
i
Figure 4. Comparison of different random weighted mixup sam-
pling distributions. Red curve B(0.2, 0.2) indicate the typical q is often a one-hot distribution, where the correct class
mixup ratio used in image classification. Blue curve is the special has probability one while all other classes have zero. Soft-
case B(1, 1), equivalent to uniform distribution. Orange curve max function, however, can only approach this distribution
represents our choice B(1.5, 1.5) for object detection after pre- when zi zj , ∀j 6= i but never reach it. This encourages
liminary experiments. the model to be too confident in its predictions and is prone
to over-fitting.
Label smoothing was proposed by Szegedy et al. [22]
Model recall of elephant disappeared furniture as a form of regularization. We smooth the ground truth
baseline 42.95 8.24 % distribution with
+mixup 94.12 1.27%
(
1−ε if i = y,
qi = (3)
Table 2. Statistics of detection results affected by elephant in the ε/(K − 1) otherwise,
room. ”Recall of elephant” is the recall of sliding elephant in all
generated frames, indicating how robust the model handles objects where K is the total number of classes and ε is a small
in unseen context. Disappeared furniture percentage is calculated constant. This technique reduces the model’s confidence,
by dividing sum of disappeared furniture count by overall furniture measured by the difference between the largest and small-
objects in all adversarial frames. est logits.
In the case of sigmoid outputs of range 0 to 1.0 as in
YOLOv3 [16], label smoothing is even simpler by correct-
1 ing the upper and lower limit of the range of targets as in
kitchen. Actually, there is no such training image after ex- Eq. 3.
amine the common training datasets. In comparison, mod-
els trained with our mix approach is more robust thanks to 3.3. Data Preprocessing
randomly generated visually deceptive training images. In In image classification domain, usually neural networks
addition, we also notice that mix model is more humble, less are extremely tolerant to image geometrical transformation.
confident and generates lower scores for objects on average. It is actually encouraged to randomly perturb the spatial
However, this behavior does not affect evaluation results as characteristics, e.g. randomly flip, rotate and crop images
shown in experimental results. We evaluated the model per- in order to improve generalization accuracy and avoid over-
formance against fake video with elephant sliding through, fitting. However, for object detection image preprocessing,
and the results are listed in Table. 2. It is obvious that model we need to carry additional cautious since detection net-
trained with visually coherent mixup is more robust (94.12 works are more sensitive to such transformations.
vs. 42.95) to detect elephant in indoor scene even though it We experimentally review the following data augmenta-
is very rare in natural images. And mixup model can pre- tion methods:
serve crowded furniture objects under heavy occlusion of
alien elephant image patch. We recognize that mixup model • Random geometry transformation. Including random
receives more challenges during training therefore is signif- cropping (with constraints), random expansion, ran-
icantly better than vanilla model in handling unprecedented dom horizontal flip and random resize (with random
scenes and very crowded object groups. interpolation).
4
(orig) (orig-1) (orig-2)
(mix) (mix-1) (mix-2)

Figure 5. Elephant in the room example. Model trained with geometry preserved mixup (bottom) is more robust against alien objects
compared to baseline (top).
• Random color jittering including brightness, hue, sat- throughout the training process. For example, the step
uration, and contrast. schedule is the most widely used learning rate schedule.
With step schedule, the learning rate is multiplied by a con-
In terms of types of detection networks, there are two
stant number below 1 after reaching pre-defined epochs or
pipelines for generating final predictions. First is single
iterations. For instance, the default step learning rate sched-
stage detector network, where final outputs are generated
ule for Faster-RCNN [17] is to reduce learning rate by ratio
from every single cell in the feature map, for example
0.1 at 60k iterations. Similarly, YOLOv3 [16] uses same
SSD[12] and YOLO[16] networks which generate detec-
ratio 0.1 to reduce learning rate at 40k and 45k iterations.
tion results proportional to spatial shape of an input image.
Step schedule has sharp learning rate transition which may
The second is multi-stage proposal and sampling based ap-
cause the optimizer to re-stabilize the learning momentum
proaches, following Fast-RCNN[17], where a certain num-
in the next few iterations. In contrast, a smoother cosine
ber of candidates are sampled from a large pool of generated
learning rate adjustment was proposed by Loshchilov et
ROIs, then the detection results are produced by repeatedly
al. [13]. Cosine schedule scales the learning rate accord-
cropping the corresponding regions on feature maps, and
ing to the value of cosine function on 0 to pi. It starts with
the number of predictions is proportional to number of sam-
slowly reducing large learning rate, then reduces the learn-
ples.
ing rate quickly halfway, and finally ends up with tiny slope
Since sampling-based approaches conduct enormous
reducing small learning rate until it reaches 0. In our imple-
cropping operations on feature maps, it substitutes the op-
mentation, we follow He et al. [8] but the numbers of iter-
eration of randomly cropping input images, therefore these
ations are adjusted according to object detection networks
networks do not require extensive geometric augmentations
and datasets.
applied during the training stage. This is the major differ-
ence between one-stage and so called multi-stage object de- Warmup learning rate is another common strategy to
tection data pipelines. In our Faster-RCNN training, we do avoid gradient explosion during the initial training itera-
not use random cropping techniques during data augmentations. Warmup learning rate schedule is critical to several
tion. object detection algorithms, e.g., YOLOv3, which has a
dominant gradient from negative examples in the very be-
3.4. Training Schedule Revamping
ginning iterations where sigmoid classification score is ini-
During training, the learning rate usually starts with tialized around 0.5 and biased towards 0 for the majority
a relatively big number and gradually becomes smaller predictions.
5
1e 3 3.6. Random shapes training for single-stage object
1.00 step detection networks
0.75 cosine
Learning Rate
0.50 Natural training images come in various shapes. To

0.25 fit memory limitation and allow simpler batching, many
0.00
single-stage object detection networks are trained with fixed
0 25 50 75 100
Epoch 125 150 175 200 shapes [12, 15]. To reduce risk of overfitting and to im-
(a) Learning Rate Schedule prove generalization of network predictions, we follow the
approach of random shapes training as in Redmon et al.
0.8 [16]. More specifically, a mini-batch of N training im-
Validation mAP
0.6
ages is resized to N × 3 × H × W , where H and W are
multipliers of network stride. For example, we use H =
0.4 step W ∈ {320, 352, 384, 416, 448, 480, 512, 544, 576, 608} for
cosine
YOLOv3 training given the stride of feature map is 32.
0 25 50 75 100
Epoch 125 150 175 200
(b) Validation mAP
4. Experiments
Figure 6. Visualization of learning rate scheduling with warmup
enabled for YOLOv3 training on Pascal VOC. (a): cosine and In order to compare proposed tweaks for object de-
step schedules for batch size 64. (b): Validation mAP compari- tection, we pick up one popular object detection frame-
son curves using step and cosine learning schedule. work from single and multiple stage pipelines, respectively.
YOLOv3 [16] is famous for its efficiency and good accu-
racy. Faster-RCNN [17] is one of the most adopted de-
Training with cosine schedule and proper warmup lead tection framework and foundation of many others variants.
to better validation accuracy, as depicted in Fig. 6, valida- Therefore in this paper, we use YOLOv3 and Faster-RCNN
tion mAP achieved by applying cosine learning rate decay as representatives to conduct experiments. Note that in or-
outperforms step learning rate schedule at all times in train- der to remove side effects of test time tricks, we always re-
ing. Due to the higher frequency of learning rate adjust- port single scale, single model results with standard Non-
ment, it also suffers less from plateau phenomenon of step maximum Suppression implementation. We do not use ex-
decay that validation performance will be stuck for a while ternal training image or labels in our experiments.
until learning rate is reduced.
4.1. Incremental trick evaluation on Pascal VOC
3.5. Synchronized Batch Normalization
Pascal VOC is the most common dataset for benchmark-
In recent years, the need of massive computation re- ing object detection models [3, 12, 15], we use Pascal VOC
sources forces training environments to equip multiple de- 2007 trainval and 2012 trainval for training and 2007 test
vices (usually GPUs) to accelerate training. Despite han- set for validation. The results are reported in mean aver-
dling different hyper-parameters in response to larger batch age precision defined in Pascal VOC development kit [2].
sizes during training, Batch Normalization [10] is draw- For YOLOv3 models, we consistently validate mean aver-
ing the attention of multi-device users due to the imple- age precision (mAP) at 416 × 416 resolution. If random
mentation details. Although the typical implementation of shape training is enabled, YOLOv3 models will be fed with
Batch Normalization working on multiple devices (GPUs) random resolutions from 320×320 to 608×608 with 32×32
is fast (with no communication overhead), it inevitably re- increments, otherwise they are always trained with fixed
duces the size of batch size and causing slightly different 416 × 416 input data. Faster RCNN models take arbitrary
statistics during computation, which potentially degraded input resolutions. In order to regulate training memory con-
performance. This is not a significant issue in some stan- sumption, the shorter sides of input images are resized to
dard vision tasks such as ImageNet classification (as the 600 pixels while ensuring the longer side in smaller than
batch size per device is usually large enough to obtain good 1000 pixels. Training and validation of Faster-RCNN mod-
statistics). However, it hurts the performance in some tasks els follow the same preprocessing steps, except that train-
with a small batch size (e.g., 1 per GPU). Recently, Peng ing images have chances of 0.5 to flip horizontally as ad-
et al. [14] has proved the importance of synchronized batch ditional data augmentation. The incremental evaluations of
normalization in object detection. In this work, we review YOLOv3 and Faster-RCNN with our bags of freebies (BoF)
the importance of Synchronized Batch Normalization with are detailed in Table. 3 and Table. 4, respectively.
YOLOv3 [16] to evaluate the impacts of relatively smaller For YOLOv3, we first notice that data augmentation con-
batch-size on each GPU as training image shape is signifi- tributed nearly 16% to the baseline mAP, suggesting that
cantly larger than image classification tasks. single-stage object detection networks rely heavily on as-
6
Figure 7. COCO 80 category AP analysis with YOLOv3 [16]. Red lines indicate performance gain using BoF, while blue lines indicate
performance drop.
Figure 8. COCO 80 category AP analysis with Faster-RCNN resnet50 [17]. Red lines indicate performance gain using BoF, while blue
lines indicate performance drop.
Incremental Tricks mAP ∆ Cumu ∆ ing, cosine learning rate schedule, Sigmoid label smoothing
- data augmentation 64.26 -15.99 -15.99 and detection mixup continuously improves validation per-
baseline 80.25 0 0 formance, up to 3.43%, achieving 83.68% single model sin-
+ synchronize BN 80.81 +0.56 +0.56 gle scale mAP.
+ random training shapes 81.23 +0.42 +0.98 For Faster-RCNN, one obvious difference compared
+ cosine lr schedule 81.69 +0.46 +1.44 with YOLOv3 results is that disabling data augmentation
+ class label smoothing 82.14 +0.45 +1.89 only introduced a minimal 0.16% mAP loss. This phenom-
+ mixup 83.68 +1.54 +3.43 ena is indicating that sampling based proposals can effec-
tively replace random cropping which is heavily used in
Table 3. Incremental trick validation results of YOLOv3, evaluated single stage object detection training pipelines. Second, in-
at 416 × 416 on Pascal VOC 2007 test set. cremental mAPs show strong confidence that the proposed
tricks can effectively improve model performance, with a
Incremental Tricks mAP ∆ Cumu ∆ significant 3.55% gain.
- data augmentation 77.61 -0.16 -0.16 It is challenging to achieve mAP higher than 80% with
baseline 77.77 0 0 out external training data on Pascal VOC [17, 12, 20]. How-
+ cosine lr schedule 79.59 +1.82 +1.82 ever, we managed to achieve up to 3.5% mAP gain on both
+ class label smoothing 80.23 +0.64 +2.46 YOLOv3 and Faster-RCNN models, reaching as high as
+ mixup 81.32 +0.89 +3.55 83.68% single model single scale evaluation results.
Table 4. Incremental trick validation results of Faster-RCNN, eval- 4.2. Bag of Freebies on MS COCO.
uated at 600 × 1000 on Pascal VOC 2007 test set.
To further evaluate effectiveness of bag of freebies on
larger dataset, we benchmark on MS COCO [11] in order
sistance of data augmentation to create unseen patches. In to validate the generalization of our bags of tricks in this
terms of the training tricks we mentioned in the previous work. COCO 2017 is 10 times larger than Pascal VOC and
section, stacking Synchronized BatchNorm, Random Train- contains much more tiny objects compared with PASCAL
7
0.5:0.95 0.5:0.95
Model Orig mAPbbox Our BoF mAPbbox Absolute delta
Faster-RCNN R50 [5] 36.5 37.6 +1.1
Faster-RCNN R101 [5] 39.4 41.1 +1.7
YOLOv3 @320 [16] 28.2 33.6 +5.4
YOLOv3 @416 [16] 31.0 36.0 +5.0
YOLOv3 @608 [16] 33.0 37.0 +4.0
Table 5. Overview of improvements achieved by applying bag of freebies(BoF), evaluated on MS COCO [11] 2017 val set. Note that
YOLOv3 models can be evaluated at different input resolutions with same weights, our BoF improves evaluation results more significantly
at lower resolution levels.
-Mixup YOLO3 +Mixup YOLO3 4.3. Impact of mixup on different phases of training
-Mixup darknet53 35.0 35.3 detection network
+Mixup darknet53 36.4 37.0 Mixup can be applied in two phases of object detection
networks: 1) pre-training classification network backbone
Table 6. Combined analysis of impacts of mixup methodology for
pre-trained image classification and detection network. with traditional mixup [8, 24]; 2) training detection net-
works using proposed visually coherent image mixup for
object detection. Since we do not freeze weights pre-trained
-Mixup FRCNN +Mixup FRCNN on ImageNet, both training phase can affect final detec-
-Mixup R101 39.9 40.1 tion models. We compare the results using Darknet 53-
+Mixup R101 40.1 41.1 layer based YOLO3 [16] implementation and ResNet101
[7] based Faster-RCNN [17]. Final validation results are
Table 7. Combined analysis of impacts of mixup methodology for
listed in Table. 6 and Table. 7, respectively. While the re-
pre-trained image classification and detection network.
sults prove the consistent improvements by adopting mixup
to either training phases, interestingly it is also notable that
applying mixup in both phases can produce more significant
VOC. We use similar training and validation settings as Pas-
gains. For example, employing either pre-training mixup or
cal VOC, except that Faster-RCNN models are resized to
detection mixup has nearly 0.2% absolute mAP improve-
800 × 1300 pixels in response to smaller objects. The re-
ment over baseline. By combining both mixup techniques,
sults are shown in Table. 5.
we achieve 1.2% performance boost. We expect by apply-
In summary, our proposed bags of freebies boost Faster- ing mixup in both training phases, shallow layers of net-
RCNN models by 1.1% and 1.7% absolute mean AP over works are receiving statistically similar inputs, resulting in
existing state-of-the-art implementations [5] with ResNet less perturbations for low level filters.
50 and 101 base models, respectively. Following evalua-
tion resolution reports in [16], we list YOLOv3 evalution 5. Conclusion
results using 320, 416, 608 resolutions to compare perfor-
mance at different scales. While at 608 × 608 our model In this paper, we propose a bag of training enhancements
outperforms baseline [16] by 4.0% absolute mAP, at lower significantly improved model performances while introduc-
resolutions, this gap is more significantly 5.4% absolute ing zero overhead to the inference environment. Our empir-
mAP, almost 20% better than baseline. Note that all these ical experiments of YOLOv3 [16] and Faster-RCNN [17]
results are obtained by generating better weights in a fully on Pascal VOC and COCO datasets show that the bag of
compatible inference model, i.e., all these achievements are tricks are consistently improving object detection models.
free lunch during inference. We also notice that by adopt- By stacking all these tweaks, we observe no signs of degra-
ing bag of freebies during training, we successfully uplift dation of any level and suggest a wider adoption to future
YOLOv3 performance to the same level as state-of-the-art object detection training pipelines. These freebies are all
Faster-RCNN [5] (37.0 vs 36.5) while preserves faster in- training time modifications, and therefore only affect model
ference speed as part of single stage model benefits. weights without increasing inference time or change of net-
Mean AP is the average over 80 categories, which may work structures. All existing and future work will be in-
not reflect the per category performance. We plot per cat- cluded as part of open source GluonCV repository [1].
egory AP changes of YOLOv3 and Faster-RCNN models
before and after our BoF in Fig. 7 and Fig. 8 respectively. References
Except rare cases, we can see the majority of categories ben- [1] DMLC. Gluoncv toolkit. https://github.com/
efit from bag of freebies training tricks. dmlc/gluon-cv, 2018.
8
[2] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and [18] A. Rosenfeld, R. Zemel, and J. K. Tsotsos. The elephant in
A. Zisserman. The pascal visual object classes (voc) chal- the room. arXiv preprint arXiv:1808.03305, 2018.
lenge. International journal of computer vision, 88(2):303– [19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
338, 2010. Chen. Mobilenetv2: Inverted residuals and linear bottle-
[3] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- necks. In Proceedings of the IEEE Conference on Computer
national conference on computer vision, pages 1440–1448, Vision and Pattern Recognition, pages 4510–4520, 2018.
2015. [20] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- Dsod: Learning deeply supervised object detectors from
ture hierarchies for accurate object detection and semantic scratch. 2017 IEEE International Conference on Computer
segmentation. In Proceedings of the IEEE conference on Vision (ICCV), Oct 2017.
computer vision and pattern recognition, pages 580–587, [21] K. Simonyan and A. Zisserman. Very deep convolutional
2014. networks for large-scale image recognition. arXiv preprint
[5] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, arXiv:1409.1556, 2014.
and K. He. Detectron. https://github.com/ [22] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
facebookresearch/detectron, 2018. Rethinking the inception architecture for computer vision. In
[6] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, Proceedings of the IEEE conference on computer vision and
L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. pattern recognition, pages 2818–2826, 2016.
Accurate, large minibatch sgd: training imagenet in 1 hour. [23] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
arXiv preprint arXiv:1706.02677, 2017. Smeulders. Selective search for object recognition. Interna-
[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- tional journal of computer vision, 104(2):154–171, 2013.
ing for image recognition. In Proceedings of the IEEE con-
[24] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz.
ference on computer vision and pattern recognition, pages
mixup: Beyond empirical risk minimization. arXiv preprint
770–778, 2016.
arXiv:1710.09412, 2017.
[8] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li. Bag
[25] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
of tricks for image classification with convolutional neural
A. Agrawal. Context encoding for semantic segmentation.
networks. arXiv preprint arXiv:1812.01187, 2018.
In The IEEE Conference on Computer Vision and Pattern
[9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
Recognition (CVPR), June 2018.
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision appli-
cations. arXiv preprint arXiv:1704.04861, 2017.
[10] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In European conference on computer
vision, pages 740–755. Springer, 2014.
[12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
In European conference on computer vision, pages 21–37.
Springer, 2016.
[13] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient de-
scent with warm restarts. arXiv preprint arXiv:1608.03983,
2016.
[14] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and
J. Sun. Megdet: A large mini-batch object detector. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 6181–6189, 2018.
[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
only look once: Unified, real-time object detection. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition, pages 779–788, 2016.
[16] J. Redmon and A. Farhadi. Yolov3: An incremental improve-
ment. arXiv preprint arXiv:1804.02767, 2018.
[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
Advances in neural information processing systems, pages
91–99, 2015.

Bag of Freebies For Training Object Detection Neural Networks

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Bag of Freebies For Training Object Detection Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bag of Freebies For Training Object Detection Neural Networks

Uploaded by

Copyright:

Available Formats

Bag of Freebies for Training Object Detection Neural Networks

Training heuristics greatly improve various image clas-

Figure 1. The Bag of Freebies improves object detector perfor-

where zi s are the unnormalized logits directly from the

(mix) (mix-1) (mix-2)

0.50 Natural training images come in various shapes. To

You might also like