Object Detection Using Domain Randomization and Generative Adversarial Refinement of Synthetic Images
Object Detection Using Domain Randomization and Generative Adversarial Refinement of Synthetic Images
Object Detection Using Domain Randomization and Generative Adversarial Refinement of Synthetic Images
4321
world. of the resulting image with our model that translates from
synthetic domain to the real domain; see Fig. 4 in Appendix
3. Refinement of synthetic images by adversar- for more examples.
ial training
4. Experiments
An alternative way we consider to reduce the reality-gap
In this section we compare different combinations of
is to use the GAN framework to refine the synthetic images
training data and its impact on the mAP for object detec-
to look more realistic. Here, we selected the Cyclic-GAN
tion with a Mask-RCNN model [5]. As a test dataset we
[6] architecture since it only requires two sets of unpaired
have used 100 real images.
examples, one for each domain, the synthetic and the real
The different types of datasets used for training were: Sf ix
one. The original synthetic images of size 1024x768 were
: synthetic images with fixed object colors without texture,
too large for the training of our Cyclic-GAN model, as such,
and white background. Sf ix →real : translated images from
instead of resizing the image, we opted for training on ran-
Sf ix to the real domain. Srand−tex : synthetic images with
dom crops of size 256x256. This way we can train in the
objects and background with randomized colors but with-
original pixel density and exploit the fact that our genera-
out texture. Srand+tex : synthetic images with objects and
tors are fully convolutional networks, such that during the
background with randomized colors and texture. See Fig. 2
inference phase we can still input the original full-size im-
in the appendix for a general overview of the training archi-
age.
tecture and Fig. 5 for some examples of different types of
images employed.
The target objects to be detected, consisted of 12 tiny elec-
tronic parts for which accurate 3D CAD models were avail-
able (Fig. 6). In all the experiments we used 10K train-
ing samples, the same number of training iterations and the
same hyperparameters.
The object detection performance for the different com-
binations of datasets used in the experiments are presented
in Table 1 . Using a training set made purely of one type of
data resulted in a mAP below 0.9 in most cases, with the ex-
Figure 1: Left: example of synthetic image. Right: corre- ception of the case with Srand+tex . Overall, the best detec-
sponding synthetic image after translation to real domain. tion results were obtained when the refined synthetic images
The USB socket has gained a more realistic reflection, and set (Sf ix→real ) was combined with high variation random-
the switch has gained a realistic surface texture and color. ized data (Srand+tex ). The results indicate that neither do-
main randomization or GAN based refinement is enough on
We notice that after training, one particular target object lost its own to get sufficient performance. In combination, they
its color and turned gray, while the remaining objects were reduce the reality-gap effectively, resulting in a significant
refined in a realistic manner without loosing their original boost in performance (see the real-time object detection
color. We think that this was mainly due to the particular video at https://youtu.be/Q-WeXSSnZ0U ). Refer
architecture of the discriminators. The discriminator model to Fig. 7 for the training curves associated with the different
final layer consisted of a spatial grid of discriminator neu- experiments, and to Fig. 8 for some detection result images.
rons whose receptive field with respect to the input image
was too small to capture that object. In order to solve this Training data mAP (0.5 IoU)
we added more convolutional layers to the discriminator 100% Sf ix 0.812
models. This effectively increased the receptive field size. 100% Sf ix→real 0.874
Furthermore, instead of substituting one grid of discrimi- 100% Srand−tex 0.867
nators by another, we preferred to maintain both, one with 100% Srand+tex 0.911
small receptive field intended to discriminate details of the 20% Sf ix and 80% Srand+tex 0.914
objects, and another with large receptive field, that can un- 20% Sf ix→real and 80% Srand+tex 0.955
derstand the objects as a whole (Fig. 3 in Appendix). The 50% Sf ix→real and 50% Srand+tex 0.950
final loss was computed as the mean of all individual dis-
criminator units for both of these two layers. This small
modification enabled us to maintain the color of all the ob- Table 1: Performance of the Mask-RCNN network for the
jects. The Cyclic-GAN model was trained using 10K syn- different training datasets.
thetic images and 256 real images. Fig.1 shows an example
4322
References
[1] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wo-
jciech Zaremba, Pieter Abbeel. Domain Randomization for
Transferring Deep Neural Networks from Simulation. In
the proceedings of the 30th IEEE/RSJ International Confer-
ence on Intelligent RObots and Systems (IROS), Vancouver,
Canada, October 2017
[2] Sadeghi, Fereshteh and Levine, Sergey. CAD2RL: Real
Single-Image Flight without a Single Real Image, Robotics:
Science and Systems(RSS), 2017.
[3] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua
Susskind, Wenda Wang and Russell Webb. Learning from
Simulated and Unsupervised Images through Adversarial
Training. 2017 IEEE Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-
26, 2017
[4] Ganin, Yaroslav et al. Domain-adversarial Training of Neural
Networks. The Journal of Machine Learning Research, 2016.
[5] Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross B. Gir-
shick. Mask R-CNN. 2017 IEEE International Conference on
Computer Vision (ICCV), 2017.
[6] Jun-Yan Zhu, Taesung Park, Phillip Isola and Alexei A. Efros.
Unpaired Image-to-Image Translation using Cycle-Consistent
Adversarial Networks. arXiv preprint arXiv:1703.10593,
2017 2017 IEEE International Conference on Computer Vi-
sion (ICCV), 2017.
Appendix
In Fig. 2 we provide a schematic overview of the object detec- Figure 3: Discriminator network with two grid layers of
tion training data generation pipeline. discriminator cells, one with small receptive field and the
other with bigger receptive field.
4323
Figure 4: Left column: images from Sf ix . Right column: corresponding refined images (Sf ix→real ).
4324
(a) Example of Sf ix image (b) Example of Srand−tex image
(c) Example of Srand+tex image (d) Example of a real image used to train the cyclic-GAN
4325
(a) tactile switch (b) pin header (c) 3 way cable mount screw
terminal
(j) buzzer (k) USB type A socket (l) USB type C socket
4326
Figure 7: Mask-RCNN training loss. The model was trained by fine-tuning a Mask-RCNN model pre-trained on the COCO
dataset. First by training only the mask-rcnn heads (without training the region proposal network or the backbone model) for
10 epochs with a learning rate of 0.002, and then the whole network for another 5 epochs with a learning rate of 0.0002. We
used a SGD optimizer with a momentum of 0.9. The configurations that achieved the better performances, ”20% Sf ix→real
and 80% Srand+tex ” and ”50% Sf ix→real and 50% Srand+tex ”, are the ones that had worse loss values during training. We
think that this is because these datasets were more difficult, but at the end prepared the model better for the also difficult real
test dataset.
4327
Figure 8: Example of detection results for 20% Sf ix→real and 80% Srand+tex .
4328