Object Detection Using Domain Randomization and Generative Adversarial Refinement of Synthetic Images

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Object Detection using Domain Randomization and Generative Adversarial

Refinement of Synthetic Images

Fernando Camaro Nogues Andrew Huie Sakyasingha Dasgupta


Ascent Robotics, Inc. Japan
{fernando, andrew, sakya}@ascent.ai

Abstract seeing the real distribution as one variation in it.


Another approach that directly tries to minimize this reality-
In this work, we present an application of domain ran- gap is to refine the synthetic images so that they look more
domization and generative adversarial networks (GAN) to realistic. One possible way to build such a refiner is by us-
train a near real-time object detector for industrial elec- ing a generative adversarial training framework [3]. An al-
tric parts, entirely in a simulated environment. Large scale ternative and more indirect approach to reduce the negative
availability of labelled real world data is typically rare and effect of this reality-gap is to use again the GAN frame-
difficult to obtain in many industrial settings. As such here, work, but in this case, directly on the features of some of
only a few hundred of unlabelled real images are used to the last layers of the network being trained for the specific
train a Cyclic-GAN network, in combination with various target task [4].
degree of domain randomization procedures. We demon- In this work we present an experimental use case of an ob-
strate that this enables robust translation of synthetic im- ject detector in a real industrial application setting, which is
ages to the real world domain. We show that a combination trained with different combinations of synthetic images and
of the original synthetic (simulation) and GAN translated refined synthetic images (synthetic images refined to look
images, when used for training a Mask-RCNN object detec- more realistic). We evaluate our method robustly across var-
tion network achieves greater than 0.95 mean average pre- ious combinations of training data.
cision in detecting and classifying a collection of industrial
electric parts. We evaluate the performance across different 2. Synthetic Image Generation with Domain
combinations of training data. Randomization
The architecture to produce the synthetic images for our
experiments is composed of two main parts. First, the
1. Introduction physics simulation engine, Bullet1 is used to place the ob-
The successful examples of deep learning require a large jects in a physically consistent configuration after letting
number of manually annotated data, which can be pro- them fall from a random position. Second, the ray trac-
hibitive for most applications, even if they start from a pre- ing rendering library POV-Ray2 is used to render an im-
trained model in another domain and only require a fine- age based on this configuration. In POV-Ray we introduce
tuning phase in the target domain. domain randomization, by randomizing several parameters,
An effective way to eliminate the cost of the expensive an- namely, the number of lights and their color, the color and
notation is to train the model within a simulated environ- texture of each part of the target objects and the scene floor
ment where the annotations can be also automatically gen- plane, as well as the camera position. The camera posi-
erated. However, the problem with this approach is that tion is drawn from a uniform distribution in a rectangular
the generated samples (in our case images) may not fol- prism that is 10cm above the floor plane, with a squared
low the same distribution as the real domain, resulting in base of side 20cm and 10cm height. Although the location
what is known as the reality-gap. Several approaches exist of the camera was uniform, the camera was always pointing
that try to reduce this apparent gap. One such method is to the global coordinates origin with no rolling angle. This
domain randomization ([1], [2]). In this, several rendering variation of the camera position was intended to achieve ro-
parameters of the scene can be randomized, like the color bustness against different positions of the camera in the real
of objects, textures, lights, etc, thus effectively enabling the 1 https://pybullet.org/wordpress/

model to see a very wide distribution during training, and 2 http://www.povray.org/

4321
world. of the resulting image with our model that translates from
synthetic domain to the real domain; see Fig. 4 in Appendix
3. Refinement of synthetic images by adversar- for more examples.
ial training
4. Experiments
An alternative way we consider to reduce the reality-gap
In this section we compare different combinations of
is to use the GAN framework to refine the synthetic images
training data and its impact on the mAP for object detec-
to look more realistic. Here, we selected the Cyclic-GAN
tion with a Mask-RCNN model [5]. As a test dataset we
[6] architecture since it only requires two sets of unpaired
have used 100 real images.
examples, one for each domain, the synthetic and the real
The different types of datasets used for training were: Sf ix
one. The original synthetic images of size 1024x768 were
: synthetic images with fixed object colors without texture,
too large for the training of our Cyclic-GAN model, as such,
and white background. Sf ix →real : translated images from
instead of resizing the image, we opted for training on ran-
Sf ix to the real domain. Srand−tex : synthetic images with
dom crops of size 256x256. This way we can train in the
objects and background with randomized colors but with-
original pixel density and exploit the fact that our genera-
out texture. Srand+tex : synthetic images with objects and
tors are fully convolutional networks, such that during the
background with randomized colors and texture. See Fig. 2
inference phase we can still input the original full-size im-
in the appendix for a general overview of the training archi-
age.
tecture and Fig. 5 for some examples of different types of
images employed.
The target objects to be detected, consisted of 12 tiny elec-
tronic parts for which accurate 3D CAD models were avail-
able (Fig. 6). In all the experiments we used 10K train-
ing samples, the same number of training iterations and the
same hyperparameters.
The object detection performance for the different com-
binations of datasets used in the experiments are presented
in Table 1 . Using a training set made purely of one type of
data resulted in a mAP below 0.9 in most cases, with the ex-
Figure 1: Left: example of synthetic image. Right: corre- ception of the case with Srand+tex . Overall, the best detec-
sponding synthetic image after translation to real domain. tion results were obtained when the refined synthetic images
The USB socket has gained a more realistic reflection, and set (Sf ix→real ) was combined with high variation random-
the switch has gained a realistic surface texture and color. ized data (Srand+tex ). The results indicate that neither do-
main randomization or GAN based refinement is enough on
We notice that after training, one particular target object lost its own to get sufficient performance. In combination, they
its color and turned gray, while the remaining objects were reduce the reality-gap effectively, resulting in a significant
refined in a realistic manner without loosing their original boost in performance (see the real-time object detection
color. We think that this was mainly due to the particular video at https://youtu.be/Q-WeXSSnZ0U ). Refer
architecture of the discriminators. The discriminator model to Fig. 7 for the training curves associated with the different
final layer consisted of a spatial grid of discriminator neu- experiments, and to Fig. 8 for some detection result images.
rons whose receptive field with respect to the input image
was too small to capture that object. In order to solve this Training data mAP (0.5 IoU)
we added more convolutional layers to the discriminator 100% Sf ix 0.812
models. This effectively increased the receptive field size. 100% Sf ix→real 0.874
Furthermore, instead of substituting one grid of discrimi- 100% Srand−tex 0.867
nators by another, we preferred to maintain both, one with 100% Srand+tex 0.911
small receptive field intended to discriminate details of the 20% Sf ix and 80% Srand+tex 0.914
objects, and another with large receptive field, that can un- 20% Sf ix→real and 80% Srand+tex 0.955
derstand the objects as a whole (Fig. 3 in Appendix). The 50% Sf ix→real and 50% Srand+tex 0.950
final loss was computed as the mean of all individual dis-
criminator units for both of these two layers. This small
modification enabled us to maintain the color of all the ob- Table 1: Performance of the Mask-RCNN network for the
jects. The Cyclic-GAN model was trained using 10K syn- different training datasets.
thetic images and 256 real images. Fig.1 shows an example

4322
References
[1] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wo-
jciech Zaremba, Pieter Abbeel. Domain Randomization for
Transferring Deep Neural Networks from Simulation. In
the proceedings of the 30th IEEE/RSJ International Confer-
ence on Intelligent RObots and Systems (IROS), Vancouver,
Canada, October 2017
[2] Sadeghi, Fereshteh and Levine, Sergey. CAD2RL: Real
Single-Image Flight without a Single Real Image, Robotics:
Science and Systems(RSS), 2017.
[3] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua
Susskind, Wenda Wang and Russell Webb. Learning from
Simulated and Unsupervised Images through Adversarial
Training. 2017 IEEE Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-
26, 2017
[4] Ganin, Yaroslav et al. Domain-adversarial Training of Neural
Networks. The Journal of Machine Learning Research, 2016.
[5] Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross B. Gir-
shick. Mask R-CNN. 2017 IEEE International Conference on
Computer Vision (ICCV), 2017.
[6] Jun-Yan Zhu, Taesung Park, Phillip Isola and Alexei A. Efros.
Unpaired Image-to-Image Translation using Cycle-Consistent
Adversarial Networks. arXiv preprint arXiv:1703.10593,
2017 2017 IEEE International Conference on Computer Vi-
sion (ICCV), 2017.

Appendix
In Fig. 2 we provide a schematic overview of the object detec- Figure 3: Discriminator network with two grid layers of
tion training data generation pipeline. discriminator cells, one with small receptive field and the
other with bigger receptive field.

Figure 2: General architecture for training the object detec-


tor.

4323
Figure 4: Left column: images from Sf ix . Right column: corresponding refined images (Sf ix→real ).

4324
(a) Example of Sf ix image (b) Example of Srand−tex image

(c) Example of Srand+tex image (d) Example of a real image used to train the cyclic-GAN

Figure 5: Examples of different types of images employed in the experiments.

4325
(a) tactile switch (b) pin header (c) 3 way cable mount screw
terminal

(d) DC power jack (e) DIP switch (f) slide switch

(g) led (h) IC socket (i) trimmer

(j) buzzer (k) USB type A socket (l) USB type C socket

Figure 6: Electronic parts used in the experiments.

4326
Figure 7: Mask-RCNN training loss. The model was trained by fine-tuning a Mask-RCNN model pre-trained on the COCO
dataset. First by training only the mask-rcnn heads (without training the region proposal network or the backbone model) for
10 epochs with a learning rate of 0.002, and then the whole network for another 5 epochs with a learning rate of 0.0002. We
used a SGD optimizer with a momentum of 0.9. The configurations that achieved the better performances, ”20% Sf ix→real
and 80% Srand+tex ” and ”50% Sf ix→real and 50% Srand+tex ”, are the ones that had worse loss values during training. We
think that this is because these datasets were more difficult, but at the end prepared the model better for the also difficult real
test dataset.

4327
Figure 8: Example of detection results for 20% Sf ix→real and 80% Srand+tex .

4328

You might also like