Dualgan: Unsupervised Dual Learning For Image-To-Image Translation
Dualgan: Unsupervised Dual Learning For Image-To-Image Translation
2869
Figure 1: Network architecture and data flow chart of DualGAN for image-to-image translation.
nator DA that discriminates between GA ’s fake outputs and on conditional image synthesis found it beneficial to replace
real members of domain V . Analogously, the dual GAN L2 distance with L1 , since the former often leads to blurri-
learns the generator GB and a discriminator DB . The over- ness [6, 23]. Hence, we adopt L1 distance to measure the
all architecture and data flow are illustrated in Fig. 1. recovery error, which is added to the GAN objective to force
As shown in Fig. 1, image u ∈ U is translated to domain the translated samples to obey the domain distribution:
V using GA . How well the translation GA (u, z) fits in V is lg (u, v) = λU u − GB (GA (u, z), z )+
evaluated by DA , where z is random noise, and so is z that
appears below. GA (u, z) is then translated back to domain λV v − GA (GB (v, z ), z) (3)
U using GB , which outputs GB (GA (u, z), z ) as the recon- −DA (GB (v, z )) − DB (GA (u, z)),
structed version of u. Similarly, v ∈ V is translated to U where u ∈ U , v ∈ V , and λU , λV are two constant parame-
as GB (v, z ) and then reconstructed as GA (GB (v, z ), z). ters. Depending on the application, λU and λV are typically
The discriminator DA is trained with v as positive samples set to a value within [100.0, 1, 000.0]. If U contains natural
and GA (u, z) as negative examples, whereas DB takes u as images and V does not (e.g., aerial photo-maps), we find it
positive and GB (v, z ) as negative. Generators GA and GB more effective to use smaller λU than λV .
are optimized to emulate “fake” outputs to blind the corre-
sponding discriminators DA and DB , as well as to mini- 3.2. Network configuration
mize the two reconstruction losses GA (GB (v, z ), z) − v DualGAN is constructed with identical network archi-
and GB (GA (u, z), z ) − u. tecture for GA and GB . The generator is configured with
3.1. Objective equal number of downsampling (pooling) and upsampling
layers. In addition, we configure the generator with skip
As in the traditional GAN, the objective of discrimina- connections between mirrored downsampling and upsam-
tors is to discriminate the generated fake samples from the pling layers as in [16, 4], making it a U-shaped net. Such a
real ones. Nevertheless, here we use the loss format ad- design enables low-level information to be shared between
vocated by Wasserstein GAN (WGAN) [1] rather than the input and output, which is beneficial since many image
sigmoid cross-entropy loss used in the original GAN [3]. It translation problems implicitly assume alignment between
is proven that the former performs better in terms of genera- image structures in the input and output (e.g., object shapes,
tor convergence and sample quality, as well as in improving textures, clutter, etc.). Without the skip layers, information
the stability of the optimization [1]. The corresponding loss from all levels has to pass through the bottleneck, typically
functions used in DA and DB are defined as: causing significant loss of high-frequency information. Fur-
thermore, similar to [4], we did not explicitly provide the
lA
d
(u, v) = DA (GA (u, z)) − DA (v), (1) noise vectors z, z . Instead, they are provided only in the
form of dropout and applied to several layers of our gener-
lB (u, v) = DB (GB (v, z )) − DB (u),
d
(2)
ators at both training and test phases.
where u ∈ U and v ∈ V . For discriminators, we employ the Markovian Patch-
The same loss function is used for both generators GA GAN architecture as explored in [8], which assumes inde-
and GB as they share the same objective. Previous works pendence between pixels distanced beyond a specific patch
2870
size and models images only at the patch level rather than
over the full image. Such a configuration is effective in
capturing local high-frequency features such as texture and
style, but less so in modeling global distributions. It fulfills
our needs well, since the recovery loss encourages preser-
vation of global and low-frequency information and the dis-
criminators are designated to capture local high-frequency
information. The effectiveness of this configuration has
been verified on various translation tasks [23]. Similar
to [23], we run this discriminator convolutionally across the
image, averaging all responses to provide the ultimate out-
put. An extra advantage of such a scheme is that it requires
fewer parameters, runs faster, and has no constraints over
the size of the input image. The patch size at which the
discriminator operates is fixed at 70 × 70, and the image
resolutions were mostly 256 × 256, same as pix2pix [4].
Input GT DualGAN GAN cGAN [4]
3.3. Training procedure
Figure 2: Results of day→night translation. cGAN [4] is
To optimize the DualGAN networks, we follow the train- trained with labeled data, whereas DualGAN and GAN are
ing procedure proposed in WGAN [1]; see Alg. 1. We train trained in an unsupervised manner. DualGAN successfully
the discriminators ncritic steps, then one step on genera- emulates the night scenes while preserving textures in the
tors. We employ mini-batch Stochastic Gradient Descent inputs, e.g., see differences over the cloud regions between
and apply the RMSProp solver, as momentum based meth- our results and the ground truth (GT). In comparison, results
ods such as Adam would occasionally cause instability [1], of cGAN and GAN contain much less details.
and RMSProp is known to perform well even on highly non-
stationary problems [19, 1]. We typically set the number
of critic iterations per generator iteration ncritic to 2-4 and
assign batch size to 1-4, without noticeable differences on locally saturated and may lead to vanishing gradients. Un-
effectiveness in the experiments. The clipping parameter c like in traditional GANs, the Wasserstein loss is differen-
is normally set in [0.01, 0.1], varying by application. tiable almost everywhere, resulting in a better discrimina-
tor. At each iteration, the generators are not trained until
Algorithm 1 DualGAN training procedure the discriminators have been trained for ncritic steps. Such
a procedure enables the discriminators to provide more re-
Require: Image set U , image set V , GAN A with gener- liable gradient information [1].
ator parameters θA and discriminator parameters ωA ,
GAN B with generator parameters θB and discrimina- 4. Experimental results and evaluation
tor parameters ωB , clipping parameter c, batch size m,
and ncritic To assess the capability of DualGAN in general-purpose
1: Randomly initialize ωi , θi , i ∈ {A, B} image-to-image translation, we conduct experiments on a
2: repeat variety of tasks, including photo-sketch conversion, label-
3: for t = 1, . . . , ncritic do image translation, and artistic stylization.
4: sample images {u(k) }m k=1 ⊆ U , {v (k) }m
k=1 ⊆ V To compare DualGAN with GAN and cGAN [4], four
1
(u(k) , v (k) )
m
5: update ωA to minimize m k=1 lA d labeled datasets are used: PHOTO-SKETCH [22, 25],
1
m d (k) (k)
6: update ωB to minimize m k=1 lB (u ,v ) DAY-NIGHT [5], LABEL-FACADES [20], and AERIAL-
7: clip(ωA , −c, c), clip(ωB , −c, c) MAPS, which was directly captured from Google Map [4].
8: end for These datasets consist of corresponding images between
9: sample images {u(k) }m k=1 ⊆ U , {v (k) }m
k=1 ⊆ V
two domains; they serve as ground truth (GT) and can also
1 m g (k) (k)
10: update θA , θB to minimize m k=1 (u
l ,v ) be used for supervised learning. However, none of these
11: until convergence datasets could guarantee accurate feature alignment at the
pixel level. For example, the sketches in SKETCH-PHOTO
dataset were drawn by artists and do not accurately align
Training for traditional GANs needs to carefully balance with the corresponding photos, moving objects and cloud
between the generator and the discriminator, since, as the pattern changes often show up in the DAY-NIGHT dataset,
discriminator improves, the sigmoid cross-entropy loss is and the labels in LABEL-FACADES dataset are not always
2871
ran the model and code provided in [4] and chose the opti-
mal loss function for each task: L1 loss for facade→label
and L1 + cGAN loss for the other tasks (see [4] for more
details). In contrast, DualGAN and GAN were trained in
an unsupervised way, i.e., we decouple the image pairs and
then reshuffle the data. The results of GAN were generated
using our approach by setting λU = λV = 0.0 in eq. (3),
noting that this GAN is different from the original GAN
model [3] as it employs a conditional generator.
All three models were trained on the same training
datasets and tested on novel data that does not overlap those
for training. All the training were carried out on a single
GeForce GTX Titan X GPU. At test time, all models ran in
well under a second on this GPU.
Compared to GAN, in almost all cases, DualGAN pro-
Input GT DualGAN GAN cGAN [4] duces results that are less blurry, contain fewer artifacts, and
better preserve content structures in the inputs and capture
features (e.g., texture, color, and/or style) of the target do-
Figure 3: Results of label→facade translation. DualGAN main. We attribute the improvements to the reconstruction
faithfully preserves the structures in the label images, even loss, which forces the inputs to be reconstructable from out-
though some labels do not match well with the correspond- puts through the dual generator and strengthens feedback
ing photos in finer details. In contrast, results from GAN signals that encodes the targeted distribution.
and cGAN contain many artifacts. Over regions with label-
In many cases, DualGAN also compares favorably over
photo misalignment, cGAN often yields blurry output (e.g.,
the supervised cGAN in terms of sharpness of the outputs
the roof in second row and the entrance in third row).
and faithfulness to the input images; see Figures 2, 3, 4, 5,
and 8. This is encouraging since the supervision in
cGAN does utilize additional image and pixel correspon-
precise. This highlights, in part, the difficulty in obtaining dences. On the other hand, when translating between pho-
high quality matching image pairs. tos and semantic-based labels, such as map↔aerial and
DualGAN enables us to utilize abundant unlabeled im- label↔facades, it is often impossible to infer the correspon-
age sources from the Web. Two unlabeled and unpaired dences between pixel colors and labels based on targeted
datasets are also tested in our experiments. The MATE- distribution alone. As a result, DualGAN may map pixels
RIAL dataset includes images of objects made of differ- to wrong labels (see Figures 9 and 10) or labels to wrong
ent materials, e.g., stone, metal, plastic, fabric, and wood. colors/textures (see Figures 3 and 8).
These images were manually selected from Flickr and cover Figures 6 and 7 show image translation results obtained
a variety of illumination conditions, compositions, color, using the two unlabeled datasets, including oil↔Chinese,
texture, and material sub-types [17]. This dataset was ini- plastic→metal, metal→stone, leather→fabric, as well as
tially used for material recognition, but is applied here for wood↔plastic. The results demonstrate that visually con-
material transfer. The OIL-CHINESE painting dataset in- vincing images can be generated by DualGAN when no cor-
cludes artistic paintings of two disparate styles: oil and Chi- responding images can be found in the target domains. As
nese. All images were crawled from search engines and well, the DualGAN results generally contain less artifacts
they contain images with varying quality, format, and size. than those from GAN.
We reformat, crop, and resize the images for training and
evaluation. In both of these datasets, no correspondence is 5.1. Quantitative evaluation
available between images from different domains. To quantitatively evaluate DualGAN, we set up two
user studies through Amazon Mechanical Turk (AMT). The
5. Qualitative evaluation “material perceptual” test evaluates the material transfer re-
Using the four labeled datasets, we first compare Du- sults, in which we mix the outputs from all material trans-
alGAN with GAN and cGAN [4] on the following trans- fer tasks and let the Turkers choose the best match based
lation tasks: day→night (Figure 2), labels↔facade (Fig- on which material they believe the objects in the image are
ures 3 and 10), face photo↔sketch (Figures 4 and 5), and made of. For a total of 176 output images, each was evalu-
map↔aerial photo (Figures 8 and 9). In all these tasks, ated by ten Turkers. An output image is rated as a success if
cGAN was trained with labeled (i.e., paired) data, where we at least three Turkers selected the target material type. Suc-
2872
Input GT DualGAN GAN cGAN [4]
2873
plastic (input) metal (DualGAN) metal (GAN) plastic (input) metal (DualGAN) metal (GAN)
metal (input) stone (DualGAN) stone (GAN) metal (input) stone (DualGAN) stone (GAN)
leather (input) fabric (DualGAN) fabric (GAN) leather (input) fabric (DualGAN) fabric (GAN)
wood (input) plastic (DualGAN) plastic (GAN) plastic (input) wood (DualGAN) wood (GAN)
Figure 7: Experimental results for various material transfer tasks. From top to bottom, plastic→metal, metal→stone,
leather→fabric, and plastic↔wood.
6. Conclusion
lation. The unsupervised characteristic of DualGAN en-
We propose DualGAN, a novel unsupervised dual learn- ables many real world applications, as demonstrated in this
ing framework for general-purpose image-to-image trans- work, as well as in the concurrent work CycleGAN [26].
2874
Input GT DualGAN GAN cGAN [4] Input GT DualGAN GAN) cGAN [4]
Figure 8: Map→aerial photo translation. Without im- Figure 10: Facades→label translation. While cGAN cor-
age correspondences for training, DualGAN may map the rectly labels various bulding components such as windows,
orange-colored interstate highways to building roofs with doors, and balconies, the overall label images are not as de-
bright colors. Nevertheless, the DualGAN results are tailed and structured as DualGAN’s outputs.
sharper than those from GAN and cGAN.
2875
References [14] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M.
Álvarez. Invertible conditional gans for image editing.
[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein arXiv preprint arXiv:1611.06355, 2016. 2
gan. arXiv preprint arXiv:1701.07875, 2017. 3, 4
[15] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele,
[2] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and and H. Lee. Generative adversarial text to image syn-
A. Torralba. Cross-modal scene networks. CoRR, thesis. In Proceedings of The 33rd International Con-
abs/1610.09003, 2016. 2 ference on Machine Learning, volume 3, 2016. 2
[3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [16] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- volutional networks for biomedical image segmenta-
gio. Generative adversarial nets. In Advances in neu- tion. In International Conference on Medical Im-
ral information processing systems, pages 2672–2680, age Computing and Computer-Assisted Intervention,
2014. 1, 2, 3, 5 pages 234–241. Springer, 2015. 3
[4] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image- [17] L. Sharan, R. Rosenholtz, and E. Adelson. Material
to-image translation with conditional adversarial net- perception: What can you see in a brief glance? Jour-
works. arXiv preprint arXiv:1611.07004, 2016. 1, 2, nal of Vision, 9(8):784–784, 2009. 5
3, 4, 5, 6, 7, 8 [18] Y. Taigman, A. Polyak, and L. Wolf. Unsuper-
vised cross-domain image generation. arXiv preprint
[5] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays.
arXiv:1611.02200, 2016. 1, 2
Transient attributes for high-level understanding and
editing of outdoor scenes. ACM Transactions on [19] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Di-
Graphics (TOG), 33(4):149, 2014. 4 vide the gradient by a running average of its recent
magnitude. COURSERA: Neural networks for ma-
[6] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, chine learning, 4(2), 2012. 4
and O. Winther. Autoencoding beyond pixels us-
ing a learned similarity metric. arXiv preprint [20] R. Tyleček and R. Šára. Spatial pattern templates for
arXiv:1512.09300, 2015. 3 recognition of objects with regular structure. In Ger-
man Conference on Pattern Recognition, pages 364–
[7] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun- 374. Springer, 2013. 4
ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
[21] X. Wang and A. Gupta. Generative image model-
Z. Wang, et al. Photo-realistic single image super-
ing using style and structure adversarial networks. In
resolution using a generative adversarial network.
European Conference on Computer Vision (ECCV),
arXiv preprint arXiv:1609.04802, 2016. 1, 2
pages 318–335. Springer, 2016. 1, 2
[8] C. Li and M. Wand. Precomputed real-time texture [22] X. Wang and X. Tang. Face photo-sketch synthesis
synthesis with markovian generative adversarial net- and recognition. IEEE Transactions on Pattern Anal-
works. In European Conference on Computer Vision ysis and Machine Intelligence, 31(11):1955–1967,
(ECCV), pages 702–716. Springer, 2016. 1, 2, 3 2009. 4
[9] M. Liu, T. Breuel, and J. Kautz. Unsuper- [23] Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu,
vised image-to-image translation networks. CoRR, and W.-Y. Ma. Dual learning for machine translation.
abs/1703.00848, 2017. 2 arXiv preprint arXiv:1611.00179, 2016. 1, 2, 3, 4
[10] M.-Y. Liu and O. Tuzel. Coupled generative adver- [24] X. Yan, J. Yang, K. Sohn, and H. Lee. At-
sarial networks. In Advances in neural information tribute2image: Conditional image generation from vi-
processing systems, pages 469–477, 2016. 2 sual attributes. In European Conference on Computer
[11] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- Vision (ECCV), pages 776–791. Springer, 2016. 2
tional networks for semantic segmentation. In Com- [25] W. Zhang, X. Wang, and X. Tang. Coupled
puter Vision and Pattern Recognition (CVPR), pages information-theoretic encoding for face photo-sketch
3431–3440, 2015. 1 recognition. In Computer Vision and Pattern Recog-
nition (CVPR), pages 513–520. IEEE, 2011. 4
[12] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-
scale video prediction beyond mean square error. [26] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired
arXiv preprint arXiv:1511.05440, 2015. 1, 2 image-to-image translation using cycle-consistent ad-
versarial networks. In International Conference on
[13] M. Mirza and S. Osindero. Conditional generative ad- Computer Vision (ICCV), to appear, 2017. 2, 7
versarial nets. arXiv preprint arXiv:1411.1784, 2014.
2
2876