0% found this document useful (0 votes)
9 views9 pages

Dualgan: Unsupervised Dual Learning For Image-To-Image Translation

The paper presents DualGAN, an unsupervised learning framework for image-to-image translation that utilizes two sets of unlabeled images from different domains. By employing a dual learning mechanism, DualGAN simultaneously trains a primal GAN for translation and a dual GAN for inversion, allowing for effective image reconstruction without the need for labeled data. Experimental results demonstrate that DualGAN outperforms traditional supervised methods in certain tasks, showcasing its potential for general-purpose image translation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Dualgan: Unsupervised Dual Learning For Image-To-Image Translation

The paper presents DualGAN, an unsupervised learning framework for image-to-image translation that utilizes two sets of unlabeled images from different domains. By employing a dual learning mechanism, DualGAN simultaneously trains a primal GAN for translation and a dual GAN for inversion, allowing for effective image reconstruction without the need for labeled data. Experimental results demonstrate that DualGAN outperforms traditional supervised methods in certain tasks, showcasing its potential for general-purpose image translation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2017 IEEE International Conference on Computer Vision

DualGAN: Unsupervised Dual Learning for Image-to-Image Translation

Zili Yi1,2 , Hao Zhang2 , Ping Tan2 , and Minglun Gong1


1
Memorial University of Newfoundland, Canada
2
Simon Fraser University, Canada

Abstract Up to date, these general-purpose methods have all been


supervised and trained with a large number of labeled and
Conditional Generative Adversarial Networks (GANs) matching image pairs. In practice however, acquiring such
for cross-domain image-to-image translation have made training data can be time-consuming (e.g., with pixelwise
much progress recently [7, 8, 21, 12, 4, 18]. Depending or patchwise labeling) and even unrealistic. For exam-
on the task complexity, thousands to millions of labeled im- ple, while there are plenty of photos or sketches available,
age pairs are needed to train a conditional GAN. However, photo-sketch image pairs depicting the same people under
human labeling is expensive, even impractical, and large the same pose are scarce. In other image translation set-
quantities of data may not always be available. Inspired tings, e.g., converting daylight scenes to night scenes, even
by dual learning from natural language translation [23], though labeled and matching image pairs can be obtained
we develop a novel dual-GAN mechanism, which enables with stationary cameras, moving objects in the scene often
image translators to be trained from two sets of unlabeled cause varying degrees of content discrepancies.
images from two domains. In our architecture, the primal In this paper, we aim to develop an unsupervised learn-
GAN learns to translate images from domain U to those in ing framework for general-purpose image-to-image transla-
domain V , while the dual GAN learns to invert the task. tion, which only relies on unlabeled image data, such as two
The closed loop made by the primal and dual tasks allows sets of photos and sketches for the photo-to-sketch conver-
images from either domain to be translated and then recon- sion task. The obvious technical challenge is how to train
structed. Hence a loss function that accounts for the recon- a translator without any data characterizing correct transla-
struction error of images can be used to train the transla- tions. Our approach is inspired by dual learning from natu-
tors. Experiments on multiple image translation tasks with ral language processing [23]. Dual learning trains two “op-
unlabeled data show considerable performance gain of Du- posite” language translators (e.g., English-to-French and
alGAN over a single GAN. For some tasks, DualGAN can French-to-English) simultaneously by minimizing the re-
even achieve comparable or slightly better results than con- construction loss resulting from a nested application of the
ditional GAN trained on fully labeled data. two translators. The two translators represent a primal-dual
pair and the nested application forms a closed loop, allow-
ing the application of reinforcement learning. Specifically,
the reconstruction loss measured over monolingual data (ei-
1. Introduction
ther English or French) would generate informative feed-
Many image processing and computer vision tasks, e.g., back to train a bilingual translation model.
image segmentation, stylization, and abstraction, can be Our work develops a dual learning framework for image-
posed as image-to-image translation problems [4], which to-image translation for the first time and differs from the
convert one visual representation of an object or scene into original NLP dual learning method of Xia et al. [23] in two
another. Conventionally, these tasks have been tackled sep- main aspects. First, the NLP method relied on pre-trained
arately due to their intrinsic disparities [7, 8, 21, 12, 4, 18]. (English and French) language models to indicate how con-
It is not until the past two years that general-purpose and fident the the translator outputs are natural sentences in their
end-to-end deep learning frameworks, most notably those respective target languages. With general-purpose process-
utilizing fully convolutional networks (FCNs) [11] and con- ing in mind and the realization that such pre-trained models
ditional generative adversarial nets (cGANs) [4], have been are difficult to obtain for many image translation tasks, our
developed to enable a unified treatment of these tasks. work develops GAN discriminators [3] that are trained ad-

2380-7504/17 $31.00 © 2017 IEEE 2868


DOI 10.1109/ICCV.2017.310
versarially with the translators to capture domain distribu- feedback signals can be generated: the membership score
tions. Hence, we call our learning architecture DualGAN . which evaluates the likelihood of the translated texts be-
Furthermore, we employ FCNs as translators which natu- longing to the targeted language, and the reconstruction er-
rally accommodate the 2D structure of images, rather than ror that measures the disparity between the reconstructed
sequence-to-sequence translation models such as LSTM or sentences and the original. Both signals are assessed with
Gated Recurrent Unit (GUT). the assistance of application-specific domain knowledge,
Taking two sets of unlabeled images as input, each i.e., the pre-trained English and French language models.
characterizing an image domain, DualGAN simultaneously In our work, we aim for a general-purpose solution for
learns two reliable image translators from one domain to image-to-image conversion and hence do not utilize any
the other and hence can operate on a wide variety of image- domain-specific knowledge or pre-trained domain represen-
to-image translation tasks. The effectiveness of DuanGAN tations. Instead, we use a domain-adaptive GAN discrimi-
is validated through comparison with both GAN (with an nator to evaluate the membership score of translated sam-
image-conditional generator and the original discriminator) ples, whereas the reconstruction error is measured as the
and conditional GAN [4]. The comparison results demon- mean of absolute difference between the reconstructed and
strate that, for some applications, DualGAN can outperform original images within each image domain.
supervised methods trained on labeled data. In CycleGAN, a concurrent work by Zhu et al. [26], the
same idea for unpaired image-to-image translation is pro-
2. Related work posed, where the primal-dual relation in DualGAN is re-
ferred to as a cyclic mapping and their cycle consistency
Since the seminal work by Goodfellow et al. [3] in 2014, loss is essentially the same as our reconstruction loss. Su-
a series of GAN-family methods have been proposed for periority of CycleGAN has been demonstrated on several
a wide variety of problems. The original GAN can learn a tasks where paired training data hardly exist, e.g., in object
generator to capture the distribution of real data by introduc- transfiguration and painting style and season transfer.
ing an adversarial discriminator that evolves to discriminate Recent work by Liu and Tuzel [10], which we refer to
between the real data and the fake [3]. Soon after, various as coupled GAN or CoGAN, also trains two GANs to-
conditional GANs (cGAN) have been proposed to condition gether to solve image translation problems without paired
the image generation on class labels [13], attributes [14, 24], training data. Unlike DualGAN or CycleGAN, the two
texts [15], and images [7, 8, 21, 12, 4, 18]. GANs in CoGAN are not linked to enforce cycle consis-
Most image-conditional models were developed for spe- tency. Instead, CoGAN learns a joint distribution over
cific applications such as super-resolution [7], texture syn- images from two domains. By sharing weight parame-
thesis [8], style transfer from normal maps to images [21], ters corresponding to high level semantics in both gener-
and video prediction [12], whereas few others were aim- ative and discriminative networks, CoGAN can enforce the
ing for general-purpose processing [4, 18]. The general- two GANs to interpret these image semantics in the same
purpose solution for image-to-image translation proposed way. However, the weight-sharing assumption in CoGAN
by Isola et al. [4] requires significant number of labeled im- and similar approaches, e.g., [2, 9], does not lead to effec-
age pairs. The unsupervised mechanism for cross-domain tive general-purpose solutions as its applicability is task-
image conversion presented by Taigman et al. [18] can train dependent, leading to unnatural image translation results,
an image-conditional generator without paired images, but as shown in comparative studies by CycleGAN [26].
relies on a sophisticated pre-trained function that maps im- DualGAN and CycleGAN both aim for general-purpose
ages from either domain to an intermediate representation, image-to-image translations without requiring a joint repre-
which requires labeled data in other formats. sentation to bridge the two image domains. In addition, Du-
Dual learning was first proposed by Xia et al. [23] to alGAN trains both primal and dual GANs at the same time,
reduce the requirement on labeled data in training English- allowing a reconstruction error term to be used to generate
to-French and French-to-English translators. The French- informative feedback signals.
to-English translation is the dual task to English-to-French
translation, and they can be trained side-by-side. The key 3. Method
idea of dual learning is to set up a dual-learning game which
involves two agents, each of whom only understands one Given two sets of unlabeled and unpaired images sam-
language, and can evaluate how likely the translated are pled from domains U and V , respectively, the primal task
natural sentences in targeted language and to what extent of DualGAN is to learn a generator GA : U → V that maps
the reconstructed are consistent with the original. Such a an image u ∈ U to an image v ∈ V , while the dual task is
mechanism is played alternatively on both sides, allowing to train an inverse generator GB : V → U . To realize this,
translators to be trained from monolingual data only. we employ two GANs, the primal GAN and the dual GAN.
Despite of a lack of parallel bilingual data, two types of The primal GAN learns the generator GA and a discrimi-

2869
Figure 1: Network architecture and data flow chart of DualGAN for image-to-image translation.

nator DA that discriminates between GA ’s fake outputs and on conditional image synthesis found it beneficial to replace
real members of domain V . Analogously, the dual GAN L2 distance with L1 , since the former often leads to blurri-
learns the generator GB and a discriminator DB . The over- ness [6, 23]. Hence, we adopt L1 distance to measure the
all architecture and data flow are illustrated in Fig. 1. recovery error, which is added to the GAN objective to force
As shown in Fig. 1, image u ∈ U is translated to domain the translated samples to obey the domain distribution:
V using GA . How well the translation GA (u, z) fits in V is lg (u, v) = λU u − GB (GA (u, z), z  )+
evaluated by DA , where z is random noise, and so is z  that
appears below. GA (u, z) is then translated back to domain λV v − GA (GB (v, z  ), z) (3)
U using GB , which outputs GB (GA (u, z), z  ) as the recon- −DA (GB (v, z  )) − DB (GA (u, z)),
structed version of u. Similarly, v ∈ V is translated to U where u ∈ U , v ∈ V , and λU , λV are two constant parame-
as GB (v, z  ) and then reconstructed as GA (GB (v, z  ), z). ters. Depending on the application, λU and λV are typically
The discriminator DA is trained with v as positive samples set to a value within [100.0, 1, 000.0]. If U contains natural
and GA (u, z) as negative examples, whereas DB takes u as images and V does not (e.g., aerial photo-maps), we find it
positive and GB (v, z  ) as negative. Generators GA and GB more effective to use smaller λU than λV .
are optimized to emulate “fake” outputs to blind the corre-
sponding discriminators DA and DB , as well as to mini- 3.2. Network configuration
mize the two reconstruction losses GA (GB (v, z  ), z) − v DualGAN is constructed with identical network archi-
and GB (GA (u, z), z  ) − u. tecture for GA and GB . The generator is configured with
3.1. Objective equal number of downsampling (pooling) and upsampling
layers. In addition, we configure the generator with skip
As in the traditional GAN, the objective of discrimina- connections between mirrored downsampling and upsam-
tors is to discriminate the generated fake samples from the pling layers as in [16, 4], making it a U-shaped net. Such a
real ones. Nevertheless, here we use the loss format ad- design enables low-level information to be shared between
vocated by Wasserstein GAN (WGAN) [1] rather than the input and output, which is beneficial since many image
sigmoid cross-entropy loss used in the original GAN [3]. It translation problems implicitly assume alignment between
is proven that the former performs better in terms of genera- image structures in the input and output (e.g., object shapes,
tor convergence and sample quality, as well as in improving textures, clutter, etc.). Without the skip layers, information
the stability of the optimization [1]. The corresponding loss from all levels has to pass through the bottleneck, typically
functions used in DA and DB are defined as: causing significant loss of high-frequency information. Fur-
thermore, similar to [4], we did not explicitly provide the
lA
d
(u, v) = DA (GA (u, z)) − DA (v), (1) noise vectors z, z  . Instead, they are provided only in the
form of dropout and applied to several layers of our gener-
lB (u, v) = DB (GB (v, z  )) − DB (u),
d
(2)
ators at both training and test phases.
where u ∈ U and v ∈ V . For discriminators, we employ the Markovian Patch-
The same loss function is used for both generators GA GAN architecture as explored in [8], which assumes inde-
and GB as they share the same objective. Previous works pendence between pixels distanced beyond a specific patch

2870
size and models images only at the patch level rather than
over the full image. Such a configuration is effective in
capturing local high-frequency features such as texture and
style, but less so in modeling global distributions. It fulfills
our needs well, since the recovery loss encourages preser-
vation of global and low-frequency information and the dis-
criminators are designated to capture local high-frequency
information. The effectiveness of this configuration has
been verified on various translation tasks [23]. Similar
to [23], we run this discriminator convolutionally across the
image, averaging all responses to provide the ultimate out-
put. An extra advantage of such a scheme is that it requires
fewer parameters, runs faster, and has no constraints over
the size of the input image. The patch size at which the
discriminator operates is fixed at 70 × 70, and the image
resolutions were mostly 256 × 256, same as pix2pix [4].
Input GT DualGAN GAN cGAN [4]
3.3. Training procedure
Figure 2: Results of day→night translation. cGAN [4] is
To optimize the DualGAN networks, we follow the train- trained with labeled data, whereas DualGAN and GAN are
ing procedure proposed in WGAN [1]; see Alg. 1. We train trained in an unsupervised manner. DualGAN successfully
the discriminators ncritic steps, then one step on genera- emulates the night scenes while preserving textures in the
tors. We employ mini-batch Stochastic Gradient Descent inputs, e.g., see differences over the cloud regions between
and apply the RMSProp solver, as momentum based meth- our results and the ground truth (GT). In comparison, results
ods such as Adam would occasionally cause instability [1], of cGAN and GAN contain much less details.
and RMSProp is known to perform well even on highly non-
stationary problems [19, 1]. We typically set the number
of critic iterations per generator iteration ncritic to 2-4 and
assign batch size to 1-4, without noticeable differences on locally saturated and may lead to vanishing gradients. Un-
effectiveness in the experiments. The clipping parameter c like in traditional GANs, the Wasserstein loss is differen-
is normally set in [0.01, 0.1], varying by application. tiable almost everywhere, resulting in a better discrimina-
tor. At each iteration, the generators are not trained until
Algorithm 1 DualGAN training procedure the discriminators have been trained for ncritic steps. Such
a procedure enables the discriminators to provide more re-
Require: Image set U , image set V , GAN A with gener- liable gradient information [1].
ator parameters θA and discriminator parameters ωA ,
GAN B with generator parameters θB and discrimina- 4. Experimental results and evaluation
tor parameters ωB , clipping parameter c, batch size m,
and ncritic To assess the capability of DualGAN in general-purpose
1: Randomly initialize ωi , θi , i ∈ {A, B} image-to-image translation, we conduct experiments on a
2: repeat variety of tasks, including photo-sketch conversion, label-
3: for t = 1, . . . , ncritic do image translation, and artistic stylization.
4: sample images {u(k) }m k=1 ⊆ U , {v (k) }m
k=1 ⊆ V To compare DualGAN with GAN and cGAN [4], four
1
(u(k) , v (k) )
m
5: update ωA to minimize m k=1 lA d labeled datasets are used: PHOTO-SKETCH [22, 25],
1
m d (k) (k)
6: update ωB to minimize m k=1 lB (u ,v ) DAY-NIGHT [5], LABEL-FACADES [20], and AERIAL-
7: clip(ωA , −c, c), clip(ωB , −c, c) MAPS, which was directly captured from Google Map [4].
8: end for These datasets consist of corresponding images between
9: sample images {u(k) }m k=1 ⊆ U , {v (k) }m
k=1 ⊆ V
two domains; they serve as ground truth (GT) and can also
1 m g (k) (k)
10: update θA , θB to minimize m k=1 (u
l ,v ) be used for supervised learning. However, none of these
11: until convergence datasets could guarantee accurate feature alignment at the
pixel level. For example, the sketches in SKETCH-PHOTO
dataset were drawn by artists and do not accurately align
Training for traditional GANs needs to carefully balance with the corresponding photos, moving objects and cloud
between the generator and the discriminator, since, as the pattern changes often show up in the DAY-NIGHT dataset,
discriminator improves, the sigmoid cross-entropy loss is and the labels in LABEL-FACADES dataset are not always

2871
ran the model and code provided in [4] and chose the opti-
mal loss function for each task: L1 loss for facade→label
and L1 + cGAN loss for the other tasks (see [4] for more
details). In contrast, DualGAN and GAN were trained in
an unsupervised way, i.e., we decouple the image pairs and
then reshuffle the data. The results of GAN were generated
using our approach by setting λU = λV = 0.0 in eq. (3),
noting that this GAN is different from the original GAN
model [3] as it employs a conditional generator.
All three models were trained on the same training
datasets and tested on novel data that does not overlap those
for training. All the training were carried out on a single
GeForce GTX Titan X GPU. At test time, all models ran in
well under a second on this GPU.
Compared to GAN, in almost all cases, DualGAN pro-
Input GT DualGAN GAN cGAN [4] duces results that are less blurry, contain fewer artifacts, and
better preserve content structures in the inputs and capture
features (e.g., texture, color, and/or style) of the target do-
Figure 3: Results of label→facade translation. DualGAN main. We attribute the improvements to the reconstruction
faithfully preserves the structures in the label images, even loss, which forces the inputs to be reconstructable from out-
though some labels do not match well with the correspond- puts through the dual generator and strengthens feedback
ing photos in finer details. In contrast, results from GAN signals that encodes the targeted distribution.
and cGAN contain many artifacts. Over regions with label-
In many cases, DualGAN also compares favorably over
photo misalignment, cGAN often yields blurry output (e.g.,
the supervised cGAN in terms of sharpness of the outputs
the roof in second row and the entrance in third row).
and faithfulness to the input images; see Figures 2, 3, 4, 5,
and 8. This is encouraging since the supervision in
cGAN does utilize additional image and pixel correspon-
precise. This highlights, in part, the difficulty in obtaining dences. On the other hand, when translating between pho-
high quality matching image pairs. tos and semantic-based labels, such as map↔aerial and
DualGAN enables us to utilize abundant unlabeled im- label↔facades, it is often impossible to infer the correspon-
age sources from the Web. Two unlabeled and unpaired dences between pixel colors and labels based on targeted
datasets are also tested in our experiments. The MATE- distribution alone. As a result, DualGAN may map pixels
RIAL dataset includes images of objects made of differ- to wrong labels (see Figures 9 and 10) or labels to wrong
ent materials, e.g., stone, metal, plastic, fabric, and wood. colors/textures (see Figures 3 and 8).
These images were manually selected from Flickr and cover Figures 6 and 7 show image translation results obtained
a variety of illumination conditions, compositions, color, using the two unlabeled datasets, including oil↔Chinese,
texture, and material sub-types [17]. This dataset was ini- plastic→metal, metal→stone, leather→fabric, as well as
tially used for material recognition, but is applied here for wood↔plastic. The results demonstrate that visually con-
material transfer. The OIL-CHINESE painting dataset in- vincing images can be generated by DualGAN when no cor-
cludes artistic paintings of two disparate styles: oil and Chi- responding images can be found in the target domains. As
nese. All images were crawled from search engines and well, the DualGAN results generally contain less artifacts
they contain images with varying quality, format, and size. than those from GAN.
We reformat, crop, and resize the images for training and
evaluation. In both of these datasets, no correspondence is 5.1. Quantitative evaluation
available between images from different domains. To quantitatively evaluate DualGAN, we set up two
user studies through Amazon Mechanical Turk (AMT). The
5. Qualitative evaluation “material perceptual” test evaluates the material transfer re-
Using the four labeled datasets, we first compare Du- sults, in which we mix the outputs from all material trans-
alGAN with GAN and cGAN [4] on the following trans- fer tasks and let the Turkers choose the best match based
lation tasks: day→night (Figure 2), labels↔facade (Fig- on which material they believe the objects in the image are
ures 3 and 10), face photo↔sketch (Figures 4 and 5), and made of. For a total of 176 output images, each was evalu-
map↔aerial photo (Figures 8 and 9). In all these tasks, ated by ten Turkers. An output image is rated as a success if
cGAN was trained with labeled (i.e., paired) data, where we at least three Turkers selected the target material type. Suc-

2872
Input GT DualGAN GAN cGAN [4]

Figure 4: Photo→sketch translation for faces. Results of


DualGAN are generally sharper than those from cGAN,
even though the former was trained using unpaired data,
whereas the latter makes use of image correspondence.

Input DualGAN GAN

Figure 6: Experimental results for translating Chinese


paintings to oil paintings (without GT available). The back-
ground grids shown in the GAN results imply that the out-
puts of GAN are not as stable as those of DualGAN.

fle real photos and outputs from all three approaches before


showing them to Turkers. Each image is shown to 20 Turk-
ers, who were asked to score the image based on to what ex-
tent the synthesized photo looks real. The “realness” score
Input GT DualGAN GAN cGAN [4] ranges from 0 (totally missing), 1 (bad), 2 (acceptable), 3
(good), to 4 (compelling). The average score of different ap-
Figure 5: Results for sketch→photo translation of faces. proaches on various tasks are then computed and shown in
More artifacts and blurriness are showing up in results gen- Table. 2. The AMT study results show that DualGAN out-
erated by GAN and cGAN than DualGAN. performs GAN on all tasks and outperforms cGAN on two
tasks as well. This indicates that cGAN has little tolerance
to misalignment and inconsistency between image pairs, but
the additional pixel-level correspondence does help cGAN
cess rates of various material transfer results using different correctly map labels to colors and textures.
approaches are summarized in Table 1, showing that Dual- Finally, we compute the segmentation accuracies for
GAN outperforms GAN by a large margin. facades→label and aerial→map tasks, as reported in Ta-
In addition, we run the AMT “realness score” evalua- bles 3 and 4. The comparison shows that DualGAN is out-
tion for sketch→photo, label map→facades, maps→aerial performed by cGAN, which is expected as it is difficult to
photo, and day→night translations. To eliminate potential infer proper labeling without image correspondence infor-
bias, for each of the four evaluations, we randomly shuf- mation from the training data.

2873
plastic (input) metal (DualGAN) metal (GAN) plastic (input) metal (DualGAN) metal (GAN)

metal (input) stone (DualGAN) stone (GAN) metal (input) stone (DualGAN) stone (GAN)

leather (input) fabric (DualGAN) fabric (GAN) leather (input) fabric (DualGAN) fabric (GAN)

wood (input) plastic (DualGAN) plastic (GAN) plastic (input) wood (DualGAN) wood (GAN)

Figure 7: Experimental results for various material transfer tasks. From top to bottom, plastic→metal, metal→stone,
leather→fabric, and plastic↔wood.

Task DualGAN GAN Avg. “realness” score


plastic→wood 2/11 0/11 Task DualGAN cGAN[4] GAN GT
wood→plastic 1/11 0/11 sketch→photo 1.87 1.69 1.04 3.56
metal→stone 2/11 0/11 day→night 2.42 1.89 0.13 3.05
stone→metal 2/11 0/11 label→facades 1.89 2.59 1.43 3.33
leather→fabric 3/11 2/11 map→aerial 2.52 2.92 1.88 3.21
fabric→leather 2/11 1/11
plastic→metal 7/11 3/11 Table 2: Average AMT “realness” scores of outputs from
metal→plastic 1/11 0/11 various tasks. The results show that DualGAN outper-
forms GAN in all tasks. It also outperforms cGAN for
Table 1: Success rates of various material transfer tasks sketch→photo and day→night tasks, but still lag behind for
based on the AMT “material perceptual” test. There are label→facade and map→aerial tasks. In the latter two tasks,
11 images in each set of transfer result, with noticeable im- the additional image correspondence in training data would
provements of DualGAN over GAN. help cGAN map labels to the proper colors/textures.

6. Conclusion
lation. The unsupervised characteristic of DualGAN en-
We propose DualGAN, a novel unsupervised dual learn- ables many real world applications, as demonstrated in this
ing framework for general-purpose image-to-image trans- work, as well as in the concurrent work CycleGAN [26].

2874
Input GT DualGAN GAN cGAN [4] Input GT DualGAN GAN) cGAN [4]

Figure 8: Map→aerial photo translation. Without im- Figure 10: Facades→label translation. While cGAN cor-
age correspondences for training, DualGAN may map the rectly labels various bulding components such as windows,
orange-colored interstate highways to building roofs with doors, and balconies, the overall label images are not as de-
bright colors. Nevertheless, the DualGAN results are tailed and structured as DualGAN’s outputs.
sharper than those from GAN and cGAN.

Per-pixel acc. Per-class acc. Class IOU


DualGAN 0.27 0.13 0.06
cGAN [4] 0.54 0.33 0.19
GAN 0.22 0.10 0.05

Table 3: Segmentation accuracy for the facades→label


task. DualGAN outperforms GAN, but is not as accurate as
cGAN. Without image correspondence (for cGAN), even if
DualGAN segments a region properly, it may not assign the
region with a correct label.

Per-pixel acc. Per-class acc. Class IOU


Input GT DualGAN GAN) cGAN [4]
DualGAN 0.42 0.22 0.09
cGAN [4] 0.70 0.46 0.26
Figure 9: Results for aerial photo→map translation. Dual- GAN 0.41 0.23 0.09
GAN performs better than GAN, but not as good as cGAN.
With additional pixel correspondence information, cGAN Table 4: Segmentation accuracy for the aerial→map task,
performs well in terms of labeling local roads, but still can- for which DualGAN performs less than satisfactorily.
not detect interstate highways.

to investigate whether this limitation can be lifted with the


Experimental results suggest that the DualGAN mechanism use of a small number of labeled data as a warm start.
can significantly improve the outputs of GAN for various
image-to-image translation tasks. With unlabeled data only, Acknowledgment. We thank all the anonymous review-
DualGAN can generate comparable or even better outputs ers for their valuable comments and suggestions. The first
than conditional GAN [4] which is trained with labeled data author is a PhD student from the Memorial University of
providing image and pixel-level correspondences. Newfoundland and has been visiting SFU since 2016. This
On the other hand, our method is outperformed by con- work was supported in part by grants from the Natural
ditional GAN or cGAN [4] for certain tasks which involve Sciences and Engineering Research Council (NSERC) of
semantics-based labels. This is due to the lack of pixel and Canada (No. 611370, 2017-06086).
label correspondence information, which cannot be inferred
from the target distribution alone. In the future, we intend

2875
References [14] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M.
Álvarez. Invertible conditional gans for image editing.
[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein arXiv preprint arXiv:1611.06355, 2016. 2
gan. arXiv preprint arXiv:1701.07875, 2017. 3, 4
[15] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele,
[2] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and and H. Lee. Generative adversarial text to image syn-
A. Torralba. Cross-modal scene networks. CoRR, thesis. In Proceedings of The 33rd International Con-
abs/1610.09003, 2016. 2 ference on Machine Learning, volume 3, 2016. 2
[3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [16] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- volutional networks for biomedical image segmenta-
gio. Generative adversarial nets. In Advances in neu- tion. In International Conference on Medical Im-
ral information processing systems, pages 2672–2680, age Computing and Computer-Assisted Intervention,
2014. 1, 2, 3, 5 pages 234–241. Springer, 2015. 3
[4] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image- [17] L. Sharan, R. Rosenholtz, and E. Adelson. Material
to-image translation with conditional adversarial net- perception: What can you see in a brief glance? Jour-
works. arXiv preprint arXiv:1611.07004, 2016. 1, 2, nal of Vision, 9(8):784–784, 2009. 5
3, 4, 5, 6, 7, 8 [18] Y. Taigman, A. Polyak, and L. Wolf. Unsuper-
vised cross-domain image generation. arXiv preprint
[5] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays.
arXiv:1611.02200, 2016. 1, 2
Transient attributes for high-level understanding and
editing of outdoor scenes. ACM Transactions on [19] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Di-
Graphics (TOG), 33(4):149, 2014. 4 vide the gradient by a running average of its recent
magnitude. COURSERA: Neural networks for ma-
[6] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, chine learning, 4(2), 2012. 4
and O. Winther. Autoencoding beyond pixels us-
ing a learned similarity metric. arXiv preprint [20] R. Tyleček and R. Šára. Spatial pattern templates for
arXiv:1512.09300, 2015. 3 recognition of objects with regular structure. In Ger-
man Conference on Pattern Recognition, pages 364–
[7] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun- 374. Springer, 2013. 4
ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
[21] X. Wang and A. Gupta. Generative image model-
Z. Wang, et al. Photo-realistic single image super-
ing using style and structure adversarial networks. In
resolution using a generative adversarial network.
European Conference on Computer Vision (ECCV),
arXiv preprint arXiv:1609.04802, 2016. 1, 2
pages 318–335. Springer, 2016. 1, 2
[8] C. Li and M. Wand. Precomputed real-time texture [22] X. Wang and X. Tang. Face photo-sketch synthesis
synthesis with markovian generative adversarial net- and recognition. IEEE Transactions on Pattern Anal-
works. In European Conference on Computer Vision ysis and Machine Intelligence, 31(11):1955–1967,
(ECCV), pages 702–716. Springer, 2016. 1, 2, 3 2009. 4
[9] M. Liu, T. Breuel, and J. Kautz. Unsuper- [23] Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu,
vised image-to-image translation networks. CoRR, and W.-Y. Ma. Dual learning for machine translation.
abs/1703.00848, 2017. 2 arXiv preprint arXiv:1611.00179, 2016. 1, 2, 3, 4
[10] M.-Y. Liu and O. Tuzel. Coupled generative adver- [24] X. Yan, J. Yang, K. Sohn, and H. Lee. At-
sarial networks. In Advances in neural information tribute2image: Conditional image generation from vi-
processing systems, pages 469–477, 2016. 2 sual attributes. In European Conference on Computer
[11] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- Vision (ECCV), pages 776–791. Springer, 2016. 2
tional networks for semantic segmentation. In Com- [25] W. Zhang, X. Wang, and X. Tang. Coupled
puter Vision and Pattern Recognition (CVPR), pages information-theoretic encoding for face photo-sketch
3431–3440, 2015. 1 recognition. In Computer Vision and Pattern Recog-
nition (CVPR), pages 513–520. IEEE, 2011. 4
[12] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-
scale video prediction beyond mean square error. [26] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired
arXiv preprint arXiv:1511.05440, 2015. 1, 2 image-to-image translation using cycle-consistent ad-
versarial networks. In International Conference on
[13] M. Mirza and S. Osindero. Conditional generative ad- Computer Vision (ICCV), to appear, 2017. 2, 7
versarial nets. arXiv preprint arXiv:1411.1784, 2014.
2

2876

You might also like