Image-to-Image Translation With Conditional Adversarial Networks
Image-to-Image Translation With Conditional Adversarial Networks
Image-to-Image Translation With Conditional Adversarial Networks
input output
Aerial to Map
Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architecture and objective, and simply train on different data.
Abstract 1. Introduction
Many problems in image processing, computer graphics,
and computer vision can be posed as “translating” an input
We investigate conditional adversarial networks as a image into a corresponding output image. Just as a concept
general-purpose solution to image-to-image translation may be expressed in either English or French, a scene may
problems. These networks not only learn the mapping from be rendered as an RGB image, a gradient field, an edge map,
input image to output image, but also learn a loss func- a semantic label map, etc. In analogy to automatic language
tion to train this mapping. This makes it possible to apply translation, we define automatic image-to-image translation
the same generic approach to problems that traditionally as the task of translating one possible representation of a
would require very different loss formulations. We demon- scene into another, given sufficient training data (see Figure
strate that this approach is effective at synthesizing photos 1). Traditionally, each of these tasks has been tackled with
from label maps, reconstructing objects from edge maps, separate, special-purpose machinery (e.g., [15, 24, 19, 8,
and colorizing images, among other tasks. Indeed, since the 10, 52, 32, 38, 17, 57, 61]), despite the fact that the setting
release of the pix2pix software associated with this pa- is always the same: predict pixels from pixels. Our goal in
per, a large number of internet users (many of them artists) this paper is to develop a common framework for all these
have posted their own experiments with our system, further problems.
demonstrating its wide applicability and ease of adoption The community has already taken significant steps in this
without the need for parameter tweaking. As a commu- direction, with convolutional neural nets (CNNs) becoming
nity, we no longer hand-engineer our mapping functions, the common workhorse behind a wide variety of image pre-
and this work suggests we can achieve reasonable results diction problems. CNNs learn to minimize a loss function –
without hand-engineering our loss functions either. an objective that scores the quality of results – and although
the learning process is automatic, a lot of manual effort still
1
goes into designing effective losses. In other words, we still x G(x) y
G
have to tell the CNN what we wish it to minimize. But, just
D D
like King Midas, we must be careful what we wish for! If
fake real
we take a naive approach, and ask the CNN to minimize
Euclidean distance between predicted and ground truth pix-
els, it will tend to produce blurry results [42, 61]. This is x x
because Euclidean distance is minimized by averaging all
Figure 2: Training a conditional GAN to map edges→photo. The
plausible outputs, which causes blurring. Coming up with
discriminator, D, learns to classify between fake (synthesized by
loss functions that force the CNN to do what we really want the generator) and real {edge, photo} tuples. The generator, G,
– e.g., output sharp, realistic images – is an open problem learns to fool the discriminator. Unlike an unconditional GAN,
and generally requires expert knowledge. both the generator and discriminator observe the input edge map.
It would be highly desirable if we could instead specify
only a high-level goal, like “make the output indistinguish- large body of literature has considered losses of this kind,
able from reality”, and then automatically learn a loss func- with methods including conditional random fields [9], the
tion appropriate for satisfying this goal. Fortunately, this is SSIM metric [55], feature matching [14], nonparametric
exactly what is done by the recently proposed Generative losses [36], the convolutional pseudo-prior [56], and losses
Adversarial Networks (GANs) [23, 12, 43, 51, 62]. GANs based on matching covariance statistics [29]. The condi-
learn a loss that tries to classify if the output image is real tional GAN is different in that the loss is learned, and can, in
or fake, while simultaneously training a generative model theory, penalize any possible structure that differs between
to minimize this loss. Blurry images will not be tolerated output and target.
since they look obviously fake. Because GANs learn a loss Conditional GANs We are not the first to apply GANs
that adapts to the data, they can be applied to a multitude of in the conditional setting. Prior and concurrent works have
tasks that traditionally would require very different kinds of conditioned GANs on discrete labels [40, 22, 12], text [45],
loss functions. and, indeed, images. The image-conditional models have
In this paper, we explore GANs in the conditional set- tackled image prediction from a normal map [54], future
ting. Just as GANs learn a generative model of data, condi- frame prediction [39], product photo generation [58], and
tional GANs (cGANs) learn a conditional generative model image generation from sparse annotations [30, 47] (c.f. [46]
[23]. This makes cGANs suitable for image-to-image trans- for an autoregressive approach to the same problem). Sev-
lation tasks, where we condition on an input image and gen- eral other papers have also used GANs for image-to-image
erate a corresponding output image. mappings, but only applied the GAN unconditionally, re-
GANs have been vigorously studied in the last two lying on other terms (such as L2 regression) to force the
years and many of the techniques we explore in this pa- output to be conditioned on the input. These papers have
per have been previously proposed. Nonetheless, ear- achieved impressive results on inpainting [42], future state
lier papers have focused on specific applications, and prediction [63], image manipulation guided by user con-
it has remained unclear how effective image-conditional straints [64], style transfer [37], and superresolution [35].
GANs can be as a general-purpose solution for image-to- Each of the methods was tailored for a specific applica-
image translation. Our primary contribution is to demon- tion. Our framework differs in that nothing is application-
strate that on a wide variety of problems, conditional specific. This makes our setup considerably simpler than
GANs produce reasonable results. Our second contri- most others.
bution is to present a simple framework sufficient to Our method also differs from the prior works in several
achieve good results, and to analyze the effects of sev- architectural choices for the generator and discriminator.
eral important architectural choices. Code is available at Unlike past work, for our generator we use a “U-Net”-based
https://github.com/phillipi/pix2pix. architecture [49], and for our discriminator we use a convo-
lutional “PatchGAN” classifier, which only penalizes struc-
2. Related work ture at the scale of image patches. A similar PatchGAN
architecture was previously proposed in [37], for the pur-
Structured losses for image modeling Image-to-image pose of capturing local style statistics. Here we show that
translation problems are often formulated as per-pixel clas- this approach is effective on a wider range of problems, and
sification or regression (e.g., [38, 57, 27, 34, 61]). These we investigate the effect of changing the patch size.
formulations treat the output space as “unstructured” in the
sense that each output pixel is considered conditionally in- 3. Method
dependent from all others given the input image. Condi-
tional GANs instead learn a structured loss. Structured GANs are generative models that learn a mapping from
losses penalize the joint configuration of the output. A random noise vector z to output image y, G : z → y [23]. In
U-Net
contrast, conditional GANs learn a mapping from observed Encoder-decoder
Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please see
https://phillipi.github.io/pix2pix/ for additional examples.
aerial photo generation, and image colorization using this der to make the task more difficult and avoid floor-level re-
approach. sults. For map↔aerial photo, we trained on 256 × 256 reso-
lution images, but exploited fully-convolutional translation
Second, we measure whether or not our synthesized
(described above) to test on 512 × 512 images, which were
cityscapes are realistic enough that off-the-shelf recognition
then downsampled and presented to Turkers at 256 × 256
system can recognize the objects in them. This metric is
resolution. For colorization, we trained and tested on
similar to the “inception score” from [51], the object detec-
256 × 256 resolution images and presented the results to
tion evaluation in [54], and the “semantic interpretability”
Turkers at this same resolution.
measures in [61] and [41].
“FCN-score” While quantitative evaluation of genera-
AMT perceptual studies For our AMT experiments, we tive models is known to be challenging, recent works [51,
followed the protocol from [61]: Turkers were presented 54, 61, 41] have tried using pre-trained semantic classifiers
with a series of trials that pitted a “real” image against a to measure the discriminability of the generated stimuli as a
“fake” image generated by our algorithm. On each trial, pseudo-metric. The intuition is that if the generated images
each image appeared for 1 second, after which the images are realistic, classifiers trained on real images will be able
disappeared and Turkers were given unlimited time to re- to classify the synthesized image correctly as well. To this
spond as to which was fake. The first 10 images of each end, we adopt the popular FCN-8s [38] architecture for se-
session were practice and Turkers were given feedback. No mantic segmentation, and train it on the cityscapes dataset.
feedback was provided on the 40 trials of the main experi- We then score synthesized photos by the classification accu-
ment. Each session tested just one algorithm at a time, and racy against the labels these photos were synthesized from.
Turkers were not allowed to complete more than one ses-
sion. ∼ 50 Turkers evaluated each algorithm. Unlike [61], 4.2. Analysis of the objective function
we did not include vigilance trials. For our colorization ex-
periments, the real and fake images were generated from the Which components of the objective in Eqn. 4 are impor-
same grayscale input. For map↔aerial photo, the real and tant? We run ablation studies to isolate the effect of the L1
fake images were not generated from the same input, in or- term, the GAN term, and to compare using a discriminator
L1 L1+cGAN
nator (labeled as GAN). In this case, the loss does not pe-
Encoder-decoder
L LL L LL
−1 −1
CVPRFigure 6: Patch size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions become
−1 −1
CVPR CVPR
−1−1
CVPR −3 −3 −3
CVPR
CVPR
#385 blurry and desaturated under L1. The 1x1 PixelGAN encourages greater color diversity but has no effect on spatial statistics. The 16x16 #385
−3−3 −3
#385
#385 #385
#385
−5 −5 −5
PatchGAN−5creates
−5
locally sharp CVPR
CVPR
CVPR results,
2016
2016 but also leads
Submission
2016Submission
Submission to CONFIDENTIAL
#385.
#385. tiling
#385. artifacts
CONFIDENTIAL
−5
CONFIDENTIAL beyond
REVIEW
REVIEW
REVIEWtheCOPY.
scale
COPY.
COPY. it DO
DO can
DO observe.
NOT
NOT The 70×70 PatchGAN forces
DISTRIBUTE.
DISTRIBUTE.
NOT DISTRIBUTE.
−7 −7
outputs that−7−7are sharp, even ifL1L1cGANincorrect, in both the spatial and spectral (colorfulness)L1cGAN dimensions. The full 286×286 ImageGAN produces
−7 −7
L1
L1 L1
cGAN cGAN
cGAN cGAN
results that−9are visually similar to the 70×70 PatchGAN, but somewhat −9 −9
lower qualityL1+cGAN according to our FCN-score metric (Table 3). Please
−9 L1+cGAN −9 L1+cGAN
−9 L1+cGAN
L1+cGAN L1+cGAN
L1+pixelcGAN
L1+pixelcGAN L1+pixelcGAN
L1+pixelcGAN
L1+pixelcGAN L1+pixelcGAN
648
648 see https://phillipi.github.io/pix2pix/
648 −11
−11
−11
0
0 0 20
2020 40
40 40 60
60 60
Ground
Ground
80
80 80
truth
truth
Ground truth
100
100100
for additional
−11−11
0
−11
0 examples.
20 20 40 40 60 60
Ground
Ground
80
truthtruth
Ground
80
truth
100 100 702702
702
0 20 40 60 80 100
649
649
649 L LL b bb a aa b bb a aa
703703
703
650
650
650 −1
−1−1
−1
−1−1 −1−1
−1 −1 −1
−1 −1 −1
−1
704704
704
651
651
−3
−3 −3−3 −3 −3 −3 −3 Histogram
Histogram
Histogram intersection
intersection
intersection 705705
651 705
−3 −3 −3
Histogram intersection
−3−3 −3 −3
(L)
(L)
(a)
(b)
log P (b)
P(L)
logPP(a)
logPP(b)
652
652
652
−5
−5
−5
−5
−5
−5
−5−5
−5
−5 −5
−5
−5 −5
−5
Loss
Loss
against
L LL a aa
ground truth
b bb 706706
706
logPP
Loss
Loss L a b
log
log
log
653
653 707707
−7 −7 −7 −7
log
−7
−7 −7−7
653
−7
−7
−7 −7
L1L1
−7 −7 −7
L1L1L1 0.81 0.81 0.69
0.81 0.690.69
0.69 0.70
0.70 0.70 707
L1
−9 cGAN
L1 0.81 0.70
cGAN 0.87
cGAN 0.87
0.870.74 0.74 0.840.84
0.74
cGAN
cGAN −9 −9 −9 −9
654
654 708708
−9
−9 −9−9
654
−9
−9
−9 −9 cGAN
L1+cGAN
L1+cGAN
L1+cGAN
−9 −9
cGAN 0.87 0.74 0.84
0.84 708
−11 −11
L1+pixelcGAN
L1+pixelcGAN
L1+pixelcGAN −11 −11−11 −11 −11 L1+cGAN
L1+cGAN
L1+cGAN 0.86 0.86
0.86 0.840.84
0.84 0.820.82
0.82
655 709709
−11 −11
655 −11 Ground
Groundtruth
truth
655 −11
00
−11
7070
20 9090
40 110
110
6060
Ground truth
130
130
8080130 100 150
150
−11
7070 90 90 110110 130130
−11
70 70 90 90 110110 130 130 150 150
−11
70 70 L1+cGAN
70 PixelGAN
PixelGAN
90 90 110 110 0.86
0.83 130 130 0.84
110 0.83 0.68 0.68 0.78 0.82
0.78 709
PixelGAN 0.83 0.68
20 40 100 150
0 7020 90
40
L
110
60 80 100 70 90
aa
110 130 70 90
b bb
110 130 150 90 130
0.78
656
656 LL bb
a(b)a
a a PixelGAN 0.83 0.68 0.78 710710
656 (a)
(a)b
(a) (b)
(b) (c)(c)
(c) (d)(d) 710
657 Figure
Figure
−1
5: (a) distribution
5: Color
Color distributionmatching
matching
−1−1
(b)ofofthe
property
property thecGAN,
cGAN,tested (c)
testedononCityscapes.
Cityscapes.(c.f.
(c.f.Figure 1 of
Figure thethe
1 of original
original(d)GAN
GAN paper [14]).
paper Note
[14]). Note 711711
657
657
−1
−1 −1 (d)
Figure 5: Color distribution matching property of the cGAN, tested on Cityscapes. (c.f. Figure 1 of the original GAN paper [14]). Note 711
658
658 that
that−3 the histogram intersection
−3
that−3 the histogram intersection
the histogram intersection
−3−3
scores
scores
scores arearedominated
are dominated
dominated byby
bydifferences
differences
differences inin the
in thehigh
the high
highprobability
probability
probability region,
region, which
which areareimperceptible
imperceptible in in
the
inthe plots,
plots, 712712
658 Figure Figure region,1 of thewhich are imperceptible the plots, 712
−3
7:
which Color
show distribution
log probability matching
and property of the cGAN, tested on Cityscapes. (c.f. original GAN paper [23]). Note
which show log probability and −5therefore
therefore emphasize
emphasize differences
differences in the
in thelowlow probability
probability regions.
regions.
−5 −5−5
659
659
659 that thewhich
−5
−5
show log probability and therefore emphasize differences in the low probability regions. 713713
713
−7 histogram intersection scores −7−7 are dominated by differences in the high probability region, which are imperceptible in the plots,
660
660
−7
−7 −7 714714
660 which show log probability
L1 and therefore emphasize
1x1 differences in the low probability regions.70x70
16x16 256x256 714
661
661
−9
−9
−9 L1
L1
−9−9
−9 1x1
1x1 16x16
16x16 70x70
70x70 256x256
256x256 715715
661 −11 −11
715
662
662
−11
−11
−11
−11 716716
662 716
70
70 9090 110
110 130
130 150
150 7070 9090 110110 130130
70 90 110 130 150 70 90 110 130
663 Photo → Map Map → Photo 717717
663 4.4. From PixelGANs to PatchGANs to ImageGANs
663 717
664 Loss % Turkers labeled real % Turkers labeled real 718718
664
664 718
L1 2.8% ± 1.0% 0.8% ± 0.3%
665
665 We test the effect of varying the patch size N of our dis- 719719
665 L1+cGAN 6.1% ± 1.3% 18.9% ± 2.5% 719
666
666 criminator
Figure
Figure
Figure 6:
6:receptive
Patch size
6: Patch
Patch size
fields,
variations.
size variations.
from
variations.
a 1 × 1 “PixelGAN”
Uncertainty
Uncertainty
Uncertaintyin
in the
inthetheoutput
to a itself
output manifests
outputmanifests
manifestsitself
differently
differently
itselfTable
differently
forfordifferent loss
fordifferent
different loss functions. Uncertain
lossfunctions.
functions. Uncertain regions become
Uncertainregionsregions become
become 720720
666 blurry and desaturated under 1L1. The 1x1 PixelGAN encourages greater color 4: AMT
diversity “real
but hashasvsnofake”
effect test
ononon maps↔aerial
spatial statistics. photos.
TheThe 16x16 720
667
667 full blurry
286
blurry × and
286
and “ImageGAN”
desaturated
desaturated under
under .
L1.
L1.Figure
The
The 1x16
1x1 shows
PixelGAN
PixelGAN qualitative
encourages
encourages greater
greater color
color diversity
diversity but
but has nono effect
effect on spatial
spatial statistics.
statistics. The 16x16
16x16 721721
667 PatchGAN creates locally sharp results, but also leads tototiling artifacts beyond thethescale it it
can observe. The 70x70 PatchGAN forces 721
668
668 results of
PatchGAN
PatchGAN this analysis
creates
creates and
locally
locally Table
sharp
sharp 3 quantifies
results,
results, but
but the
also
also effects
leads
leads to us-
tiling
tiling artifacts
artifacts beyond
beyond Method
the scale
scale it can
can observe.
observe.% The
Turkers
The 70x70
labeled
70x70 PatchGAN
real
PatchGAN forces
forces 722722
668 outputs that are sharp, even ififincorrect, ininboth the spatial and spectral (coforfulness) dimensions.
L2 regression fromThe [61] full 256x256
16.3% ± ImageGAN
2.4% produces 722
669 ing theoutputs that
FCN-score.
outputs that are are sharp,
Note even
sharp, if incorrect,
that elsewhere
even incorrect,in inboth
this the
bothpaper,thespatial unless
spatial and
andspectral
spectral(coforfulness)
(coforfulness) dimensions.
dimensions. The
Thefull 256x256
full27.8%
256x256 ImageGAN
ImageGANproduces produces 723723
669
669 results that are visually similar to the 70x70 PatchGAN, but somewhat lower quality according
Zhang et al. to to
2016 ourour
[61] FCN-score metric
± 2.7%(Table 2).2). 723
results
specified,
resultsall that
that are
are visually
experimentsvisually similar use 70to
similar ×the
to 7070x70
the PatchGAN,
PatchGANs,
70x70 PatchGAN, andbut
butforsomewhat
somewhatlower lowerquality
quality according FCN-score metric (Table
670
670
670 Ours according to our FCN-score 22.5% ±metric 1.6% (Table 2). 724724
724
671
671
671
this section all experiments use an L1+cGAN loss. Table 5: AMT “real vs fake” test on colorization.
725725
725
672
672 The PixelGAN Classification has no effect on spatial
Classification
Classification
Ours
Ours
Ours sharpness, but Input
Input
Input
Ground
Ground
Groundtruth
truth
truth
L1L1
L1
cGAN
cGAN
cGAN
726726
672
673 does increase
L2L2 [44]
[44] (rebal.)
(rebal.) [44]
[44] (L1
(L1 ++cGAN)cGAN) Ground
Ground truth
truth 727
726
673
673
L2 [44] the colorfulness
(rebal.) [44] of the
(L1 results
+ cGAN) (quantified
Ground in
truth 727
727
674 Figure 7). For example, the bus in Figure 6 is painted gray Fully-convolutional translation An advantage of the 728
674
674 728
728
675 PatchGAN is that a fixed-size patch discriminator can be 729
675 when the net is trained with an L1 loss, but becomes red
675
applied to arbitrarily large images. We may also apply the
729
729
676 730
676 with the PixelGAN loss. Color histogram matching is a
676
generator convolutionally, on larger images than those on
730
730
677 731
677 common problem in image processing [48], and PixelGANs
677 731
732 731
678 which
Figureit8:was trained.
Applying We test this
a conditional GAN ontothe map↔aerial
semantic photo
segmentation.
678 may be a promising lightweight solution.
678 Figure
Figure 8: Applying
8:training
Applying aaconditional
conditional GAN
GAN totosemantic
semantic segmentation.
segmentation.
732
733 732
679 task.
The After
cGAN produces a generator
sharp images on 256×256
that look at images,
glance we
like test
the
679
679 Using a 16×16 PatchGAN is sufficient to promote sharp The cGAN produces sharp images that look
look at glance like
like the
733
680 it on The
ground512 cGAN
truth,
× 512 butproduces
images.
in fact includesharp
The images
results
many small,inthat
Figure 8atdemonstrate
hallucinated glanceobjects. the 734 733
680 outputs, and achieves good FCN-scores, but also leads to
680 ground
ground truth,
truth, but
but in
in fact
fact include
include many
many small,
small, hallucinated
hallucinated objects.
objects. 734
735 734
681
681 the effectiveness of this approach. 735
681 tiling artifacts. The 70 × 70 PatchGAN alleviates these ar-
682 736 735
682
682 tifacts and achieves slightly better similar scores. Scaling 736
683 4.5.
nearlyPerceptual
discrete, rather validationthan “images”, with their continuous- 737 736
683
683 beyond this, to the full 286 × 286 ImageGAN, does not nearly discrete, rather than “images”, with 737
684
684
nearly
valued discrete,
variation. rather
Although thancGANs“images”, achievewiththeir
theircontinuous-
some continuous-
success, 738 737
738
684 appear to improve the visual quality of the results, and in We
valuedvalidate the perceptual
variation. Although realism
cGANs of our results
achieve some on the
success, 739 738
685
685 they are far from the best available method for solvingsuccess,
valued variation. Although cGANs achieve some this 739
685 fact gets a considerably lower FCN-score (Table 3). This tasks theyof map↔aerial
are far
they aresimply from
far from photograph
the
theL1 best
best availableand
availablegets grayscale→color.
method
method for solving Re-this 740 739
686
686 problem: using regression betterfor solving
scores thanthis 740
686
687 may be because the ImageGAN has many more parameters sults of
problem:our
problem: AMT
simply
simply experiment
using
usinginL1 L1 for
regressionmap↔photo
regression gets are
better
gets better given
scores
scores in
than 741 740
687 using a cGAN, as shown Table 4. We argue that for visionthan 741
687 and greater depth than the 70 × 70 PatchGAN, and may be
688 Table using4. aThe
using a cGAN,aerial
cGAN, as
as shown
photos
shown in
in Table
generated
Table 4.4. We
by We argue
our
argue that
method
that for vision
fooled
for vision 742 741
688 problems, the goal (i.e. predicting output close to ground 742
688 harder
689 Figure 7: Colorization results of conditional GANs versus the L2
to train. problems,
participants onthe goal
18.9% (i.e.
of predicting
trials, output
significantly close
above to theground
L1 743 742
689 Figure 7: Colorization results of conditional GANs versus the L2 problems,
truth) may be the
less goal (i.e.
ambiguous predicting
than output
graphics close
tasks, to
and ground
re- 743
Figure 7: from
regression Colorization
[44] andresults the full of method
conditional GANs versus
(classification withthe re-L2 baseline, truth) may be less ambiguous than graphics
689
690
690 regression
1 We achieve this
regression
balancing) from
from
from [44]
[44] and
variation
[46]. the
in patch
and
The the
cGANs full
sizemethod
full by
canadjusting
method (classification
the
(classification
produce depth ofwith
compelling withcol-re-
the re- construction which
truth) maylosses produces
be less like ambiguousblurry
L1 are mostly results graphics tasks, and re-
and
than sufficient. tasks,
nearly and
never re- 744 743
744
690
691 balancing)
GANorizations
discriminator. from Details[46]. of The
this cGANs
process, andcanthe produce
discriminator compelling
architec- col- fooledconstruction
participants.
construction losses
losses In like
like L1
contrast,
L1 are
are mostly
in the
mostly sufficient.
photo→map
sufficient. direc- 745 744
691 balancing)(first fromtwo [46].rows), Thebut cGANshave acan common
producefailure mode of
compelling col- 745
of tion our method only fooled participants on 6.1% of tri-
691 tures are
692 provideda(first in thetwoin the 746 745
692
orizations
producing
orizations grayscale
(first two orsupplemental
rows), but
but have
desaturated
rows), have materials
aa common
result (last
common online.
row).failure
failure modemode of 746
692
693 producing a grayscale or desaturated result (last row). 747 746
producing a grayscale or desaturated result (last row).
693
693
694 4. Conclusion 748747
747
694
694
695 To begin to test this, we train a cGAN (with/without L1
4.
4. Conclusion
Conclusion 749748
748
695
696
695 loss)To
Toonbegin to
to test
cityscape
begin this,
this, we
we train
testphoto!labels.train aaFigure
cGAN
cGAN8(with/without
shows qualita-
(with/without L1
L1 The results in this paper suggest that conditional adver- 750749
749
696
697
696 loss)
tive on
results,cityscape
and photo!labels.
quantitative Figure
classification 8 shows
accuracies
loss) on cityscape photo!labels. Figure 8 shows qualita- qualita-
are re- sarialThe results
networks
The in
in this
resultsare athis paper
paper suggest
promising that
approach
suggest conditional
thatfor adver-
many image-
conditional adver- 751750
750
697
698
697 tive
tive results,
ported in Table
results, and
and4.quantitative
Interestingly,
quantitative classification
cGANs, trained
classification accuracies
without
accuracies are re-
the
are re- sarial
sarial networks
to-image translation
networks are aa promising
aretasks, approach
especially
promising for
formany
those involving
approach image-
highly
many image- 752751
751
698
699
698 ported
L1 loss,in
ported areTable
in able 4.
Table to Interestingly,
4. solve cGANs,
this problem
Interestingly, trained
trainedwithout
at a reasonable
cGANs, degree
without the
the to-image
structured translation
to-imagegraphical
translation tasks,
tasks,especially
outputs. those
thoseinvolving
These networks
especially learn a highly
involving loss
highly 753752
752
Map to aerial photo Aerial photo to map
Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256 × 256 resolution, and run convo-
lutionally on the larger images at test time). Contrast adjusted for clarity.
Classification Ours Input Ground truth L1 cGAN
L2 [61] (rebal.) [61] (L1 + cGAN) Ground truth
#fotogenerator
sketch by Ivy Tsai by Bertrand Gondouin by Brannon Dorsey sketch by Yann LeCun
Figure 11: Example applications developed by online community based on our pix2pix codebase: #edges2cats [3] by Christopher Hesse,
Background removal [6] by Kaihu Chen, Palette generation [5] by Jack Qiao, Sketch → Portrait [7] by Mario Klingemann, Sketch→
Pokemon [1] by Bertrand Gondouin, “Do As I Do” pose transfer [2] by Brannon Dorsey, and #fotogenerator by Bosman et al. [4].
as they are resized from the original maps using bilinear interpolation and
saved as jpeg images, with some compression artifacts.
Input Ground truth Output Input Ground truth Output
Figure 12: Example results of our method on Cityscapes labels→photo, compared to ground truth.
Figure 13: Example results of our method on facades labels→photo, compared to ground truth.
Input Ground truth Output Input Ground truth Output
Figure 14: Example results of our method on day→night, compared to ground truth.
Figure 15: Example results of our method on automatically detected edges→handbags, compared to ground truth.
Input Ground truth Output Input Ground truth Output
Figure 16: Example results of our method on automatically detected edges→shoes, compared to ground truth.
Figure 17: Additional results of the edges→photo models applied to human-drawn sketches from [18]. Note that the models were trained
on automatically detected edges, but generalize to human drawings
Figure 18: Example results on photo inpainting, compared to [42], on the Paris StreetView dataset [13]. This experiment demonstrates that
the U-net architecture can be effective even when the predicted pixels are not geometrically aligned with the information in the input – the
information used to fill in the central hole has to be found in the periphery of these photos.
Figure 19: Example results on translating thermal images to RGB photos, on the dataset from [26].
Figure 20: Example failure cases. Each pair of images shows input on the left and output on the right. These examples are selected as some
of the worst results on our tasks. Common failures include artifacts in regions where the input image is sparse, and difficulty in handling
unusual inputs. Please see https://phillipi.github.io/pix2pix/ for more comprehensive results.
References [19] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T.
Freeman. Removing camera shake from a single photo-
[1] Bertrand gondouin. https://twitter.com/ graph. ACM Transactions on Graphics (TOG), 25(3):787–
bgondouin/status/818571935529377792. 794, 2006. 1
Accessed, 2017-04-21. 9
[20] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis
[2] Brannon dorsey. https://twitter.com/ and the controlled generation of natural stimuli using convo-
brannondorsey/status/806283494041223168. lutional neural networks. arXiv preprint arXiv:1505.07376,
Accessed, 2017-04-21. 9 12, 2015. 4
[3] Christopher hesse. https://affinelayer.com/ [21] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer
pixsrv/. Accessed: 2017-04-21. 9 using convolutional neural networks. CVPR, 2016. 4
[4] Gerda bosman, tom kenter, rolf jagerman, and daan gosman. [22] J. Gauthier. Conditional generative adversarial nets for
https://dekennisvannu.nl/site/artikel/ convolutional face generation. Class Project for Stanford
Help-ons-kunstmatige-intelligentie-testen/ CS231N: Convolutional Neural Networks for Visual Recog-
9163. Accessed: 2017-08-31. 9 nition, Winter semester, 2014(5):2, 2014. 2
[5] Jack qiao. http://colormind.io/blog/. Accessed: [23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
2017-04-21. 9 D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
[6] Kaihu chen. http://www.terraai.org/ erative adversarial nets. In NIPS, 2014. 2, 4, 6, 7
imageops/index.html. Accessed, 2017-04-21. [24] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H.
9 Salesin. Image analogies. In SIGGRAPH, pages 327–340.
[7] Mario klingemann. https://twitter.com/ ACM, 2001. 1, 4
quasimondo/status/826065030944870400. [25] G. E. Hinton and R. R. Salakhutdinov. Reducing the
Accessed, 2017-04-21. 9 dimensionality of data with neural networks. Science,
[8] A. Buades, B. Coll, and J.-M. Morel. A non-local algo- 313(5786):504–507, 2006. 3
rithm for image denoising. In CVPR, volume 2, pages 60–65. [26] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon. Mul-
IEEE, 2005. 1 tispectral pedestrian detection: Benchmark dataset and base-
[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and line. In CVPR, pages 1037–1045, 2015. 4, 13, 16
A. L. Yuille. Semantic image segmentation with deep con- [27] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be
volutional nets and fully connected crfs. In ICLR, 2015. 2 Color!: Joint End-to-end Learning of Global and Local Im-
[10] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. age Priors for Automatic Image Colorization with Simulta-
Sketch2photo: internet image montage. ACM Transactions neous Classification. ACM Transactions on Graphics (TOG),
on Graphics (TOG), 28(5):124, 2009. 1 35(4), 2016. 2
[11] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, [28] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
R. Benenson, U. Franke, S. Roth, and B. Schiele. The deep network training by reducing internal covariate shift.
cityscapes dataset for semantic urban scene understanding. 2015. 3, 4
In CVPR), 2016. 4, 16 [29] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for
[12] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera- real-time style transfer and super-resolution. 2016. 2, 3
tive image models using a laplacian pyramid of adversarial [30] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning
networks. In NIPS, pages 1486–1494, 2015. 2 to generate images of outdoor scenes from attributes and se-
[13] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What mantic layouts. arXiv preprint arXiv:1612.00215, 2016. 2
makes paris look like paris? ACM Transactions on Graphics, [31] D. Kingma and J. Ba. Adam: A method for stochastic opti-
31(4), 2012. 4, 13, 17 mization. ICLR, 2015. 4
[14] A. Dosovitskiy and T. Brox. Generating images with per- [32] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient
ceptual similarity metrics based on deep networks. arXiv attributes for high-level understanding and editing of outdoor
preprint arXiv:1602.02644, 2016. 2 scenes. ACM Transactions on Graphics (TOG), 33(4):149,
[15] A. A. Efros and W. T. Freeman. Image quilting for tex- 2014. 1, 4, 16
ture synthesis and transfer. In SIGGRAPH, pages 341–346. [33] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoen-
ACM, 2001. 1, 4 coding beyond pixels using a learned similarity metric. arXiv
[16] A. A. Efros and T. K. Leung. Texture synthesis by non- preprint arXiv:1512.09300, 2015. 3
parametric sampling. In ICCV, volume 2, pages 1033–1038. [34] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-
IEEE, 1999. 4 resentations for automatic colorization. ECCV, 2016. 2, 8,
[17] D. Eigen and R. Fergus. Predicting depth, surface normals 16
and semantic labels with a common multi-scale convolu- [35] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
tional architecture. In Proceedings of the IEEE International A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al.
Conference on Computer Vision, pages 2650–2658, 2015. 1 Photo-realistic single image super-resolution using a gener-
[18] M. Eitz, J. Hays, and M. Alexa. How do humans sketch ative adversarial network. arXiv preprint arXiv:1609.04802,
objects? SIGGRAPH, 31(4):44–1, 2012. 4, 12 2016. 2
[36] C. Li and M. Wand. Combining markov random fields and [53] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal-
convolutional neural networks for image synthesis. CVPR, ization: The missing ingredient for fast stylization. arXiv
2016. 2, 4 preprint arXiv:1607.08022, 2016. 4
[37] C. Li and M. Wand. Precomputed real-time texture synthe- [54] X. Wang and A. Gupta. Generative image modeling using
sis with markovian generative adversarial networks. ECCV, style and structure adversarial networks. ECCV, 2016. 2, 3,
2016. 2, 4 5
[38] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional [55] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
networks for semantic segmentation. In CVPR, pages 3431– Image quality assessment: from error visibility to struc-
3440, 2015. 1, 2, 5 tural similarity. IEEE Transactions on Image Processing,
[39] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale 13(4):600–612, 2004. 2
video prediction beyond mean square error. ICLR, 2016. 2, [56] S. Xie, X. Huang, and Z. Tu. Top-down learning for struc-
3 tured labeling with convolutional pseudoprior. 2015. 2
[40] M. Mirza and S. Osindero. Conditional generative adversar- [57] S. Xie and Z. Tu. Holistically-nested edge detection. In
ial nets. arXiv preprint arXiv:1411.1784, 2014. 2 ICCV, 2015. 1, 2, 4
[41] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adel- [58] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-
son, and W. T. Freeman. Visually indicated sounds. In Pro- level domain transfer. ECCV, 2016. 2, 3
ceedings of the IEEE Conference on Computer Vision and [59] A. Yu and K. Grauman. Fine-Grained Visual Comparisons
Pattern Recognition, pages 2405–2413, 2016. 5 with Local Learning. In CVPR, 2014. 4
[42] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. [60] A. Yu and K. Grauman. Fine-grained visual comparisons
Efros. Context encoders: Feature learning by inpainting. with local learning. In CVPR, pages 192–199, 2014. 16
CVPR, 2016. 2, 3, 13, 17 [61] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
[43] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- tion. ECCV, 2016. 1, 2, 5, 7, 8, 16
sentation learning with deep convolutional generative adver- [62] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based genera-
sarial networks. arXiv preprint arXiv:1511.06434, 2015. 2, tive adversarial network. arXiv preprint arXiv:1609.03126,
3, 16 2016. 2
[44] R. Š. Radim Tyleček. Spatial pattern templates for recogni- [63] Y. Zhou and T. L. Berg. Learning temporal transformations
tion of objects with regular structure. In Proc. GCPR, Saar- from time-lapse videos. In ECCV, 2016. 2, 3, 8
brucken, Germany, 2013. 4, 16 [64] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
[45] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and Generative visual manipulation on the natural image mani-
H. Lee. Generative adversarial text to image synthesis. arXiv fold. In ECCV, 2016. 2, 4, 16
preprint arXiv:1605.05396, 2016. 2
[46] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst,
M. Botvinick, and N. de Freitas. Generating interpretable
images with controllable structure. Technical report, Techni-
cal report, 2016. 2, 2016. 2
[47] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and
H. Lee. Learning what and where to draw. In Advances
In Neural Information Processing Systems, pages 217–225,
2016. 2
[48] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color
transfer between images. IEEE Computer Graphics and Ap-
plications, 21:34–41, 2001. 7
[49] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In MIC-
CAI, pages 234–241. Springer, 2015. 2, 3
[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge.
IJCV, 115(3):211–252, 2015. 4, 8, 16
[51] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
ford, and X. Chen. Improved techniques for training gans.
arXiv preprint arXiv:1606.03498, 2016. 2, 4, 5
[52] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven
hallucination of different times of day from a single outdoor
photo. ACM Transactions on Graphics (TOG), 32(6):200,
2013. 1
6. Appendix C64-C128-C256-C512-C512-C512
6.1. Network architectures
We adapt our network architectures from those 6.2. Training details
in [43]. Code for the models is available at Random jitter was applied by resizing the 256×256 input
https://github.com/phillipi/pix2pix. images to 286 × 286 and then randomly cropping back to
Let Ck denote a Convolution-BatchNorm-ReLU layer size 256 × 256.
with k filters. CDk denotes a a Convolution-BatchNorm- All networks were trained from scratch. Weights were
Dropout-ReLU layer with a dropout rate of 50%. All convo- initialized from a Gaussian distribution with mean 0 and
lutions are 4 × 4 spatial filters applied with stride 2. Convo- standard deviation 0.02.
lutions in the encoder, and in the discriminator, downsample Cityscapes labels→photo 2975 training images from
by a factor of 2, whereas in the decoder they upsample by a the Cityscapes training set [11], trained for 200 epochs, with
factor of 2. random jitter and mirroring. We used the Cityscapes val
set for testing. To compare the U-net against an encoder-
6.1.1 Generator architectures decoder, we used a batch size of 10, whereas for the ob-
jective function experiments we used batch size 1. We find
The encoder-decoder architecture consists of:
that batch size 1 produces better results for the U-net, but is
encoder:
inappropriate for the encoder-decoder. This is because we
C64-C128-C256-C512-C512-C512-C512-C512
apply batchnorm on all layers of our network, and for batch
decoder:
size 1 this zeros the activations on the bottleneck layer. The
CD512-CD512-CD512-C512-C256-C128-C64
U-net is able to skip over the bottleneck, but the encoder-
After the last layer in the decoder, a convolution is ap-
decoder cannot, and so the encoder-decoder requires a batch
plied to map to the number of output channels (3 in general,
size greater than 1. Note, an alternative strategy is to re-
except in colorization, where it is 2), followed by a Tanh
move batchnorm from the bottleneck layer. See errata for
function. As an exception to the above notation, Batch-
more details.
Norm is not applied to the first C64 layer in the encoder.
Architectural labels→photo 400 training images from
All ReLUs in the encoder are leaky, with slope 0.2, while
[44], trained for 200 epochs, batch size 1, with random jitter
ReLUs in the decoder are not leaky.
and mirroring. Data was split into train and test randomly.
The U-Net architecture is identical except with skip con-
Maps↔aerial photograph 1096 training images
nections between each layer i in the encoder and layer n − i
scraped from Google Maps, trained for 200 epochs, batch
in the decoder, where n is the total number of layers. The
size 1, with random jitter and mirroring. Images were
skip connections concatenate activations from layer i to
sampled from in and around New York City. Data was then
layer n − i. This changes the number of channels in the
split into train and test about the median latitude of the
decoder:
sampling region (with a buffer region added to ensure that
U-Net decoder:
no training pixel appeared in the test set).
CD512-CD1024-CD1024-C1024-C1024-C512
-C256-C128 BW→color 1.2 million training images (Imagenet train-
ing set [50]), trained for ∼ 6 epochs, batch size 4, with only
mirroring, no random jitter. Tested on subset of Imagenet
6.1.2 Discriminator architectures val set, following protocol of [61] and [34].
The 70 × 70 discriminator architecture is: Edges→shoes 50k training images from UT Zappos50K
C64-C128-C256-C512 dataset [60] trained for 15 epochs, batch size 4. Data was
After the last layer, a convolution is applied to map to a 1 split into train and test randomly.
dimensional output, followed by a Sigmoid function. As an Edges→Handbag 137K Amazon Handbag images from
exception to the above notation, BatchNorm is not applied [64], trained for 15 epochs, batch size 4. Data was split into
to the first C64 layer. All ReLUs are leaky, with slope 0.2. train and test randomly.
All other discriminators follow the same basic architec- Day→night 17823 training images extracted from 91
ture, with depth varied to modify the receptive field size: webcams, from [32] trained for 17 epochs, batch size 4,
1 × 1 discriminator: with random jitter and mirroring. We use 91 webcams as
C64-C128 (note, in this special case, all convolutions are training, and 10 webcams for test.
1 × 1 spatial filters) Thermal→color photos 36609 training images from set
16 × 16 discriminator: 00–05 of [26], trained for 10 epochs, batch size 4. Images
C64-C128 from set 06-11 are used for testing.
286 × 286 discriminator: Photo with missing pixels→inpainted photo 14900
training images from [13], trained for 25 epochs, batch size
4, and tested on 100 held out images following the split of
[42].
6.3. Errata
For all experiments reported in this paper with batch
size 1, the activations of the bottleneck layer are zeroed by
the batchnorm operation, effectively making the innermost
layer skipped. This can be fixed by removing batchnorm
from this layer, as has been done in the public code. We ob-
serve little difference with this change and therefore leave
the experiments as is in the paper.
6.4. Change log
arXiv v2 Reran generator architecture comparisons
(Section 4.3) with batch size equal to 10 rather than
1, so that bottleneck layer is not zeroed (see Errata).
Reran FCN-scores with minor details cleaned up (re-
sults saved losslessly as pngs, removed unecessary
downsampling). FCN-scores computed using scripts at
https://github.com/phillipi/pix2pix/tree/
master/scripts/eval cityscapes, commit
d7e7b8b. Updated several figures and text. Added addi-
tional results on thermal→color photos and inpainting, as
well as community contributions.