Esser A Variational U-Net CVPR 2018 Paper
Esser A Variational U-Net CVPR 2018 Paper
Abstract
8857
ducing the image. Therefore, they can easily add facial hair data can be well represented by a limited number of modes,
or glasses to a face as this amounts to recoloring of image does not require segmentation masks, and it includes an in-
areas. Contrast this to a person moving their arm, which ference mechanism for appearance.
would be represented as coloring the arm at the old position [28] utilizes the GAN framework and [29] the autore-
with background color and turning the background at the gressive framework to provide control over shape and ap-
new position into an arm. What we are lacking is a gener- pearance. However the appearance is specified by very
ative model that can move and deform objects and not only coarse text descriptions. Furthermore, both methods have
blend their color. problems producing the desired shape consistently.
Therefore, we seek to model both, appearance and shape, In contrast to our generative approach, [4, 3] have pur-
and their interplay when generating images. For general ap- sued unsupervised learning of human posture similarity for
plicability, we want to be able to learn from mere still image retrieval in still images and [25, 5] in videos. Rendering
datasets with no need for a series of images of the same ob- images of persons in different poses has been considered
ject instance showing different articulations. We propose a by [46] for a fixed, discrete set of target poses, and by [24]
conditional U-Net [30] architecture for mapping from shape for general poses. In the latter, the authors use a two-stage
to the target image and condition on a latent representation model. The first stage implements pixelwise regression to
of a variational autoencoder for appearance. To disentangle a target image from a conditional image and the pose of
shape and appearance, we allow to utilize easily available the target image. Thus the method is fully supervised and
information related to shape, such as edges or automatic es- requires labeled examples of the same appearance in dif-
timates of body joint locations. Our approach then enables ferent poses. As the result of the first stage is in most cases
conditional image generation and transfer: to synthesize too blurry, they use a second stage which employs adversar-
different geometrical layouts or change the appearance of ial training to produce more realistic images. Our method
an object, either shape or appearance can be retained from is never directly trained on the transfer task and therefore
a query image, whereas the other component can be freely does not require such specific datasets. Instead, we care-
altered or even imputed from other images. Moreover, the fully model the separation between shape and appearance
model also allows to sample from the appearance distribu- and as a result, obtain an explicit representation of the ap-
tion without altering the shape. pearance which can be combined with new poses.
2. Related work
3. Approach
In the context of deep learning, three different ap-
proaches to image generation can be identified. Genera- Let x be an image of an object from a dataset X. We
tive Adversarial Networks [10], Autoregressive (AR) mod- want to understand how images are influenced by two es-
els [39] and Variational Auto-Encoders (VAE) [16]. sential characteristics of the objects that they depict: their
Our method provides control over both, appearance and shape y and appearance z. Although the precise seman-
shape. In contrast, many previous methods can control the tics of y can vary, we assume it characterizes geometrical
generative process only with respect to appearance. [15, 26, information, particularly location, shape, and pose. z then
38] utilize class labels, [42] attributes and [44, 52] textual represents the intrinsic appearance characteristics.
descriptions to control the appearance. If y and z capture all variations of interest, the variance
Control over shape has been mainly obtained in the of a probabilistic model for images conditioned on those
Image-to-Image translation framework. [12] uses a discrim- two variables is only due to noise. Hence, the maximum a
inator to obtain realistic outputs but their method is limited posteriori estimate arg maxx p(x|y, z) serves as an image
to the synthesis of a single, uncontrollable appearance. To generator controlled by y and z. How can we model this
obtain a larger variety of appearances, [18] first generates a generator?
segmentation mask of fashion articles and then synthesizes
an image. This leads to larger variations in appearances but
3.1. Variational Autoencoder based on latent shape
does not allow to change the pose of a given appearance.
and appearance
[7] uses segmentation masks to produce images in the
context of street scenes as well. They do not rely on adver- If y and z are both latent variables, a popular way of
sarial training but directly learn a multimodal distribution learning the generator p(x|y, z) is to use a VAE. To learn
for each segmentation label. The amount of appearances p(x|y, z) we need to maximize the log-likelihood of ob-
that can be produced is given by the number of combina- served data x and marginalize out the latent variables y and
tions of modes, resulting in very coarse modeling of appear- z. To avoid the intractable integral, one introduces an ap-
ance. In contrast, our method makes no assumption that the proximate posterior q(y, z|x) to obtain the evidence lower
8858
bound (ELBO) from Jensen’s inequality,
Z
log p(x) = log p(x, y, z) dz dy
Z
p(x, y, z)
= log q(y, z|x)
q(y, z|x)
p(x|y, z)p(y, z)
≥ Eq log . (1)
q(y, z|x)
As one can see, Eq. 1 contains the prior p(y, z), which is
assumed to be a standard normal distribution in the VAE Figure 2: Our conditional U-Net combined with a variational au-
framework. With this joint prior we cannot guarantee that toencoder. x: query image, ŷ: shape estimate, z: appearance.
both variables, y and z would be separated in the latent
space. Thus, our overall goal of separately altering shape
and appearance cannot be met. A standard normal prior form:
can model z but it is not suited to describe the spatial in-
formation contained in y, which is localized and easily gets L(x, θ, φ) = −KL(qφ (z|x, ŷ)||pθ (z|ŷ))
lost in the bottleneck. Therefore, we need additional infor- + Eqφ (z|x,ŷ) [log pθ (x|ŷ, z)], (3)
mation to disentangle y and z when learning the generator
p(x|y, z). where KL denotes Kullback-Leibler divergence. The next
section derives the network architecture we use for model-
3.2. Conditional Variational Autoencoder with ap- ing Gθ and Fφ .
pearance 3.3. Generator
In the previous section we have shown that a standard
Let us first establish a network Gθ which estimates the
VAE with two latent variables is not suitable for learning
parameters of the distribution p(x|ŷ, z). We assume further,
disentangled representations of y and z. Instead we assume
as it is common practice [16], that the distribution p(x|ŷ, z)
that we have an estimator function e for the variable y, i.e.,
has constant standard deviation and the function Gθ (ŷ, z)
ŷ = e(x). For example, e could provide information on
is a deterministic function in ŷ. As a consequence, the net-
shape by extracting edges or automatically estimating body
work Gθ (ŷ, z) can be considered as an image generator net-
joint locations [6, 41]. Following up on Eq. 1, the task is
work and we can replace the second term in Eq. 3 with the
now to infer the latent variable z from the image and the
reconstruction loss L(x, θ) = kx − Gθ (ŷ, z)k1 :
estimate ŷ = e(x) by maximizing their conditional log-
likelihood. L(x, θ, φ) = −KL(qφ (z|x, ŷ)||pθ (z|ŷ))
Z + kx − Gθ (ŷ, z)k1 . (4)
p(x, z|ŷ)
log p(x|ŷ) = log p(x, z|ŷ) dz ≥ Eq log
z q(z|x, ŷ)
It is well known that pixelwise statistics of images, such
p(x|ŷ, z)p(z|ŷ) as the L1 -norm here, do not model perceptual quality of
= Eq log (2)
q(z|x, ŷ) images well [17]. Instead we adopt the perceptual loss from
[7] and formulate the final loss function as:
Compared to Eq. 1, the ELBO in Eq. 2 depends now on
the (conditional) prior p(z|ŷ). This distribution can now L(x, θ, φ) = −KL(qφ (z|x, ŷ)||pθ (z|ŷ))
be estimated from the training data and captures potential X
+ λk kΦk (x) − Φk (Gθ (ŷ, z))k1 , (5)
interrelations between shape and appearance. For instance
k
a person jumping is less likely to wear a dinner jacket than
a T-shirt. where Φ is a network for measuring perceptual similarity
Following [31] we model p(x|ŷ, z) as a parametric (in our case VGG19 [37]) and λk , k are hyper-parameters
Laplace and q(z|x, ŷ) as a parametric Gaussian distribu- that control the contribution of the different layers of Φ to
tion. The parameters of these distributions are estimated the total loss.
by two neural networks Gθ and Fφ respectively. Using the If we forget for a moment about z, the task of the net-
reparametrization trick [16], these networks can be trained work Gθ (ŷ) is to generate an image x̄ given the estimate
end-to-end using standard gradient descent. The loss func- ŷ of the shape information of an image x. Here it is cru-
tion for training follows directly from Eq. 2 and has the cial that we want to preserve spatial information given by
8859
GT pix2pix[12] our (reconst.) our (random samples)
Figure 3: Generating images with only the edge image as input (GT image (left) is held back). We compare our approach to pix2pix on
the datasets of shoes [43] and handbags [49]. On the right: sampling from our latent appearance distribution.
ŷ in the output image x̄. Therefore, we represent ŷ in the keeps the gradients for training the respective encoders Fφ
form of an image of the same size as x. Depending on the and Eθ well separated, while the decoder Dθ can learn to
estimate e : e(x) = ŷ this is easy to achieve. For exam- combine those representations for an optimal synthesis. To-
ple, estimated joints of a human body can be used to draw a gether Eθ and Dθ build a U-Net like network, which guar-
stickman for this person. Given such image representation antees optimal transfer of spatial information from input to
of ŷ we require that each keypoint of ŷ is used to estimate output images. On the other hand, Fφ when put together
x̄. A U-Net architecture [30] would be the most appropriate with Dθ frames a VAE that allows appearance inference.
choice in this case, as its skip-connections help to propagate The prior p(z|ŷ) is estimated by Eθ just before it concate-
the information directly from input to output. In our case, nates z into its representation. We train all three networks
however, the generator Gθ (ŷ, z) should learn about images jointly by maximizing the loss in Eq. 5.
by also conditioning on z.
The appearance z is sampled from the Gaussian distri- 4. Experiments
bution q(z|x, ŷ) whose parameters are estimated by the en-
We now proof the advantages of the proposed method by
coder network Fφ . Its optimization requires balancing two
showing the results of image generation in various datasets
terms. It has to encode enough information about x into z
with different shape estimators ŷ. In addition to visual com-
such that p(x|ŷ, z) can describe the data well as measured
parisons with other methods, all results are supported by nu-
by the reconstructions loss in (4). At the same time we pe-
merical experiments. Code and additional experiments can
nalize a deviation from the prior p(z|ŷ) by minimizing the
be found at https://compvis.github.io/vunet.
Kullback-Leibler divergence between q(z|x, ŷ) and p(z|ŷ).
The design of the generator Gθ as a U-Net already guaran- Datasets To compare with other methods, we evaluate
tees the preservation of spatial information in the output im- on: shoes [43], handbags [49], Market-1501 [47], Deep-
age. Therefore, any additional information about the shape Fashion [21, 23] and COCO [20]. As baselines for our
encoded in z, which is not already contained in the prior, subsequent comparisons we use the state-of-the-art pix2pix
incurs a cost without providing new information on the like- model [12] and PG2 [24]. To the best of our knowledge PG2
lihood p(x|ŷ, z). Thus, an optimal encoder Fφ must be in- is the only one approach which is able to transfer one per-
variant to shape. In this case it suffices to include z at the son to the pose of another. We show that we improve upon
bottleneck of the generator Gθ . this method and do not require specific datasets for train-
ing. With regard to pix2pix, it is the most general image-
More formally, let our U-Net-like generator Gθ (ŷ) con-
to-image translation model which can work with different
sist of two parts: an encoder Eθ and a decoder Dθ (see
shape estimates. Where applicable we directly compare to
Fig.2). We concatenate the inferred appearance represen-
the quantitative and qualitative results provided by the au-
tation z with the bottle-neck representation of Gθ : γ =
thors of the mentioned papers. As [12] does not perform
[Eθ (ŷ), z] and let the decoder Dθ (γ) generate an image
experiments on Market-1501, DeepFashion and COCO we
from it. Concatenating the shape and appearance features
train their model on these datasets using their published
8860
method Market1501 DeepFashion
IS SSIM IS SSIM
mean std mean std mean std mean std
real data 3.678 0.274 1.000 0.000 3.415 0.399 1.000 0.000
PG2 G1-poseMaskedLoss 3.326 − 0.340 − 2.668 − 0.779 −
PG2 G1+D 3.490 − 0.283 − 3.091 − 0.761 −
PG2 G1+G2+D 3.460 − 0.253 − 3.090 − 0.762 −
pix2pix 2.289 0.0489 0.166 0.060 2.640 0.2171 0.646 0.067
our 3.214 0.119 0.353 0.097 3.087 0.2394 0.786 0.068
Table 1: Inception scores (IS) and structured similarities (SSIM) of reconstructed test images on DeepFashion and Market1501 datasets.
Our method outperforms both pix2pix [12] and PG2 [24] in terms of SSIM. As to IS the proposed method performs better than pix2pix
and obtains comparable results to PG2 .
Figure 4: Generating images based only the stickman as input (GT image is held back). We compare our approach with pix2pix [12] on
Deepfashion and Market-1501 datasets. On the right: sampling from our latent appearance distribution.
code [50]. 0.9 for 100K iterations. The initial learning rate is set to
Shape estimate In the following experiments we work 0.001 and linearly decreases to 0 during training. We utilize
with two kinds of shape estimates: edge images and, in case weight normalization and data dependent initialization of
of humans, automatically regressed body joint positions. weights as described in [35]. Each λk is set to the reciprocal
We utilize edges extracted with the HED algorithm [41] by of the total number of elements in layer k.
the authors of [12]. Following [24] we apply current state- In-plane normalization In some difficult cases, e.g. for
of-the-art real time multi-person pose estimator [6] for body datasets with high shape variability, it is difficult to perform
joint regression. appearance transfer from one object to another with no part
Network architecture The generator Gθ is implemented correspondences between them. This problem is especially
as a U-Net architecture with 2n residual blocks [11]: n problematic when generating human beings. To cope with
blocks in the encoder part Eθ and n symmetric blocks in it we propose to use additional in-plane normalization uti-
the decoder part Dθ . Additional skip-connections link each lizing the information provided by the shape estimate ŷ. In
block in Eθ to the corresponding block in Dθ and guarantee our case ŷ is given by the positions of body joints which
direct information flow from input to output. Empirically, we use to crop out areas around body limbs. This results
we set the parameter n = 7 which worked well for all con- in 8 image crops that we stack together and give as input
sidered datasets. Each residual block follows the architec- to the generator Fφ instead of x. If some limbs are missing
ture proposed in [11] without batch normalization. We use (e.g. due to occlusions) we use a black image instead of the
strided convolution with stride 2 after each residual block corresponding crop.
to downsample the input until a bottleneck layer. In the de- Let us now investigate the proposed model for condi-
coder Dθ we utilize subpixel convolution [36] to perform tional image generation based on three tasks: 1) reconstruc-
the up-sampling between two consecutive residual blocks. tion of an image x given its shape estimate ŷ and origi-
All convolutional layers consists of 3 × 3 filters. The en- nal appearance z; 2) conditional image generation based on
coder Fφ follows the same architecture as the encoder Eθ . a given shape estimate ŷ; 3) conditional image generation
We train our model separately for each dataset using the from arbitrary combinations of ŷ and z.
Adam [14] optimizer with parameters β1 = 0.5 and β2 =
8861
Input pix2pix Our Input pix2pix Our
4.1. Image reconstruction space, we therefore present only one of them. In contrast,
our model generates high-quality images with large diver-
Given a query image x and its shape estimate ŷ we can sity. We also observe that our model generalizes better to
use the network Fφ to infer appearance of the image x. sketchy drawings made by humans [9] (see Fig. 5). Due
Namely, we denote the mean of the distribution q(z|x, ŷ) to a higher abstraction level, sketches are quite different to
predicted by Fφ from the single image x as its original ap- the edges extracted from the real images in the previous ex-
pearance z. Using these z and ŷ we can ask our generator periment. In this challenging task our model shows higher
Gθ to reconstruct x from its two components. coherence to the input edge image as well as less artifacts
We show examples of images reconstructed by our meth- such as at the carrying strap of the backpack.
ods in Figs. 3 and 4. Additionally, we follow the experi- Stickman-to-person Here we evaluate our model on the
ment in [24] and calculate for the reconstructions of the test task of learning plausible appearances for rendering human
images in Market-1501 and DeepFashion dataset Structural beings. Given a ŷ we thus sample z and infer x. We
Similarities (SSIM) [40] and Inception Scores (IS) [34] (see compare our results with the ones achieved by pix2pix on
Table 1). Compared to pix2pix [12] and PG2 [24] our Market-1501 and DeepFashion datasets (see Fig. 4). Due
method outperforms both in terms of SSIM score. Note to marginal diversity in the output of pix2pix we again only
that SSIM compares the reconstructions directly against the show one sample per row. We observe that our model has
original images. As our method differs from both by gen- learned a significantly more natural latent representation of
erating images conditioned on shape and appearance this the distribution of appearance. Also it preserves the spatial
underlines the benefit of this conditional representation for layout of the human figure better. We prove this observa-
image generation. In contrast to SSIM, inception score is tion by re-estimating joint positions from the test images
measured on the set of reconstructed images independently generated by each methods on all three datasets. For this
from the original images. In terms of IS we achieve compa- we apply the same the algorithm we used to estimate the
rable results to [24] and improve on [12]. positions of body joints initially, namely [6] with parame-
ter kept fixed. We report mean L2 -error in the positions of
4.2. Appearance sampling detected joints in Table 2. Our approach shows a signifi-
cantly lower re-localization error, thus demonstrating that
An important advantage of our model compared to [12]
body pose has been favorably retained.
and [24] is its ability to generate multiple new images
conditioned only on the estimate of an object’s shape ŷ.
4.3. Independent transfer of shape and appearance
This is achieved by randomly sampling z from the learned
prior p(z|ŷ) instead of inferring it directly from an image We show performance of our method for conditional im-
x. Thus, appearance can be explored while keeping shape age transfer, Fig. 7. Our disentangled representation of
fixed. shape and appearance can transfer a single appearance over
Edges-to-images We compare our method to pix2pix by different shapes and vice versa. The model has learned a
generating images from edge images of shoes or handbags. disentangled representation of both characteristics, so that
The results can been seen in Fig. 3. As noted by the au- one can be freely altered without affecting the other. This
thors in [12], the outputs of pix2pix show only marginal ability is further demonstrated in Fig. 6 that shows a synthe-
diversity at test time, thus looking almost identical. To save sis across a full 360◦ turn.
8862
method our pix2pix PG2 same person. Despite the fact that we never train our model
COCO 23.23 59.26 − explicitly on pairs of images, we demonstrate both quali-
DeepFashion 7.34 15.53 19.04 tatively and quantitatively that our method improves upon
Market1501 54.60 59.59 59.95 [24]. A direct visual comparison is shown in Fig. 8. We fur-
ther design a new metric to evaluate and compare against
PG2 on the appearance and shape transfer. Since code for
Table 2: Automatic body joint detection is applied to images of [24] is not available our comparison is limited to generated
humans synthesized by our method, pix2pix, and PG2 . The L2
images provided by [24]. The idea behind our metric is to
error of joint location is presented, indicating how good shape is
preserved. The error is measured in pixels based on a resolution
compare how good an appearance z of a reference image x
of 256 × 256. is preserved when synthesizing it with a new shape estimate
ŷ. For that we first fine-tune an ImageNet [33] pretrained
VGG16 [37] on Market-1501 on the challenging task of
person re-identification. In test phase this network achieves
mean average precision (mAP) of 35.62% and rank-1 accu-
racy of 63.00% on a task of single query retrieval. These
results are comparable to those reported in [48]. Due to the
nature of Market-1501, which contains images of the same
persons from multiple viewpoints, the features learned by
the network should be pose invariant and mostly sensitive
to appearance. Therefore, we use a difference between two
features extracted by this network as a measure for appear-
ance similarity.
For all results on DeepFashion and Market-1501 datasets
reported in [24] we use our method to generate exactly the
same images. Further we build groups of images sharing
the same appearance and retain those groups that contain
more than one element. As a result we obtain three groups
of images (see Table. 3) which we analyze independently.
We denote these groups with Ii , i = {1, 2, 3}.
Figure 7: Stability of appearance transfer on DeepFashion. Each For each image j in the group Ii we find its 10 near-
row is synthesized using appearance information from the leftmost est neighbors nij1 , nij2 , . . . nij10 in the training set using the
image and each column is synthesized from the pose in the first embedding of the fine-tuned VGG16. We search for the
row. Notice that inferred appearance remains constant across a nearest neighbors in the training dataset, as the person IDs
wide variety of viewpoints. and poses were taken from the test dataset. We calculate the
mean over each nearest-neighbor set and use this mean mj
dataset Our PG2 as the unique representation of the generated image j. For
kstdk max pairwise kstdk max pairwise images j in the group Ii we calculate maximal pairwise dis-
dist dist tance between the mj as well as the length of the standard
market1501 55.95 125.99 67.39 155.16 deviation vector. The results over all three image groups
deepfashion 59.24 135.83 69.57 149.66 I1 , I2 , I3 are summarized in Table 3. One can see that our
deepfashion 56.24 121.47 59.73 127.53 method shows higher compactness of the feature represen-
tations mj of the images in each group. From these results
we conclude that our generated images are more consistent
Table 3: Given an image its appearance is transferred from an in their appearance than the results of PG2 .
image to different target poses. For these synthesized images, the Generalization to different poses Because we are
unwanted deviation in appearance is measured using a pairwise not limited by the availability of labeled images show-
perceptual VGG16 loss.
ing the same appearance in different poses, we can uti-
lize additional large scale datasets. Results on COCO
The only other work we can compare with in this exper- are shown in Fig. 1. Besides still images, we are
iment is PG2 from [24]. In contrast to our method PG2 was able to synthesize videos. Examples can be found at
trained fully supervised on DeepFashion and Market-1501 https://compvis.github.io/vunet, demonstrating the transfer
datasets with pairs of images that share appearance (person of appearances from COCO to poses obtained from a video
id) but contain different shapes (in this case pose) of the dataset [45].
8863
Market DeepFashion
Conditional Target Stage Our Conditional Target Stage Our
image image II[24] image image II[24]
Figure 8: Comparing image transfer against PG2 . Left: Results on Market. Right: Results on DeepFashion. Appearance is inferred from
the conditional image, the pose is inferred from the target image. Note that our method does not require labels about person identity.
8864
References Lawrence, and K. Q. Weinberger, editors, Advances in Neu-
ral Information Processing Systems 27, pages 3581–3589.
[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Curran Associates, Inc., 2014. 2
arXiv preprint arXiv:1701.07875, 2016. 1
[16] D. P. Kingma and M. Welling. Auto-encoding variational
[2] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CVAE-GAN: bayes. CoRR, abs/1312.6114, 2013. 1, 2, 3
Fine-grained image generation through assymetric training.
[17] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoen-
In To appear in Proceedings of the International Conference
coding beyond pixels using a learned similarity metric. arXiv
on Computer Vision (ICCV), 2017. 1
preprint arXiv:1512.09300, 2015. 1, 3
[3] M. Bautista, A. Sanakoyeu, and B. Ommer. Deep unsuper-
[18] C. Lassner, G. Pons-Moll, and P. V. Gehler. A generative
vised similarity learning using partially ordered sets. In The
model for people in clothing. In Proceedings of the IEEE
IEEE Conference on Computer Vision and Pattern Recogni-
International Conference on Computer Vision, 2017. 1, 2
tion (CVPR), 2017. 2
[4] M. Bautista, A. Sanakoyeu, E. Sutter, and B. Ommer. [19] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken,
Cliquecnn: Deep unsupervised exemplar learning. In Pro- A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single
ceedings of the Conference on Advances in Neural Infor- image super resolution using generative adversarial network.
mation Processing Systems (NIPS), Barcelona, 2016. MIT In Proceedings of the IEEE Conference on Computer Vision
Press, MIT Press. 2 and Pattern Recognition, 2017. 1
[5] B. Brattoli, U. Büchler, A. S. Wahl, M. E. Schwab, and [20] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Gir-
B. Ommer. Lstm self-supervision for detailed behavior anal- shick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L.
ysis. In Proceedings of the IEEE Conference on Computer Zitnick. Microsoft COCO: common objects in context. arXiv
Vision and Pattern Recognition (CVPR). (BB and UB con- preprint arXiv:1405.0312, 2014. 1, 4, 3, 11
tributed equally), (BB and UB contributed equally), 2017. [21] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion:
2 Powering robust clothes recognition and retrieval with rich
[6] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- annotations. In Proceedings of IEEE Conference on Com-
person 2d pose estimation using part affinity fields. In CVPR, puter Vision and Pattern Recognition (CVPR), 2016. 1, 4, 3,
2017. 3, 5, 6 11
[7] Q. Chen and V. Koltun. Photographic image synthesis with [22] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face
cascaded refinement networks. In To appear in Proceedings attributes in the wild. In Proceedings of International Con-
of the International Conference on Computer Vision (ICCV), ference on Computer Vision (ICCV), 2015. 1
2017. 1, 2, 3 [23] Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang. Fashion land-
[8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, mark detection in the wild. In European Conference on Com-
and P. Abbeel. Infogan: Interpretable representation learn- puter Vision (ECCV), 2016. 1, 4, 3, 11
ing by information maximizing generative adversarial nets. [24] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and
arXiv preprint arXiv:1606.03657, 2016. 1 L. Van Gool. Pose guided person image generation. In
[9] M. Eitz, J. Hays, and M. Alexa. How do humans sketch ob- To appear in Proceedings of the Conference on Advances in
jects? ACM Trans. Graph. (Proc. SIGGRAPH), 31(4):44:1– Neural Information Processing Systems (NIPS), pages 3846–
44:10, 2012. 6 3854, 2017. 1, 2, 4, 5, 6, 7, 8
[10] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [25] T. Milbich, M. Bautista, E. Sutter, and B. Ommer. Unsuper-
D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. vised video understanding by reconciliation of posture sim-
Generative adversarial nets. In In Neural Information Pro- ilarities. In Proceedings of the IEEE International Confer-
cessing Systems (NIPS), pages 2672–2680, 2014. 1, 2 ence on Computer Vision (ICCV), 2017. 2
[11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [26] A. Odena, C. Olah, and J. Shlens. Conditional image
deep residual networks. In Computer Vision - ECCV 2016 synthesis with auxiliary classifier gans. arXiv preprint
- 14th European Conference, Amsterdam, The Netherlands, arXiv:1610.09585, 2017. 2
October 11-14, 2016, Proceedings, Part IV, pages 630–645, [27] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
2016. 5 sentation learning with deep convolutional generative adver-
[12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image- sarial networks. In In International Conference On Learning
to-image translation with conditional adversarial networks. Representations (ICLR), 2016. 1
arxiv preprint arXiv:1611.07004, 2016. 1, 2, 4, 5, 6, 3 [28] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and
[13] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive H. Lee. Learning what and where to draw. In D. D. Lee,
growing of gans for improved quality, stability, and variation. M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, edi-
arXiv preprint arXiv:1710.10196, 2017. 1 tors, Advances in Neural Information Processing Systems 29,
[14] D. P. Kingma and J. Ba. Adam: A method for stochastic pages 217–225. Curran Associates, Inc., 2016. 2
optimization. CoRR, abs/1412.6980, 2014. 5 [29] S. E. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez,
[15] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and Z. Wang, D. Belov, and N. de Freitas. Parallel multiscale au-
M. Welling. Semi-supervised learning with deep generative toregressive density estimation. In Proceedings of The 34th
models. In Z. Ghahramani, M. Welling, C. Cortes, N. D. International Conference on Machine Learning, 2017. 2
8865
[30] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolu- with stacked generative adversarial networks. In To appear
tional Networks for Biomedical Image Segmentation, pages in Proceedings of the International Conference on Computer
234–241. Springer International Publishing, Cham, 2015. 1, Vision (ICCV), 2017. 2
2, 4 [45] W. Zhang, M. Zhu, and K. G. Derpanis. From actemes to
[31] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and action: A strongly-supervised representation for detailed ac-
S. Mohamed. Variational approaches for auto-encoding gen- tion understanding. In Proceedings of the IEEE International
erative adversarial networks. CoRR, abs/1706.04987, 2017. Conference on Computer Vision, pages 2248–2255, 2013. 7
3 [46] B. Zhao, X. Wu, Z. Cheng, H. Liu, and J. Feng. Multi-
[32] J. C. Rubio, A. Eigenstetter, and B. Ommer. Generative reg- view image generation from a single-view. arXiv preprint
ularization with latent topics for discriminative object recog- arXiv:1704.04886, 2017. 2
nition. Pattern Recognition, 48(12):3871–3880, 2015. 1 [47] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, Scalable person re-identification: A benchmark. In Com-
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, puter Vision, IEEE International Conference on, 2015. 1, 4,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual 3, 11
Recognition Challenge. International Journal of Computer [48] Z. Zheng, L. Zheng, and Y. Yang. A discriminatively learned
Vision (IJCV), 115(3):211–252, 2015. 7 cnn embedding for person re-identification. arXiv preprint
[34] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, arXiv:1611.05666, 2016. 7
A. Radford, and X. Chen. Improved techniques for training [49] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
gans. In NIPS, 2016. 6 Generative visual manipulation on the natural image mani-
[35] T. Salimans and D. P. Kingma. Weight normalization: A fold. In Proceedings of European Conference on Computer
simple reparameterization to accelerate training of deep neu- Vision (ECCV), 2016. 1, 4, 2
ral networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, [50] J.-Y. Zhu and T. Park. ImagetoImage Translation with con-
I. Guyon, and R. Garnett, editors, Advances in Neural Infor- ditional adversarial nets. 5
mation Processing Systems 29, pages 901–909. Curran As- [51] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
sociates, Inc., 2016. 5 to-image translation using cycle-consistent adversarial net-
[36] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, works. arXiv preprint arXiv:1703.10593, 2017. 1
R. Bishop, D. Rueckert, and Z. Wang. Real-time single im- [52] S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. C. Loy. Be your
age and video super-resolution using an efficient sub-pixel own prada: Fashion synthesis with structural coherence. In
convolutional neural network. 2016 IEEE Conference on Proceedings of the IEEE International Conference on Com-
Computer Vision and Pattern Recognition (CVPR), pages puter Vision, 2017. 2
1874–1883, 2016. 5
[37] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. 3, 7
[38] K. Sohn, H. Lee, and X. Yan. Learning structured output rep-
resentation using deep conditional generative models. In In
Neural Information Processing Systems (NIPS), pages 3483–
3491, 2015. 1, 2
[39] A. van den Oord, N. Kalchbrenner, , L. E. K. Kavukcuoglu,
O. Vinyals, and A. Graves. Conditional image generation
with pixelcnn decoders. In In Neural Information Processing
Systems (NIPS), pages 4790–4798, 2016. 2
[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Image quality assessment: From error visibility to structural
similarity. Trans. Img. Proc., 13(4):600–612, Apr. 2004. 6
[41] S. Xie and Z. Tu. Holistically-nested edge detection. In In
Proceedings of the IEEE International Conference on Com-
puter Vision (ICCV), 2015. 3, 5
[42] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image:
Conditional image generation from visual attributes. In Pro-
ceedings of the European Conference on Computer Vision,
2016. 2
[43] A. Yu and K. Grauman. Fine-grained visual comparisons
wiht local learnings. In In Conference on Computer Vision
and Pattern Recognition (CVPR), 2014. 1, 4, 2
[44] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and
D. Metaxas. Stackgan: Text photo-realistic image synthesis
8866