0% found this document useful (0 votes)

28 views23 pages

Image 2 Style Gan

1) The document proposes an efficient algorithm to embed images into the latent space of StyleGAN, a pre-trained GAN model. This allows semantic editing of existing photographs by applying operations in the latent space. 2) Testing the algorithm on various images provides insights into what types of images can be embedded, how they are embedded, and if the embedding is meaningful. Basic operations like interpolation and transfer suggest the embedding captures semantics. 3) The algorithm efficiently maps images into StyleGAN's extended latent space, enabling applications like morphing, style transfer, and expression transfer of face images.

Uploaded by

Piotr Pej

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views23 pages

Image 2 Style Gan

Uploaded by

Piotr Pej

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?

Rameen Abdal Yipeng Qin Peter Wonka

KAUST KAUST KAUST
[email protected] [email protected] [email protected]

Abstract
arXiv:1904.03189v2 [cs.CV] 3 Sep 2019

is not only able to embed human face images, but also suc-
cessfully embeds non-face images from different classes.
We propose an efficient algorithm to embed a given im- Therefore, we continue our investigation by analyzing the
age into the latent space of StyleGAN. This embedding en- quality of the embedding to see if the embedding is semanti-
ables semantic image editing operations that can be applied cally meaningful. To this end, we propose to use three basic
to existing photographs. Taking the StyleGAN trained on operations on vectors in the latent space: linear interpola-
the FFHQ dataset as an example, we show results for image tion, crossover, and adding a vector and a scaled difference
morphing, style transfer, and expression transfer. Studying vector. These operations correspond to three semantic im-
the results of the embedding algorithm provides valuable age processing applications: morphing, style transfer, and
insights into the structure of the StyleGAN latent space. We expression transfer. As a result, we gain more insight into
propose a set of experiments to test what class of images can the structure of the latent space and can solve the mystery
be embedded, how they are embedded, what latent space is why even instances of non-face images such as cars can be
suitable for embedding, and if the embedding is semanti- embedded.
cally meaningful. Our contributions include:

• An efficient embedding algorithm which can map a

1. Introduction given image into the extended latent space W + of a
pre-trained StyleGAN.
Generative Adverserial Networks (GANs) are very suc-
cessfully applied in various computer vision applications, • We study multiple questions providing insight into the
e.g. texture synthesis [20, 37, 31], video generation [35, 34], structure of the StyleGAN latent space, e.g.: What type
image-to-image translation [11, 40, 1, 27] and object detec- of images can be embedded? What type of faces can
tion [21]. be embedded? What latent space can be used for the
In the few past years, the quality of images synthesized embedding?
by GANs has increased rapidly. Compared to the seminal
DCGAN framework [28] in 2015, the current state-of-the- • We propose to use three basic operations on vectors
art GANs [14, 3, 15, 40, 41] can synthesize at a much higher to study the quality of the embedding. As a result,
resolution and produce significantly more realistic images. we can better understand the latent space and how dif-
Among them, StyleGAN [15] makes use of an intermediate ferent classes of images are embedded. As a byprod-
W latent space that holds the promise of enabling some con- uct, we obtain excellent results on multiple face image
trolled image modifications. We believe that image modifi- editing applications including morphing, style transfer,
cations are a lot more exciting when it becomes possible to and expression transfer.
modify a given image rather than a randomly GAN gener-
ated one. This leads to the natural question if it is possible 2. Related Work
to embed a given photograph into the GAN latent space.
To tackle this question, we build an embedding algo- High-quality GANs Starting from the groundbreaking
rithm that can map a given image I in the latent space of work by Goodfellow et al. [8] in 2014, the entire computer
StyleGAN pre-trained on the FFHQ dataset. One of our vision community has witnessed the fast-paced improve-
important insights is that the generalization ability of the ments on GANs in the past years. For image generation
pre-trained StyleGAN is significantly enhanced when using tasks, DCGAN [28] is the first milestone that lays down
an extended latent space W + (See Sec. 3.3). As a conse- the foundation of GAN architectures as fully-convolutional
quence, somewhat surprisingly, our embedding algorithm neural networks. Since then, various efforts have been made

1
Figure 1: Top row: input images. Bottom row: results of embedding the images into the StyleGAN latent space.

to improve the performance of GANs from different as- ii) select a random initial latent code and optimize it using
pects, e.g. the loss function [23, 2], the regularization or gradient descent [39, 4]. Between them, the first approach
normalization [9, 25], and the architecture [9]. However, provides a fast solution of image embedding by performing
due to the limitation of computational power and the short- a forward pass through the encoder neural network. How-
age of high-quality training data, these works are only tested ever, it usually has problems generalizing beyond the train-
with low resolution and poor quality datasets collected for ing dataset. In this paper, we decided to build on the second
classification / recognition tasks. Addressing this issue, approach as the more general and stable solution. As a con-
Karras et al. collected the first high-quality human face currently developed work, the Github repository stylegan-
dataset CelebA-HQ and proposed a progressive strategy to encoder [26] also demonstrated that the optimization-based
train GANs for high resolution image generation tasks [14]. approach leads to embeddings of very high visual quality.
Their ProGAN is the first GAN that can generate realistic
human faces at a high resolution of 1024 × 1024. How-
Perceptual Loss and Style Transfer Traditionally, the
ever, the generation of high-quality images from complex
low-level similarity between two images is measured in the
datasets (e.g. ImageNet) remains a challenge. To this end,
pixel space with L1/L2 loss functions. While in the past
Brock et al. proposed BigGAN and argued that the training
years, inspired by the success of complex image classifica-
of GANs benefit dramatically from large batch sizes [3].
tion [18, 22], Gatys et al. [7, 6] observed that the learned
Their BigGAN can generate realistic samples and smooth
filters of the VGG image classification model [22] are ex-
interpolations spanning different classes. Recently, Karras
cellent general-purpose feature extractors and proposed to
et al. collected a more diverse and higher quality human
use the covariance statistics of the extracted features to mea-
face dataset FFHQ and proposed a new generator archi-
sure the high-level similarity between images perceptually,
tecture inspired by the idea of neural style transfer [10],
which is then formalized as the perceptual loss [12, 5].
which further improves the performance of GANs on hu-
To demonstrate the power of their method, they showed
man face generation tasks [15]. However, the lack of con-
promising results on style transfer [6].
trol over image modification ascribed to the interpretability
Specifically, they argued that different layers of the VGG
of neural networks, is still an open problem. In this paper,
neural network extract the image features at different scales
we tackle the interpretability problem by embedding user-
and can be separated into content and style.
specified images back to the GAN latent space, which leads
to a variety of potential applications. To accelerate the initial algorithm, Johnson et al. [12]
proposed to train a neural network to solve the optimiza-
tion problem of [6], which can transfer the style of a given
Latent Space Embedding In general, there are two exist- image to any other image in real-time. The only limitation
ing approaches to embed instances from the image space to of their method is that they need to train separate neural
the latent space: i) learn an encoder that maps a given image networks for different style images. Finally, this issue is re-
to the latent space (e.g. the Variational Auto-Encoder [16]); solved by Huang and Belongie [10] with adaptive instance
normalization. As a result, they can transfer arbitrary style Transformation L(×105 ) kw∗ − w̄k
in real-time. Translation (Right 140 pixels) 0.782 48.56
Translation (Left 160 pixels) 0.406 44.12
3. What images can be embedded into the Zoom out (2X) 0.225 38.04
StyleGAN latent space? Zoom in (2X) 0.718 40.55
90◦ Rotation 0.622 47.21
We set out to study the question if it is even possible 180◦ Rotation 0.599 42.93
to embed images into the StyleGAN latent space. This
question is not trivial, because our initial embedding ex- Table 1: Embedding results of the transformed images. L
periments with faces and with other GANs resulted in faces is the loss (Eq.1) after optimization. kw∗ − w̄k is the dis-
that were no longer recognizable as the same person. Due to tance between the latent codes w∗ and w̄ (Section 5.1) of
the improved variability of the FFHQ dataset and the supe- the average face [15].
rior quality of the StyleGAN architecture, there is a renewed
hope that embedding existing images in the latent space is
possible. to affine transformations (translation, resizing and rotation).
Among them, the translation seems to have the worst perfor-
3.1. Embedding Results for Various Image Classes mance as it can fail to produce a valid face embedding. For
To test our method, we collect a small-scale dataset of 25 resizing and rotation, the results are valid faces. However,
diverse images spanning 5 categories (i.e. faces, cats, dogs, they are blurry and lose many details, which are still worse
cars, and paintings). Details of the dataset are shown in than the normal embedding. From these observations, we
the supplementary material. We use the code provided by argue that the generalization ability of GANs is sensitive to
StyleGAN [15] to preprocess the face images. This prepro- affine transformation, which implies that the learned rep-
cess includes registration to a canonical face position. resentations are still scale and position dependent to some
To better understand the structure and attributes of the extent.
latent space, it is beneficial to study the embedding of a
larger variety of image classes. We choose faces of cats, Embedding Defective Images As Figure 3 shows, the
dogs, and paintings as they share the overall structure with performance of StyleGAN embedding is quite robust to de-
human faces, but are depicted in a very different style. Cars fects in images. It can be observed that the embeddings of
are selected as they have no structural similarity to faces. different facial features are independent of each other. For
Figure 1 shows the embedding results consist of one ex- example, removing the nose does not have an obvious influ-
ample for each image class in the collected test dataset. It ence on the embedding of the eyes and the mouth. On the
can be observed that the embedded Obama face is of very one hand, this phenomenon is good for general image edit-
high perceptual quality and faithfully reproduces the in- ing applications. On the other hand, it shows that the latent
put. However, it is noted that the embedded face is slightly space does not force the embedded image to be a complete
smoothed and minor details are absent. face, i.e. it does not inpaint the missing information.
Going beyond faces, interestingly, we find that although
the StyleGAN generator is trained on a human face dataset, 3.3. Which Latent Space to Choose?
the embedding algorithm is capable to go far beyond hu- There are multiple latent spaces in StyleGAN [15] that
man faces. As Figure 1 shows, although slightly worse could be used for an embedding. Two obvious candidates
than those of human faces, we can obtain reasonable and are the initial latent space Z and the intermediate latent
relatively high-quality embeddings of cats, dogs and even space W . The 512-dimensional vectors w ∈ W are ob-
paintings and cars. This reveals the effective embedding ca- tained from the 512-dimensional vectors z ∈ Z by passing
pability of the algorithm and the generality of the learned them through a fully connected neural network. An impor-
filters of the generator. tant insight of our work is that it is not easily possible to
Another interesting question is how the quality of the embed into W or Z directly. Therefore, we propose to em-
pre-trained latent space affects the embedding. To conduct bed into an extended latent space W + . W + is a concate-
these tests we also used StyleGANs trained on cars, cats, ... nation of 18 different 512-dimensional w vectors, one for
The quality of these results is significantly lower, as shown each layer of the StyleGAN architecture that can receive
in supplementary materials. input via AdaIn. As shown in Figure 5 (c)(d), embedding
into W directly does not give reasonable results. Another
3.2. How Robust is the Embedding of Face Images?
interesting question is how important the learned network
Affine Transformation As Figure 2 and Table 1 show, weights are for the result. We answer this question in Fig-
the performance of StyleGAN embedding is very sensitive ure 5 (b)(e) by showing an embedding into a network that is
(a) (b) (c) (d) (e) (f) (g)

Figure 2: Top row: the input images. Bottom row: the embedded results. (a) Standard embedding results. (b) Translation
140 pixels to the right. (c) Translation 160 pixels to the left. (d) Zoom out by 2X. (e) Zoom in by 2X. (f) 90◦ rotation. (g)
180◦ rotation.

Figure 3: Stress test results on defective image embedding.

Top row: the input images. Bottom row: the embedded
results.

simply initialized with random weights.

4. How Meaningful is the Embedding?

We propose three tests to evaluate if an embedding is
semantically meaningful. Each of these tests can be con-
ducted by simple latent code manipulations of vectors wi
and these tests correspond to semantic image editing appli-
cations in computer vision and computer graphics: morph-
ing, expression transfer, and style transfer. We consider a Figure 4: Morphing between two embedded images (the
test successful if the resulting manipulation results in high left-most and right-most ones).
quality images.
4.1. Morphing
Interestingly, it can be observed that there are contours of
Image morphing is a longstanding research topic in com- human faces in the intermediate images of the inter-class
puter graphics and computer vision, e.g. [36, 29, 30, 32, 38, morphing, which shows that the latent space structure of
17]). Given two embedded images with their respective la- this StyleGAN is dedicated to human faces. We therefore
tent vectors w1 and w2 , morphing is computed by a linear conjecture that non-face images are actually embedded the
interpolation, w = λw1 + (1 − λ)w2 , λ ∈ (0, 1), and sub- following way. The initial layers create a face like structure
sequent image generation using the new code w. As Figure but the later layers paint over this structure so that it is no
4 shows, our method generates high-quality morphing be- longer recognizable. While an extensive study of morphing
tween face images (row 1,2,3) but fails on non-face images itself is beyond the scope of this paper, we believe that the
in both in-class (row 4) and inter-class (row 5) morphing. face morphing results are excellent and might be superior
(a) (b) (c) (d) (e) (f) (g)

Figure 5: (a) Original images. Embedding results into the original space W : (b) using random weights in the network layers;
(c) with w̄ initialization; (d) with random initialization. Embedding results into the W + space: (e) using random weights in
the network layers; (f) with w̄ initialization; (g) with random initialization.

to the current state of the art. We leave this investigation to of the embedded content image for the first 9 layers (cor-
future work. responding to spatial resolution 42 − 642 ) and override the
latent codes with the ones of the style image for the last
9 layers (corresponding to spatial resolution 642 − 10242 ).
Our method is able to transfer the low level features (e.g.
colors and textures) but fails to faithfully maintain the con-
tent structure of non-face images (second column Figure 8),
especially the painting. This phenomenon reveals that the
generalization and expressing power of StyleGAN is more
likely to reside in the style layers corresponding to higher
spatial resolutions.

4.3. Expression Transfer and Face Reenactment

Figure 6: First column: style image; Second column: em- Given three input vectors w1 , w2 , w3 , expression trans-
bedded stylized image using style loss from conv4 2 layer fer is computed as w = w1 + λ(w3 − w2 ), where w1 is the
of VGG-16; Third to Sixth column: style transfer by re- latent code of the target image, w2 corresponds to a neu-
placing latent code of last 9 layers of base image with the tral expression of the source image, and w3 corresponds to
embedded style image. a more distinct expression. For example, w3 could corre-
spond to a smiling face and w2 to an expressionless face of
the same person. To eliminate the noise (e.g. background
4.2. Style Transfer noise), we heuristically set a lower bound threshold on the
L2 − norm of the channels of difference latent code, be-
Given two latent codes w1 and w2 , style transfer is com- low which, the channel is replaced by a zero vector. For
puted by a crossover operation [15]. We show the style the above experiment, the selected value of the threshold is
transfer results between an embedded stylized image and 1. We normalize the resultant vectors to control the inten-
other face images (Figure 6) and between embedded images sity of an expression in a particular direction. Such code is
from different classes (Figure 8). relatively independent of the source faces and can be used
More specifically in Figure 8, we retain the latent codes to transfer expressions (Figure 7). We believe that these
Figure 7: Results on expression transfer. The first row shows the reference images from IMPA-FACES3D [24] dataset. In the
following rows, the middle image in each of the examples is the embedded image, whose expression is gradually transferred
to the reference expression (on the right) and the opposite direction (on the left) respectively. More results are included in the
supplementary material.

5. Embedding Algorithm
Our method follows a straightforward optimization
framework [4] to embed a given image onto the manifold
of the pre-trained generator. Starting from a suitable ini-
tialization w, we search for an optimized vector w∗ that
minimizes the loss function that measures the similarity be-
tween the given image and the image generated from w∗ .
Algorithm 1 shows the pseudo-code of our method. An in-
teresting aspect of this work is that not all design choices
lead to good results and that experimenting with the design
choices provides further insights into the embedding.

Algorithm 1: Latent Space Embedding for GANs

Input: An image I ∈ Rn×m×3 to embed; a
pre-trained generator G(·).
Output: The embedded latent code w∗ and the
Figure 8: Style transfer between the embedded style image embedded image G(w∗ ) optimzed via F 0 .
(first column) and the embedded content images (first row). 1 Initialize latent code w∗ = w;
2 while not converged do
3 L ← Lpercept (G(w∗ ), I) + N λ
kG(w∗ ) − Ik22 ;
∗ ∗ 0
4 w ← w − ηF (∇w∗ L );
expression transfer results are also of very high quality. Ad- 5 end
ditional results are available in supplementary materials and
the accompanying video.
Data class w Init. L(×105 ) kw∗ − w̄k
w = w̄ 0.309 30.67
Face
Random 0.351 35.60
w = w̄ 0.752 70.86
Cat
Random 0.740 70.97
w = w̄ 0.922 74.78
Dog
Random 0.845 75.14
w = w̄ 3.530 103.61
Painting
Random 3.451 105.29
w = w̄ 1.390 82.53
Car
Random 1.269 82.60

Table 2: Algorithmic choice justification on the latent code

initialization. w Init. is the initialization method for the
latent code w. L is the mean of the loss (Eq.1) after opti-
mization. kw∗ − w̄k is the distance between the latent codes
w∗ and w̄ of the average face [15].

Figure 9: Algorithmic choice justification on the loss func-

5.1. Initialization tion. Each row shows the results of an image from the five
We investigate two design choices for the initialization. different classes in our test dataset respectively. From left
The first choice is random initialization. In this case, each to right, each column shows: (1) the original image; (2)
variable is sampled independently from a uniform distribu- pixel-wise MSE loss only; (3) perceptual loss on VGG-16
tion U[−1, 1]. The second choice is motivated by the obser- conv3 2 layer only; (4) pixel-wise MSE loss and VGG-16
vation that the distance to the mean latent vector w̄ can be conv3 2; (5) perceptual loss (Eq.2) only; (6) our loss func-
used to identify low quality faces [15]. Therefore, we pro- tion (Eq.1). More results are included in the supplementary
pose to use w̄ as initialization and expect the optimization material.
to converge to a vector w∗ that is closer to w̄.
To evaluate these two design choices, we compared the
loss values and the distance kw∗ − w̄k between the opti- λmse = 1 is empirically obtained for good performance.
mized latent code w∗ and w̄ after optimization. As Table 2 For the perceptual loss term Lpercept (·) in Eq.1, we use:
shows, initializing w = w̄ for face image embeddings not 4
only makes the optimized w∗ closer to w̄, but also achieves
X λj
Lpercept (I1 , I2 ) = kFj (I1 ) − Fj (I2 )k22 (2)
a much lower loss value. However, for images in other j=1
Nj
classes (e.g. dog), random initialization proves to be the
better option. Intuitively, the phenomenon suggests that the where I1 , I2 ∈ Rn×n×3 are the input images, Fj is the fea-
distribution has only one cluster of faces, the other instances ture output of VGG-16 layers conv1 1, conv1 2, conv3 2
(e.g. dogs, cats) are scattered points surrounding the cluster and conv4 2 respectively, Nj is the number of scalars in the
without obvious patterns. Qualitative results are shown in jth layer output, λj = 1 for all js are empirically obtained
Figure 5 (f)(g). for good performance.
Our choice of the perceptual loss together with the pixel-
5.2. Loss Function wise MSE loss comes from the fact that the pixel-wise MSE
loss alone cannot find a high quality embedding. The per-
To measure the similarity between the input image and
ceptual loss therefore acts as some sort of regularizer to
the embedded image during optimization, we employ a loss
guide the optimization into the right region of the latent
function that is a weighted combination of the VGG-16 per-
space.
ceptual loss [12] and the pixel-wise MSE loss:
We perform an ablation study to justify our choice of loss
λmse function in Eq.1. As Figure 9 shows, using the pixel-wise
w∗ = min Lpercept (G(w), I) + kG(w) − Ik22 (1) MSE loss term alone (column 2) embeds the general colors
w N
well but fails to catch the features of non-face images. In ad-
where I ∈ Rn×n×3 is the input image, G(·) is the pre- dition, it has a smoothing effect that does not preserve the
trained generator, N is the number of scalars in the image details even for the human faces. Interestingly, due to the
(i.e. N = n × n × 3), w is the latent code to optimize, pixel-wise MSE loss working in the pixel space and ignor-
Figure 11: Stress test results on iterative embedding. The
left most column shows the original images and the subse-
quent columns are the results of iterative embedding.

Iterative Embedding We tested the robustness of the pro-

posed method on iterative embedding, i.e. we iteratively
take the embedding results as new input images and do the
embedding again. This process is repeated seven times. As
Figure 11 shows, although it is guaranteed that the input
Figure 10: Loss values vs. the number of optimization steps. image exists in the model distribution after the first em-
bedding, the performance of the proposed method slowly
degenerates (more details are lost) with the number of iter-
ing the differences in feature space, its embedding results ative embedding. The reason for this observation may be
on non-face images (e.g. the car and the painting) have a that the employed optimization approach suffers from slow
tendency towards the average face of the pre-trained Style- convergence around local optimum. For the embeddings
GAN [15]. This problem is addressed by the perceptual other than human faces, the stochastic initial latent codes
losses (column 3, 5) that measures image similarity in the may also be a factor for the degradation. In summary, these
feature space. Since our embedding task requires the em- observations show that our embedding approach can reach
bedded image to be close to the input at all scales, we found reasonably “good” embeddings on the model distribution
that matching the features at multiple layers of the VGG-16 easily, although “perfect” embeddings are hard to reach.
network (column 5) works better than using only a single
layer (column 3). This further motivates us to combine the 6. Conclusion
pixel-wise MSE loss with the perceptual loss (column 4, 6)
from that the pixel-wise MSE loss can be viewed as the low- We proposed an efficient algorithm to embed a given
est level perceptual loss at pixel scale. Column 6 of Figure 9 image into the latent space of StyleGAN. This algorithm
shows the embedding results of our final choice (pixel-wise enables semantic image editing operations, such as image
MSE + multi-layer perceptual loss), which achieves the best morphing, style transfer, and expression transfer. We also
performance among different algorithmic choices. used the algorithm to study multiple aspects of the Style-
GAN latent space. We proposed experiments to analyze
what type of images can be embedded, how they are em-
5.3. Other Parameters bedded, and how meaningful the embedding is. Important
conclusions of our work are that embedding works best into
We use the Adam optimizer with a learning rate of 0.01, the extended latent space W + and that any type of image
β1 = 0.9, β2 = 0.999, and = 1e−8 in all our exper- can be embedded. However, only the embedding of faces is
iments. We use 5000 gradient descent steps for the opti- semantically meaningful.
mization, taking less than 7 minutes per image on a 32GB Our framework still has several limitations. First, we in-
Nvidia TITAN V100 GPU. herit image artifacts present in pre-trained StyleGAN that
To justify our choice of 5000 optimization steps, we in- we illustrate in supplementary materials. Second, the opti-
vestigated the change in the loss function as a function of mization takes several minutes and an embedding algorithm
the number of iterations. As Figure 10 shows, the loss value that can work in under a second would be more appealing
of the human face image drops the quickest and converges for interactive editing.
at around 1000 optimization steps; those of the cat, the dog In future work, we hope to extend our framework to pro-
and the car images converge slower at around 3000 opti- cess videos in addition to static images. Further, we would
mization steps; while the painting curve is the slowest and like to explore embeddings into GANs trained on three-
converges around 5000 optimization steps. We choose to dimensional data, such as point clouds or meshes.
optimize the loss function for 5000 steps in all our experi- Acknowledgement This work was supported by the
ments. KAUST Office of Sponsored Research (OSR) under Award
No. OSR-CRG2017-3426. [16] Diederik P Kingma and Max Welling. Auto-encoding varia-
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2
References [17] Pavel Korshunov and Touradj Ebrahimi. Using face morph-
ing to protect privacy. In 2013 10th IEEE International Con-
[1] Yazeed Alharbi, Neil Smith, and Peter Wonka. Latent filter ference on Advanced Video and Signal Based Surveillance,
scaling for multimodal unsupervised image-to-image trans- pages 208–213. IEEE, 2013. 4
lation. arXiv preprint arXiv:1812.09877, 2018. 1
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Imagenet classification with deep convolutional neural net-
Wasserstein generative adversarial networks. In Proceedings works. In Advances in Neural Information Processing Sys-
of the 34th International Conference on Machine Learning, tems 25. 2012. 2
volume 70, pages 214–223, 2017. 2
[19] Samuli Laine. Feature-based metrics for exploring the latent
[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
space of generative models, 2018. 11
scale GAN training for high fidelity natural image synthe-
[20] Chuan Li and Michael Wand. Precomputed real-time texture
sis. In International Conference on Learning Representa-
synthesis with markovian generative adversarial networks.
tions, 2019. 1, 2
In Computer Vision - ECCV 2016 - 14th European Confer-
[4] A. Creswell and A. A. Bharath. Inverting the generator of a
ence, Amsterdam, The Netherlands, October 11-14, 2016,
generative adversarial network. IEEE Transactions on Neu-
Proceedings, Part III, 2016. 1
ral Networks and Learning Systems, 2018. 2, 6
[21] Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi
[5] Alexey Dosovitskiy and Thomas Brox. Generating images
Feng, and Shuicheng Yan. Perceptual generative adversarial
with perceptual similarity metrics based on deep networks.
networks for small object detection. In The IEEE Conference
In Advances in neural information processing systems, pages
on Computer Vision and Pattern Recognition (CVPR), July
658–666, 2016. 2
2017. 1
[6] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm
of artistic style. arXiv, Aug 2015. 2 [22] S. Liu and W. Deng. Very deep convolutional neural network
based image classification using small training sample size.
[7] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge.
In 2015 3rd IAPR Asian Conference on Pattern Recognition
Texture synthesis using convolutional neural networks. In
(ACPR), Nov 2015. 2
Proceedings of the 28th International Conference on Neural
Information Processing Systems - Volume 1, NIPS’15, 2015. [23] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau,
2 Zhen Wang, and Stephen Paul Smolley. Least squares gener-
ative adversarial networks. In The IEEE International Con-
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
ference on Computer Vision (ICCV), Oct 2017. 2
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in [24] Jesús P Mena-Chalco, Luiz Velho, and RM Cesar Junior.
neural information processing systems, 2014. 1 3d human face reconstruction using principal components
[9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent spaces. In Proceedings of WTD SIBGRAPI Conference on
Dumoulin, and Aaron C Courville. Improved training of Graphics, Patterns and Images, 2011. 6, 23
wasserstein gans. In Advances in Neural Information Pro- [25] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
cessing Systems, pages 5767–5777, 2017. 2 Yuichi Yoshida. Spectral normalization for generative ad-
[10] Xun Huang and Serge Belongie. Arbitrary style transfer in versarial networks. In International Conference on Learning
real-time with adaptive instance normalization. In ICCV, Representations, 2018. 2
2017. 2 [26] Dmitry Nikitko. stylegan-encoder. https://github.
[11] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A com/Puzer/stylegan-encoder, 2019. 2
Efros. Image-to-image translation with conditional adver- [27] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
sarial networks. CVPR, 2017. 1 Zhu. Semantic image synthesis with spatially-adaptive nor-
[12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual malization. In Proceedings of the IEEE Conference on Com-
losses for real-time style transfer and super-resolution. In puter Vision and Pattern Recognition, 2019. 1
European conference on computer vision, 2016. 2, 7 [28] Alec Radford, Luke Metz, and Soumith Chintala. Un-
[13] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual supervised representation learning with deep convolu-
losses for real-time style transfer and super-resolution. In tional generative adversarial networks. arXiv preprint
European conference on computer vision, pages 694–711. arXiv:1511.06434, 2015. 1
Springer, 2016. 11 [29] Ulrich Scherhag, Christian Rathgeb, Johannes Merkle,
[14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Ralph Breithaupt, and Christoph Busch. Face recognition
Progressive growing of GANs for improved quality, stabil- systems under morphing attacks: A survey. IEEE Access, 7,
ity, and variation. In International Conference on Learning 2019. 4
Representations, 2018. 1, 2 [30] Clemens Seibold, Wojciech Samek, Anna Hilsmann, and Pe-
[15] Tero Karras, Samuli Laine, and Timo Aila. A style-based ter Eisert. Detection of face morphing attacks by deep learn-
generator architecture for generative adversarial networks. ing. In International Workshop on Digital Watermarking,
arXiv preprint arXiv:1812.04948, 2018. 1, 2, 3, 5, 7, 8, 11 pages 107–120. Springer, 2017. 4
[31] Ron Slossberg, Gil Shamai, and Ron Kimmel. High quality
facial surface and texture synthesis via generative adversar-
ial networks. In European Conference on Computer Vision,
pages 498–513. Springer, 2018. 1
[32] Mark Steyvers. Morphing techniques for manipulating face
images. Behavior Research Methods, Instruments, & Com-
puters, 31(2):359–369, 1999. 4
[33] Timo Aila Tero Karras, Samuli Laine. Stylegan - official
tensorflow implementation. https://github.com/
NVlabs/stylegan, 2018. 11
[34] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Kautz. Mocogan: Decomposing motion and content for
video generation. In The IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), June 2018. 1
[35] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Generating videos with scene dynamics. In Advances in Neu-
ral Information Processing Systems 29. 2016. 1
[36] George Wolberg. Image morphing: a survey. The Visual
Computer, 14(8), 1998. 4
[37] Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj,
Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. Tex-
turegan: Controlling deep image synthesis with texture
patches. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018. 1
[38] Fei Yang, Eli Shechtman, Jue Wang, Lubomir Bourdev, and
Dimitris Metaxas. Face morphing using 3d-aware appear-
ance optimization. In Proceedings of Graphics Interface
2012, pages 93–99. Canadian Information Processing Soci-
ety, 2012. 4
[39] Jun-Yan Zhu, Philipp Krhenbhl, Eli Shechtman, and
Alexei A. Efros. Generative visual manipulation on the nat-
ural image manifold. Lecture Notes in Computer Science,
page 597613, 2016. 2
[40] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networkss. In Computer Vision
(ICCV), 2017 IEEE International Conference on, 2017. 1
[41] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-
rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-
ward multimodal image-to-image translation. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Infor-
mation Processing Systems 30, pages 465–476. Curran As-
sociates, Inc., 2017. 1
Defect L(×105 ) kw∗ − w̄k
non-defective 0.204 29.19
Eyes 0.271 34.90
Nose 0.311 39.20
Mouth 0.301 37.04
Eyes and Mouth 0.233 39.62
Eyes, Nose and Mouth 0.285 37.59

Table 3: Quantitative results on defective image embedding

(Figure 3 in the main paper). L is the loss after optimization.
kw∗ − w̄k is the distance between the latent codes w∗ and
w̄ of the average face.

dataset (officially released [15, 33]) inherently creates cir-

cular artifacts in the generated images, which are also ob-
servable in our embedding results (Figure 21). These arti-
facts are thus independent of our embedding algorithm and
may be resolved by employing better pretrained models in
the future.

Figure 12: First column: original image (1024 × 1024). Limitation of the ImageNet-based Perceptual loss All
Second column: embedded image with the perceptual loss existing perceptual losses utilize the classifiers trained on
applied to resized images of 256 × 256 resolution. Third the ImageNet dataset (e.g. VGG-16, VGG-19), which are
column: embedded image with the perceptual loss applied restricted to the resolution of 224×224. While in our paper,
to the images at the original 1024 × 1024 resolution. we aim to embed images of high resolution (1024 × 1024)
that are much larger than that of ImageNet images. Such
inconsistency in the resolution may disable the learned im-
7. Additional Materials on Embedding age filters as they are scale-dependent. To this end, we fol-
low the common practice [13, 19] and use a simple resizing
Dataset In order to test our embedding algorithm, we col-
trick to compute the perceptual loss on resized images of
lect a small dataset of 25 images in five different categories:
256 × 256 resolution. As Figure 12 shows, the embedding
human faces, cats, dogs, cars and paintings (Figure 17).
results with the resizing trick outperform the ones at the
original resolution. However, small details are lost during
Additional Embedding Results To further support our the resizing, which can slightly smoothen the embedding re-
findings about the initial latent code in the main paper, we sults. We expect to get better results with future perceptual
show more results in Figure 13. It can be observed that: losses that work on higher resolutions.
for face images, initializing the optimization with the mean
face latent code works better; while for non-face images, us- StyleGANs trained on Other Datasets To support our
ing the latent codes randomly sampled from a multivariate insights on the learned distribution, we further tested our
uniform distribution is a better option. embedding algorithm on the StyleGANs trained on three
more datasets: the LSUN-Car (512 × 384), LSUN-Cat
Quantitative Results on Defective Image Embedding (256 × 256) and LSUN-Bedroom (256 × 256) datasets. The
Table 3 shows the corresponding quantitative results on de- embedding results are shown in Figure 18. It can be ob-
fective image embedding (Figure 3 in the main paper). The served that the quality of the embedding is poor compared
results show that compared to non-defective faces, the em- to that of the StyleGAN trained on the FFHQ dataset. The
bedded images of defective faces are farther from the mean linear interpolation (image morphing) results of LSUN-Cat,
face. This reaffirms that the valid faces form a cluster LSUN-Car, and LSUN-Bedroom StyleGANs are shown in
around the mean face. Figure 19 (a), (b) and (c) respectively. Interestingly, we
observed that linear interpolation fails on the LSUN-Cat
and LSUN-Car StyleGANs. Recall that the FFHQ human
Inherent Circular Artifacts of StyleGAN Interestingly, face dataset is of very high quality in terms of scale, align-
we observed that the StyleGAN model trained on the FFHQ ment, color, poses etc., we believe that the low quality of
Figure 13: Additional Embedding Results into W + space. Left column: the original images. Middle column: the embedded
images with random latent code initialization. Right column: the embedded images with w̄ latent code initialization.
(a) (b) (c) (d) (e) (f) (g)

Figure 14: Additional results on the justification of latent space choice.(a) Original images. Embedding results into the
original space W : (b) using random weights in the network layers; (c) with w̄ initialization; (d) with random initialization.
Embedding results into the W + space: (e) using random weights in the network layers; (f) with w̄ initialization; (g) with
random initialization.

the LSUN datasets is the source of such failure. In other Justification of Loss Function Choice Figure 22 vali-
words, the quality of the data distribution is one of the key dates the algorithmic choice of the loss function used in the
components to learn a meaningful model distribution. main paper. It can be observed that (i) matching the image
features at multiple layers of the VGG-16 network works
Additional Results on the Justification of Latent Space better than at a single layer; (ii) the combination of pixel-
Choice Figure 14 shows additional results (cat, dog, car) wise MSE loss and perceptual loss works the best.
on the justification of our choice of latent space W + . Sim-
ilar to the main paper, we can observe that: (i) embedding Influence of Noise Channels Figure 16 shows that
into W directly does not give reasonable results; (ii) the restarting the embedding with a different noise leads to sim-
learned network weights is important to good embeddings. ilar results. In addition, we observed significantly worse
quality when resampling the noise during the embedding
(at each update step). To this end, we kept the noise chan-
Clustering or Scattering? To support our insight that
nel constant during the embedding for all our experiments.
only face images form a cluster in the latent space, we com-
pute the L2 distances between the embeddings of all pairs 8. Additional Results on Applications
of test images (Figure 20). It can be observed that the dis-
tances between the faces are relatively smaller than those of Figure 15 shows additional results of the image morph-
other classes, which justifies that they are close to each other ing. Figure 23 shows the complete table of the style transfer
in the W + space and form a cluster. For images in other results between different classes. The results support our in-
classes, especially the paintings, the pairwise distances are sight that the multi-class embedding works by using an un-
much higher. This implies that they are scattered in the la- derlying human face structure (encoded in the first couple of
tent space. layers) and painting powerful styles onto it (encoded in the
Figure 15: Additional morphing results between two embedded images (the left-most and right-most ones).
Figure 16: Image embedding using different constant
noises.

latter layers). Figure 25 shows additional results on the ex-

pression transfer. We also include an accompanying video
in the supplementary material to show it works with noisy
images taken by a commodity camera in a typical office en-
vironment. The random walk results (of two classes ‘hu-
man faces’ and ‘cars’) from the embedded image towards
the mean face image are also shown in videos.
Figure 17: The collected 25 images of our dataset. First row: human faces. Second row: cats. Third row: dogs. Fourth row:
cars. Fifth row: paintings.
(a)
(a)

(b)
(b)

(c)

Figure 18: Embedding results of StyleGANs trained on

(a) LSUN-Car, (b) LSUN-Cat and (c) LSUN-Bedroom
datasets. For each subfigure, the first row shows the em-
(c)
bedding results of the images in 5 different classes in our
dataset. The second row shows the embedding results of Figure 19: Results on linear interpolations (image morph-
the images of the corresponding class in our dataset (“cars” ing) in the latent spaces of StyleGANs trained on (a) LSUN-
in (a) and “cats” in (b)). Note that (c) has only one row Cat (b) LSUN-Car (c) LSUN-Bedroom datasets.
because we did not collect bedroom images in our dataset.
Figure 20: Heat map of the inter- and intra-class L2 distances between embedded images.
Figure 21: Inherent circular artifacts of StyleGAN. First row: circular artifacts in the embeded images. Second and third
rows: randomly generated images. Left column: images with circular artifacts. Right column: highlighted artifacts by
zooming in their local neighbourhood.
Figure 22: Additional results of the algorithmic choice justification on the loss function. Each row shows the results of an
image from the five different classes in our test dataset respectively. From left to right, each column shows: (1) the original
image; (2) pixel-wise MSE loss only; (3) perceptual loss on VGG-16 conv3 2 layer only; (4) pixel-wise MSE loss and
VGG-16 conv3 2; (5) perceptual loss only; (6) our loss function .
Figure 23: Complete table of the style transfer results. Left-most column: the embedded style image. First row: the embedded
content images.
(a)
(b)

Figure 25: Additional results on expression transfer. In each subfigure, the first row shows the reference images from
IMPA-FACES3D [24] dataset; in the following rows, the middle image in each of the examples is the embedded image,
whose expression is gradually transferred to the reference expression (on the right) and the opposite direction (on the left)
respectively.