0% found this document useful (0 votes)

20 views17 pages

UVCGAN UNetVision Transformer Cycle-Consistent GAN For Unpaired

The document presents UVCGAN, a novel unpaired image-to-image translation model that enhances the CycleGAN framework by integrating a Vision Transformer (ViT) into its generator and employing advanced training techniques. This approach aims to improve the quality of scientific simulations by maintaining a strong correlation between simulation results and experimental data, while also allowing for the translation of real-world data into the simulation domain. The authors provide source code and configurations to promote reproducibility and open science in the field of generative models.

Uploaded by

suntemp.p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views17 pages

UVCGAN UNetVision Transformer Cycle-Consistent GAN For Unpaired

Uploaded by

suntemp.p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired

image-to-image translation

Dmitrii Torbunov, Yi Huang, Haiwang Yu, Jin Huang,

Shinjae Yoo, Meifeng Lin, Brett Viren, Yihui Ren
Brookhaven National Laboratory, Upton, NY, USA
dtorbunov,yhuang2,hyu,jhuang,sjyoo,mlin,bviren,[email protected]
arXiv:2203.02557v3 [cs.CV] 18 Oct 2022

Abstract models as drop-in replacements for existing simulation soft-

ware. Modern simulation frameworks can generate data
Unpaired image-to-image translation has broad appli- with high fidelity, yet the data are imperfect. Widespread
cations in art, design, and scientific simulations. One systematic inconsistencies between the generated and actual
early breakthrough was CycleGAN that emphasizes one- data significantly limit the applicability of simulation re-
to-one mappings between two unpaired image domains via sults. We would like to take advantage of the expressiveness
generative-adversarial networks (GAN) coupled with the of deep generative models to bridge this simulation versus
cycle-consistency constraint, while more recent works pro- reality gap. We frame the task as an unpaired image-to-
mote one-to-many mapping to boost diversity of the trans- image translation problem, where simulation results can be
lated images. Motivated by scientific simulation and one-to- defined as one domain with experimental data as the other.
one needs, this work revisits the classic CycleGAN frame- Unpaired is a necessary constraint because gathering sim-
work and boosts its performance to outperform more con- ulation and experiment data with exact pixel-to-pixel map-
temporary models without relaxing the cycle-consistency ping is difficult (often impossible). Apart from improving
constraint. To achieve this, we equip the generator with the quality of the simulation results, the successful gener-
a Vision Transformer (ViT) and employ necessary train- ative model can be run in the inverse direction to translate
ing and regularization techniques. Compared to previous real-world data into the simulation domain. This inverse
best-performing models, our model performs better and re- task can be viewed as a denoising step, helpful toward cor-
tains a strong correlation between the original and trans- rectly inferring the underlying parameters from experiment
lated image. An accompanying ablation study shows that observations [20]. Achieving realistic scientific simulations
both the gradient penalty and self-supervised pre-training requires both well-defined scientific datasets and purpose-
are crucial to the improvement. To promote reproducibility fully designed machine learning models. This work will
and open science, the source code, hyperparameter config- focus on the latter by developing novel models for unpaired
urations, and pre-trained model are available at https: image-to-image translation.
//github.com/LS4GAN/uvcgan. The CycleGAN [82] model is the first of its kind to
translate images between two domains without paired in-
stances. It uses two GANs, one for each translation direc-
1. Introduction tion. CycleGAN introduces cycle-consistency loss, where
an image should look like itself after a cycle of transla-
Deep generative models such as generative adversar- tions to the other domain and back. Such cycle-consistency
ial networks (GAN) [30, 11, 41], variational autoencoder is of utmost importance for scientific applications as the
(VAE) [44, 45], normalizing flow (NF) [26, 43], and diffu- science cannot be altered during translation. Namely,
sion models (DM) [37, 70, 55] represent a class of statistical there should be a one-to-one mapping between a simu-
models used to create realistic and diverse data instances lation result and its experimental counterpart. However,
that mimic ones from a target data domain. Along with to promote more diverse image generation, many recent
applications in image processing, audio analysis, and text works [80, 56, 61, 79] relaxed the cycle-consistency con-
generation, their success and expressiveness have attracted straint. Following the same objective of revisiting and
researchers in natural science, including cosmology [54], modifying canonical neural architectures [8], we demon-
high-energy physics [25, 2], materials design [29], and drug strate that by equipping CycleGAN with a Vision Trans-
design [22, 9]. Most existing work treats deep generative former (ViT) [28] to boost non-local pattern learning and

1
employing advanced training techniques, such as gradi- growing too large. However, the Wasserstein loss function
ent penalty and self-supervised pre-training, the resulting was later reexamined. Notably, the assessment revealed the
model, named UVCGAN, can outperform competing mod- gradient penalty term was responsible for stabilizing the
els in several benchmark datasets. training and not the Wasserstien loss function [71]. In ad-
Contributions. In this work, we: 1) incorporated ViT to the dition, the StyleGAN v2 [41] relies on a zero-centered gra-
CycleGAN generator and employed advanced training tech- dient penalty term to achieve state-of-the-art results on a
niques, 2) demonstrated its superb image translation per- high-resolution image generation task. These findings mo-
formance versus other more heavy models, 3) showed via tivated this work to explore applying the gradient penalty
an ablation study that the architecture change alone is in- terms to improve GAN training stability.
sufficient to compete with other methods and pre-training Transformer Architecture for Computer Vision. Con-
and gradient-penalty are needed, and 4) identified the un- volutional neural network (CNN) architecture is a popu-
matched evaluation results from past literature and stan- lar choice for computer vision tasks. In the natural lan-
dardized the evaluation procedure to ensure a fair compari- guage processing (NLP) field, the attention mechanism
son and promote reusability of our benchmarking results. and transformer-style architecture have surpassed previous
models, such as hidden Markov models and recurrent neural
2. Related work networks, in open benchmark tasks. Compared to CNNs,
Deep Generative Models. Deep generative models cre- transformers can more efficiently capture non-local pat-
ate realistic data points (images, molecules, audio samples, terns, which are common in nature. Applications of trans-
etc.) that are similar to those presented in a dataset. Un- formers in computer vision debuted in [28], while other re-
like decision-making models that contract representation di- cent work has shown that a CNN-transformer hybrid can
mension and distill high-level information, generative mod- achieve better performance [77, 35].
els enlarge the representation dimension and extrapolate in- Self-supervised Pre-training. Self-supervised pre-training
formation. There are several types of deep generative mod- primes network initial weights by training the network on
els. A VAE [44, 45, 48, 64] reduces data points into a artificial tasks derived from the original data without super-
probabilistic latent space and reconstructs them from sam- vision. This is especially important for training models with
ples of latent distributions. NFs [26, 43, 14, 31] make use a large amount of parameters on a small labeled dataset as
of the change of variable formula and transform samples they tend to overfit. There are many innovative ways to
from a normal distribution to the data distribution via a create these artificial self-supervision tasks. Examples in
sequence of invertible and differentiable transformations. computer vision include image inpainting [62], solving jig-
DMs [37, 70, 55, 66, 69, 76] are parameterized Markov saw puzzles [58], predicting image rotations [46], multitask
chains trained to transform noise into data (forward pro- learning [27], contrastive learning [15, 16], and teacher-
cess) via successive steps. Meanwhile, GANs [30] formu- student latent bootstrapping [33, 13]. Common pre-training
late the learning process as a minimax game, where the methods in NLP include the auto-regressive [63] and mask-
generator tries to fool the discriminator by creating realis- filling [24] tasks. In the mask-filling task, some parts of the
tic data points, and the discriminator attempts to distinguish sentence are masked, and the network is tasked with pre-
the generated samples from the real ones. GANs are among dicting the missing parts from their context. Once a model
the most expressive and flexible models that can generate is pre-trained, it can be fine-tuned for multiple downstream
high-resolution, diverse, style-specific images [11, 41]. tasks using much smaller labeled datasets.
GAN Training Techniques. The original GAN suffered We hypothesise GAN training also can benefit from self-
from many problems, such as mode collapsing and train- supervised pre-training. In particular, GAN training is
ing divergence [52]. Since then, much work has been known to suffer from the “mode collapse” problem [52]:
done to improve training stability and model diversity. Pro- the generator fails to reproduce the target distribution of
GAN [40] introduces two stabilization methods: progres- images faithfully. Instead, only a small set of images are
sive training and learning rate equalization. Progressive generated repeatedly despite diverse input samples. Obser-
training of the generator starts from low-resolution im- vations have noted the mode collapse problem occurs just
ages and moves up to high-resolution ones. The learn- a few epochs after beginning the GAN training [40]. This
ing rate equalization scheme seeks to ensure that all parts suggests that better initialized model weights could be used.
of the model are being trained at the same rate. Wasser- Indeed, transfer learning of GANs, a form of pre-training,
stein GAN [34] suggests that the destructive competition has been an effective way to improve GAN performance
between the generator and discriminator can be prevented on small training datasets [75, 57, 78, 74, 32]. However,
by using a better loss function, i.e., the Wasserstein loss scientific data, such as those in cosmology and high en-
function. Its key ingredient is a gradient penalty term that ergy physics, are remotely similar to natural images. There-
prevents the magnitude of the discriminator gradients from fore, we have chosen only to pre-train generators on a self-

2
supervised inpainting task, which has been successful in The discriminators are updated by backpropagating loss
both NLP and computer vision. Moreover, it is well suited corresponding to failure in distinguishing real and translated
for image-to-image translation models, where the model’s images (called generative adversarial loss or GAN loss):
output shape is the same as its input shape.
GAN Models for Unpaired Image-to-image Translation. \mathcal {L}_{\text {disc}, A} = & \mathbb {E}_{x\sim {B}}\ell _{\text {GAN}}\paren {\da \paren {\genba (x)}, 0}\nonumber \\ &+ \mathbb {E}_{x\sim {A}}\ell _\text {GAN}\paren {\da (x), 1}, \label {eq:disc_loss_A}\\ \mathcal {L}_{\textrm {disc}, B} = & \mathbb {E}_{x\sim {A}}\ell _{\text {GAN}}\paren {\db \paren {\genab (x)}, 0}\nonumber \\ & + \mathbb {E}_{x\sim {B}}\ell _\text {GAN}\paren {\db (x), 1} \label {eq:disc_loss_B}.
Many frameworks [38, 47, 56, 80] have been developed
for unpaired image-to-image translation. While most com-
monly use GANs for translation, they differ in how consis-
tency is maintained. U-GAT-IT [42] follows the CycleGAN (2)
closely but relies on more sophisticated generator and dis-
Here, ℓGAN can be any classification loss function (L2 ,
criminator networks for better performance. Other models
cross-entropy, Wasserstein [5], etc.), while the 0 and 1 are
relax the cycle-consistency constraint. For example, ACL-
class labels for translated (fake) and real images, respec-
GAN [80] relaxes the per-image consistency constraint by
tively. The generators are updated by backpropagating loss
introducing the so-called “adversarial-consistency loss” that
from three sources: GAN loss, cycle-consistency loss, and
imposes cycle-consistency at a distribution level between
identity-consistency loss. Using GA→B as an example:
a neighborhood of the input and the translations. Mean-
while, Council-GAN [56] abandons the idea of explicit con- \mathcal {L}_{\text {GAN}, A} =& \mathbb {E}_{x\sim {A}}\ell _\text {GAN}\paren {\da \paren {\genab (x)}, 1},\\ \mathcal {L}_{\text {cyc}, A} =& \mathbb {E}_{x\sim {A}}\ell _\text {reg}\paren {\genba \paren {\genab (x)}, x},\\ \mathcal {L}_{\text {idt}, A} =& \mathbb {E}_{x\sim {A}} \ell _\text {reg}\paren {\genba \paren {x}, x}.
sistency enforcement and instead relies on a generator en-
semble with the assumption that, when multiple generators
arrive at an agreement, the commonly agreed upon por- (5)
tion is what should be kept consistent. While relaxed or
And,
implicit consistency constraints boost translation diversity
and achieve better evaluation scores, such models inevitably \mathcal {L}_{\text {gen}, A\rightarrow B} =& \mathcal {L}_{\text {GAN}, A} + \lambda _\text {cyc}\mathcal {L}_{\text {cyc}, A} + \lambda _\text {idt}\mathcal {L}_{\text {idt}, A}, \label {eq:gen_loss_A}\\ \mathcal {L}_{\text {gen}, B\rightarrow A} =& \mathcal {L}_{\text {GAN}, B} + \lambda _\text {cyc}\mathcal {L}_{\text {cyc}, B} + \lambda _\text {idt}\mathcal {L}_{\text {idt}, B}. \label {eq:gen_loss_B}
introduce randomness into the feature space and output.
Hence, they are unsuitable for applications where a one-to- (7)
one mapping is required. Compared to the original Cycle-
Here, ℓreg can be any regression loss function (L1 or L2 ,
GAN, all these models contain more parameters requiring
etc.), and λcyc and λidt are combination coefficients.
more computation resources and longer training time. Con-
To improve the original CycleGAN model’s perfor-
currently, Zheng et. al. [81] also proposed to utilize ViT
mance, we implement three major changes. First, we mod-
for image translation by replacing the ResNet blocks with
ify the generator to have a hybrid architecture based on a
hybrid blocks of self-attention and convolution.
UNet with a ViT bottleneck (Section 3.2). Second, to regu-
larize the CycleGAN discriminator, we augment the vanilla
3. Method CycleGAN discriminator loss with a gradient penalty term
3.1. CycleGAN-like Models (Section 3.3). Finally, instead of training from a randomly
initialized network weights, we pre-train generators in a
cycle-consistency loss
self-supervised fashion on the image inpainting task to ob-
tain a better starting state (Section 3.4).
identity loss A GA→B Bf GB→A Ac
3.2. UNet-ViT Generator
A GB→A Ai DA DB B GB→A Bi
A UNet-ViT generator consists of a UNet [67] with a
Bc GA→B Af GB→A B identity loss pixel-wise Vision Transformer (ViT) [28] at the bottleneck
discriminator loss
(Figure 2A). UNet’s encoding path extracts features from
cycle-consistency loss generator losses the input via four layers of convolution and downsampling.
Figure 1. CycleGAN Framework The features extracted at each layer are also passed to the
corresponding layers of the decoding path via skip connec-
CycleGAN-like models [82, 42] interlace two generator- tions, whereas the bottom-most features are passed to the
discriminator pairs for unpaired image-to-image translation ViT. We hypothesize that the skip connections are effective
(Figure 1). Denote the two image domains by A and B, in passing high-frequency features to the decoder, and the
a CycleGAN-like model uses generator GA→B to translate ViT provides an effective means to learn pairwise relation-
images from A to B, and generator GB→A , B to A. Dis- ships of low-frequency features.
criminator DA is used to distinguish between images in A On the encoding path of the UNet, the pre-processing
and those translated from B (denoted as Af in Figure 1) and layer turns an image into a tensor with dimension
discriminator DB , B and Bf . (w0 , h0 , f0 ). A pre-processed tensor will have its width

3
A. UNet-ViT Generator C. basic block D. PE E. FFN
Input Output InstanceNorm normalization Linear(fv , fh ) basic block
Conv(k = 3, p = 1) Linear(2, fp ) GeLu down-sampling:
pre-process post-process Conv(k = 2, s = 2)
LeakyReLU ×2 sine Linear(fh , fv )
up-sampling:
⊕ up-sample with
scale 2 and
B. pixel-wise vision transformer Conv(k = 3, p = 1)
⊕
positional + addition
encoding

decoding
embedding (PE) ⊕ concatenation along

attention (MSA)
multi-head self-

network (FFN)
feature dimension

LayerNorm

feed-forward
⊕

Linear

reshape
pre-process Conv(k = 3, p = 1)

flatten

+
LeakyReLU

×α

×α
⊕
post-process 1 × 1-Conv
sigmoid
×α learnable
pixel-wise ViT Transformer encoder block ×12 rezero parameter

Figure 2. Schematic diagrams of UVCGAN. A. UNet-ViT generator; B. pixel-wise ViT; C. basic block; D. positional embedding (PE);
F. feed-forward network (FFN).

and height halved at each down-sampling block, while the GP form introduced in [40] with the following formula for
feature dimension doubled at the last three down-sampling loss of DA :
blocks. The output from the encoding path with dimen-
sion (w, h, f ) = (w0 /16, h0 /16, 8f0 ) forms the input to
\label {eq:dist_loss_gp} \mathcal {L}^{\text {GP}}_{\text {disc}, A} = \mathcal {L}_{\text {disc}, A} + \lambda _\text {GP} \mathbb {E} \left [\frac { \left ( \| \nabla _x \mathcal {D}_A(x) \|_2 - \gamma \right )^2}{\gamma ^2} \right ], (8)
the pixel-wise ViT bottleneck.
A pixel-wise ViT (Figure 2B) is composed primarily of
a stack of Transformer encoder blocks [24]. To construct an where Ldisc,A is defined as in Eq. (1), and LGP
disc,B follows
input to the stack, the ViT first flattens an encoded image the same form. In our experiments, this γ-centered GP reg-
along the spatial dimensions to form a sequence of tokens. ularization provides more stable training and less sensitive
The token sequence has length w × h, and each token in the to the hyperparameter choices. To see the effect of GP on
sequence is a vector of length f . It then concatenates each model performance, refer to the ablation study detailed in
token with its two-dimensional Fourier positional embed- Section 5.3 and Appendix Section 1.
ding [4] of dimension fp (Figure 2D) and linearly maps the
result to have dimension fv . To improve the Transformer 3.4. Self-Supervised Pre-training by Inpainting
convergence, we adopt the rezero regularization [6] scheme Pre-training is an effective way to prime large networks
and introduce a trainable scaling parameter α that modu- for downstream tasks [24, 7] that often can bring significant
lates the magnitudes of the nontrivial branches of the resid- improvement over random initialization. In this work, we
ual blocks. The output from the Transformer stack is lin- pre-train the UVCGAN generators on an image inpainting
early projected back to have dimension f and unflattened task. More precisely, we tile images with non-overlapping
to have width w and h. In this study, we use 12 Transform patches of size 32 × 32 and mask 40% of the patches by
encoder blocks and set f, fp , fv = 384, and fh = 4fv for setting their pixel values to zero. The generator is trained to
the feed-forward network in each block (Figure 2E). predict the original unmasked image using pixel-wise L1
loss. We consider two modes of pre-training: 1) on the
3.3. Discriminator Loss with Gradient Penalty (GP) same dataset where the subsequent image translation is to
In this study, we use the least squares GAN (LSGAN) be performed and 2) on the ImageNet [23] dataset. In Sec-
loss function [50] (i.e., ℓGAN is an L2 error) in Eq. (1)- tion 5.3, we conduct an ablation study on these two pre-
(7) and supplement the discriminator loss with a GP term. training modes together with no pre-training.
GP [34] originally was introduced to be used with Wasser-
stein GAN (WGAN) loss to ensure the 1-Lipschitz con- 4. Experiments
straint [5]. However, in our experiments, WGAN + GP
4.1. Benchmarking Datasets
yielded overall worse results, which echoes the findings
in [51, 52] We have also considered zero-centered GP [52]. To test UVCGAN’s performance, we have completed
In our case, zero-centered GP turned out to be very sensi- an extensive literature survey for benchmark datasets.
tive to the values of hyperparameters, and did not improve The most popular among them are datasets derived
the training stability. Therefore, we settle on a more generic from CelebA [49] and Flickr-Faces [59], as well as

4
the SYNTHIA/GTA-to-Cityscape [68, 18, 65], photo- Optimal λcyc differs slightly for CelebA and Selfie2Anime
to-painting [82], Selfie2Anime [42], and animal face at 5 and 10, respectively. An ablation study on hyperparam-
datasets [17]. We prioritize our effort on the Selfie2Anime eter tuning can be found in Section 5.3. More training de-
dataset and two others derived from the CelebA dataset: tails also can be found in the open-source repository [73].
gender swap (denoted as GenderSwap) and adding and
removing eyeglasses (marked as Eyeglasses), which have 4.3. Other Model Training Details
been used in recent papers.
To fairly represent other models’ performance, we strive
Selfie2Anime [42] is a small dataset with 3.4K images to reproduce trained models following three principles.
in each domain. Both GenderSwap and Eyeglasses tasks First, if a pre-trained model for a dataset exists, we will use
are derived from CelebA [49] based on the gender and eye- it directly. Second, in the absence of pre-trained models, we
glass attributes, respectively. GenderSwap contains about will train the model from scratch using a configuration file
68K males and 95K females for training, while Eyeglasses (if provided), following a description in the original paper,
includes 11K with glasses and 152K without. For a fair or using a hyperparameter configuration for a similar task.
comparison, we do not use CelebA’s validation dataset for Third, we will keep the source code “as is” unless it abso-
training. Instead, we combine it with the test dataset fol- lutely is necessary to make changes. In addition, we have
lowing the convention of [56, 80]. Selfie2Anime contains conducted a small-scale hyperparameter tuning on models
images of size 256 × 256 that can be used directly. The lacking hyperparameters for certain translation directions
CelebA datasets contains images of size 178 × 218, which (Appendix Sec. 2). Post-processing and evaluation choices
we resize and crop to size 256×256 for UVCGAN training. also will affect the reported performance (Section 5.2).
4.2. UVCGAN Training Procedures ACL-GAN [1] provides configuration file for the Gen-
derSwap dataset. For configuration files for Eyeglasses and
Pre-training. The UVCGAN generators are pre-trained Selfie2Anime, we copy the settings for GenderSwap except
with self-supervised image inpainting. To construct im- for the four key parameters λacl , λmask , δmin , and δmax , which
paired images, we tile images of size 256 × 256 into non- we modify according to the paper [80, p. 8, Training De-
overlapping 32 × 32 pixel patches and randomly mask 40% tails]. Because ACL-GAN does not train two generators
of the patches by zeroing their pixel values. We use the jointly, we train a model for each direction for all datasets.
Adam optimizer, cosine annealing learning-rate scheduler, Council-GAN [19] provides models for all datasets but
and several standard data augmentations, such as small- only in one direction (selfie to anime, male to female, re-
angle random rotation, random cropping, random flipping, moving glasses). The pre-tained models output images with
and color jittering. During pre-training, we do not distin- size 256 for GenderSwap and Selfie2Anime and 128 for
guish the image domains, which means both generators in Eyeglasses. For a complete comparison, we train models
the ensuing translation training have the same initialization. for the missing directions using the same hyperparameters
In this work, we pre-train one generator on ImageNet, an- as the existing ones with the exception for Eyeglasses– we
other on CelebA, and one on the Selfie2Anime dataset. train a model for adding glasses for image size 256. Cy-
Image Translation Training. For all three benchmarking cleGAN [21] models are trained from scratch with the de-
tasks, we train UVCGAN models for one million iterations fault settings (resnet 9blocks generators and LSGAN
with a batch size of one. We use the Adam optimizer with losses, batch size 1, etc.). Because the original CycleGAN
the learning rate kept constant at 0.0001 during the first half uses square images, we add a pre-processing for CelebA
of the training then linearly annealed to zero during the sec- by scaling up the shorter edge to 256 while maintaining
ond half. We apply three data augmentations: resizing, ran- the aspect ratio, followed by a 256 × 256 random crop-
dom cropping, and random horizontal flipping. Before ran- ping. U-GAT-IT [72] provides the pre-trained model for
domly cropping images to 256×256, we enlarge them from Selfie2Anime, which is used directly. For the two CelebA
256 × 256 to 286 × 286 for Selfie2Anime and 178 × 218 to datasets, models are trained using default hyperparameters.
256 × 313 for CelebA. Table 1 depicts the training time (in hours) for vari-
Hyperparameter search. The UVCGAN loss functions ous models on the CelebA datasets using an NVIDIA RTX
depend on four hyperparameters: λcyc , λGP , λidt and γ, A6000 GPU. The times correspond to training the models
Eq. (6)-(8). If identity loss (λidt ) is used, it is always set with a batch size 1 for one million iterations. U-GAT-IT’s
to λcyc /2 as suggested in [82]. To find the best-performing long training time is due to a large number of loss function
configuration, we run a small-scale hyperparameter opti- terms that must be computed, as well as the large size of the
mization on a grid. Our experiments show that the best per- generators and discriminators. For Council-GAN, the time
formance for all three benchmarking tasks is achieved with stems from training an ensemble of generators, each with its
the LSGAN + GP with (λGP = 0.1, γ = 100) and with gen- own discriminator, in addition to the domain discriminators.
erators pre-trained on the image translation dataset itself. More details are available in the open-source repository [3]

5
Table 1. Training time. CycleGAN, U-GAT-IT, and UVCGAN Table 2. FID and KID scores. Lower is better.
train two generators jointly. ACL-GAN and Council-GAN’s gen- Selfie to Anime Anime to Selfie
erators are trained separately for each direction. The time shown FID KID (×100) FID KID (×100)
is for training both directions.
ACL-GAN 99.3 3.22 ± 0.26 128.6 3.49 ± 0.33
Algorithm Time (hrs) Jointly Trained # Para.
Council-GAN 91.9 2.74 ± 0.26 126.0 2.57 ± 0.32
ACL-GAN ∼ 86 ∼ 55M CycleGAN 92.1 2.72 ± 0.29 127.5 2.52 ± 0.34
Council-GAN ∼ 600 ∼ 116M U-GAT-IT 95.8 2.74 ± 0.31 108.8 1.48 ± 0.34
CycleGAN ∼ 40 ✓ ∼ 28M UVCGAN 79.0 1.35 ± 0.20 122.8 2.33 ± 0.38
U-GAT-IT ∼ 140 ✓ ∼ 671M Male to Female Female to Male
UVCGAN ∼ 60 ✓ ∼ 68M FID KID (×100) FID KID (×100)
5. Results ACL-GAN 9.4 0.58 ± 0.06 19.1 1.38 ± 0.09
Council-GAN 10.4 0.74 ± 0.08 24.1 1.79 ± 0.10
5.1. Evaluation Metrics CycleGAN 15.2 1.29 ± 0.11 22.2 1.74 ± 0.11
Fréchet Inception Distance (FID) [36] and Kernel Incep- U-GAT-IT 24.1 2.20 ± 0.12 15.5 0.94 ± 0.07
tion Distance (KID) [10] are the two most accepted met- UVCGAN 9.6 0.68 ± 0.07 13.9 0.91 ± 0.08
rics used for evaluating image-to-image translation perfor-
Remove Glasses Add Glasses
mance. A lower score means the translated images are
FID KID (×100) FID KID (×100)
more similar to those in the target domain. As shown
in Table 2, our model offers better performance in most ACL-GAN 16.7 0.70 ± 0.06 20.1 1.35 ± 0.14
image-to-image translation tasks compared to existing mod- Council-GAN 37.2 3.67 ± 0.22 19.5 1.33 ± 0.13
els. As a CycleGAN-like model, ours produce translated CycleGAN 24.2 1.87 ± 0.17 19.8 1.36 ± 0.12
images correlated strongly with the input images, such as U-GAT-IT 23.3 1.69 ± 0.14 19.0 1.08 ± 0.10
hair color and face orientations (Figure 3), which is cru- UVCGAN 14.4 0.68 ± 0.10 13.6 0.60 ± 0.08
cial for augmenting scientific simulations. On the contrary,
we observed the translations produced by ACL-GAN and lated images for an input. However, because larger sam-
Council-GAN tend to be overly liberal on features that are ple size improves FID score [10], we generate one trans-
not essential in accomplishing the translation (such as back- lated image per input for a fair comparison. To produce the
ground color or hair color and length). We also note that test result, ACL-GAN resizes images from CelebA to have
although U-GAT-IT achieves lower scores in the anime-to- width 256 and output without cropping. For FID and KID
selfie task and produces translations that resemble human evaluation, we take the center 256 × 256 crop from the test
faces better, they are less correlated to the input and some- output. Council-GAN resizes the images to have a width
times miss key features from the input (such as headdress 256, except for removing glasses, which is 128 due to the
or glasses) that we want to preserve. In the Supplementary pre-trained model provided. In order to follow the princi-
material, more samples of larger sizes are provided. ple of using a pre-trained model if available and maintain
consistency in evaluating on images of size 256, we resize
5.2. Model Evaluation and Reproducibility 128 to 256 during testing, which may be responsible for the
KID and FID for image-to-image translation are difficult large FID score. The reverse direction, adding glasses, is
to reproduce. For example, in [56, 80, 42], most FID and trained from scratch using an image size of 256. Its perfor-
KID scores of the same task-model settings differ. We hy- mance is similar to that of other models. CycleGAN takes a
pothesize that this is due to: 1) Different sizes of test data as random square crop for both training and testing. However,
FID decreases with more data samples [10] 2) Differences for a fair comparison, we modify the source code so the test
in post-processing before testing 3) Different formulation of output are the center crops. Since the original U-GAT-IT
metrics (e.g. KID in U-GAT-IT [42]) 4) Different FID and cannot handle non-square images, we modified the code to
KID implementations. Therefore, we standardize the eval- scale the shorter edge 256 for the CelebA datasets.
uations as follows: 1) Using the full test datasets for FID
5.3. Ablation Studies
and KID—for KID subset size, use 50 for Selfie2Anime and
1000 for the two CelebA datasets; 2) Resizing non-square Table 3 summarizes the male-to-female and selfie-to-
CelebA images and taking a central crop of size 256 × 256 anime translation performance with respect to pre-training,
to maintain the correct aspect ratio; 3) Delegating all KID GP, and identity loss. First, GP combined with identity loss
and FID calculations to the torch-fidelity package [60]. improves performance. Second, without GP, the identity
ACL-GAN follows a non-deterministic type of cycle loss produces mixed results. Finally, pre-training on the
consistency and can generate a variable number of trans- same dataset improves performance, especially in conjunc-

6
Input ACL-GAN Council-GAN CycleGAN U-GAT-IT UVCGAN Input ACL-GAN Council-GAN CycleGAN U-GAT-IT UVCGAN

selfie to anime anime to selfie

male to female female to male

remove glasses add glasses

Input ACL-GAN Council-GAN CycleGAN U-GAT-IT UVCGAN Input ACL-GAN Council-GAN CycleGAN U-GAT-IT UVCGAN

Figure 3. Samples of unpaired image-to-image translation.

tion with the GP and identity loss. Appendix Sec. 1 contains to see if they help with generator interpretability. We plot
the complete ablation study results for all data sets. (Figure 4) the attention weights produced by the multi-head
We speculate that the GP is required to obtain the best self-attention (MSA) unit in each of the 12 Transformer en-
performance with pre-trained networks because those net- coder blocks in the bottleneck of the UVCGAN generators
works provide a good starting point for the image trans- (Figure 2B). The (i, j)-entry of the attention matrix indi-
lation task. However, at the beginning of fine-tuning, the cates how much attention token i is paying to token j while
discriminator is initialized by random values and provides the sum of row i is one. When multi-head attention is used,
a meaningless signal to the generator. This random signal each head produces an attention matrix. For simplicity, we
may drive the generator away from the good starting point average the attention weights over all heads and target to-
and undermine the benefits of pre-training. kens for each block in the Transformer encoder stack. Given
the input image of size 256 × 256, this provides an atten-
5.4. Interpretation of Attention tion vector of dimension w × h (16 × 16 = 256). The j-th
Because the UVCGAN generator uses the transformer entry of such a vector indicates how much attention token j
bottleneck, it is instructive to visualize its attention matrices receives on average. Because the tokens represent overlap-

7
input block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8 block 9 block 10 block 11 block 12 translation

Figure 4. Attention. Attention heatmap generated by the attention weights from the 12 Transformer encoder blocks in the pixel-wise ViT.
The attention heatmap demonstrates the amount of attention different locations of an image receive.

Table 3. Ablation studies. Pre-train/Dataset column indicates most informative and relevant regions.
which dataset the generator is pre-trained on (None for no
pre-training; Same indicates CelebA for male-to-female and
Selfie2Anime for selfie-to-anime).
6. Conclusion
Pre-train Male to Female Selfie to Anime This work introduces UVCGAN to promote cycle-
Dataset GP idt FID KID (×100) FID KID (×100) consistent, content-preserving image translation and effec-
Same ✓ ✓ 9.6 0.68 ± 0.07 79.0 1.35 ± 0.20 tively handle long-range spatial dependencies that remain
ImageNet ✓ ✓ 11.0 0.85 ± 0.08 81.3 1.66 ± 0.21 a common problem in scientific domain research. Com-
bined with self-supervised pre-training and GP regulariza-
None ✓ ✓ 11.0 0.85 ± 0.09 80.9 1.78 ± 0.20
tion, UVCGAN outperforms competing methods on a di-
Same ✓ 11.1 0.86 ± 0.08 83.9 1.88 ± 0.35
verse set of image translation benchmarks. The ablation
ImageNet ✓ 11.0 0.85 ± 0.08 84.3 1.77 ± 0.21 study suggests GP and cycle-consistent loss work well with
None ✓ 13.4 1.11 ± 0.09 115.4 6.85 ± 0.59 UVCGAN. Additional inspection on attention weights indi-
Same ✓ 14.2 1.22 ± 0.10 81.5 1.68 ± 0.22 cates our model has focused on the relevant regions of the
ImageNet ✓ 14.5 1.23 ± 0.10 86.8 2.21 ± 0.25 source images. To further demonstrate the effectiveness of
None ✓ 14.4 1.26 ± 0.10 81.6 1.75 ± 0.25 our model in handling long-distance patterns beyond bench-
Same 12.7 1.06 ± 0.09 79.0 1.32 ± 0.19 mark datasets, more open scientific datasets are needed.
ImageNet 13.4 1.14 ± 0.10 91.2 2.63 ± 0.23
Potential Negative Social Impact. All data used in this
work are publicly available. The environmental impact of
None 18.3 1.63 ± 0.11 81.2 1.76 ± 0.21
training our model is greater than the original CycleGAN
ping patches in the original image, we generate a heatmap as yet considerably less comparing to other advanced models.
follows: reshape a feature vector to a square of size 16 × 16, Although the motivation of our image-to-image translation
upscale it 16 folds to match the dimension of the input im- work is to bridge the gap between scientific simulation and
age, then apply a Gaussian filter with σ = 16. By over- experiment, the authors are aware of its potential use for
laying the attention heatmap on the input images, we note generating fake content [53]. Thankfully, there are coun-
that each block is paying attention to a specific facial part termeasures and detection tools [39] developed to counter
with the eye and mouth areas receiving the most attention. such misuse. To contribute to such mitigation efforts, we
This echoes the findings in behavioral science experiments have made our code and pre-trained models available.
on statistical eye fixation (e.g., [12]), where the regions of Acknowledgement. The LDRD Program at Brookhaven
interest also tend to be around the eyes and mouth, which National Laboratory, sponsored by DOE’s Office of Science
may indicate that the model’s attention is focused on the under Contract DE-SC0012704, supported this work.

8
References ference on Neural Information Processing Systems. Curran
Associates Inc.
[1] GitHub: ACL-GAN. https://github.com/hyperplane-lab/acl-
[15] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
gan.
offrey Hinton. A simple framework for contrastive learning
[2] Yasir Alanazi, Nobuo Sato, Tianbo Liu, Wally Melnitchouk, of visual representations. In International conference on ma-
Pawel Ambrozewicz, Florian Hauenstein, Michelle P. chine learning, pages 1597–1607. PMLR.
Kuchera, Evan Pritchard, Michael Robertson, Ryan Strauss,
[16] Xinlei Chen and Kaiming He. Exploring simple siamese rep-
Luisa Velasco, and Yaohang Li. Simulation of electron-
resentation learning. In Proceedings of the IEEE/CVF Con-
proton scattering events by a feature-augmented and trans-
ference on Computer Vision and Pattern Recognition, pages
formed generative adversarial network (FAT-GAN). In Zhi-
15750–15758.
Hua Zhou, editor, Proceedings of the Thirtieth International
Joint Conference on Artificial Intelligence, IJCAI-21, pages [17] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.
2126–2132. International Joint Conferences on Artificial In- Stargan v2: Diverse image synthesis for multiple domains.
telligence Organization. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2020.
[3] GitHub: UVCGAN-Benchmarking Algorithms.
https://github.com/ls4gan/benchmarking. [18] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
[4] Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb
Franke, Stefan Roth, and Bernt Schiele. The cityscapes
Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image
dataset for semantic urban scene understanding. In Proceed-
generators with conditionally-independent pixel synthesis.
ings of the IEEE conference on computer vision and pattern
In Proceedings of the IEEE/CVF Conference on Computer
recognition, pages 3213–3223, 2016.
Vision and Pattern Recognition, pages 14278–14287, 2021.
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. [19] GitHub: Council-GAN. https://github.com/onr/council-gan.
Wasserstein generative adversarial networks. In Interna- [20] Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The
tional conference on machine learning, pages 214–223. frontier of simulation-based inference. Proceedings of the
PMLR, 2017. National Academy of Sciences, 117(48):30055–30062, 2020.
[6] Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Publisher: National Academy of Sciences Section: Collo-
Mao, Gary Cottrell, and Julian McAuley. Rezero is all you quium Paper.
need: Fast convergence at large depth. In Uncertainty in [21] GitHub: CycleGAN. https://github.com/junyanz/pytorch-
Artificial Intelligence, pages 1352–1361. PMLR, 2021. cyclegan-and-pix2pix.
[7] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training [22] Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Se-
of image transformers. arXiv preprint arXiv:2106.08254, bastian Gehrmann, Flaviu Cipcigan, Vijil Chenthamarak-
2021. shan, Hendrik Strobelt, Cicero dos Santos, Pin-Yu Chen,
[8] Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Yi Yan Yang, Jeremy P. K. Tan, James Hedrick, Jason Crain,
Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Aleksandra Mojsilovic. Accelerated antimicrobial dis-
and Barret Zoph. Revisiting resnets: Improved training and covery via deep generative models and molecular dynamics
scaling strategies. Advances in Neural Information Process- simulations. 5(6):613–623. Number: 6 Publisher: Nature
ing Systems, 34:22614–22627, 2021. Publishing Group.
[9] Camille Bilodeau, Wengong Jin, Tommi Jaakkola, [23] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Regina Barzilay, and Klavs F. Jensen. Gener- and Li Fei-Fei. Imagenet: A large-scale hierarchical image
ative models for molecular discovery: Recent database. In 2009 IEEE Conference on Computer Vision and
advances and challenges. n/a:e1608. eprint: Pattern Recognition, pages 248–255, 2009.
https://onlinelibrary.wiley.com/doi/pdf/10.1002/wcms.1608. [24] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
[10] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Toutanova. Bert: Pre-training of deep bidirectional trans-
Arthur Gretton. Demystifying mmd gans. arXiv preprint formers for language understanding, 2019.
arXiv:1801.01401, 2018. [25] Riccardo Di Sipio, Michele Faucci Giannelli, Sana Ketabchi
[11] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Haghighat, and Serena Palazzo. DijetGAN: a generative-
scale GAN training for high fidelity natural image synthesis. adversarial network approach for the simulation of QCD di-
[12] Roberto Caldara and Sébastien Miellet. imap: a novel jet events at the LHC. 2019(8):110.
method for statistical fixation mapping of eye movement [26] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio.
data. Behavior research methods, 43(3):864–878, 2011. Density estimation using real NVP.
[13] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, [27] C. Doersch and A. Zisserman. Multi-task self-supervised
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- visual learning. In 2017 IEEE International Conference on
ing properties in self-supervised vision transformers. In Computer Vision (ICCV), pages 2070–2079. IEEE Computer
Proceedings of the IEEE/CVF International Conference on Society. ISSN: 2380-7504.
Computer Vision, pages 9650–9660, 2021. [28] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[14] Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Jörn-Henrik Jacobsen. Residual flows for invertible genera- Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
tive modeling. In Proceedings of the 33rd International Con- vain Gelly, et al. An image is worth 16x16 words: Trans-

9
formers for image recognition at scale. arXiv preprint works with adaptive layer-instance normalization for image-
arXiv:2010.11929, 2020. to-image translation. In International Conference on Learn-
[29] Addis S. Fuhr and Bobby G. Sumpter. Deep generative mod- ing Representations, 2019.
els for materials discovery and machine learning-accelerated [43] Durk P Kingma and Prafulla Dhariwal. Glow: Generative
innovation. 9. flow with invertible 1x1 convolutions. In S. Bengio, H. Wal-
[30] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R.
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Garnett, editors, Advances in Neural Information Processing
Yoshua Bengio. Generative adversarial nets. In Z. Ghahra- Systems, volume 31. Curran Associates, Inc.
mani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Wein- [44] Diederik P Kingma and Max Welling. Auto-encoding varia-
berger, editors, Advances in Neural Information Processing tional bayes.
Systems, volume 27. Curran Associates, Inc., 2014. [45] Diederik P. Kingma and Max Welling. An introduction to
[31] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya variational autoencoders. 12(4):307–392.
Sutskever, and David Duvenaud. FFJORD: Free-form con- [46] Nikos Komodakis and Spyros Gidaris. Unsupervised repre-
tinuous dynamics for scalable reversible generative models. sentation learning by predicting image rotations. In Interna-
[32] Timofey Grigoryev, Andrey Voynov, and Artem Babenko. tional Conference on Learning Representations (ICLR).
When, why, and which pretrained GANs are useful? In In- [47] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh
ternational Conference on Learning Representations. Singh, and Ming-Hsuan Yang. Diverse image-to-image
[33] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin translation via disentangled representations. In Proceed-
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, ings of the European conference on computer vision (ECCV),
Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- pages 35–51, 2018.
laghi Azar, et al. Bootstrap your own latent-a new approach [48] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and
to self-supervised learning. Advances in neural information Alexander L. Gaunt. Constrained graph variational autoen-
processing systems, 33:21271–21284, 2020. coders for molecule design. In Proceedings of the 32nd
International Conference on Neural Information Processing
[34] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
Systems, NIPS’18, pages 7806–7815. Curran Associates Inc.
Dumoulin, and Aaron Courville. Improved training of
event-place: Montréal, Canada.
wasserstein gans, 2017.
[49] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
[35] Jianyuan Guo, Kai Han, Han Wu, Chang Xu, Yehui Tang,
Deep learning face attributes in the wild. In Proceedings of
Chunjing Xu, and Yunhe Wang. Cmt: Convolutional
International Conference on Computer Vision (ICCV), De-
neural networks meet vision transformers. arXiv preprint
cember 2015.
arXiv:2107.06263, 2021.
[50] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen
[36] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Wang, and Stephen Paul Smolley. Least squares genera-
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
tive adversarial networks. In Proceedings of the IEEE inter-
two time-scale update rule converge to a local nash equilib-
national conference on computer vision, pages 2794–2802,
rium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R.
2017.
Fergus, S. Vishwanathan, and R. Garnett, editors, Advances
[51] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau,
in Neural Information Processing Systems, volume 30. Cur-
Zhen Wang, and Stephen Paul Smolley. On the effective-
ran Associates, Inc., 2017.
ness of least squares generative adversarial networks. IEEE
[37] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Transactions on Pattern Analysis and Machine Intelligence,
sion probabilistic models. type: article. 41(12):2947–2960, 2019.
[38] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. [52] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
Multimodal unsupervised image-to-image translation. In Which training methods for gans do actually converge? In
Proceedings of the European conference on computer vision International conference on machine learning, pages 3481–
(ECCV), pages 172–189, 2018. 3490. PMLR, 2018.
[39] Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei [53] Yisroel Mirsky and Wenke Lee. The creation and detection
Ma, and Yang Liu. Countering malicious deepfakes: Survey, of deepfakes: A survey. ACM Computing Surveys (CSUR),
battleground, and horizon. International Journal of Com- 54(1):1–41, 2021.
puter Vision, pages 1–57, 2022. [54] Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Zarija
[40] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Lukić, Rami Al-Rfou, and Jan M. Kratochvil. CosmoGAN:
Progressive growing of GANs for improved quality, stability, creating high-fidelity weak lensing convergence maps using
and variation. generative adversarial networks. 6(1):1.
[41] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, [55] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
ing the image quality of stylegan. In Proceedings of Mark Chen. GLIDE: Towards photorealistic image gener-
the IEEE/CVF conference on computer vision and pattern ation and editing with text-guided diffusion models. type:
recognition, pages 8110–8119, 2020. article.
[42] Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwang Hee [56] Ori Nizan and Ayellet Tal. Breaking the cycle–colleagues are
Lee. U-gat-it: Unsupervised generative attentional net- all you need. In 2020 IEEE/CVF Conference on Computer

10
Vision and Pattern Recognition (CVPR), pages 7857–7866. [74] Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis
IEEE, 2020. Herranz, Fahad Shahbaz Khan, and Joost van de Weijer.
[57] Atsuhiro Noguchi and Tatsuya Harada. Image genera- Minegan: effective knowledge transfer from gans to target
tion from small datasets via batch statistics adaptation. In domains with few images. In Proceedings of the IEEE/CVF
Proceedings of the IEEE/CVF International Conference on Conference on Computer Vision and Pattern Recognition,
Computer Vision, pages 2750–2758. pages 9332–9341.
[58] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of [75] Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de
visual representations by solving jigsaw puzzles. In Euro- Weijer, Abel Gonzalez-Garcia, and Bogdan Raducanu.
pean conference on computer vision, pages 69–84. Springer. Transferring gans: generating images from limited data. In
[59] NVlabs/ffhq dataset. https://github.com/nvlabs/ffhq-dataset. Proceedings of the European Conference on Computer Vi-
[60] Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Semen sion (ECCV), pages 218–234.
Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. High- [76] Daniel Watson, William Chan, Jonathan Ho, and Moham-
fidelity performance metrics for generative models in py- mad Norouzi. Learning fast samplers for diffusion models
torch, 2020. Version: 0.3.0, DOI: 10.5281/zenodo.4957738. by differentiating through sample quality. type: article.
[61] Taesung Park, Alexei A Efros, Richard Zhang, and Jun- [77] Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor
Yan Zhu. Contrastive learning for unpaired image-to-image Darrell, and Ross Girshick. Early convolutions help trans-
translation. In European conference on computer vision, formers see better. Advances in Neural Information Process-
pages 319–345. Springer. ing Systems, 34, 2021.
[62] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor [78] Miaoyun Zhao, Yulai Cong, and Lawrence Carin. On lever-
Darrell, and Alexei A Efros. Context encoders: Feature aging pretrained gans for limited-data generation.
learning by inpainting. In Proceedings of the IEEE con- [79] Yang Zhao and Changyou Chen. Unpaired image-to-image
ference on computer vision and pattern recognition, pages translation via latent energy transport. In Proceedings of
2536–2544. the IEEE/CVF Conference on Computer Vision and Pattern
[63] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Recognition, pages 16418–16427, 2021.
Sutskever, et al. Improving language understanding by gen- [80] Yihao Zhao, Ruihai Wu, and Hao Dong. Unpaired image-
erative pre-training. 2018. to-image translation using adversarial consistency loss. In
[64] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Gener- European Conference on Computer Vision, pages 800–815.
ating diverse high-fidelity images with vq-vae-2. 32. Springer, 2020.
[65] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen [81] Wanfeng Zheng, Qiang Li, Guoxin Zhang, Pengfei Wan, and
Koltun. Playing for data: Ground truth from computer Zhongyuan Wang. Ittr: Unpaired image-to-image translation
games. In European conference on computer vision, pages with transformers. arXiv preprint arXiv:2203.16015, 2022.
102–118. Springer, 2016. [82] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.
[66] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Efros. Unpaired image-to-image translation using cycle-
Patrick Esser, and Björn Ommer. High-resolution image syn- consistent adversarial networks. In 2017 IEEE International
thesis with latent diffusion models. type: article. Conference on Computer Vision (ICCV), pages 2242–2251,
[67] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- 2017. ISSN: 2380-7504.
net: Convolutional networks for biomedical image segmen-
tation. In International Conference on Medical image com-
puting and computer-assisted intervention, pages 234–241.
Springer, 2015.
[68] German Ros, Laura Sellart, Joanna Materzynska, David
Vazquez, and Antonio M Lopez. The synthia dataset: A large
collection of synthetic images for semantic segmentation of
urban scenes. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3234–3243,
2016.
[69] Tim Salimans and Jonathan Ho. Progressive distillation for
fast sampling of diffusion models. type: article.
[70] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. type: article.
[71] Jan Stanczuk, Christian Etmann, Lisa Maria Kreusser, and
Carola-Bibiane Schönlieb. Wasserstein gans work because
they fail (to approximate the wasserstein distance). arXiv
preprint arXiv:2103.01678, 2021.
[72] GitHub: U-GAT-IT. https://github.com/znxlwm/ugatit-
pytorch.
[73] GitHub: UVCGAN. https://github.com/ls4gan/uvcgan.

11
Appendix A. Extended UVCGAN Ablation Table A1. FID and KID scores. Lower is better. PT stands for
the self-supervised generator pre-training, and GP means usage of
Studies the gradient penalty.
This appendix shows the impact of the UVCGAN gen- Selfie to Anime Anime to Selfie
erator, gradient penalty (GP), and self-supervised generator FID KID (×100) FID KID (×100)
pretraining (PT) on UVCGAN’s performance. Table A1 ACL-GAN 99.3 3.22 ± 0.26 128.6 3.49 ± 0.33
summarizes these findings. For each data set, the bottom Council-GAN 91.9 2.74 ± 0.26 126.0 2.57 ± 0.32
half of the table shows the UVCGAN performance with CycleGAN 92.1 2.74 ± 0.31 127.5 2.52 ± 0.34
some of its components disabled. For example, UVCGAN U-GAT-IT 95.8 2.74 ± 0.31 108.8 1.48 ± 0.34
no GP shows the UVCGAN performance without the gra- UVCGAN 79.0 1.35 ± 0.20 122.8 2.33 ± 0.38
dient penalty term (but using a hybrid UNet-ViT genera- UVCGAN no GP 81.4 1.68 ± 0.22 133.3 2.90 ± 0.49
UVCGAN no PT 80.9 1.78 ± 0.20 134.0 2.98 ± 0.49
tor and a self-supervised pretraining). This table affords
UVCGAN no PT and GP 81.6 1.75 ± 0.25 140.6 3.53 ± 0.59
a few observations: 1. the addition of a hybrid UNet-ViT
Male to Female Female to Male
generator alone typically produces a large degree of im-
FID KID (×100) FID KID (×100)
provement compared to CycleGAN, even in the absence of
ACL-GAN 9.4 0.58 ± 0.06 19.1 1.38 ± 0.09
the self-supervised pre-training and GP term; 2. the self-
Council-GAN 10.4 0.74 ± 0.08 24.1 1.79 ± 0.10
supervised generator pre-training without the GP term does
CycleGAN 15.2 1.29 ± 0.11 22.2 1.74 ± 0.11
not seem to improve the image-to-image translation perfor-
U-GAT-IT 24.1 2.20 ± 0.12 15.5 0.94 ± 0.07
mance and sometimes makes it worse; 3. the self-supervised
UVCGAN 9.6 0.68 ± 0.07 13.9 0.91 ± 0.08
pre-training only helps when it is used in conjunction with
UVCGAN no GP 14.1 1.22 ± 0.10 20.4 1.61 ± 0.11
the GP.
UVCGAN no PT 11.0 0.85 ± 0.09 14.7 0.98 ± 0.08
UVCGAN no PT and GP 14.4 1.26 ± 0.10 19.9 1.55 ± 0.11
Appendix B. Hyperparameter Tuning for Remove Glasses Add Glasses
Other Algorithms FID KID (×100) FID KID (×100)
This section summarizes the hyperparameter tuning re- ACL-GAN 16.7 0.70 ± 0.06 20.1 1.35 ± 0.14
Council-GAN 37.2 3.67 ± 0.22 19.5 1.33 ± 0.13
sults for three benchmarking algorithms: ACL-GAN, Cy-
CycleGAN 24.2 1.87 ± 0.17 19.8 1.36 ± 0.12
cleGAN, and U-GAT-IT. We omitted tuning for Council-
U-GAT-IT 23.3 1.69 ± 0.14 19.0 1.08 ± 0.10
GAN because it takes too long to run (300 hours per trans-
UVCGAN 14.4 0.68 ± 0.10 13.6 0.60 ± 0.08
lation).
UVCGAN no GP 19.2 1.28 ± 0.15 18.7 1.14 ± 0.12
Because none of the benchmarking algorithms use any
UVCGAN no PT 15.8 0.84 ± 0.12 14.3 0.70 ± 0.10
stablization techniques (such as the EMA of network
UVCGAN no PT and GP 19.7 1.32 ± 0.15 16.1 0.89 ± 0.11
weight [41]) beyond shrinking learning rate, we suspect the
fluctuation may be at least partially due to instability of the
GAN training. Table A2. ACL-GAN hyperparameter tuning results. We tune
We only provide hyperparameter tuning results for a data three hyperparameters related to the focus loss: weight of the focus
set or task if an algorithm did not work on it. We skip hyper- loss, focus upper, and focus lower.
parameter tuning if either a pre-trained model or a hyperpa- task weight upper lower FID KID(×100)
rameter setup was provided by the author. In Table A2-A4, 0 − − 128.6 3.49 ± 0.33
the best results are marked in bold font. The default hyper- anime-to-selfie .025 .5 .3 205.3 11.0 ± 1.01
parameters are highlighted in gray. .025 .1 .05 250.3 18.6 ± 1.19
ACL-GAN worked on all three data sets studied and de- 0 − − 46.0 3.39 ± 0.13
tailed in this paper—but all for only one direction: selfie-to- female-to-male .025 .5 .3 19.1 1.38 ± 0.09
.05 .5 .3 36.3 2.91 ± 0.13
anime, male-to-female, and remove glasses. For the trans-
0 − − 29.0 1.77 ± 0.12
lation in the opposite directions, we tune three parameters
add glasses .025 .1 .05 26.6 2.26 ± 0.17
concerning the focus loss: focus loss weight, focus upper, .05 .1 .05 20.1 1.35 ± 0.14
and focus lower. The results are summarized in Table A2.
CycleGAN did not work on any of the three data sets. also correct the aspect ratio problem of U-GAT-IT in this
We search a grid on two hyperparameters: type of generator revised version as the original U-GAT-IT implementation
(Gen.) and weight (Wt.) of cycle-consistency loss. We also cannot handle images with different height and width. We
try two GAN modes: lsgan and wgangp. However, because implement the rescaling in the preprocessing stage, so a
CycleGAN did not implement GP properly, wgangp did not CelebA image of width 178 and height 218 is resized to
work. The results are summarized in Table A3. have width 256 and height 313. As we did for CycleGAN
In addition to hyperparameter tuning for U-GAT-IT, we and UVCGAN, we take a random 256 × 256 crop from a

12
Table A3. CycGAN hyperparameter tuning results. volution with stride 2. A pre-processed tensor will have its
FID KID(×100) FID KID(×100) width and height halved at each downsampling block, while
gen. Wt. selfie-to-anime anime-to-selfie
the feature dimension doubles at the last three downsam-
ResNet 5 92.1 2.72 ± 0.29 127.5 2.52 ± 0.34 pling blocks. Hence, the output from the encoding path will
ResNet 10 93.4 2.96 ± 0.27 129.4 2.91 ± 0.39 have dimension (w, h, f ) = (w0 /16, h0 /16, 8f0 ), and it
UNet 5 121.9 6.21 ± 0.32 134.3 2.96 ± 0.30
forms the input to the pixel-wise ViT bottleneck. Each layer
UNet 10 286.0 27.0 ± 0.87 135.8 3.32 ± 0.32
of the UNet decoding path consists of an upsampling block
male-to-female female-to-male followed by a basic block. A basic block on the decoding
ResNet 5 21.9 2.00 ± 0.12 33.6 2.82 ± 0.14 path differs from one on the encoding path in that it takes as
ResNet 10 15.2 1.29 ± 0.11 22.2 1.74 ± 0.11 input a concatenated tensor as input formed with the output
UNet 5 45.5 4.55 ± 0.17 50.8 4.86 ± 0.16
from the upsampling layer and the tensor from the corre-
UNet 10 47.4 4.82 ± 0.19 47.5 4.57 ± 0.17
sponding skip connection of the encoding path. The decod-
remove glasses add glasses ing path’s output will go through a post-processing layer of
ResNet 5 27.7 2.08 ± 0.16 26.0 1.77 ± 0.11 1 × 1 convolution with a sigmoid activation to produce an
ResNet 10 24.2 1.87 ± 0.17 19.8 1.36 ± 0.12 image.
UNet 5 32.2 2.52 ± 0.19 37.3 2.90 ± 0.14
A pixel-wise ViT is composed primarily of a stack of
UNet 10 32.2 2.52 ± 0.19 44.9 3.63 ± 0.20
Transformer encoder blocks [24]. To construct an input to
training image and a central 256 × 256 crop from a test im- the stack, the ViT first flattens an encoding along the spa-
age. tial dimensions to form a sequence of transformer tokens.
U-GAT-IT studied the selfie-to-anime data set. For the The token sequence has length w × h, and each token in
two CelebA data sets, we try three levels of weight of cycle- the sequence is a vector of length f . It then concatenates
consistency loss: (5, 10, and 20) and summarize the results each token with its two-dimensional Fourier positional em-
in Table A4. bedding [4] of dimension fp and linearly maps the result
to have dimension fv . To improve the Transformer conver-
Table A4. U-GAT-IT hyperparameter tuning results. gence, we adopt the rezero regularization [6] scheme and
FID KID(×100) FID KID(×100) introduce a trainable scaling parameter α that modulates
weight male-to-female female-to-male the magnitudes of the nontrivial branches of the residual
5 39.2 3.86 ± 0.15 45.1 4.04 ± 0.16 blocks. The Transformer stack output is linearly projected
back to have dimension f and unflattened to have width w
10 24.1 2.20 ± 0.12 15.5 0.94 ± 0.07
and h. In this study, we use input raw or cropped images
20 32.1 3.09 ± 0.16 47.5 4.42 ± 0.17
with w0 = h0 = 256 and set f0 = 48. Hence, we have
remove glasses add glasses w = h = 16 and f = 384. We use 12 Transform encoder
5 34.9 2.63 ± 0.15 50.0 5.08 ± 0.26 blocks in ViT and set fp , fv = f , and fh = 4fv for the
10 23.3 1.69 ± 0.14 19.0 1.08 ± 0.10 feed-forward network in each transformer encoder block.
20 36.1 3.13 ± 0.19 36.1 2.67 ± 0.13
Appendix D. Additional Sample Translations
We show a few more translations in Figures A3 to A5.
Appendix C. More detail about the UNet-ViT
Generator
A UNet-ViT generator consists of a UNet [67] with a
pixel-wise Vision Transformer (ViT) [28] at the bottleneck
(Figure A1). UNet’s encoding path extracts features from
the input via four layers of convolution and downsampling.
The features extracted at each layer are also passed to the
corresponding layers of the decoding path via the skip con-
nections, whereas the bottom-most features are passed to
the pixel-wise ViT (Figure A2).
On UNet’s encoding path, the pre-processing layer turns
an image into a tensor with dimension (w0 , h0 , f0 ). Each
layer of the encoding path consists of a basic and downsam-
pling block. The basic block is composed primarily of two
convolutions, while the downsampling block has one con-

13
Input pre-process post-process Output
256, 256, 3 256, 256, 48 256, 256, 3
w0 = h0 = 256, f0 = 48
B4e 256, 256, 48 ⊕ B4d

D4 U4
post-processing
B3e 128, 128, 96 ⊕ B3d Conv(k = 1)

Sigmoid
D3 encoding decoding U3
pre-processing
Conv(k = 3, p = 1) B2e 64, 64, 192 ⊕ B2d B basic block
D down sampling:
LeakyReLU D2 U2
Conv(k = 2, s = 2)
U up sampling:
basic block B1e 32, 32, 384 ⊕ B1d
up-sample with
InstanceNorm D1 U1 scale 2 and
Conv(k = 3, p = 1) Conv(k = 3, p = 1)
pixel-wise ViT 16, 16, 384 ⊕ concatenation along
LeakyReLU
×2 w = h = 16, f = 384 feature dimension
Figure A1. UNet ViT Generator with Full Details

FFN
×12
256, f + fp

Linear(fv , fh )
attention (MSA)
Multi-head self-

network (FFN)
256, fv

256, fv
Feed-Forward
LayerNorm

LayerNorm
256, f

256, f

GeLu
Linear

Linear
reshape

reshape

Linear(fh , fv )
⊕

+
×α

×α
256, fp