0% found this document useful (0 votes)
26 views2 pages

Cycle-Consistent Inverse GAN For Text-to-Image Synthesis - 3474085.3475226

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views2 pages

Cycle-Consistent Inverse GAN For Text-to-Image Synthesis - 3474085.3475226

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Poster Session 1 MM ’21, October 20–24, 2021, Virtual Event, China

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis


Hao Wang1,2 , Guosheng Lin1 , Steven C. H. Hoi3 , Chunyan Miao1,2∗
1 School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore
2 Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly, NTU, Singapore
3 Singapore Management University, Singapore

{hao005,gslin,ascymiao}@ntu.edu.sg,[email protected]

ABSTRACT
conditional text
This paper investigates an open research task of text-to-image
synthesis for automatically generating or manipulating images 

from text descriptions. Prevailing methods mainly take the textual
descriptions as the conditional input for the GAN generation, and
(a) Conventional text-to-image
need to train dierent models for the text-guided image generation generation architecture
and manipulation tasks. In this paper, we propose a novel unied
framework of Cycle-consistent Inverse GAN (CI-GAN) for both GAN disentangled
text-to-image generation and text-guided image manipulation tasks. latent space 
Specically, we rst train a GAN model without text input, aiming
to generate images with high diversity and quality. Then we learn
a GAN inversion model to convert the images back to the GAN change
feather colour
latent space and obtain the inverted latent codes for each image,
where we introduce the cycle-consistency training to learn more change
robust and consistent inverted latent codes. We further uncover the belly colour

semantics of the latent space of the trained GAN model, by learning


latent space alignment
a similarity model between text representations and the latent codes.
In the text-guided optimization module, we can generate images (b) Our proposed framework
with the desired semantic attributes through optimization on the
inverted latent codes. Extensive experiments on the Recipe1M and
Figure 1: The comparison of conventional architecture and
CUB datasets validate the ecacy of our proposed framework.
our proposed framework for text-to-image generation. Ex-
isting works mainly take the text feature-conditioned GAN
CCS CONCEPTS structure, where the limited combinations of text and im-
• Computing methodologies → Computer vision. ages will aect the diversity of generation. While we adopt
a decoupled learning scheme, we rst train a GAN model
KEYWORDS without text, then we discover the semantics of the latent
GAN, Text-to-image synthesis, Cycle consistency space W of the trained GAN. We allow the text representa-
ACM Reference Format: tions to be matched with the latent codes, such that we can
Hao Wang1,2 , Guosheng Lin1 , Steven C. H. Hoi3 , Chunyan Miao1,2 . 2021. control the semantic attributes of the synthesised images by
Cycle-Consistent Inverse GAN for Text-to-Image Synthesis. In Proceedings changing the latent codes.
of the 29th ACM International Conference on Multimedia (MM ’21), October
20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https:
//doi.org/10.1145/3474085.3475226 text descriptions, typically based on the Generative Adversarial
Networks (GANs) approaches. It has various potential applications,
1 INTRODUCTION such as visual content design and art generation. However, text-to-
image synthesis is a challenging cross-modal task, as we need to
Text-to-image synthesis [6, 20, 21, 26, 35–37, 42] aims to generate
interpret the semantic attributes hidden in the text and produce
images that have semantic contents corresponding to the input
images with high diversity and good quality.
∗ Corresponding author Prevailing works [6, 15, 38, 42] on text-to-image generation
mainly build their frameworks based on StackGAN [37], which
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed can generate high-resolution images progressively. Specically, the
for prot or commercial advantage and that copies bear this notice and the full citation StackGAN model stacks multiple generators and discriminators,
on the rst page. Copyrights for components of this work owned by others than ACM which can generate initial low-resolution images with rough shapes
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a and colour attributes rst and then rene the initial images to the
fee. Request permissions from [email protected]. high-resolution ones. To improve the semantic correspondence
MM ’21, October 20–24, 2021, Virtual Event, China between the textual descriptions and the generated images, Xu
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00 et al. propose AttnGAN [35] to discover the attribute alignment
https://doi.org/10.1145/3474085.3475226 between image and text by pretraining an attentional similarity

630
Poster Session 1 MM ’21, October 20–24, 2021, Virtual Event, China

model. However, the paired text-image training for GAN model • We uncover the semantics of the latent codes, based on which
limits the diversity of the model representation, since we only have we can generate high-quality images corresponding to the
limited combinations of text and images and the generated images textual descriptions.
are regularized by the corresponding real images and the text pairs.
We evaluate our proposed framework CI-GAN on two public datasets
Moreover, it is hard to use the aforementioned framework to only
in the wild, i.e. Recipe1M and CUB datasets. We conduct extensive
change one attribute while preserve other text-irrelevant attributes
experiments to analyse the ecacy of the CI-GAN. Finally, we
in the generated images, hence we need to train an extra module
present quantitative and qualitative results of our proposed meth-
to do the text-based image manipulation task [16].
ods and visualizations of the generated images.
The advent of style-based generator architecture, such as Style-
GAN [13, 14], has greatly improved the realism, quality and diver-
sity of the generated images. Specically, the StyleGAN proposed 2 RELATED WORK
to map the input noise to another latent space W, which has been 2.1 Text-Based Image Generation
validated to yield more disentangled semantic representations. To
In this section, we review two categories of text-based image gen-
uncover the relationships between the latent codes in the space W
eration, i.e. text-to-image generation and text-based image manip-
and the synthesised images, we need to be aware of the distribu-
ulation. Generating images from text is a challenging task, as we
tions of the space W and nd the corresponding latent codes of the
need to correlate the cross-modal information [5, 30, 32]. To control
images. To this end, many research works adopt the GAN inversion
the correspondence between the text and the generated images,
technique [1, 27, 40] to invert the images back to the space W and
some prevailing text-to-image generation works [6, 15, 42] pretrain
obtain the inverted latent codes.
a Deep Attentional Multimodal Similarity Model (DAMSM) [35],
In this paper, we propose a novel framework of Cycle-consistent
which is used as a supervision to regularize the semantics of the
Inverse GAN (CI-GAN), where we incorporate the GAN inversion
generated images. Specically, Cheng et al. [6] propose to use a
methodology to the text-to-image synthesis task. Technically, we
renement module to return more complete caption sets, which
rst train a GAN inversion encoder to map the images to the latent
can provide more semantic information for the image generation.
space W of a trained StyleGAN, such that we can get the inverted
Zhu et al. [42] use a memory writing gate to rene the initial image
latent codes for the real images of the given datasets. To make the
and generate a high-quality one. Wang et al. [31] and Zhu et al.
original and inverted latent codes to be identical and follow the same
[38] aim to generate food images from the cooking recipes.
distribution, we introduce to apply the cycle consistency loss on the
For the text-based image manipulation task, it requires the model
GAN inversion training process, as obtaining similar inverted latent
to only change certain parts or attributes and preserve other text-
codes to the original ones is critical for our subsequent generation
irrelevant attributes on the input images. Li et al. [16] propose a
procedure.
module to combine the text and generated images to jointly corre-
We assume the StyleGAN learned space W is disentangled re-
late the details, such that the mismatched attributes can be rectied.
garding the semantic attributes of the target image dataset. For
Dong et al. [8] use an encoder-decoder architecture to take the
example, in the bottom row of Figure 1, when we want to change
original images as well as the textual descriptions as the input, and
the belly colour of the bird image, the rest semantic attributes, such
output the manipulated images, which is supervised by a discrimi-
as the bird shape, pose and feather colour, will remain the same,
nator.
only the bird belly colour changes to the black colour from the
However, the existing text-to-image generation works suer
orange colour. The disentanglement of the space W allows us to
from the limited diversity of the generated images, since they use
generate images with various attributes based on the optimization
the paired text and images for the GAN training. Moreover, the
on the latent codes. To generate images from the textual descrip-
aforementioned architectures adopt the multi-stage renement
tions, we learn a similarity model between text representations
[35, 37] to improve the resolution of the generated images, therefore
and the inverted latent codes, such that the latent codes can be
it is cumbersome to generate images with higher resolution. In
optimized to have the desired semantic attributes. We feed the opti-
contrast, our proposed method use the StyleGAN2 [14] model as
mized latent codes into the trained StyleGAN generator and realize
the generator backbone and we do not use the paired text input
the text-to-image generation task. Apart from the text-to-image
when training GAN, which guarantees the quality and diversity of
generation task, our proposed CI-GAN can also be used on the text-
the generated images.
based image manipulation task by applying an extra perceptual loss
between the original images and the images reconstructed from
the optimized latent codes. 2.2 GAN Inversion
Our contributions can be summarized as: Due to the lack of inference capabilities in GAN, the manipulation
in the latent space can only be applied on the generated images,
• We propose a novel GAN approach combining GAN inver- rather than any given real images. GAN inversion is popular way to
sion and cycle consistency training for the text-to-image manipulate the real images [3, 19, 41]. The purpose of GAN inver-
synthesis. The unied framework can be used for the text-to- sion is to invert the given image to the latent space of a pretrained
image generation and text-based image manipulation tasks. GAN model and obtain the inverted latent code, such that the image
• We use the improved GAN inversion methods with cycle can be faithfully reconstructed by the generator from the inverted
consistency training to invert real images to the GAN latent latent code. As a new technique to connect the images and the GAN
space and obtain the latent codes of images. latent space, GAN inversion enables the pretrained GAN model to

631

You might also like