Cycle-Consistent Inverse GAN For Text-to-Image Synthesis - 3474085.3475226
Cycle-Consistent Inverse GAN For Text-to-Image Synthesis - 3474085.3475226
{hao005,gslin,ascymiao}@ntu.edu.sg,[email protected]
ABSTRACT
conditional text
This paper investigates an open research task of text-to-image
synthesis for automatically generating or manipulating images
from text descriptions. Prevailing methods mainly take the textual
descriptions as the conditional input for the GAN generation, and
(a) Conventional text-to-image
need to train dierent models for the text-guided image generation generation architecture
and manipulation tasks. In this paper, we propose a novel unied
framework of Cycle-consistent Inverse GAN (CI-GAN) for both GAN disentangled
text-to-image generation and text-guided image manipulation tasks. latent space
Specically, we rst train a GAN model without text input, aiming
to generate images with high diversity and quality. Then we learn
a GAN inversion model to convert the images back to the GAN change
feather colour
latent space and obtain the inverted latent codes for each image,
where we introduce the cycle-consistency training to learn more change
robust and consistent inverted latent codes. We further uncover the belly colour
630
Poster Session 1 MM ’21, October 20–24, 2021, Virtual Event, China
model. However, the paired text-image training for GAN model • We uncover the semantics of the latent codes, based on which
limits the diversity of the model representation, since we only have we can generate high-quality images corresponding to the
limited combinations of text and images and the generated images textual descriptions.
are regularized by the corresponding real images and the text pairs.
We evaluate our proposed framework CI-GAN on two public datasets
Moreover, it is hard to use the aforementioned framework to only
in the wild, i.e. Recipe1M and CUB datasets. We conduct extensive
change one attribute while preserve other text-irrelevant attributes
experiments to analyse the ecacy of the CI-GAN. Finally, we
in the generated images, hence we need to train an extra module
present quantitative and qualitative results of our proposed meth-
to do the text-based image manipulation task [16].
ods and visualizations of the generated images.
The advent of style-based generator architecture, such as Style-
GAN [13, 14], has greatly improved the realism, quality and diver-
sity of the generated images. Specically, the StyleGAN proposed 2 RELATED WORK
to map the input noise to another latent space W, which has been 2.1 Text-Based Image Generation
validated to yield more disentangled semantic representations. To
In this section, we review two categories of text-based image gen-
uncover the relationships between the latent codes in the space W
eration, i.e. text-to-image generation and text-based image manip-
and the synthesised images, we need to be aware of the distribu-
ulation. Generating images from text is a challenging task, as we
tions of the space W and nd the corresponding latent codes of the
need to correlate the cross-modal information [5, 30, 32]. To control
images. To this end, many research works adopt the GAN inversion
the correspondence between the text and the generated images,
technique [1, 27, 40] to invert the images back to the space W and
some prevailing text-to-image generation works [6, 15, 42] pretrain
obtain the inverted latent codes.
a Deep Attentional Multimodal Similarity Model (DAMSM) [35],
In this paper, we propose a novel framework of Cycle-consistent
which is used as a supervision to regularize the semantics of the
Inverse GAN (CI-GAN), where we incorporate the GAN inversion
generated images. Specically, Cheng et al. [6] propose to use a
methodology to the text-to-image synthesis task. Technically, we
renement module to return more complete caption sets, which
rst train a GAN inversion encoder to map the images to the latent
can provide more semantic information for the image generation.
space W of a trained StyleGAN, such that we can get the inverted
Zhu et al. [42] use a memory writing gate to rene the initial image
latent codes for the real images of the given datasets. To make the
and generate a high-quality one. Wang et al. [31] and Zhu et al.
original and inverted latent codes to be identical and follow the same
[38] aim to generate food images from the cooking recipes.
distribution, we introduce to apply the cycle consistency loss on the
For the text-based image manipulation task, it requires the model
GAN inversion training process, as obtaining similar inverted latent
to only change certain parts or attributes and preserve other text-
codes to the original ones is critical for our subsequent generation
irrelevant attributes on the input images. Li et al. [16] propose a
procedure.
module to combine the text and generated images to jointly corre-
We assume the StyleGAN learned space W is disentangled re-
late the details, such that the mismatched attributes can be rectied.
garding the semantic attributes of the target image dataset. For
Dong et al. [8] use an encoder-decoder architecture to take the
example, in the bottom row of Figure 1, when we want to change
original images as well as the textual descriptions as the input, and
the belly colour of the bird image, the rest semantic attributes, such
output the manipulated images, which is supervised by a discrimi-
as the bird shape, pose and feather colour, will remain the same,
nator.
only the bird belly colour changes to the black colour from the
However, the existing text-to-image generation works suer
orange colour. The disentanglement of the space W allows us to
from the limited diversity of the generated images, since they use
generate images with various attributes based on the optimization
the paired text and images for the GAN training. Moreover, the
on the latent codes. To generate images from the textual descrip-
aforementioned architectures adopt the multi-stage renement
tions, we learn a similarity model between text representations
[35, 37] to improve the resolution of the generated images, therefore
and the inverted latent codes, such that the latent codes can be
it is cumbersome to generate images with higher resolution. In
optimized to have the desired semantic attributes. We feed the opti-
contrast, our proposed method use the StyleGAN2 [14] model as
mized latent codes into the trained StyleGAN generator and realize
the generator backbone and we do not use the paired text input
the text-to-image generation task. Apart from the text-to-image
when training GAN, which guarantees the quality and diversity of
generation task, our proposed CI-GAN can also be used on the text-
the generated images.
based image manipulation task by applying an extra perceptual loss
between the original images and the images reconstructed from
the optimized latent codes. 2.2 GAN Inversion
Our contributions can be summarized as: Due to the lack of inference capabilities in GAN, the manipulation
in the latent space can only be applied on the generated images,
• We propose a novel GAN approach combining GAN inver- rather than any given real images. GAN inversion is popular way to
sion and cycle consistency training for the text-to-image manipulate the real images [3, 19, 41]. The purpose of GAN inver-
synthesis. The unied framework can be used for the text-to- sion is to invert the given image to the latent space of a pretrained
image generation and text-based image manipulation tasks. GAN model and obtain the inverted latent code, such that the image
• We use the improved GAN inversion methods with cycle can be faithfully reconstructed by the generator from the inverted
consistency training to invert real images to the GAN latent latent code. As a new technique to connect the images and the GAN
space and obtain the latent codes of images. latent space, GAN inversion enables the pretrained GAN model to
631