0% found this document useful (0 votes)
20 views

Text2Face Text-Based Face Generation With Geometry and Appearance Control

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Text2Face Text-Based Face Generation With Geometry and Appearance Control

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO.

9, SEPTEMBER 2024 6481

Text2Face: Text-Based Face Generation With


Geometry and Appearance Control
Zhaoyang Zhang , Junliang Chen , Hongbo Fu , Senior Member, IEEE, Jianjun Zhao , Shu-Yu Chen ,
and Lin Gao , Member, IEEE

Abstract—Recent years have witnessed the emergence of var- the potential to become a reality now, enabled by the fascinating
ious techniques proposed for text-based human face generation progress of human face generation and manipulation methods
and manipulation. Such methods, targeting bridging the semantic as well as natural language processing techniques. However,
gap between text and visual contents, provide users with a deft
hand to turn ideas into visuals via text interface and enable more sometimes textual descriptions are incomplete for describing
diversified multimedia applications. However, due to the flexibility every detail of desired faces, i.e. one sentence may not cover
of linguistic expressiveness, the mapping from sentences to desired detailed description of every facial part at once. This is due to
facial images is clearly many-to-many, causing ambiguities during the fact that the users may not want to specify every detail of the
text-to-face generation. To alleviate these ambiguities, we introduce face at the very beginning. For example, a user may start with
a local-to-global framework with two graph neural networks (one
for geometry and the other for appearance) embedded to model the describing the desired face as “An oval face with blonde hair”,
inter-dependency among facial parts. This is based upon our key then adding more facial descriptions such as “smiling mouth”
observation that the geometry and appearance attributes among and “big nose”, etc. Thus how to generate satisfying human faces
different facial components are not mutually independent, i.e., the from partial descriptions becomes a challenging problem. In this
combinations of part-level facial features are not arbitrary and thus work, we aim to visualize the text-to-face depiction process by
do not conform to a uniform distribution. By learning from the
dataset distribution and enabling recommendations given partial building an interface for converting the textual descriptions to
descriptions of human faces, these networks are highly suitable for realistic human faces, where we introduce a recommendation
our text-to-face task. Our method is capable of generating high- mechanism with graph neural networks (GNNs) for proposing
quality attribute-conditioned facial images from text. Extensive coherent faces given partial descriptions. Also, when detailed
experiments have confirmed the superiority and usability of our
descriptions are provided, our model is able to generate facial
method over the prior art.
Index Terms—Image generation, face editing, sketching
images corresponding to the given texts, and our proposed GNNs
interface, text-based user interaction. will optimize over the generated faical parts and achieve higher
fidelity and consistency.
I. INTRODUCTION Efforts have been devoted to text-based image generation in
previous years, but not until recently do such methods begin
OW does a certain character in a novel look like visually?
H This is a common question raised by readers when they
are immersed in the content of a novel and wonder about more
to apply to facial images. Thanks to the visual-linguistic joint
representation ability of CLIP [2], a series of works (e.g., [3],
[4]) derive in this domain. By attempting to bridge the semantic
details behind the text. Can we reconstruct the faces in the novel
gap between the visual-linguistic joint latent space of CLIP and
simply from textual descriptions [1]? Although it sounds impos-
the latent space of the state-of-the-art face generation model,
sible in the past, this depiction-to-visualization procedure has
StyleGAN [5], such methods are capable of generating and
Manuscript received 1 March 2023; revised 8 November 2023; accepted 16 editing face images with specific attributes that are semantically
December 2023. Date of publication 2 January 2024; date of current version consistent with the given text prompts (e.g., glasses, hairstyle,
31 July 2024. This work was supported in part by National Natural Science emotions, and expressions), and have achieved impressive re-
Foundation of China under Grants 62061136007, 62102403 in part by the Beijing
Municipal Natural Science Foundation for Distinguished Young Scholars under sults. A concurrent study from [6] also provides a powerful tool
Grant JQ21013, and in part by Beijing Municipal Science and Technology for interactive editing of face images using text as hints. They
Commission under Grant Z231100005923031. Recommended for acceptance model the mapping from the textual editing instructions to the
by X. Tong. (Corresponding author: Lin Gao.)
Zhaoyang Zhang is with the Department of Computer Science, Yale Univer- editing directions in the StyleGAN latent space as a semantic
sity, New Haven, CT 06520 USA (e-mail: [email protected]). field.
Shu-Yu Chen and Lin Gao are with the Beijing Key Laboratory of Mo- Different from previous works, our work sheds light on a text-
bile Computing and Pervasive Device, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing 100190, China, and also with the guided face generation process rather than using texts to guide
University of Chinese Academy of Sciences, Beijing 100190, China (e-mail: the editing process of human faces, and we explicitly model the
[email protected]; [email protected]). geometry and appearance features in the pipeline in a disentan-
Junliang Chen and Jianjun Zhao are with the Department of Film and TV
Technology, Beijing Film Academy, Beijing 100088, China (e-mail: junele- gled way, rather than an entangled representation as a StyleGAN
[email protected]; [email protected]). latent feature, bringing more flexibility for part-level control.
Hongbo Fu is with the School of Creative Media, City University of Hong Moreover, we are enabling more attributes to be controlled via
Kong, Kowloon 000000, Hong Kong (e-mail: [email protected]).
Digital Object Identifier 10.1109/TVCG.2023.3349050 text, while previous methods only generate poor editing results

1077-2626 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
6482 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 9, SEPTEMBER 2024

on these attributes, as illustrated in our experiments. To this text as an interface for image generation and manipulation.
end, we propose a multi-stage framework comprising four parts, Previously, text-based image generation methods [14], [15],
namely Text Parsing Module, Feature Extraction Module, Graph [16], [17] focus on generating simple-structured images like
Recommendation Module, and Global Generation Module. The birds, using the CUB200 dataset [18], and flowers, using the
Text Parsing Module maps sentence inputs into attribute-value Oxford-Flower-102 dataset [15], etc. These methods generally
pairs, thus providing a simple yet accurate way of finding key lack thorough analysis over the target data distribution (in their
textual hints. The Feature Extraction Module is responsible for cases, birds and flowers, etc.; in our case, human faces), there-
disentangling each facial component’s geometry and appearance fore being unable to improve the quality of the generated im-
features, followed by a Graph Recommendation Module, which ages. Based on large pretrained models, DALLE/DALLE2 [19],
learns the inference relationship among facial components. Fi- [20] are able to generate complex and semantically abundant
nally, the geometry and appearance features optimized by the images from pure text inputs, achieving phenomenal effects
Graph Recommendation Module are transformed into photo- on text-based image generation. Another track of works on
realistic images by the Global Generation Module. text-based audio generation also grasps attention within the
We summarize our main contributions as follows: community [21], [22], [23]. For example, MusicLM [21] models
r We enable detailed part-level attribute-conditioned face the music generation process in a hierarchical manner, which
generation from textual descriptions, which enables more is proved efficient in previous arts. Our work also models the
controllable attributes than previous methods. generation of facial images in a local-to-global manner.
r We incorporate graph neural networks (GNN) into the Recent progresses in text-guided graphics and vision are
generation process of face images, which enables geometry largely facilitated by CLIP’s strong visual-linguistic represen-
and appearance recommendation upon given conditions tation ability. CLIPasso [24] utilizes a CLIP image encoder to
from the text. measure the semantic and geometric similarity between input
real images and abstracted sketches, benefiting from the rich
II. RELATED WORKS semantics within the CLIP text-image joint latent space. CLIP-
styler [25] incorporates CLIP for image style transfer, where the
A. Neural Face Generation and Editing desired style is specified via text inputs. Sangkloy et al. [26]
The prosperity of deep neural networks has demonstrated their design an image retrieving system using both text and sketch
capability in human face generation and editing literature. To as a query. With the help of this system, users could conduct
generate face images with high fidelity, Karras et al. [5] propose fine-grained retrieval, which could not be achieved using any of
StyleGAN and a series of its variants [7], [8]. These models are the two modalities alone. 3D content creation field also benefits
capable of generating high-resolution photo-realistic faces by from CLIP, with Text2Mesh [27] being a representative work.
randomly sampling from a latent distribution pZ (z). They are The proposed method predicts per-vertex color and positional
robust to noisy inputs, thus inducing an abundance of follow-up offsets from the input template mesh and uses a differentiable
works (e.g., [9], [10], [11]), which explore the properties of render to propagate the CLIP 2D semantic supervision to 3D.
its intermediate latent space W to implement conditional face Specifically within the face generation and manipulation com-
generation and editing. While StyleGAN-based methods could munity, Patashnik et al. [3] introduce three CLIP-based ap-
benefit from the unprecedented generation ability of StyleGAN proaches under this direction, all targeting manipulating inverted
and generate photo-realistic human faces, non-StyleGAN-based StyleGAN images. Xia et al. [28] yet map multi-modal inputs,
methods are also deft in this domain. For example, Chen et including text into the fixed W space of StyleGAN, forcing the
al. [12] propose a structural framework to disentangle the ge- embeddings of multiple modalities to be as close as possible to
ometry features from the appearance features, using the sketch the inverted w ∈ W of their corresponding real face image. Jiang
as an intermediary. Lee et al. [13] adopt semantic masks as et al. [6] model the mapping between text features and StyleGAN
an intermediary for flexible face manipulation while preserving latent editing directions using an MLP, by which they attempt
identity and fidelity. to solicit the most salient editing direction corresponding to the
Although these methods are promising in generating and/or textual hints. AnyFace [29] first defines the problem of free-style
manipulating human face images, they do not explicitly take text-to-image face as well as proposes a two-stream architecture
into account the inherent coherence among the appearances and that utilizes CLIP and StyleGAN to achieve open-world human
geometric features of facial components, thus being incapable face generation and manipulation. Different from their method
of understanding high-level semantics and structures of human which depends on StyleGAN to generate facial images as a
faces, let alone recommending and generating faces with ge- whole, our method relies on partial generators to generate facial
ometrically coherent and appearance-consistent human faces. parts separately, together with two GNNs to ensure the fidelity
In contrast, our work explicitly models the relationship among and consistency of facial images. Very recently, diffusion mod-
facial part geometry and appearance (respectively) using graphs els [30], [31], [32], [33], [34], [35] have elevated text-to-image
and achieves easier control over the geometry and appearance generation to a new level, where text-to-face generation becomes
features. a natural subtask. These approaches are deft at editing global
attributes such as age, beard, smiling emotion, etc., instead
of editing part-level geometry and appearance features as we
B. Text-Based Multimodal Generation do.
Text enjoys wide applications in human-computer interaction, The above-mentioned text-based facial image genera-
with recent advances in vision and graphics having integrated tion methods, while having achieved impressive results in
Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TEXT2FACE: TEXT-BASED FACE GENERATION WITH GEOMETRY AND APPEARANCE CONTROL 6483

Fig. 1. We present a novel pipeline for text-driven face generation, supporting intuitive control over the part-level geometry and appearance of generated facial
images using text as the only input (Right-top, manually simplified from a novel paragraph on the Left.). Our pipeline inherently supports both end-to-end text-to-face
generation (Right-middle) and sequential generation (Right-bottom), as illustrated here.

manipulating human faces, often rely too much on the repre- parts, namely P := (leye, reye, nose, mouth, bg), where bg
sentation ability of large pretrained models such as CLIP and stands for background. See Fig. 2 for more details. Network
StyleGAN. Thus they compromise detailed semantic control details are included in the supplementary materials.
over each component of human faces, i.e., some attributes in
the StyleGAN latent space are highly entangled (as mentioned A. Pipeline
in [28]). Our work, built upon a local-to-global framework,
is able to translate semantic descriptions to part-level visuals 1) Text Parsing Module: By assumption, the input sentences
with geometry and appearance compatibility, thus supporting contain certain patterns suitable for extracting attribute-value
disentangled control for each part while also considering the pairs directly using a regular parser [36]. As previously men-
overall coherence. Also to note that most of the controllable tioned, the parser is used to acquire semantic descriptions for
facial attributes enabled by our method do not overlap with each facial part, including geometry and appearance descrip-
those enabled by previous works, and editing/generating these tions. Specifically, given the input sentence(s) s, the parser P
attributes using previous methods yields less satisfying results, will produce a set of attributes P(s) that are used to index into
as shown in Section IV. the database for finding the corresponding geometry and ap-
pearance features for the subsequent generation process. In our
implementation, we parse the sentence s using the off-the-shelf
III. METHODOLOGY spaCy [37] library by analyzing the dependency tree and part of
Given an input sentence s describing a human face, we aim speech of the words.
to generate a photo-realistic facial image I f inal with details 2) Feature Extraction Module: This module serves for local
in accordance with the description in s (examples shown in geometry and appearance disentanglement. It takes as input real
Fig. 1) To eliminate potential abuse of our work, the input images of facial components Ipr (r standing for real) belonging
adjectives used to describe the face are restricted within a range to a whole image I, and outputs their corresponding geometry
(see more discussions in the supplementary materials). Due features fpgeo and appearance features fpapp , where p ∈ P is in
to the diversity of linguistic descriptions, the mapping from short of part. We omit all the subscript p in the rest of this section
sentences to faces is clearly many-to-many, bringing about more when there is no ambiguity. We propose our Feature Extraction
ambiguities when s contains fewer detailed specifications for Module for explicitly disentangling geometry and appearance
each facial part. Therefore, we suggest a recommendation mech- features of facial images, using sketches as intermediary [12].
anism to infer the features of facial parts that are not specified in For each facial part, we first train an auto-encoder consisting
s from specified ones, aiming at a seamless combination of part of E s , Rs (s standing for sketch) over the sketch domain using
features during global generation. Note that the input sentence s L1 reconstruction loss as supervision, after which we get the
could also be several separate sentences, as long as they describe part-level sketch feature defined as f s := E s (I s ) ∈ R512 . Serv-
the same face together. ing as the geometry features, these part-level sketch features are
This requires us to learn the inter-dependency and intrinsic further utilized to guide the disentanglement of the geometry
compatibility among facial parts, from both geometry and ap- and appearance features of real image I r . Such disentanglement
pearance perspectives. This requirement in turn leads us to de- is done by another auto-encoder E r , Rr . This auto-encoder
sign our whole pipeline in a local-to-global manner. Specifically, learns to extract geometry and appearance features from I r
during training and inference, we divide a facial image into five simultaneously, enabling us to formulate f app and f geo as two
Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
6484 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 9, SEPTEMBER 2024

Fig. 2. Overview of our pipeline: Our pipeline follows a local-to-global manner. The Text Parsing Module parses one or multiple sentences s describing the
same face into a set of keywords, which are used for conditionally sampling features for face generation from a property pool. The features in the property pool
are extracted in advance using the Feature Extraction Module, which is trained to disentangle geometry from appearance for each facial component. The Graph
Recommendation Module contains two graphs, Appearance Graph and Geometry Graph. They learn the coherence among facial components from appearance
and geometry perspectives, respectively, and thus can propose recommendations for unspecified facial parts in s. Finally, the Global Generation Module fuses the
part-level feature maps into a generated face image I f inal . During inference, the input sentence s is parsed into keywords indexing into the property pool to get
corresponding part features. The part features are optimized by the Appearance Graph and Geometry Graph , after which the optimized features are sent into the
part-level decoders ({Decr }) in the Feature Extraction Module to get the feature maps. The feature maps are fused at fixed positions and translated into real image
I real by the Global Generation Module.

vectors, rather than the feature maps used in [12]. Using vectors comprises the edges from every node of specified/predicted parts
rather than feature maps is a necessary formulation since the to every node of unspecified/unpredicted ones. Formally, let Ps
graph networks in Graph Recommendation Module could not denote the text-specified/predicted subset of P , we have
take feature maps as input. The geometry feature of I r is defined  
as the latent vector f geo ∈ R512 acquired by the fully connected V geo := fpgeo | p ∈ P (1)
layer after the last encoding block, and the appearance feature  
E geo := egeo
x→y : fx
geo
→ fygeo |x ∈ Ps , y ∈ P \Ps , (2)
of I r is defined as the linear combination
 of IN parameters of
encoding blocks. Formally, f app = i wi (μi ⊕ σi ), where ⊕ where each edge egeo
xy in E
geo
is implemented as an MLP. We
represents vector concatenation, μi and σi are the mean and denote the output of Geometry Graph as {f geo }.
standard deviation of the ith layer’s feature map, and wi are Appearance Graph: With this appearance graph, we aim to
learnable weights. To achieve disentanglement, we force f geo achieve controllable style fusing for appearance features from
to be aligned with f s , which is encoded by the pretrained sketch different source images. We observe that the appearance of one
encoder E s . facial part may largely tell what other parts look like. That is,
3) Graph Recommendation Module: With the disentangled for example, if we know that the eyes of a face have a light/dark
geometry and appearance features, we propose two graph neural skin color, we will have enough confidence to reason that the
networks, one for recommending compatible geometry features whole face has a light/dark color. This inter-dependency of
for unspecified parts (Geometry Graph), and the other for uni- appearance features among different parts is modeled using a
fying the appearance of generated face image from part-level 5-node complete graph Gapp := (V app , E app ), formally,
(Appearance Graph). Please refer to Section III-B for the infer-  
ence procedure. V app := fpapp | p ∈ P (3)
Geometry Graph: Our key observation here is that the geom-  
E app := eapp
x→y : fx
app
→ fyapp | x, y ∈ P, x = y . (4)
etry features of different facial parts should share an intrinsic
coherence, i.e. not all the combinations of facial geometry form We model every edge eapp xy ∈ E
app
as a unified EdgeConv [39]
compatible faces [38]. For example, the eyes of the same face function, which is shared across different edges to update the
should be largely symmetric, while the size of the mouth and the node features during every propagation. The outputs of Appear-
shape of the jaw will both influence the contour of the whole face, ance Graph are denoted as {f app }.
etc. We formulate the recommendation problem as a conditional 4) Global Generation Module: We base our Global Gener-
sampling and prediction of unspecified facial parts and model the ation Module on the commonly adopted image-to-image trans-
inter-dependency of geometry features among different facial lation model pix2pixHD [40], which takes as input the opti-
parts as a 5-node (one node represents one facial part) bipar- mized appearance feature {f app } and the part-level geometry
tite graph Ggeo := (V geo , E geo ) during each step of inference, features {f geo }, and outputs the final synthesized image I f inal .
where V geo contains the geometry features of 5 nodes and E geo {f app } and {f geo } are first sent through the {Rr } mentioned in

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TEXT2FACE: TEXT-BASED FACE GENERATION WITH GEOMETRY AND APPEARANCE CONTROL 6485

Fig. 3. Illustration of the graph recommendation for Geometry Graph. Fig. 4. Illustration of the graph recommendation for Appearance Graph.
We iteratively perform attribute-conditioned manifold projection to generate Missing appearance features could be deduced from known ones. Known
compatible geometry features for the whole face. appearance features would unify with each other to achieve coherent facial
appearances.

Section III-A2, after which we spatially combine the feature


map of the second-last layer of {Rr } as indicated in Fig. 2. The 2) Appearance Graph (Fig. 4): The Appearance Graph
combined feature map is then fused into a photo-realistic image learns the relationship among the appearance features of differ-
I f inal using Rglobal consisting of a sequence of ResBlocks [41]. ent facial parts. Since the appearance features of different parts
do not lie on the same manifold, we extend each f app ∈ R512
B. Graph Recommendation Mechanism to fˆapp ∈ R2560 during both training and inference to expect
{fˆapp } belong to the same space. Intuitively, one could interpret
We formalize the inference logic of Graph Recommendation
fˆapp as a vector belonging to the direct sum of five part-level
Module in this subsection.
appearance feature space. The extended dimensions and missing
1) Geometry Graph (Fig. 3): The inference procedure of
part-level appearance features are padded with zeros as default.
Geometry Graph follows a step-by-step manner, where we start
geo {fˆapp } are used to perform message-passing updates, during
by deciding the geometry feature for bg. If fbg is specified
which process the data flow between every pair of nodes unifies
in the input sentence s, we conditionally sample a geometry
the appearance features from different facial parts. For each
feature from our property pool using the specified attributes as
geo round of message passing, we perform
the condition. If fbg cannot be directly inferred from the input
  
sentence s, i.e. no key in P(s) is relevant with the face contour, 1
geo fˆpapp = eapp ˆapp , fˆapp .
f (7)
we randomly sample a geometry feature for fbg from our |P | − 1 x→p x p
geo x∈P \{p}
property pool. Then fbg is used to predict compatible geometry
features for all the other parts. Generally, the predicted feature Finally, after several rounds (5 in our implementation) of
for an unspecified part is forwarded as follows, message-passing, we acquire the optimized appearance features
for each part: {f app }, f app ∈ R512 by extracting the corre-
1  geo
fˆpgeo = ex→p (fxgeo ) , p ∈ P \Ps , (5) sponding slices of {fˆapp }, which are optimized by Appearance
|Ps |
x∈Vs Graph from {fˆapp }. To be more specific, we have
where Ps is the specified/predicted subset of P as mentioned in fˆpapp [i × 512 : (i + 1) × 512] = fpapp (8)
Section III-A3. When deciding the next part geometry feature,
fpapp = fˆpapp [i × 512 : (i + 1) × 512] ,
geo
for example fnose
geo
, we already have a predicted one from fbg , (9)
ˆ
which we denote as fnose . Therefore, if nose is not specified,
geo
where i is the index of part p in P .
we directly use fˆnose
geo
as fnose
geo
. Otherwise, we could sample
from all the geometry features in our database which satisfy C. Training Stages
the specified attributes for nose, and apply manifold projection
to fˆnose
geo
over the sampled subset of the database. We call this The training process of the entire pipeline contains three
process attribute-conditioned manifold projection, abbreviated stages. We introduce them respectively in this subsection. The
as A. Formally, the prediction logic for fnose
geo
can be formulated training process is independent of any attribute label.
as follows, Stage I. Training the Feature Extraction Module: As described
 geo in Section III-A2, (f geo , f app ) = E r (I r ). During training, we
fˆnose
 ,  if nose is not specified force the decoder Rr to reconstruct the original image, i.e.,
fnose
geo
= (6)
A fˆnose
geo
, if nose is specified. forcing I recon = Rr (f geo , f app ) to be as close to I r as possible.
Therefore we have the first loss term Lrecon defined as follows,
geo
After the two iterations above, fnose
geo
and fbg have been decided,
which will be fixed and used to predict the rest undecided part recon = I − I
Llocal r recon
1 . (10)
geometry features like what has been done for predicting fnose
geo
. To eliminate the interdependence of geometry and appearance
Iterations terminate until all the part-level geometry features features, we align the geometry feature space of real images
have been decided. We denote the output of Geometry Graph as ({f geo }) with that of sketches ({f s }), where f s is extracted via
{f geo }. the pre-trained E s . Thus, the second loss term Lalign comes as

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
6486 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 9, SEPTEMBER 2024

follows, IV. EXPERIMENTS

Lalign = f geo − f s 2 . (11) A. Data Preparation


Using text as the interface of interaction requires us to prepare
Further, we utilize the third loss term – adversarial loss Ladv , in a dataset with high-quality facial images and accurate semantic
a similar way as [42] do, by employing a discriminator Dr , x annotations for each of them. For facial images, we generate
a dataset with a capacity of 40 K images using the attribute-
E,R
Ladv = E (Dr (I recon ) − 1)2 , (12) conditioned sampling method provided by StyleFlow [10],
where we impose constraints on the yaw and pitch during the
LD r r 2 r recon 2
adv = E[(D (I ) − 1)) ] + E[(D (I ) ]. (13) generation process but randomize other attributes. This elim-
inates the impact of extreme poses and background since we
In summary, the training objective for Stage I is formulated as a desire the facial images to be as frontal as possible, enabling
minimax game as follows, us to capture and model the geometry patterns of faces more
E,R accurately. Also, we use the StyleGAN generators provided
min Llocal
recon + λalign Lalign + λadv Ladv , (14) by [47], which are trained on various other datasets. For training
E,R
the Feature Extraction Module, we prepare the corresponding
min LD
adv . (15) sketches for these images used during feature disentanglement
D
following the method in [12]. For facial attribute annotations,
In our implementation, we set λalign = 0.01, and λadv = 0.005. we use APIs from Face++ [48], Microsoft Azure [49], and
Stage II. Training the Geometry Graph: The Geometry Graph Alibaba [50] since the detection of some desired attributes is
models the geometric coherence among facial parts. This is only provided by one of these APIs. Unless otherwise specified,
enabled by learning a set of MLP-based mappings between we set the resolution of generated images to 512 × 512 in all
the latent spaces of every pair of facial components. For each our experiments.
pair of facial components x, y ∈ V geo , x = y, we force the When generating the training dataset (as well as our database),
MLP exy to map fxgeo to fygeo . Therefore the loss is simply we only generate frontal faces to eliminate the negative impacts
defined as an L2 loss between the predicted y geometry feature of occlusion and pose, i.e., we reasonably use the a priori of
fygeo := exy (fxgeo ) and the fygeo face layout and pose. Here, we briefly explain why we only use
frontal and occlusion-free faces:
min fygeo − fygeo 2 . (16) r Non-frontal faces bring about difficulties for the graph
exy
recommendation module to infer the accurate geome-
Stage III. Joint Training of the Global Generation Module try/appearance correlation. For example, an apparent ge-
and the Appearance Graph: The Appearance Graph learns the ometry relation within the human face is the symmetry of
style inter-dependency among facial components, with which we two eyes. If a face has a big yaw, the symmetry would not
want to achieve appearance reasoning when observing the partial exist in the image space because this 3D symmetry is not
appearance of a face, and appearance fusing when combining fa- preserved when projected to 2D.
cial components from different sources. Therefore, we train our r Non-frontal faces and occlusions would make it difficult for
Appearance Graph together with the Global Generation Module the detection API to make accurate judgments. Intuitively,
using the reconstruction loss as main supervision. Given the for example, if the face has a big yaw/pitch, the arched
original geometry features {f geo } and partial appearance fea- eyebrow may look like a straight eyebrow, leading to
tures {Dropout(f app , p)}, where Dropout represents Dropout misjudgment of the API.
function operating on every part-level appearance feature and
p is the Dropout probability (p = 0.1 in our implementation),
we first compute the optimized appearance features {f app } by
B. Results and Evaluations
calling Gapp . Then {f geo } and {f app } are used to compute the
local feature maps for each part, which are further combined We conduct extensive experiments to demonstrate the effec-
into F ∈ R32×512×512 . Finally, we have I f inal = Rglobal (F ). tiveness and usability of our system. We evaluate our method
The first loss is L1 reconstruction loss, from four aspects: attribute accuracy of the generated images
(Section IV-B1), comparison with the state-of-the-art text-based
recon = I
Lglobal f inal
− I1 . (17) image generation techniques on human faces (Section IV-B2),
ablation study (Section IV-B3), and perceptual study (Sec-
We further employ VGG loss [45] and Lab loss [46] to constrain tion IV-B6). We also present more generated results in Fig. 6.
on the visual accuracy of generated images. Therefore, the 1) Attribute Accuracy of the Generated Faces: To test the
training objective for this stage is as follows, accuracy of text-image correspondence of the generated images
(i.e. do the attributes in the generated images match the descrip-
min Lglobal
recon + λvgg Lvgg + λLab LLab . (18) tions?), for each attribute, we generate a batch of 100 images
Gapp ,Rglobal
by specifying only one attribute in the input sentence. Then,
We set λvgg = 0.2, λLab = 0.001 in our experiments. these generated images are sent to the facial attribute detection

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TEXT2FACE: TEXT-BASED FACE GENERATION WITH GEOMETRY AND APPEARANCE CONTROL 6487

Fig. 5. Editing comparisons with state-of-the-art methods. We perform single-attribute editing for each example. In all three examples, TediGAN [28] fails to
produce changes corresponding to the text-specified facial attributes. For StyleCLIP [3], it succeeds in turning a closed mouth into an opened one, while it also
fails in the other two cases. We speculate from an empirical perspective that the success of editing an opened mouth and the failure of editing eyebrows/nose shape
may both ascribe to the entangled nature of the StyleGAN latent space, as prior arts [6], [10], [43], [44] have already managed to change the mouth openness
via StyleGAN latent manipulation but none (to our knowledge) have succeeded in editing eyebrows/nose in the same way. Overall, our method yields the most
satisfying results from both reconstruction quality and editing effectiveness.

Fig. 6. More end-to-end generated results. Given the input sentences, our method can generate diverse faces conditioned on the prompts in the text.

TABLE I For the generation task, we compare our method with At-
TEXT-IMAGE CORRESPONDENCE ACCURACY
tnGAN [14] and DM-GAN [51] by retraining their models using
the official implementations but with our dataset and setting the
same sentence as the input to all three works. Since the original
implementations of AttnGAN and DM-GAN set the maximum
resolution to 256 × 256, we directly use their results under this
resolution for comparison with our results which have a resolu-
tion of 512 × 512. Specifically, for each image in our generated
dataset, we randomly generate 10 sentences describing each
face using the detected facial attributes, and retrain their model
under the original resolution with our generated sentences. This
APIs [48], [49], [50] for re-detection. We calculate the accuracy is deemed as a fair comparison by us since generating images
for each attribute, as shown in Table I. with a higher resolution is often considered to be more difficult.
2) Comparison With State-of-the-Arts: Existing text-based Please note that their models are not specifically designed for
works that are relevant to our work can be categorized into text-to-face generation but rather for a more general text-to-
two tracks: text-based image generation [14], [16], [51], and image generation task. In contrast, our model is specifically
text-guided face manipulation [3], [28]. Since our work can be designed for generating human faces. Although we explicitly
adapted to support face manipulation, we make comparisons for take into account the prior of human face layout into our model
the two tasks. architecture, we argue that this comparison is better than nothing

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
6488 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 9, SEPTEMBER 2024

Fig. 7. Ablation study of the Appearance Graph. Editing the appearance of the
source image (Geometry) using part-level reference images (Appearance). The
Paste column shows the pasted appearance reference over the source image.
As shown in the rightmost two columns, the edited results with Appearance
Graphare much more color-consistent than the rightmost column where the
results are generated without incorporating the Appearance Graph.

since there do not exist relevant works under the same settings as
ours: text-to-face generation with disentangled feature control.
As shown in Fig. 9, our method’s generation results are visually
high-quality and semantically accurate. In contrast, results from Fig. 8. Results of partial appearance morphing with Appearance Graph. The
previous text-to-image generation methods could not reach such third column shows results generated from the geometry features from the
source image (shown on the leftmost column), bg appearance feature from the
a high resolution while also being deficient in satisfying the source image, and other partial appearance features from the target image on
conditions in the input texts. the rightmost column. It keeps the background appearance nearly intact while
For the manipulation tasks, we adapt our pipeline as follows shifting the facial color toward that of the target image as much as possible.
The fourth column is the opposite, where the bg appearance feature comes
to support manipulation given an input image I: We encode from the target image while the rest appearance features inherit from the source
I to get {f geo } and {f app } using E r , and then substitute image. The corresponding effect is the maintenance of facial appearance and
the features of specified editing attributes and perform graph swapping of the background appearance. Note that the background appearance
here includes the hair color feature.
recommendation upon the modified features. Here we com-
pare our method with the two existing open-world-text-based
editing methods [28], [52] for editing functionality and only
compare the results of editing single attribute, because it is
intuitive to perform multi-attribute editing by serializing the
editing processes of single-attribute editing. We use the standard
optimization-based method in [28] and the Global Direction
method in [3] for comparison. As shown in Fig. 5, editing results
from TediGAN [28] often fail to convey the semantics indicated
in the input texts. Obvious artifacts exist in those results as well.
StyleCLIP [3] synthesizes more meaningful editing results, as
shown in the middle example in Fig. 5. However, in the left
and right examples, it fails to generate apparent editing effects,
i.e.the eyebrows in the left example are not “arch” and the nose
in the right example is not “thin”. Our method, on the other
hand, generates semantically consistent and visually meaningful
results for all examples shown in Fig. 5.
3) Ablation Study: Graph Recommendation Module is an
essential part of our framework to ensure the quality and realism
Fig. 9. Generation comparisons with state-of-the-art methods. Given the same
of the generated results. To demonstrate its validity for geom- input sentence (leftmost in each example), our result is significantly better than
etry or appearance recommendation, we conduct an ablation the other two methods in terms of both image quality and attribute accuracy.

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TEXT2FACE: TEXT-BASED FACE GENERATION WITH GEOMETRY AND APPEARANCE CONTROL 6489

Fig. 10. Ablation study of the Geometry Graph. We randomly sample facial geometry features to generate face images. The upper row shows the results generated
from geometry features without being optimized by the Geometry Graph, and the lower row shows the results generated using Geometry Graph. Obviously, there
exist artifacts on the borders of different facial parts in the generated faces when the geometry features are not being optimized by Geometry Graph. On the other
hand, when optimized by Geometry Graph, the geometry features of different facial parts are more consistent, producing more realistic results.

study with/without the graph. Since the Appearance Graph and


Geometry Graph operate separately, we perform the ablation
study in two ways. First, we randomly edit one part of the face
and observe the generated images with/without the Geometry
geo
Graph. Specifically, as illustrated in Fig. 10, we fix fbg and
geo geo
keep changing feye1 and feye2 . In this way, the Geometry Graph
geo
is expected to predict fnose
geo
and fmouth to form a compatible
face. Second, we testify the effectiveness of our Appearance
Graph by swapping the appearance features of several facial
parts from two faces. We replace the appearance features of the
source person with those of the target person. With Appearance
Graph, such a swapping operation is expected not to produce
any sharp boundaries on the faces, as shown in Fig. 7. While
without Appearance Graph, the swapping operation produces
images with inconsistent color.
4) Geometry and Appearance Morphing: The encoder net-
work E r of our framework could extract the geometry and
appearance respectively from a real image. The representations
of those two features are both 1 × 512 latent vectors. Our method
could do interpolation in each feature domain. As shown in
Fig. 11, the upper left and the lower right are the given images.
Along the vertical axis is to interpolate the appearance, while Fig. 11. Interpolation via Geometry and Appearance Axes. The appearance
gradually changes along each column, while the geometry changes along each
along the horizontal axis is to interpolate the geometry. The in- row.
termediate images between the two corners are the interpolation
results, where the geometry and appearance features smoothly
change. of our method in comparison with the existing solutions from
5) Partial Appearance Morphing With Appearance Graph: two aspects: text-image correspondence and quality of generated
In Fig. 8, we demonstrate the potential capability of the Ap- images. The box plots of these two user study statistics are
pearance Graph in the Graph Recommendation Module, where illustrated in Fig. 12.
by substituting some of the appearance features of the input The evaluation was done via an online questionnaire to
image with those of the target image, we could get face im- evaluate the correspondence between the given sentence and
ages with blending appearances. Specifically, if we swap the the synthesized results according to user perception. For each
part-level appearance features of {nose, mouth, leye, reye}, sample, we show a sentence depicting a portrait and 5 images
the Appearance Graph could propagate the appearance features synthesized by the compared methods, including AttnGAN [14],
it infers from these nodes to the whole face while maintaining DM-GAN [51], TediGAN [28] and StyleCLIP [3], and ours. To
the appearance of bg intact. On the other hand, by swapping the avoid bias and ensure the score fairness, we place those images
appearance feature of only bg, we get the skin color maintained randomly. Each participant is asked to evaluate 10 examples
while the hair color changed. according to 2 criteria: the text-image correspondence and the
6) Perceptual Study: To evaluate the faithfulness of the syn- quality of the generated images, each of them on a 10-point
thesized results with respect to the given sentence, we prepared Likert scale (1 = strongly negative to 10 = strong positive).
two user studies and asked human users to judge the effectiveness We invite 20 participants to participate in this study and get

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
6490 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 9, SEPTEMBER 2024

TABLE III
T-TEST RESULTS FOR IMAGE QUALITY PERCEPTUAL STUDY

Fig. 13. Less successful cases. The boundary between the ears and the hair is
not clear in these images. Moreover, the curly hair styles in the middle image
and the right image are not perfectly handled. We believe that these two artifacts
are due to the limited capability of Global Generation Module of understanding
such detailed spatial information from the input feature maps. As discussed in
Section IV-C.

C. Failure Modes
Since our dataset only contains frontal faces, the failure
modes of our pipeline mainly center around the non-frontal
issue and occlusion issue. Another failure mode is the issue of
generating/reconstructing complicated hair styles such as wavy
Fig. 12. Two box plots for illustrating the statistics from the perceptual studies. hair, plate hair, bangs, etc.. We initially test our model on the
As has been discussed in Section IV-B6, the face images generated by our model CelebA [13] and FFHQ [5] datasets but achieve under-expected
are rated the highest score in the user study of text-image correspondence. Such
a result may benefit from the part-level control of each facial attribute. In terms results, and we believe that this could attribute to the three failure
of the quality of the generated images, our model generates high-fidelity images modes mentioned above. More failure examples are illustrated
whose quality could be compared with that of StyleGAN [5]. The reason why in Fig. 13. As shown in Fig. 13, the left image contains unclear
TediGAN [28]’s image quality is rated lower than StyleCLIP [3] may attribute to
the edited latent codes, which lead to generating visual artifacts in the synthetic boundaries between ears and hair, the middle image and the right
images. image both show artifacts around the ear which make it look like
wearing ear rings.

V. LIMITATIONS
TABLE II
T-TEST RESULTS FOR TEXT-IMAGE CORRESPONDENCE PERCEPTUAL STUDY The motivation for our work originates from an entertainment
and interaction setting. Therefore, directly using our model
for applications such as criminal investigation is improper and
should involve more dedicated considerations beforehand. In
other words, one of our model’s limitations, from the applica-
tion perspective, is that the accuracy and experimental settings
restrict it from being used as a way to facilitate applications
requiring extra accuracy.
Another limitation of our work from the technical perspective
is that our model does not perform well on complicated hairstyles
20 (participants) × 10 (questions) = 200 subjective evaluations such as wavy hair, plate hair, bangs, etc. Thus it could not
for each method. The statistics are shown in the supplementary generate faces with such hairstyles. We refer to the readers
materials. We perform one-way ANOVA tests on the 5 methods to [46], [53] about how to manipulate complex hairstyles. More
mentioned above and find significant effects for text-image details about failure modes are appended in the supplementary
correspondence (F(4,45) = 229.48, p < 0.001) and for image materials.
quality (F(4,45) = 360.23, p < 0.001). The results from paired Last but not least, the generation results of our model rely a
t-tests further testify the effectiveness of our method over the lot on the dataset/database. The frontal faces used in our work
other methods, as illustrated in Tables II and III. require extensive work to generate and check their validity.

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TEXT2FACE: TEXT-BASED FACE GENERATION WITH GEOMETRY AND APPEARANCE CONTROL 6491

Limited by the diversity encoded in the StyleGAN generator, [18] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
our database inherits such bias. The bias could be reduced as we Caltech-UCSD birds-200–2011 dataset,” California Institute of Technol-
ogy, Pasadena, CA, Tech. Rep. CNS-TR-2011-001, 2011.
are continuing enlarging our dataset. We will release the code [19] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. 38th Int.
and provide an online system when the dataset is diverse enough. Conf. Mach. Learn., 2021, pp. 8821–8831.
[20] A. Ramesh et al., “Hierarchical text-conditional image generation with
CLIP latents,” 2022, arXiv:2204.06125.
VI. CONCLUSION [21] A. Agostinelli et al., “MusicLM: Generating music from text,”
2023, arXiv:2301.11325.
In this article, we have presented a local-to-global framework [22] D. Yang et al., “Diffsound: Discrete diffusion model for text-to-sound
for generating realistic facial images from pure textual inputs, generation,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31,
enabling linguistic control over the geometry and appearance pp. 1720–1733, 2023.
[23] F. Kreuk et al., “AudioGen: Textually guided audio generation,” 2023,
features of every facial part. We demonstrated the effectiveness arXiv:2209.15352.
of our method by comparing it with the state-of-the-art text- [24] Y. Vinker, “CLIPasso: Semantically-aware object sketching,” ACM Trans.
based editing and text-to-image models as well as conducting a Graph., vol. 41, 2022, Art. no. 86.
[25] G. Kwon and J. C. Ye, “CLIPstyler: Image style transfer with a single text
convincing user study under a real-world scenario. However, our condition,” 2021, arXiv:2112.00374.
current pipeline may not apply to complex sentences. Generation [26] P. Sangkloy, W. Jitkrittum, D. Yang, and J. Hays, “A sketch is worth a
from sentences with more fuzzy descriptions is to be adapted in thousand words: Image retrieval with text and sketch,” in Proc. Eur. Conf.
Comput. Vis., 2022, pp. 251–267.
the future. [27] O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2Mesh:
Text-driven neural stylization for meshes,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2022, pp. 13492–13502.
REFERENCES [28] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, “TediGAN: Text-guided diverse
[1] SmartClick, 2021. [Online]. Available: https://smartclick.ai/articles/how- face image generation and manipulation,” in Proc. IEEE Conf. Comput.
artificial-intelligence-is-used-in-the-film-industry/ Vis. Pattern Recognit., 2021, pp. 2256–2265.
[2] A. Radford et al., “Learning transferable visual models from natural [29] J. Sun, Q. Deng, Q. Li, M. Sun, M. Ren, and Z. Sun, “AnyFace: Free-
language supervision,” in Proc. 38th Int. Conf. Mach. Learn., 2021, style text-to-face synthesis and manipulation,” in Proc. IEEE/CVF Conf.
pp. 8748–8763. Comput. Vis. Pattern Recognit., 2022, pp. 18666–18675.
[3] O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, [30] Z. Huang, K. C. Chan, Y. Jiang, and Z. Liu, “Collaborative diffusion
“StyleCLIP: Text-driven manipulation of StyleGAN imagery,” in Proc. for multi-modal face generation and editing,” in Proc. IEEE/CVF Conf.
IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 2065–2074. Comput. Vis. Pattern Recognit., 2023, pp. 6080–6090.
[4] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, “Towards open-world text-guided [31] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman,
face image generation and manipulation,” 2021, arXiv:2104.08910. “DreamBooth: Fine tuning text-to-image diffusion models for subject-
[5] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for driven generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. Vis. nit., 2023, pp. 22500–22510.
Pattern Recognit., 2019, pp. 4396–4405. [32] G. Kim, T. Kwon, and J. C. Ye, “DiffusionCLIP: Text-guided diffusion
[6] Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu, “Talk-to-edit: Fine- models for robust image manipulation,” in Proc. IEEE/CVF Conf. Comput.
grained facial editing via dialog,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2426–2435.
Vis., 2021, pp. 13799–13808. [33] Y. Ma, H. Yang, W. Wang, J. Fu, and J. Liu, “Unified multi-modal
[7] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. latent diffusion for joint subject and text conditional image generation,”
Aila, “Analyzing and improving the image quality of StyleGAN,” 2023, arXiv:2303.09319.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, [34] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Om-
pp. 8107–8116. mer, “High-resolution image synthesis with latent diffusion models,”
[8] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, 2021, arXiv:2112.10752.
“Training generative adversarial networks with limited data,” in Proc. Int. [35] J. N. M. Pinkney and C. Li, “Clip2latent: Text driven sampling
Conf. Neural Inf. Process. Syst., 2020, pp. 12104–12114. of a pre-trained StyleGAN using denoising diffusion and clip,”
[9] Y. Nitzan, A. Bermano, Y. Li, and D. Cohen-Or, “Face identity disen- 2022, arXiv:2210.02347.
tanglement via latent space mapping,” ACM Trans. on Graph., vol. 39, [36] H. Wu et al., “Unified visual-semantic embeddings: Bridging vision and
pp. 1–14, 2020. language with structured meaning representations,” in Proc. IEEE/CVF
[10] R. Abdal, P. Zhu, N. J. Mitra, and P. Wonka, “StyleFlow: Attribute- Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6609–6618.
conditioned exploration of StyleGAN-generated images using conditional [37] M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd,
continuous normalizing flows,” 2020, arXiv: 2008.02401. [Online]. Avail- “spaCy: Industrial-strength natural language processing in Python,”
able: https://arxiv.org/abs/2008.02401 2020, doi: 10.5281/zenodo.1212303.
[11] A. Tewari et al., “PIE: Portrait image embedding for semantic control,” [38] B. Zhu, C. Lin, Q. Wang, R. Liao, and C. Qian, “Fast and accurate: Structure
ACM Trans. Graph., vol. 39, 2020, Art. no. 223. coherence component for face alignment,” 2020, arXiv:2006.11697.
[12] S.-Y. Chen et al., “DeepFaceEditing: Deep face generation and editing [39] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon,
with disentangled geometry and appearance control,” ACM Trans. Graph., “Dynamic graph CNN for learning on point clouds,” ACM Trans. Graph.,
vol. 40, 2021, Art. no. 90. vol. 38, 2019, Art. no. 146.
[13] C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “MaskGAN: Towards diverse and [40] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,
interactive facial image manipulation,” in Proc. IEEE Conf. Comput. Vis. “High-resolution image synthesis and semantic manipulation with condi-
Pattern Recognit., 2020, pp. 5548–5557. tional GANs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
[14] X. Tao et al., “AttnGAN: Fine-grained text to image generation with pp. 8798–8807.
attentional generative adversarial networks,” in Proc. IEEE Conf. Comput. [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Vis. Pattern Recognit., 2018, pp. 1316–1324. recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
[15] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a pp. 770–778.
large number of classes,” in Proc. Indian Conf. Comput. Vis. Graph. Image [42] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least
Process., 2008, pp. 722–729. squares generative adversarial networks,” in Proc. IEEE Int. Conf. Comput.
[16] T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning text-to-image Vis., 2017, pp. 2813–2821.
generation by redescription,” in Proc. IEEE Conf. Comput. Vis. Pattern [43] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “GANSpace:
Recognit., 2019, pp. 1505–1514. Discovering interpretable GAN controls,” in Proc. Int. Conf. Neural Inf.
[17] J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, “RiFeGAN: Rich Process. Syst., 2020, pp. 9841–9850.
feature generation for text-to-image synthesis from prior knowledge,” [44] E. Richardson et al., “Encoding in style: A StyleGAN encoder for image-
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, to-image translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
pp. 10911–10920. Recognit., 2021, pp. 2287–2296.

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.
6492 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 9, SEPTEMBER 2024

[45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Jianjun Zhao received the PhD degree in computer
large-scale image recognition,” in Proc. 3rd Int. Conf. Learn., Comput. science from the Institute of Computing Technology,
Biol. Soc., 2015, pp. 1–14. Chinese Academy of Sciences. He is an associate pro-
[46] Z. Tan et al., “MichiGAN: Multi-input-conditioned hair image generation fessor with the Department of Film and TV Technol-
for portrait editing,” ACM Trans. Graph., vol. 39, no. 4, pp. 1–13, 2020. ogy, Beijing Film Academy. His research focuses on
[47] seeprettyface, 2020. [Online]. Available: seeprettyface.com film virtual production and physics-based character
[48] Face++, 2020. [Online]. Available: https://www.faceplusplus.com.cn/ animation.
face-detection/
[49] Microsoft, 2020. [Online]. Available: https://docs.microsoft.com/en-in/
azure/cognitive-services/face/
[50] Alibaba, 2020. [Online]. Available: https://help.aliyun.com/document_
detail/130846.html
[51] M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic memory
generative adversarial networks for text-to-image synthesis,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5802–5810. Shu-Yu Chen received the PhD degree in com-
[Online]. Available: http://openaccess.thecvf.com/content_CVPR_2019/ puter science and technology from the University of
html/Zhu_DM-GAN_Dynamic_Memory_Generative_Adversarial_ Chinese Academy of Sciences. She is currently
Networks_for_Text-To-Image_Synthesis_CVPR_2019_paper.html working as an assistant professor with the Insti-
[52] A. Tewari et al., “StyleRig: Rigging StyleGAN for 3D control over portrait tute of Computing Technology, Chinese Academy
images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, of Sciences. Her research interests include computer
pp. 6141–6150. graphics.
[53] C. Xiao, D. Yu, X. Han, Y. Zheng, and H. Fu, “SketchHairSalon: Deep
sketch-based hair image synthesis,” ACM Trans. Graph., vol. 40, 2021,
Art. no. 216.

Zhaoyang Zhang received the BEng (summa cum


laude) degree in computer science and technology
from the University of Chinese Academy of Sciences
(UCAS), in June 2022. He is currently working to- Lin Gao (Member, IEEE) received the PhD degree
ward the PhD degree with Yale Computer Graphics in computer science from Tsinghua University. He
Group. His research interests lie in 2D and 3D digital is currently an associate professor with the Insti-
content creation. tute of Computing Technology, Chinese Academy
of Sciences. He has been awarded the Newton Ad-
vanced Fellowship from the Royal Society and the
AG young Researcher award. His research interests
include computer graphics and geometric processing.

Junliang Chen is currently working toward the mas-


ter’s degree with the Department of Film and TV
Technology, Beijing Film Academy. His research
interests include digital film technology.

Hongbo Fu (Senior Member, IEEE) received the BS


degree in information sciences from Peking Univer-
sity, China, in 2002, and the PhD degree in computer
science from the Hong Kong University of Science
and Technology, in 2007. He is a full professor with
the School of Creative Media, City University of
Hong Kong. His primary research interests fall in
the fields of computer graphics and human computer
interaction. He has served as an associate editor of
The Visual Computer, Computers & Graphics, and
Computer Graphics Forum.

Authorized licensed use limited to: Mar Athanasius College Of Engineering. Downloaded on September 11,2024 at 06:28:41 UTC from IEEE Xplore. Restrictions apply.

You might also like