0% found this document useful (0 votes)
12 views11 pages

Generating Users Desired Face Image Using The Conditional Generative Adversarial Network and Relevance Feedback

Uploaded by

timothimabey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Generating Users Desired Face Image Using The Conditional Generative Adversarial Network and Relevance Feedback

Uploaded by

timothimabey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SPECIAL SECTION ON DEEP LEARNING: SECURITY AND FORENSICS RESEARCH ADVANCES

AND CHALLENGES

Received October 24, 2019, accepted November 17, 2019, date of publication November 28, 2019,
date of current version December 23, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2956574

Generating Users’ Desired Face Image Using the


Conditional Generative Adversarial Network
and Relevance Feedback
CAIE XU 1 , YING TANG 1 , MASAHIRO TOYOURA 1, (Member, IEEE), JIAYI XU 2,

AND XIAOYANG MAO 1,2 , (Member, IEEE)


1 Faculty of Engineering, University of Yamanashi, Kofu 400-0015, Japan
2 School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310013, China
Corresponding author: Xiaoyang Mao ([email protected])
This work was supported in part by the JAPAN JSPS KAKENHI under Grant 17H00737 and Grant 17H00738, and in part by the Natural
Science Foundation of Zhejiang Province, China, under Grant LGF18F020015.

ABSTRACT In this study, we propose a novel method for generating an image of the target face by using the
generative adversarial network (GAN) and relevance feedback. Combining GAN with relevance feedback
compensates for the lack of user intervention in GAN and the low image quality in traditional methods. The
feature points of face images, namely, the landmarks, are used as the conditional information for GAN to
control the detailed features of the generated face image. An optimum-path forest classifier is applied to
categorize the relevance of training images based on the user’s feedback so that the user can quickly retrieve
the training images that are most similar to the target face. The retrieved training images are then used to
create a new landmark for synthesizing the target face image. The experimental results showed that users
can generate images of their desired faces, including the faces in their memory, with a reasonable number
of iterations, therefore demonstrating the potential of applying the proposed method for forensic purposes,
such as creating the face images of criminals based on the memories of witnesses or victims.

INDEX TERMS Face image generation, relevance feedback, optimum-path forest, conditional generative
adversarial network, landmark.

I. INTRODUCTION process, resulting in a generator that can produce realistic


Recently, with the rapid development of deep learning tech- data.
nology, various research areas and applications, such as Since the very first GAN model [1] demonstrated its abil-
computer vision, robotics, big data analysis, and pilotless ity to generate face images, various improved models have
automobiles, have achieved major advancements. The field been developed. The face images generated with [1] are
of face image generation and synthesis is no exception, as it fair random draws, not cherry-picked, and are poor-quality
has also undergone significant developments. In particular, grayscale images. To gain some control over the generated
the emergence of the generative adversarial network (GAN), results, Mehdi and Simon proposed the conditional generative
which is a type of neural network architecture for the gener- adversarial network (CGAN) in the same year, which allows
ative model first proposed by Goodfellow et al. in 2014 [1], inputting a condition to the model in addition to the noise [2].
brought about a major breakthrough in the field of face image This model set a solid foundation for the emergence of var-
generation. GAN consists of two networks: the generator ious variants of GAN. In 2015, Jon Gauthier et al. proposed
that creates as realistic data as possible and the discriminator the use of CGAN for convolutional face generation [3], which
that attempts to distinguish fake samples from real ones. The added to CGAN the capability of generating face images with
two networks compete with each other during the training specific attributes, such as race, age, and emotion, by varying
the conditional information. Grigory et al. proposed the Age-
GAN [4] for automatically simulating face aging based on
The associate editor coordinating the review of this manuscript and CGAN; AgeGAN particularly emphasizes the preservation
approving it for publication was Zhen Qin. of the original person’s identity in the aged version of his/her

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
181458 VOLUME 7, 2019
C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

FIGURE 1. Results from our proposed method. The images in the first row are the references, and those in the second row are the corresponding created
face images.

face image. The Two-Pathway Generative Adversarial Net- for interactive face image generation. An optimum-path for-
work (TP-GAN) [5] was proposed by Rui et al. in 2017 for est (OPF) classifier is used to define the desired facial features
realistic face synthesis. It is mainly used for reconstructing represented as landmarks. The classifier is iteratively updated
face images from a partial view corresponding to different based on the relevance feedback of users. The landmarks
poses. The most recent variant, StyleGAN [6], which was pro- of the desired face are then used as the input to GP-GAN
posed by Tero et al., led to an automatically learned, unsuper- to generate realistic face images with the desired features.
vised separation of high-level attributes, and it can synthesize In this way, the proposed method can take full advantage of
high-quality face images with varying high-level attributes, the high image quality while compensating for the lack of
such as different hairstyles and expressions. However, it con- user intervention of state-of-the-art GAN technology. Fig. 1
siders stochastic variation and does not have the ability to shows some results generated with the proposed system. The
control such attributes. Xing et al. focused on high-level user provides feedback to the system based on the similarity
face-related analysis tasks and proposed gender-preserving between the created image and the target one. After several
GAN (GP-GAN) [7], which could synthesize corresponding iterations, a face image resembling the target image is created.
face images from landmarks; the feature points represent the Our experiment showed that the proposed method can gener-
geometric information of the overall shape and the individual ate a result similar to the target face in the user’s memory,
parts of the face. Although controlling the geometric features, and it demonstrated its potential for forensic purpose, such as
such as the pose, the shape of the face, and individual facial assisting the police to create the face image of a suspect based
parts, is possible with GP-GAN, it requires the landmarks on the feedback of victims or witnesses.
as input. The application of GP-GAN is therefore limited In the remainder of this paper, we first review related
without providing users a method to create the landmarks of works in Section II and then present our proposed method
their desired face. in Section III. Section IV discusses the results and describes
Bontrager et al. [8] proposed an approach based on Wasser- the experiment for evaluation. Section V and Section VI
stein GAN [9] and interactive evolutionary computation [10] conclude the paper by discussing the limitations of the study
to produce an image resembling a given target. The user is and presenting future research directions.
asked to evaluate a set of images resulting from GAN, and
a genetic algorithm is used to modify the latent vector based II. RELATED WORKS
on the user’s evaluation. This is the first work that demon- A. THE GENERATIVE ADVERSARIAL NETWORK
strated the potential of using the evolutionary algorithm to FOR SYNTHESIZING FACE IMAGES
generate face images similar to the target faces. However, While a large number of works have focused on facial
their evaluation experiment reported that the average score recognition and identification, very few studies have been
of the results was only 2.2 out of 5. Furthermore, the method conducted on face image synthesis until a few years ago.
cannot provide control over detailed facial features. Significant developments in machine learning technologies
To the best of our knowledge, none of the existing GAN led to major advancements in the field, such as GAN tech-
models can provide users with easy control over detailed nology, which has made face image generation a current
facial features, such as the shapes and positions of individ- research hotspot. The original GAN [1] aims to estimate
ual parts of the face. The ability to control detailed facial generative models via an adversarial process, which consists
features is required in many applications. One important of a generator network G that captures the data distribution
application is in assisting the police to create the face image and a discriminator network D that estimates the probabil-
of suspects based on the memories of witnesses or victims. ity of the generated sample coming from real data rather
Other potential applications include visually presenting an than G, and in which G and D are trained simultaneously.
image of the user’s imagined face. This study proposes a The input to GAN is a random noise, and G is trained in a
novel method combining GP-GAN and relevance feedback way that maximizes the probability for D to make a mistake.

VOLUME 7, 2019 181459


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

To compensate for the disadvantages of GAN, such as the concept in E-FIT is the technique of Montage synthesis,
lack of control over the generated results and the low image which requires the user to look through a dataset of face
quality, CGAN was proposed in [2], allowing the generation components (eyebrows, eyes, noses, mouths, etc.) in order
of images with certain attributes. In practice, the conditions to search for each part separately based on resemblance and
can be any information related to the target face image, composite face image using the selected parts. However,
such as the facial pose or the facial attribute. On this basis, finding the ideal parts is time consuming and it is difficult
AgeGAN [4] and CGAN, which are tailored for convolutional to ensure that the synthesized face image is the desired one,
face image generation [3], were proposed. The former focuses even if each component is satisfactory.
on generating face images of different ages, with a particular Wu and Dai [13] proposed a method to synthesize face
emphasis on preserving a person’s identity in the aged version images from sketches. Xiao et al. developed a method [14]
of his/her face. CGAN can generate face images with specific enabling bidirectional photo-sketch mapping, which can syn-
attributes by varying the conditional information. It exploits thesize a face sketch from a photo and, conversely, a photo
facial expression attributes and other related features, such as from a face sketch. The results of these methods depend
race and age, as conditional data. However, these attributes on the quality of the sketches; the average people, however,
are tagged to the data when preparing the training dataset, cannot draw sketches well.
so there is no way to reflect the user’s intention at the execu- The evolutionary algorithm was explored by Stuart et al.;
tion time. Another GAN model that contributes to face image it uses both local and global models, allowing a witness to
synthesis is TP-GAN [5], which focuses on synthesizing evolve plausible, photo-realistic face images in an intuitive
and reconstructing realistic frontal-view face images from way [10]. However, the algorithm fails because of its slow
partial-view face images. Its results have good performance convergence speed, which caused by dozens, even more than
in face recognition. While the face images generated with a hundred times iterations.
GAN are more realistic, one major challenge is how to control Xu et al. proposed an approach to synthesize images
the generated results, such as the pose, the face shape, and based on relevance feedback in order to conquer the speed
the hairstyle. Style-GAN [6] is a novel model that addresses of convergence and consider the whole face image and the
this issue. The Style-GAN feature has a multi-resolution user’s feedback [15]. Through a dialogic approach based
structure. By modifying the input of each level separately, on relevance feedback strategy, the user only needs to look
Style-GAN succeeds in controlling the output from coarse at several candidate face images and assess whether each
features (pose, face shape) to fine details (hair color). How- image is similar to the target one. The selected candidates are
ever, all these existing GANs, including Style-GAN, cannot interpolated to create the user’s desired image. Unfortunately,
provide users any control at runtime. Philip et al. proposed the results are sometimes blurred and lacking in details,
an approach [8] based on Wasserstein GAN [9] and the evo- and the convergence of relevance feedback can be slow, as it
lutionary algorithm [10] to produce user-desired images. The uses the principal components of grayscale images to build
user is asked to evaluate a set of images resulting from GAN, the feature space and synthesize the face image. Moreover,
and a genetic algorithm is applied to modify the noise input only interpolating from training face images limits the varia-
based on the user’s evaluation. Their paper showed some tions of the face images to be created. The proposed method
examples of using the method for generating face images uses the relevance feedback framework similar to [15], but
resembling target faces. However, the average score of the by combing landmark features and the newest GAN instead
generated images is only 2.2 out of 5, which is the result of of using principal components, it can produce high-quality
the user evaluation experiment. images with a smaller number of user iterations. A new
GP-GAN [7] attempts to perform high-level face-related method of creating a completely new face image that is even
analysis tasks based on landmarks [11], which are the feature better than the best training samples is also developed.
points characterizing the geometric features of faces. In the
training of GP-GAN, both the face image and the corre- III. PROPOSED METHOD
sponding landmarks are inputted to generator G. The trained A. OVERVIEW
generator model can produce a face image that is similar to As depicted in Fig. 2, the proposed method consists of two
the image from which the landmark was extracted. In our parts: (1) a relevance feedback framework for users to gen-
study, we used landmarks as the features for training the erate the landmarks of new candidate faces by evaluating
classifier based on the user’s feedback, and utilized GP-GAN the sample face images and (2) the face image generator
for synthesizing realistic face images from the landmarks. using GP-GAN with the new landmarks resulting from the
relevance feedback process. In the offline training phase,
B. USER CONTROL IN FACE IMAGE SYNTHESIS the algorithm proposed by Dlib [16] is applied to train the
A very typical application requiring user intervention in network for extracting the landmark features; these extracted
face image synthesis is assisting the police in investigations. landmarks are then used for training the GP-GAN.
Electronic facial identification (E-FIT) [12] is a face image The relevance feedback framework consists of three parts:
synthesis system that can produce the facial composites of constructing the feature space, training the OPF classifier, and
wanted criminals based on eyewitness descriptions. The core exploring the candidate feature vectors.

181460 VOLUME 7, 2019


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

10 images near the border of the classifier to the


user, and the system goes back to step 2. Otherwise,
the system proceeds to step 5.
Step 5: The user can specify the degrees of similarity to
the images in the candidate list. The landmarks of
these images are interpolated using the degrees of
similarity as the weight to produce the landmark to
be sent to GP-GAN for producing the resulting face
image.

1) TRAINING THE OPF CLASSIFIER


We reflect the user’s intention by first training a face image
classifier based on the user’s feedback. The OPF classifier is
FIGURE 2. Overview of the proposed method. used for training the classifier because it is known to have
the ability to handle a large dataset effectively and efficiently
compared with other representative classification algorithms,
We construct the feature space based on the extracted
such as the support vector machine and the k-nearest neigh-
landmark features. The OPF classifier is then trained for the
bors algorithm [17].
feature space based on the relevance feedback, and the user
The OPF represents each class of images by optimum-path
explores the candidate feature vectors by using the trained
trees rooted at the given representative samples, called proto-
OPF classifier. The trained generator generates face images
types [18], [19]. It starts as a complete graph whose nodes
from the new landmarks obtained by interpolating the candi-
represent the feature vectors of all training samples in the
date landmarks. The network of GP-GAN is trained with the
dataset (Fig. 4c). All pairs of nodes are linked by arcs that
landmarks extracted from the training dataset. The final result
are weighted by the distances (referred to as cost hereafter)
is created by interpolating the landmarks of the candidate
between the feature vectors of the corresponding nodes.
faces using the trained model of GP-GAN during the running
At each iteration of the relevance feedback, after the user
stage.
labels the 10 samples as relevant or irrelevant (step 2 of
The above relevance feedback framework relies on two
the aforementioned framework), a minimum spanning tree is
core techniques: training an image classifier based on the
constructed for the labeled samples (Fig. 4d), and the adjacent
user’s feedback and exploring the candidates in the landmark
pair of the relevant and irrelevant samples are chosen as
feature space. The two following subsections describe the
the relevant and irrelevant prototypes, respectively (Fig. 4e).
details of the two techniques, respectively.
Then, the graph is repartitioned by the competition process
among prototypes, which offer optimum paths (the path with
B. RELEVANCE FEEDBACK FRAMEWORK the lowest cost) to the remaining nodes of the graph and
As depicted in Fig. 3, the relevance feedback framework is classify all nodes into relevant or irrelevant depending on
realized in the following five steps: whether they are connected to a relevant or irrelevant pro-
Step 1: The system randomly selects 10 face images from totype (Fig. 4f). To effectively update the classifier based on
the training dataset and shows these images to the the user’s feedback, the subsequent 10 samples to be labelled
user as the initial set. The user chooses the most by the users are chosen from the nodes near the border of the
similar image from the initial set, and then the sys- classifier. The border nodes can be selected by investigating
tem shows 10 images that are most similar to that the costs of paths from all non-prototype nodes to both rele-
selected by the user in order to initialize the rele- vant and irrelevant prototypes. We define the border nodes as
vance feedback process. non-prototype nodes that belong to the relevant class and with
Step 2: The user labels each showed image as similar (◦) or the smallest ratio of the cost from these nodes to the relevant
not similar (×). prototypes over the cost from these nodes to the irrelevant
Step 3: Based on the user’s feedback, the system updates the prototypes; we define the best node as the positive sample
classifier and explores some candidate landmarks in located farthest from the negative prototype and closest to the
the feature space, generating face images from them positive prototype.
using GP-GAN and then presenting the generated In the current implementation, the cost of any two adja-
results to the user. cent nodes is assigned using L2 norm distance. We assume
Step 4: If the user feels that the generated results con- that there are s number of non-prototype samples Uk (k =
sist of images similar to the desired face to some 1, 2, 3, . . . , s) that belong to the relevant class, n number
extent, the user selects up to k most similar faces of relevant prototypes, and m number of irrelevant ones
from these images. The selected images are then denoted as pi (i = 1, 2, 3, . . . , n) and qj (j = 1, 2, 3, . . . , m).
added to the candidate set. Otherwise, the user can We denote the cost of the path from a non-prototype sample
choose to continue the iteration. The system shows Uk to the relevant prototype pi as CUk →pi and the cost of the

VOLUME 7, 2019 181461


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

FIGURE 3. Relevance feedback framework.

2) CREATING THE CANDIDATE LANDMARKS


In order to create the candidate landmarks of the desired face,
we try to move the best node along a certain direction vector

→v so that it becomes closer to the relevant prototypes and
farther from the irrelevant ones. There are three core issues
here: the direction in which the best node should be moved
along, the step size of movement, and the valid range of
distance to be moved.

a: THE DIRECTION OF MOVEMENT


To obtain the direction vector − → v of movement, we compute
FIGURE 4. Supervised training of OPF. two composited vectors: the composited vector − →
vr , which is
the summation of vectors from the best node to all relevant
prototypes, and the vector − →
vir , which is the summation of
path from Uk to the irrelevant prototype qj as CUk →qj . The vectors from all irrelevant prototypes to the best node. For
ratio of CUk →pi to Uk → qj , denoted as RelevanceUk →(pi ,qj ) , example, Fig. 5 shows that given the best node depicted by
can be computed as follows:
the red circle , two relevant prototypes depicted by , and
Avg(CUk →pi ) two irrelevant prototypes depicted by , first, we calculate
RelevanceUk →(pi ,qj ) = . composite vector − →
vr by computing the summation of vectors
Avg(CUk →qj )
from the best node to the two relevant prototypes. Second,
The 10 border nodes with the largest value of we calculate another composite vector − →
vir by computing the
RelevanceUk →(pi ,qj ) are chosen for the user to label. The summation of vectors from the two irrelevant prototypes to
best node chosen is the one with the smallest value of the best nodes. Finally, the direction vector − →v of movement
RelevanceUk →(pi ,qj ) . is obtained as the composited vector of vr and −

→ →
vir .

181462 VOLUME 7, 2019


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

FIGURE 5. Direction of movement.

Assume there are n number of relevance prototypes and


m number of irrelevance prototypes denoted as ri (i =
1, 2, 3, . . . , n) and iri (i = 1, 2, 3, . . . , m), respectively. With
the best node denoted as b, the direction vector − →v can be
computed as follows: FIGURE 6. Unsafe area for creating new landmarks.


→v =− →v +− →
v ,
i ir
n

→ −
v−−→
X
vi = b→ri ,
i=1
m

→ −
v− −→
X
vir = iri →b .
i=1

b: THE STEP SIZE OF MOVEMENT


The step size of movement is a critical parameter in our study,
as it controls how much the result is changed each time.
Choosing an appropriate distance of movement is difficult,
as a value that is too small may make the change not obvi- FIGURE 7. Structure of GP-GAN.
ous enough, whereas a value that is too large may result in
skipping the optimal result. The sensitivity of the step size
on the U-net [20] and DenseNet [21] architectures to leverage
should depend on the extent of the dataset in the feature space.
the advantages of these architectures and D is devised under
Therefore, we first compute the diagonal line of the bounding
the patch-based principle. Given a landmark, G does its best
box of the training dataset in the feature space. With the length
to generate the corresponding face images, whereas D tries
of the diagonal line of the bounding box denoted as l, the step
its best to distinguish between real data and the generated
size of movement is given as α × l. Here, α is a parameter
images. Unlike other GANs, the network of GP-GAN adds a
controlled by users.
perceptual sub-network (based on VGG-16 architecture) and
a gender-preserving one in addition to the discriminator. The
c: THE DISTANCE OF MOVEMENT
model is learned by adversarial loss, perceptual loss, and a
In theory, any point in the feature space represents the land-
gender-preserving loss by minimizing the following objective
mark of a face, and we can obtain a face image by inputting
function:
the point to GP-GAN. However, empirically, we found that a
point located out of the bounding box of the training dataset L = LA + λP LP + λC LC + λ1 L1 .
has a high probability of not defining a face (as depicted
in Fig. 6). Therefore, when moving along − →v , we also val- Here, LA , LP , and LC represent the adversarial loss,
idate whether the new position is still within the bounding the perceptual loss, and the gender-preserving loss, respec-
box. If it already exceeds the bounding box, we stop the tively. The adversarial loss LA is based on the discriminator
exploration. sub-network D, the perceptual loss LP guides the genera-
tor using the L1 distance between the high-level features
3) GENERATIVE MODEL FOR SYNTHESIZING extracted from the VGG-16 [22] network, the gender-
FACE IMAGES (GP-GAN) preserving loss LC measures the difference of the gender
The purpose of GP-GAN [7] is to synthesize faces from their feature of the produced image and the real image, and
respective landmarks. As shown in Fig. 7, similar to the tradi- L1 defines the error between the target and the generated
tional GAN network, GP-GAN consists of two components: image. The corresponding weights of the losses are λP , λC ,
generator G and discriminator D in which G is designed based and λ1 .

VOLUME 7, 2019 181463


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

IV. EXPERIMENT AND EVALUATION


A. DATASETS AND IMPLEMENTATION DETAILS
For the generative model of GP-GAN, we learned its param-
eters based on the whole Labeled Faces in the Wild (LFW)
dataset [18], [19], which contains 5,749 identities and
13,233 face images. There are now four different sets of
LFW images—the original and the three different types of
aligned images. The aligned sets include funneled images
(ICCV 2007), LFW-a, and deep funneled images (NIPS FIGURE 8. Created face images based on the reference image. First row:
2012). The model of GP-GAN is trained based on official reference images. Second row: the corresponding created face images.
deep funneling aligned [23], [24], and it uses the official
training, validating, and testing View 1 in this experiment. from the generated face images for addition into the candidate
The details of the training can be found in [7]. However, set at each iteration.
to obtain high-quality results, we adjusted some parameters
for achieving the best learning rate and the number of epochs.
B. EXPERIMENTS
In the current implementation, the model of GP-GAN is
trained on a single GTX 1070 GPU for approximately more Three types of experiments are conducted to validate the
than 40 hours (800 epochs). Landmark images are repre- effectiveness of the proposed method. The first experiment
sented as black solid dots on a white background. During the invited participants to create face images, and the second
training stage, both a landmark image and its corresponding experiment had another group of participants evaluate the
real data are inputted into the network of GP-GAN. We found generated results. The third experiment aimed to compare
that the appearance of the black dots can largely affect the proposed method with the existing one using a similar
the performance of the trained GP-GAN model. Probably relevance feedback framework [15]. In all the experiments,
because of the aliasing, the resulting face image from the the participants were asked to ignore the hairstyles in the face
generator is not ideal when landmarks are presented with a images.
binary image. To address this issue, we use a grayscale image,
making the center of each dot the darkest and then changing 1) EXPERIMENT FOR CREATING FACE IMAGES
it to white gradually toward the edge in order to achieve a This experiment includes three tasks—to generate face
good anti-aliasing effect. These kinds of grayscale landmark images based on the reference image, to generate face images
images are used for both training and testing. according to the briefly presented reference image, and to
The LFW dataset consists of the face images of different synthesize the imagined face images. The second task is
head poses and different expressions. As the geometric dif- particularly designed by assuming forensic applications, such
ference between two different poses is much larger than the as assisting the police to create face images of criminals for
geometric difference between the facial parts of two different forensic purposes. The participants scored their generated
persons, the pose feature is much more dominant than the results according to the image’s similarity to the reference
shape feature of individual parts (e.g., the size and shape of image on a five-point scale (1: no resemblance; 2: very weak
the eyes, and nose). Therefore, if we include the faces of resemblance; 3: neither weak nor strong resemblance; 4:
different head poses when training the OPF with relevance somewhat strong resemblance; 5: strong resemblance). The
feedback, the detailed features of individual parts tend to number of iterations taken before obtaining a satisfactory
be neglected. Similarly, there is a large geometric difference image was recorded to evaluate the performance of the pro-
between an open mouth and a closed one. Basing on these posed method. Ten participants (8 males and 2 females in
observations, when training the OPF classifier, we only chose their 20s–25s) joined all the three face image-creating tasks.
a total of 1,000 frontal face images from the LFW and
divided these into a sub-dataset by gender and an open/closed
mouth. For each subset, 80% of the face images are used a: TASK 1, CREATING FACE IMAGES BASED ON THE
for training, and the remaining 20% are used for testing. REFERENCE IMAGES
During the runtime relevance feedback process, the initially In this experiment, we presented a reference face image to
chosen 10 images consist of the faces of different genders each participant during the entire relevance feedback process
and open/closed mouths. When the user selects the most and asked him/her to create a face image similar to the
similar image, the subset of training samples consisting of reference image. A total of 20 images that were randomly
the selected face is automatically loaded for constructing excluded from the dataset used for training GP-GAN and OPF
the OPF. were used in the experiment. The participants were asked
In the current implementation, we set k = 3 for step 4 of to perform the task with all 20 images, and they were also
the relevance feedback framework in Section III. Therefore, required to score their created results. Some results are shown
the user is allowed to select up to three most similar faces in Fig. 8. These will be evaluated in the second experiment.

181464 VOLUME 7, 2019


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

b: TASK 2, CREATING FACE IMAGES BASED ON THE BRIEFLY


PRESENTED REFERENCE IMAGES
This task is performed with the aim of validating whether the
proposed method enables the synthesis of an image in the
user’s memory. We presented a reference face image to each
participant for 3–4 seconds and had the participant create a
face image resembling the reference image and then score
the generated image. Twenty face images randomly excluded
from the training set and are different from those used in
task 1 were utilized as the reference images. Fig. 9 shows
that the resulting images can capture the overall features of
FIGURE 11. Average final scores.
the reference face images, as well as some shape information
of individual parts (e.g., the size and, shape of the eyes, noise,
and mouth).
number of task 3 is larger than that of task 2; this is likely
because the mental image of the face may be influenced by
the results during the relevance feedback, and it takes some
time for the user to make the conclusion. Even so, the user
can arrive at satisfactory results within an average of five
iterations. This is much faster than that obtained with existing
evolutionary algorithm-based approaches [8], [15].

FIGURE 9. Created face images based on the briefly presented reference


images. First row: the reference images. Second row: the corresponding
created face images.

c: TASK 3, CREATING THE IMAGINED FACE IMAGES


Unlike the two previous tasks, in this case, we asked a
participant to create an image that he/she imagined without
any reference image and then to score the result. With this
third task, as no one knows about the face imagined by the
participant, the scores of the results by the participant are the
only measure for evaluation. Fig. 10 shows some results from FIGURE 12. Average number of iterations.
this task.
Fig. 13 illustrates the number of iterations of each partic-
ipant in different tasks. The horizontal axis represents the
participant, and the vertical axis represents the number of
iterations. It took less than six iterations for 11 participants
to arrive at the final results. In task 1, however, one partic-
FIGURE 10. Created face images that were imagined.
ipant took more than 12 iterations before she reached sat-
Fig. 11 presents the average scores, the horizontal axis rep- isfactory results. The interview after the experiment reveals
resents the tasks, and the vertical axis represents the average that although the participants obtained satisfactory results at
scores of each task. The figure shows that the average scores the fourth or fifth iterations, they still attempted to perform
for all tasks are very similar and are all close to 4, which is further iterations because of their curiosity about what hap-
much better than the scores of the existing method combining pens next if more iterations are conducted.
GAN with the genetic algorithm [8]. As our method uses a
relevance feedback framework, the convergence speed is also 2) EXPERIMENT FOR EVALUATING THE
an important measure for evaluating the performance of the GENERATED FACE IMAGES
system. Fig. 12 shows the average number of iterations for all In this experiment, we asked another group to evaluate the
tasks. It depicts that task 1, in which the reference images are face images generated in experiment 1 by finding the corre-
presented during the whole process, took more iterations than sponding reference image of the generated image from a set
the two other tasks. We observed that this is mainly because of candidate images. Ten participants (7 males and 3 females,
the participants can check all the feature information better in their 20s–25s) who were different from those who joined
and more carefully during the entire process. The iteration experiment 1 were invited to perform the present experiment.

VOLUME 7, 2019 181465


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

FIGURE 13. Number of iterations required for achieving satisfactory


results.

As shown in Fig. 14, for each face image generated with FIGURE 15. Matching success rate.

experiment 1, three face images are presented to the users.


Among the three images, one is the reference image used for
creating the result, and the other two are selected from the
training dataset, which are the faces closest to the reference
image in the landmark feature space (based on L2 distance).
The display positions of the three images are randomly shuf-
fled to eliminate bias caused by the layout. The participant
was asked to find the reference image from the three images.
Ten images randomly selected from the results in tasks 1 and
2 of experiment 1 were used. The average success rates of all
10 participants are shown in Fig. 15, with the horizontal axis
representing the 10 generated images.

FIGURE 16. Results using different methods. First row: the reference
images. Second row: the generated face images using our method. Third
row: the generated face images using the existing approach.

same dataset is used to ensure a fair and objective comparison.


As Xu’s method can only generate grayscale images, we first
converted all images in the training dataset used for our OPF
FIGURE 14. An example of experiment 2. into grayscale ones and then resized them to be the same
size as the images used in [15]. Next, facial features were
Fig. 15 shows that among the 10 generated images, three extracted from the pixel level, and the PCA algorithm [25]
have a 100% matching success rate, and the lowest success was applied to reduce the facial features to 80 dimen-
rate is 60%. We also observed that the matching success rate sions and then for training the OPF classifier, as described
of females is lower than that of males. This result may be in [15].
related to the hairstyle. Although during the entire procedure To compare iterations and image quality, the participants
of the experiment, all subjects were asked to ignore the who joined experiment 1 were invited to perform all the
hairstyle, they might have still been involuntarily affected. three tasks as those in experiment 1 with Xu’s method.
The subjects commented that the eyes were the most impor- Fig. 16 shows some results of the two methods; the first row
tant feature that drew their attention. presents the reference images, whereas the second and third
rows present the images generated using our method and the
3) THE EXPERIMENT FOR COMPARING existing method, respectively. Fig. 17 shows the comparison
WITH THE EXISTING METHOD results on the average final scores of the three tasks, in which
This experiment compares our method with Xu’s tech- the horizontal axis represents the task types and the vertical
nique that used principal component analysis (PCA) and axis represents the average scores. The average score of the
OPF-based relevance feedback [15] in terms of the number of proposed method is higher than that of the existing method
iterations, image quality, and similarity to the target face. The for all three tasks.

181466 VOLUME 7, 2019


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

FIGURE 17. Comparison of the average final scores.

FIGURE 19. The similarity.

V. DISCUSSION
As shown in Fig. 1, Fig. 9, and Fig. 10, most of the cre-
ated images can well resemble the geometric features of the
reference images, but they failed in capturing the details of
texture features. For example, in Fig. 1, the wrinkles on the
faces of Laura Bush and George W. Bush were not reproduced
in the resulting image. This is because the GP-GAN model
is controlled with a landmark that consists of the geometry
FIGURE 18. Comparison of the average number of iterations.
information of facial parts only. To address this issue, we are
developing a new GAN model that can take both geometry
Fig. 18 shows the average number of iterations. The and texture information as the input. The participants of the
existing method took more iterations than ours in all tasks. experiments reported that when they focus on some particular
The proposed method outperforms the existing method par- parts, quickly converging to an image with that part resem-
ticularly for task 1, in which the participants can always bling the reference image is possible. However, that particular
compare the results with the reference image during the entire part may become less similar to the reference image again
process. after trying to improve the other parts. It is important to
For the similarity comparison, we invited a new group of allow users to control each part independently and integrate
participants who did not join any face image generation or the best results of all facial parts. The overall results are
matching tests to evaluate the face images created by the still blurred, although they are much better than those of
two methods; similar to experiment 1, a five-point scale was Xu’s method that used the PCA feature. The image quality
used based on the images’ similarity to the reference images. may be further improved by carefully tuning the training
Twenty subjects were randomly divided into two groups. parameters of GP-GAN. Nevertheless, our contribution is
The first group, consisting of 10 participants (10 males in the approach of combining GAN with an effective relevance
their 20s–25s), evaluated the results with the existing method, feedback framework; substituting GP-GAN with any state-
whereas the second group, consisting of 10 participants of-the-art GAN model for a better image quality is not dif-
(10 males in their 20s–25s) evaluated the results with our ficult. In the current implementation, we treat the bounding
proposed method. We did not ask the participants to directly box of all the training images in the landmark feature space
compare the results of the two methods, and we instead used as the safe area for exploring new landmarks, but this is an
non-overlapping groups of participants to evaluate the results approximated approach. A more accurate scope needs to be
of the two methods separately because we wanted to focus on defined. Currently, the face images generated from the newly
the evaluation of similarity and avoid any adverse evaluations created landmarks are not included in the dataset for training
of the existing method caused by the low image quality. the OPF. By adding the created face images to the training
As the existing method could not synthesize color image, dataset at every step of the iteration, we can expect to expand
we converted the face images by our method into grayscale the range of face images that can be created.
ones and then presented these images to the participants to
eliminate the effect of color. We also asked the participants VI. CONCLUSION
to ignore the blur artifacts. Fig. 19 presents the result of the This study proposed a novel method to gain user control
similarity comparison for all the reference images. The results over detailed face features when generating face images with
generated by our method are evaluated to be more similar to the use of the newest GAN model combined with the rel-
the reference images. evance feedback framework based on the OPF algorithm.

VOLUME 7, 2019 181467


C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback

The experiment results demonstrated that the proposed [23] G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller, ‘‘Learning to align
method can be used to generate not only a face image resem- from scratch,’’ in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 764–772.
[24] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ‘‘Labeled faces
bling the target face but also a face image in the user’s in the wild: A database forstudying face recognition in unconstrained envi-
memory or imagination. Our proposed method makes up for ronments,’’ in Proc. Dans Workshop Faces Real-Life Images, Detection,
both the lack of user intervention in GAN and the low image Alignment, Recognit., 2008, pp. 1–14.
[25] A. J. Calder, A. M. Burton, P. Miller, A. W. Young, and S. Akamatsu,
quality in traditional methods. ‘‘A principal component analysis of facial expressions,’’ Vis. Res., vol. 41,
no. 9, pp. 1179–1208, 2001.
REFERENCES
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in CAIE XU received the B.Sc. degree in computer
Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680. science and technology from the Hebei Univer-
[2] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’ sity of Technology and the M.Sc. degree in soft-
2014, arXiv:1411.1784. [Online]. Available: https://arxiv.org/abs/1411. ware engineering from Zhejiang University. She
1784 is currently pursuing the Ph.D. degree in com-
[3] J. Gauthier, Conditional Generative Adversarial Nets for Convolutional puter science with the University of Yamanashi.
Face Generation, Class Project Stanford CS231N, 2014, p. 2, no. 5. Her research interests include computer vision,
[4] G. Antipov, M. Baccouche, and J.-L. Dugelay, ‘‘Face aging with condi- machine learning, deep learning, and data mining.
tional generative adversarial networks,’’ in Proc. IEEE Int. Conf. Image
Process. (ICIP), Sep. 2017, pp. 2089–2093.
[5] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond face rotation: Global
and local perception gan for photorealistic and identity preserving frontal YING TANG received the B.Sc. degree from
view synthesis,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017,
East China Normal University, China. He is cur-
pp. 2439–2448.
rently pursuing the master’s degree with the
[6] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for
Interdisciplinary Graduate School, University of
generative adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2019, pp. 4401–4410. Yamanashi, Japan. His research interests include
[7] X. Di, V. A. Sindagi, and V. M. Patel, ‘‘GP-GAN: Gender preserving gan augmented reality, computer and human vision,
for synthesizing faces from landmarks,’’ in Proc. 24th Int. Conf. Pattern and deep learning.
Recognit. (ICPR), Aug. 2018, pp. 1079–1084.
[8] P. Bontrager, W. Lin, J. Togelius, and S. Risi, ‘‘Deep interactive evolution,’’
in Proc. Int. Conf. Comput. Intell. Music, Sound, Art Design. Springer,
2018, pp. 267–282.
[9] M. Arjovsky, S. Chintala, and L. Bottou, ‘‘Wasserstein generative adver- MASAHIRO TOYOURA received the B.Sc.
sarial networks,’’ in Proc. Int. Conf. Mach. Learn., 2017, pp. 214–223. degree in engineering and the M.Sc. and Ph.D.
[10] S. J. Gibson, C. J. Solomon, and A. P. Bejarano, ‘‘Synthesis of photo- degrees in informatics from Kyoto University,
graphic quality facial composites using evolutionary algorithms,’’ in Proc. in 2003, 2005, and 2008, respectively. He is
BMVC, 2003, pp. 1–10. currently an Assistant Professor with the Inter-
[11] M. Valstar, B. Martinez, X. Binefa, and M. Pantic, ‘‘Facial point detection disciplinary Graduate School, University of
using boosted regression and graph models,’’ in Proc. IEEE Comput. Soc. Yamanashi, Japan. His research interests are aug-
Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2729–2736.
mented reality and computer and human vision.
[12] O. Gunnarsson and S. Maddock, ‘‘Sketching faces,’’ in Proc. 5th Euro-
He is a member of ACM and the IEEE Computer
graph. Conf. Sketch-Based Interfaces Modeling, 2008, pp. 105–112.
Society.
[13] D. Wu and Q. Dai, ‘‘Sketch realizing: Lifelike portrait synthesis from
sketch,’’ in Proc. Comput. Graph. Int. Conf., 2009, pp. 13–20.
[14] X. Wang and X. Tang, ‘‘Face photo-sketch synthesis and recognition,’’
IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967, JIAYI XU received the B.Sc., M.Sc., and Ph.D.
Nov. 2009. degrees in computer science from Zhejiang Uni-
[15] C. Xu, S. Fushimi, M. Toyoura, and J. X. X. Mao, ‘‘Synthesising images versity. She is currently an Assistant Professor
of imagined faces based on relevance feedback,’’ Tech. Rep. with the School of Computer Science and Tech-
[16] D. E. King, ‘‘Dlib-ml: A machine learning toolkit,’’ J. Mach. Learn. Res., nology, Hangzhou Dianzi University, China. Her
vol. 10, pp. 1755–1758, Jan. 2009. research interests include texture design, image
[17] A. T. Da Silva, A. X. Falcão, and L. P. Magalhães, ‘‘Active learning enhancement, and face recognition and synthesis.
paradigms for CBIR systems based on optimum-path forest classification,’’
Pattern Recognit., vol. 44, no. 12, pp. 2971–2978, 2011.
[18] J. P. Papa and A. X. Falcao, ‘‘Optimum-path forest: A novel and powerful
framework for supervised graph-based pattern recognition techniques,’’
Inst. Comput., Univ. Campinas, Campinas, Brazil, Tech. Rep., 2010,
XIAOYANG MAO received the B.S. degree in
pp. 41–48.
computer science from Fudan University and the
[19] J. P. Papa, A. X. Falcão, and C. T. N. Suzuki, ‘‘Supervised pattern classifi-
M.S. and Ph.D. degrees in computer science from
cation based on optimum-path forest,’’ Int. J. Imag. Syst. Technol., vol. 19,
no. 2, pp. 120–131, 2009. The University of Tokyo. She is currently a Pro-
[20] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-net: Convolutional networks fessor with the Department of Computer Sci-
for biomedical image segmentation,’’ in Proc. Int. Conf. Med. Image ence and Engineering, University of Yamanashi,
Comput. Comput.-Assist. Intervent. Springer, 2015, pp. 234–241. Japan, and an Adjunct Professor with the Col-
[21] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, ‘‘Densely lege of Computer Science, Hangzhou Dianzi
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis. University, China. Her current research inter-
Pattern Recognit., Jul. 2017, pp. 4700–4708. ests include image processing, visual perception,
[22] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for non-photorealistic rendering, and their applications to e-health. She received
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail- the Computer Graphics International Career Achievement Award, in 2018.
able: https://arxiv.org/abs/1409.1556

181468 VOLUME 7, 2019

You might also like