Generating Users Desired Face Image Using The Conditional Generative Adversarial Network and Relevance Feedback
Generating Users Desired Face Image Using The Conditional Generative Adversarial Network and Relevance Feedback
AND CHALLENGES
Received October 24, 2019, accepted November 17, 2019, date of publication November 28, 2019,
date of current version December 23, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2956574
ABSTRACT In this study, we propose a novel method for generating an image of the target face by using the
generative adversarial network (GAN) and relevance feedback. Combining GAN with relevance feedback
compensates for the lack of user intervention in GAN and the low image quality in traditional methods. The
feature points of face images, namely, the landmarks, are used as the conditional information for GAN to
control the detailed features of the generated face image. An optimum-path forest classifier is applied to
categorize the relevance of training images based on the user’s feedback so that the user can quickly retrieve
the training images that are most similar to the target face. The retrieved training images are then used to
create a new landmark for synthesizing the target face image. The experimental results showed that users
can generate images of their desired faces, including the faces in their memory, with a reasonable number
of iterations, therefore demonstrating the potential of applying the proposed method for forensic purposes,
such as creating the face images of criminals based on the memories of witnesses or victims.
INDEX TERMS Face image generation, relevance feedback, optimum-path forest, conditional generative
adversarial network, landmark.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
181458 VOLUME 7, 2019
C. Xu et al.: Generating Users’ Desired Face Image Using the Conditional GAN and Relevance Feedback
FIGURE 1. Results from our proposed method. The images in the first row are the references, and those in the second row are the corresponding created
face images.
face image. The Two-Pathway Generative Adversarial Net- for interactive face image generation. An optimum-path for-
work (TP-GAN) [5] was proposed by Rui et al. in 2017 for est (OPF) classifier is used to define the desired facial features
realistic face synthesis. It is mainly used for reconstructing represented as landmarks. The classifier is iteratively updated
face images from a partial view corresponding to different based on the relevance feedback of users. The landmarks
poses. The most recent variant, StyleGAN [6], which was pro- of the desired face are then used as the input to GP-GAN
posed by Tero et al., led to an automatically learned, unsuper- to generate realistic face images with the desired features.
vised separation of high-level attributes, and it can synthesize In this way, the proposed method can take full advantage of
high-quality face images with varying high-level attributes, the high image quality while compensating for the lack of
such as different hairstyles and expressions. However, it con- user intervention of state-of-the-art GAN technology. Fig. 1
siders stochastic variation and does not have the ability to shows some results generated with the proposed system. The
control such attributes. Xing et al. focused on high-level user provides feedback to the system based on the similarity
face-related analysis tasks and proposed gender-preserving between the created image and the target one. After several
GAN (GP-GAN) [7], which could synthesize corresponding iterations, a face image resembling the target image is created.
face images from landmarks; the feature points represent the Our experiment showed that the proposed method can gener-
geometric information of the overall shape and the individual ate a result similar to the target face in the user’s memory,
parts of the face. Although controlling the geometric features, and it demonstrated its potential for forensic purpose, such as
such as the pose, the shape of the face, and individual facial assisting the police to create the face image of a suspect based
parts, is possible with GP-GAN, it requires the landmarks on the feedback of victims or witnesses.
as input. The application of GP-GAN is therefore limited In the remainder of this paper, we first review related
without providing users a method to create the landmarks of works in Section II and then present our proposed method
their desired face. in Section III. Section IV discusses the results and describes
Bontrager et al. [8] proposed an approach based on Wasser- the experiment for evaluation. Section V and Section VI
stein GAN [9] and interactive evolutionary computation [10] conclude the paper by discussing the limitations of the study
to produce an image resembling a given target. The user is and presenting future research directions.
asked to evaluate a set of images resulting from GAN, and
a genetic algorithm is used to modify the latent vector based II. RELATED WORKS
on the user’s evaluation. This is the first work that demon- A. THE GENERATIVE ADVERSARIAL NETWORK
strated the potential of using the evolutionary algorithm to FOR SYNTHESIZING FACE IMAGES
generate face images similar to the target faces. However, While a large number of works have focused on facial
their evaluation experiment reported that the average score recognition and identification, very few studies have been
of the results was only 2.2 out of 5. Furthermore, the method conducted on face image synthesis until a few years ago.
cannot provide control over detailed facial features. Significant developments in machine learning technologies
To the best of our knowledge, none of the existing GAN led to major advancements in the field, such as GAN tech-
models can provide users with easy control over detailed nology, which has made face image generation a current
facial features, such as the shapes and positions of individ- research hotspot. The original GAN [1] aims to estimate
ual parts of the face. The ability to control detailed facial generative models via an adversarial process, which consists
features is required in many applications. One important of a generator network G that captures the data distribution
application is in assisting the police to create the face image and a discriminator network D that estimates the probabil-
of suspects based on the memories of witnesses or victims. ity of the generated sample coming from real data rather
Other potential applications include visually presenting an than G, and in which G and D are trained simultaneously.
image of the user’s imagined face. This study proposes a The input to GAN is a random noise, and G is trained in a
novel method combining GP-GAN and relevance feedback way that maximizes the probability for D to make a mistake.
To compensate for the disadvantages of GAN, such as the concept in E-FIT is the technique of Montage synthesis,
lack of control over the generated results and the low image which requires the user to look through a dataset of face
quality, CGAN was proposed in [2], allowing the generation components (eyebrows, eyes, noses, mouths, etc.) in order
of images with certain attributes. In practice, the conditions to search for each part separately based on resemblance and
can be any information related to the target face image, composite face image using the selected parts. However,
such as the facial pose or the facial attribute. On this basis, finding the ideal parts is time consuming and it is difficult
AgeGAN [4] and CGAN, which are tailored for convolutional to ensure that the synthesized face image is the desired one,
face image generation [3], were proposed. The former focuses even if each component is satisfactory.
on generating face images of different ages, with a particular Wu and Dai [13] proposed a method to synthesize face
emphasis on preserving a person’s identity in the aged version images from sketches. Xiao et al. developed a method [14]
of his/her face. CGAN can generate face images with specific enabling bidirectional photo-sketch mapping, which can syn-
attributes by varying the conditional information. It exploits thesize a face sketch from a photo and, conversely, a photo
facial expression attributes and other related features, such as from a face sketch. The results of these methods depend
race and age, as conditional data. However, these attributes on the quality of the sketches; the average people, however,
are tagged to the data when preparing the training dataset, cannot draw sketches well.
so there is no way to reflect the user’s intention at the execu- The evolutionary algorithm was explored by Stuart et al.;
tion time. Another GAN model that contributes to face image it uses both local and global models, allowing a witness to
synthesis is TP-GAN [5], which focuses on synthesizing evolve plausible, photo-realistic face images in an intuitive
and reconstructing realistic frontal-view face images from way [10]. However, the algorithm fails because of its slow
partial-view face images. Its results have good performance convergence speed, which caused by dozens, even more than
in face recognition. While the face images generated with a hundred times iterations.
GAN are more realistic, one major challenge is how to control Xu et al. proposed an approach to synthesize images
the generated results, such as the pose, the face shape, and based on relevance feedback in order to conquer the speed
the hairstyle. Style-GAN [6] is a novel model that addresses of convergence and consider the whole face image and the
this issue. The Style-GAN feature has a multi-resolution user’s feedback [15]. Through a dialogic approach based
structure. By modifying the input of each level separately, on relevance feedback strategy, the user only needs to look
Style-GAN succeeds in controlling the output from coarse at several candidate face images and assess whether each
features (pose, face shape) to fine details (hair color). How- image is similar to the target one. The selected candidates are
ever, all these existing GANs, including Style-GAN, cannot interpolated to create the user’s desired image. Unfortunately,
provide users any control at runtime. Philip et al. proposed the results are sometimes blurred and lacking in details,
an approach [8] based on Wasserstein GAN [9] and the evo- and the convergence of relevance feedback can be slow, as it
lutionary algorithm [10] to produce user-desired images. The uses the principal components of grayscale images to build
user is asked to evaluate a set of images resulting from GAN, the feature space and synthesize the face image. Moreover,
and a genetic algorithm is applied to modify the noise input only interpolating from training face images limits the varia-
based on the user’s evaluation. Their paper showed some tions of the face images to be created. The proposed method
examples of using the method for generating face images uses the relevance feedback framework similar to [15], but
resembling target faces. However, the average score of the by combing landmark features and the newest GAN instead
generated images is only 2.2 out of 5, which is the result of of using principal components, it can produce high-quality
the user evaluation experiment. images with a smaller number of user iterations. A new
GP-GAN [7] attempts to perform high-level face-related method of creating a completely new face image that is even
analysis tasks based on landmarks [11], which are the feature better than the best training samples is also developed.
points characterizing the geometric features of faces. In the
training of GP-GAN, both the face image and the corre- III. PROPOSED METHOD
sponding landmarks are inputted to generator G. The trained A. OVERVIEW
generator model can produce a face image that is similar to As depicted in Fig. 2, the proposed method consists of two
the image from which the landmark was extracted. In our parts: (1) a relevance feedback framework for users to gen-
study, we used landmarks as the features for training the erate the landmarks of new candidate faces by evaluating
classifier based on the user’s feedback, and utilized GP-GAN the sample face images and (2) the face image generator
for synthesizing realistic face images from the landmarks. using GP-GAN with the new landmarks resulting from the
relevance feedback process. In the offline training phase,
B. USER CONTROL IN FACE IMAGE SYNTHESIS the algorithm proposed by Dlib [16] is applied to train the
A very typical application requiring user intervention in network for extracting the landmark features; these extracted
face image synthesis is assisting the police in investigations. landmarks are then used for training the GP-GAN.
Electronic facial identification (E-FIT) [12] is a face image The relevance feedback framework consists of three parts:
synthesis system that can produce the facial composites of constructing the feature space, training the OPF classifier, and
wanted criminals based on eyewitness descriptions. The core exploring the candidate feature vectors.
−
→v =− →v +− →
v ,
i ir
n
−
→ −
v−−→
X
vi = b→ri ,
i=1
m
−
→ −
v− −→
X
vir = iri →b .
i=1
As shown in Fig. 14, for each face image generated with FIGURE 15. Matching success rate.
FIGURE 16. Results using different methods. First row: the reference
images. Second row: the generated face images using our method. Third
row: the generated face images using the existing approach.
V. DISCUSSION
As shown in Fig. 1, Fig. 9, and Fig. 10, most of the cre-
ated images can well resemble the geometric features of the
reference images, but they failed in capturing the details of
texture features. For example, in Fig. 1, the wrinkles on the
faces of Laura Bush and George W. Bush were not reproduced
in the resulting image. This is because the GP-GAN model
is controlled with a landmark that consists of the geometry
FIGURE 18. Comparison of the average number of iterations.
information of facial parts only. To address this issue, we are
developing a new GAN model that can take both geometry
Fig. 18 shows the average number of iterations. The and texture information as the input. The participants of the
existing method took more iterations than ours in all tasks. experiments reported that when they focus on some particular
The proposed method outperforms the existing method par- parts, quickly converging to an image with that part resem-
ticularly for task 1, in which the participants can always bling the reference image is possible. However, that particular
compare the results with the reference image during the entire part may become less similar to the reference image again
process. after trying to improve the other parts. It is important to
For the similarity comparison, we invited a new group of allow users to control each part independently and integrate
participants who did not join any face image generation or the best results of all facial parts. The overall results are
matching tests to evaluate the face images created by the still blurred, although they are much better than those of
two methods; similar to experiment 1, a five-point scale was Xu’s method that used the PCA feature. The image quality
used based on the images’ similarity to the reference images. may be further improved by carefully tuning the training
Twenty subjects were randomly divided into two groups. parameters of GP-GAN. Nevertheless, our contribution is
The first group, consisting of 10 participants (10 males in the approach of combining GAN with an effective relevance
their 20s–25s), evaluated the results with the existing method, feedback framework; substituting GP-GAN with any state-
whereas the second group, consisting of 10 participants of-the-art GAN model for a better image quality is not dif-
(10 males in their 20s–25s) evaluated the results with our ficult. In the current implementation, we treat the bounding
proposed method. We did not ask the participants to directly box of all the training images in the landmark feature space
compare the results of the two methods, and we instead used as the safe area for exploring new landmarks, but this is an
non-overlapping groups of participants to evaluate the results approximated approach. A more accurate scope needs to be
of the two methods separately because we wanted to focus on defined. Currently, the face images generated from the newly
the evaluation of similarity and avoid any adverse evaluations created landmarks are not included in the dataset for training
of the existing method caused by the low image quality. the OPF. By adding the created face images to the training
As the existing method could not synthesize color image, dataset at every step of the iteration, we can expect to expand
we converted the face images by our method into grayscale the range of face images that can be created.
ones and then presented these images to the participants to
eliminate the effect of color. We also asked the participants VI. CONCLUSION
to ignore the blur artifacts. Fig. 19 presents the result of the This study proposed a novel method to gain user control
similarity comparison for all the reference images. The results over detailed face features when generating face images with
generated by our method are evaluated to be more similar to the use of the newest GAN model combined with the rel-
the reference images. evance feedback framework based on the OPF algorithm.
The experiment results demonstrated that the proposed [23] G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller, ‘‘Learning to align
method can be used to generate not only a face image resem- from scratch,’’ in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 764–772.
[24] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ‘‘Labeled faces
bling the target face but also a face image in the user’s in the wild: A database forstudying face recognition in unconstrained envi-
memory or imagination. Our proposed method makes up for ronments,’’ in Proc. Dans Workshop Faces Real-Life Images, Detection,
both the lack of user intervention in GAN and the low image Alignment, Recognit., 2008, pp. 1–14.
[25] A. J. Calder, A. M. Burton, P. Miller, A. W. Young, and S. Akamatsu,
quality in traditional methods. ‘‘A principal component analysis of facial expressions,’’ Vis. Res., vol. 41,
no. 9, pp. 1179–1208, 2001.
REFERENCES
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in CAIE XU received the B.Sc. degree in computer
Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680. science and technology from the Hebei Univer-
[2] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’ sity of Technology and the M.Sc. degree in soft-
2014, arXiv:1411.1784. [Online]. Available: https://arxiv.org/abs/1411. ware engineering from Zhejiang University. She
1784 is currently pursuing the Ph.D. degree in com-
[3] J. Gauthier, Conditional Generative Adversarial Nets for Convolutional puter science with the University of Yamanashi.
Face Generation, Class Project Stanford CS231N, 2014, p. 2, no. 5. Her research interests include computer vision,
[4] G. Antipov, M. Baccouche, and J.-L. Dugelay, ‘‘Face aging with condi- machine learning, deep learning, and data mining.
tional generative adversarial networks,’’ in Proc. IEEE Int. Conf. Image
Process. (ICIP), Sep. 2017, pp. 2089–2093.
[5] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond face rotation: Global
and local perception gan for photorealistic and identity preserving frontal YING TANG received the B.Sc. degree from
view synthesis,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017,
East China Normal University, China. He is cur-
pp. 2439–2448.
rently pursuing the master’s degree with the
[6] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for
Interdisciplinary Graduate School, University of
generative adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2019, pp. 4401–4410. Yamanashi, Japan. His research interests include
[7] X. Di, V. A. Sindagi, and V. M. Patel, ‘‘GP-GAN: Gender preserving gan augmented reality, computer and human vision,
for synthesizing faces from landmarks,’’ in Proc. 24th Int. Conf. Pattern and deep learning.
Recognit. (ICPR), Aug. 2018, pp. 1079–1084.
[8] P. Bontrager, W. Lin, J. Togelius, and S. Risi, ‘‘Deep interactive evolution,’’
in Proc. Int. Conf. Comput. Intell. Music, Sound, Art Design. Springer,
2018, pp. 267–282.
[9] M. Arjovsky, S. Chintala, and L. Bottou, ‘‘Wasserstein generative adver- MASAHIRO TOYOURA received the B.Sc.
sarial networks,’’ in Proc. Int. Conf. Mach. Learn., 2017, pp. 214–223. degree in engineering and the M.Sc. and Ph.D.
[10] S. J. Gibson, C. J. Solomon, and A. P. Bejarano, ‘‘Synthesis of photo- degrees in informatics from Kyoto University,
graphic quality facial composites using evolutionary algorithms,’’ in Proc. in 2003, 2005, and 2008, respectively. He is
BMVC, 2003, pp. 1–10. currently an Assistant Professor with the Inter-
[11] M. Valstar, B. Martinez, X. Binefa, and M. Pantic, ‘‘Facial point detection disciplinary Graduate School, University of
using boosted regression and graph models,’’ in Proc. IEEE Comput. Soc. Yamanashi, Japan. His research interests are aug-
Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2729–2736.
mented reality and computer and human vision.
[12] O. Gunnarsson and S. Maddock, ‘‘Sketching faces,’’ in Proc. 5th Euro-
He is a member of ACM and the IEEE Computer
graph. Conf. Sketch-Based Interfaces Modeling, 2008, pp. 105–112.
Society.
[13] D. Wu and Q. Dai, ‘‘Sketch realizing: Lifelike portrait synthesis from
sketch,’’ in Proc. Comput. Graph. Int. Conf., 2009, pp. 13–20.
[14] X. Wang and X. Tang, ‘‘Face photo-sketch synthesis and recognition,’’
IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967, JIAYI XU received the B.Sc., M.Sc., and Ph.D.
Nov. 2009. degrees in computer science from Zhejiang Uni-
[15] C. Xu, S. Fushimi, M. Toyoura, and J. X. X. Mao, ‘‘Synthesising images versity. She is currently an Assistant Professor
of imagined faces based on relevance feedback,’’ Tech. Rep. with the School of Computer Science and Tech-
[16] D. E. King, ‘‘Dlib-ml: A machine learning toolkit,’’ J. Mach. Learn. Res., nology, Hangzhou Dianzi University, China. Her
vol. 10, pp. 1755–1758, Jan. 2009. research interests include texture design, image
[17] A. T. Da Silva, A. X. Falcão, and L. P. Magalhães, ‘‘Active learning enhancement, and face recognition and synthesis.
paradigms for CBIR systems based on optimum-path forest classification,’’
Pattern Recognit., vol. 44, no. 12, pp. 2971–2978, 2011.
[18] J. P. Papa and A. X. Falcao, ‘‘Optimum-path forest: A novel and powerful
framework for supervised graph-based pattern recognition techniques,’’
Inst. Comput., Univ. Campinas, Campinas, Brazil, Tech. Rep., 2010,
XIAOYANG MAO received the B.S. degree in
pp. 41–48.
computer science from Fudan University and the
[19] J. P. Papa, A. X. Falcão, and C. T. N. Suzuki, ‘‘Supervised pattern classifi-
M.S. and Ph.D. degrees in computer science from
cation based on optimum-path forest,’’ Int. J. Imag. Syst. Technol., vol. 19,
no. 2, pp. 120–131, 2009. The University of Tokyo. She is currently a Pro-
[20] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-net: Convolutional networks fessor with the Department of Computer Sci-
for biomedical image segmentation,’’ in Proc. Int. Conf. Med. Image ence and Engineering, University of Yamanashi,
Comput. Comput.-Assist. Intervent. Springer, 2015, pp. 234–241. Japan, and an Adjunct Professor with the Col-
[21] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, ‘‘Densely lege of Computer Science, Hangzhou Dianzi
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis. University, China. Her current research inter-
Pattern Recognit., Jul. 2017, pp. 4700–4708. ests include image processing, visual perception,
[22] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for non-photorealistic rendering, and their applications to e-health. She received
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail- the Computer Graphics International Career Achievement Award, in 2018.
able: https://arxiv.org/abs/1409.1556