Multi View Frontal Face Image Generation: A Survey: Concurrency and Computation Practice and Experience December 2020
Multi View Frontal Face Image Generation: A Survey: Concurrency and Computation Practice and Experience December 2020
net/publication/347731732
CITATIONS READS
60 1,674
5 authors, including:
Zhang Liping
Chinese Academy of Sciences
37 PUBLICATIONS 682 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Research on multi-view representation learning based on homology-continuity principle in small sample environment View project
All content following this page was uploaded by Xin Ning on 12 January 2021.
Xin Ning1,2,3 Fangzhe Nan2,3 Shaohui Xu2,3 Lina Yu1 Liping Zhang1,2,3
China recognition. To understand the development of frontal face generation models and
grasp the current research hotspots and trends, existing methods based on 3D mod-
Correspondence
Lina Yu and Liping Zhang, Institute of
els, deep learning, and hybrid models are summarized, and the current commonly used
Semiconductors, Chinese Academy of face generation methods are introduced. Dataset, and compare the performance of
Sciences, Beijing 100083, China.
Email: [email protected] (L. Y.) and
existing models through experiments. The purpose of this paper is to fundamentally
[email protected] (L. Z.) understand the advantages of existing frontal face generation, sort out the key issues
of such generation, and look toward future development trends.
Funding information
the National Natural Science Foundation of
China, Grant/Award Number: 61901436 KEYWORDS
3D model, deep learning, face frontalization, hybrid model
1 INTRODUCTION
Multi-view frontal face image generation refers to the generation of frontal face images from non-frontal images from one or more perspectives,1 for
use in face recognition,2 video surveillance,3 and authentication.4 In recent years, face recognition technology has been widely developed, including
face attendance,5 face reconnaissance,6 and face payment7 systems. With the upsurge of machine and deep learning technologies, the application
and accuracy rate of face recognition have reached higher levels.8 However, because non-frontal facial recognition texture information is usually less
recognizable than frontal facial recognition, the common features of frontal and lateral sides are extremely limited, and posture changes can cause
substantial facial deformations.9 Many face recognition algorithms are still challenged when attempting to recognize faces in large-pose non-frontal
face images, and pose problems have also become the main factor limiting the effect of face recognition in unrestricted environments. Therefore,
the recovery of a frontal face image from a large-pose non-frontal face image is an extremely valuable research topic.10
At present, the problem of frontal face image generation has been extensively studied both at home and abroad. For example, Shao11 designed
an end-to-end facial pose with different target adaptive weights based on 3D models and a generative adversarial network. The model effectively
restores a frontal face with high-quality texture and high identity preservation. Wei et al12 proposed a new flow-based feature warping model
(FFWM), which can synthesize photo-realistic and illumination preserving frontal images under inconsistent illumination. The frontal generation
of human faces based on a support vector machine has also been realized.13,14 Xia et al15 used a single unmodified three-dimensional surface to
approximate the shape of all input surfaces, and obtained the frontal view of a face through two specific steps of hard frontalization and soft sym-
metry. Christos et al16 devised a kernel norm (convex substitution of the rank function) and matrix minimization in a model that can jointly recover
the frontalized version of a face as well as the facial landmarks. In addition, Wang et al17 combined an image rotation formula to realize the plane
rotation correction of a face image. Existing methods can be roughly divided into traditional methods, 3D-based methods, and deep learning-based
methods. Based on traditional methods that attempt to process posture changes by learning posture invariant features or by predicting facial images
in novel target postures without the use of 3D information, this traditional frontal generation method is simple and fast to calculate, although the
generation effect is not ideal. A problem of local distortion occurs, and the approach has significant limitations. It can only solve the frontal gener-
ation of human faces at small angles (small poses). A 3D-based method fits the face image to a three-dimensional model, that is, a 3D morphable
model (3DMM)18 and an extension method,19 and then uses symmetry or other operations to complete the lack of information in the side face image
Concurrency Computat Pract Exper. 2020;e6147. wileyonlinelibrary.com/journal/cpe © 2020 John Wiley & Sons, Ltd. 1 of 18
https://doi.org/10.1002/cpe.6147
2 of 18 NING ET AL.
owing to self-occlusions. When correcting large-angle faces, obvious artificial symmetry marks will occur, but the overall effect is better than that
of traditional methods, and such marks can be corrected initially from a large angle. Some researchers have used deep learning methods including
encoder models, convolutional neural networks, and generative adversarial networks to generate a frontal face, and have achieved relatively good
experimental results. Based on this, the present study focuses on the main progress and some representative research results of face frontalization
based on 3D and deep learning models achieved in recent years, and summarizes the difficulties and hot spots in the research on face frontalization
as well as the possible development trend through an experimental comparison and analysis.
Owing to the natural robustness of 3D data to changes in viewing angle, 3D-based methods can ideally solve the problem of face generation. Among
them, 3DMM is a 3D deformable face model proposed by Blanz et al18 As an average model used to describe a face shape, 3DMM is a commonly
used approach to realize 3D face reconstruction and recognition. The idea of this algorithm is to use a face database to construct an average face
deformation model. After a new face image is given, the face image is matched and combined with the model. The corresponding parameters of
the model are modified, and the model is deformed until the difference between the model and the face image is minimized. The texture is then
optimized and adjusted to complete the face modeling. Therefore, the algorithm has two main steps. The first step is to construct an average face
model from all faces in the face database, and the second step is to complete the matching of the deformed model and the photograph. For step 1,
the face is first divided into two vectors: shape and texture. The shape vector, S, contains the coordinate information of X, Y, and Z. The definition is
shown in (1), where n represents the fixed-point number of the model:
s = (X1 , Y1 , Z1 , X2 , Y2 , Z2 , … , Xn , Yn , Zn )T (1)
The texture information, T, contains the color value information of R, G, and B, as defined in (2):
T = (R1 , G1 , B1 , R2 , G2 , B2 , … , Rn , Gn , Bn )T (2)
Then, a three-dimensional deformed face model is established from m face models, each of which contains two corresponding Si and T i vectors.
The formula for the new 3D deformed face model is defined as follows:
∑
m
SnewModel = ai si (3)
i=1
∑
m
TnewModel = bi T i (4)
i=1
∑m ∑m
where i=1 ai = i=1 bi = 1, m is the number of face samples collected, and a and b are the parameter coefficients.
Finally, formulas (3) and (4) are linearly combined into a new face model, as shown in formula 5:
→ →
{SnewModel ( a), TnewModel ( b)} (5)
In step two, based on the deformation model, for a given face photograph, the model with the facial photograph is first registered, and the
parameters of the model are then adjusted to minimize the difference between the real face and the face in a photograph; that is, the Euclidean
distance, EI , between the model Imodel (x, y) and the input image Iinput (x, y) is the smallest.
∑
EI = ||Iinput (x, y) − Imodel (x, y)||2 (6)
x,y
The idea of generating a frontal face image based on a 3D model is to build a 3D face model, match the model parameters of the test face image, and
then obtain the complete 3D face data.
NING ET AL. 3 of 18
Some existing methods directly generate frontal face images based on the 3DMM. For example, Qianqing et al20 proposed the BFM-3DMM
model. First, an improved AAM (active appearance model) model was used for face alignment, and then the BFM-3DMM model was used for a
preliminary correction, and finally, the SFS (shape from shading algorithm) algorithm was used for a face recalibration. Experiments show that this
algorithm can not only generate frontal images of Europeans but also achieve the frontal generation of Asian faces.
Jeon et al21 proposed a method for generating frontal face images from non-frontal faces in a single image based on the 3DMM model. The
forward-looking 3DMM of this method is generated using a non-forward-looking rotating 3DMM. The visibility of the 3DMM surface is measured
based on the ratio of the visible area corresponding to the front view and the non-front view of the 3DMM. The visible area of the front view 3DMM
is drawn by a segmented affine distortion of the face image, and the invisible area is drawn using the symmetry characteristics of the face.
Asthana et al22 proposed a novel 3D pose normalization method that maps non-frontal face images to an aligned 3D face model and
obtains frontal face images by adjusting the pose of this 3D model. This method automatically fits the 3D face model firmly to the 2D input
image without any manual intervention. In addition, the method can handle a continuous range of postures, and thus it is not limited to a
set of discrete predetermined posture angles to be successfully applied to a standard face recognition tester, and it can produce excellent
results.
Hassner et al23 provided a simple and effective method for generating frontal faces. The method first searches for the 2D-3D correspon-
dence between the query photo and the surface of the 3D face model, and then uses a robust facial feature detection method to find the
same landmark in the two images, matching the query points with the points on the front view of the rendered model. Finally, by using the
geometric figures of the 3D model, the queried facial features are projected back to the reference coordinate system to generate a frontal
face image.
Some methods consider the problems of the 3DMM model itself, and a face frontal generation model has been proposed based on the 3DMM
model. For example, Fang et al24 proposed a frontal face image synthesis method based on a pose estimation. This method establishes an average 3D
face model for pose estimation to avoid complicated iterative calculations of the 3DMM method. The compressed sensing theory is used to screen
prototype samples to improve the accuracy of the deformation model. The original texture and the reconstructed texture are combined to construct
a comprehensive texture to preserve the detailed information of the face image.
Zhu et al25 considered the traditional 3DMM to have certain problems such as a slow operation speed, and proposed a high-fidelity pose and
expression normalization (HPEN) method based on a 3DMM. By estimating the depth information of the entire image, the three-dimensional trans-
formation of the posture and expression can be easily corrected to preserve as much identity information as possible. To ensure a smooth transition
from the normalized face area to the background and face area, HPEN also estimates the depth of the outer area of the face and background. This
method can automatically generate natural face images with frontal poses and neutral expressions.
In addition to the above methods, other approaches26 including the HCRC27 model have been developed.
Deep learning, as a new field of machine learning research, interprets data by imitating the mechanism of the human brain, such that the machine
can automatically learn good features without a manual selection process. Compared with the 3D model, the model based on deep learning reduces
the computational complexity, improves the generation rate, and solves the image quality problem caused by the artificial symmetry to a certain
extent. According to the different network models used, the frontal face generation model based on deep learning can be further subdivided into
models based on auto-encodes, convolutional neural networks, and generative adversarial networks.
An auto-encoder (AE) is a type of neural network that uses a back-propagation algorithm to make the output value equal to the input value. It
first compresses the input into the representation of the latent space, and then reconstructs the output through this representation.28 In short,
AE can be regarded as a three-layer divine neural network structure: one input layer, one hidden layer, and one output layer. The network struc-
ture is shown in Figure 1. The network can be regarded as having two parts: an encoder represented by a function and a decoder generating
reconstruction.
The auto-encoder needs to add some constraints during the training process such that it can only apply an approximate replication and can only
replicate inputs similar to the training data. These constraints force the model to consider the parts of the input data need to be copied first, and
thus it can often learn useful features of the data. In recent years, the connection between the auto-encoder and latent variable model theory has
brought auto-encoder to the forefront of generative modeling.
4 of 18 NING ET AL.
-0, +0
Decoder g3
3 hidden
layer Pose
Decoder g2 Encoder f3 Robust
feature
-15 , +15 2 hidden
layer
Decoder g1 Encoder f2
1 hidden
-30 , +30 layer
Encoder f1
-45 , +45
Input layer
The model based on a 3D reconstruction relies on the existing 3D model, and thus the generated positive image has artifacts. In the multi-view face
frontal image generation model based on the auto-encoder, the encoder is responsible for extracting the image features. The decoder is responsible
for synthesizing the frontal image of the face. It does not rely on prior knowledge and can better synthesize the frontal image of the face. Therefore,
an auto-encoder has been successfully applied to the generation of a frontal face.
For example, Kan et al29 proposed a stacked progressive auto-encoder (SPAE) model to mitigate a problem wherein facial appearance changes
caused by posture differences are greater than those caused by identity differences. SPAE can convert a non-frontal face image into a frontal face
image. The network structure of SPAE is shown in Figure 2. (the number on the left of the figure represents the angle of the face). Specifically, each
shallow progressive auto-encoder of the stacked network is designed to map facial images in larger poses to facial images in smaller poses, while
keeping those images unchanged in smaller poses. Then, stacking multiple shallow auto-encoders can gradually convert non-frontal face images into
frontal face images, which means that posture changes are gradually reduced to zero. Therefore, the output of the top hidden layer of the stacked
network contains extremely small gesture changes, which can be used as a gesture robust feature for gesture recognition.
In 2017, Ning et al30 proposed a multi-pose face reconstruction and recognition method based on multi-task learning (MtL) based on SPAE,
namely a multi-task learning stacked auto-encoder (MtLSAE). The MtLSAE method uses a stacked auto-encoder to restore the frontal face pose
layer by layer, and at the same time introduces a non-negatively constrained sparse auto-encoder based on a partial feature expression to retain the
local feature information of the input data, thereby improving the reconstruction of the input data quality.
In 2019, Xu et al31 proposed a multi-pose face image frontalization method based on coding and decoding networks, which is called a multi-task
convolutional codec network (MCEDN). This method introduces the frontal basic feature network to synthesize the basic features of the frontal
face, and on this basis, combines the local features of the multi-pose face extracted by the encoding network to compensate for the details, and
NING ET AL. 5 of 18
Intensity/Featuresimilarity
Filter
Forward
Filter size:5*5
Filter
Filter size:3*3
size:3*3
ResBlk ResBlk size:3*3
Skip STN1 STN2
Connections
backward
concatenate ResBlk ResBlk
encoder decoder HR output
finally synthesizes a clearer frontal face image. A multi-task learning mechanism is used to establish an end-to-end model, the three modules of the
local feature extraction are unified, a frontal basic feature analysis and frontal image synthesis are conducted, and the training speed of the entire
model is increased by sharing the parameters.
In 2020, Yu et al32 proposed a transform discriminant neural network (MTDN) that can simultaneously realize a frontal image generation
and super-resolution. As shown in Figure 3., an MTDN consists of two parts: a multi-scale transform up-sampling network that combines an
auto-encoder, a spatial conversion network layer, an up-sampling layer, and a residual block layer, and a convolutional layer with maximum pool-
ing. A discriminant network is composed of layers and fully connected layers. A multi-scale transform up-sampling network aims to receive and
super-resolve LR (low-resolution) images at different resolutions, whereas the development of a discriminant network aims to force super-resolved
faces to become lifelike. An MTDN can effectively align and upsample low resolution images in different large poses, and the upsampled images are
similar to their corresponding high-resolution images.
Learning from the super large scale and complex interconnected structure of the human brain, neural networks have experienced an important trans-
formation from a shallow neural network into a deep neural network. A convolutional neural network (CNN)33 is one of the most important depth
models. Because of its effective feature extraction and generalization abilities, compared with traditional methods such as Fourier,34 it is used in com-
puter vision, image detection, optical character recognition, and other fields. The best performer has significant advantages over shallow models in
face recognition,35 image classification,36 Paragraph Image Captioning,37 traffic sign recognition,38 and image super-resolution reconstruction.39
A CNN consists of an input layer, a convolutional layer, an activation function, a pooling layer, and a fully connected layer. Its structure is shown
in Figure 4. The convolutional layer follows the principle of weight sharing for feature extraction, and usually uses multiple convolutional layers
to obtain deeper feature maps. The activation function aims to ensure that the data input and output are differentiable; in addition, the pooling
layer is often placed after the convolutional layer, and aims to compress the input feature map to extract the main features to prevent an over-
fitting. The function of a fully connected layer is to connect all features and send the output value to the classifier, to achieve the purpose of
classification.
By combining the three characteristics of the local receptive field, shared weight, and spatial or temporal pooled down sampling, a CNN can make full
use of the local features of the face image itself, and its strong robustness and fault tolerance can ensure the invariance of displacement to a certain
6 of 18 NING ET AL.
Input
SIFT- A3F-
Train A 3F-CNN
FLOW CNN
extent in the spatial translation, distortion, and scaling.40,41 In addition, the local connection, weight sharing, and pooling operation of the CNN can
effectively reduce the complexity of the network model, reduce the number of training parameters, and make it easier to train and optimize.
Nourabadi et al42 proposed a CNN-based method to solve the posture change problem using a single face image in 2013. First, the pose of each
image is estimated using the pose classifier model. Then, in addition to using the 2D image information, the estimated pose code is also used to
reconstruct the depth map. Finally, the estimated depth map and pose code are used to provide a frontal face image for identity recognition. This
method rotates the image to the front through the deep reconstruction bidirectional model and obtains the best identification accuracy.
Zhu et al43 proposed a new deep learning framework based on a CNN in 2014 to restore the frontal aspect of a face image. Unlike existing frontal
face generation methods that evaluate or use 3D information in a controlled 2D environment, this framework can directly learn the transformation
from a face image with complex variants into its standard view. During the training phase, to avoid the expensive process of manually training tags
from standard view images, a new metric was designed that can automatically select or synthesize standard view images for each identity.
Yim et al44 proposed a multi-task learning framework in 2015. By training a CNN network, a face and a binary code representing the target
pose are used as inputs. The face ID is maintained, and any pose and illumination face is rotated to the target pose, and the target pose can be
manually controlled.
In 2017, Aaron et al45 trained a CNN on appropriate datasets including 2D images and 3D facial models or scans to solve the limitation of
establishing a dense correspondence in a large facial pose, expression, and inconsistent illumination in facial reconstruction. The CNN of this method
only needs to process a single 2D facial image and does not need to accurately align or establish dense correspondence between images. It can be
used to generate any facial pose and expression and can be used to bypass the construction and fitting of a 3D deformable model to reconstruct the
entire 3D facial geometry, including the invisible part of the face.
In 2018, Zhang et al46 proposed a face frontal convolutional neural network (A3F-CNN) based on the appearance flow. The network structure
is shown in Figure 5. A3F-CNN learns to establish a dense correspondence between the non-frontal and frontal regions. Once the correspondence
is established, the front side can be synthesized by explicitly “moving” pixels from non-front side pixels. In this way, the synthesized front face can
retain a fine facial texture. To improve the convergence of the training, a learning strategy guided by the outflow (SIFT-FLOW in Figure 5.) is pro-
posed. In addition, it applies a generative adversarial network loss to obtain a more realistic face and introduces a face mirroring method to deal
with the self-occlusion problem. The results show that an A3F-CNN can synthesize more realistic human faces in both controlled and uncontrolled
lighting environments.
In 2019, Guo et al47 proposed a novel CNN-based framework to achieve real-time detailed face reverse rendering. Specifically, the framework
uses two CNNs for each frame, namely CoarseNet and FineNet. The first CNN fully estimates the coarse-scale geometry, albedo, lighting, and pose
parameters, and the second reconstructs the fine-scale geometry encoded at the pixel level. With the help of well-structured large-scale training
data, the framework can recover the detailed geometry, albedo, lighting, posture, and projection parameters in real time. Feng Liu et al48 based
on CNN, proposed a joint face alignment and 3D face reconstruction method to simultaneously solve these two problems for 2D face images of
arbitrary poses and expressions.
NING ET AL. 7 of 18
Goodfellow et al49 first proposed a generative adversarial network (GAN) in 2014. GAN adopts the idea of a two-person zero-sum game in game the-
ory, composed of a generator and a discriminator. The generator captures the potential distribution of the real data samples and generates new data
samples. The discriminator is a two-classifier that discriminates whether the input is real or generated data. Both the generator and discriminator
can be implemented using a deep neural network model,50 the structure of which is shown in Figure 6.
In recent years, experts and scholars have made improvements on the basis of the original GAN in response to the problems of the original GAN
framework itself and the problems in practical applications. For example, the input hidden variable, z, for the original GAN is unstructured, and the
hidden variables are not known. The question of what attribute is controlled by each digit in CGAN51 (Conditional GAN) uses a supervised learning
method to use random noise z and category label c as the input of the generator. The discriminator uses the generated sample/real sample, and
category labels are used as input to learn the association between labels and pictures. For the problem wherein GAN networks need paired training
samples during the training process, CycleGAN52 (Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks) uses two
mirror-symmetric GANs to form a ring network, that is, a universal mapping from data domain A to data domain B. the transformation between the
styles of data domains A and B is learned instead of the one-to-one mapping relationship between data A and B, and two input images are realized.
These can be any two images, which is the purpose of asymmetrical imaging.
Compared with other deep learning models, the complexity and dimensions of the data generated by GAN are linearly related. To generate a larger
image, it will not face the computational burden of exponential increase like the traditional model, but a process of linear increase of neural net-
work. Second, there are few GAN priori hypotheses, and a positive face image of a higher quality can be generated without any explicit parameter
distribution assumption of the face image.53 Therefore, compared with other networks, a GAN is more suitable for a positive face generation.
Real datax x
Discriminant D Real/fake?
G(Z)
Random noise z Generater G
Huang et al54 proposed a dual-channel generative confrontation network (TP_GAN) that simultaneously perceives the global structure and local
details based on the GAN network, which is used for frontal face generation. The network structure of TP_GAN is shown in Figure 7. To constrain the
ill-conditioned synthesis problem, the TP_GAN network further introduces adversarial loss, symmetry loss, and identity preservation loss during
the training process. Not only does the proposed TP-GAN obtain convincing perceptual results but also achieves better results in large pose face
recognition.
The PWGAN proposed by Ma55 is a generated confrontation network based on pose weighting. In this network, a pre-trained posture authenti-
cation module is added to learn the face pose information. It makes full use of this information to make the network more aware of the face features
and obtain a better generation effect.
Cao et al56 observed that there is no clear definition of the relationship between the authenticity and identity maintenance of the positive face
image generated by existing models. Therefore, based on GAN, a three-dimensional auxiliary dual-generation countermeasure network (AD_GAN)
is proposed to accurately rotate the yaw angle of the input face image to any specified angle. The model can change the pose of the input face image
and keep the image realistic.
Traditionally, pose-invariant face recognition is performed either by the frontalization of the non-frontal face image by type or by learning the
pose-invariant representation from the non-frontal face image. Luan et al57 believe that it is preferable to perform these two tasks jointly to allow
them to use each other. Based on this, they proposed three different novelty entangled representation learning generative adversarial networks
(DR-GAN). As shown in Figure 8, DR-GAN is based on the CGAN network and adds some new features, including an encoder–decoder structure
generator, pose code, and pose classification in the discriminator (identity representation) and an integrated multi-image fusion scheme. DR_GAN
can learn a face feature that has nothing to do with posture, and separate the pose information of the face from the feature, allowing the feature
extraction used for face recognition to be applied toward the recognition of various angles of the face.
Chen et al58 proposed a method for face recognition in video surveillance scenes based on the use of a conditional generative adversarial net-
work (cGAN). This method can input multiple faces with varying poses from the video. Experimental results show that this method can generate
suitable frontal faces from a dataset of 43,276 facial images of 19 people collected from real video surveillance scenes and increase the facial
recognition capabilities by approximately 20%.
Zhao et al59 proposed a pose invariant model (PIM) for face recognition in the wild based on a GAN. PIM is a novel and unified deep neural
network, which includes a face front subnet (FFN) and discrimination learning subnet (DLN). They learn together in an end-to-end manner and
promote each other. An FFN is a carefully designed dual-path (i.e., simultaneous perception of the global structure and local details) GAN, which
combines unsupervised cross domain confrontation training and meta learning strategy convolution using a dynamic discriminator for high fidelity
and an identity preserving frontal view synthesis.
Models based on generative adversarial networks for frontal face generation include FIGAN,60 PIGAN,61 PPN-GAN,62 CAPG-GAN,63 DVN64
and FNM.65 On the basis of a GAN network, aiming at the problems of training difficulty and instability in generating a confrontation network,
and because the identity information cannot be well maintained, a GAN network is improved and optimized to make the model suitable for face
frontalization from all angles. Table 1. compares and analyzes the advantages and disadvantages of the deep learning model and its generation effect.
A single network model often extracts a single feature category and cannot cover all of the feature information. By adopting a method of fusing
multiple models, the respective characteristics of the different models can be used in generating frontal images. At the same time, the correlation and
restriction of different models are weighed, and considering the optimization goals of different models, a better generation effect can be obtained.
Existing face frontal generation models that are based on hybrid models are mostly a collection of 3D models and deep learning models, such as 3D
NING ET AL. 9 of 18
Deep Base on AE SPAE,25 MCEDN,31 Non-linear transformation Picture details are severely Passed
learning-based MTDN32 modeling; simple model, few missing, artifacts appear in
model training parameters; large pose pictures; the
recognition rate of generated
pictures is poor
Base on CNN Zhu et al,43 Aaron Strong robustness, with certain It tends to be blurred in large Good
et al,45 A3F-CNN46 fault tolerance; poses; training relies on a large
number of labeled sample data
Base on GAN TP_GAN,54 DR_GAN,57 Few prior assumptions are Affected by the amount of data Excellent
FIGAN,60 PIGAN,61 required; keep the identity and data distribution; the
CAPG-GAN63 information of the image; training problems of the GAN
network itself are not easy to
train
3D + CNN PAM,69 Ding et al70 Realize efficient recognition Subject to constraints such as Excellent
accuracy and good robustness picture size and lighting factors
3D + GAN FF_GAN,71 Hang Realize positive generation from any GAN artifacts may appear in the Excellent
Zhou,72 Tewari et al73 angle, and keep the identity result; some inherent
information and details intact shortcomings of GAN
model + encoder, 3D model + convolutional network, and 3D model + GAN and so on.66 Table 2. shows the advantages and disadvantages of these
models and their generation effect.
Wu et al67 proposed a method based on an auto-encoder and 3D neural rendering model,74 to learn 3D deformable object information directly
from single-view images without external super vision. As the network structure in Figure 9. shows, the encoder internally decouples the input
image into the depth, albedo, view, and illuminance. The potential symmetry is then used by explicitly modeling the illumination, and the model is
expanded to infer the potential asymmetry in the object. Finally, the 3D neural network rendering model uses key point supervision to perform a 3D
reconstruction of the real face to generate various angles of the face image.
Gao et al68 designed an encoder–decoder architecture that allows end-to-end semi-supervised adversarial training to extract the untangled
semantic representation of a single image, in which the encoder network decomposes the input 2D facial image into unrelated semantic repre-
sentations: identity, expression, posture, and lighting codes. The two decoder networks regress the 3D face shape and albedo from the extracted
representation such that the graphics system can render the face image back to match the input image. This method uses a parametric lighting model
and a differentiable renderer to render the input face image under different identities, expressions, poses, and lighting conditions. This method
can retain the identity representation of the face image and replace the pose, illumination, and expression representation in another face image to
generate a new realistic face image with the same identity but the pose, illumination, and expression of another face.
10 of 18 NING ET AL.
Matching Matching
The pose perception model (PAM) proposed by Masi et al69 uses several pose-specific deep CNNs to process face images. 3D rendering is used to
synthesize multiple facial poses from input images to train these models and provide additional robustness to change the poses during testing. The
network structure of the PAM is shown in Figure 10. This method not only relies on a single frontal CNN model, but also relies on half- and full-profile
models for face recognition in the wild, particularly when the target dataset contains full-profile images. Unlike many existing contemporary meth-
ods, which apply frontalization to counteract pose changes, this method suggests considering the estimated pose of the face adaptively during the
rendering.
To improve the accuracy of the 3D face recognition algorithm, Ding et al70 proposed an effective pose fusion algorithm that can frontalize the
face and combine multiple inputs. Its structure is shown in the Figure 11. The algorithm is based on a CNN of a deep feature extractor to extract 2D
features from a normalized canonical color image. The expression-invariant geodesic distance between facial landmarks calculated on a 3D facial
grid is used as a 3D geometric feature. Finally, these 2D and 3D functions are connected to train the SVM (support vector machines) model for face
recognition. For the face conversion part, the algorithm uses a 3D rotation matrix to model these changes and reverses the rotation to make the
face frontal. At the same time, to estimate the rotation parameters, the nose area of the 3D face is compared with the nose template of the standard
frontal face, and then fused to obtain a frontal model with a complete set of facial landmarks. Through experiments and comparisons with other
state-of-the-art methods, we prove that our method can achieve the highest facial recognition rate and is robust to pose and expression changes.
Fangmin75 combining the local datasets with the public datasets, improving the exiting 3DMM fitting method and then using a convolutional
neural network (CNN) to improve reconstruction effect. Experimental results and analysis show that this method costs much less time than tra-
ditional methods of 3D face modeling, and it is improved for different races on photos with any angles than the existing methods based on deep
learning, and the system has better robustness.
NING ET AL. 11 of 18
Based on the 3DMM and GAN networks, Xi et al71 proposed a novel deep 3D deformable model conditional face frontal generation confronta-
tion network (FF-GAN), which combines elements from a deep 3DMM and facial recognition CNN. Integrating 3DMM into the GAN structure can
provide shape and appearance priors, a quick convergence with less training data, and end-to-end training, and thus a single input image can be used
to achieve a high-quality and positive identity preservation.
Zhao et al76 proposed a 3D-assisted deep pose invariant model (3D-PIM) for pose-invariant face recognition. The 3D-PIM contains simulators
and optimization programs that learn in a conjugate manner. The simulator is assisted by a 3DMM, which can locate landmark points, estimate
3DMM coefficients, and generate synthetic faces with standardized poses, which are then fed to the refiner for realistic refinement. The refiner uses
a global local GAN network, which uses real unlabeled data to improve the global structure and the authenticity of the local details of the simulator
output while retaining the identity information.
Cao et al77 proposed a high-fidelity pose invariant model (HF-PIM) to obtain high-resolution realism and positive results that preserve the
identity. HF-PIM combines the advantages of 3D and GAN-based methods and applies frontal processing on profile images through a new facial
texture fusion warping program.
Zhou et al72 proposed a novel unsupervised framework to solve the limitations of the scale and scope of the data source, eliminate a multi-view
supervision, and solve the domain generalization problem caused by a multi-view supervision. The operation process is shown in the Figure 12.
3D face modeling is applied through repeated rotations and rendering operations to build a self-supervision, and an ordinary CycleGAN is used
to generate the final image. The framework does not depend on the multi-view images of the same person, and can generate high-quality images
from other perspectives. The framework is also suitable for various unrestricted scenes and can apply 3D rotation and face rendering to any angle
without a loss of detail.
StyleGAN generates realistic portrait images of faces with eyes, teeth, hair, and context (neck, shoulders, and background), but lacks the similar
binding of semantic facial parameters (such as the facial pose, expression, and scene lighting) that can be explained under 3D control. The 3DMM
provides control over the semantic parameters but lacks photo realism when rendering, and only models the interior of the face, without modeling
the other parts of the portrait image (the hair, inside the mouth, or background). Tewari et al73 proposed the StyleRig model to provide control similar
12 of 18 NING ET AL.
to a facial rig for a pre-trained and fixed StyleGAN through a 3DMM. Because the parameters of the 3DMM can also be controlled independently,
StyleRig is allowed to perform explicit semantic control on the images generated using StyleGAN. A new assembly network is trained between the
semantic parameters of the 3DMM and the input of StyleGAN. Users can interact with the facial grid by interactively changing their pose, expression,
and scene lighting parameters.
This section provides a detailed introduction to the commonly used datasets for face frontal generation in recent years and provides a basis for a
subsequent experimental analysis.
ERET78 was created by the FERET project. This image set contains a large number of face images, and each image has only one face. In this
dataset, photographs of the same person have different expressions, lighting, posture, and changes in age. Containing more than 10,000 face images
with multiple poses and illuminations, it is one of the most widely used face databases in the field of face recognition. However, the images in the
dataset are mostly of Westerners, and the changes in the facial images of each person are relatively simple.
Multi_PIE79 was established by Carnegie Mellon University (CMU). “PIE” is the abbreviation of pose, illumination, and expression. The CMU
Multi-PIE face database was developed based on the CMU-PIE face database and contains more than 75,000 facial images with multiple poses,
lighting, and expressions of 337 volunteers. Images with changes in posture and illumination are also collected under strictly controlled conditions
and have gradually become an important test set in the field of face recognition.
The face images provided by LFW80 are all derived from natural scenes in life, including more than 13,000 facial images of 5749 subjects. These
images apply different postures, expressions, lighting, and occlusion methods.
CelebA81 is the abbreviation of CelebFaces Attribute, which is a dataset of celebrity face attributes. It contains 202,599 facial images with
10,177 celebrity identities. Each image is marked with features, including a face bounding box with labels, five facial feature point coordinates, and
40 attribute markers. CelebA is openly provided by the Chinese University of Hong Kong and is widely used in face-related computer vision training
tasks.
CFP82 consists of 500 subjects, where each subject has 10 front and four side images.
CAS-PEAL83 is a database of 99,450 faces of 1040 volunteers completed by the Institute of Computing Technology of the Chinese Academy of
Sciences in 2003. The database covers changes in features such as gestures, expressions, decorations, lighting, background, distance, and time.
IJB-A84 includes 5396 images and 20,412 video frames for 500 subjects, which is a challenge for uncontrolled posture changes. Unlike previous
databases, IJB-A defines face template matching, where each template contains a different number of images. It consists of 10 folders, each of which
is a different partition of the entire collection.
This section describes an experimental comparative analysis conducted on different datasets for a variety of face frontal generation models.
The training datasets used in the experiment include Multi_PIE, CelebA, and CAS-PEAL, and the test datasets include CFP, LFW, and IJB-A. The
experimental environment is shown in Table 3. To verify the effectiveness of the model, in addition to visually comparing the frontal images,
the facial frontal generation model also uses evaluation indicators to quantitatively analyze the performance of the model. This section describes
the selection of the input side view angle range, and the recognition accuracy rate85 (ACC and AUC) and face verification indices86 (verification and
identification) are used as objective evaluation indexes. The experiment compared different models based on a 3DMM (HPEN25 ), AE (SPAE29 ), CNN
network (A3F-CNN46 ), and GAN network. Other full (TP_GAN54 and DR_GAN57 ) and hybrid-based (e.g., FF_GAN71 and Hang et al’s method72 )
models were also used.
HPEN[25]
TP_GAN[55]
DR_GAN[58]
FF_GAN[74]
Hang[77]
TP_GAN[55]
Hang[77]
Figure 13 shows a visualization of frontal faces generated by some of the models based on the LFW dataset. Although HPEN tries to fill in
the missing parts with symmetric priors, it produces large artifacts when larger poses appear. Because TP_GAN is a reconstruction-based method
trained on a constrained dataset, the test results show that it is only effective in the same field as the training dataset, and the effect is poor on the
LFW dataset; in addition, FFGAN does not completely rotate the face to the front, which leads a significant loss of details. Positive treatment is more
effectively applied by Hang et al’s method.72
To facilitate the comparison of the effectiveness of different methods in different poses, Figure 14 shows the face repaired by TP-GAN and the
face repaired by the model proposed by Wang on the Multi_PIE dataset. It can be seen from the figure that when the side face range is between 0 and
45 degrees, the TP_GAN and Hang et al models72 can achieve the frontal face generation extremely well, although TP_GAN ignores the influence of
light, and at −45 to 0 degrees, TP_GAN incurs an identity loss. The effect of the frontal face generated by the hybrid model is better for any pose.
Table 4 shows the face recognition performance of different models on the LFW database, which can generate the pose range and speed
required for creating the front face. As shown in Table 4, the generation method based on a model fusion (the method proposed based
on FF_GAN and Hang et al’s approach72 ) achieves the highest average face verification accuracy, which shows that the method can effec-
tively save the texture information related to the identity. The neural network and encoder-based and 3D-based generation methods achieve
a similar average facial recognition accuracy, whereas the 3D-based method has a slower generation speed and a better effect on small
pose images.
Table 5. shows the verification and recognition performance of the IJB-A database for the different models. TP-GAN has problems such as large
face pose, poor lighting conditions, and low resolution in the IJB-A dataset, and the training dataset is derived from different datasets, which leads
to a performance degradation in TP-GAN, resulting in face synthesis and degradation of the face recognition performance.
14 of 18 NING ET AL.
Model Model classification ACC (%) AUC (%) Pose (◦) Speed
SPAE29 Base on AE 30.8 ± 1.1 21.4 ± 1.1 47.16 ± 2.3 56.32 ± 1.4
A3F_CNN46 Base on CNN 80.4 ± 3.3 60.0 ± 8.6 92.2 ± 2.3 97.4 ± 0.9
TP_GAN54 Base on GAN 31.5 ± 1.8 9.2 ± 1.1 48.6 ± 5.0 59.3 ± 5.6
DR_GAN57 Base on GAN 77.4 ± 2.7 53.9 ± 4.3 85.5 ± 1.5 94.7 ± 1.1
PAM68 hybrid model-based 64.0 ± 2.7 86.5 ± 1.2 67.9 ± 2.0 83.7 ± 0.4
FF_GAN71 hybrid model-based 85.2 ± 1.0 66.3 ± 1.3 90.2 ± 0.6 95.4 ± 0.5
Hang72 Hybrid model-based 97.3 ± 0.6 95.63 ± 0.7 98.44 ± 0.96 99.1 ± 0.63
Based on the above experimental analysis, we can see that the method based on the 3D reconstruction model relies on a large number of scanned
3D face data to realize the frontal range of the target face. This method relies on a large number of accurate and fine 3D scanning data, which
requires a large number of calculations and is time-consuming. Based on the CNN end-to-end method, it does not require a large amount of 3D
scanning face data to extract the corresponding parameter vector from the input face image. Through a series of network constraints, the face image
at various angles can be restored, but the deformation is controllable and not strong, and is completely dependent on the supervision information
in the network. The GAN-based model makes full use of the advantages of the generative confrontation network to realize the frontal generation
of large-posture faces; however, the images generated are prone to a loss of identity information. The method based on the mixed model performs
better in both quantitative and qualitative analyses.
In this paper, the problem of a frontal face generation was analyzed, methods based on a 3D model, deep learning, and a hybrid model were
introduced in detail, and the experimental results of different models were compared and analyzed experimentally.
The method based on a 3DMM can realize the ideal synthesis of the face shape and texture because it uses dense 3D data. However, face
rectification based on 3D models generally uses a 3DMM model to fit the face, and then applies a symmetry or other type of operation to complete
the information loss caused by self-occlusions in the side face image, which will appear when correcting large-angle faces with obvious traces of
artificial symmetry. Moreover, the 3DMM requires a large number of scanned face models to create an average face model, and thus it has the
disadvantage of a large number of calculations and a slow generation speed.
The face generation method based on deep learning uses the powerful fitting ability of deep learning to synthesize a face from a virtual per-
spective, to generate a positive face image. For example, a model based on an auto-encoder must be decoded to learn a posture correction. Some
local features are discarded, and although frontal face images can be synthesized, the details are severely lacking. The CNN-based face frontal gen-
eration model uses the nonlinear learning ability of a neural network to solve the face pose problem, but it requires a large-scale database collected
NING ET AL. 15 of 18
in a restricted environment, for learning. The GAN-based model applies the powerful image generation capabilities of a generative adversarial net-
work for face generation, improving the quality of the generated face and the performance of face recognition; however, the training process relies
on high-quality multi-view training data. Therefore, the generation results are limited by the amount and distribution of data.
The hybrid model-based method makes full use of the advantages of 3D models, convolutional network models, encoder models, and generative
adversarial network models. By combining two or more models, the advantages of different models are complemented. It solves the problem of
data source scope limitation in a single model, the loss of generated image details, and artifacts in frontal images of human faces. Because the GAN
network can reconstruct the image details, the combination of the 3D model and GAN network has been widely used in recent years.
1. Optimization of existing algorithms—Although the latest algorithms have improved the image generation and recognition rate of frontal faces
compared with previous algorithms, there are still some problems to be solved, for example, the model depends on the performance of the GPU,
and the quality of the images generated by some models still needs to be improved. Solving some of these problems will be the main focus of
future research.
2. Treatment of extreme environments—The existing frontal face generation model has good robustness in most scenes and natural environments;
however, for extreme environments (such as noisy texture, shadows, and extreme illumination), the quality of the front face generated is low.
Therefore, for more complex images, we can use multiple normal views or different 3D models based on existing algorithms, to build a more
stable and efficient frontal face generation model.
3. Combination with face attribute editing—Face attributes include expression, posture, gender, age, and other factors. Although a single posture
correction can improve the accuracy of the recognition, it ignores the influence of other attributes of face recognition. The editing of multiple
attributes and the decoupling of the correlation between each attribute should be considered such that the generated face image is more in line
with the real image, and the recognition accuracy is further improved.
4. Actual application scenario requirements—At present, research on the generation of front faces has been recognized in the field of face recog-
nition, but most of the research is based on theory and has rarely been applied to actual scenarios. Therefore, determining ways to combine
existing models with specific actual scenarios, such as video surveillance, criminal investigation identification, and other requirements, will be
an extremely valuable research direction.
5. Utilization of unique features of the human face—Faces have unique characteristics, such as an extremely complicated topology. If we can fully
investigate and improve the unique features of human faces, we can integrate the face topology into face generation and face recognition. This
can further improve the quality of the generated face and improve the accuracy of the face recognition.
6. Integration of multiple models—An image generated by a single solution to the problem of face frontal conversion has certain limitations. Relevant
models can be combined in the fields of image restoration, super-resolution reconstruction, Expression Recognition87 and image denoising such
that the generated images can be processed in a positive manner, and the image quality and clarity can also be improved.
7. Perfection of theoretical knowledge—Although deep learning can automatically learn to express abstract and deep features without the need for
experts to select and extract features manually, the application of deep learning models to process positive face images still lacks sufficient
theoretical support and reasonable interpretability. Therefore, it is necessary to further improve the deep learning theory to provide guidance
for improving the model structure, accelerating the model training, and improving the detection effects.
In short, in future research, based on frontal face studies, the focus needs to be on information fusion and real scene applications, among other
factors. Therefore, researchers need to propose more innovative and practical models and methods.
ACKNOWLEDGMENT
This work is supported by the National Natural Science Foundation of China (Grant No. 61901436).
REFERENCES
1. Lai Y, Lai S. Emotion-preserving representation learning via generative adversarial network for multi-view facial expression recognition. Paper presented
at: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018); 2018; Xi’an:263–270. doi:10.1109/FG.2018.00046.
2. Ding C, Tao DA. Comprehensive survey on pose-invariant face recognition. ACM Trans Intell Syst Dent Tech. 2016;7(3):37.
3. Jamil NA, Sheikh UU, Mokji MM, et al. Improved face recognition on video surveillance images using pose correction. Paper presented at: TENCON
2017–2017 IEEE Region 10 Conference; 2017: IEEE.
4. Kumar A. Contactless finger knuckle authentication under severe pose deformations. Paper presented at: 2020 8th International Workshop on
Biometrics and Forensics (IWBF); 2020.
5. Grover V, Chhabra N. Attendance monitoring system through face recognition. Paper presented at: 2019 6th International Conference on Computing
for Sustainable Global Development (INDIACom); 2020: IEEE.
6. Lin CH, Wang ZH, Jong GJ. A de-identification face recognition using extracted thermal features based on deep learning. IEEE Sens J. 2020;99:1-1.
7. Buddhavarapu S. Face-to-face payments with augmented reality; 2018.
8. Zafaruddin GM, Fadewar HS. Face recognition: a holistic approach review. Paper presented at: 2014 International Conference on Contemporary
Computing and Informatics (IC3I); 2014; Mysore, India:175-178. doi:101109/IC3I.2014.7019610.
16 of 18 NING ET AL.
9. Wenchao M. Research and implementation of face correction based on generative adversarial networks[D]. School Comput Sci Eng. 2019.
10. Ma J, Zhou F. Multi-poses face frontalization based on pose weighted GAN. Paper presented at: 2019 IEEE 3rd Information Technology, Networking,
Electronic and Automation Control Conference (ITNEC); 2019; Chengdu, China:1271–1276. doi:10.1109/ITNEC.2019.8729088.
11. Shao X, Zhou X, Li Z, Shi Y. Multi-view face recognition via well-advised pose normalization network. IEEE Access. 2020;1(8):66400-66410. https://doi.
org/10.1109/ACCESS.2020.2983459.
12. Wei Y, Liu M, Wang H, Zhu R, Hu G, Zuo W. Learning flow-based feature warping for face Frontalization with illumination inconsistent. Supervision.
2020;558–574.
13. Wang Y, Yu H, Dong J, et al. Cascade support vector regression-based facial expression-aware face frontalization. Paper presented at: IEEE International
Conference on Image Processing; 2018: IEEE.
14. Kopaczka M, Breuer L, Schock J, Merhof D. A modular system for detection, tracking and analysis of human faces in thermal infrared recordings. Sensors.
2019;19(19):4135.
15. Xia L, Zhang K, Su J. Face frontalization with adaptive soft symmetry. Paper presented at: 2017 IEEE international conference on robotics and biomimetics
(ROBIO); 2017: IEEE.
16. Sagonas C, Panagakis Y, Zafeiriou S, Pantic M. Robust statistical frontalization of human and animal faces. Int J Comput Vis. 2017;122:270-291. https://
doi-org-443.wv.semi.ac.cn/10.1007/s11263-016-0920-7.
17. Kejun W, GuoFeng Z, Guoxia F. An approach to fast eye location and face plane rotation correction. J Comput Aided Des Comput Graph.
2013;25(6):865-872,879.
18. Blanz V, Vetter T, Rockwood AA. Morphable model for the synthesis of 3D faces. ACM Siggraph. 1999;99(7):187-194.
19. Zhu X, Lei Z, Yan J, Yi D, Li SZ. High-fidelity pose and expression normalization for face recognition in the wild. Paper presented at: 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR); 2015; Boston, MA:787–796. doi: 10.1109/CVPR.2015.7298679.
20. Qianqing W, Jinglei Z. Face pose and expression correction based on 3D Morphable model. Comput Sci. 2019;46(06):263-269.
21. Jeon SH, Yoon HS, Kim JH. Frontal face reconstruction with symmetric constraints. Paper presented at: 2016 13th International Conference on
Ubiquitous Robots and Ambient Intelligence (URAI); 2016: IEEE.
22. Asthana A, Marks TK, Jones MJ, et al. Fully automatic pose-invariant face recognition via 3D pose normalization. Paper presented at: Proceedings of the
International Conference on Computer Vision; 2011; Barcelona:937–944.
23. Hassner T, Harel S, Paz E, et al. Effective face frontalization in unconstrained images. Paper presented at: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition; 2015; Boston, MA:4295–4304.
24. Fang SY, Zhou DK, Cao YP, et al. Frontal face image synthesis based on pose estimation. Comput Eng. 2015;41:240-244.
25. Zhu X Y, Lei Z, Yan J J, et al. High-fidelity pose and expression normalization for face recognition in the wild. Paper presented at: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition; 2015; Boston, MA:787–796.
26. Song X, Feng Z, Hu G, Kittler J, Wu X. Dictionary integration using 3D Morphable face models for pose-invariant collaborative-representation-based
classification. IEEE Trans Inform Foren Secur. 2018;13(11):2734-2745. https://doi.org/10.1109/TIFS.2018.2833052.
27. Shi L, Song X, Zhang T, et al. Histogram-based CRC for 3D-aided pose-invariant face recognition. Sensors. 2019;19(4).759–883.
28. Xu Q, Wu Z, Yang Y, Zhang L. The difference learning of hidden layer between autoencoder and variational autoencoder. Paper presented at: 2017 29th
Chinese Control and Decision Conference (CCDC); 2017; Chongqing:4801–4804. doi:10.1109CCDC.2017.7979344.
29. Kan M, Shan S, Chang H, Chen X. Stacked progressive auto-encoders (SPAE) for face recognition across poses. Paper presented at: 2014 IEEE Conference
on Computer Vision and Pattern Recognition; 2014; Columbus, OH:1883–1890. doi:10.1109/CVPR.2014.243.
30. Ning O, Yutao MA, Leping L. Multi-pose face reconstruction and recognition based on multi-task learning. J Comput Appl. 2017;37(3):896-900.
31. Haiyue X, Yao N, Peng X, Chen H, Wanga H. Multi-pose face frontalization method based on encoder-decoder network. Scientia Sin. 2019;49(4):450-463.
32. Yu X, Porikli F, Fernando B, Hartley R. Hallucinating unaligned face images by multiscale transformative discriminative networks. Int J Comput Vis.
2020;128 (8):500-526.
33. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278-2324.
34. R. Tao, X. Zhao, W. Li, H. Li, and Q. Du,“Hyperspectral anomaly detection by fractional Fourier entropy,”IEEE J Selected Topics Appl Earth Observ Remote Sens,
2019, 12(12):4920–4929.
35. Zhou Y, Ni H, Ren F, Kang X. Face and gender recognition system based on convolutional neural networks. Paper presented at: 2019 IEEE International
Conference on Mechatronics and Automation (ICMA); 2019; Tianjin, China:1091–1095. doi:10.1109/ICMA.2019.8816192.
36. X. Zhao, R. Tao*, W. Li, H. Li, Q. Du, W. Liao, and W. Philips,“Joint classification of hyperspectral and LiDAR data using hierarchical random walk and deep
CNN architecture,”IEEE Trans Geoscie Remote Sens. 99. in press, 2020;1–16.
37. Li R, Liang H, Shi Y, Feng F, Wang X. Dual-CNN: a convolutional language decoder for paragraph image captioning. Neurocomputing. 2020;396:92-101.
38. Zhang J, Xie Z, Sun J, Zou X, Wang J. A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection. IEEE Access.
2020;8:29742-29754. https://doi.org/10.1109/ACCESS.2020.2972338.
39. Nan F, Zeng Q, Xing Y, Qian Y. Single image super-resolution reconstruction based on the ResNeXt network. Multimed Tools Appl. 2020.
(79):34459–34470. https://doi-org-443.wv.semi.ac.cn/10.1007/s11042-020-09053-8.
40. Qin H, Gong R, Liu X, Bai X, Song J, Sebe N. Binary neural networks: a survey. Pattern Recogn. 2020;105:107281.
41. Scott GJ, Marcum RA, Davis CH, et al. Fusion of deep convolutional neural networks for land cover classification of high-resolution imagery. IEEE Geosci
Remote Sens Lett. 2017;99:1-5.
42. Nourabadi NS, Dizaji KG, Seyyedsalehi SA. Face pose normalization for identity recognition using 3D information by means of neural networks. Inform
Knowl Technol. 2013;432–437.
43. Zhu Z, Luo P, Wang X, et al. Recover canonical-view faces in the wild with deep neural networks; 2014. Eprint Arxiv.
44. Yim J, Jung H, Yoo BI, et al. Rotating your face using multi-task deep neural network. Comput Vis Pattern Recogn. 2015;676–684.
45. Jackson A, Bulat V, Argyriou V, Tzimiropoulos G. Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. Paper
presented at: 2017 IEEE International Conference on Computer Vision (ICCV); 2017; Venice, Italy:1031-1039. doi:10.1109ICCV.2017.117.
46. Zhang Z, Chen X, Wang B, et al. Face Frontalization using an appearance-flow-based convolutional neural network. IEEE Trans Image Process.
2019;28(5):2187-2199.
NING ET AL. 17 of 18
47. Guo Y, Cai J, Jiang B, Zheng J. CNN-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE Trans Pattern Anal
Mach Intell. 2019;41(6):1294-1307.
48. Liu F, Zhao Q, Liu X, Zeng D. Joint face alignment and 3D face reconstruction with application to face recognition. IEEE Trans Pattern Anal Mach Intell.
2020;42(3):664-678. https://doi.org/10.1109/TPAMI.2018.2885995.
49. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. Paper presented at:
Proceedings of the 27th International Conference on Neural Information Processing Systems; 2014, 2(10):2672–2680.
50. Li R, Zhang X, Chen G, Mao Y, Wang X, et al. Multi-negative samples with generative adversarial networks for image retrieval. Neurocomputing.
2020;394:146-157.
51. Mirza M, Osindero S. Conditional generative adversarial nets; 2014. arXiv Preprint:1411.1784.
52. Zhu J, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. Paper presented at: 2017 IEEE
International Conference on Computer Vision (ICCV); 2017; Venice, Italy:2242–2251. doi:10.1109/ICCV.2017.244.
53. Yi-Lun L, Xing-Yuan DA, Li LI, Xiao WA, Fei-Yue WA. The new frontier of AI research: generative adversarial networks. Acta Automat Sin.
2018;44(5):775-792.
54. Huang R, Zhang S, Li T, He R. Beyond face rotation: global and local perception GAN for photorealistic and identity preserving frontal view synthesis.
Paper presented at: 2017 IEEE International Conference on Computer Vision (ICCV); 2017; Venice, Italy:2458–2467. doi: 10.1109/ICCV.2017.267.
55. Zhang S, et al. Pose-weighted Gan for photorealistic face Frontalization. Paper presented at: 2019 IEEE International Conference on Image Processing
(ICIP); 2019; Taipei, Taiwan:2384–2388. doi: 10.1109.ICIP.2019.8803362.
56. Cao J, Hu Y, Yu B, He R, Sun Z. 3D aided duet GANs for multi-view face image synthesis. IEEE Trans Inform Foren Secur. 2019;14(8):2028-2042.
57. Tran L, Yin X, Liu X. Disentangled representation learning GAN for pose-invariant face recognition. Paper presented at: 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR); Honolulu, HI; 2017:1283–1292. doi:10.1109.CVPR.2017.141.
58. Chen Z, He Q, Pang W, Li Y. Frontal face generation from multiple pose-variant faces with CGAN in real-world surveillance scene. Paper
presented at: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018; Calgary, AB:1308–1312.
doi:10.1109/ICASSP.2018.8462648.
59. Zhao J, Xing J, Xiong L, et al. Recognizing profile faces by imagining frontal view. Int J Comput Vis. 2020;128:460-478. https://doi-org-443.wv.semi.ac.cn/
10.1007/s11263-019-01252-7.
60. Rong X, Zhang X, Lin Y. Feature-improving generative adversarial network for face frontalization. IEEE Access. 2020;8:68842-68851.
61. Alqahtani H. PI-GAN: learning pose independent representations for multiple profile face synthesis; 2019.
62. Liu L, Zhang L, Chen J. Progressive pose normalization generative adversarial network for frontal face synthesis and face recognition under large pose.
Paper presented at: 2019 IEEE International Conference on Image Processing (ICIP); 2019; Taipei, Taiwan:4434–4438. doi:10.1109/ICIP.2019.8803452.
63. Hu Y, Wu X, Yu B, He R, Sun Z. Pose-guided photorealistic face rotation. Paper presented at: 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition; 2018; Salt Lake City, UT:8398–8406. doi:10.1109.CVPR.2018.00876.
64. Hsu G, Tang C. Dual-view normalization for face recognition. IEEE Access. 2020;8:147765-147775. https://doi.org/10.1109/ACCESS.2020.3014877.
65. Qian Y, Deng W, Hu J. Unsupervised face normalization with extreme pose and expression in the wild. Paper presented at: 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR); 2019; Long Beach, CA:9843–9850. doi:10.1109/CVPR.2019.01008.
66. Ruiz-Garcia A, Palade V, Elshaw M, Awad M. Generative adversarial stacked autoencoders for facial pose normalization and emotion recognition. Paper
presented at: 2020 International Joint Conference on Neural Networks (IJCNN); 2020; Glasgow, UK:1–8. doi:10.1109/IJCNN48605.2020.9207170.
67. Wu S, Rupprecht C, Vedaldi A. Unsupervised learning of probably symmetric deformable 3D objects from images in the wild; 2019.
68. Gao Z, Zhang J, Guo Y, Ma C, Zhai G, Yang X. Semi-supervised 3D face representation learning from unconstrained photo collections. Paper
presented at: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2020; Seattle,WA:1426-1435.
doi:10.1109/CVPRW50498.2020.00182.
69. Masi I, Chang FJ, Choi J, et al. Learning pose-aware models for pose-invariant face recognition in the wild. IEEE Trans Pattern Anal Mach Intell.
2019;41(2):379-393.
70. Ding Y, Li N, Young SS, Ye J. Efficient 3D face recognition in uncontrolled environment. In: Bebis G et al., eds. Advances in Visual Computing. ISVC 2019.
Lecture Notes in Computer Science. Vol 11844. Cham, Switzerland: Springer; 2019. https://doi-org-443.wv.semi.ac.cn/10.1007/978-3-030-33720-9_
33.
71. Yin X, Yu X, Sohn K, et al. Towards large-pose face frontalization in the wild; 2017.
72. Zhou H, Liu J, Liu Z, Liu Y, Wang X. Rotate-and-render: unsupervised photorealistic face rotation from single-view images; 2020. CVPR.
73. Tewari A, et al. StyleRig: rigging StyleGAN for 3D control over portrait images. Paper presented at: 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR); 2020; Seattle, WA:6141–6150. doi:10.1109/CVPR42600.2020.00618.
74. Kato H, Ushiku Y, Harada T. Neural 3D mesh renderer; 2017.
75. Fangmin L, Ke C, Xinhua L. 3D face reconstruction based on convolutional neural network. Paper presented at: 2017 10th International Conference on
Intelligent Computation Technology and Automation (ICICTA); 2017; Changsha:71–74. doi:10.1109/ICICTA.2017.23.
76. Zhao J, Xiong L, Cheng Y, et al. 3D-aided deep pose-invariant face recognition. Paper presented at: International Joint Conference on Artificial
Intelligence (IJCAI 2018); 2018.
77. Cao J, Hu Y, Zhang H, et al. Towards high Fidelity face Frontalization in the wild. Int J Comput Vis. 2019;(128):1485–1504.
78. Rallings C, Thrasher M, Gunter C, et al. The FERET database and evaluation procedure for face-recognition algorithms. Image Vis Comput J.
1998;16(5):295-306.
79. Gross R, Matthews I, Cohn J, Kanade T, Baker S. Multi-PIE IVC; 2010. 2, 5
80. Huang GB, Ramesh M, Berg T, et al. Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical Report
07-49; 2007.
81. Liu Z, Luo P, Wang X, Tang X. Deep learning face attributes in the wild. Paper presented at: Proc. ICCV; 2015:5.
82. Sengupta S, Chen J-C, Castillo C, Patel VM, Chellappa R, Jacobs DW. Frontal to profile face verification in the wild. Paper presented at: WACV; 2016.
83. Gao W, Cao B, Shan SG, et al. The CAS-PEAL large-scale Chinese face database and baseline evaluations. IEEE Trans Syst Man Cybern A. 2008;38:149-161.
84. Klare BF, Klein B, Taborsky E, Blanton A, Cheney J, Allen K, et al. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus
benchmark a. Paper presented at: IEEE Conference on Computer Vision and Pattern Recognition; 2015.
18 of 18 NING ET AL.
85. He R., Cao J., Song L., Sun Z. and Tan T., “Adversarial cross-spectral face completion for NIR-VIS face recognition.” in IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2020.42(5):1025–1037. https://doi.org/10.1109/TPAMI.2019.2961900.
86. Bae HB, Jeon T, Lee Y, Jang S, Lee S. Non-visual to visual translation for cross-domain face recognition. IEEE Access. 2020;8:50452-50464. https://doi.
org/10.1109/ACCESS.2020.2980047.
87. Engin D, Ecabert C, Ekenel HK, Thiran J. Face frontalization for cross-pose facial expression recognition. Paper presented at: 2018 26th European Signal
Processing Conference (EUSIPCO); 2018; Rome:1795–1799. doi:10.23919/EUSIPCO.2018.8553087.
How to cite this article: Ning X, Nan F, Xu S, Yu L, Zhang L. Multi-view frontal face image generation: A survey. Concurrency Computat
Pract Exper. 2020;e6147. https://doi.org/10.1002/cpe.6147