(2021) Deep Face Recognition A Survey
(2021) Deep Face Recognition A Survey
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Deep learning applies multiple processing layers to learn representations of data with multiple levels of
Received 10 May 2020 feature extraction. This emerging technique has reshaped the research landscape of face recognition (FR)
Revised 1 August 2020 since 2014, launched by the breakthroughs of DeepFace and DeepID. Since then, deep learning technique,
Accepted 25 October 2020
characterized by the hierarchical architecture to stitch together pixels into invariant face representation,
Available online 10 November 2020
Communicated by Zidong Wang
has dramatically improved the state-of-the-art performance and fostered successful real-world applica-
tions. In this survey, we provide a comprehensive review of the recent developments on deep FR, covering
broad topics on algorithm designs, databases, protocols, and application scenes. First, we summarize dif-
Keywords:
Deep face recognition
ferent network architectures and loss functions proposed in the rapid evolution of the deep FR methods.
Deep learning Second, the related face processing methods are categorized into two classes: ‘‘one-to-many augmenta-
Face processing tion” and ‘‘many-to-one normalization”. Then, we summarize and compare the commonly used data-
Face recognition database bases for both model training and evaluation. Third, we review miscellaneous scenes in deep FR, such
Loss function as cross-factor, heterogenous, multiple-media and industrial scenes. Finally, the technical challenges
Deep network architecture and several promising directions are highlighted.
Ó 2020 Elsevier B.V. All rights reserved.
1. Introduction the FR community [17–19], in which local filters are learned for
better distinctiveness and the encoding codebook is learned for
Face recognition (FR) has been the prominent biometric tech- better compactness. However, these shallow representations still
nique for identity authentication and has been widely used in have an inevitable limitation on robustness against the complex
many areas, such as military, finance, public security and daily life. nonlinear facial appearance variations.
FR has been a long-standing research topic in the CVPR commu- In general, traditional methods attempted to recognize human
nity. In the early 1990s, the study of FR became popular following face by one or two layer representations, such as filtering
the introduction of the historical Eigenface approach [1]. The mile- responses, histogram of the feature codes, or distribution of the
stones of feature-based FR over the past years are presented in dictionary atoms. The research community studied intensively to
Fig. 1, in which the times of four major technical streams are high- separately improve the preprocessing, local descriptors, and fea-
lighted. The holistic approaches derive the low-dimensional repre- ture transformation, but these approaches improved FR accuracy
sentation through certain distribution assumptions, such as linear slowly. What’s worse, most methods aimed to address one aspect
subspace [2–4], manifold [5–7], and sparse representation [8–11]. of unconstrained facial changes only, such as lighting, pose, expres-
This idea dominated the FR community in the 1990s and 2000s. sion, or disguise. There was no any integrated technique to address
However, a well-known problem is that these theoretically plausi- these unconstrained challenges integrally. As a result, with contin-
ble holistic methods fail to address the uncontrolled facial changes uous efforts of more than a decade, ‘‘shallow” methods only
that deviate from their prior assumptions. In the early 2000s, this improved the accuracy of the LFW benchmark to about 95% [15],
problem gave rise to local-feature-based FR. Gabor [12] and LBP which indicates that ‘‘shallow” methods are insufficient to extract
[13], as well as their multilevel and high-dimensional extensions stable identity feature invariant to real-world changes. Due to the
[14–16], achieved robust performance through some invariant insufficiency of this technical, facial recognition systems were
properties of local filtering. Unfortunately, handcrafted features often reported with unstable performance or failures with count-
suffered from a lack of distinctiveness and compactness. In the less false alarms in real-world applications.
early 2010s, learning-based local descriptors were introduced to But all that changed in 2012 when AlexNet won the ImageNet
competition by a large margin using a technique called deep learn-
⇑ Corresponding author. ing [22]. Deep learning methods, such as convolutional neural net-
E-mail address: [email protected] (W. Deng).
works, use a cascade of multiple layers of processing units for
https://doi.org/10.1016/j.neucom.2020.10.081
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 1. Milestones of face representation for recognition. The holistic approaches dominated the face recognition community in the 1990s. In the early 2000s, handcrafted
local descriptors became popular, and the local feature learning approaches were introduced in the late 2000s. In 2014, DeepFace [20] and DeepID [21] achieved a
breakthrough on state-of-the-art (SOTA) performance, and research focus has shifted to deep-learning-based approaches. As the representation pipeline becomes deeper and
deeper, the LFW (Labeled Face in-the-Wild) performance steadily improves from around 60% to above 90%, while deep learning boosts the performance to 99.80% in just three
years.
feature extraction and transformation. They learn multiple levels A systematic review on the evolution of the network architec-
of representations that correspond to different levels of abstrac- tures and loss functions for deep FR is provided. Various loss
tion. The levels form a hierarchy of concepts, showing strong functions are categorized into Euclidean-distance-based loss,
invariance to the face pose, lighting, and expression changes, as angular/cosine-margin-based loss and softmax loss and its vari-
shown in Fig. 2. It can be seen from the figure that the first layer ations. Both the mainstream network architectures, such as
of the deep neural network is somewhat similar to the Gabor fea- Deepface [20], DeepID series [34,35,21,36], VGGFace [37], Face-
ture found by human scientists with years of experience. The sec- Net [38], and VGGFace2 [39], and other architectures designed
ond layer learns more complex texture features. The features of the for FR are covered.
third layer are more complex, and some simple structures have
begun to appear, such as high-bridged nose and big eyes. In the We categorize the new face processing methods based on deep
fourth, the network output is enough to explain a certain facial learning, such as those used to handle recognition difficulty on
attribute, which can make a special response to some clear abstract pose changes, into two classes: ‘‘one-to-many augmentation”
concepts such as smile, roar, and even blue eye. In conclusion, in and ‘‘many-to-one normalization”, and discuss how emerging
deep convolutional neural networks (CNN), the lower layers auto- generative adversarial network (GAN) [40] facilitates deep FR.
matically learn the features similar to Gabor and SIFT designed for
years or even decades (such as initial layers in Fig. 2), and the We present a comparison and analysis on public available data-
higher layers further learn higher level abstraction. Finally, the bases that are of vital importance for both model training and
combination of these higher level abstraction represents facial testing. Major FR benchmarks, such as LFW [23], IJB-A/B/C
identity with unprecedented stability. [41–43], Megaface [44], and MS-Celeb-1 M [45], are reviewed
In 2014, DeepFace [20] achieved the SOTA accuracy on the and compared, in term of the four aspects: training methodol-
famous LFW benchmark [23], approaching human performance ogy, evaluation tasks and metrics, and recognition scenes,
on the unconstrained condition for the first time (DeepFace: which provides an useful reference for training and testing deep
97.35% vs. Human: 97.53%), by training a 9-layer model on 4 mil- FR.
lion facial images. Inspired by this work, research focus has shifted
to deep-learning-based approaches, and the accuracy was dramat- Besides the general purpose tasks defined by the major data-
ically boosted to above 99.80% in just three years. Deep learning bases, we summarize a dozen scenario-specific databases and
technique has reshaped the research landscape of FR in almost solutions that are still challenging for deep learning, such as
all aspects such as algorithm designs, training/test datasets, appli- anti-attack, cross-pose FR, and cross-age FR. By reviewing spe-
cation scenarios and even the evaluation protocols. Therefore, it is cially designed methods for these unsolved problems, we
of great significance to review the breakthrough and rapid devel- attempt to reveal the important issues for future research on
opment process in recent years. There have been several surveys deep FR, such as adversarial samples, algorithm/data biases,
on FR [24–28] and its subdomains, and they mostly summarized and model interpretability.
and compared a diverse set of techniques related to a specific FR
The remainder of this survey is structured as follows. In Sec-
scene, such as illumination-invariant FR [29], 3D FR [28], pose-
tion 2, we introduce some background concepts and terminolo-
invariant FR [30,31]. Unfortunately, due to their earlier publication
gies, and then we briefly introduce each component of FR. In
dates, none of them covered the deep learning methodology that is
Section 3, different network architectures and loss functions
most successful nowadays. This survey focuses only on recognition
are presented. Then, we summarize the face processing algo-
problem, and one can refer to Ranjan et al. [32] for a brief review of
rithms and the datasets. In Section 5, we briefly introduce sev-
a full deep FR pipeline with detection and alignment, or refer to Jin
eral methods of deep FR used for different scenes. Finally, the
et al. [33] for a survey of face alignment. Specifically, the major
conclusion of this paper and discussion of future works are pre-
contributions of this survey are as follows:
sented in Section 6.
216
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 2. The hierarchical architecture that stitches together pixels into invariant face representation. Deep model consists of multiple layers of simulated neurons that
convolute and pool input, during which the receptive-field size of simulated neurons are continually enlarged to integrate the low-level primary elements into multifarious
facial attributes, finally feeding the data forward to one or more fully connected layer at the top of the network. The output is a compressed feature vector that represent the
face. Such deep representation is widely considered as the SOTA technique for face recognition.
2. Overview The feature extractor is learned by loss functions when training, and
is utilized to extract features of faces when testing. M means a face
2.1. Components of Face Recognition matching algorithm used to compute similarity scores of features to
determine the specific identity of faces. Different from object classi-
As mentioned in [32], there are three modules needed for FR fication, the testing identities are usually disjoint from the training
system, as shown in Fig. 3. First, a face detector is used to localize data in FR, which makes the learned classifier cannot be used to rec-
faces in images or videos. Second, with the facial landmark detec- ognize testing faces. Therefore, face matching algorithm is an essen-
tor, the faces are aligned to normalized canonical coordinates. tial part in FR.
Third, the FR module is implemented with these aligned face
images. We only focus on the FR module throughout the remainder
of this paper. 2.1.1. Face processing
Before a face image is fed to an FR module, face anti-spoofing, Although deep-learning-based approaches have been widely
which recognizes whether the face is live or spoofed, is applied used, Mehdipour et al. [46] proved that various conditions, such
to avoid different types of attacks. Then, recognition can be per- as poses, illuminations, expressions and occlusions, still affect the
formed. As shown in Fig. 3(c), an FR module consists of face pro- performance of deep FR. Accordingly, face processing is introduced
cessing, deep feature extraction and face matching, and it can be to address this problem. The face processing methods are catego-
described as follows: rized as ‘‘one-to-many augmentation” and ‘‘many-to-one normal-
ization”, as shown in Table 1.
M F ðPi ðIi ÞÞ; F Pj Ij ð1Þ
where Ii and Ij are two face images, respectively. P stands for face ‘‘One-to-many augmentation”. These methods generate many
processing to handle intra-personal variations before training and patches or images of the pose variability from a single image
testing, such as poses, illuminations, expressions and occlusions. F to enable deep networks to learn pose-invariant
denotes feature extraction, which encodes the identity information. representations.
217
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 3. Deep FR system with face detector and alignment. First, a face detector is used to localize faces. Second, the faces are aligned to normalized canonical coordinates.
Third, the FR module is implemented. In FR module, face anti-spoofing recognizes whether the face is live or spoofed; face processing is used to handle variations before
training and testing, e.g. poses, ages; different architectures and loss functions are used to extract discriminative deep feature when training; face matching methods are used
to do feature classification after the deep features of testing data are extracted.
Table 1
Different data preprocessing approaches.
Table 2 ResNet and SENet [22,75–78], are introduced and widely used as
Different network architectures of FR. the baseline models in FR (directly or slightly modified). In addi-
Network Subsettings
tion to the mainstream, some assembled networks, e.g. multi-
Architectures task networks and multi-input networks, are utilized in FR. Hu
backbone mainstream architectures: AlexNet [80,81,38], VGGNet
et al. [79] shows that accumulating the results of assembled net-
network [37,47,82], GoogleNet [83,38], ResNet [84,82], SENet [39] works provides an increase in performance compared with an indi-
light-weight architectures [85,86,61,87] vidual network.
adaptive architectures [88–90] Loss Function. The softmax loss is commonly used as the
joint alignment-recognition architectures [91–94]
supervision signal in object recognition, and it encourages the sep-
assembled multipose [95–98], multipatch [58–60,99,34,21,35], mul-
networks titask [100] arability of features. However, the softmax loss is not sufficiently
effective for FR because intra-variations could be larger than
inter-differences and more discriminative features are required
when recognizing different people. Many works focus on creating
novel loss functions to make features not only more separable
‘‘Many-to-one normalization”. These methods recover the but also discriminative, as shown in Table 3.
canonical view of face images from one or many images of a
nonfrontal view; then, FR can be performed as if it were under 2.1.3. Face matching by deep features
controlled conditions. FR can be categorized as face verification and face identification.
In either scenario, a set of known subjects is initially enrolled in the
Note that we mainly focus on deep face processing method system (the gallery), and during testing, a new subject (the probe)
designed for pose variations in this paper, since pose is widely is presented. After the deep networks are trained on massive data
regarded as a major challenge in automatic FR applications and with the supervision of an appropriate loss function, each of the
other variations can be solved by the similar methods. test images is passed through the networks to obtain a deep fea-
ture representation. Using cosine distance or L2 distance, face ver-
ification computes one-to-one similarity between the gallery and
2.1.2. Deep feature extraction probe to determine whether the two images are of the same sub-
Network Architecture. The architectures can be categorized as ject, whereas face identification computes one-to-many similarity
backbone and assembled networks, as shown in Table 2. Inspired to determine the specific identity of a probe face. In addition to
by the extraordinary success on the ImageNet [74] challenge, the these, other methods are introduced to postprocess the deep fea-
typical CNN architectures, e.g. AlexNet, VGGNet, GoogleNet, tures such that the face matching is performed efficiently and
218
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 4. FR studies have begun with general scenario, then gradually get close to more realistic applications and drive different solutions for specific scenarios, such as cross-
pose FR, cross-age FR, video FR. In specific scenarios, targeted training and testing database are constructed, and face processing, architectures and loss functions are modified
based on the special requirements.
219
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Table 4
The accuracy of different methods evaluated on the LFW dataset.
Fig. 5. The development of loss functions. It marks the beginning of deep FR that Deepface [20] and DeepID [34] were introduced in 2014. After that, Euclidean-distance-
based loss always played the important role in loss function, such as contractive loss, triplet loss and center loss. In 2016 and 2017, L-softmax [104] and A-softmax [84]
further promoted the development of the large-margin feature learning. In 2017, feature and weight normalization also begun to show excellent performance, which leads to
the study on variations of softmax. Red, green, blue and yellow rectangles represent deep methods using softmax, Euclidean-distance-based loss, angular/cosine-margin-
based loss and variations of softmax, respectively.
DeepID2+ [35] increased the dimension of hidden representations learned a linear projection W to construct triplet loss. Other meth-
and added supervision to early convolutional layers. DeepID3 [36] ods optimize deep models using both triplet loss and softmax loss
further introduced VGGNet and GoogleNet to their work. However, [59,58,60,121]. They first train networks with softmax and then
the main problem with the contrastive loss is that the margin fine-tune them with triplet loss.
parameters are often difficult to choose. However, the contrastive loss and triplet loss occasionally
Contrary to contrastive loss that considers the absolute dis- encounter training instability due to the selection of effective
tances of the matching pairs and non-matching pairs, triplet loss training samples, some paper begun to explore simple alternatives.
considers the relative difference of the distances between them. Center loss [101] and its variants [82,116,102] are good choices for
Along with FaceNet [38] proposed by Google, Triplet loss reducing intra-variance. The center loss [101] learned a center for
[38,37,81,80,58,60] was introduced into FR. It requires the face tri- each class and penalized the distances between the deep features
plets, and then it minimizes the distance between an anchor and a and their corresponding class centers. This loss can be defined as
positive sample of the same identity and maximizes the distance follows:
between the anchor and a negative sample of a different identity.
2 2 1X m
FaceNet made f xa f xp þ a < f xa f xn using hard
i i 2 i i 2 LC ¼ xi cy 2
i 2
ð3Þ
2 i¼1
triplet face samples, where xai ; xpi
and xni are the anchor, positive
and negative samples, respectively, a is a margin and f ðÞ repre- where xi denotes the i-th deep feature belonging to the yi -th class
sents a nonlinear transformation embedding an image into a fea- and cyi denotes the yi -th class center of deep features. To handle
ture space. Inspired by FaceNet [38], TPE [81] and TSE [80]
the long-tailed data, a range loss [82], which is a variant of center
220
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
loss, is used to minimize k greatest range’s harmonic mean values in on a clean dataset, but is vulnerable to noise and becomes worse
one class and maximize the shortest inter-class distance within one than center loss and softmax in the high-noise region as shown
batch. Wu et al. [102] proposed a center-invariant loss that penal- in Fig. 7.
izes the difference between each center of classes. Deng et al.
[116] selected the farthest intra-class samples and the nearest 3.1.3. Softmax loss and its variations
inter-class samples to compute a margin loss. However, the center In 2017, in addition to reformulating softmax loss into an
loss and its variants suffer from massive GPU memory consumption angular/cosine-margin-based loss as mentioned above, some
on the classification layer, and prefer balanced and sufficient train- works tries to normalize the features and weights in loss functions
ing data for each identity. to improve the model performance, which can be written as
follows:
3.1.2. Angular/cosine-margin-based loss
In 2017, people had a deeper understanding of loss function in c ¼ W ; ^x ¼ a x
W ð6Þ
kW k k xk
deep FR and thought that samples should be separated more
strictly to avoid misclassifying the difficult samples. Angular/ where a is a scaling parameter, x is the learned feature vector, W is
cosine-margin-based loss [104,84,105,106,108] is proposed to weight of last fully connected layer. Scaling x to a fixed radius a is
make learned features potentially separable with a larger angular/- important, as Wang et al. [110] proved that normalizing both fea-
cosine distance. The decision boundary in softmax loss is tures and weights to 1 will make the softmax loss become trapped
ðW 1 W 2 Þx þ b1 b2 ¼ 0, where x is feature vector, W i and bi are at a very high value on the training set. After that, the loss function,
weights and bias in softmax loss, respectively. Liu et al. [104] refor- e.g. softmax, can be performed using the normalized features and
mulated the original softmax loss into a large-margin softmax (L- weights.
Softmax) loss. They constrain b1 ¼ b2 ¼ 0, so the decision bound- Some papers [84,108] first normalized the weights only and
aries for class 1 and class 2 become kxkðkW 1 kcosðmh1 Þ then added angular/cosine margin into loss functions to make
kW 2 kcosðh2 ÞÞ ¼ 0 and kxkðkW 1 kkW 2 kcosðh1 Þ cosðmh2 ÞÞ ¼ 0, the learned features be discriminative. In contrast, some works,
respectively, where m is a positive integer introducing an angular such as [109,111], adopted feature normalization only to overcome
margin, and hi is the angle between W i and x. Due to the non- the bias to the sample distribution of the softmax. Based on the
monotonicity of the cosine function, a piece-wise function is observation of [125] that the L2-norm of features learned using
applied in L-softmax to guarantee the monotonicity. The loss func- the softmax loss is informative of the quality of the face, L2-
tion is defined as follows: softmax [109] enforced all the features to have the same L2-
0 1 norm by feature normalization such that similar attention is given
to good quality frontal faces and blurry faces with extreme pose.
B ekW yi kkxi kuðhyi Þ C
B C Rather than scaling x to the parameter a, Hasnat et al. [111] nor-
Li ¼ log B X C ð4Þ
@ kW yi kkxi kcosðhj Þ A xl
kW yi kkxi kuðhyi Þþ e malized features with ^ x¼p ffiffiffiffi2 , where l and r2 are the mean and
r
j–yi
e variance. Ring loss [117] encouraged the norm of samples being
value R (a learned parameter) rather than explicit enforcing
where
through a hard normalization operation. Moreover, normalizing
k p ð k þ 1Þ p both features and weights [110,112,115,105,106] has become a
uðhÞ ¼ ð1Þk cosðmhÞ 2k; h 2 ; ð5Þ common strategy. Wang et al. [110] explained the necessity of this
m m
normalization operation from both analytic and geometric per-
Considering that L-Softmax is difficult to converge, it is always spectives. After normalizing features and weights, CoCo loss
combined with softmax loss to facilitate and ensure the conver- [112] optimized the cosine distance among data features, and Has-
gence. Therefore, the loss function is changed into: nat et al. [115] used the von Mises-Fisher (vMF) mixture model as
kkW yi kkxi kcosðhyi ÞþkW yi kkxi kuðhyi Þ
f yi ¼ , where k is a dynamic hyper- the theoretical basis to develop a novel vMF mixture loss and its
1þk
parameter. Based on L-Softmax, A-Softmax loss [84] further nor- corresponding vMF deep features.
malized the weight W by L2 norm (kW k ¼ 1) such that the normal-
ized vector will lie on a hypersphere, and then the discriminative 3.2. Evolution of network architecture
face features can be learned on a hypersphere manifold with an
angular margin (Fig. 6). Liu et al. [108] introduced a deep hyper- 3.2.1. Backbone network
spherical convolution network (SphereNet) that adopts hyper- Mainstream architectures. The commonly used network archi-
spherical convolution as its basic convolution operator and is tectures of deep FR have always followed those of deep object clas-
supervised by angular-margin-based loss. To overcome the opti- sification and evolved from AlexNet to SENet rapidly. We present
mization difficulty of L-Softmax and A-Softmax, which incorporate the most influential architectures of deep object classification
the angular margin in a multiplicative manner, ArcFace [106] and and deep face recognition in chronological order 1 in Fig. 8.
CosFace [105], AMS loss [107] respectively introduced an additive In 2012, AlexNet [22] was reported to achieve the SOTA recog-
angular/cosine margin cosðh þ mÞ and cosh m. They are extremely nition accuracy in the ImageNet large-scale visual recognition
easy to implement without tricky hyper-parameters k, and are competition (ILSVRC) 2012, exceeding the previous best results
more clear and able to converge without the softmax supervision. by a large margin. AlexNet consists of five convolutional layers
The decision boundaries under the binary classification case are and three fully connected layers, and it also integrates various
given in Table 5. Based on large margin, FairLoss [122] and Adap- techniques, such as rectified linear unit (ReLU), dropout, data aug-
tiveFace [123] further proposed to adjust the margins for different mentation, and so forth. ReLU was widely regarded as the most
classes adaptively to address the problem of unbalanced data. essential component for making deep learning possible. Then, in
Compared to Euclidean-distance-based loss, angular/cosine- 2014, VGGNet [75] proposed a standard network architecture that
margin-based loss explicitly adds discriminative constraints on a used very small 3 3 convolutional filters throughout and doubled
hypershpere manifold, which intrinsically matches the prior that the number of feature maps after the 2 2 pooling. It increased the
human face lies on a manifold. However, Wang et al. [124] showed
that angular/cosine-margin-based loss can achieve better results 1
The time we present is when the paper was published.
221
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 7. 1:1 M rank-1 identification results on MegaFace benchmark: (a) introducing label flips to training data, (b) introducing outliers to training data. [124].
222
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 8. The top row presents the typical network architectures in object classification, and the bottom row describes the well-known FR algorithms that use the typical
architectures. We use the same color rectangles to represent the algorithms using the same architecture. It is easy to find that the architectures of deep FR have always
followed those of deep object classification and evolved from AlexNet to SENet rapidly.
Fig. 9. The architecture of Alexnet [22], VGGNet [75], GoogleNet [76], ResNet [77], SENet [78].
by a novel online triplet mining method and achieved good perfor- are needed, which makes the applications on many mobiles and
mance of 99.63%. In the same year, VGGface [37] designed a proce- embedded devices impractical. To address this problem, light-
dure to collect a large-scale dataset from the Internet. It trained the weight networks are proposed. Light CNN [85,86] proposed a
VGGNet on this dataset and then fine-tuned the networks via a tri- max-feature-map (MFM) activation function that introduces the
plet loss function similar to FaceNet. VGGface obtains an accuracy concept of maxout in the fully connected layer to CNN. The MFM
of 98.95%. In 2017, SphereFace [84] used a 64-layer ResNet archi- obtains a compact representation and reduces the computational
tecture and proposed the angular softmax (A-Softmax) loss to learn cost. Sun et al. [61] proposed to sparsify deep networks iteratively
discriminative face features with angular margin. It boosts the from the previously learned denser models based on a weight
achieves to 99.42% on LFW. In the end of 2017, a new large-scale selection criterion. MobiFace [87] adopted fast downsampling
face dataset, namely VGGface2 [39], was introduced, which con- and bottleneck residual block with the expansion layers and
sists of large variations in pose, age, illumination, ethnicity and achieved high performance with 99.7% on LFW database. Although
profession. Cao et al. first trained a SENet with MS-celeb-1 M data- some other light-weight CNNs, such as SqueezeNet, MobileNet,
set [45] and then fine-tuned the model with VGGface2 [39], and ShuffleNet and Xception [126–129], are still not widely used in
achieved the SOTA performance on the IJB-A [41] and IJB-B [42]. FR, they deserve more attention.
Light-weight networks. Using deeper neural network with Adaptive-architecture networks. Considering that designing
hundreds of layers and millions of parameters to achieve higher architectures manually by human experts are time-consuming
accuracy comes at cost. Powerful GPUs with larger memory size and error-prone processes, there is growing interest in adaptive-
223
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
architecture networks which can find well-performing architec- for jointly learning the identity and non-identity features as shown
tures, e.g. the type of operation every layer executes (pooling, con- in Fig. 11.
volution, etc) and hyper-parameters associated with the operation
(number of filters, kernel size and strides for a convolutional layer, 3.3. Face matching by deep features
etc), according to the specific requirements of training and testing
data. Currently, neural architecture search (NAS) [130] is one of the During testing, the cosine distance and L2 distance are generally
promising methodologies, which has outperformed manually employed to measure the similarity between the deep features x1
designed architectures on some tasks such as image classification and x2 ; then, threshold comparison and the nearest neighbor
[131] or semantic segmentation [132]. Zhu et al. [88] integrated (NN) classifier are used to make decision for verification and iden-
NAS technology into face recognition. They used reinforcement tification. In addition to these common methods, there are some
learning [133] algorithm (policy gradient) to guide the controller other explorations.
network to train the optimal child architecture. Besides NAS, there
are some other explorations to learn optimal architectures adap-
3.3.1. Face verification
tively. For example, conditional convolutional neural network (c–
Metric learning, which aims to find a new metric to make two
CNN) [89] dynamically activated sets of kernels according to
classes more separable, can also be used for face matching based
modalities of samples; Han et al. [90] proposed a novel contrastive
on extracted deep features. The JB [136] model is a well-known
convolution consisted of a trunk CNN and a kernel generator,
metric learning method [35,21,36,34,120], and Hu et al. [79]
which is beneficial owing to its dynamistic generation of con-
proved that it can improve the performance greatly. In the JB
trastive kernels based on the pair of faces being compared.
model, a face feature x is modeled as x ¼ l þ e, where l and e
Joint alignment-recognition networks. Recently, an end-to-
are identity and intra-personal variations, respectively. The simi-
end system [91–94] was proposed to jointly train FR with several
larity score rðx1 ; x2 Þ can be represented as follows:
modules (face detection, alignment, and so forth) together. Com-
pared to the existing methods in which each module is generally Pðx1 ; x2 jHI Þ
rðx1 ; x2 Þ ¼ log ð7Þ
optimized separately according to different objectives, this end- Pðx1 ; x2 jHE Þ
to-end system optimizes each module according to the recognition
objective, leading to more adequate and robust inputs for the where Pðx1 ; x2 jHI Þ is the probability that two faces belong to the
recognition model. For example, inspired by spatial transformer same identity and Pðx1 ; x2 jHE Þ is the probability that two faces
[134], Hayat et al. [91] proposed a CNN-based data-driven belong to different identities.
approach that learns to simultaneously register and represent faces
(Fig. 10), while Wu et al. [92] designed a novel recursive spatial 3.3.2. Face identification
transformer (ReST) module for CNN allowing face alignment and After cosine distance was computed, Cheng et al. [137] pro-
recognition to be jointly optimized. posed a heuristic voting strategy at the similarity score level to
combine the results of multiple CNN models and won first place
in Challenge 2 of MS-celeb-1 M 2017. Yang et al. [138] extracted
3.2.2. Assembled networks the local adaptive convolution features from the local regions of
Multi-input networks. In ‘‘one-to-many augmentation”, multi- the face image and used the extended SRC for FR with a single sam-
ple images with variety are generated from one image in order to ple per person. Guo et al. [139] combined deep features and the
augment training data. Taken these multiple images as input, mul- SVM classifier to perform recognition. Wang et al. [62] first used
tiple networks are also assembled together to extract and combine product quantization (PQ) [140] to directly retrieve the top-k most
features of different type of inputs, which can outperform an indi- similar faces and re-ranked these faces by combining similarities
vidual network. In [58–60,99,34,21,35], assembled networks are from deep features and the COTS matcher [141]. In addition, Soft-
built after different face patches are cropped, and then different max can be also used in face matching when the identities of train-
types of patches are fed into different sub-networks for represen- ing set and test set overlap. For example, in Challenge 2 of MS-
tation extraction. By combining the results of sub-networks, the celeb-1 M, Ding et al. [142] trained a 21,000-class softmax classi-
performance can be improved. Other papers [96,95,98] used fier to directly recognize faces of one-shot classes and normal
assembled networks to recognize images with different poses. classes after augmenting feature by a conditional GAN; Guo et al.
For example, Masi et al. [96] adjusted the pose to frontal (0 ), [143] trained the softmax classifier combined with
half-profile (40 ) and full-profile views (75 ) and then addressed underrepresented-classes promotion (UP) loss term to enhance
pose variation by assembled pose networks. A multi-view deep the performance on one-shot classes.
network (MvDN) [95] consists of view-specific subnetworks and When the distributions of training data and testing data are the
common subnetworks; the former removes view-specific varia- same, the face matching methods mentioned above are effective.
tions, and the latter obtains common representations. However, there is always a distribution change or domain shift
Multi-task networks. FR is intertwined with various factors, between two data domains that can degrade the performance on
such as pose, illumination, and age. To solve this problem, multi- test data. Transfer learning [144,145] has recently been introduced
task learning is introduced to transfer knowledge from other rele- into deep FR to address the problem of domain shift. It learns
vant tasks and to disentangle nuisance factors. In multi-task transferable features using a labeled source domain (training data)
networks, identity classification is the main task and the side tasks and an unlabeled target domain (testing data) such that domain
are pose, illumination, and expression estimations, among others. discrepancy is reduced and models trained on source domain will
The lower layers are shared among all the tasks, and the higher lay- also perform well on target domain. Sometimes, this technology is
ers are disentangled into different sub-networks to generate the applied to face matching. For example, Crosswhite et al. [121] and
task-specific outputs. In [100], the task-specific sub-networks are Xiong et al. [146] adopted template adaptation to the set of media
branched out to learn face detection, face alignment, pose estima- in a template by combining CNN features with template-specific
tion, gender recognition, smile detection, age estimation and FR. linear SVMs. But most of the time, it is not enough to do transfer
Yin et al. [97] proposed to automatically assign the dynamic loss learning only at face matching stage. Transfer learning should be
weights for each side task. Peng et al. [135] used a feature recon- embedded in deep models to learn more transferable representa-
struction metric learning to disentangle a CNN into sub-networks tions. Kan et al. [147] proposed a bi-shifting autoencoder network
224
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
(BAE) for domain adaptation across view angle, ethnicity, and to augment not only training data but also the gallery of test data.
imaging sensor; while Luo et al. [148] utilized the multi-kernels we categorized them into four classes: data augmentation, 3D
maximum mean discrepancy (MMD) to reduce domain discrepan- model, autoencoder model and GAN model.
cies. Sohn et al. [149] used adversarial learning [150] to transfer Data augmentation. Common data augmentation methods
knowledge from still image FR to video FR. Moreover, fine-tuning consist of photometric transformations [75,22] and geometric
the CNN parameters from a prelearned model using a target train- transformations, such as oversampling (multiple patches obtained
ing dataset is a particular type of transfer learning, and is com- by cropping at different scales) [22], mirroring [153], and rotating
monly employed by numerous methods [151,152,103]. [154] the images. Recently, data augmentation has been widely
used in deep FR algorithms [58–60,35,21,36,61,62]. for example,
4. Face processing for training and recognition Sun et al. [21] cropped 400 face patches varying in positions, scales,
and color channels and mirrored the images. Liu et al. [58] gener-
We present the development of face processing methods in ated seven overlapped image patches centered at different land-
chronological order in Fig. 12. As we can see from the figure, most marks on the face region and trained them with seven CNNs
papers attempted to perform face processing by autoencoder with the same structure.
model in 2014 and 2015; while 3D model played an important role 3D model. 3D face reconstruction is also a way to enrich the
in 2016. GAN [40] has drawn substantial attention from the deep diversity of training data. They utilize 3D structure information
learning and computer vision community since it was first pro- to model the transformation between poses. 3D models first use
posed by Goodfellow et al. It can be used in different fields and 3D face data to obtain morphable displacement fields and then
was also introduced into face processing in 2017. GAN can be used apply them to obtain 2D face data in different pose angles. There
to perform ‘‘one-to-many augmentation” and ‘‘many-to-one nor- is a large number of papers about this domain, but we only focus
malization”, and it broke the limit that face synthesis should be on the 3D face reconstruction using deep methods or used for deep
done under supervised way. Although GAN has not been widely FR. In [47], Masi et al. generated face images with new intra-class
used in face processing for training and recognition, it has great facial appearance variations, including pose, shape and expression,
latent capacity for preprocessing, for example, Dual-Agent GANs and then trained a 19-layer VGGNet with both real and augmented
(DA-GAN) [56] won the 1st places on verification and identification data. Masi et al. [48] used generic 3D faces and rendered fixed
tracks in the NIST IJB-A 2017 FR competitions. views to reduce much of the computational effort. Richardson
et al. [49] employed an iterative 3D CNN by using a secondary
4.1. One-to-many augmentation input channel to represent the previous network’s output as an
image for reconstructing a 3D face as shown in Fig. 13. Dou et al.
Collecting a large database is extremely expensive and time [51] used a multi-task CNN to divide 3D face reconstruction into
consuming. The methods of ‘‘one-to-many augmentation” can neutral 3D reconstruction and expressive 3D reconstruction. Tran
mitigate the challenges of data collection, and they can be used et al. [53] directly regressed 3D morphable face model (3DMM)
225
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 12. The development of deep face processing methods. Red, green, orange and blue rectangles represent CNN model, autoencoder model, 3D model and GAN model,
respectively.
[155] parameters from an input photo by a very deep CNN archi- which combines prior knowledge of the data distribution and
tecture. An et al. [156] synthesized face images with various poses knowledge of faces (pose and identity perception loss). CVAE-
and expressions using the 3DMM method, then reduced the gap GAN [159] combined a variational auto-encoder with a GAN for
between synthesized data and real data with the help of MMD. augmenting data, and took advantages of both statistic and pair-
Autoencoder model. Rather than reconstructing 3D models wise feature matching to make the training process converge faster
from a 2D image and projecting it back into 2D images of different and more stably. In addition to synthesizing diverse faces from
poses, autoencoder models can generate 2D target images directly. noise, some papers also explore to disentangle the identity and
Taken a face image and a pose code encoding a target pose as input, variation, and synthesize new faces by exchanging identity and
an encoder first learns pose-invariant face representation, and then variation from different people. In CG-GAN [160], a generator
a decoder generates a face image with the same identity viewed at directly resolves each representation of input image into a varia-
the target pose by using the pose-invariant representation and the tion code and an identity code and regroups these codes for
pose code. For example, given the target pose codes, multi-view cross-generating, simultaneously, a discriminator ensures the real-
perceptron (MVP) [55] trained some deterministic hidden neurons ity of generated images. Bao et al. [161] extracted identity repre-
to learn pose-invariant face representations, and simultaneously sentation of one input image and attribute representation of any
trained some random hidden neurons to capture pose features, other input face image, then synthesized new faces by recombining
then a decoder generated the target images by combining pose- these representations. This work shows superior performance in
invariant representations with pose features. As shown in Fig. 14, generating realistic and identity preserving face images, even for
Yim et al. [157] and Qian et al. [158] introduced an auxiliary identities outside the training dataset. Unlike previous methods
CNN to generate better images viewed at the target poses. First, that treat classifier as a spectator, FaceID-GAN [162] proposed a
an autoencoder generated the desired pose image, then the auxil- three-player GAN where the classifier cooperates together with
iary CNN reconstructed the original input image back from the the discriminator to compete with the generator from two differ-
generated target image, which guarantees that the generated ent aspects, i.e. facial identity and image quality respectively.
image is identity-preserving. In [65], two groups of units are
embedded between encoder and decoder. The identity units 4.2. Many-to-one normalization
remain unchanged and the rotation of images is achieved by taking
actions to pose units at each time step. In contrast to ‘‘one-to-many augmentation”, the methods of
GAN model. In GAN models, a generator aims to fool a discrim- ‘‘many-to-one normalization” produce frontal faces and reduce
inator through generating images that resemble the real images, appearance variability of test data to make faces align and compare
while the discriminator aims to discriminate the generated sam- easily. It can be categorized as autoencoder model, CNN model and
ples from the real ones. By this minimax game between generator GAN model.
and discriminator, GAN can successfully generate photo-realistic Autoencoder model. Autoencoder can also be applied to
images with different poses. After using a 3D model to generate ‘‘many-to-one normalization”. Different from the autoencoder
profile face images, DA-GAN [56] refined the images by a GAN, model in ‘‘one-to-many augmentation” which generates the
226
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 14. Autoencoder model of ‘‘one-to-many augmentation” proposed by [157]. The first part extracts feature from an input image, then the second and third part generate a
target image with the same identity viewed at the target pose. The forth part is an auxiliary task which reconstructs the original input image back from the generated image to
guarantee that the generated image is identity-preserving.
desired pose images with the help of pose codes, autoencoder Through combining adversarial loss, symmetry loss and identity-
model here learns pose-invariant face representation by an enco- preserving loss, TP-GAN generates a frontal view and simultane-
der and directly normalizes faces by a decoder without pose codes. ously preserves global structures and local details as shown in
Zhu et al. [66,67] selected canonical-view images according to the Fig. 16. In a disentangled representation learning generative adver-
face images’ symmetry and sharpness and then adopted an autoen- sarial network (DR-GAN) [71], the generator serves as a face rota-
coder to recover the frontal view images by minimizing the recon- tor, in which an encoder produces an identity representation, and a
struction loss error. The proposed stacked progressive decoder synthesizes a face at the specified pose using this repre-
autoencoders (SPAE) [63] progressively map the nonfrontal face sentation and a pose code. And the discriminator is trained to
to the frontal face through a stack of several autoencoders. Each not only distinguish real vs. synthetic images, but also predict
shallow autoencoders of SPAE is designed to convert the input face the identity and pose of a face. Yin et al. [73] incorporated
images at large poses to a virtual view at a smaller pose, so the 3DMM into the GAN structure to provide shape and appearance
pose variations are narrowed down gradually layer by layer along priors to guide the generator to frontalization.
the pose manifold. Zhang et al. [64] built a sparse many-to-one
encoder to enhance the discriminant of the pose free feature by
using multiple random faces as the target values for multiple 5. Face databases and evaluation protocols
encoders.
CNN model. CNN models usually directly learn the 2D map- In the past three decades, many face databases have been con-
pings between non-frontal face images and frontal images, and uti- structed with a clear tendency from small-scale to large-scale,
lize these mapping to normalize images in pixel space. The pixels from single-source to diverse-sources, and from lab-controlled to
in normalized images are either directly the pixels or the combina- real-world unconstrained condition, as shown in Fig. 17. As the
tions of the pixels in non-frontal images. In LDF-Net [68], the dis- performance of some simple databases become saturated, e.g.
placement field network learns the shifting relationship of two LFW [23], more and more complex databases were continually
pixels, and the translation layer transforms the input non-frontal developed to facilitate the FR research. It can be said without exag-
face image into a frontal one with this displacement field. In Grid- geration that the development process of the face databases largely
Face [69] shown in Fig. 15, first, the rectification network normal- leads the direction of FR research. In this section, we review the
izes the images by warping pixels from the original image to the development of major training and testing academic databases
canonical one according to the computed homography matrix, for the deep FR.
then the normalized output is regularized by an implicit canonical
view face prior, finally, with the normalized faces as input, the 5.1. Large-scale training data sets
recognition network learns discriminative face representation via
metric learning. The prerequisite of effective deep FR is a sufficiently large train-
GAN model. Huang et al. [70] proposed a two-pathway genera- ing dataset. Zhou et al. [59] suggested that large amounts of data
tive adversarial network (TP-GAN) that contains four landmark- with deep learning improve the performance of FR. The results of
located patch networks and a global encoder-decoder network. Megaface Challenge also revealed that premier deep FR methods
Fig. 15. (a) System overview and (b) local homography transformation of GridFace [69]. The rectification network normalizes the images by warping pixels from the original
image to the canonical one according to the computed homography matrix.
227
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Fig. 16. General framework of TP-GAN [70]. The generator contains two pathways with each processing global or local transformations. The discriminator distinguishes
between synthesized frontal views and ground-truth frontal views.
Fig. 17. The evolution of FR datasets. Before 2007, early works in FR focused on controlled and small-scale datasets. In 2007, LFW [23] dataset was introduced which marks
the beginning of FR under unconstrained conditions. Since then, more testing databases designed for different tasks and scenes are constructed. And in 2014, CASIA-Webface
[120] provided the first widely-used public training dataset, large-scale training datasets begun to be hot topic. Red rectangles represent training datasets, and other color
rectangles represent different testing datasets.
were typically trained on data larger than 0.5 M images and 20 K Depth v.s. breadth. These large training sets are expanded from
people. The early works of deep FR were usually trained on private depth or breadth. VGGface2 provides a large-scale training dataset
training datasets. Facebook’s Deepface [20] model was trained on of depth, which have limited number of subjects but many images
4 M images of 4 K people; Google’s FaceNet [38] was trained on for each subjects. The depth of dataset enforces the trained model
200 M images of 3 M people; DeepID serial models [34,35,21,36] to address a wide range intra-class variations, such as lighting, age,
were trained on 0.2 M images of 10 K people. Although they and pose. In contrast, MS-Celeb-1 M and Mageface (Challenge 2)
reported ground-breaking performance at this stage, researchers offers large-scale training datasets of breadth, which contains
cannot accurately reproduce or compare their models without many subject but limited images for each subjects. The breadth
public training datasets. of dataset ensures the trained model to cover the sufficiently vari-
To address this issue, CASIA-Webface [120] provided the first able appearance of various people. Cao et al. [39] conducted a sys-
widely-used public training dataset for the deep model training tematic studies on model training using VGGface2 and MS-Celeb-
purpose, which consists of 0.5 M images of 10 K celebrities col- 1 M, and found an optimal model by first training on MS-Celeb-
lected from the web. Given its moderate size and easy usage, it 1 M (breadth) and then fine-tuning on VGGface2 (depth).
has become a great resource for fair comparisons for academic Long tail distribution. The utilization of long tail distribution is
deep models. However, its relatively small data and ID size may different among datasets. For example, in Challenge 2 of MS-Celeb-
not be sufficient to reflect the power of many advanced deep learn- 1 M, the novel set specially uses the tailed data to study low-shot
ing methods. Currently, there have been more databases providing learning; central part of the long tail distribution is used by the
public available large-scale training dataset (Table 6), especially Challenge 1 of MS-Celeb-1 M and images’ number is approximately
three databases with over 1 M images, namely MS-Celeb-1 M limited to 100 for each celebrity; VGGface and VGGface2 only use
[45], VGGface2 [39], and Megaface [44,164], and we summary the head part to construct deep databases; Megaface utilizes the
some interesting findings about these training sets, as shown in whole distribution to contain as many images as possible, the min-
Fig. 18. imal number of images is 3 per person and the maximum is 2469.
228
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Table 6
The commonly used FR datasets for training
Fig. 18. The distribution of three new large-scale databases suitable for training deep models. They have larger scale than the widely-used CAISA-Web database. The vertical
axis displays number of images per person, and the horizontal axis shows person IDs.
229
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
and 2.8 M aligned images of 100 K Asian celebrities. Moreover, that human faces exhibit similar intra-subject variations, deep
Zhan et al. [167] shifted the focus from cleaning the datasets to models can display transcendental generalization ability when
leveraging more unlabeled data. Through automatically assigning training with a sufficiently large set of generic subjects, where
pseudo labels to unlabeled data with the help of relational graphs, the key is to learn discriminative large-margin deep features. This
they obtained competitive or even better results over the fully- generalization ability makes subject-independent FR possible.
supervised counterpart. Almost all major face-recognition benchmarks, such as LFW [23],
Data bias. Large-scale training datasets, such as CASIA- PaSC [179], IJB-A/B/C [41–43] and Megaface [44,164], require the
WebFace [120], VGGFace2 [39] and MS-Celeb-1 M [45], are typi- tested models to be trained under subject-independent protocol.
cally constructed by scraping websites like Google Images, and
consist of celebrities on formal occasions: smiling, make-up, 5.3. Evaluation tasks and performance metrics
young, and beautiful. They are largely different from databases
captured in the daily life (e.g. Megaface). The biases can be attrib- In order to evaluate whether our deep models can solve the dif-
uted to many exogenous factors in data collection, such as cam- ferent problems of FR in real life, many testing datasets are
eras, lightings, preferences over certain types of backgrounds, or designed to evaluate the models in different tasks, i.e. face verifica-
annotator tendencies. Dataset biases adversely affect cross- tion, close-set face identification and open-set face identification.
dataset generalization; that is, the performance of the model In either task, a set of known subjects is initially enrolled in the
trained on one dataset drops significantly when applied to another system (the gallery), and during testing, a new subject (the probe)
one. One persuasive evidence is presented by P.J. Phillips’ study is presented. Face verification computes one-to-one similarity
[168] which conducted a cross benchmark assessment of VGGFace between the gallery and probe to determine whether the two
model [37] for face recognition. The VGGFace model achieves images are of the same subject, whereas face identification com-
98.95% on LFW [23] and 97.30% on YTF [169], but only obtains putes one-to-many similarity to determine the specific identity
26%, 52% and 85% on Ugly, Bad and Good partition of GBU database of a probe face. When the probe appears in the gallery identities,
[170]. this is referred to as closed-set identification; when the probes
Demographic bias (e.g., race/ethnicity, gender, age) in datasets include those who are not in the gallery, this is open-set
is a universal but urgent issue to be solved in data bias field. In identification.
existing training and testing datasets, the male, White, and Face verification is relevant to access control systems, re-
middle-aged cohorts always appear more frequently, as shown in identification, and application independent evaluations of FR algo-
Table 7, which inevitably causes deep learning models to replicate rithms. It is classically measured using the receiver operating char-
and even amplify these biases resulting in significantly different acteristic (ROC) and estimated mean accuracy (Acc). At a given
accuracies when deep models are applied to different demographic threshold (the independent variable), ROC analysis measures the
groups. Some researches [145,171,172] showed that the female, true accept rate (TAR), which is the fraction of genuine compar-
Black, and younger cohorts are usually more difficult to recognize isons that correctly exceed the threshold, and the false accept rate
in FR systems trained with commonly-used datasets. For example, (FAR), which is the fraction of impostor comparisons that incor-
Wang et al. [173] proposed a Racial Faces in-the-Wild (RFW) data- rectly exceed the threshold. And Acc is a simplified metric intro-
base and proved that existing commercial APIs and the SOTA algo- duced by LFW [23], which represents the percentage of correct
rithms indeed work unequally for different races and the classifications. With the development of deep FR, more accurate
maximum difference in error rate between the best and worst recognitions are required. Customers concern more about the
groups is 12%, as shown in Table 8. Hupont et al. [171] showed that TAR when FAR is kept in a very low rate in most security certifica-
SphereFace has a TAR of 0.87 for White males which drops to 0.28
tion scenario. PaSC [179] reports TAR at a FAR of 102 ; IJB-A [41]
for Asian females, at a FAR of 1e 4. Such bias can result in
mistreatment of certain demographic groups, by either exposing evaluates TAR at a FAR of 103 ; Megaface [44,164] focuses on
them to a higher risk of fraud, or by making access to services more TAR@106 FAR; especially, in MS-celeb-1 M challenge 3 [163],
difficult. Therefore, addressing data bias and enhancing fairness of TAR@109 FAR is reported.
FR systems in real life are urgent and necessary tasks. Collecting Close-set face identification is relevant to user driven searches
balanced data to train a fair model or designing some debiasing (e.g., forensic identification), rank-N and cumulative match charac-
algorithms are effective way. teristic (CMC) is commonly used metrics in this scenario. Rank-N is
based on what percentage of probe searches return the probe’s gal-
5.2. Training protocols lery mate within the top k rank-ordered results. The CMC curve
reports the percentage of probes identified within a given rank
In terms of training protocol, FR can be categorized into subject- (the independent variable). IJB-A/B/C [41–43] concern on the
dependent and subject-independent settings, as illustrated in rank-1 and rank-5 recognition rate. The MegaFace challenge
Fig. 20. [44,164] systematically evaluates rank-1 recognition rate function
Subject-dependent protocol. For subject-dependent protocol, of increasing number of gallery distractors (going from 10 to 1 Mil-
all testing identities are predefined in training set, it is natural to lion), the results of the SOTA evaluated on MegaFace challenge are
classify testing face images to the given identities. Therefore, listed in Table 9. Rather than rank-N and CMC, MS-Celeb-1 M [45]
subject-dependent FR can be well addressed as a classification further applies a precision-coverage curve to measure identifica-
problem, where features are expected to be separable. The protocol tion performance under a variable threshold t. The probe is
is mostly adopted by the early-stage (before 2010) FR studies on rejected when its confidence score is lower than t. The algorithms
FERET [177], AR [178], and is suitable only for some small-scale are compared in term of what fraction of passed probes, i.e. cover-
applications. The Challenge 2 of MS-Celeb-1 M is the only large- age, with a high recognition precision, e.g. 95% or 99%, the results
scale database using subject-dependent training protocol. of the SOTA evaluated on MS-Celeb-1 M challenge are listed in
Subject-independent protocol. For subject-independent pro- Table 10.
tocol, the testing identities are usually disjoint from the training Open-set face identification is relevant to high throughput
set, which makes FR more challenging yet close to practice. face search systems (e.g., de-duplication, watch list identification),
Because it is impossible to classify faces to known identities in where the recognition system should reject unknown/unseen sub-
training set, generalized representation is essential. Due to the fact jects (probes who do not present in gallery) at test time. At present,
230
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Table 7
Statistical demographic information of commonly-used training and testing databases. [173,171]
Table 8
Racial bias in existing commercial recognition APIs and face recognition algorithms. Face verification accuracies (%) on RFW database are given [173].
Fig. 20. The comparison of different training protocol and evaluation tasks in FR. In terms of training protocol, FR can be classified into subject-dependent or subject-
independent settings according to whether testing identities appear in training set. In terms of testing tasks, FR can be classified into face verification, close-set face
identification, open-set face identification.
Table 9
Performance of state of the arts on Megaface dataset.
there are very few databases covering the task of open-set FR. IJB- FPIR measures what fraction of comparisons between probe tem-
A/B/C [41–43] benchmarks introduce a decision error tradeoff plates and non-mate gallery templates result in a match score
(DET) curve to characterize the false negative identification rate exceeding T. At the same time, FNIR measures what fraction of
(FNIR) as function of the false positive identification rate (FPIR). probe searches will fail to match a mated gallery template above
231
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Table 10
Performance of state of the arts on MS-celeb-1 M dataset.
Table 11
Face Identification and Verification Evaluation of state of the arts on IJB-A dataset
a score of T. The algorithms are compared in term of the FNIR at a set may be enrolled with a set of images and videos and set-
low FPIR, e.g. 1% or 10%, the results of the SOTA evaluated on IJB-A to-set recognition should be performed, such as IJB-A [41] and
dataset as listed in Table 11. PaSC [179].
FR in industry. Although deep FR has achieved beyond human
5.4. Evaluation scenes and data performance on some standard benchmarks, but some other
factors should be given more attention rather than accuracy
Public available training databases are mostly collected from when deep FR is adopted in industry, e.g. anti-attack (CASIA-
the photos of celebrities due to privacy issue, it is far from images FASD [210]) and 3D FR (Bosphorus [203], BU-3DFE [205] and
captured in the daily life with diverse scenes. In order to study dif- FRGCv2[206]). Compared to publicly available 2D face data-
ferent specific scenarios, more difficult and realistic datasets are bases, 3D scans are hard to acquire, and the number of scans
constructed accordingly, as shown in Table 12. According to their and subjects in public 3D face databases is still limited, which
characteristics, we divide these scenes into four categories: hinders the development of 3D deep FR.
cross-factor FR, heterogenous FR, multiple (or single) media FR
and FR in industry (Fig. 21). 6. Diverse recognition scenes of deep learning
Cross-factor FR. Due to the complex nonlinear facial appear- Despite the high accuracy in the LFW [23] and Megaface
ance, some variations will be caused by people themselves, such [44,164] benchmarks, the performance of FR models still hardly
as cross-pose, cross-age, make-up, and disguise. For example, meets the requirements in real-world application. A conjecture in
CALFW [188], MORPH [189], CACD [191] and FG-NET [194] industry is made that results of generic deep models can be
are commonly used datasets with different age range; CFP improved simply by collecting big datasets of the target scene.
[182] only focuses on frontal and profile face, CPLFW [181] is However, this holds only to a certain degree. More and more con-
extended from LFW and contains different poses. Disguised cerns on privacy may make the collection and human-annotation
faces in the wild (DFW) [214] evaluates face recognition across of face data become illegal in the future. Therefore, significant
disguise. efforts have been paid to design excellent algorithms to address
Heterogenous FR. It refers to the problem of matching faces the specific problems with limited data in these realistic scenes.
across different visual domains. The domain gap is mainly In this section, we present several special algorithms of FR.
caused by sensory devices and cameras settings, e.g. visual light
vs. near-infrared and photo vs. sketch. For example, CUFSF [201] 6.1. Cross-factor face recognition
and CUFS [199] are commonly used photo-sketch datasets and
CUFSF dataset is harder due to lighting variation and shape 6.1.1. Cross-pose face recognition
exaggeration. As [182] shows that many existing algorithms suffer a decrease
Multiple (or single) media FR. Ideally, in FR, many images of of over 10% from frontal-frontal to frontal-profile verification,
each subject are provided in training datasets and image-to- cross-pose FR is still an extremely challenging scene. In addition
image recognitions are performed when testing. But the situa- to the aforementioned methods, including ‘‘one-to-many augmen-
tion will be different in reality. Sometimes, the number of tation”, ‘‘many-to-one normalization” and assembled networks
images per person in training set could be very small, such as (Section 4 and 3.2.2), there are some other algorithms designed
MS-Celeb-1 M challenge 2 [45]. This challenge is often called for cross-pose FR. Considering the extra burden of above methods,
low- shot or few-shot FR. Moreover, each subject face in test Cao et al. [215] attempted to perform frontalization in the deep
232
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Table 12
The commonly used FR datasets for testing.
2
Datasets Publish #photos #subjects # of Metrics Typical Methods & Accuracy Key Features (Section)
Time photos
per
subject 1
LFW [23] 2007 13 K 5K 1/2.3/530 1:1: Acc, TAR vs. FAR 99.78% Acc [109]; 99.63% Acc [38] annotation with several
(ROC); 1:N: Rank-N, DIR attribute
vs. FAR (CMC)
MS-Celeb-1 M 2016 2K 1K 2 Coverage@P = 0.95 random set: 87.50%@P = 0.95; hard set: large-scale
Challenge 1 79.10%@P = 0.95 [174];
[45]
MS-Celeb-1 M 2016 100 K(base 20 K(base 5/-/20 Coverage@P = 0.99 99.01%@P = 0.99 [137] low-shot learning (6.3.1)
Challenge 2 set) 0 K set) 1 K
[45] (novel set) (novel set)
MS-Celeb-1 M 2018 274 K 5.7 K - 1:1: TPR@FPR = 1e9; 1: 1:1: 46.15% [106]; 1:N: 43.88% [106] trillion pairs; large
Challenge 3 (ELFW) 1 M (ELFW) N: TPR@FPR = 1e3 distractors
[163] (DELFW) 1.58 M
(DELFW)
MegaFace 2016 1M 690,572 1.4 1:1: TPR vs. FPR (ROC); 1:1: 86.47%@106 FPR [38]; 1:N: large-scale; 1 million
[44,164] 1:N: Rank-N (CMC) 70.50% Rank-1 [38] distractors
IJB-A [41] 2015 25,809 500 51.6 1:1: TAR vs. FAR (ROC); 1:1: 92.10%@103 FAR [39]; 1:N: cross-pose; template-
1:N: Rank-N, TPIR vs. 98.20% Rank-1 [39] based (6.1.1 and 6.3.2)
FPIR (CMC, DET)
IJB-B [42] 2017 11,754 1,845 41.6 1:1: TAR vs. FAR (ROC); 1:1: 52.12%@106 FAR [180]; 1:N: cross-pose; template-
images 1:N: Rank-N, TPIR vs. 90.20% Rank-1 [39] based (6.1.1 and 6.3.2)
7,011 FPIR (CMC, DET)
videos
IJB-C [43] 2018 31.3 K 3,531 42.1 1:1: TAR vs. FAR (ROC); 1:1: 90.53%@106 FAR [180]; 1:N: cross-pose; template-
images 1:N: Rank-N, TPIR vs. 74.5% Rank-1 [71] based (6.1.1 and 6.3.2)
11,779 FPIR (CMC, DET)
videos
RFW [173] 2018 40607 11429 3.6 1:1: Acc, TAR vs. FAR Caucasian: 92.15% Acc; Indian: 88.00% evaluating race bias
(ROC) Acc; Asian: 83.98% Acc; African:
84.93% Acc [84]
DemogPairs 2019 10.8 K 800 18 1:1: TAR vs. FAR (ROC) White male: 88%; White female: 87% evaluating race and
[171] @104 FAR; Black male: 55%; Black gender bias
female: 65%@104 FAR [84]
CPLFW [181] 2017 11652 3968 2/2.9/3 1:1: Acc, TAR vs. FAR 77.90% Acc [37] cross-pose (6.1.1)
(ROC)
CFP [182] 2016 7,000 500 14 1:1: Acc, EER, AUC, TAR Frontal-Frontal: 98.67% Acc [135]; frontal-profile (6.1.1)
vs. FAR (ROC) Frontal-Profile: 94:39% Acc [97]
SLLFW [183] 2017 13 K 5K 2.3 1:1: Acc, TAR vs. FAR 85.78% Acc [37]; 78.78% Acc [20] fine-grained
(ROC)
UMDFaces 2016 367,920 8,501 43.3 1:1: Acc, TPR vs. FPR 69.30%@102 FAR [22] annotation with bounding
[184] (ROC) boxes, 21 keypoints,
gender and 3D pose
YTF [169] 2011 3,425 1,595 48/181.3/ 1:1: Acc 97.30% Acc [37]; 96.52% Acc [185] video (6.3.3)
6,070
PaSC [179] 2013 2,802 265 – 1:1: VR vs. FAR (ROC) 95.67%@102 FAR [185] video (6.3.3)
YTC [186] 2008 1,910 47 – 1:N: Rank-N (CMC) 97.82% Rank-1 [185]; 97.32% Rank-1 video (6.3.3)
[187]
CALFW [188] 2017 12174 4025 2/3/4 1:1: Acc, TAR vs. FAR 86.50% Acc [37]; 82.52% Acc [114] cross-age; 12 to 81 years
(ROC) old (6.1.2)
MORPH [189] 2006 55,134 13,618 4.1 1:N: Rank-N (CMC) 94.4% Rank-1 [190] cross-age, 16 to 77 years
old (6.1.2)
CACD [191] 2014 163,446 2000 81.7 1:1 (CACD-VS): Acc, TAR 1:1 (CACD-VS): 98.50% Acc [192]; 1:N: cross-age, 14 to 62 years
vs. FAR (ROC); 1:N: MAP 69.96% MAP (2004–2006)[193] old (6.1.2)
FG-NET [194] 2010 1,002 82 12.2 1:N: Rank-N (CMC) 88.1% Rank-1 [192] cross-age, 0 to 69 years old
(6.1.2)
CASIA NIR-VIS 2013 17,580 725 24.2 1:1: Acc, VR vs. FAR 98.62% Acc, 98.32%@103 FAR [196] NIR-VIS; with eyeglasses,
v2.0 [195] (ROC) pose and expression
variation (6.2.1)
CASIA-HFB 2009 5097 202 25.5 1:1: Acc, VR vs. FAR 97.58% Acc, 85.00%@103 FAR [198] NIR-VIS; with eyeglasses
[197] (ROC) and expression variation
(6.2.1)
CUFS [199] 2009 1,212 606 2 1:N: Rank-N (CMC) 100% Rank-1 [200] sketch-photo (6.2.3)
CUFSF [201] 2011 2,388 1,194 2 1:N: Rank-N (CMC) 51.00% Rank-1 [202] sketch-photo; lighting
variation; shape
exaggeration (6.2.3)
Bosphorus 2008 4,652 105 31/44.3/ 1:1: TAR vs. FAR (ROC); 1:N: 99.20% Rank-1 [204] 3D; 34 expressions, 4
[203] 54 1:N: Rank-N (CMC) occlusions and different
poses (6.4.1)
BU-3DFE [205] 2006 2,500 100 25 1:1: TAR vs. FAR (ROC); 1:N: 95.00% Rank-1 [204] 3D; different expressions
1:N: Rank-N (CMC) (6.4.1)
FRGCv2 [206] 2005 4,007 466 1/8.6/22 1:1: TAR vs. FAR (ROC); 1:N: 94.80% Rank-1 [204] 3D; different expressions
1:N: Rank-N (CMC) (6.4.1)
233
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
Table 12 (continued)
2
Datasets Publish #photos #subjects # of Metrics Typical Methods & Accuracy Key Features (Section)
Time photos
per
subject 1
Guo et al. [207] 2014 1,002 501 2 1:1: Acc, TAR vs. FAR 94.8% Rank-1, 65.9%@103 FAR [208] make-up; female (6.1.3)
(ROC)
FAM [209] 2013 1,038 519 2 1:1: Acc, TAR vs. FAR 88.1% Rank-1, 52.6%@103 FAR [208] make-up; female and male
(ROC) (6.1.3)
CASIA-FASD 2012 600 50 12 EER, HTER 2.67% EER, 2.27% HTER [211] anti-spoofing (6.4.4)
[210]
Replay-Attack 2012 1,300 50 – EER, HTER 0.79% EER, 0.72% HTER [211] anti-spoofing (6.4.4)
[212]
WebCaricature 2017 12,016 252 – 1:1: TAR vs. FAR (ROC); 1:1: 34.94%@101 FAR [213]; 1:N: Caricature (6.2.3)
[213] 1:N: Rank-N (CMC) 55.41% Rank-1 [213]
Fig. 21. The different scenes of FR. We divide FR scenes into four categories: cross-factor FR, heterogenous FR, multiple (or single) media FR and FR in industry. There are
many testing datasets and special FR methods designed for each scene.
feature space rather than the image space. A deep residual equiv- aging and identity components separately and extract age-
ariant mapping (DREAM) block dynamically added residuals to invariant representations. Wen et al. [192] developed a latent iden-
an input representation to transform a profile face to a frontal tity analysis (LIA) layer to separate these two components, as
image. Chen et al. [216] proposed to combine feature extraction shown in Fig. 22. In [193], age-invariant features were obtained
with multi-view subspace learning to simultaneously make fea- by subtracting age-specific factors from the representations with
tures be more pose-robust and discriminative. Pose Invariant the help of the age estimation task. In [124], face features are
Model (PIM) [217] jointly performed face frontalization and decomposed in the spherical coordinate system, in which the
learned pose invariant representations end-to-end to allow them identity-related components are represented with angular coordi-
to mutually boost each other, and further introduced unsupervised nates and the age-related information is encoded with radial coor-
cross-domain adversarial training and a learning to learn strategy dinate. Additionally, there are other methods designed for cross-
to provide high-fidelity frontal reference face images. age FR. For example, Bianco ett al. [223] and El et al. [224] fine-
tuned the CNN to transfer knowledge across age. Wang et al.
[225] proposed a siamese deep network to perform multi-task
6.1.2. Cross-age face recognition
learning of FR and age estimation. Li et al. [226] integrated feature
Cross-age FR is extremely challenging due to the changes in
extraction and metric learning via a deep CNN.
facial appearance by the aging process over time. One direct
approach is to synthesize the desired image with target age such
that the recognition can be performed in the same age group. A 6.1.3. Makeup face recognition
generative probabilistic model was used by [218] to model the Makeup is widely used by the public today, but it also brings
facial aging process at each short-term stage. The identity- challenges for FR due to significant facial appearance changes.
preserved conditional generative adversarial networks (IPCGANs) The research on matching makeup and nonmakeup face images
[219] framework utilized a conditional-GAN to generate a face in is receiving increasing attention. Li et al. [208] generated non-
which an identity-preserved module preserved the identity infor- makeup images from makeup ones by a bi-level adversarial net-
mation and an age classifier forced the generated face with the tar- work (BLAN) and then used the synthesized nonmakeup images
get age. Antipov et al. [220] proposed to age faces by GAN, but the for verification as shown in Fig. 23. Sun et al. [227] pretrained a tri-
synthetic faces cannot be directly used for face verification due to plet network on videos and fine-tuned it on a small makeup data-
its imperfect preservation of identities. Then, they used a local sets. Specially, facial disguise [214,228,229] is a challenging
manifold adaptation (LMA) approach [221] to solve the problem research topic in makeup face recognition. By using disguise acces-
of [220]. In [222], high-level age-specific features conveyed by sories such as wigs, beard, hats, mustache, and heavy makeup, dis-
the synthesized face are estimated by a pyramidal adversarial dis- guise introduces two variations: (i) when a person wants to
criminator at multiple scales to generate more lifelike facial details. obfuscate his/her own identity, and (ii) another individual imper-
An alternative to address the cross-age problem is to decompose sonates someone else’s identity. Obfuscation increases intra-class
234
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
variations whereas impersonation reduces the inter-class dissimi- through cross-spectral hallucination and restoring a low-rank
larity, thereby affecting face recognition/verification task. To structure for features through low-rank embedding. Reale et al.
address this issue, a variety of methods are proposed. Zhang [198] trained a VISNet (for visible images) and a NIRNet (for
et al. [230] first trained two DCNNs for generic face recognition near-infrared images), and coupled their output features by creat-
and then used Principal Components Analysis (PCA) to find the ing a siamese network. He et al. [238,239] divided the high layer of
transformation matrix for disguised face recognition adaptation. the network into a NIR layer, a VIS layer and a NIR-VIS shared layer,
Kohli et al. [231] finetuned models using disguised faces. Smirnov then, a modality-invariant feature can be learned by the NIR-VIS
et al. [232] proposed a hard example mining method benefitted shared layer. Song et al. [240] embedded cross-spectral face hallu-
from class-wise (Doppelganger Mining [233]) and example-wise cination and discriminative feature learning into an end-to-end
mining to learn useful deep embeddings for disguised face recogni- adversarial network. In [196], the low-rank relevance and cross-
tion. Suri et al. [234] learned the representations of images in modal ranking were used to alleviate the semantic gap.
terms of colors, shapes, and textures (COST) using an unsupervised
dictionary learning method, and utilized the combination of COST 6.2.2. Low-resolution face recognition
features and CNN features to perform recognition. Although deep networks are robust to low resolution to a great
extent, there are still a few studies focused on promoting the per-
6.2. Heterogenous face recognition formance of low-resolution FR. For example, Zangeneh et al. [241]
proposed a CNN with a two-branch architecture (a super-
. resolution network and a feature extraction network) to map the
high- and low-resolution face images into a common space where
6.2.1. NIR-VIS face recognition the intra-person distance is smaller than the inter-person distance.
Due to the excellent performance of the near-infrared spectrum Shen et al. [242] exploited the face semantic information and local
(NIS) images under low-light scenarios, NIS images are widely structural constraints to better restore the shape and detail of face
applied in surveillance systems. Because most enrolled databases images. In addition, they optimized the network with perceptual
consist of visible light (VIS) spectrum images, how to recognize a and adversarial losses to produce photo-realistic results.
NIR face from a gallery of VIS images has been a hot topic. Saxena
et al. [235] and Liu et al. [236] transferred the VIS deep networks to 6.2.3. Photo-sketch face recognition
the NIR domain by fine-tuning. Lezama et al. [237] used a VIS CNN The photo-sketch FR may help law enforcement to quickly iden-
to recognize NIR faces by transforming NIR images to VIS faces tify suspects. The commonly used methods can be categorized as
235
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
two classes. One is to utilize transfer learning to directly match the features’ complementary information generated by different
photos to sketches. Deep networks are first trained using a large CNNs. Liu et al. [256] introduced the actor-critic reinforcement
face database of photos and are then fine-tuned using small sketch learning for set-based FR. They casted the inner-set dependency
database [243,244]. The other is to use the image-to-image trans- modeling to a Markov decision process in the latent space, and
lation, where the photo can be transformed to a sketch or the trained a dependency-aware attention control agent to make
sketch to a photo; then, FR can be performed in one domain. Zhang attention control for each image in each step.
et al. [200] developed a fully convolutional network with genera-
tive loss and a discriminative regularizer to transform photos to 6.3.3. Video face recognition
sketches. Zhang et al. [245] utilized a branched fully convolutional There are two key issues in video FR: one is to integrate the
neural network (BFCN) to generate a structure-preserved sketch information across different frames together to build a representa-
and a texture-preserved sketch, and then they fused them together tion of the video face, and the other is to handle video frames with
via a probabilistic method. Recently, GANs have achieved impres- severe blur, pose variations, and occlusions. For frame aggregation,
sive results in image generation. Yi et al. [246], Kim et al. [247] Yang et al. [83] proposed a neural aggregation network (NAN) in
and Zhu et al. [248] used two generators, GA and GB , to generate which the aggregation module, consisting of two attention blocks
sketches from photos and photos from sketches, respectively driven by a memory, produces a 128-dimensional vector represen-
(Fig. 24). Based on [248], Wang et al. [202] proposed a multi- tation (Fig. 26). Rao et al. [187] aggregated raw video frames
adversarial network to avoid artifacts by leveraging the implicit directly by combining the idea of metric learning and adversarial
presence of feature maps of different resolutions in the generator learning. For dealing with bad frames, Rao et al. [185] discarded
subnetwork. Similar to photo-sketch FR, photo-caricature FR is the bad frames by treating this operation as a Markov decision pro-
one kind of heterogenous FR scenes which is challenging and cess and trained the attention model through a deep reinforcement
important to understanding of face perception. Huo et al. [213] learning framework. Ding et al. [257] artificially blurred clear
built a large dataset of caricatures and photos, and provided sev- images for training to learn blur-robust face representations. Par-
eral evaluation protocols and their baseline performances for chami et al. [258] used a CNN to reconstruct a lower-quality video
comparison. into a high-quality face.
6.3. Multiple (or single) media face recognition 6.4. Face recognition in industry
Fig. 25. The architecture of a single sample per person domain adaptation network (SSPP-DAN). [249].
a CNN using both a single frame and multiple frames with five
scales as input, and using the live/spoof label as the output. Taken
the sequence of video frames as input, Xu et al. [274] applied LSTM
units on top of CNN to obtain end-to-end features to recognize
spoofing faces which leveraged the local and dense property from
convolution operation and learned the temporal structure using
LSTM units. Li et al. [275] and Patel et al. [276] fine-tuned their net-
works from a pretrained model by training sets of real and fake
images. Jourabloo et al. [277] proposed to inversely decompose a
spoof face into the live face and the spoof noise pattern. Adversarial
perturbation is the other type of attack which can be defined as the
addition of a minimal vector r such that with addition of this vector
into the input image x, i.e. ðx þ r Þ, the deep learning models mis-
classifies the input while people will not. Recently, more and more
Fig. 26. The FR framework of NAN. [83]. work has begun to focus on solving this perturbation of FR. Gos-
wami et al. [280] proposed to detect adversarial samples by char-
acterizing abnormal filter response behavior in the hidden layers
and increase the network’s robustness by removing the most prob-
for a minibatch of k face images and then constructs an unbiased lematic filters. Goel et al. [281] provided an open source imple-
2
estimate of the full gradient by relying on all k k pairs from mentation of adversarial detection and mitigation algorithms.
the minibatch. As mentioned in Section 3.2.1, light-weight deep Despite of progresses of anti-attack algorithms, attack methods
networks [126–129] perform excellently in the fundamental tasks are updated as well and remind us the need to further increase
of image classification and deserve further attention in FR tasks. security and robustness in FR systems, for example, Mai et al.
Moreover, some well-known compressed networks such as Prun- [282] proposed a neighborly de-convolutional neural network
ing [264–266], BinaryNets [267–270], Mimic Networks [271,272], (NbNet) to reconstruct a fake face using the stolen deep templates.
also have potential to be introduced into FR.
6.4.5. Debiasing face recognition
6.4.4. Face anti-attack As described in Section 5.1, existing datasets are highly biased
With the success of FR techniques, various types of attacks, such in terms of the distribution of demographic cohorts, which may
as face spoofing and adversarial perturbations, are becoming large dramatically impact the fairness of deep models. To address this
threats. Face spoofing involves presenting a fake face to the bio- issue, there are some works that seek to introduce fairness into
metric sensor using a printed photograph, worn mask, or even an face recognition and mitigate demographic bias, e,g. unbalanced-
image displayed on another electronic device. In order to defense training [283], attribute removal [284–286] and domain adapta-
this type of attack, several methods are proposed [211,273–279]. tion [173,287,147]. 1) Unbalanced-training methods mitigate the
Atoum et al. [211] proposed a novel two-stream CNN in which bias via model regularization, taking into consideration of the fair-
the local features discriminate the spoof patches that are indepen- ness goal in the overall model objective function. For example, RL-
dent of the spatial face areas, and holistic depth maps ensure that RBN [283] formulated the process of finding the optimal margins
the input live sample has a face-like depth. Yang et al. [273] trained for non-Caucasians as a Markov decision process and employed
237
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
deep Q-learning to learn policies based on large margin loss. 2) ric templates that can hidden some of the private information
Attribute removal methods confound or remove demographic presented in the facial images. Further research on the princi-
information of faces to learn attribute-invariant representations. ples of visual cryptography, signal mixing and image perturba-
For example, Alvi et al. [284] applied a confusion loss to make a tion to protect users’ privacy on stored face templates are
classifier fail to distinguish attributes of examples so that multiple essential for addressing public concern on privacy.
spurious variations are removed from the feature representation.
SensitiveNets [288] proposed to introduce sensitive information Understanding deep face recognition. Deep face recognition
into triplet loss. They minimized the sensitive information, while systems are now believed to surpass human performance in
maintaining distances between positive and negative embeddings. most scenarios [300]. There are also some interesting attempts
3) Domain adaptation methods propose to investigate data bias to apply deep models to assist human operators for face verifi-
problem from a domain adaptation point of view and attempt to cation [183,300]. Despite this progress, many fundamental
design domain-invariant feature representations to mitigate bias questions are still open, such as what is the ‘‘identity capacity”
across domains. IMAN [173] simultaneously aligned global distri- of a deep representation [301]? Why deep neural networks,
bution to decrease race gap at domain-level, and learned the dis- rather than humans, are easily fooled by adversarial samples?
criminative target representations at cluster level. Kan [147] While bigger and bigger training dataset by itself cannot solve
directly converted the Caucasian data to non-Caucasian domain this problem, deeper understanding on these questions may
in the image space with the help of sparse reconstruction coeffi- help us to build robust applications in real world. Recently, a
cients learnt in the common subspace. new benchmark called TALFW has been proposed to explore
this issue [93].
7. Technical challenges Remaining challenges defined by non-saturated benchmark
datasets. Three current major datasets, namely, MegaFace
In this paper, we provide a comprehensive survey of deep FR [44,164], MS-Celeb-1 M [45] and IJB-A/B/C [41–43], are corre-
from both data and algorithm aspects. For algorithms, mainstream sponding to large-scale FR with a very large number of candi-
and special network architectures are presented. Meanwhile, we dates, low/one-shot FR and large pose-variance FR which will
categorize loss functions into Euclidean-distance-based loss, be the focus of research in the future. Although the SOTA algo-
angular/cosine-margin-based loss and variable softmax loss. For rithms can be over 99.9 percent accurate on LFW [23] and
data, we summarize some commonly used datasets. Moreover, Megaface [44,164] databases, fundamental challenges such as
the methods of face processing are introduced and categorized as matching faces cross ages [181], poses [188], sensors, or styles
‘‘one-to-many augmentation” and ‘‘many-to-one normalization”. still remain. For both datasets and algorithms, it is necessary
Finally, the special scenes of deep FR, including video FR, 3D FR to measure and address the racial/gender/age biases of deep
and cross-age FR, are briefly introduced. FR in future research.
Taking advantage of big annotated data and revolutionary deep
learning techniques, deep FR has dramatically improved the SOTA Ubiquitous face recognition across applications and scenes.
performance and fostered successful real-world applications. With Deep face recognition has been successfully applied on many
the practical and commercial use of this technology, many ideal user-cooperated applications, but the ubiquitous recognition
assumptions of academic research were broken, and more real- applications in everywhere are still an ambitious goal. In prac-
world issues are emerging. To the best our knowledge, major tech- tice, it is difficult to collect and label sufficient samples for innu-
nical challenges include the following aspects. merable scenes in real world. One promising solution is to first
learn a general model and then transfer it to an application-
Security issues. Presentation attack [289], adversarial attack specific scene. While deep domain adaptation [145] has
[280,281,290], template attack [291] and digital manipulation recently been applied to reduce the algorithm bias on different
attack [292,293] are developing to threaten the security of deep scenes [148], different races [173], general solution to transfer
face recognition systems. 1) Presentation attack with 3D sili- face recognition is largely open.
cone mask, which exhibits skin-like appearance and facial
motion, challenges current anti-sproofing methods [294]. 2) Pursuit of extreme accuracy and efficiency. Many killer-
Although adversarial perturbation detection and mitigation applications, such as watch-list surveillance or financial identity
methods are recently proposed [280,281], the root cause of verification, require high matching accuracy at very low alarm
adversarial vulnerability is unclear and thus new types of rate, e.g. 109 . It is still a big challenge even with deep learning
adversarial attacks are still upgraded continuously [295,296]. on massive training data. Meanwhile, deploying deep face
3) The stolen deep feature template can be used to recover its recognition on mobile devices pursues the minimum size of fea-
facial appearance, and how to generate cancelable template ture representation and compressed deep network. It is of great
without loss of accuracy is another important issue. 4) Digital significance for both industry and academic to explore this
manipulation attack, made feasible by GANs, can generate extreme face-recognition performance beyond human imagina-
entirely or partially modified photorealistic faces by expression tion. It is also exciting to constantly push the performance lim-
swap, identity swap, attribute manipulation and entire face its of the algorithm after it has already surpassed human.
synthesis, which remains a main challenge for the security of
deep FR. Fusion issues. Face recognition by itself is far from sufficient to
solve all biometric and forensic tasks, such as distinguishing
Privacy-preserving face recognition. With the leakage of bio- identical twins and matching faces before and after surgery
logical data, privacy concerns are raising nowadays. Facial [302]. A reliable solution is to consolidate multiple sources of
images can predict not only demographic information such as biometric evidence [303]. These sources of information may
gender, age, or race, but even the genetic information [297]. correspond to different biometric traits (e.g., face + hand
Recently, the pioneer works such as Semi-Adversarial Networks [304]), sensors (e.g., 2D + 3D face cameras), feature extraction
[298,299,285] have explored to generate a recognizable biomet- and matching techniques, or instances (e.g., a face sequence of
238
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
various poses). It is beneficial for face biometric and forensic [21] Y. Sun, Y. Chen, X. Wang, and X. Tang, ”Deep learning face representation by
joint identification-verification,” in Advances in neural information
applications to perform information fusion at the data level, fea-
processing systems, 2014, pp. 1988–1996.
ture level, score level, rank level, and decision level [305]. [22] A. Krizhevsky, I. Sutskever, and G.E. Hinton, ”Imagenet classification with
deep convolutional neural networks,” in Advances in neural information
processing systems, 2012, pp. 1097–1105.
[23] G.B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, ‘‘Labeled faces in the
Declaration of Competing Interest wild: A database for studying face recognition in unconstrained
environments,” Technical Report 07-49, University of Massachusetts,
Amherst, Tech. Rep., 2007.
The authors declare that they have no known competing finan- [24] W. Zhao, R. Chellappa, P.J. Phillips, A. Rosenfeld, Face recognition: A literature
cial interests or personal relationships that could have appeared survey, ACM Computing Surveys (CSUR) 35 (4) (2003) 399–458.
[25] K.W. Bowyer, K. Chang, P. Flynn, A survey of approaches and challenges in 3d
to influence the work reported in this paper. and multi-modal 3d+ 2d face recognition, Computer Vision Image
Understanding 101 (1) (2006) 1–15.
[26] A.F. Abate, M. Nappi, D. Riccio, G. Sabatino, 2d and 3d face recognition: A
Acknowledgments survey, Pattern Recogn. Lett. 28 (14) (2007) 1885–1906.
[27] R. Jafri, H.R. Arabnia, A survey of face recognition techniques, J. Inform.
This work was partially supported by National Key R&D Pro- Process. Syst. 5 (2) (2009) 41–68.
[28] A. Scheenstra, A. Ruifrok, and R.C. Veltkamp, ”A survey of 3d face recognition
gram of China (2019YFB1406504) and BUPT Excellent Ph.D. Stu- methods,” in International Conference on Audio-and Video-based Biometric
dents Foundation CX2020207. Person Authentication. Springer, 2005, pp. 891–899.
[29] X. Zou, J. Kittler, and K. Messer, ‘‘Illumination invariant face recognition: A
survey,” in 2007 first IEEE international conference on biometrics: theory,
References applications, and systems. IEEE, 2007, pp. 1–8.
[30] X. Zhang, Y. Gao, Face recognition across pose: A review, Pattern Recogn. 42
(11) (2009) 2876–2896.
[1] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive Neurosci. 3 (1)
[31] C. Ding, D. Tao, A comprehensive survey on pose-invariant face recognition,
(1991) 71–86.
ACM Trans. Intelligent Systems Technol. 7 (3) (2015) 37.
[2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces:
[32] R. Ranjan, S. Sankaranarayanan, A. Bansal, N. Bodla, J.C. Chen, V.M. Patel, C.D.
Recognition using class specific linear projection, IEEE Trans. Pattern Anal.
Castillo, R. Chellappa, Deep learning for understanding faces: Machines may
Mach. Intell. 19 (7) (1997) 711–720.
be just as good, or better, than humans, IEEE Signal Process. Mag. 35 (1)
[3] B. Moghaddam, W. Wahid, and A. Pentland, ”Beyond eigenfaces: probabilistic
(2018) 66–83.
matching for face recognition,” Automatic Face and Gesture Recognition,
[33] X. Jin, X. Tan, Face alignment in-the-wild: A survey, Comput. Vis. Image
1998. Proc. Third IEEE Int. Conf., pp. 30–35, Apr 1998.
Underst. 162 (2017) 1–22.
[4] W. Deng, J. Hu, J. Lu, J. Guo, Transform-invariant pca: A unified approach to
[34] Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting
fully automatic facealignment, representation, and recognition, IEEE Trans.
10,000 classes, in: Proceedings of the IEEE conference on computer vision and
Pattern Anal. Mach. Intell. 36 (6) (June 2014) 1275–1284.
pattern recognition, 2014, pp. 1891–1898.
[5] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using
[35] Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse,
laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340.
selective, and robust, in: Proceedings of the IEEE conference on computer
[6] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, ‘‘Graph embedding: A general framework
vision and pattern recognition, 2015, pp. 2892–2900.
for dimensionality reduction,” Computer Vision and Pattern Recognition, IEEE
[36] Y. Sun, D. Liang, X. Wang, and X. Tang, ”Deepid3: Face recognition with very
Computer Society Conference on, vol. 2, pp. 830–837, 2005.
deep neural networks,” arXiv preprint arXiv:1502.00873, 2015.
[7] W. Deng, J. Hu, J. Guo, H. Zhang, C. Zhang, Comments on ‘‘globally
[37] O.M. Parkhi, A. Vedaldi, A. Zisserman et al., ‘‘Deep face recognition.” in BMVC,
maximizing, locally minimizing: Unsupervised discriminant projection with
vol. 1, no. 3, 2015, p. 6.
applications to face and palm biometrics”, IEEE Trans. Pattern Anal. Mach.
[38] F. Schroff, D. Kalenichenko, and J. Philbin, ‘‘Facenet: A unified embedding for
Intell. 30 (8) (2008) 1503–1504.
face recognition and clustering,” in Proceedings of the IEEE conference on
[8] J. Wright, A. Yang, A. Ganesh, S. Sastry, Y. Ma, Robust Face Recognition via
computer vision and pattern recognition, 2015, pp. 815–823.
Sparse Representation, IEEE Trans. Pattern Anal. Machine Intell. 31 (2) (2009)
[39] Q. Cao, L. Shen, W. Xie, O.M. Parkhi, and A. Zisserman, ”Vggface2: A dataset for
210–227.
recognising faces across pose and age,” in 2018 13th IEEE International
[9] L. Zhang, M. Yang, and X. Feng, ‘‘Sparse representation or collaborative
Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018,
representation: Which helps face recognition?” in 2011 International
pp. 67–74.
conference on computer vision. IEEE, 2011, pp. 471–478.
[40] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
[10] W. Deng, J. Hu, J. Guo, Extended src: Undersampled face recognition via
Courville, and Y. Bengio, ”Generative adversarial nets,” in Advances in neural
intraclass variant dictionary, IEEE Trans. Pattern Anal. Machine Intell. 34 (9)
information processing systems, 2014, pp. 2672–2680.
(2012) 1864–1870.
[41] B.F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A.
[11] W. Deng, J. Hu, and J. Guo, ”Face recognition via collaborative representation:
Mah, A.K. Jain, Pushing the frontiers of unconstrained face detection and
Its discriminant nature and superposed representation,” IEEE Trans. Pattern
recognition: Iarpa janus benchmark a, in: Proceedings of the IEEE conference
Anal. Mach. Intell., vol. PP, no. 99, pp. 1–1, 2018.
on computer vision and pattern recognition, 2015, pp. 1931–1939.
[12] C. Liu, H. Wechsler, Gabor feature based classification using the enhanced
[42] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A.
fisher linear discriminant model for face recognition, Image processing, IEEE
K. Jain, J.A. Duncan, K. Allen et al., ‘‘Iarpa janus benchmark-b face dataset,” in
Transactions on 11 (4) (2002) 467–476.
Proceedings of the IEEE Conference on Computer Vision and Pattern
[13] T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary
Recognition Workshops, 2017, pp. 90–98.
patterns: Application to face recognition, IEEE Trans. Pattern Anal. Machine
[43] B. Maze, J. Adams, J.A. Duncan, N. Kalka, T. Miller, C. Otto, A.K. Jain, W.T.
Intell. 28 (12) (2006) 2037–2041.
Niggel, J. Anderson, J. Cheney et al., ”Iarpa janus benchmark-c: Face dataset
[14] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, ”Local gabor binary pattern
and protocol,” in 2018 International Conference on Biometrics (ICB). IEEE,
histogram sequence (lgbphs): A novel non-statistical model for face
2018, pp. 158–165.
representation and recognition,” in ICCV, vol. 1. IEEE, 2005, pp. 786–791.
[44] I. Kemelmacher-Shlizerman, S.M. Seitz, D. Miller, E. Brossard, The megaface
[15] D. Chen, X. Cao, F. Wen, J. Sun, Blessing of dimensionality: High-dimensional
benchmark: 1 million faces for recognition at scale, in: Proceedings of the
feature and its efficient compression for face verification, in: Proceedings of
IEEE conference on computer vision and pattern recognition, 2016, pp. 4873–
the IEEE conference on computer vision and pattern recognition, 2013, pp.
4882.
3025–3032.
[45] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, ”Ms-celeb-1m: A dataset and
[16] W. Deng, J. Hu, and J. Guo, ”Compressive binary patterns: Designing a robust
benchmark for large-scale face recognition,” in European conference on
binary face descriptor with random-field eigenfilters,” IEEE Trans. Pattern
computer vision. Springer, 2016, pp. 87–102.
Anal. Mach. Intell., vol. PP, no. 99, pp. 1–1, 2018.
[46] M. Mehdipour Ghazi, H. Kemal Ekenel, A comprehensive analysis of deep
[17] Z. Cao, Q. Yin, X. Tang, and J. Sun, ‘‘Face recognition with learning-based
learning based representation for face recognition, in: Proceedings of the IEEE
descriptor,” in 2010 IEEE Computer society conference on computer vision
conference on computer vision and pattern recognition workshops, 2016, pp.
and pattern recognition. IEEE, 2010, pp. 2707–2714.
34–41.
[18] Z. Lei, M. Pietikainen, S.Z. Li, Learning discriminant face descriptor, IEEE Trans.
[47] I. Masi, A.T. Tr?n, T. Hassner, J.T. Leksut, G. Medioni, ‘‘Do we really need to
Pattern Anal. Machine Intell. 36 (2) (2014) 289–302.
collect millions of faces for effective face recognition?” in ECCV. Springer,
[19] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, Y. Ma, Pcanet: A simple deep learning
2016, pp. 579–596.
baseline for image classification?, IEEE Trans Image Process. 24 (12) (2015)
[48] I. Masi, T. Hassner, A.T. Tran, G. Medioni, ”Rapid synthesis of massive face sets
5017–5032.
for improved face recognition,” in 2017 12th IEEE International Conference
[20] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to
on Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017, pp. 604–611.
human-level performance in face verification, in: Proceedings of the IEEE
conference on computer vision and pattern recognition, 2014, pp. 1701–
1708.
239
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
[49] E. Richardson, M. Sela, and R. Kimmel, ‘‘3d face reconstruction by learning [77] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
from synthetic data,” in 2016 fourth international conference on 3D vision in: Proceedings of the IEEE conference on computer vision and pattern
(3DV). IEEE, 2016, pp. 460–469. recognition, 2016, pp. 770–778.
[50] E. Richardson, M. Sela, R. Or-El, R. Kimmel, Learning detailed face [78] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the
reconstruction from a single image, in: Proceedings of the IEEE conference IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–
on computer vision and pattern recognition, 2017, pp. 1259–1268. 7141.
[51] P. Dou, S.K. Shah, I.A. Kakadiaris, End-to-end 3d face reconstruction with deep [79] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S.Z. Li, and T. Hospedales,
neural networks, in: Proceedings of the IEEE Conference on Computer Vision ”When face recognition meets with deep learning: an evaluation of
and Pattern Recognition, 2017, pp. 5908–5917. convolutional neural networks for face recognition,” in ICCV workshops,
[52] Y. Guo, J. Zhang, J. Cai, B. Jiang, and J. Zheng, ”3dfacenet: Real-time dense face 2015, pp. 142–150.
reconstruction via synthesizing photo-realistic face images,” arXiv preprint [80] S. Sankaranarayanan, A. Alavi, and R. Chellappa, ”Triplet similarity embedding
arXiv:1708.00980, 2017. for face verification,” arXiv preprint arXiv:1602.03418, 2016.
[53] A. Tuan Tran, T. Hassner, I. Masi, G. Medioni, Regressing robust and [81] S. Sankaranarayanan, A. Alavi, C.D. Castillo, and R. Chellappa, ‘‘Triplet
discriminative 3d morphable models with a very deep neural network, in: probabilistic embedding for face verification and clustering,” in 2016 IEEE
Proceedings of the IEEE conference on computer vision and pattern 8th international conference on biometrics theory, applications and systems
recognition, 2017, pp. 5163–5172. (BTAS). IEEE, 2016, pp. 1–8.
[54] A. Tewari, M. Zollhofer, H. Kim, P. Garrido, F. Bernard, P. Perez, C. Theobalt, [82] X. Zhang, Z. Fang, Y. Wen, Z. Li, Y. Qiao, Range loss for deep face recognition
Mofa: Model-based deep convolutional face autoencoder for unsupervised with long-tailed training data, in: Proceedings of the IEEE International
monocular reconstruction, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5409–5418.
Conference on Computer Vision Workshops, 2017, pp. 1274–1283. [83] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, G. Hua, Neural aggregation
[55] Z. Zhu, P. Luo, X. Wang, X. Tang, Multi-view perceptron: a deep model for network for video face recognition, in: Proceedings of the IEEE conference on
learning face identity and view representations, Advances in Neural computer vision and pattern recognition, 2017, pp. 4362–4371.
Information Processing Systems (2014) 217–225. [84] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, Sphereface: Deep hypersphere
[56] J. Zhao, L. Xiong, P.K. Jayashree, J. Li, F. Zhao, Z. Wang, P.S. Pranata, P.S. Shen, S. embedding for face recognition, in: Proceedings of the IEEE conference on
Yan, and J. Feng, ”Dual-agent gans for photorealistic and identity preserving computer vision and pattern recognition, 2017, pp. 212–220.
profile face synthesis,” in Advances in neural information processing systems, [85] X. Wu, R. He, Z. Sun, T. Tan, A light cnn for deep face representation with noisy
2017, pp. 66–76. labels, IEEE Trans. Inf. Forensics Secur. 13 (11) (2018) 2884–2896.
[57] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, R. Webb, Learning [86] X. Wu, R. He, and Z. Sun, ”A lightened cnn for deep face representation,” in
from simulated and unsupervised images through adversarial training, in: CVPR, vol. 4, 2015.
Proceedings of the IEEE conference on computer vision and pattern [87] C.N. Duong, K.G. Quach, N. Le, N. Nguyen, K. Luu, Mobiface: A lightweight
recognition, 2017, pp. 2107–2116. deep learning face recognition on mobile devices, 2018, arXiv preprint
[58] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang, ”Targeting ultimate accuracy: Face arXiv:1811.11080.
recognition via deep embedding,” arXiv preprint arXiv:1506.07310, 2015. [88] N. Zhu, Z. Yu, and C. Kou, ”A new deep neural architecture search pipeline for
[59] E. Zhou, Z. Cao, and Q. Yin, ”Naive-deep face recognition: Touching the limit of face recognition,” IEEE Access, vol. 8, pp. 91 303–91 310, 2020.
lfw benchmark or not?” arXiv preprint arXiv:1501.04690, 2015. [89] C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, T.-K. Kim, Conditional
[60] C. Ding, D. Tao, Robust face recognition via multimodal deep face convolutional neural network for modality-aware face recognition, in:
representation, IEEE Trans. Multimedia 17 (11) (2015) 2049–2058. Proceedings of the IEEE International Conference on Computer Vision, 2015,
[61] Y. Sun, X. Wang, X. Tang, Sparsifying neural network connections for face pp. 3667–3675.
recognition, in: Proceedings of the IEEE Conference on Computer Vision and [90] C. Han, S. Shan, M. Kan, S. Wu, and X. Chen, ”Face recognition with contrastive
Pattern Recognition, 2016, pp. 4856–4864. convolution,” in The European Conference on Computer Vision (ECCV),
[62] D. Wang, C. Otto, A.K. Jain, Face search at scale, IEEE Trans. Pattern Analysis September 2018.
Machine Intell. 39 (6) (2016) 1122–1136. [91] M. Hayat, S.H. Khan, N. Werghi, R. Goecke, Joint registration and
[63] M. Kan, S. Shan, H. Chang, X. Chen, Stacked progressive auto-encoders (spae) representation learning for unconstrained face identification, in:
for face recognition across poses, in: Proceedings of the IEEE conference on Proceedings of the IEEE Conference on Computer Vision and Pattern
computer vision and pattern recognition, 2014, pp. 1883–1890. Recognition, 2017, pp. 2767–2776.
[64] Y. Zhang, M. Shao, E.K. Wong, Y. Fu, Random faces guided sparse many-to-one [92] W. Wu, M. Kan, X. Liu, Y. Yang, S. Shan, X. Chen, Recursive spatial transformer
encoder for pose-invariant face recognition, in: Proceedings of the IEEE (rest) for alignment-free face recognition, in: Proceedings of the IEEE
International Conference on Computer Vision, 2013, pp. 2416–2423. International Conference on Computer Vision, 2017, pp. 3772–3780.
[65] J. Yang, S.E. Reed, M.-H. Yang, H. Lee, Weakly-supervised disentangling with [93] Y. Zhong, J. Chen, B. Huang, Toward end-to-end face recognition through
recurrent transformations for 3d view synthesis, Adv. Neural Inform. Process. alignment learning, IEEE signal processing letters 24 (8) (2017) 1213–1217.
Syst. (2015) 1099–1107. [94] J.-C. Chen, R. Ranjan, A. Kumar, C.-H. Chen, V.M. Patel, R. Chellappa, An end-
[66] Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning identity-preserving face space, to-end system for unconstrained face verification with deep convolutional
in: Proceedings of the IEEE International Conference on Computer Vision, neural networks, in: Proceedings of the IEEE international conference on
2013, pp. 113–120. computer vision workshops, 2015, pp. 118–126.
[67] Z. Zhu, P. Luo, X. Wang, X. Tang, ”Recover canonical-view faces in the wild [95] M. Kan, S. Shan, X. Chen, Multi-view deep network for cross-view
with deep neural networks,” arXiv preprint arXiv:1404.3543, 2014. classification, in: Proceedings of the IEEE Conference on Computer Vision
[68] L. Hu, M. Kan, S. Shan, X. Song, X. Chen, ”Ldf-net: Learning a displacement and Pattern Recognition, 2016, pp. 4847–4855.
field network for face recognition across pose,” in 2017 12th IEEE [96] I. Masi, S. Rawls, G. Medioni, P. Natarajan, Pose-aware face recognition in the
International Conference on Automatic Face & Gesture Recognition (FG wild, in: Proceedings of the IEEE conference on computer vision and pattern
2017). IEEE, 2017, pp. 9–16. recognition, 2016, pp. 4838–4846.
[69] E. Zhou, Z. Cao, J. Sun, ”Gridface: Face rectification via learning local [97] X. Yin, X. Liu, Multi-task convolutional neural network for pose-invariant face
homography transformations,” in The European Conference on Computer recognition, IEEE Trans. Image Process. 27 (2) (2017) 964–975.
Vision (ECCV), September 2018. [98] W. Wang, Z. Cui, H. Chang, S. Shan, and X. Chen, ”Deeply coupled auto-
[70] R. Huang, S. Zhang, T. Li, R. He, Beyond face rotation: Global and local encoder networks for cross-view classification,” arXiv preprint
perception gan for photorealistic and identity preserving frontal view arXiv:1402.2031, 2014.
synthesis, in: Proceedings of the IEEE International Conference on [99] Y. Sun, X. Wang, X. Tang, Hybrid deep learning for face verification, in:
Computer Vision, 2017, pp. 2439–2448. Proceedings of the IEEE international conference on computer vision, 2013,
[71] L. Tran, X. Yin, X. Liu, Disentangled representation learning gan for pose- pp. 1489–1496.
invariant face recognition, in: Proceedings of the IEEE conference on [100] R. Ranjan, S. Sankaranarayanan, C.D. Castillo, and R. Chellappa, ”An all-in-one
computer vision and pattern recognition, 2017, pp. 1415–1424. convolutional neural network for face analysis,” in 2017 12th IEEE
[72] J. Deng, S. Cheng, N. Xue, Y. Zhou, S. Zafeiriou, Uv-gan: Adversarial facial uv International Conference on Automatic Face & Gesture Recognition (FG
map completion for pose-invariant face recognition, in: Proceedings of the IEEE 2017). IEEE, 2017, pp. 17–24.
conference on computer vision and pattern recognition, 2018, pp. 7093–7102. [101] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, ”A discriminative feature learning
[73] X. Yin, X. Yu, K. Sohn, X. Liu, M. Chandraker, Towards large-pose face approach for deep face recognition,” in European conference on computer
frontalization in the wild, in: Proceedings of the IEEE international conference vision. Springer, 2016, pp. 499–515.
on computer vision, 2017, pp. 3990–3999. [102] Y. Wu, H. Liu, J. Li, Y. Fu, Deep face recognition with center invariant loss, in:
[74] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Proceedings of the on Thematic Workshops of ACM Multimedia 2017 ACM,
Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual 2017, pp. 408–414.
recognition challenge, Int. J. Comput. Vision 115 (3) (2015) 211–252. [103] J.-C. Chen, V.M. Patel, and R. Chellappa, ‘‘Unconstrained face verification
[75] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for large- using deep cnn features,” in 2016 IEEE winter conference on applications of
scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. computer vision (WACV). IEEE, 2016, pp. 1–9.
[76] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. [104] W. Liu, Y. Wen, Z. Yu, M. Yang, Large-margin softmax loss for convolutional
Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings neural networks, ICML 2 (3) (2016) 7.
of the IEEE conference on computer vision and pattern recognition, 2015, pp. [105] F. Wang, J. Cheng, W. Liu, H. Liu, Additive margin softmax for face verification,
1–9. IEEE Signal Process. Lett. 25 (7) (2018) 926–930.
240
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
[106] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: Additive angular margin loss for [134] M. Jaderberg, K. Simonyan, A. Zisserman et al., ”Spatial transformer
deep face recognition, in: Proceedings of the IEEE Conference on Computer networks,” in Advances in neural information processing systems, 2015, pp.
Vision and Pattern Recognition, 2019, pp. 4690–4699. 2017–2025.
[107] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, W. Liu, Cosface: Large [135] X. Peng, X. Yu, K. Sohn, D.N. Metaxas, M. Chandraker, Reconstruction-based
margin cosine loss for deep face recognition, in: Proceedings of the IEEE disentanglement for pose-invariant face recognition, in: Proceedings of the
Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265– IEEE international conference on computer vision, 2017, pp. 1623–1632.
5274. [136] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, ”Bayesian face revisited: A joint
[108] W. Liu, Y.-M. Zhang, X. Li, Z. Yu, B. Dai, T. Zhao, and L. Song, ”Deep formulation,” in European conference on computer vision. Springer, 2012, pp.
hyperspherical learning,” in Advances in neural information processing 566–579.
systems, 2017, pp. 3950–3960. [137] Y. Cheng, J. Zhao, Z. Wang, Y. Xu, K. Jayashree, S. Shen, J. Feng, Know you at
[109] R. Ranjan, C.D. Castillo, and R. Chellappa, ”L2-constrained softmax loss for one glance: A compact vector representation for low-shot learning, in:
discriminative face verification,” arXiv preprint arXiv:1703.09507, 2017. Proceedings of the IEEE International Conference on Computer Vision
[110] F. Wang, X. Xiang, J. Cheng, A.L. Yuille, Normface: L2 hypersphere embedding Workshops, 2017, pp. 1924–1932.
for face verification, in: Proceedings of the 25th ACM international [138] M. Yang, X. Wang, G. Zeng, L. Shen, Joint and collaborative representation
conference on Multimedia, 2017, pp. 1041–1049. with local adaptive convolution feature for face recognition with single
[111] A. Hasnat, J. Bohné, J. Milgram, S. Gentric, L. Chen, Deepvisage: Making face sample per person, Pattern Recogn. 66 (C) (2016) 117–128.
recognition simple yet with powerful generalization skills, in: Proceedings of [139] S. Guo, S. Chen, Y. Li, Face recognition based on convolutional neural network
the IEEE International Conference on Computer Vision Workshops, 2017, pp. and support vector machine, in: IEEE International Conference on
1682–1691. Information and Automation, 2017, pp. 1787–1792.
[112] Y. Liu, H. Li, and X. Wang, ”Rethinking feature discrimination and [140] H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neighbor
polymerization for large-scale recognition,” arXiv preprint search, IEEE Transactions on Pattern Analysis & Machine Intelligence 33 (1)
arXiv:1710.00870, 2017. (2011) 117.
[113] X. Qi and L. Zhang, ”Face recognition via centralized coordinate learning,” [141] P.J. Grother and L.N. Mei, ”Face recognition vendor test (frvt) performance of
arXiv preprint arXiv:1801.05678, 2018. face identification algorithms nist ir 8009,” NIST Interagency/Internal Report
[114] B. Chen, W. Deng, J. Du, Noisy softmax: Improving the generalization ability (NISTIR) - 8009, 2014.
of dcnn via postponing the early softmax saturation, in: Proceedings of the [142] Z. Ding, Y. Guo, L. Zhang, and Y. Fu, ”One-shot face recognition via generative
IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. learning,” in 2018 13th IEEE International Conference on Automatic Face &
5372–5381. Gesture Recognition (FG 2018). IEEE, 2018, pp. 1–7.
[115] M. Hasnat, J. Bohné, J. Milgram, S. Gentric, L. Chen et al., ”von mises-fisher [143] Y. Guo and L. Zhang, ”One-shot face recognition by promoting
mixture model-based deep learning: Application to face verification,” arXiv underrepresented classes,” arXiv preprint arXiv:1707.05574, 2017.
preprint arXiv:1706.04264, 2017. [144] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng.
[116] J. Deng, Y. Zhou, S. Zafeiriou, Marginal loss for deep face recognition, in: 22 (10) (2010) 1345–1359.
Proceedings of the IEEE Conference on Computer Vision and Pattern [145] M. Wang, W. Deng, Deep visual domain adaptation: A survey,
Recognition Workshops, 2017, pp. 60–68. Neurocomputing 312 (2018) 135–153.
[117] Y. Zheng, D.K. Pal, M. Savvides, Ring loss: Convex feature normalization for [146] L. Xiong, J. Karlekar, J. Zhao, J. Feng, S. Pranata, and S. Shen, ”A good practice
face recognition, in: Proceedings of the IEEE conference on computer vision towards top performance of face recognition: Transferred deep feature
and pattern recognition, 2018, pp. 5089–5097. fusion,” arXiv preprint arXiv:1704.00438, 2017.
[118] E.P. Xing, M.I. Jordan, S.J. Russell, and A.Y. Ng, ”Distance metric learning with [147] M. Kan, S. Shan, X. Chen, Bi-shifting auto-encoder for unsupervised domain
application to clustering with side-information,” in Advances in neural adaptation, in: Proceedings of the IEEE international conference on computer
information processing systems, 2003, pp. 521–528. vision, 2015, pp. 3846–3854.
[119] K.Q. Weinberger and L.K. Saul, ”Distance metric learning for large margin [148] Z. Luo, J. Hu, W. Deng, and H. Shen, ”Deep unsupervised domain adaptation
nearest neighbor classification,” Journal of Machine Learning Research, vol. for face recognition,” in 2018 13th IEEE International Conference on
10, no. Feb, pp. 207–244, 2009. Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 453–457.
[120] D. Yi, Z. Lei, S. Liao, and S.Z. Li, ”Learning face representation from scratch,” [149] K. Sohn, S. Liu, G. Zhong, X. Yu, M.-H. Yang, M. Chandraker, Unsupervised
arXiv preprint arXiv:1411.7923, 2014. domain adaptation for face recognition in unlabeled videos, in: Proceedings
[121] N. Crosswhite, J. Byrne, C. Stauffer, O. Parkhi, Q. Cao, A. Zisserman, Template of the IEEE International Conference on Computer Vision, 2017, pp. 3210–
adaptation for face verification and identification, Elsevier 79 (2018) 35–48. 3218.
[122] B. Liu, W. Deng, Y. Zhong, M. Wang, J. Hu, X. Tao, Y. Huang, Fair loss: Margin- [150] E. Tzeng, J. Hoffman, K. Saenko, T. Darrell, Adversarial discriminative domain
aware reinforcement learning for deep face recognition, in: Proceedings of adaptation, in: Proceedings of the IEEE conference on computer vision and
the IEEE/CVF International Conference on Computer Vision (ICCV), October pattern recognition, 2017, pp. 7167–7176.
2019. [151] W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, I. Masi, J. Choi, J.
[123] H. Liu, X. Zhu, Z. Lei, and S.Z. Li, ”Adaptiveface: Adaptive margin and sampling Lekust, J. Kim, P. Natarajan et al., ”Face recognition using deep multi-pose
for face recognition,” in Proceedings of the IEEE/CVF Conference on Computer representations,” in 2016 IEEE Winter Conference on Applications of
Vision and Pattern Recognition (CVPR), June 2019. Computer Vision (WACV). IEEE, 2016, pp. 1–9.
[124] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, C. Change Loy, The devil of [152] D. Wang, C. Otto, A.K. Jain, Face search at scale, IEEE Trans. Pattern Anal.
face recognition is in the noise, The European Conference on Computer Vision Mach. Intell. 39 (6) (2017) 1122–1136.
(ECCV) (September 2018). [153] H. Yang and I. Patras, ”Mirror, mirror on the wall, tell me, is the error small?”
[125] C.J. Parde, C. Castillo, M.Q. Hill, Y.I. Colon, S. Sankaranarayanan, J.-C. Chen, and in Proceedings of the IEEE Conference on Computer Vision and Pattern
A.J. O’Toole, ”Deep convolutional neural network features and the original Recognition, 2015, pp. 4685–4693.
image,” arXiv preprint arXiv:1611.01751, 2016. [154] S. Xie, Z. Tu, Holistically-nested edge detection, in: Proceedings of the IEEE
[126] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, and K. Keutzer, international conference on computer vision, 2015, pp. 1395–1403.
”Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb [155] V. Blanz, T. Vetter, Face recognition based on fitting a 3d morphable model,
model size,” arXiv preprint arXiv:1602.07360, 2016. IEEE Trans. Pattern Analysis Machine Intell. 25 (9) (2003) 1063–1074.
[127] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. [156] Z. An, W. Deng, T. Yuan, and J. Hu, ”Deep transfer network with 3d morphable
Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for models for face recognition,” in 2018 13th IEEE International Conference on
mobile vision applications, 2017, arXiv preprint arXiv:1704.04861. Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 416–422.
[128] F. Chollet, Xception: Deep learning with depthwise separable convolutions, [157] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, J. Kim, Rotating your face using multi-
in: Proceedings of the IEEE conference on computer vision and pattern task deep neural network, in: Proceedings of the IEEE Conference on
recognition, 2017, pp. 1251–1258. Computer Vision and Pattern Recognition, 2015, pp. 676–684.
[129] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient [158] Y. Qian, W. Deng, and J. Hu, ”Task specific networks for identity and face
convolutional neural network for mobile devices, in: Proceedings of the variation,” in 2018 13th IEEE International Conference on Automatic Face &
IEEE conference on computer vision and pattern recognition, 2018, pp. 6848– Gesture Recognition (FG 2018). IEEE, 2018, pp. 271–277.
6856. [159] J. Bao, D. Chen, F. Wen, H. Li, G. Hua, Cvae-gan: fine-grained image generation
[130] B. Zoph and Q.V. Le, ”Neural architecture search with reinforcement through asymmetric training, in: Proceedings of the IEEE international
learning,” arXiv preprint arXiv:1611.01578, 2016. conference on computer vision, 2017, pp. 2745–2754.
[131] E. Real, A. Aggarwal, Y. Huang, and Q.V. Le, ”Aging evolution for image [160] W. Chai, W. Deng, and H. Shen, ”Cross-generating gan for facial identity
classifier architecture search,” in AAAI Conference on Artificial Intelligence, preserving,” in 2018 13th IEEE International Conference on Automatic Face &
2019. Gesture Recognition (FG 2018). IEEE, 2018, pp. 130–134.
[132] L.-C. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and [161] J. Bao, D. Chen, F. Wen, H. Li, G. Hua, Towards open-set identity preserving
J. Shlens, ”Searching for efficient multi-scale architectures for dense image face synthesis, in: Proceedings of the IEEE conference on computer vision and
prediction,” in Advances in neural information processing systems, 2018, pp. pattern recognition, 2018, pp. 6713–6722.
8699–8710. [162] Y. Shen, P. Luo, J. Yan, X. Wang, X. Tang, Faceid-gan: Learning a symmetry
[133] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. three-player gan for identity-preserving face synthesis, in: Proceedings of the
Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski et al., ‘‘Human-level control IEEE conference on computer vision and pattern recognition, 2018, pp. 821–
through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529– 830.
533, 2015. [163] ”Ms-celeb-1m challenge 3,” http://trillionpairs.deepglint.com.
241
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
[164] A. Nech, I. Kemelmacher-Shlizerman, Level playing field for million scale face [191] B.-C. Chen, C.-S. Chen, and W.H. Hsu, ”Cross-age reference coding for age-
recognition, in: Proceedings of the IEEE Conference on Computer Vision and invariant face recognition and retrieval,” in European conference on
Pattern Recognition, 2017, pp. 7044–7053. computer vision. Springer, 2014, pp. 768–783.
[165] Y. Zhang, W. Deng, M. Wang, J. Hu, X. Li, D. Zhao, D. Wen, Global-local gcn: [192] Y. Wen, Z. Li, Y. Qiao, Latent factor guided convolutional neural networks for
Large-scale label noise cleansing for face recognition, in: Proceedings of the age-invariant face recognition, in: Proceedings of the IEEE conference on
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. computer vision and pattern recognition, 2016, pp. 4893–4901.
7731–7740. [193] T. Zheng, W. Deng, J. Hu, Age estimation guided convolutional neural network
[166] A. Bansal, C. Castillo, R. Ranjan, R. Chellappa, The do’s and don’ts for cnn- for age-invariant face recognition, in: Proceedings of the IEEE Conference on
based face verification, in: Proceedings of the IEEE International Conference Computer Vision and Pattern Recognition Workshops, 2017, pp. 1–9.
on Computer Vision Workshops, 2017, pp. 2545–2554. [194] ”Fg-net aging database,” http://www.fgnet.rsunit.com.
[167] X. Zhan, Z. Liu, J. Yan, D. Lin, C. Change Loy, Consensus-driven propagation in [195] S. Li, D. Yi, Z. Lei, S. Liao, The casia nir-vis 2.0 face database, in: Proceedings of
massive unlabeled data for face recognition, The European Conference on the IEEE conference on computer vision and pattern recognition workshops,
Computer Vision (ECCV) (September 2018). 2013, pp. 348–353.
[168] P.J. Phillips, ”A cross benchmark assessment of a deep convolutional neural [196] X. Wu, L. Song, R. He, and T. Tan, ”Coupled deep learning for heterogeneous
network for face recognition,” in 2017 12th IEEE International Conference on face recognition,” in Thirty-Second AAAI Conference on Artificial Intelligence,
Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017, pp. 705–710. 2018.
[169] L. Wolf, T. Hassner, and I. Maoz, ‘‘Face recognition in unconstrained videos [197] S.Z. Li, Z. Lei, and M. Ao, ”The hfb face database for heterogeneous face
with matched background similarity,” in CVPR. IEEE, 2011, pp. 529–534. biometrics research,” in 2009 IEEE Computer Society Conference on
[170] P.J. Phillips, J.R. Beveridge, B.A. Draper, G. Givens, A.J. O’Toole, D. Bolme, J. Computer Vision and Pattern Recognition Workshops. IEEE, 2009, pp. 1–8.
Dunlop, Y.M. Lui, H. Sahibzada, S. Weimer, The good, the bad, and the ugly [198] C. Reale, N.M. Nasrabadi, H. Kwon, and R. Chellappa, ‘‘Seeing the forest from
face challenge problem, Image Vis. Comput. 30 (3) (2012) 177–185. the trees: A holistic approach to near-infrared heterogeneous face
[171] I. Hupont and C. Fernández, ”Demogpairs: Quantifying the impact of recognition,” in Proceedings of the IEEE Conference on Computer Vision
demographic imbalance in deep face recognition,” in 2019 14th IEEE and Pattern Recognition Workshops, 2016, pp. 54–62.
International Conference on Automatic Face & Gesture Recognition (FG [199] X. Wang, X. Tang, Face photo-sketch synthesis and recognition, IEEE Trans.
2019). IEEE, 2019, pp. 1–7. Pattern Anal. Mach. Intell. 31 (11) (2009) 1955–1967.
[172] I. Serna, A. Morales, J. Fierrez, M. Cebrian, N. Obradovich, and I. Rahwan, [200] L. Zhang, L. Lin, X. Wu, S. Ding, L. Zhang, End-to-end photo-sketch generation
”Sensitiveloss: Improving accuracy and fairness of face representations with via fully convolutional representation learning, in: Proceedings of the 5th
discrimination-aware deep learning,” arXiv preprint arXiv:2004.11246, 2020. ACM on International Conference on Multimedia Retrieval ACM, 2015, pp.
[173] M. Wang, W. Deng, J. Hu, X. Tao, Y. Huang, Racial faces in the wild: Reducing 627–634.
racial bias by information maximization adaptation network, in: Proceedings [201] W. Zhang, X. Wang, and X. Tang, ‘‘Coupled information-theoretic encoding for
of the IEEE International Conference on Computer Vision, 2019, pp. 692–702. face photo-sketch recognition,” in CVPR 2011. IEEE, 2011, pp. 513–520.
[174] Y. Xu, Y. Cheng, J. Zhao, Z. Wang, L. Xiong, K. Jayashree, H. Tamura, T. Kagaya, [202] L. Wang, V. Sindagi, and V. Patel, ‘‘High-quality facial photo-sketch synthesis
S. Shen, S. Pranata et al., ‘‘High performance large scale face recognition with using multi-adversarial networks,” in 2018 13th IEEE international
multi-cognition softmax and feature retrieval,” in Proceedings of the IEEE conference on automatic face & gesture recognition (FG 2018). IEEE, 2018,
International Conference on Computer Vision Workshops, 2017, pp. 1898– pp. 83–90.
1906. [203] A. Savran, N. Alyüz, H. Dibeklioğlu, O. Çeliktutan, B. Gökberk, B. Sankur, and L.
[175] C. Wang, X. Zhang, X. Lan, How to train triplet networks with 100k Akarun, ”Bosphorus database for 3d face analysis,” in European Workshop on
identities?, in: Proceedings of the IEEE International Conference on Biometrics and Identity Management. Springer, 2008, pp. 47–56.
Computer Vision Workshops, 2017, pp 1907–1915. [204] D. Kim, M. Hernandez, J. Choi, and G. Medioni, ‘‘Deep 3d face identification,”
[176] Y. Wu, H. Liu, Y. Fu, Low-shot face recognition with hybrid classifiers, in: in 2017 IEEE international joint conference on biometrics (IJCB). IEEE, 2017,
Proceedings of the IEEE International Conference on Computer Vision pp. 133–142.
Workshops, 2017, pp. 1933–1939. [205] L. Yin, X. Wei, Y. Sun, J. Wang, M.J. Rosato, ”A 3d facial expression database for
[177] P.J. Phillips, H. Wechsler, J. Huang, P.J. Rauss, The feret database and facial behavior research,” in 7th international conference on automatic face
evaluation procedure for face-recognition algorithms, Image & Vision and gesture recognition (FGR06), IEEE (2006) 211–216.
Computing J 16 (5) (1998) 295–306. [206] P.J. Phillips, P.J. Flynn, T. Scruggs, K.W. Bowyer, J. Chang, K. Hoffman, J.
[178] A.M. Martinez, ”The ar face database,” CVC Technical Report 24, 1998. Marques, J. Min, and W. Worek, ‘‘Overview of the face recognition grand
[179] J.R. Beveridge, P.J. Phillips, D.S. Bolme, B.A. Draper, G.H. Givens, Y.M. Lui, M.N. challenge,” in 2005 IEEE computer society conference on computer vision and
Teli, H. Zhang, W.T. Scruggs, K.W. Bowyer et al., ”The challenge of face pattern recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 947–954.
recognition from digital point-and-shoot cameras,” in 2013 IEEE Sixth [207] G. Guo, L. Wen, S. Yan, Face authentication with makeup changes, IEEE Trans.
International Conference on Biometrics: Theory, Applications and Systems Circuits Syst. Video Technol. 24 (5) (2014) 814–825.
(BTAS). IEEE, 2013, pp. 1–8. [208] Y. Li, L. Song, X. Wu, R. He, and T. Tan, ”Anti-makeup: Learning a bi-level
[180] Y. Kim, W. Park, M.-C. Roh, and J. Shin, ”Groupface: Learning latent groups adversarial network for makeup-invariant face verification,” in Thirty-Second
and constructing group-based representations for face recognition,” in IEEE/ AAAI Conference on Artificial Intelligence, 2018.
CVF Conference on Computer Vision and Pattern Recognition (CVPR), June [209] J. Hu, Y. Ge, J. Lu, and X. Feng, ”Makeup-robust face verification,” in 2013 IEEE
2020. International Conference on Acoustics, Speech and Signal Processing. IEEE,
[181] T. Zheng, W. Deng, ”Cross-pose lfw: A database for studying cross-pose face 2013, pp. 2342–2346.
recognition in unconstrained environments,” Beijing University of Posts and [210] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S.Z. Li, ‘‘A face antispoofing database
Telecommunications, Tech. Rep. 18–01 (February 2018). with diverse attacks,” in 2012 5th IAPR international conference on
[182] S. Sengupta, J.-C. Chen, C. Castillo, V.M. Patel, R. Chellappa, and D.W. Jacobs, Biometrics (ICB). IEEE, 2012, pp. 26–31.
”Frontal to profile face verification in the wild,” in 2016 IEEE Winter [211] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu, ”Face anti-spoofing using patch and
Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9. depth-based cnns,” in 2017 IEEE International Joint Conference on Biometrics
[183] W. Deng, J. Hu, N. Zhang, B. Chen, J. Guo, Fine-grained face verification: Fglfw (IJCB). IEEE, 2017, pp. 319–328.
database, baselines, and human-dcmn partnership, Pattern Recogn. 66 (2017) [212] I. Chingovska, A. Anjos, and S. Marcel, ‘‘On the effectiveness of local binary
63–73. patterns in face anti-spoofing,” in 2012 BIOSIG-proceedings of the
[184] A. Bansal, A. Nanduri, C.D. Castillo, R. Ranjan, and R. Chellappa, ”Umdfaces: An international conference of biometrics special interest group (BIOSIG). IEEE,
annotated face dataset for training deep networks,” in 2017 IEEE 2012, pp. 1–7.
International Joint Conference on Biometrics (IJCB). IEEE, 2017, pp. 464–473. [213] J. Huo, W. Li, Y. Shi, Y. Gao, and H. Yin, ”Webcaricature: a benchmark for
[185] Y. Rao, J. Lu, J. Zhou, Attention-aware deep reinforcement learning for video caricature face recognition,” arXiv preprint arXiv:1703.03230, 2017.
face recognition, in: Proceedings of the IEEE international conference on [214] V. Kushwaha, M. Singh, R. Singh, M. Vatsa, N. Ratha, and R. Chellappa,
computer vision, 2017, pp. 3931–3940. ”Disguised faces in the wild,” in IEEE Conference on Computer Vision and
[186] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley, ”Face tracking and recognition Pattern Recognition Workshops, vol. 8, 2018.
with visual constraints in real-world videos,” in 2008 IEEE Conference on [215] K. Cao, Y. Rong, C. Li, X. Tang, C. Change Loy, Pose-robust face recognition via
Computer Vision and Pattern Recognition. IEEE, 2008, pp. 1–8. deep residual equivariant mapping, in: Proceedings of the IEEE Conference on
[187] Y. Rao, J. Lin, J. Lu, J. Zhou, Learning discriminative aggregation network for Computer Vision and Pattern Recognition, 2018, pp. 5187–5196.
video-based face recognition, in: Proceedings of the IEEE international [216] G. Chen, Y. Shao, C. Tang, Z. Jin, and J. Zhang, ”Deep transformation learning
conference on computer vision, 2017, pp. 3781–3790. for face recognition in the unconstrained scene,” Machine Vision and
[188] T. Zheng, W. Deng, and J. Hu, ”Cross-age lfw: A database for studying cross- Applications, pp. 1–11, 2018.
age face recognition in unconstrained environments,” arXiv preprint [217] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen,
arXiv:1708.08197, 2017. J. Xing et al., ‘‘Towards pose invariant face recognition in the wild,” in
[189] K. Ricanek and T. Tesafaye, ”Morph: A longitudinal image database of normal Proceedings of the IEEE conference on computer vision and pattern
adult age-progression,” in 7th International Conference on Automatic Face recognition, 2018, pp. 2207–2216.
and Gesture Recognition (FGR06). IEEE, 2006, pp. 341–345. [218] C. Nhan Duong, K. Gia Quach, K. Luu, N. Le, M. Savvides, Temporal non-
[190] L. Lin, G. Wang, W. Zuo, X. Feng, L. Zhang, Cross-domain visual matching via volume preserving approach to facial age-progression and age-invariant face
generalized similarity measure and feature learning, IEEE Trans. Pattern Anal. recognition, in: Proceedings of the IEEE International Conference on
Machine Intell. 39 (6) (2016) 1089–1102. Computer Vision, 2017, pp. 3735–3743.
242
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
[219] Z. Wang, X. Tang, W. Luo, S. Gao, Face aging with identity-preserved [247] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, ”Learning to discover cross-domain
conditional generative adversarial networks, in: Proceedings of the IEEE relations with generative adversarial networks,” arXiv preprint
conference on computer vision and pattern recognition, 2018, pp. 7939– arXiv:1703.05192, 2017.
7947. [248] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation
[220] G. Antipov, M. Baccouche, and J.-L. Dugelay, ‘‘Face aging with conditional using cycle-consistent adversarial networks, in: Proceedings of the IEEE
generative adversarial networks,” in 2017 IEEE international conference on international conference on computer vision, 2017, pp. 2223–2232.
image processing (ICIP). IEEE, 2017, pp. 2089–2093. [249] S. Hong, W. Im, J. Ryu, and H.S. Yang, ”Sspp-dan: Deep domain adaptation
[221] G. Antipov, M. Baccouche, and J.-L. Dugelay, ”Boosting cross-age face network for face recognition with single sample per person,” in 2017 IEEE
verification via generative age normalization,” in 2017 IEEE International International Conference on Image Processing (ICIP). IEEE, 2017, pp. 825–
Joint Conference on Biometrics (IJCB). IEEE, 2017, pp. 191–199. 829.
[222] H. Yang, D. Huang, Y. Wang, A.K. Jain, Learning face age progression: A [250] J. Choe, S. Park, K. Kim, J. Hyun Park, D. Kim, H. Shim, Face generation for low-
pyramid architecture of gans, in: Proceedings of the IEEE conference on shot learning using generative adversarial networks, in: Proceedings of the
computer vision and pattern recognition, 2018, pp. 31–39. IEEE International Conference on Computer Vision Workshops, 2017, pp.
[223] S. Bianco, Large age-gap face verification by feature injection in deep 1940–1948.
networks, Pattern Recogn. Lett. 90 (2017) 36–42. [251] X. Yin, X. Yu, K. Sohn, X. Liu, M. Chandraker, Feature transfer learning for face
[224] H. El Khiyari, H. Wechsler, Age invariant face recognition using convolutional recognition with under-represented data, in: Proceedings of the IEEE
neural networks and set distances, J. Inform. Security 8 (03) (2017) 174. Conference on Computer Vision and Pattern Recognition, 2019, pp. 5704–
[225] X. Wang, Y. Zhou, D. Kong, J. Currey, D. Li, and J. Zhou, ”Unleash the black 5713.
magic in age: a multi-task deep neural network approach for cross-age face [252] J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric
verification,” in 2017 12th IEEE International Conference on Automatic Face learning for image set classification, in: Proceedings of the IEEE conference on
& Gesture Recognition (FG 2017). IEEE, 2017, pp. 596–603. computer vision and pattern recognition, 2015, pp. 1137–1145.
[226] Y. Li, G. Wang, L. Nie, Q. Wang, W. Tan, Distance metric optimization driven [253] J. Zhao, J. Han, L. Shao, Unconstrained face recognition using a set-to-set
convolutional neural network for age invariant face recognition, Pattern distance measure on deep learned features, IEEE Trans. Circuits Syst. Video
Recogn. 75 (2018) 51–62. Technol. (2017).
[227] Y. Sun, L. Ren, Z. Wei, B. Liu, Y. Zhai, S. Liu, A weakly supervised method for [254] N. Bodla, J. Zheng, H. Xu, J.-C. Chen, C. Castillo, and R. Chellappa, ‘‘Deep
makeup-invariant face verification, Pattern Recogn. 66 (2017) 153–159. heterogeneous feature fusion for template-based face recognition,” in 2017
[228] M. Singh, M. Chawla, R. Singh, M. Vatsa, R. Chellappa, Disguised faces in the IEEE winter conference on applications of computer vision (WACV). IEEE,
wild 2019, in: Proceedings of the IEEE International Conference on Computer 2017, pp. 586–595.
Vision Workshops, 2019. [255] M. Hayat, M. Bennamoun, S. An, Learning non-linear reconstruction models
[229] M. Singh, R. Singh, M. Vatsa, N.K. Ratha, R. Chellappa, Recognizing disguised for image set classification, in: Proceedings of the IEEE Conference on
faces in the wild, IEEE Trans. Biometrics, Behavior, Identity Sci. 1 (2) (2019) Computer Vision and Pattern Recognition, 2014, pp. 1907–1914.
97–108. [256] X. Liu, B. Vijaya Kumar, C. Yang, Q. Tang, J. You, Dependency-aware attention
[230] K. Zhang, Y.-L. Chang, W. Hsu, Deep disguised faces recognition, in: control for unconstrained face recognition with image sets, The European
Proceedings of the IEEE Conference on Computer Vision and Pattern Conference on Computer Vision (ECCV) (September 2018).
Recognition Workshops, 2018, pp. 32–36. [257] C. Ding, D. Tao, Trunk-branch ensemble convolutional neural networks for
[231] N. Kohli, D. Yadav, A. Noore, Face verification with disguise variations via video-based face recognition, IEEE Trans. Pattern Analysis Machine Intell.
deep disguise recognizer, in: Proceedings of the IEEE Conference on (2017).
Computer Vision and Pattern Recognition Workshops, 2018, pp. 17–24. [258] M. Parchami, S. Bashbaghi, E. Granger, and S. Sayed, ”Using deep
[232] E. Smirnov, A. Melnikov, A. Oleinik, E. Ivanova, I. Kalinovskiy, E. Luckyanets, autoencoders to learn robust domain-invariant representations for still-to-
Hard example mining with auxiliary embeddings, in: Proceedings of the IEEE video face recognition,” in 2017 14th IEEE International Conference on
Conference on Computer Vision and Pattern Recognition Workshops, 2018, Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2017, pp. 1–6.
pp. 37–46. [259] S. Zulqarnain Gilani, A. Mian, Learning from millions of 3d scans for large-
[233] E. Smirnov, A. Melnikov, S. Novoselov, E. Luckyanets, G. Lavrentyeva, scale 3d face recognition, in: Proceedings of the IEEE Conference on
Doppelganger mining for face representation learning, in: Proceedings of Computer Vision and Pattern Recognition, 2018, pp. 1896–1905.
the IEEE International Conference on Computer Vision Workshops, 2017, pp. [260] J. Zhang, Z. Hou, Z. Wu, Y. Chen, and W. Li, ”Research of 3d face recognition
1916–1923. algorithm based on deep learning stacked denoising autoencoder theory,” in
[234] S. Suri, A. Sankaran, M. Vatsa, and R. Singh, ”On matching faces with 2016 8th IEEE International Conference on Communication Software and
alterations due to plastic surgery and disguise,” in 2018 IEEE 9th Networks (ICCSN). IEEE, 2016, pp. 663–667.
International Conference on Biometrics Theory, Applications and Systems [261] L. He, H. Li, Q. Zhang, Z. Sun, and Z. He, ”Multiscale representation for partial
(BTAS). IEEE, 2018, pp. 1–7. face recognition under near infrared illumination,” in IEEE International
[235] S. Saxena and J. Verbeek, ”Heterogeneous face recognition with cnns,” in Conference on Biometrics Theory, Applications and Systems, 2016, pp. 1–7.
European conference on computer vision. Springer, 2016, pp. 483–491. [262] L. He, H. Li, Q. Zhang, and Z. Sun, ”Dynamic feature learning for partial face
[236] X. Liu, L. Song, X. Wu, and T. Tan, ”Transferring deep representation for nir-vis recognition,” in The IEEE Conference on Computer Vision and Pattern
heterogeneous face recognition,” in 2016 International Conference on Recognition (CVPR), June 2018.
Biometrics (ICB). IEEE, 2016, pp. 1–8. [263] O. Tadmor, Y. Wexler, T. Rosenwein, S. Shalev-Shwartz, and A. Shashua,
[237] J. Lezama, Q. Qiu, G. Sapiro, Not afraid of the dark: Nir-vis face recognition via ”Learning a metric embedding for face recognition using the multibatch
cross-spectral hallucination and low-rank embedding, in: Proceedings of the method,” arXiv preprint arXiv:1605.07270, 2016.
IEEE conference on computer vision and pattern recognition, 2017, pp. 6628– [264] S. Han, H. Mao, and W.J. Dally, ”Deep compression: Compressing deep neural
6637. networks with pruning, trained quantization and huffman coding,” arXiv
[238] R. He, X. Wu, Z. Sun, and T. Tan, ”Learning invariant deep representation for preprint arXiv:1510.00149, 2015.
nir-vis face recognition,” in Thirty-First AAAI Conference on Artificial [265] S. Han, J. Pool, J. Tran, and W. Dally, ”Learning both weights and connections
Intelligence, 2017. for efficient neural network,” in Advances in neural information processing
[239] R. He, X. Wu, Z. Sun, T. Tan, Wasserstein cnn: Learning invariant features for systems, 2015, pp. 1135–1143.
nir-vis face recognition, IEEE Trans. Pattern Analysis Machine Intell. 41 (7) [266] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, ”Learning efficient
(2018) 1761–1773. convolutional networks through network slimming,” in Computer Vision
[240] L. Song, M. Zhang, X. Wu, R. He, ”Adversarial discriminative heterogeneous (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 2755–2763.
face recognition,” in Thirty-Second AAAI Conference on Artificial Intelligence, [267] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, ”Binarized
2018. neural networks: Training deep neural networks with weights and
[241] E. Zangeneh, M. Rahmati, Y. Mohsenzadeh, Low resolution face recognition activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
using a two-branch deep convolutional neural network architecture, Expert [268] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, ”Binarized
Syst. Appl. 139 (2020) 112854. neural networks,” in Advances in neural information processing systems,
[242] Z. Shen, W.-S. Lai, T. Xu, J. Kautz, M.-H. Yang, Deep semantic face deblurring, 2016, pp. 4107–4115.
in: Proceedings of the IEEE Conference on Computer Vision and Pattern [269] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, ”Xnor-net: Imagenet
Recognition, 2018, pp. 8260–8269. classification using binary convolutional neural networks,” in European
[243] P. Mittal, M. Vatsa, and R. Singh, ”Composite sketch recognition via deep Conference on Computer Vision. Springer, 2016, pp. 525–542.
network-a transfer learning approach,” in 2015 International Conference on [270] M. Courbariaux, Y. Bengio, and J.-P. David, ”Binaryconnect: Training deep
Biometrics (ICB). IEEE, 2015, pp. 251–256. neural networks with binary weights during propagations,” in Advances in
[244] C. Galea, R.A. Farrugia, Forensic face photo-sketch recognition using a deep neural information processing systems, 2015, pp. 3123–3131.
learning-based architecture, IEEE Signal Process. Lett. 24 (11) (2017) 1586– [271] Q. Li, S. Jin, and J. Yan, ”Mimicking very efficient network for object
1590. detection,” in 2017 IEEE Conference on Computer Vision and Pattern
[245] D. Zhang, L. Lin, T. Chen, X. Wu, W. Tan, E. Izquierdo, Content-adaptive sketch Recognition (CVPR). IEEE, 2017, pp. 7341–7349.
portrait generation by decompositional representation learning, IEEE Trans. [272] Y. Wei, X. Pan, H. Qin, W. Ouyang, J. Yan, Quantization mimic: Towards very
Image Process. 26 (1) (2017) 328–339. tiny cnn for object detection, in: Proceedings of the European Conference on
[246] Z. Yi, H. Zhang, P. Tan, M. Gong, Dualgan: Unsupervised dual learning for Computer Vision (ECCV), 2018, pp. 267–283.
image-to-image translation, in: Proceedings of the IEEE international [273] J. Yang, Z. Lei, and S.Z. Li, ”Learn convolutional neural network for face anti-
conference on computer vision, 2017, pp. 2849–2857. spoofing,” arXiv preprint arXiv:1408.5601, 2014.
243
M. Wang and W. Deng Neurocomputing 429 (2021) 215–244
[274] Z. Xu, S. Li, and W. Deng, ”Learning temporal features using lstm-cnn the 2016 ACM SIGSAC Conference on Computer and Communications
architecture for face anti-spoofing,” in 2015 3rd IAPR Asian Conference on Security ACM, 2016, pp. 1528–1540.
Pattern Recognition (ACPR). IEEE, 2015, pp. 141–145. [297] Y. Gurovich, Y. Hanani, and e. a. Bar, Omri, ‘‘Identifying facial phenotypes of
[275] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid, ”An original face anti- genetic disorders using deep learning,” Nature Medicine, vol. 25, pp. 60 – 64,
spoofing approach using partial convolutional neural network,” in 2016 Sixth 2019.
International Conference on Image Processing Theory, Tools and Applications [298] V. Mirjalili and A. Ross, ”Soft biometric privacy: Retaining biometric utility of
(IPTA). IEEE, 2016, pp. 1–6. face images while perturbing gender,” in Biometrics (IJCB), 2017 IEEE
[276] K. Patel, H. Han, and A.K. Jain, ”Cross-database face antispoofing with robust International Joint Conference on. IEEE, 2017, pp. 564–573.
feature representation,” in Chinese Conference on Biometric Recognition. [299] V. Mirjalili, S. Raschka, A. Namboodiri, and A. Ross, ”Semi-adversarial
Springer, 2016, pp. 611–619. networks: Convolutional autoencoders for imparting privacy to face
[277] A. Jourabloo, Y. Liu, and X. Liu, ”Face de-spoofing: Anti-spoofing via noise images,” in 2018 International Conference on Biometrics (ICB). IEEE, 2018,
modeling,” in The European Conference on Computer Vision (ECCV), pp. 82–89.
September 2018. [300] P.J. Phillips, A.N. Yates, Y. Hu, C.A. Hahn, E. Noyes, K. Jackson, J.G. Cavazos, G.
[278] R. Shao, X. Lan, P.C. Yuen, Joint discriminative learning of deep dynamic Jeckeln, R. Ranjan, S. Sankaranarayanan et al., ”Face recognition accuracy of
textures for 3d mask face anti-spoofing, IEEE Trans. Inf. Forensics Secur. 14 forensic examiners, superrecognizers, and face recognition algorithms,”
(4) (2019) 923–938. Proceedings of the National Academy of Sciences, p. 201721355, 2018.
[279] R. Shao, X. Lan, and P.C. Yuen, ”Deep convolutional dynamic texture learning [301] S. Gong, V.N. Boddeti, and A.K. Jain, ”On the capacity of face representation,”
with adaptive channel-discriminability for 3d mask face anti-spoofing,” in arXiv preprint arXiv:1709.10433, 2017.
Biometrics (IJCB), 2017 IEEE International Joint Conference on. IEEE, 2017, pp. [302] R. Singh, M. Vatsa, H.S. Bhatt, S. Bharadwaj, A. Noore, S.S. Nooreyezdan,
748–755. Plastic surgery: A new dimension to face recognition, IEEE Trans. Inf.
[280] G. Goswami, N. Ratha, A. Agarwal, R. Singh, and M. Vatsa, ”Unravelling Forensics Secur. 5 (3) (2010) 441–448.
robustness of deep learning based face recognition against adversarial [303] A. Ross and A.K. Jain, ‘‘Multimodal biometrics: An overview,” in Signal
attacks,” arXiv preprint arXiv:1803.00401, 2018. Processing Conference, 2004 12th European. IEEE, 2004, pp. 1221–1224.
[281] A. Goel, A. Singh, A. Agarwal, M. Vatsa, and R. Singh, ”Smartbox: [304] A.A. Ross and R. Govindarajan, ‘‘Feature level fusion of hand and face
Benchmarking adversarial detection and mitigation algorithms for face biometrics,” in Biometric Technology for Human Identification II, vol. 5779.
recognition,” IEEE BTAS, 2018. International Society for Optics and Photonics, 2005, pp. 196–205.
[282] G. Mai, K. Cao, P.C. Yuen, A.K. Jain, On the reconstruction of face images from [305] A. Ross, A. Jain, Information fusion in biometrics, Pattern Recogn. Lett. 24 (13)
deep face templates, IEEE Trans. Pattern Analysis Machine Intell. 41 (5) (2003) 2115–2125.
(2018) 1188–1202.
[283] M. Wang, W. Deng, Mitigating bias in face recognition using skewness-aware
reinforcement learning, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2020, pp. 9322–9331. Mei Wang received the B.E. degree in information and
[284] M. Alvi, A. Zisserman, C. Nellåker, Turning a blind eye: Explicit removal of communication engineering from the Dalian University
biases and variation from deep neural network embeddings, in: Proceedings of Technology (DUT), Dalian, China, in 2013 and
of the European Conference on Computer Vision (ECCV), 2018. received M.E. degree in communication engineering
[285] V. Mirjalili, S. Raschka, and A. Ross, ”Gender privacy: An ensemble of semi from the Beijing University of Posts and Telecommuni-
adversarial networks for confounding arbitrary gender classifiers,” in 2018 cations (BUPT), Beijing, China, in 2016. From September
IEEE 9th International Conference on Biometrics Theory, Applications and 2018, she is a Ph.D. student in school of information and
Systems (BTAS). IEEE, 2018, pp. 1–10. communication engineering of BUPT. Her research
[286] A. Othman and A. Ross, ”Privacy of facial soft biometrics: Suppressing gender interests include pattern recognition and computer
but retaining identity,” in European Conference on Computer Vision. vision, with a particular emphasis in deep face recog-
Springer, 2014, pp. 682–696. nition and transfer learning.
[287] J. Guo, X. Zhu, C. Zhao, D. Cao, Z. Lei, S.Z. Li, Learning meta face recognition in
unseen domains, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 6163–6172.
[288] A. Morales, J. Fierrez, and R. Vera-Rodriguez, ”Sensitivenets: Learning
agnostic representations with application to face recognition,” arXiv
preprint arXiv:1902.00334, 2019. Weihong Deng received the B.E. degree in information
[289] R. Ramachandra, C. Busch, Presentation attack detection methods for face engineering and the Ph.D. degree in signal and infor-
recognition systems: a comprehensive survey, ACM Computing Surveys mation processing from the Beijing University of Posts
(CSUR) 50 (1) (2017) 8. and Telecommunications (BUPT), Beijing, China, in 2004
[290] Y. Zhong and W. Deng, ”Towards transferable adversarial attack against deep and 2009, respectively. From Oct. 2007 to Dec. 2008, he
face recognition,” arXiv preprint arXiv:2004.05790, 2020. was a postgraduate exchange student in the School of
[291] G. Mai, K. Cao, P.C. Yuen, and A.K. Jain, ”On the reconstruction of deep face Information Technologies, University of Sydney, Aus-
templates,” arXiv preprint arXiv:1703.00832, 2017. tralia. He is currently an professor in School of Infor-
[292] H. Dang, F. Liu, J. Stehouwer, X. Liu, A.K. Jain, On the detection of digital face
mation and Telecommunications Engineering, BUPT. His
manipulation, in: Proceedings of the IEEE/CVF Conference on Computer
research interests include statistical pattern recognition
Vision and Pattern Recognition, 2020, pp. 5781–5790.
and computer vision, with a particular emphasis in face
[293] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Niessner,
Faceforensics++: Learning to detect manipulated facial images, in: recognition. He has published over 100 technical papers
Proceedings of the IEEE/CVF International Conference on Computer Vision in international journals and conferences, such as IEEE TPAMI and CVPR. He serves
(ICCV), October 2019. as associate editor for IEEE Access, and guest editor for Image and Vision Computing
[294] I. Manjani, S. Tariyal, M. Vatsa, R. Singh, A. Majumdar, Detecting silicone Journal and the reviewer for dozens of international journals, such as IEEE TPAMI /
mask-based presentation attack via deep dictionary learning, IEEE Trans. Inf. TIP / TIFS / TNNLS / TMM / TSMC, IJCV, PR / PRL. His Dissertation titled ”Highly
Forensics Secur. 12 (7) (2017) 1713–1723. accurate face recognition algorithms” was awarded the Outstanding Doctoral Dis-
[295] M. Sharif, S. Bhagavatula, L. Bauer, and M.K. Reiter, ”Adversarial generative sertation Award by Beijing Municipal Commission of Education in 2011. He has
nets: Neural network attacks on state-of-the-art face recognition,” arXiv been supported by the program for New Century Excellent Talents by the Ministry
preprint arXiv:1801.00349, 2017. of Education of China in 2013 and Beijing Nova Program in 2016.
[296] M. Sharif, S. Bhagavatula, L. Bauer, M.K. Reiter, Accessorize to a crime: Real
and stealthy attacks on state-of-the-art face recognition, in: Proceedings of
244