Von Mises-Fisher Mixture Model-Based Deep Learning: Application To Face Verification
Von Mises-Fisher Mixture Model-Based Deep Learning: Application To Face Verification
F
arXiv:1706.04264v2 [cs.CV] 31 Dec 2017
Abstract—A number of pattern recognition tasks, e.g., face verification, model, i.e., each observation is a sample from a finite mix-
can be boiled down to classification or clustering of unit length direc- ture of probability distributions. In this paper, we adopt the
tional feature vectors whose distance can be simply computed by their theoretical concept of MM to model the deep feature representation
angle. In this paper, we propose the von Mises-Fisher (vMF) mixture task and realize the relationship among the model parame-
model as the theoretical foundation for an effective deep-learning of
ters and deep features.
such directional features and derive a novel vMF Mixture Loss and its
corresponding vMF deep features. The proposed vMF feature learning Unit length normalized feature vectors are directional
achieves the characteristics of discriminative learning, i.e., compact- features which only keep the orientations of data features
ing the instances of the same class while increasing the distance of as discriminative information while ignoring their magni-
instances from different classes. Moreover, it subsumes a number of tude. In this case, simple angle measurement, e.g., cosine
popular loss functions as well as an effective method in deep learning, distance, can be used as dissimilarity measure of two data
namely normalization. We conduct extensive experiments on face veri- points and provides very intuitive geometric interpretation
fication using 4 different challenging face datasets, i.e., LFW, YouTube
of similarity [17]. In this paper, we propose to model the
faces, CACD and IJB-A. Results show the effectiveness and excellent
(deep)-features delivered by the deep neural nets, e.g., CNN-
generalization ability of the proposed approach as it achieves state-
of-the-art results on the LFW, YouTube faces and CACD datasets and based neural networks, as a mixture of von Mises-Fisher
competitive results on the IJB-A dataset. distributions, also called vMF Mixture Model (vMFMM).
The von Mises-Fisher (vMF) is a fundamental probability
Index Terms—Deep Learning, Face Recognition, Mixture Model, von distribution, which has been successfully used in numerous
Mises-Fisher distribution. unsupervised classification tasks [3], [17], [21]. By combin-
ing this vMFMM with deep neural networks, we derive
a novel loss function, namely vMF mixture loss (vMFML)
1 I NTRODUCTION which enables a discriminative learning. Figure 1(a) (from
A number of pattern recognition tasks, e.g., face recogni- right to left) provides an illustration of the proposed model.
tion (FR), can be boiled down to supervised classification Figure 1(b) shows1 the discriminative nature of the pro-
or unsupervised clustering of unit length feature vectors posed model, i.e., the learned features for each class are
whose distance can be simply computed by their angle, compacted whereas inter-class features are repulsed.
i.e., cosine distance. In deep learning based FR, numerous To demonstrate the effectiveness of the proposed
methods find it useful to unit-normalize the final feature method, we carried out extensive experiments on face recog-
vectors, e.g., [14], [38], [48], [65], [66], [68], [73]. Besides, the nition (FR) task on which recent deep learning-based meth-
widely used and simple softmax loss has been extended ods [36], [45], [54], [55], [58] have surpassed human level
with additional or reinforced supervising signal, e.g., center performance. We used 4 different challenging face datasets,
loss [68], large margin softmax loss [39], to further enable a namely LFW [25] for single image based face verification,
discriminative learning, i.e., compacting intra-class instances IJB-A [30] for face templates matching, YouTube Faces [69]
while repulsing inter-class instances, and thereby increase (YTF) for video faces matching, and CACD [7] for cross age
the final recognition accuracy. However, the great success face matching. Using only one deep CNN model trained on
of these methods and practices remains unclear from a the MS-Celeb dataset [19], the proposed method achieves
theoretical viewpoint, which motivates us to study the deep 99.63% accuracy on LFW, 85% TAR@FAR=0.001 on IJB-A
feature representation from a theoretical perspective. [30], 96.46% accuracy on YTF [69] and 99.2% accuracy on
Statistical Mixture Models (MM) is a common method CACD [7]. These results indicate that our method general-
to perform probabilistic clustering and widely used in data izes very well across different datasets as it achieves state-
mining and machine learning [44]. MM plays key role in of-the-art results on the LFW, YouTube faces and CACD
model based clustering [6], which assumes a generative datasets and competitive results on the IJB-A dataset.
Laboratoire LIRIS, École centrale de Lyon, 69134 Ecully, France.
Safran Identity & Security, 92130 Issy-les-Moulineaux, France. 1. We visualize the features of the MNIST digits in the 3D space
e-mail:[email protected], [email protected], (similar illustration with 2D plot in [68]). The CNN is composed of
[email protected], [email protected] 6 convolution, 2 pool and 1 FC (with 2 neurons for 2D visualization)
[email protected] layers. We optimize it using the softmax loss and the proposed vMFML.
2
with the concentration (κ) parameter and requires no supple- jointly [49], [55], [56], [57], [68], [76] or sequentially (op-
mentary loss function, see Section 3.3 for details. [38], [39] timize softmax loss followed by the other loss) [13], [36],
proposed the large-margin softmax loss by incorporating an [45], [58], [71]. However, multiple loss optimization not only
intuitive margin on the classification boundary, which can requires additional efforts for training data preparation but
be explained by our method under certain condition, see also complicates the CNN training procedure [20]. Our FR
Section 3.3 for details. method optimizes single loss and do not need extra efforts.
After training the CNN model, a second learning method
2.2 Face Recognition is often incorporated by different FR methods, see Additional
Face recognition (FR) is a widely studied problem in which Learning column of Table 1. CNN fine-tune is a particular
remarkable results have been achieved by the recent deep form of transfer learning commonly employed by several
CNN based methods on several standard FR benchmarks methods [9], [52], [65]. It consists of updating the trained
datasets, such as, the Labeled Faces in the Wild (LFW) [25], CNN parameters using the target specific training data. In
YouTube Faces (YTF) [69] and IARPA Janus Benchmark order to adapt the CNN features to particular FR tasks,
A (IJB-A) [30]. In Table 1, we study2 these methods and several methods apply additional transformation on the
decompose their key aspects as: (a) CNN model design; CNN features based on the metric/distance learning strategy.
(b) objective/loss functions; (c) fine-tune and additional The Joint Bayesian method [8] is a popular metric learning
learning method; (d) multi-modal input and number of method and used by [9], [14], [55], [56], [57], [65], [73].
CNNs and (e) use of the training database. Specific embedding learning methods [52], [53] have been
Recent deep CNN based FR methods tend to adopt proposed to learn from face triplets. The template adap-
(directly or slightly modify) the famous CNN architectures tation [13] is another strategy which incorporates an SVM
[18] proposed for the ImageNet [50] challenge. The CNN classifier and extends the FR tasks for videos and templates
Info column of Table 1 provides the details of CNN models [30]. A different approach learns an aggregation module [71]
used by different FR methods. The AlexNet model is used to compute the similarity scores among two sets of video
by [1], [40], [42], [49], [52], [53], [54], VGGNet model is used frames. The principal component analysis (PCA) technique
by [1], [13], [14], [42], [43], [45], [57], GoogleNet model is is commonly used [1], [42], [43] to learn a dataset specific
used by [54], [71] and ResNet model is used by [20], [38], projection matrix. The above methods often need to prepare
[48], [66], [68], [76]. Besides the famous CNN models, [73] its training data from the target specific datasets. Moreover,
proposed a simpler CNN model which is used by [9], [14], they [52], [53] may need to carefully prepare the training
[65]. CNNs with lower depth have been used by [55], [56], data, e.g., triplets. Our FR method do not incorporate any such
[58], [59], [68], where the model complexity is increased with additional learning strategies.
locally connected convolutional layers. Parallel CNNs have FR methods often accumulate features from a set of
been employed by [36], [78] to simultaneously learn features independently trained CNN models to construct rich facial
from different facial regions. We adopt the ResNet [24] based descriptors, see # of CNNs column of Table 1. These CNN
deeper CNN model. models are trained with different types of inputs: (a) image-
FR methods are trained with different loss functions, see crops based on different facial regions (eyes, nose, lips, etc.)
Loss function column in Table 1. Most FR methods learn the [14], [55], [56], [57], [65]; (b) different images modalities,
facial feature representation model by training the CNN for such as 2D, 3D, frontalized and synthesized faces at different
identity classification. For this purpose, the softmax loss poses [1], [14], [42], [58] and (c) different training databases
[18] is used to optimize the classification objective, which with varying number of images [36], [59]. Our FR method do
requires the facial images with associated identity labels. not apply these approaches and train single CNN model.
Several variants of the softmax loss [38], [39], [48], [66] A large facial image database is significantly important
have been recently proposed to enhance its discriminabil- to achieve high accuracy on FR [20], [54], [78]. Dataset Info
ity. Besides, the contrastive loss [11], [18] is used by [55], column of Table 1 provides the information of the training
[56], [57], [58], [71] for face verification, which requires datasets used by different FR methods. Currently, several FR
similar/dissimilar face image pairs and similarity labels. datasets [19], [73] are publicly available. Among them, the
Moreover, the triplet loss [54] is used by [13], [36], [45], [54] CASIA-WebFace [73] has been widely used by the recent
for face verification and requires the face triplets (i.e. anchor, methods [1], [9], [14], [40], [42], [43], [49], [52], [53], [65],
positive, negative). Recently, [66] proposed different formu- [68], [70], [73] to train CNN with a medium sized database.
lations of the contrastive and triplet losses such that they Besides, the recently released MSCELEB [19] dataset, which
can be trained with only the identity labels. Our proposed provides the largest collection of facial images and identi-
vMFML, simply learns the features via identity classification ties, becomes the standard choice for large scale training.
and requires only the class labels. Moreover, the theoretical We exploit it for developing our FR method.
foundation of vMFML provides interesting interpretation
and relationship with the very recently proposed softmax
based loss functions, such as [38], [39], [48], [66], see Sect.
3.3 for details.
3 M ETHODOLOGY
CNN training with multiple loss functions have been
In this section, first we present the statistical feature represen-
adopted by several methods, where the losses are optimized
tation model and then discuss the facial feature representation
2. We only consider the deep CNN based methods. For the others, learning method w.r.t. the model. Finally, we present the
we refer readers to the survey [32] for LFW and the [30] for IJB-A. complete face recognition pipeline.
4
TABLE 1: Overview of the state-of-the-art FR methods. In the second column (CNN Info), the shorthand notations mean-
C- convolutional layer, FC: fully connected layer, LC: locally connected layer and L: loss layer. In the third column (Loss
Function), the shorthand notations mean- TL: triplet loss, SL: softmax loss, CL: contrastive loss, CCL: C-contrastive loss,
CeL: center loss, ASL: angular softmax loss and RL: range loss. In the fourth column (Additional Learning), the shorthand
notations mean- JB: joint bayesian, TSE: triplet similarity embedding, TPE: triplet probability embedding, TA: template
adaptation and X: no metric learning. We list the methods in a decreasing order based on the number of convolution and
FC layers in the CNN model.
Loss Additional # of Dataset
FR system CNN Info
function learning CNNs Info
DeepID2+ [56] 4-C, 1-FC, 2L SL, CL JB 25 0.29M, 12K
Deepface [58] 2-C, 3-LC, 1-FC, 1-L SL, CL X 3 4.4M, 4K
Webscale [59] 2-C, 3-LC, 2-FC, 1-L SL X 4 4.5M, 55K
Center Loss [68] 3-C, 3-LC, 1-FC, 2-L SL, CeL X 1 0.7M, 17.2K
FV-TSE [52] 6-C, 2-FC, 1-L SL TSE 1 0.49M, 10.5K
FV-TPE [53] 7-C, 2-FC, 1-L SL TPE 1 0.49M, 10.5K
VIPLFaceNet [40] 7-C, 2-FC, 1-L SL X 1 0.49M, 10.57K
All-In-One [49] 7-C, 2-FC, 8-L SL TPE 1 0.49M, 10.5K
CASIA-Webface [73] 10-C, 1-L SL JB 1 0.49M, 10.5K
FSS [65] 10-C, 1-L SL JB 9 0.49M, 10.57K
Unconstrained FV [9] 10-C, 1-FC, 1-L SL JB 1 0.49M, 10.5K
Sparse ConvNet [57] 10-C, 1-FC, 1-L SL, CL JB 25 0.29M, 12K
FaceNet [54] 11-C, 3-FC, 1-L TL X 1 200M, 8M
DeepID3 [55] 8-C, 4-FC, 2-LC, 2L SL, CL JB 25 0.3M, 13.1K
10-C, 2-FC, 1-L and
MM-DFR [14] SL JB 8 0.49M, 10.5K
12-C, 2-FC, 1-L
VGG Face [45] 13-C, 3-FC, 2L SL, TL X 1 2.6M, 2.6K
FV-TA [13] 13-C, 3-FC, 2L SL, TL TA 1 2.6M, 2.6K
MFM-CNN [70] 14-C, 2-FC, 1L SL X 1 0.49M, 10.5K
Face-Aug-Pose-Syn [43] 16-C, 3-FC, 1-L SL PCA 1 0.49M, 10.57K
16-C, 3-FC, 1-L and
Deep Multipose [1] SL PCA 12 0.4M, 10.5K
5-C, 2-FC, 1-L
Pose aware FR [42] 16-C, 3-FC, 1-L SL PCA 5 0.49M, 10.5K
Range Loss [76] 27-C, 1-FC, 2-L SL, RL X 1 1.5M, 100K
DeepVisage [20] 27-C, 1-FC, 1-L SL X 1 4.48M, 62K
NormFace [66] 27-C, 1-FC, 2-L SL, CCL X 1 0.49M, 10.5K
Megvii [78] 4 × 10, 1-L SL X 1 5M, 0.2M
Baidu [36] 4 × 9-C, 2-L SL, TL X 10 1.2M, 1.8K
NAN [71] 57-C, 5-FC, 3-L SL, CL Agg 1 3M, 50K
SphereFace [38] 64-C, 1-FC, 1-L ASL X 1 0.49M, 10.5K
L2-Softmax [48] 100-C, 1-FC, 1-L L2-SL X 1 0.49M, 10.5K
3.1 Model and Method ΘM is the set of model parameters and Vd (.) is the density
function of the vMF distribution (see Section 3.2 for details).
3.1.1 Statistical Feature Representation (SFR) Model
The SFR model makes equal privilege assumption for the
We propose the SFR model based on the generative model- classes, i.e., each class j has equal appearance probability π
based approach [44], where the facial features are issued and is distributed with same concentration value κ. This
from a finite statistical mixture of probability distributions. assumption is important for discriminative learning to make
Then, these features are transformed into the 2D image sure that the supervised classifier is not biased to any partic-
space using a transformer. See Appendix A for the experi- ular class regardless of the number of samples and amount
mental proof of concept of the proposed SFR model with of variations present in the training data for each class. On
MNIST [33] digits expements. the other hand, µj plays significant role to preserve each
Figure 1(a) (from right to left) provides an illustration identity in its respective sub-space. Therefore, the generative
of the SFR model, which considers a mixture of von Mises- SFR model can be used for discriminative learning tasks,
Fisher (vMF) distributions [41] to model the facial features which can be viewed by reversing the directions in Figure
from different identities/classes. The vMF distribution [41] 1(a), i.e., information will flow from left to right. Next we
is parameterized with the mean direction µ (shown as solid discuss the discriminative features learning task w.r.t. the
lines) and concentration κ (indicates the spread of feature SFR model in details.
points from the solid line). For the ith facial image features
xi , called vMF feature, we define the SFR model with M 3.1.2 vMF Feature Learning (vMF-FL) Method
classes as: Figure 2 illustrates the workflow of the vMF-FL method,
M
where the features are learned using an object identity
SF R (xi |ΘM ) =
X
πj Vd (xi |µj , κj ) (1) classifier. See Appendix A for an extended view of its rela-
j=1
tionship with the SFR model. The vMF-FL method consists
of two sub-tasks: (1) mapping input 2D object images to vMF
where πj , µj and κj denote respectively the mixing propor- feature using the CNN model, which we use as the inverse-
tion, mean direction and concentration value of the j th class. transformer w.r.t. the SFR model and (2) classifying features
5
Fig. 2: Workflow of the vMF-FL method. top: block diagram and bottom: view with an example.
C,l T
fx,y,k = wkl fx,y
Op,l−1
+ blk (2) 3.2 SFR model and von Mises-Fisher Mixture Loss
(vMFML)
where, wkl and blk are the shared weights and bias. C Our proposed SFR model assumes that the facial features
denotes convolution and Op (for l > 1) denotes various are unit vectors and distributed according to a mixture of
tasks, such as convolution, sub-sampling or activation. For vMFs. By combining the SFR and CNN, vMF-ML method
l = 1, Op represents the input image. Sub-sampling or provides a novel loss function, called the von Mises-Fisher
pooling performs a simple local operation, e.g., computing Mixture Loss (vMFML). Below we provide its formulation.
the maximum in a local neighborhood followed by reducing
spatial resolution as: 3.2.1 vMF Mixture Model (vMFMM)
T
P,l Op,l−1 For a d dimensional random unit vector x = [x1 , ..., xd ] ∈
fx,y,k = max fm,n,k (3) S d−1 ⊂ Rd (i.e., kxk2 = 1), the density function of the vMF
(m,n)∈Nx,y
distribution is defined as [41]:
where, Nx,y is the local neighborhood and P denotes pool-
Vd (x|µ, κ) = Cd (κ) exp(κµT x) (5)
ing. In order to ensure non-linearity of the network, the
feature maps are passed through a non-linear activation where, µ denotes the mean (with kµk2 = 1) and κ de-
function, e.g., the Rectified Linear Unit (ReLU) [18], [23]: notes the concentration parameter (with κ ≥ 0). Cd (κ) =
l l−1 κd/2−1
fx,y,k = max(fx,y,k , 0). is the normalization constant, where, Iρ (.)
(2π)d/2 Id/2−1 (κ)
At the basic level, a CNN is constructed by stacking is the modified Bessel function of the first kind. The shape
series of convolution, activation and pooling layers [33]. of the vMF distribution depends on the value of the con-
Often a fully connected (FC) layer is placed at the end, centration parameter κ. For high value of κ, i.e. highly
which connects all neurons from the previous layer to all concentrated features, the distribution has a mode at the
neurons of the next layer. mean direction µ. In contrary, for low values of κ the
CNN models are trained by optimizing loss function. distribution is uniform, i.e. the samples appear as to be
The softmax loss, which is widely used for classification, uniformly distributed on the sphere. The top row of Figure
has the following form: 3 illustrates examples of the 3D samples in the S 2 sphere,
N
which are distributed according to the vMF distribution
X exp(wyTi fi + byi ) with different values of the concentration κ.
LSof tmax = − log PM (4)
T Let X = {xi }i=1,...,N is a set of samples, where
i=1 l=1 exp(wl fi + bl )
N is the total number of samples. For the ith sam-
where, fi and yi are the features and true class label of the ple xi , the vMFMM with M classes is defined3 as
ith image. wj and bj denote the weights and bias of the j th
PM
[3]: gv (xi |ΘM ) = j=1 πj Vd (xi |µj , κj ), where ΘM =
class. N and M denote the number of training samples and
the number of classes. 3. Note that, this is equivalent to the definition of SFR model in Eq. 1
6
{(π1 , µ1 , κ1 ), ..., (πM , µM , κM )} is the set of parameters and we can compute the gradients (we consider single sample
πj is the mixing proportion of the j th class. The bottom row and drop subscript i for brevity) as:
of Figure 3 shows the samples from vMFMM with different
∂L yj ∂pl p (1 − pj ) l == j ∂L
κ values. =− ; = j ; = p j − yj
∂pj pj ∂zj −pj pl l 6= j ∂zj
The vMFMM model has been used for unsupervised clas-
(10)
sification [3] to cluster data on the unit hyper-sphere, where ∂zj ∂zj ∂zj
T
the model is estimated by the Expectation Maximization = µj x; = κ xd ; = κ µjd (11)
∂κ ∂µjd ∂xd
(EM) method. The objective of EM method is to estimate
model parameters such that the negative log-likelihood ∂xd ∂xd = kf k2 −fd2 = 1−x2d
= ∂f d kf k3 kf k
(12)
value of the vMFMM model, i.e., − log(gv (X|ΘK )), is min-
∂fd ∂xr = −fd f3r = −xd xr
imized. The EM method estimates the posterior probability ∂fd kf k kf k
in the E-step as [3]:
∂µd ∂µd = kwk2 −wd2 = 1−µ2d
= ∂w d kwk3 kwk
(13)
πj Cd (κj ) exp(κj µTj xi ) ∂wd ∂µr = −wd w3r = −µd µr
pij = PK T
(6) ∂w d kwk kwk
l=1 πl Cd (κl ) exp(κl µl xi )
M
and model parameters in the M-step as [3]: ∂L X ∂L
= (pj − yj ) µTj x ; = (pj − yj ) κ xd (14)
∂κ j=1
∂µjd
N PN
1 X i=1 pij xi
πj = pij , µˆj = P N
, M
!
N i=1 ∂L ∂L 1 ∂L X ∂L
i=1 pij
X
(7) = (pj −yj ) κ µjd ; = − xd xr
kµˆj k µˆj r̄ d − r̄3 ∂xd j=1
∂fd kf k ∂xd r
∂xr
r̄ = , µj = and κj =
N πj kµˆj k 1 − r̄2 (15)
3.2.2 von Mises-Fisher Mixture Loss (vMFML) and opti- 3.3 Interpretation and discussion
mization
The proposed SFR model represents each identity/class (i.e.,
Our vMF-FL method aims to learn discriminative facial face) with the mean (µ) and concentration (κ) parameters
features by minimizing the classification loss. Within this of the vMF distribution, which (unlike weight and bias)
(supervised classification) context, we set our objective as to express their direct relationship with the respective identity.
minimize the cross entropy guided by the vMFMM. There- µ provides an expected representation (e.g., mean facial fea-
fore, we rewrite the posterior probability based on the equal tures) of the identity and κ (independently computed) indi-
privilege assumption of the SFR model as: cates the variations within the samples from the identity. See
exp(κµTj xi ) Appendix A for the illustration from MNIST digits based
pij = PM T
(8) experiment, where the generated images from µj effectively
l=1 exp(κµl xi ) show the ability of vMF-FL to learn a representative images
Now we can exploit the posterior/conditional probability of the corresponding classes.
to minimize the cross entropy and define the loss function, In terms of discriminative feature learning [39], [68], we
called vMFML, as: can interpret the effectiveness of vMFML by analyzing the
shape of the vMF distributions and vMFMMs in Figure 3
N X
M
X based on the κ value. While for high κ value the features
LvM F M L = − yij log(pij ) are closely located around the mean direction µ, for low
i=1 j=1
N κ values they randomly spread around and can be located
X exp(κµTj xi ) far from µ. We observe that κ also plays an important role
=− log PM T
(9)
i=1 l=1 exp(κµl xi ) to separate the vMFMM samples from different classes.
N A higher κ value will enforce the features to be more
X ezij h i
=− log PM zij = κµTj xi concentrated around µ to minimize intra-class variations
zil
i=1 l=1 e (reduce angular distances of samples and mean) and maxi-
mize inter-class distances (see Figure 3 and 1(b)). Therefore,
where, yij is the true class probability and we set yij = 1
unlike [68] (jointly optimizes two losses: softmax and center
as we only know the true class labels. Now, by comparing
loss), we4 can learn discriminative features by optimizing
the vMFML with the softmax loss (Eq. 4), we observe the
single loss function and save M × D parameters, where D
following differences: (a) vMFML uses unit normalized
is the features dimension.
features: x = kff k ; (b) mean parameter has relation with the
w The formulation of vMFML can naturally provide in-
softmax weight as: µ = kwk ; (c) it has no bias and (d) it has
terpretation and its relationships with several concurrently
an additional parameter κ.
proposed loss functions [38], [39], [48], [66]. In Eq. 9, by
Now, we observe that the proposed vMF-FL method using a higher κ value for the true class compared to the rest,
modifies the CNN training by replacing the softmax loss i.e., κyi > κj6=yi , vMFML can formulate the large-margin soft-
with the vMFML. Therefore, to learn the parameters we max loss [39] and A(angular)-softmax loss [38] under certain
can follow the standard CNN model learning procedure, conditions5 . The L2-softmax loss [48] is similar to vMFML
i.e., iteratively learn through the forward and backward
propagation [33]. This requires us to compute the gradients 4. See Appendix B for further details.
of vMFML w.r.t. the parameters. By following the chain rule, 5. See Appendix B for further details.
7
9. We take the list of 5.05M faces provided by [70] and keep non- 10. Note that, we do not compare with the contrastive loss [11] as the
overlapping (with test set) identities which have at least 30 images after joint softmax+center loss (JSCL) [68] has been shown to be more efficient
successful landmarks detection. than it.
9
TABLE 2: Comparison of different loss functions evaluated TABLE 3: Comparison of the state-of-the-art methods evaluated on
w.r.t. different CNN architectures and training datasets. The the LFW benchmark [25].
LFW [25] verification accuracy is used as a measure of the # of Dataset Acc
performance. Loss functions: vMF Mixture Loss (vMFML), FR method
CNNs Info %
Softmax loss, joint softmax+center loss (JSCL) [68] and softmax vMF-FL (proposed) 1 4.51M, 61.24K 99.63
followed by triplet loss (STL) [54]. Baidu [36] 10 1.2M, 1.8K 99.77
Baidu [36] 1 1.2M, 1.8K 99.13
Database CNN Softmax JSCL STL vMFML L2-Softmax [48] 1 3.7, 58.2K 99.78
Casia CasiaNet 97.50 97.60 97.93 98.00 FaceNet [54] 1 200M, 8M 99.63
Casia Res-27 97.40 98.87 98.13 99.18 DeepVisage [20] 1 4.48M, 62K 99.62
MSCeleb-1M Res-27 98.50 99.28 98.83 99.63 RangeLoss [76] 1 1.5M, 100K 99.52
MSCeleb-1M CasiaNet 98.21 98.33 98.42 98.65 Sparse ConvNet [57] 25 0.29M, 12K 99.55
DeepID3 [55] 25 0.29M, 12K 99.53
Megvii [78] 4 5M, 0.2M 99.50
LF-CNNs [67] 25 0.7M, 17.2K 99.50
ison. It consists of 13,233 images from 5,759 identities. Fig. DeepID2+ [56] 25 0.29M, 12K 99.47
6(a) illustrates some face images from this dataset. The FR SphereFace [38] 1 0.49M, 10.57K 99.42
task requires verifying 6000 image pairs, which are equally Center Loss [68] 1 0.7M, 17.2K 99.28
NormFace [66] 1 0.49M, 10.5K 99.19
divided into genuine and impostor pairs and comprises
MM-DFR [14] 8 0.49M, 10.57K 99.02
total 7.7K images from 4,281 identities. The LFW evaluation VGG Face [45] 1 2.6M, 2.6K 98.95
requires face verification in 10 pre-specified folds and report MFM-CNN [70] 1 5.1M, 79K 98.80
the average accuracy. It has different evaluation protocols VIPLFaceNet [40] 1 0.49M, 10.57K 98.60
and based on the recent trend we follow the unrestricted- Webscale [59] 4 4.5M, 55K 98.37
AAL [72] 1 0.49M, 10.57K 98.30
labeled-outside-data protocol. Table 3 presents the results from FSS [65] 9 0.49M, 10.57K 98.20
our method in comparison with other state-of-the-art meth- Face-Aug-Pose-Syn [43] 1 2.4M, 10.57K 98.06
ods. CASIA-Webface [73] 1 0.49M, 10.57K 97.73
As can be seen in Table 3, our method achieves very Unconstrained FV [9] 1 0.49M, 10.5K 97.45
Deepface [58] 3 4.4M, 4K 97.35
competitive accuracy (99.63%) and ranks among the top
performers, despite the fact that: (a) L2-Softmax [48] used11
a 100 layers CNN model (3.7 times deeper/larger compared TABLE 4: Comparison of the state-of-the-art methods based
to us) to achieve 99.78%; (b) Baidu [36] combines 10 CNNs on the BLUFR LFW protocol [35].
to obtain 99.77%, whereas we use a single CNN and (c)
Method VR@FAR=0.1%
FaceNet [54] used 200M images of 8M identities to obtain vMF-FL(proposed) 99.10
99.63%, whereas we train our CNN with only 4.51M images DeepVisage [20] 98.65
and 61.24K identities. It is interesting to note that, our NormFace [66] 95.83
LFW result (99.63%) is obtained by training CNN with the Center Loss [68] 92.97
FSS [65] 89.80
cleaned MSCeleb dataset that removes LFW overlapped CASIA [73] 80.26
identities. In removing the incorrectly labeled pairs within
LFW (see Errata on the LFW webpage), our proposed vMF-
FL further displays 99.68% accuracy. In light of the results TABLE 5: Evaluation of the CALFW [77] dataset. Results
from [36], [48], [55], [56], [57], [67], [78], the performance from competitive methods are obtained from the paper [77].
of the proposed method could be further improved by vMF-FL (proposed) VGG-Face Noisy Softmax
combining features from multiple CNN models or by using Acc% 94.20 86.50 82.52
deeper (e.g., 100 layers) CNNs. However, Table 3 suggests
saturation in the LFW results, as most of the methods al-
ready surpass human performance (97.53%) and they differ the recently released variant of LFW, called Cross-Age
very few in terms of performance. Besides, it raises debate LFW (CALFW) dataset [77]. CALFW is constructed by re-
whether we should justify a method w.r.t. the real world organizing (by crowdsourcing efforts) the LFW verification
FR scenario [35] based on the matching of only 6K pairs. pairs with apparent age gaps (as large as possible) to form
To overcome this limitation, we follow the BLUFR LFW the positive pairs and then selecting negative pairs using
protocol [35], which exploits all LFW images, verifies 47M individuals with the same race and gender (see Fig. 6(b)
pairs per trial and provide the measure of true accept rate for some illustrations). Similar to LFW, CALFW evaluation
(TAR) at a low false accept rate (FAR). In Table 4, we provide consists of verifying 6000 pairs (equally divided into gen-
our result for the verification rate (VR) at FAR=0.1% and uine and impostor pairs) of images in 10 pre-specified folds
compare it with the other methods which reported results in and report the average accuracy. Table 5 synthesizes the
this protocol. As can be seen in Table 4, our method displays experimental result in comparison with the state of the art.
the best performance in comparison with the results of some As can be seen from Table 5, our method provides the best
state of the FR methods published so far. results compared to the other available results from different
Besides evaluating with different performance measures, methods.
we further challenge our proposed FR method on LFW
The results in Table 3, 4, 5 confirm the remarkable perfor-
from the age invariant perspective and benchmark it on
mance of vMF-FL on the LFW dataset. Next, we explore the
11. They achieved 99.60% when using the same CNN model that we FR task on videos and evaluate our method on the YouTube
used. Faces [69] dataset.
10
TABLE 6: Comparison of the state-of-the-art methods evaluated on TABLE 7: Comparison of the state-of-the-art methods evaluated on
the Youtube Face [69]. Ad.Tr. denotes additional training is used. the CACD [7] dataset. VGG [45] result is obtained from [70].
TABLE 8: Comparison of the state-of-the-art methods evaluated on TABLE 9: Analysis of the influences from training databases with
the IJB-A benchmark [30]. ‘-’ indicates the information for the entry different sizes and numbers of classes. T@F denotes the True Accept
is unavailable. Methods which incorporates external training (ExTr) or Rate at a fixed False Accept Rate (TAR@FAR).
CNN fine-tuning (FT) with IJB-A training data are separated with a
horizontal line. VGG-Face result was provided by [52]. T@F denotes Acc T@F
Min samp/id Size, Class
the True Accept Rate at a fixed False Accept Rate (TAR@FAR). % 0.01
4.3.4 Feature normalization pairs for which the verification results are incorrect. Table
Feature normalization [20], [28], [66] plays a significant role 10 provides the information about the number and type
in the performance of deep CNN models. As discussed in of incorrect cases, which indicate a higher ratio of false
Section 3.3, vMFML naturally integrates features normal- rejection compared to false acceptance. Figure 6 provides
ization due to the SFR model (Sect. 3.1.1) and the vMF illustrations of the top (selected based on the distance from
distribution (Sect. 3.2.1). The features of our FR method are threshold) examples of failure cases on different datasets.
unit-normalized vectors, i.e., kxk = 1. We observed that From an in-depth analysis of the erroneous results, our
when learned with the proposed vMFML these normalized observations are as follows:
features provides significantly better results than the un- • On the LFW [25] failure cases, occlusion, variation of
normalized features learned with the Softmax loss, see Table illumination and poor image resolution played im-
2 for the performance comparison of Softmax vs vMFML. portant role. From the erroneous CALFW [77] pairs,
The promising results achieved by the unit-normalized we observe that poor image resolution appears as a
features (learned with vMFML) naturally raises the question common property. Besides, the false rejected pairs are
- can the unit-normalized features improves the accuracy with any suffering from high age difference.
loss function? To answer this question, we train the CNN • Failure cases on the YTF [69] dataset can be charac-
with the unit-normalized features kxk = 1 and optimize the terized by variations of illumination and poor image
softmax loss under different settings in order to compare resolution. Besides, high pose variation plays an im-
them w.r.t. vMFML. To explain the settings, we exploit the portant role.
terms within the exponential of the nominators, i.e., wT x + • Most of the CACD incorrect results occurred due to
byi (Softmax) with κµT x (vMFML). Our observations are: falsely rejecting the similar face image pairs, which
• wT x + byi : provides very poor results, because the indicate that our method encounters difficulties to
CNN training fails to converge as it get stuck at recognize the same person when the age difference
arbitrary local minimas even if with a range of is large. Besides, we observe that the variations of il-
different learning rates (from 0.1 to 0.0001) and lumination commonly appear in the incorrect results.
with/without applying the L2 regularization. Com- • The large number of errors on the IJB-A dataset
pare to the vMFML, the expectation from this setting can be characterized with high pose (mostly profile
is that the learned w values are similar to the values images, yaw angle more than 60 degrees) and poor
observed from κµT , such that, kwk = κ. That means, image resolution. Unfortunately, both of these two
this setting verifies whether the learned weights can reasons cause the failure of face and landmarks de-
absorb κ within it. However, as the observed results tection, which forced us to leave a large number of
suggest, it fails to do so. images without applying any pre-processing. Note
• wT (x κ) + byi : CNN successfully trains and pro- that, our training dataset do not have any image
vides results which is closer to the vMFML. For for which the pre-processor failed to detect face
this modified Softmax κ acts as a scalar multiplier, and landmarks. Therefore, our method may perform
whereas for vMFML κ signifies the shared concen- poorly in case of the failure of the pre-processor.
tration parameter value. Interestingly, this modified
From the above observations, we can particularly focus on
Softmax is the same as the L2-Softmax loss recently
several issues to handle in future, such as: (a) image reso-
proposed by [48], which is motivated by the fact
lution; (b) extreme pose variation (c) lighting normalization
that the L2-norm of the features (learned with the
and (d) occlusions.
Softmax loss) provides interesting information of the
Poor image resolution is a common issue among the failure
face image quality and attributes. Intuitively, the
cases on different datasets. This observation is similar to
idea is to provide same attention to all face images
the recent research by Best-Rowden and Jain [4], which
regardless of their quality. Moreover, they interpret
proposed a method to automatically predict the face image
the multiplier (here κ) as a constraint on the features
quality. They found that the results in the IJB-A [30] dataset
to lie on a hypersphere of a fixed radius. Note that,
can be improved by removing low-quality faces before
if bias is ignored then the difference among vMFML
computing the similarity scores. Therefore, we can apply
and L2-Softmax remains as w vs µ. Besides, while
this approach for certain FR cases where multiple images
the softmax loss imposes no constraints13 on the
are available for each identity. However, this approach will
weights w, vMFML applies natural constraint on µ,
not work when only a single image is available per identity
i.e., kµk = 1.
or when all available images have poor quality. In order to
Therefore, the key finding from the above observations is address this, we can incorporate face image super resolution
that the unit-normalized features are not sufficient alone and based technique [26] as a part of our pre-processor. The idea
concentration parameter κ of vMFML plays significant role is to apply super-resolution to those images which have very
to efficiently learn the CNN models. low image quality score computed by techniques such as [4].
Large facial pose causes degradation of FR performance
4.3.5 Limitations of the proposed method
[27], [43]. It has been addressed in recent researches [1],
Finally, we investigate the limitation of the proposed [42], [43], which proposed to overcome this problem by
method on different datasets by observing the face image frontalization [58] and creating a larger training dataset
13. Except L2 regularization which is an explicit settings for many with synthesized facial images of a range of facial poses
ML problems [1], [43]. However, the performance with above approaches
13
A PPENDIX B
R ELATIONSHIP WITH DIFFERENT LOSS FUNCTIONS
AND NORMALIZATION METHOD
Center Loss [68] aims to enhance feature discrimination by
minimizing the intra-class mean-squared distances. It has the
following form [68]: (b)
N
1X 2 Fig. 7: (a) illustration of the vMF-AE method with MNIST
LCenter = kfi − cyi k (16)
2 i=1 digits, where the vMF-FL (encoder/inverse-transformer) learns
discriminative 3D features and vMFMM for 10 digits-
where, fi and yi are the features and ground-truth class
classes. In the 3D plot each dot represent features and each
labels of the ith sample, cyi is the center of class yi . By
line represent the mean (µj ) of respective class j . The images
comparing Eq. 16 and Eq. 9, we see that vMFML incor-
around the 3D sphere illustrates the generated images from
porates the cosine/angular similarity with the term µTyi xi ,
cyi respective (µj ) and (b) illustration of the 2D images gen-
where µyi = and x = kff k , and use it to compute
kcyi k erated by the transformer/decoder of the SFR model, where
the porterior probability followed by computing the loss. col-1 represents the original image, col-2 shows the image
Note that, the higher the cosine similarity, the higher the generated from the encoded features of the original image,
probability and hence the lower the loss. In a different col-3 shows the image generated from the vMF mean (µj ) of
way, we can say that the loss in Eq. 9 is minimized when each class j , col-4-11 show 8 images from each digit-class j
the cosine similarity among the sample xi and the mean generated from the vMF samples (sampled using the mean
µyi of its true class yi is maximized. This indicates that, (µj ) and concentration (κj ) of each class).
vMFML minimizes the intra-class distance by incorporating
the distance computation task within its formulation.
The center loss is used as a supplementary loss to the are used by the center loss to externally learn and save the
softmax loss and the CNN learning task is achieved by joint centers.
(softmax+center) optimization. On the other hand, the above Large-Margin Softmax Loss (LMSL) [39] is defined as:
comparison among vMFML and both softmax loss and center
loss reveals that vMFML can take the advantage of both
N
X e kwyi kkfi kψ(θyi )
LLM SL = − log
with a single loss function and save M × D parameters, i=1 ekwyi kkfi kψ(θyi ) +
P
j6=yi ekwj kkfi kcos(θj )
where D is the features dimension and M is the number of (17)
classes in the training dataset. These additional parameters where, ψ (θyi ) = (−1)h cos(mθ) − 2h with integer h ∈
15
h i
hπ (h+1)π
[0, m − 1] and θ ∈ m, m. m denotes the margin, fi [9] J.-C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face
th verification using deep CNN features. In 2016 IEEE WACV, pages
is the i image features, yi is the true class label, wj is the 1–9, 2016.
weight corresponding to the j th class and θj is the angle [10] J.-C. Chen, R. Ranjan, A. Kumar, C.-H. Chen, V. Patel, and R. Chel-
between wj and fi . lappa. An end-to-end system for unconstrained face verification
with deep convolutional neural networks. In Proc. of IEEE ICCV-
vMFML has similarity to LMSL when kwyi k = kµk = W, pages 118–126, 2015.
1, ∀ yi and kfi k = kxi k = 1. In this condition, LMSL [11] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric
requires mθyi < θj (j 6= yi ), i.e., the angle between the discriminatively, with application to face verification. In Proc. of
IEEE CVPR, pages 539–546, 2005.
sample and its true class is smaller than the rest of the classes [12] L. Chunjie, Y. Qiang, et al. Cosine Normalization: Using cosine
subject to a margin multiplier m. With vMFML, this can be similarity instead of dot product in neural networks. arXiv preprint
achieved by using a higher κ value (or multiply κ with a arXiv:1702.05870, 2017.
[13] N. Crosswhite, J. Byrne, O. M. Parkhi, C. Stauffer, Q. Cao, and
scalar multiplier m) for the true class compared to the rest, A. Zisserman. Template adaptation for face verification and
i.e., κyi > κj6=yi . By considering the multiplier is equivalent identification. arXiv:1603.03958, 2016.
to the margin m, we can rewrite Eq. 9 as: [14] C. Ding and D. Tao. Robust face recognition via multimodal deep
face representation. IEEE Trans. on Multimedia, 17(11):2049–2058,
2015.
N T ! [15] B. Fernando, E. Fromont, D. Muselet, and M. Sebban. Supervised
X emκµyi xi learning of Gaussian mixture models for visual vocabulary gener-
LvM F M L = − log mκµT x T (18) ation. Pattern Recognition, 45(2):897–907, 2012.
+ j6=yi eκµj xi
yi i
P
i=1 e [16] J. Glover, G. Bradski, and R. B. Rusu. Monte Carlo pose estimation
with quaternion kernels and the bingham distribution. Robotics:
Note that, we do not use the Eq. 18 in this work, because it Science and Systems VII, page 97, 2012.
does not strictly follow the statistical features representation [17] S. Gopal and Y. Yang. Von mises-fisher clustering models. In Proc.
of ICML, pages 154–162, 2014.
model proposed in this paper. The A (angular)-softmax loss [18] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu,
[38] is a recent extension of the LMSL, which replaces the X. Wang, and G. Wang. Recent advances in convolutional neural
notion of margin with angular margin, considers kwyi k = networks. arXiv:1512.07108, 2015.
[19] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-Celeb-1M: A
1 and provides an equivalent loss formulation as Eq. 17. dataset and benchmark for large-scale face recognition. CoRR,
Therefore, Eq. 18 provides the relationship among vMFML abs/1607.08221, 2016.
and A-softmax [38] in a similar way as LMSL. [20] A. Hasnat, J. Bohne, J. Milgram, S. Gentric, and L. Chen. Deep-
visage: Making face recognition simple yet with powerful gener-
Weight Normalization [51] proposed a reparameteriza- alization skills. In Proc. of IEEE ICCV-W AMFG, Oct 2017.
tion of the standard weight vector w as: [21] M. A. Hasnat, O. Alata, and A. Trémeau. Joint Color-Spatial-
Directional clustering and Region Merging (JCSD-RM) for unsu-
v pervised RGB-D image segmentation. IEEE TPAMI, 38(11):2255–
w=g (19)
kvk 2268, 2016.
[22] M. A. Hasnat, O. Alata, and A. Trémeau. Model-based hierarchical
where, v is a vector, g is a scalar and kvk is the norm of clustering with Bregman divergences and Fishers mixture model:
application to depth image analysis. Statistics and Computing,
v. This is related to the vMFML by considering the weight 26(4):861–880, 2016.
vector w as: [23] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:
µ Surpassing human-level performance on imagenet classification.
w=κ = κµ; [kµk = 1] (20) In Proc. of IEEE CVPR, pages 1026–1034, 2015.
kµk [24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for
image recognition. In Proc. of IEEE CVPR, 2016.
The comparison of Eq. 19 and Eq. 20 reveals that vMFML [25] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled
incorporates the properties of weight normalization subject faces in the wild: A database for studying face recognition in
unconstrained environments. Technical report, Technical Report
to normalizing the activations of the CNN layer which are 07-49, University of Massachusetts, Amherst, 2007.
considered as the features. [26] H. Huang, R. He, Z. Sun, and T. Tan. Wavelet-srnet: A wavelet-
based cnn for multi-scale face super resolution. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
1689–1697, 2017.
R EFERENCES [27] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global
and local perception gan for photorealistic and identity preserving
[1] W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, I. Masi, frontal view synthesis. arXiv preprint arXiv:1704.04086, 2017.
J. Choi, J. Lekust, J. Kim, and P. Natarajan. Face recognition using [28] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
deep multi-pose representations. In IEEE WACV, pages 1–9, 2016. network training by reducing internal covariate shift. In Proc. of
[2] G. Antipov, M. Baccouche, and J.-L. Dugelay. Face aging ICML, pages 448–456, 2015.
with conditional generative adversarial networks. arXiv preprint [29] D. P. Kingma and M. Welling. Auto-encoding variational bayes.
arXiv:1702.01983, 2017. In Proceedings of International Conference on Learning Representations
[3] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the (ICLR), 2014.
unit hypersphere using von Mises-Fisher distributions. Journal of [30] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen,
Machine Learning Research, 6(Sep):1345–1382, 2005. P. Grother, A. Mah, M. Burge, and A. K. Jain. Pushing the frontiers
[4] L. Best-Rowden and A. K. Jain. Automatic face image quality of unconstrained face detection and recognition: IARPA Janus
prediction. arXiv preprint arXiv:1706.09887, 2017. Benchmark A. In Proc. of IEEE CVPR, pages 1931–1939, 2015.
[5] A. Bhalerao and C.-F. Westin. Hyperspherical von mises-fisher [31] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther.
mixture (HvMF) modelling of high angular resolution diffusion Autoencoding beyond pixels using a learned similarity metric.
MRI. In Proc. of MICCAI, pages 236–243. Springer, 2007. arXiv preprint arXiv:1512.09300, 2015.
[6] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture [32] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and
model for clustering with the integrated completed likelihood. G. Hua. Labeled faces in the wild: A survey. In Advances in face
IEEE TPAMI, 22(7):719–725, 2000. detection and facial image analysis, pages 189–248. Springer, 2016.
[7] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Face recognition and [33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based
retrieval using cross-age reference coding with cross-age celebrity learning applied to document recognition. Proc. of the IEEE,
dataset. IEEE Trans. on Multimedia, 17(6):804–815, 2015. 86(11):2278–2324, 1998.
[8] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face [34] C. Li, S. Lin, K. Zhou, and K. Ikeuchi. Specular highlight removal
revisited: A joint formulation. In ECCV, pages 566–579. 2012.
16
in facial images. In Proceedings of the IEEE Conference on Computer a deep neural network architecture. In Proc. of ICASSP, pages
Vision and Pattern Recognition, pages 3107–3116, 2017. 4270–4274. IEEE, 2015.
[35] S. Liao, Z. Lei, D. Yi, and S. Z. Li. A benchmark study of large- [64] D. H. T. Vu and R. Haeb-Umbach. Blind speech separation
scale unconstrained face recognition. In Proc. of IEEE IJCB, pages employing directional statistics in an expectation maximization
1–8, 2014. framework. In Proc. of ICASSP. IEEE, 2010.
[36] J. Liu, Y. Deng, and C. Huang. Targeting ultimate accuracy: Face [65] D. Wang, C. Otto, and A. K. Jain. Face search at scale. IEEE TPAMI,
recognition via deep embedding. arXiv:1506.07310, 2015. 2016.
[37] M. Liu, B. C. Vemuri, S.-I. Amari, and F. Nielsen. Shape retrieval [66] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: L2
using hierarchical total Bregman soft clustering. IEEE TPAMI, hypersphere embedding for face verification. In Proceedings of the
34(12):2407–2419, 2012. 25th ACM international conference on Multimedia. ACM, 2017.
[38] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: [67] Y. Wen, Z. Li, and Y. Qiao. Latent factor guided convolutional
Deep hypersphere embedding for face recognition. In Proc. of IEEE neural networks for age-invariant face recognition. In Proc. of IEEE
CVPR. IEEE, 2017. CVPR, pages 4893–4901, 2016.
[39] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-Margin Softmax Loss [68] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature
for convolutional neural networks. In Proc. of ICML, pages 507– learning approach for deep face recognition. In Proc. of ECCV,
516, 2016. pages 499–515. Springer, 2016.
[40] X. Liu, M. Kan, W. Wu, S. Shan, and X. Chen. VIPLFaceNet: An [69] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained
open source deep face recognition SDK. arXiv:1609.03892, 2016. videos with matched background similarity. In Proc. of IEEE CVPR,
[41] K. V. Mardia and P. E. Jupp. Directional statistics, volume 494. pages 529–534, 2011.
Wiley. com, 2009. [70] X. Wu, R. He, Z. Sun, and T. Tan. A light CNN for deep face
[42] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face representation with noisy labels. arXiv:1511.02683, 2015.
recognition in the wild. In Proc. of IEEE CVPR, pages 4838–4846, [71] J. Yang, P. Ren, D. Chen, F. Wen, H. Li, and G. Hua. Neural
2016. aggregation network for video face recognition. arXiv:1603.05474,
[43] I. Masi, A. Tran, T. Hassner, J. T. Leksut, and G. Medioni. Do 2016.
We Really Need to Collect Millions of Faces for Effective Face [72] H. Ye, W. Shao, H. Wang, J. Ma, L. Wang, Y. Zheng, and X. Xue.
Recognition? In ECCV, 2016. Face recognition via active annotation and learning. In Proc. of
[44] K. P. Murphy. Machine learning: a probabilistic perspective. The MIT ACM Multimedia, pages 1058–1062. ACM, 2016.
Press, 2012. [73] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation
[45] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. from scratch. arXiv:1411.7923, 2014.
Proc. of BMVC, 1(3):6, 2015. [74] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and
[46] A. B. Patel, M. T. Nguyen, and R. Baraniuk. A probabilistic alignment using multitask cascaded convolutional networks. IEEE
framework for deep learning. In Proc. of NIPS, pages 2558–2566, Signal Processing Letters, 23(10):1499–1503, Oct 2016.
2016. [75] W. Zhang, X. Zhao, J.-M. Morvan, and L. Chen. Improving shadow
[47] A. Prati, S. Calderara, and R. Cucchiara. Using circular statistics suppression for illumination robust face recognition. arXiv preprint
for trajectory shape analysis. In Proc. of IEEE CVPR, pages 1–8. arXiv:1710.05073, 2017.
IEEE, 2008. [76] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep
[48] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained face recognition with long-tail. arXiv:1611.08976, 2016.
softmax loss for discriminative face verification. arXiv preprint [77] T. Zheng, W. Deng, and J. Hu. Cross-age lfw: A database for study-
arXiv:1703.09507, 2017. ing cross-age face recognition in unconstrained environments.
[49] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa. arXiv preprint arXiv:1708.08197, 2017.
An all-in-one convolutional neural network for face analysis. [78] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: Touch-
arXiv:1611.00851, 2016. ing the limit of LFW benchmark or not? arXiv:1501.04690, 2015.
[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
IJCV, 115(3):211–252, 2015.
[51] T. Salimans and D. P. Kingma. Weight Normalization: A simple
reparameterization to accelerate training of deep neural networks.
In Proc. of NIPS, pages 901–901, 2016.
[52] S. Sankaranarayanan, A. Alavi, C. Castillo, and R. Chellappa.
Triplet probabilistic embedding for face verification and cluster-
ing. arXiv:1604.05417, 2016.
[53] S. Sankaranarayanan, A. Alavi, and R. Chellappa. Triplet similar-
ity embedding for face verification. arXiv:1602.03418, 2016.
[54] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified
embedding for face recognition and clustering. In Proc. of IEEE
CVPR, 2015.
[55] Y. Sun, D. Liang, X. Wang, and X. Tang. DeepID3: Face recognition
with very deep neural networks. arXiv:1502.00873, 2015.
[56] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations
are sparse, selective, and robust. In Proc. of IEEE CVPR, pages
2892–2900, 2015.
[57] Y. Sun, X. Wang, and X. Tang. Sparsifying neural network connec-
tions for face recognition. In Proc. of IEEE CVPR, 2016.
[58] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing
the gap to human-level performance in face verification. In Proc.
of IEEE CVPR, pages 1701–1708, 2014.
[59] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training
for face identification. In Proc. of IEEE CVPR, pages 2746–2754,
2015.
[60] Z. Tüske, M. A. Tahir, R. Schlüter, and H. Ney. Integrating
Gaussian mixtures into deep neural networks: softmax layer with
hidden variables. In Proc. of ICASSP, pages 4285–4289. IEEE, 2015.
[61] P. Upchurch, J. Gardner, K. Bala, R. Pless, N. Snavely, and K. Wein-
berger. Deep feature interpolation for image content changes.
arXiv preprint arXiv:1611.05507, 2016.
[62] A. Van den Oord and B. Schrauwen. Factoring variations in
natural images with deep Gaussian mixture models. In Proc. of
NIPS, pages 3518–3526, 2014.
[63] E. Variani, E. McDermott, and G. Heigold. A Gaussian mixture
model layer jointly optimized with discriminative features within