0% found this document useful (0 votes)
105 views16 pages

Von Mises-Fisher Mixture Model-Based Deep Learning: Application To Face Verification

1. The document proposes a von Mises-Fisher mixture model (vMFMM) for deep learning of directional features like those used in face verification. vMFMM models deep features as a mixture of von Mises-Fisher distributions. 2. A novel von Mises-Fisher mixture loss (vMFML) is derived from the vMFMM that enables discriminative learning by compacting intra-class features while repulsing inter-class features. 3. Extensive experiments on 4 face datasets show the proposed vMFML achieves state-of-the-art or competitive results, demonstrating the effectiveness and generalization ability of the approach for face verification.

Uploaded by

assasaa asasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views16 pages

Von Mises-Fisher Mixture Model-Based Deep Learning: Application To Face Verification

1. The document proposes a von Mises-Fisher mixture model (vMFMM) for deep learning of directional features like those used in face verification. vMFMM models deep features as a mixture of von Mises-Fisher distributions. 2. A novel von Mises-Fisher mixture loss (vMFML) is derived from the vMFMM that enables discriminative learning by compacting intra-class features while repulsing inter-class features. 3. Extensive experiments on 4 face datasets show the proposed vMFML achieves state-of-the-art or competitive results, demonstrating the effectiveness and generalization ability of the approach for face verification.

Uploaded by

assasaa asasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

von Mises-Fisher Mixture Model-based Deep


learning: Application to Face Verification
Md. Abul Hasnat, Julien Bohné, Jonathan Milgram, Stéphane Gentric and Liming Chen

F
arXiv:1706.04264v2 [cs.CV] 31 Dec 2017

Abstract—A number of pattern recognition tasks, e.g., face verification, model, i.e., each observation is a sample from a finite mix-
can be boiled down to classification or clustering of unit length direc- ture of probability distributions. In this paper, we adopt the
tional feature vectors whose distance can be simply computed by their theoretical concept of MM to model the deep feature representation
angle. In this paper, we propose the von Mises-Fisher (vMF) mixture task and realize the relationship among the model parame-
model as the theoretical foundation for an effective deep-learning of
ters and deep features.
such directional features and derive a novel vMF Mixture Loss and its
corresponding vMF deep features. The proposed vMF feature learning Unit length normalized feature vectors are directional
achieves the characteristics of discriminative learning, i.e., compact- features which only keep the orientations of data features
ing the instances of the same class while increasing the distance of as discriminative information while ignoring their magni-
instances from different classes. Moreover, it subsumes a number of tude. In this case, simple angle measurement, e.g., cosine
popular loss functions as well as an effective method in deep learning, distance, can be used as dissimilarity measure of two data
namely normalization. We conduct extensive experiments on face veri- points and provides very intuitive geometric interpretation
fication using 4 different challenging face datasets, i.e., LFW, YouTube
of similarity [17]. In this paper, we propose to model the
faces, CACD and IJB-A. Results show the effectiveness and excellent
(deep)-features delivered by the deep neural nets, e.g., CNN-
generalization ability of the proposed approach as it achieves state-
of-the-art results on the LFW, YouTube faces and CACD datasets and based neural networks, as a mixture of von Mises-Fisher
competitive results on the IJB-A dataset. distributions, also called vMF Mixture Model (vMFMM).
The von Mises-Fisher (vMF) is a fundamental probability
Index Terms—Deep Learning, Face Recognition, Mixture Model, von distribution, which has been successfully used in numerous
Mises-Fisher distribution. unsupervised classification tasks [3], [17], [21]. By combin-
ing this vMFMM with deep neural networks, we derive
a novel loss function, namely vMF mixture loss (vMFML)
1 I NTRODUCTION which enables a discriminative learning. Figure 1(a) (from
A number of pattern recognition tasks, e.g., face recogni- right to left) provides an illustration of the proposed model.
tion (FR), can be boiled down to supervised classification Figure 1(b) shows1 the discriminative nature of the pro-
or unsupervised clustering of unit length feature vectors posed model, i.e., the learned features for each class are
whose distance can be simply computed by their angle, compacted whereas inter-class features are repulsed.
i.e., cosine distance. In deep learning based FR, numerous To demonstrate the effectiveness of the proposed
methods find it useful to unit-normalize the final feature method, we carried out extensive experiments on face recog-
vectors, e.g., [14], [38], [48], [65], [66], [68], [73]. Besides, the nition (FR) task on which recent deep learning-based meth-
widely used and simple softmax loss has been extended ods [36], [45], [54], [55], [58] have surpassed human level
with additional or reinforced supervising signal, e.g., center performance. We used 4 different challenging face datasets,
loss [68], large margin softmax loss [39], to further enable a namely LFW [25] for single image based face verification,
discriminative learning, i.e., compacting intra-class instances IJB-A [30] for face templates matching, YouTube Faces [69]
while repulsing inter-class instances, and thereby increase (YTF) for video faces matching, and CACD [7] for cross age
the final recognition accuracy. However, the great success face matching. Using only one deep CNN model trained on
of these methods and practices remains unclear from a the MS-Celeb dataset [19], the proposed method achieves
theoretical viewpoint, which motivates us to study the deep 99.63% accuracy on LFW, 85% TAR@FAR=0.001 on IJB-A
feature representation from a theoretical perspective. [30], 96.46% accuracy on YTF [69] and 99.2% accuracy on
Statistical Mixture Models (MM) is a common method CACD [7]. These results indicate that our method general-
to perform probabilistic clustering and widely used in data izes very well across different datasets as it achieves state-
mining and machine learning [44]. MM plays key role in of-the-art results on the LFW, YouTube faces and CACD
model based clustering [6], which assumes a generative datasets and competitive results on the IJB-A dataset.
Laboratoire LIRIS, École centrale de Lyon, 69134 Ecully, France.
Safran Identity & Security, 92130 Issy-les-Moulineaux, France. 1. We visualize the features of the MNIST digits in the 3D space
e-mail:[email protected], [email protected], (similar illustration with 2D plot in [68]). The CNN is composed of
[email protected], [email protected] 6 convolution, 2 pool and 1 FC (with 2 neurons for 2D visualization)
[email protected] layers. We optimize it using the softmax loss and the proposed vMFML.
2

vMFML on the task of face verification through


comprehensive experiments on four face benchmarks
depicting various challenges, e.g., pose, lighting,
age, and demonstrate its generalization skills across
datasets.
• we perform (sect.4.3) additional experiments and
provide in-depth analysis of various factors, e.g.,
sensitivity of parameters, training data size, failure
cases, to further gain insight of the proposed method.
In the remaining part, we study the related work in
Section 2, describe our method in Section 3, present exper-
imental results with analysis in Section 4 and finally draw
conclusions in Section 5.
(a)
2 R ELATED W ORK
2.1 Mixture models and Loss Functions
Mixture models (MM) [44] have been widely used in nu-
merous machine learning problems, such as classification
[15], clustering [6], image analysis [21], text analysis [3],
shape retrieval [37], etc. However, their potentials are rel-
atively under-explored with the neural network (NN) based
learning methods. Several recent work [46], [60], [62], [63]
explored the concept of MM with NN from different per-
(b)
spectives. [62] used the Gaussian MM (GMM) to model
Fig. 1: (a) illustration of the proposed model with a 5 classes deep NN as a mixture of transformers. [46] aimed to capture
vMFMM, where facial identity represents the classes and (b) the variations in the nuisance variables (e.g., object poses)
illustration of the 3D features learned from the MNIST digits and used NN as a rendering function to propose deep
[33], left: softmax loss, right: proposed (vMFML). Features rendering MM. Both of these methods use MM within
from different classes are shown with different markers the NN, whereas we consider NN as a single transformer.
and colors in the sphere. See Appendix A for an extended [60], [63] used GMM with NN and applied it for speech
experiment and further illustrations. analysis. [63] performed discriminative feature learning task
via the proposed GMM layer. [60] used the concept of log-
linear model with GMM and modified the softmax loss
The contributions of the proposed method can be sum- accordingly. While our method is more similar to [60], [63],
marized as follows: there are several differences: (a) we use directional (unit
normalized) features; (b) we use the vMF [41] distribution
• we propose a novel feature representation model (Sect. which is more appropriate for directional features [22]; (c)
3.1.1) from a theoretical perspective. It comprises the our feature representation model is based on a generative
statistical mixture model [44] with directional distri- model-based concept; and (d) we exploit the CNN model
bution [41]. It provides a novel view to understand and explore practical application of computer vision.
and model the desired pattern classification task. MM with directional distributions [41] have been used
Therefore, it can help to develop efficient methods in a variety of domains to analyze images [22], speech [64],
to achieve better results. text [3], [17], gene expressions [3], shapes [47], pose [16],
• we propose a directional feature representation learning diffusion MRI [5], etc. However, they remain unexplored
method, called vMF-FL (Sect. 3.1.2), which combines to learn discriminative features. In this paper, we aim to
the theoretical model with the CNN model. vMF-FL explore it by modeling the feature learning task with the
provides a novel loss function, called vMFML (Sect. vMF distribution [3], [41] and combining it with a CNN
3.2), whose formulation w.r.t. the backpropagation model. To the best of our knowledge, this is the first reported
[33] method shows that it can be easily integrated attempt to use the directional distribution with the CNN model.
with any CNN model. Moreover, vMFML is able to Loss functions are essential part of CNN training. The
explain (Sect. 3.3) different loss functions [38], [39], CNN model parameters are learned by optimizing certain
[48], [66], [68] and normalization methods [12], [28], loss functions which are defined based on the given task
[51]. vMFML not only interprets the relation among (e.g., classification, regression) and the available supervisory
the parameters and features, but also improves the information (e.g., class labels, price). While the softmax loss
CNN learning task w.r.t. efficiency (faster conver- [18] is most commonly incorporated, recent researches [39],
gence) and performance (better accuracy, Sect. 4.2.1). [68] indicate that it cannot guarantee to provide discrimina-
Therefore, it can be used in a variety of classification tive features. [68] proposed center loss as a supplementary
tasks under the assumption of directional features. loss to minimize intra-class variations and hence to improve
• we verify (Sect. 4.2) the effectiveness of the proposed feature discrimination. Our method performs the same task
3

with the concentration (κ) parameter and requires no supple- jointly [49], [55], [56], [57], [68], [76] or sequentially (op-
mentary loss function, see Section 3.3 for details. [38], [39] timize softmax loss followed by the other loss) [13], [36],
proposed the large-margin softmax loss by incorporating an [45], [58], [71]. However, multiple loss optimization not only
intuitive margin on the classification boundary, which can requires additional efforts for training data preparation but
be explained by our method under certain condition, see also complicates the CNN training procedure [20]. Our FR
Section 3.3 for details. method optimizes single loss and do not need extra efforts.
After training the CNN model, a second learning method
2.2 Face Recognition is often incorporated by different FR methods, see Additional
Face recognition (FR) is a widely studied problem in which Learning column of Table 1. CNN fine-tune is a particular
remarkable results have been achieved by the recent deep form of transfer learning commonly employed by several
CNN based methods on several standard FR benchmarks methods [9], [52], [65]. It consists of updating the trained
datasets, such as, the Labeled Faces in the Wild (LFW) [25], CNN parameters using the target specific training data. In
YouTube Faces (YTF) [69] and IARPA Janus Benchmark order to adapt the CNN features to particular FR tasks,
A (IJB-A) [30]. In Table 1, we study2 these methods and several methods apply additional transformation on the
decompose their key aspects as: (a) CNN model design; CNN features based on the metric/distance learning strategy.
(b) objective/loss functions; (c) fine-tune and additional The Joint Bayesian method [8] is a popular metric learning
learning method; (d) multi-modal input and number of method and used by [9], [14], [55], [56], [57], [65], [73].
CNNs and (e) use of the training database. Specific embedding learning methods [52], [53] have been
Recent deep CNN based FR methods tend to adopt proposed to learn from face triplets. The template adap-
(directly or slightly modify) the famous CNN architectures tation [13] is another strategy which incorporates an SVM
[18] proposed for the ImageNet [50] challenge. The CNN classifier and extends the FR tasks for videos and templates
Info column of Table 1 provides the details of CNN models [30]. A different approach learns an aggregation module [71]
used by different FR methods. The AlexNet model is used to compute the similarity scores among two sets of video
by [1], [40], [42], [49], [52], [53], [54], VGGNet model is used frames. The principal component analysis (PCA) technique
by [1], [13], [14], [42], [43], [45], [57], GoogleNet model is is commonly used [1], [42], [43] to learn a dataset specific
used by [54], [71] and ResNet model is used by [20], [38], projection matrix. The above methods often need to prepare
[48], [66], [68], [76]. Besides the famous CNN models, [73] its training data from the target specific datasets. Moreover,
proposed a simpler CNN model which is used by [9], [14], they [52], [53] may need to carefully prepare the training
[65]. CNNs with lower depth have been used by [55], [56], data, e.g., triplets. Our FR method do not incorporate any such
[58], [59], [68], where the model complexity is increased with additional learning strategies.
locally connected convolutional layers. Parallel CNNs have FR methods often accumulate features from a set of
been employed by [36], [78] to simultaneously learn features independently trained CNN models to construct rich facial
from different facial regions. We adopt the ResNet [24] based descriptors, see # of CNNs column of Table 1. These CNN
deeper CNN model. models are trained with different types of inputs: (a) image-
FR methods are trained with different loss functions, see crops based on different facial regions (eyes, nose, lips, etc.)
Loss function column in Table 1. Most FR methods learn the [14], [55], [56], [57], [65]; (b) different images modalities,
facial feature representation model by training the CNN for such as 2D, 3D, frontalized and synthesized faces at different
identity classification. For this purpose, the softmax loss poses [1], [14], [42], [58] and (c) different training databases
[18] is used to optimize the classification objective, which with varying number of images [36], [59]. Our FR method do
requires the facial images with associated identity labels. not apply these approaches and train single CNN model.
Several variants of the softmax loss [38], [39], [48], [66] A large facial image database is significantly important
have been recently proposed to enhance its discriminabil- to achieve high accuracy on FR [20], [54], [78]. Dataset Info
ity. Besides, the contrastive loss [11], [18] is used by [55], column of Table 1 provides the information of the training
[56], [57], [58], [71] for face verification, which requires datasets used by different FR methods. Currently, several FR
similar/dissimilar face image pairs and similarity labels. datasets [19], [73] are publicly available. Among them, the
Moreover, the triplet loss [54] is used by [13], [36], [45], [54] CASIA-WebFace [73] has been widely used by the recent
for face verification and requires the face triplets (i.e. anchor, methods [1], [9], [14], [40], [42], [43], [49], [52], [53], [65],
positive, negative). Recently, [66] proposed different formu- [68], [70], [73] to train CNN with a medium sized database.
lations of the contrastive and triplet losses such that they Besides, the recently released MSCELEB [19] dataset, which
can be trained with only the identity labels. Our proposed provides the largest collection of facial images and identi-
vMFML, simply learns the features via identity classification ties, becomes the standard choice for large scale training.
and requires only the class labels. Moreover, the theoretical We exploit it for developing our FR method.
foundation of vMFML provides interesting interpretation
and relationship with the very recently proposed softmax
based loss functions, such as [38], [39], [48], [66], see Sect.
3.3 for details.
3 M ETHODOLOGY
CNN training with multiple loss functions have been
In this section, first we present the statistical feature represen-
adopted by several methods, where the losses are optimized
tation model and then discuss the facial feature representation
2. We only consider the deep CNN based methods. For the others, learning method w.r.t. the model. Finally, we present the
we refer readers to the survey [32] for LFW and the [30] for IJB-A. complete face recognition pipeline.
4

TABLE 1: Overview of the state-of-the-art FR methods. In the second column (CNN Info), the shorthand notations mean-
C- convolutional layer, FC: fully connected layer, LC: locally connected layer and L: loss layer. In the third column (Loss
Function), the shorthand notations mean- TL: triplet loss, SL: softmax loss, CL: contrastive loss, CCL: C-contrastive loss,
CeL: center loss, ASL: angular softmax loss and RL: range loss. In the fourth column (Additional Learning), the shorthand
notations mean- JB: joint bayesian, TSE: triplet similarity embedding, TPE: triplet probability embedding, TA: template
adaptation and X: no metric learning. We list the methods in a decreasing order based on the number of convolution and
FC layers in the CNN model.
Loss Additional # of Dataset
FR system CNN Info
function learning CNNs Info
DeepID2+ [56] 4-C, 1-FC, 2L SL, CL JB 25 0.29M, 12K
Deepface [58] 2-C, 3-LC, 1-FC, 1-L SL, CL X 3 4.4M, 4K
Webscale [59] 2-C, 3-LC, 2-FC, 1-L SL X 4 4.5M, 55K
Center Loss [68] 3-C, 3-LC, 1-FC, 2-L SL, CeL X 1 0.7M, 17.2K
FV-TSE [52] 6-C, 2-FC, 1-L SL TSE 1 0.49M, 10.5K
FV-TPE [53] 7-C, 2-FC, 1-L SL TPE 1 0.49M, 10.5K
VIPLFaceNet [40] 7-C, 2-FC, 1-L SL X 1 0.49M, 10.57K
All-In-One [49] 7-C, 2-FC, 8-L SL TPE 1 0.49M, 10.5K
CASIA-Webface [73] 10-C, 1-L SL JB 1 0.49M, 10.5K
FSS [65] 10-C, 1-L SL JB 9 0.49M, 10.57K
Unconstrained FV [9] 10-C, 1-FC, 1-L SL JB 1 0.49M, 10.5K
Sparse ConvNet [57] 10-C, 1-FC, 1-L SL, CL JB 25 0.29M, 12K
FaceNet [54] 11-C, 3-FC, 1-L TL X 1 200M, 8M
DeepID3 [55] 8-C, 4-FC, 2-LC, 2L SL, CL JB 25 0.3M, 13.1K
10-C, 2-FC, 1-L and
MM-DFR [14] SL JB 8 0.49M, 10.5K
12-C, 2-FC, 1-L
VGG Face [45] 13-C, 3-FC, 2L SL, TL X 1 2.6M, 2.6K
FV-TA [13] 13-C, 3-FC, 2L SL, TL TA 1 2.6M, 2.6K
MFM-CNN [70] 14-C, 2-FC, 1L SL X 1 0.49M, 10.5K
Face-Aug-Pose-Syn [43] 16-C, 3-FC, 1-L SL PCA 1 0.49M, 10.57K
16-C, 3-FC, 1-L and
Deep Multipose [1] SL PCA 12 0.4M, 10.5K
5-C, 2-FC, 1-L
Pose aware FR [42] 16-C, 3-FC, 1-L SL PCA 5 0.49M, 10.5K
Range Loss [76] 27-C, 1-FC, 2-L SL, RL X 1 1.5M, 100K
DeepVisage [20] 27-C, 1-FC, 1-L SL X 1 4.48M, 62K
NormFace [66] 27-C, 1-FC, 2-L SL, CCL X 1 0.49M, 10.5K
Megvii [78] 4 × 10, 1-L SL X 1 5M, 0.2M
Baidu [36] 4 × 9-C, 2-L SL, TL X 10 1.2M, 1.8K
NAN [71] 57-C, 5-FC, 3-L SL, CL Agg 1 3M, 50K
SphereFace [38] 64-C, 1-FC, 1-L ASL X 1 0.49M, 10.5K
L2-Softmax [48] 100-C, 1-FC, 1-L L2-SL X 1 0.49M, 10.5K

3.1 Model and Method ΘM is the set of model parameters and Vd (.) is the density
function of the vMF distribution (see Section 3.2 for details).
3.1.1 Statistical Feature Representation (SFR) Model
The SFR model makes equal privilege assumption for the
We propose the SFR model based on the generative model- classes, i.e., each class j has equal appearance probability π
based approach [44], where the facial features are issued and is distributed with same concentration value κ. This
from a finite statistical mixture of probability distributions. assumption is important for discriminative learning to make
Then, these features are transformed into the 2D image sure that the supervised classifier is not biased to any partic-
space using a transformer. See Appendix A for the experi- ular class regardless of the number of samples and amount
mental proof of concept of the proposed SFR model with of variations present in the training data for each class. On
MNIST [33] digits expements. the other hand, µj plays significant role to preserve each
Figure 1(a) (from right to left) provides an illustration identity in its respective sub-space. Therefore, the generative
of the SFR model, which considers a mixture of von Mises- SFR model can be used for discriminative learning tasks,
Fisher (vMF) distributions [41] to model the facial features which can be viewed by reversing the directions in Figure
from different identities/classes. The vMF distribution [41] 1(a), i.e., information will flow from left to right. Next we
is parameterized with the mean direction µ (shown as solid discuss the discriminative features learning task w.r.t. the
lines) and concentration κ (indicates the spread of feature SFR model in details.
points from the solid line). For the ith facial image features
xi , called vMF feature, we define the SFR model with M 3.1.2 vMF Feature Learning (vMF-FL) Method
classes as: Figure 2 illustrates the workflow of the vMF-FL method,
M
where the features are learned using an object identity
SF R (xi |ΘM ) =
X
πj Vd (xi |µj , κj ) (1) classifier. See Appendix A for an extended view of its rela-
j=1
tionship with the SFR model. The vMF-FL method consists
of two sub-tasks: (1) mapping input 2D object images to vMF
where πj , µj and κj denote respectively the mixing propor- feature using the CNN model, which we use as the inverse-
tion, mean direction and concentration value of the j th class. transformer w.r.t. the SFR model and (2) classifying features
5

Fig. 2: Workflow of the vMF-FL method. top: block diagram and bottom: view with an example.

to the respective classes based on the discriminative view of


the SFR model. It formulates an optimization problem by
integrating the SFR and CNN models and learns parameters
by minimizing the classification loss. In general, CNN mod-
els use the softmax function and minimize the cross entropy.
Therefore, our integration will replace the softmax function
according to Eq. 1.

3.1.3 Convolutional Neural Network (CNN) model


Fig. 3: 3D directional samples from the vMF distribution
The basic ideas of CNN [33] consist of: (a) local receptive
(above arrow) and the vMFMM of 3 classes (below arrow).
fields via convolution and (b) spatial sub-sampling via
Op,l−1 Samples are shown on the S 2 sphere for different values of
pooling. At layer l, the convolution of the input fx,y to
th C,l κ.
obtain the k output map fx,y,k is:

C,l T
fx,y,k = wkl fx,y
Op,l−1
+ blk (2) 3.2 SFR model and von Mises-Fisher Mixture Loss
(vMFML)
where, wkl and blk are the shared weights and bias. C Our proposed SFR model assumes that the facial features
denotes convolution and Op (for l > 1) denotes various are unit vectors and distributed according to a mixture of
tasks, such as convolution, sub-sampling or activation. For vMFs. By combining the SFR and CNN, vMF-ML method
l = 1, Op represents the input image. Sub-sampling or provides a novel loss function, called the von Mises-Fisher
pooling performs a simple local operation, e.g., computing Mixture Loss (vMFML). Below we provide its formulation.
the maximum in a local neighborhood followed by reducing
spatial resolution as: 3.2.1 vMF Mixture Model (vMFMM)
T
P,l Op,l−1 For a d dimensional random unit vector x = [x1 , ..., xd ] ∈
fx,y,k = max fm,n,k (3) S d−1 ⊂ Rd (i.e., kxk2 = 1), the density function of the vMF
(m,n)∈Nx,y
distribution is defined as [41]:
where, Nx,y is the local neighborhood and P denotes pool-
Vd (x|µ, κ) = Cd (κ) exp(κµT x) (5)
ing. In order to ensure non-linearity of the network, the
feature maps are passed through a non-linear activation where, µ denotes the mean (with kµk2 = 1) and κ de-
function, e.g., the Rectified Linear Unit (ReLU) [18], [23]: notes the concentration parameter (with κ ≥ 0). Cd (κ) =
l l−1 κd/2−1
fx,y,k = max(fx,y,k , 0). is the normalization constant, where, Iρ (.)
(2π)d/2 Id/2−1 (κ)
At the basic level, a CNN is constructed by stacking is the modified Bessel function of the first kind. The shape
series of convolution, activation and pooling layers [33]. of the vMF distribution depends on the value of the con-
Often a fully connected (FC) layer is placed at the end, centration parameter κ. For high value of κ, i.e. highly
which connects all neurons from the previous layer to all concentrated features, the distribution has a mode at the
neurons of the next layer. mean direction µ. In contrary, for low values of κ the
CNN models are trained by optimizing loss function. distribution is uniform, i.e. the samples appear as to be
The softmax loss, which is widely used for classification, uniformly distributed on the sphere. The top row of Figure
has the following form: 3 illustrates examples of the 3D samples in the S 2 sphere,
N
which are distributed according to the vMF distribution
X exp(wyTi fi + byi ) with different values of the concentration κ.
LSof tmax = − log PM (4)
T Let X = {xi }i=1,...,N is a set of samples, where
i=1 l=1 exp(wl fi + bl )
N is the total number of samples. For the ith sam-
where, fi and yi are the features and true class label of the ple xi , the vMFMM with M classes is defined3 as
ith image. wj and bj denote the weights and bias of the j th
PM
[3]: gv (xi |ΘM ) = j=1 πj Vd (xi |µj , κj ), where ΘM =
class. N and M denote the number of training samples and
the number of classes. 3. Note that, this is equivalent to the definition of SFR model in Eq. 1
6

{(π1 , µ1 , κ1 ), ..., (πM , µM , κM )} is the set of parameters and we can compute the gradients (we consider single sample
πj is the mixing proportion of the j th class. The bottom row and drop subscript i for brevity) as:
of Figure 3 shows the samples from vMFMM with different 
∂L yj ∂pl p (1 − pj ) l == j ∂L
κ values. =− ; = j ; = p j − yj
∂pj pj ∂zj −pj pl l 6= j ∂zj
The vMFMM model has been used for unsupervised clas-
(10)
sification [3] to cluster data on the unit hyper-sphere, where ∂zj ∂zj ∂zj
T
the model is estimated by the Expectation Maximization = µj x; = κ xd ; = κ µjd (11)
∂κ ∂µjd ∂xd
(EM) method. The objective of EM method is to estimate 
model parameters such that the negative log-likelihood ∂xd  ∂xd = kf k2 −fd2 = 1−x2d
= ∂f d kf k3 kf k
(12)
value of the vMFMM model, i.e., − log(gv (X|ΘK )), is min-
∂fd  ∂xr = −fd f3r = −xd xr
imized. The EM method estimates the posterior probability ∂fd kf k kf k
in the E-step as [3]: 
∂µd  ∂µd = kwk2 −wd2 = 1−µ2d
= ∂w d kwk3 kwk
(13)
πj Cd (κj ) exp(κj µTj xi ) ∂wd  ∂µr = −wd w3r = −µd µr
pij = PK T
(6) ∂w d kwk kwk
l=1 πl Cd (κl ) exp(κl µl xi )
M
and model parameters in the M-step as [3]: ∂L X ∂L
= (pj − yj ) µTj x ; = (pj − yj ) κ xd (14)
∂κ j=1
∂µjd
N PN
1 X i=1 pij xi
πj = pij , µˆj = P N
, M
!
N i=1 ∂L ∂L 1 ∂L X ∂L
i=1 pij
X
(7) = (pj −yj ) κ µjd ; = − xd xr
kµˆj k µˆj r̄ d − r̄3 ∂xd j=1
∂fd kf k ∂xd r
∂xr
r̄ = , µj = and κj =
N πj kµˆj k 1 − r̄2 (15)

3.2.2 von Mises-Fisher Mixture Loss (vMFML) and opti- 3.3 Interpretation and discussion
mization
The proposed SFR model represents each identity/class (i.e.,
Our vMF-FL method aims to learn discriminative facial face) with the mean (µ) and concentration (κ) parameters
features by minimizing the classification loss. Within this of the vMF distribution, which (unlike weight and bias)
(supervised classification) context, we set our objective as to express their direct relationship with the respective identity.
minimize the cross entropy guided by the vMFMM. There- µ provides an expected representation (e.g., mean facial fea-
fore, we rewrite the posterior probability based on the equal tures) of the identity and κ (independently computed) indi-
privilege assumption of the SFR model as: cates the variations within the samples from the identity. See
exp(κµTj xi ) Appendix A for the illustration from MNIST digits based
pij = PM T
(8) experiment, where the generated images from µj effectively
l=1 exp(κµl xi ) show the ability of vMF-FL to learn a representative images
Now we can exploit the posterior/conditional probability of the corresponding classes.
to minimize the cross entropy and define the loss function, In terms of discriminative feature learning [39], [68], we
called vMFML, as: can interpret the effectiveness of vMFML by analyzing the
shape of the vMF distributions and vMFMMs in Figure 3
N X
M
X based on the κ value. While for high κ value the features
LvM F M L = − yij log(pij ) are closely located around the mean direction µ, for low
i=1 j=1
N κ values they randomly spread around and can be located
X exp(κµTj xi ) far from µ. We observe that κ also plays an important role
=− log PM T
(9)
i=1 l=1 exp(κµl xi ) to separate the vMFMM samples from different classes.
N A higher κ value will enforce the features to be more
X ezij h i
=− log PM zij = κµTj xi concentrated around µ to minimize intra-class variations
zil
i=1 l=1 e (reduce angular distances of samples and mean) and maxi-
mize inter-class distances (see Figure 3 and 1(b)). Therefore,
where, yij is the true class probability and we set yij = 1
unlike [68] (jointly optimizes two losses: softmax and center
as we only know the true class labels. Now, by comparing
loss), we4 can learn discriminative features by optimizing
the vMFML with the softmax loss (Eq. 4), we observe the
single loss function and save M × D parameters, where D
following differences: (a) vMFML uses unit normalized
is the features dimension.
features: x = kff k ; (b) mean parameter has relation with the
w The formulation of vMFML can naturally provide in-
softmax weight as: µ = kwk ; (c) it has no bias and (d) it has
terpretation and its relationships with several concurrently
an additional parameter κ.
proposed loss functions [38], [39], [48], [66]. In Eq. 9, by
Now, we observe that the proposed vMF-FL method using a higher κ value for the true class compared to the rest,
modifies the CNN training by replacing the softmax loss i.e., κyi > κj6=yi , vMFML can formulate the large-margin soft-
with the vMFML. Therefore, to learn the parameters we max loss [39] and A(angular)-softmax loss [38] under certain
can follow the standard CNN model learning procedure, conditions5 . The L2-softmax loss [48] is similar to vMFML
i.e., iteratively learn through the forward and backward
propagation [33]. This requires us to compute the gradients 4. See Appendix B for further details.
of vMFML w.r.t. the parameters. By following the chain rule, 5. See Appendix B for further details.
7

when weights kwk = 1 and biases byi = 0 are applied


within its formulation. Indeed, this relationship is signifi-
cant as it provides additional justification (why κµ is impor-
tant) for the vMFML proposal from the perspective of face
image quality, see 4.3 for further details. The reformulated
softmax loss proposed in NormFace [66] optimizes the cosine
similarity and add a scaling parameter which is empirically (a)
found by analyzing the bound of softmax loss. Interestingly,
this reformulation (Eq. 6 of [66]) is exactly similar to Eq.
9 and hence vMFML has an interesting interpretation from
a different viewpoint. The above discussions clearly state
that while vMFML is derived from a theoretical model,
it can be explained based on the intuitions and empirical (b)
justifications presented in the concurrently proposed loss
Fig. 4: (a) Illustration of the CNN model with vMFML.
functions [38], [48], [66] for FR.
CoPr indicates Convolution followed by the PReLU acti-
Normalization [12], [28], [51] becomes an increasingly
vation. ResBl is a residual block [24] which computes
popular technique to use within the CNN models. Our
output = input + CoP r(CoP r(input)). # Replication
method (s.t. normalization in the final layer) takes the
indicates how many times the same block is sequentially
advantages of different normalization techniques due to
replicated. # Filts denotes the number of feature maps. (b)
its natural form of the features (kxk = 1) and parame-
illustration of the residual block ResBl.
ter (kµk = 1). Particularly, the term κµ is equivalent to
the reparameterization6 proposed by weight normalization
[51] and µT x is equivalent to the cosine normalization [12]. 3.4.2 Face verification
Both [51] and [12] provide their relationship with the batch
normalization [28] under certain conditions, which can be Our face verification strategy follows the steps below:
equally applicable to our case. 1) face image normalization: we apply the following
steps: (a) use the MTCNN [74] detector8 to detect
3.4 Face Verification with the vMF-FL method the faces and landmarks; (b) normalize the face
image by applying a 2D similarity transformation.
The proposed vMF-FL method learns discriminative facial The transformation parameters are computed from
feature representation from a set of 2D facial images. There- the location of the detected landmarks on the image
fore, we use it to extract facial features and verify pairs of and pre-set coordinates in a 112×96 image frame;
face images [25], templates [30] and videos [69]. (c) apply color to grayscale image conversion and
(d) normalize the pixel intensity values within the
3.4.1 CNN model range −1 to 1. Note that, this normalization method
In general, any CNN model can be used with the proposed is also applied during the training data preparation.
vMF-FL method. In this work, we follow the recent trend 2) features extraction: use the CNN (trained with
[18], [24] and use a deeper CNN. To this aim, we choose the vMF-FL) to extract features from the original and
publicly available7 ResNet [24] based CNN model provided horizontally flipped version and take the element-
by the authors of [68]. It consists of 27 convolution (Conv), wise maximum value. For template [30] and video
4 pooling (Pool) and 1 fully connected (FC) layers. Figure 4 [69], obtain the features of an identity by taking
illustrates the CNN model, called Res-27. Each convolution element-wise average of the features from all of the
uses a 3 × 3 kernel and is followed by a PReLU activation images/frames.
function. The CNN progresses from the lower to higher 3) verification score computation: compute the cosine
depth by decreasing the spatial resolution using a 2 × 2 similarity as the score and compare it to a threshold.
max Pool layer while gradually increasing the number of
feature maps from 32 to 512. The 512 dimensional output
4 E XPERIMENTS , R ESULTS AND D ISCUSSION
from the FC layer is then unit normalized which we con-
sider as the desired directional features representation of We train the CNN model, use it to extract features and
the input 2D image. Finally, we use the proposed vMFML perform different scenarios for face verification, i.e., single
and optimize the CNN during training. Overall, the CNN image-based [7], [25], multi-image or video-based [30], [69].
comprises 36.1M parameters for feature representation and In order to verify the effectiveness of the proposed ap-
(512 × M ) + 1 parameters for the vMFML, where M is the proach, we experiment on several datasets, namely LFW
total number of identities/classes in the training database. [25], IJB-A [30], YTF [69] and CACD [7], which impose
Note that, vMFML only requires one additional scalar pa- various challenge to FR by collecting images from different
rameter (κ) compared to the general softmax loss. sources and ensuring sufficient variations w.r.t. pose, illu-
mination, occlusion, expression, resolution, age, geographic
6. See Appendix B for further details.
7. Note that the CNN proposed in [68] is different than the publicly 8. In case of multiple faces, we take the face closer to the image center.
provided CNN by the same authors. Therefore, in order to avoid If the landmarks detector fails we keep the face image by cropping it
confusion, we do not cite our CNN model directly as [68]. based on the detected bounding box.
8

regions, etc. Figure 6 provides examples of the pairs of


images/videos/templates from different FR datasets.
LFW [25] dataset was particularly designed to study the
problem of face verification with unconstrained facial im-
ages taken in everyday settings and collected from the web.
It is considered as one of the most challenging and popular
FR benchmark. YTF [69] is a dataset of unconstrained videos
and designed to evaluate FR methods for matching faces
in pairs of videos. Low image resolution, high illumination
and pose variations on the YTF videos added significant
challenges to the video based FR methods. IJB-A [30] is a
recently proposed dataset, which raises the FR difficulty by
collecting faces with high variations in pose, illumination,
expression, resolution and occlusion. Unlike LFW, IJB-A
includes faces which cannot be detected by standard face Fig. 5: Illustration of the ROC curves for different loss functions: vMF
detectors and hence added significant amount of challenges. Mixture Loss (vMFML), Softmax loss, joint softmax+center loss (JSCL)
The CACD [7] dataset is specifically designed to study age [68] and softmax followed by triplet loss (STL) [54]. Results obtained using
invariant FR problem. It ensures large variations of the ages the same CNN settings trained with the MS-Celeb-1M [19] dataset.
in order to add further challenge for the FR methods.
functions10 for FR, such as, Softmax loss, joint softmax+center
4.1 CNN Training loss (JSCL) [68] and softmax followed by triplet loss (STL)
We collect the training images from the cleaned9 version [54]. Note that, we use the same CNN model and best
of the MS-Celeb-1M [19] database, which consists of 4.61M known training settings (learning rate, regularization, etc.)
images of 61.24K identities. In order to pre-process the for individual loss function during the experiments.
training data we normalize (Sect. 3.4.2) the facial images of First, we use the CASIA dataset [73] and train a shal-
the dataset. lower CNN proposed in [73], called CasiaNet. We observe
We train our CNN model using only the identity label of that: vMFML (98% ) > STL (97.93%) > JSCL (97.6%) >
each image. We use 95% images (4.3M images) for training softmax (97.5%). Next, we train a deeper CNN, called
and 5% images (259K images) for monitoring and evalu- Res-27 (Section 3.4.1), with the same dataset and observe
ating the loss and accuracy. We train our CNN using the that: vMFML (99.18% ) > JSCL (98.87%) > STL (98.13%)
stochastic gradient descent method and momentum set to 0.9. > softmax (97.4%). Therefore, vMFML achieves better result
Moreover, we apply L2 regularization with the weight decay (from 98% to 99.18%) with a deeper CNN. Fig. 5 illustrates the
set to 5e−4 . We begin the CNN training with a learning corresponding receiver operating characteristic (ROC) curve
rate 0.1 for 2 epochs. Then we decrease it after each epoch for this comparison. Next, we train Res-27 with the MS-
by a factor 10. We stop the training after 5 epochs. We use Celeb-1M [19] and observe that: vMFML (99.63% ) > JSCL
120 images in each mini-batch. During training, we apply (99.28%) > STL (98.83%) > Softmax (98.50%). That means,
data augmentation by horizontally flipping the images. vMFML improves its accuracy (from 99.18% to 99.63%) when
Note that, during evaluation on a particular dataset, we do trained with a larger dataset. We also analyze the influence of
not apply any additional CNN training or fine-tuning or a larger dataset by training CasiaNet with the MS-Celeb-1M
dimension reduction. [19] and observe that: vMFML (98.65% ) > STL (98.41%) >
JSCL (98.33%) > Softmax (98.21%). This observation further
verifies the importance of a larger dataset to achieve better
4.2 Results and Evaluation results with deep CNN, including the proposed vMFML.
First we evaluate the proposed vMFML by comparing it The above analyses indicate that: 1) the proposed vMFML
to the state-of-the-art loss functions. Next, we evaluate outperforms the state-of-the-art loss functions commonly
our vMF-FL based FR method on the most common and used for FR; 2) Res-27 is an appropriate CNN model and 3)
challenging FR benchmark datasets and compare it to the the MS-Celeb-1M [19] is the best choice of training dataset
state-of-the-art FR methods. to develop our FR method (see Sect. 3.4 and 4.1). In the next
Section, we compare our FR method with the state-of-the-
4.2.1 Comparison of the Loss Functions art for a variety of face verification tasks using different
To gain insight on the effectiveness of the proposed loss benchmarks. Besides, we conduct additional experiments
function, we compare the proposed vMFML with some and analysis to identify several common issues related to
state-of-the-art loss functions using two different CNN CNN training and discuss them in Sect. 4.3 .
architectures of different depth along with two training
datasets of different size. Table 2 presents the results, where 4.2.2 Labeled Faces in the Wild (LFW)
we use the LFW [25] face verification accuracy as a measure The LFW dataset [25] is the de facto benchmark for evaluating
of the performance. We consider three commonly used loss unconstrained FR methods based on single image compar-

9. We take the list of 5.05M faces provided by [70] and keep non- 10. Note that, we do not compare with the contrastive loss [11] as the
overlapping (with test set) identities which have at least 30 images after joint softmax+center loss (JSCL) [68] has been shown to be more efficient
successful landmarks detection. than it.
9

TABLE 2: Comparison of different loss functions evaluated TABLE 3: Comparison of the state-of-the-art methods evaluated on
w.r.t. different CNN architectures and training datasets. The the LFW benchmark [25].
LFW [25] verification accuracy is used as a measure of the # of Dataset Acc
performance. Loss functions: vMF Mixture Loss (vMFML), FR method
CNNs Info %
Softmax loss, joint softmax+center loss (JSCL) [68] and softmax vMF-FL (proposed) 1 4.51M, 61.24K 99.63
followed by triplet loss (STL) [54]. Baidu [36] 10 1.2M, 1.8K 99.77
Baidu [36] 1 1.2M, 1.8K 99.13
Database CNN Softmax JSCL STL vMFML L2-Softmax [48] 1 3.7, 58.2K 99.78
Casia CasiaNet 97.50 97.60 97.93 98.00 FaceNet [54] 1 200M, 8M 99.63
Casia Res-27 97.40 98.87 98.13 99.18 DeepVisage [20] 1 4.48M, 62K 99.62
MSCeleb-1M Res-27 98.50 99.28 98.83 99.63 RangeLoss [76] 1 1.5M, 100K 99.52
MSCeleb-1M CasiaNet 98.21 98.33 98.42 98.65 Sparse ConvNet [57] 25 0.29M, 12K 99.55
DeepID3 [55] 25 0.29M, 12K 99.53
Megvii [78] 4 5M, 0.2M 99.50
LF-CNNs [67] 25 0.7M, 17.2K 99.50
ison. It consists of 13,233 images from 5,759 identities. Fig. DeepID2+ [56] 25 0.29M, 12K 99.47
6(a) illustrates some face images from this dataset. The FR SphereFace [38] 1 0.49M, 10.57K 99.42
task requires verifying 6000 image pairs, which are equally Center Loss [68] 1 0.7M, 17.2K 99.28
NormFace [66] 1 0.49M, 10.5K 99.19
divided into genuine and impostor pairs and comprises
MM-DFR [14] 8 0.49M, 10.57K 99.02
total 7.7K images from 4,281 identities. The LFW evaluation VGG Face [45] 1 2.6M, 2.6K 98.95
requires face verification in 10 pre-specified folds and report MFM-CNN [70] 1 5.1M, 79K 98.80
the average accuracy. It has different evaluation protocols VIPLFaceNet [40] 1 0.49M, 10.57K 98.60
and based on the recent trend we follow the unrestricted- Webscale [59] 4 4.5M, 55K 98.37
AAL [72] 1 0.49M, 10.57K 98.30
labeled-outside-data protocol. Table 3 presents the results from FSS [65] 9 0.49M, 10.57K 98.20
our method in comparison with other state-of-the-art meth- Face-Aug-Pose-Syn [43] 1 2.4M, 10.57K 98.06
ods. CASIA-Webface [73] 1 0.49M, 10.57K 97.73
As can be seen in Table 3, our method achieves very Unconstrained FV [9] 1 0.49M, 10.5K 97.45
Deepface [58] 3 4.4M, 4K 97.35
competitive accuracy (99.63%) and ranks among the top
performers, despite the fact that: (a) L2-Softmax [48] used11
a 100 layers CNN model (3.7 times deeper/larger compared TABLE 4: Comparison of the state-of-the-art methods based
to us) to achieve 99.78%; (b) Baidu [36] combines 10 CNNs on the BLUFR LFW protocol [35].
to obtain 99.77%, whereas we use a single CNN and (c)
Method VR@FAR=0.1%
FaceNet [54] used 200M images of 8M identities to obtain vMF-FL(proposed) 99.10
99.63%, whereas we train our CNN with only 4.51M images DeepVisage [20] 98.65
and 61.24K identities. It is interesting to note that, our NormFace [66] 95.83
LFW result (99.63%) is obtained by training CNN with the Center Loss [68] 92.97
FSS [65] 89.80
cleaned MSCeleb dataset that removes LFW overlapped CASIA [73] 80.26
identities. In removing the incorrectly labeled pairs within
LFW (see Errata on the LFW webpage), our proposed vMF-
FL further displays 99.68% accuracy. In light of the results TABLE 5: Evaluation of the CALFW [77] dataset. Results
from [36], [48], [55], [56], [57], [67], [78], the performance from competitive methods are obtained from the paper [77].
of the proposed method could be further improved by vMF-FL (proposed) VGG-Face Noisy Softmax
combining features from multiple CNN models or by using Acc% 94.20 86.50 82.52
deeper (e.g., 100 layers) CNNs. However, Table 3 suggests
saturation in the LFW results, as most of the methods al-
ready surpass human performance (97.53%) and they differ the recently released variant of LFW, called Cross-Age
very few in terms of performance. Besides, it raises debate LFW (CALFW) dataset [77]. CALFW is constructed by re-
whether we should justify a method w.r.t. the real world organizing (by crowdsourcing efforts) the LFW verification
FR scenario [35] based on the matching of only 6K pairs. pairs with apparent age gaps (as large as possible) to form
To overcome this limitation, we follow the BLUFR LFW the positive pairs and then selecting negative pairs using
protocol [35], which exploits all LFW images, verifies 47M individuals with the same race and gender (see Fig. 6(b)
pairs per trial and provide the measure of true accept rate for some illustrations). Similar to LFW, CALFW evaluation
(TAR) at a low false accept rate (FAR). In Table 4, we provide consists of verifying 6000 pairs (equally divided into gen-
our result for the verification rate (VR) at FAR=0.1% and uine and impostor pairs) of images in 10 pre-specified folds
compare it with the other methods which reported results in and report the average accuracy. Table 5 synthesizes the
this protocol. As can be seen in Table 4, our method displays experimental result in comparison with the state of the art.
the best performance in comparison with the results of some As can be seen from Table 5, our method provides the best
state of the FR methods published so far. results compared to the other available results from different
Besides evaluating with different performance measures, methods.
we further challenge our proposed FR method on LFW
The results in Table 3, 4, 5 confirm the remarkable perfor-
from the age invariant perspective and benchmark it on
mance of vMF-FL on the LFW dataset. Next, we explore the
11. They achieved 99.60% when using the same CNN model that we FR task on videos and evaluate our method on the YouTube
used. Faces [69] dataset.
10

TABLE 6: Comparison of the state-of-the-art methods evaluated on TABLE 7: Comparison of the state-of-the-art methods evaluated on
the Youtube Face [69]. Ad.Tr. denotes additional training is used. the CACD [7] dataset. VGG [45] result is obtained from [70].

FR method Ad.Tr. Accuracy (%) FR method Accuracy (%)


vMF-FL (proposed) N 96.46 vMF-FL (proposed) 99.20
DeepVisage [20] N 96.34 DeepVisage [20] 99.13
L2-Softmax [48] N 96.08 LF-CNNs [67] 98.50
VGG Face [45] N 91.60 MFM-CNN [70] 97.95
Sparse ConvNet [57] N 93.50 VGG Face [45] 96.00
FaceNet [54] N 95.18 CARC [7] 87.60
SphereFace [38] N 95.00 Human, Avg. 85.70
Center Loss [68] N 94.90 Human, Voting [7] 94.20
NormFace [66] Y 94.72
RangeLoss [76] N 93.7
DeepID2+ [56] N 93.20
MFM-CNN [70] N 93.40 the measure of evaluation. In Table 7, we provide the results
CASIA-Webface [73] Y 92.24 of vMF-FL and compare it with the state-of-the-art methods.
Deepface [58] Y 91.40 Results from Table 7 show that our method provides
VGG Face [45] Y 97.30 the best accuracy. Moreover, it is better than LF-CNN [67],
NAN [71] Y 95.72 which is a recent method specialized on age invariant face
recognition.

4.2.3 YouTube Faces (YTF) 4.2.5 IARPA Janus Benchmark A (IJB-A)


The YTF database [69] (see Fig. 6(c) for some illustrations) The recently proposed IJB-A database [30] is developed
is the de facto benchmark for evaluating the FR methods with the aim to augment more challenges to the FR task
based on the unconstrained videos. It consists of 3,425 by collecting facial images with a wide variations in pose,
videos from 1,595 identities. Evaluation with YTF requires illumination, expression, resolution and occlusion (see Fig.
matching 5000 video pairs in 10 folds experiment and report 6(e) for some illustrations). IJB-A is constructed by collecting
the average accuracy. Each fold consists of 500 video pairs 5,712 images and 2,085 videos from 500 identities, with an
and ensures subject-mutually exclusive property. We follow average of 11.4 images and 4.2 videos per identity. We follow
the restricted protocol of YTF which restricts the access to the compare protocol of IJB-A because it measures the face
only the similarity information. Table 6 provides the result verification accuracy. This protocol requires the comparison
from our method and compare it with the state-of-the-art among pairs of templates, where each template consists of a
methods. set of images and video-frames. The evaluation protocol of
Results in Table 6 show that our method provides the IJB-A computes the true accept rate (TAR) at different fixed
best accuracy (96.46%) in this dataset for the restricted pro- false accept rate (FAR), e.g., 0.01 and 0.001. The evaluation
tocol. In Table 6, we also report the results (separated with requires computing the metrics in ten random split experi-
a horizontal line) from the unrestricted protocol, i.e., access mental settings. Note that we do not use the training data
to similarity and identity 12 information of the test data. from each split and only evaluate on the test data using our
Comparison indicates that our method is very competitive once-trained FR method.
to the results from unrestricted protocol also. Moreover, by In Table 8, we present the results of vMF-FL and compare
comparing the VGG Face [45] results with both protocols it with the state-of-the-art methods. We separate the results
we observe that its accuracy increases significantly (from (with a horizontal line) to distinguish two categories of
restricted-91.6% to unrestricted-97.3%) when the CNN is fine- methods: (1) methods do not use IJB-A training data and
tuned on the YTF dataset. Therefore, this comparison indi- evaluate test sets with a pre-trained CNN; our method
cates that our result (96.46%) can be further enhanced if we belongs to this category and (2) methods use IJB-A training
train or fine-tune our CNN using the YTF data. data and applies additional training, such as, CNN fine-
tuning or metric learning.
4.2.4 Cross-Age Celebrity Dataset (CACD) From the results in Table 8, we observe that, while L2-
softmax [48] provides the best result, our method provides
The CACD dataset [7] aims to incorporate additional chal- competitive results among the others. We manually investi-
lenges to the FR task by explicitly focusing on the age invari- gated our performance on IJB-A and observed that a large
ant scenario (see Fig. 6(d) for some illustrations). Therefore, number of images were incorrectly pre-processed because of
it ensures a large variations of the ages within the collected the failure of face and landmarks detection performance by
database of faces in the wild. CACD is constructed by col- the pre-processor. This is due to the fact that a large part of
lecting 163,446 images from 2000 identities, where the range this dataset consists of very high pose and occluded faces,
of ages varies from 16 to 62. FR with the CACD dataset see Fig. 6(e) for few examples. Therefore, we conjecture that
requires evaluating the similarity of 4000 pairs of images, our performance could be much more improved if we apply
which are equally splitted in ten folds experimental settings. a better pre-processor for this challenging database.
The average from these ten folds accuracy is considered as Note that, there are numerous methods, such as TA
[13], NAN [71] and TPE [52], which use the CNN features
12. Access to identity information helps to create a large number and incorporates additional learning method to improve the
of similar and dissimilar image pairs and hence can be further used
to train or fine-tune CNN with an additional loss function, e.g., the results. Therefore, features from vMF-FL can be used with
contrastive or the triplet loss [45]. them [13], [52], [71] to further improve the results.
11

TABLE 8: Comparison of the state-of-the-art methods evaluated on TABLE 9: Analysis of the influences from training databases with
the IJB-A benchmark [30]. ‘-’ indicates the information for the entry different sizes and numbers of classes. T@F denotes the True Accept
is unavailable. Methods which incorporates external training (ExTr) or Rate at a fixed False Accept Rate (TAR@FAR).
CNN fine-tuning (FT) with IJB-A training data are separated with a
horizontal line. VGG-Face result was provided by [52]. T@F denotes Acc T@F
Min samp/id Size, Class
the True Accept Rate at a fixed False Accept Rate (TAR@FAR). % 0.01

T@F T@F MSCeleb-1M [19]


FR method ExTr FT 10 4.68M, 62.1K 99.57 0.9963
0.01 0.001
vMF-FL (proposed) N N 0.897 0.850 30 4.61M, 61.2K 99.63 0.9971
L2-softmax [48] N N 0.968 0.938 50 3.91M, 46.7K 99.55 0.9960
DeepVisage [20] N N 0.887 0.824 70 3.11M, 32.1K 99.53 0.9956
VGG Face [45] N N 0.805 0.604 100 1.53M, 12.5K 99.35 0.9928
Face-Aug-Pose-Syn [43] N N 0.886 0.725 CASIA [73] 0.43M, 10.6K 99.18 0.9902
Deep Multipose [1] N N 0.787 - Pose-CASIA [43] 1.26M, 10.6K 99.21 0.9907
Pose aware FR [42] N N 0.826 0.652 VGG Faces [45] 1.6M, 2.6K 98.45 0.9787
TPE [53] N N 0.871 0.766
All-In-One [49] N N 0.893 0.787
L2-softmax [48] Y N 0.970 0.943 4.3.2 Impact of training dataset size
All-In-One [49] + TPE Y N 0.922 0.823 Training dataset size matters and plays important role to learn
Sparse ConvNet [57] Y N 0.726 0.460
good feature representations with the deep CNN models
FSS [65] N Y 0.729 0.510
TPE [53] Y N 0.900 0.813 [54], [78]. To investigate its effect on our FR method, first
Unconstrained FV [9] Y Y 0.838 - we train the Res-27 CNN model (Sect. 3.4.1) with vMFML
TSE [52] Y Y 0.790 0.590 (Sect. 3.2.2) using the well-known training datasets, such as
NAN [71] Y N 0.941 0.881 the MSCeleb-1M [19], CASIA-Webface [73] and VGG Faces
TA [13] Y N 0.939 0.836
End-To-End [10] N Y 0.787 - [45]. For the MSCeleb-1M [19] dataset, we created different
subsets by selecting certain minimum number of images
per identity. Experiments with these datasets and subsets
4.2.6 Summary of the experimental results help us to understand the learning capacity of the proposed
FR method and identify the training dataset requirements
Results of vMF-FL on different datasets prove that besides
to achieve better performance. Table 9 presents the results,
achieving significant results it generalizes very well and
which shows that the proposed method learns reasonably
overcomes several difficulties which make unconstrained
well facial representations from a wide range of different
FR a challenging task.
sized datasets. Additional observations are as follows:
• performance does not improve significantly with the
4.3 Additional Analysis and Discussion
increase of number of images and identities. We
In this section, we perform in-depth analysis to further achieved only 0.28% improvement by enlarging the
provide insight of the proposed method. We first study the number of images 3 times and number of identities
sensitivity parameters of the proposed vMFML parameters 5 times (compare the results between MSCeleb-100
(Sect. 4.3.1); Then, we conduct further experiments and min-samp/id with MSCeleb-10 min-samp/id). Perhaps,
analysis to discuss the influences of several CNN training it is important to ensure larger number of samples
related aspects, such as: (a) training datasets size (Sect. per identity (e.g., 100).
4.3.2); (b) activation functions (Sect 4.3.3) and (c) normal- • synthesized images does not help significantly. By
ization (Sect. 4.3.4). Finally, we also analyze the limitations comparing CASIA [73] with pose-augmented-CASIA
of the proposed method (Sect. 4.3.5). In order to evaluate, (≈3 times larger dataset) [43], we observe only 0.03%
we observe the accuracy and TAR@FAR=0.01 on the LFW accuracy improvement. Perhaps the synthesized im-
benchmark. age quality affects the results.
• number of identities is equally important as the
4.3.1 Sensitivity analysis of the vMFML parameters number of images. We observe that the VGG Face
The proposed vMFML features two parameters, namely [45] dataset provides lower accuracy compared to the
the concentration κ and the mean µ, whose sensitivity is CASIA [73] dataset despite having ≈4 times more
analyzed here. In general, we can initialize κ with a small number of images (but ≈4 times less number of
value (e.g., 1) and learn it via backpropagation. However, identities).
we observe that the trained κ may provide sub-optimal Therefore, the key finding from this experiment is that,
results, especially when it is trained with large number of while we can achieve reasonably good and comparable re-
CNN parameters and updated with the same learning rate. sults with relatively smaller datasets (CASIA and MSCeleb-
To overcome
p this, we set κ to an approximated value as 100 min-samp/id), we can achieve better results by training
κ = d/2 and set its learning rate by multiplying the CNN with larger dataset which provides sufficient number of
learning rate with a small value 0.001. An alternative choice images per identity.
is to set κ to a fixed value for the entire training period,
where the value will be determined empirically. Our best 4.3.3 Influence of activation function
results are achieved with κ = 16. On the other hand, the In order to observe the influence of activation functions, we
parameter µ does not exhibit any particular sensitivity and also used the ReLU activation instead of PReLU and observe
learned in a similar way to other CNN parameters. that it decreases the accuracy by approximately 0.4%.
12

4.3.4 Feature normalization pairs for which the verification results are incorrect. Table
Feature normalization [20], [28], [66] plays a significant role 10 provides the information about the number and type
in the performance of deep CNN models. As discussed in of incorrect cases, which indicate a higher ratio of false
Section 3.3, vMFML naturally integrates features normal- rejection compared to false acceptance. Figure 6 provides
ization due to the SFR model (Sect. 3.1.1) and the vMF illustrations of the top (selected based on the distance from
distribution (Sect. 3.2.1). The features of our FR method are threshold) examples of failure cases on different datasets.
unit-normalized vectors, i.e., kxk = 1. We observed that From an in-depth analysis of the erroneous results, our
when learned with the proposed vMFML these normalized observations are as follows:
features provides significantly better results than the un- • On the LFW [25] failure cases, occlusion, variation of
normalized features learned with the Softmax loss, see Table illumination and poor image resolution played im-
2 for the performance comparison of Softmax vs vMFML. portant role. From the erroneous CALFW [77] pairs,
The promising results achieved by the unit-normalized we observe that poor image resolution appears as a
features (learned with vMFML) naturally raises the question common property. Besides, the false rejected pairs are
- can the unit-normalized features improves the accuracy with any suffering from high age difference.
loss function? To answer this question, we train the CNN • Failure cases on the YTF [69] dataset can be charac-
with the unit-normalized features kxk = 1 and optimize the terized by variations of illumination and poor image
softmax loss under different settings in order to compare resolution. Besides, high pose variation plays an im-
them w.r.t. vMFML. To explain the settings, we exploit the portant role.
terms within the exponential of the nominators, i.e., wT x + • Most of the CACD incorrect results occurred due to
byi (Softmax) with κµT x (vMFML). Our observations are: falsely rejecting the similar face image pairs, which
• wT x + byi : provides very poor results, because the indicate that our method encounters difficulties to
CNN training fails to converge as it get stuck at recognize the same person when the age difference
arbitrary local minimas even if with a range of is large. Besides, we observe that the variations of il-
different learning rates (from 0.1 to 0.0001) and lumination commonly appear in the incorrect results.
with/without applying the L2 regularization. Com- • The large number of errors on the IJB-A dataset
pare to the vMFML, the expectation from this setting can be characterized with high pose (mostly profile
is that the learned w values are similar to the values images, yaw angle more than 60 degrees) and poor
observed from κµT , such that, kwk = κ. That means, image resolution. Unfortunately, both of these two
this setting verifies whether the learned weights can reasons cause the failure of face and landmarks de-
absorb κ within it. However, as the observed results tection, which forced us to leave a large number of
suggest, it fails to do so. images without applying any pre-processing. Note
• wT (x κ) + byi : CNN successfully trains and pro- that, our training dataset do not have any image
vides results which is closer to the vMFML. For for which the pre-processor failed to detect face
this modified Softmax κ acts as a scalar multiplier, and landmarks. Therefore, our method may perform
whereas for vMFML κ signifies the shared concen- poorly in case of the failure of the pre-processor.
tration parameter value. Interestingly, this modified
From the above observations, we can particularly focus on
Softmax is the same as the L2-Softmax loss recently
several issues to handle in future, such as: (a) image reso-
proposed by [48], which is motivated by the fact
lution; (b) extreme pose variation (c) lighting normalization
that the L2-norm of the features (learned with the
and (d) occlusions.
Softmax loss) provides interesting information of the
Poor image resolution is a common issue among the failure
face image quality and attributes. Intuitively, the
cases on different datasets. This observation is similar to
idea is to provide same attention to all face images
the recent research by Best-Rowden and Jain [4], which
regardless of their quality. Moreover, they interpret
proposed a method to automatically predict the face image
the multiplier (here κ) as a constraint on the features
quality. They found that the results in the IJB-A [30] dataset
to lie on a hypersphere of a fixed radius. Note that,
can be improved by removing low-quality faces before
if bias is ignored then the difference among vMFML
computing the similarity scores. Therefore, we can apply
and L2-Softmax remains as w vs µ. Besides, while
this approach for certain FR cases where multiple images
the softmax loss imposes no constraints13 on the
are available for each identity. However, this approach will
weights w, vMFML applies natural constraint on µ,
not work when only a single image is available per identity
i.e., kµk = 1.
or when all available images have poor quality. In order to
Therefore, the key finding from the above observations is address this, we can incorporate face image super resolution
that the unit-normalized features are not sufficient alone and based technique [26] as a part of our pre-processor. The idea
concentration parameter κ of vMFML plays significant role is to apply super-resolution to those images which have very
to efficiently learn the CNN models. low image quality score computed by techniques such as [4].
Large facial pose causes degradation of FR performance
4.3.5 Limitations of the proposed method
[27], [43]. It has been addressed in recent researches [1],
Finally, we investigate the limitation of the proposed [42], [43], which proposed to overcome this problem by
method on different datasets by observing the face image frontalization [58] and creating a larger training dataset
13. Except L2 regularization which is an explicit settings for many with synthesized facial images of a range of facial poses
ML problems [1], [43]. However, the performance with above approaches
13

TABLE 10: Analysis of the number of incorrect results made


by the proposed method on different FR datasets.
# of False # of False
Dataset # of Errors
Accept Reject
LFW [25] 22 7 15
CALFW [77] 368 111 257
YTF [69] 197 13 184
CACD [7] 29 3 26
IJB-A [30] 855 133 722
(a) LFW [25], threshold distance 0.665

depends on the quality of frontalization or synthesization.


Recently [27] proposed a method to generate photo-realistic
facial images. We can adopt this proposal within our pre-
processor to overcome the large pose related issues.
In order to deal with the occlusion related problems we
can adopt the deep feature interpolation based approach
[61] to recover missing face attributes. Besides, to deal with
eye-glass related occlusions, we can enlarge training images (b) CALFW [77], threshold distance 0.66
per identity by synthesizing faces with eye-glass using a
generative method, such as [31].
In order to deal with the lighting related issues, we can
adopt recent techniques, such as [34] for specular reflection
problem and [75] for shadow related problems. The large
age difference in the same person image pair reduces the
performance of our method. One possible way to overcome
this problem is by augmenting (age based facial image
synthesis with generative methods [2], [61]) more training
images with the variation of ages. However, we must ensure
that the synthesized images remains photo-realistic.
(c) YTF [69], threshold distance 0.72
5 C ONCLUSION
We proposed a novel directional deep-features learning
method for FR by exploiting the concept of model-based
clustering and combining it with the deep CNN models.
First, we used the vMF mixture model as the theoretical
basis to propose a statistical feature representation (SFR)
model. Next, we developed an effective directional fea-
tures learning method called vMF-FL, which formulated a
novel loss function called vMFML. It has several interesting (d) CACD [7], threshold distance 0.61
properties, such as: (a) learns discriminative features; (b)
subsumes different loss functions and normalization tech-
niques and (c) interprets relationships among parameters
and object features. Extensive experiments on face verifica-
tion confirms the efficiency and generalizability of vMF-FL
as it achieved very competitive and state-of-the-art results
on the benchmark FR datasets. We foresee several future
perspectives: (a) use the learned model to synthesize iden-
tity preserving faces and enhance training dataset and (b)
explore SFR model with the generative adversarial network;
and (c) apply it for other vision tasks (e.g., scene analysis),
other domains (e.g., NLP, speech analysis) and other tasks
(e.g. clustering). Moreover, by ignoring the equal privilege
assumption one can further analyze the variations withing (e) IJB-A [30], threshold distance 0.49
a class/cluster, which can be interesting for unsupervised
problems. Fig. 6: Illustration of the errors made by our method on
different datasets: (a) LFW [25]; (b) CALFW [77]; (c) YTF
[69]; (d) CACD [7] and (e) IJB-A [30]. The top row of each
A PPENDIX A
sub-figure provides examples of the false rejected pairs
E XPERIMENTAL JUSTIFICATION OF THE SFR MODEL whereas the bottom row provides examples of the false
In this experiment, we experimentally justify the proposed accepted pairs.
SFR model (Sect. 3.1.1) and vMF-FL method (Sect. 3.1.2) by
14

exploiting the concept of deep convolutional autoencoders


[29]. To conduct this experiment with the MNIST digits
[33], we construct a deep autoencoder for this experiment,
called vMF auto-encoder (vMF-AE). The encoder/inverse-
transformer of vMF-FL consists of a 7 layers (6 convolution
layes, 1 fully connected layer) deep architecture and the
decoder/transformer of SFR model consists of a 4 layers (1 fully
connected layer, 3 de-convolution layes) deep architecture.
Fig. 7 (a) illustrates the vMF-AE method and Fig. 7 (b) shows
the generated samples from the SFR-MNIST model (column
4-11).
vMF-AE combines the proposed SFR model with the
vMF-FL method and learns in two steps. First it uses the
vMF-FL method/block to simultaneously learn discrimi-
native features (vMF feature) representation and the vMF
Mixture Model (vMFMM) by optimizing the vMF mixture (a)
loss (Sect 3.2.2). Next, it learns the decoder/transformer of
the SFR model to generate sample images from the fea-
tures (learned by vMF-ML method) by optimizing pixel-
wise binary-cross-entropy loss. When both learning tasks
are done, we can generate 2D images of digits by sampling
features from the learned vMFMM. While the column-3 of
Fig. 7 (b) shows the images generated from different µj
(learned vMFMM parameters), column 4-11 of Fig. 7 (b) show
several 2D images generated from the sampled vMFMM
features. We believe that, these illustrations demonstrate the
originality of the proposed SFR model and hence provides
additional justification to the proposed vMF-FL method.

A PPENDIX B
R ELATIONSHIP WITH DIFFERENT LOSS FUNCTIONS
AND NORMALIZATION METHOD
Center Loss [68] aims to enhance feature discrimination by
minimizing the intra-class mean-squared distances. It has the
following form [68]: (b)
N
1X 2 Fig. 7: (a) illustration of the vMF-AE method with MNIST
LCenter = kfi − cyi k (16)
2 i=1 digits, where the vMF-FL (encoder/inverse-transformer) learns
discriminative 3D features and vMFMM for 10 digits-
where, fi and yi are the features and ground-truth class
classes. In the 3D plot each dot represent features and each
labels of the ith sample, cyi is the center of class yi . By
line represent the mean (µj ) of respective class j . The images
comparing Eq. 16 and Eq. 9, we see that vMFML incor-
around the 3D sphere illustrates the generated images from
porates the cosine/angular similarity with the term µTyi xi ,
cyi respective (µj ) and (b) illustration of the 2D images gen-
where µyi = and x = kff k , and use it to compute
kcyi k erated by the transformer/decoder of the SFR model, where
the porterior probability followed by computing the loss. col-1 represents the original image, col-2 shows the image
Note that, the higher the cosine similarity, the higher the generated from the encoded features of the original image,
probability and hence the lower the loss. In a different col-3 shows the image generated from the vMF mean (µj ) of
way, we can say that the loss in Eq. 9 is minimized when each class j , col-4-11 show 8 images from each digit-class j
the cosine similarity among the sample xi and the mean generated from the vMF samples (sampled using the mean
µyi of its true class yi is maximized. This indicates that, (µj ) and concentration (κj ) of each class).
vMFML minimizes the intra-class distance by incorporating
the distance computation task within its formulation.
The center loss is used as a supplementary loss to the are used by the center loss to externally learn and save the
softmax loss and the CNN learning task is achieved by joint centers.
(softmax+center) optimization. On the other hand, the above Large-Margin Softmax Loss (LMSL) [39] is defined as:
comparison among vMFML and both softmax loss and center  
loss reveals that vMFML can take the advantage of both
N
X e kwyi kkfi kψ(θyi )
LLM SL = − log  
with a single loss function and save M × D parameters, i=1 ekwyi kkfi kψ(θyi ) +
P
j6=yi ekwj kkfi kcos(θj )
where D is the features dimension and M is the number of (17)
classes in the training dataset. These additional parameters where, ψ (θyi ) = (−1)h cos(mθ) − 2h with integer h ∈
15
h i
hπ (h+1)π
[0, m − 1] and θ ∈ m, m. m denotes the margin, fi [9] J.-C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face
th verification using deep CNN features. In 2016 IEEE WACV, pages
is the i image features, yi is the true class label, wj is the 1–9, 2016.
weight corresponding to the j th class and θj is the angle [10] J.-C. Chen, R. Ranjan, A. Kumar, C.-H. Chen, V. Patel, and R. Chel-
between wj and fi . lappa. An end-to-end system for unconstrained face verification
with deep convolutional neural networks. In Proc. of IEEE ICCV-
vMFML has similarity to LMSL when kwyi k = kµk = W, pages 118–126, 2015.
1, ∀ yi and kfi k = kxi k = 1. In this condition, LMSL [11] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric
requires mθyi < θj (j 6= yi ), i.e., the angle between the discriminatively, with application to face verification. In Proc. of
IEEE CVPR, pages 539–546, 2005.
sample and its true class is smaller than the rest of the classes [12] L. Chunjie, Y. Qiang, et al. Cosine Normalization: Using cosine
subject to a margin multiplier m. With vMFML, this can be similarity instead of dot product in neural networks. arXiv preprint
achieved by using a higher κ value (or multiply κ with a arXiv:1702.05870, 2017.
[13] N. Crosswhite, J. Byrne, O. M. Parkhi, C. Stauffer, Q. Cao, and
scalar multiplier m) for the true class compared to the rest, A. Zisserman. Template adaptation for face verification and
i.e., κyi > κj6=yi . By considering the multiplier is equivalent identification. arXiv:1603.03958, 2016.
to the margin m, we can rewrite Eq. 9 as: [14] C. Ding and D. Tao. Robust face recognition via multimodal deep
face representation. IEEE Trans. on Multimedia, 17(11):2049–2058,
2015.
N T ! [15] B. Fernando, E. Fromont, D. Muselet, and M. Sebban. Supervised
X emκµyi xi learning of Gaussian mixture models for visual vocabulary gener-
LvM F M L = − log mκµT x T (18) ation. Pattern Recognition, 45(2):897–907, 2012.
+ j6=yi eκµj xi
yi i
P
i=1 e [16] J. Glover, G. Bradski, and R. B. Rusu. Monte Carlo pose estimation
with quaternion kernels and the bingham distribution. Robotics:
Note that, we do not use the Eq. 18 in this work, because it Science and Systems VII, page 97, 2012.
does not strictly follow the statistical features representation [17] S. Gopal and Y. Yang. Von mises-fisher clustering models. In Proc.
of ICML, pages 154–162, 2014.
model proposed in this paper. The A (angular)-softmax loss [18] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu,
[38] is a recent extension of the LMSL, which replaces the X. Wang, and G. Wang. Recent advances in convolutional neural
notion of margin with angular margin, considers kwyi k = networks. arXiv:1512.07108, 2015.
[19] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-Celeb-1M: A
1 and provides an equivalent loss formulation as Eq. 17. dataset and benchmark for large-scale face recognition. CoRR,
Therefore, Eq. 18 provides the relationship among vMFML abs/1607.08221, 2016.
and A-softmax [38] in a similar way as LMSL. [20] A. Hasnat, J. Bohne, J. Milgram, S. Gentric, and L. Chen. Deep-
visage: Making face recognition simple yet with powerful gener-
Weight Normalization [51] proposed a reparameteriza- alization skills. In Proc. of IEEE ICCV-W AMFG, Oct 2017.
tion of the standard weight vector w as: [21] M. A. Hasnat, O. Alata, and A. Trémeau. Joint Color-Spatial-
Directional clustering and Region Merging (JCSD-RM) for unsu-
v pervised RGB-D image segmentation. IEEE TPAMI, 38(11):2255–
w=g (19)
kvk 2268, 2016.
[22] M. A. Hasnat, O. Alata, and A. Trémeau. Model-based hierarchical
where, v is a vector, g is a scalar and kvk is the norm of clustering with Bregman divergences and Fishers mixture model:
application to depth image analysis. Statistics and Computing,
v. This is related to the vMFML by considering the weight 26(4):861–880, 2016.
vector w as: [23] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:
µ Surpassing human-level performance on imagenet classification.
w=κ = κµ; [kµk = 1] (20) In Proc. of IEEE CVPR, pages 1026–1034, 2015.
kµk [24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for
image recognition. In Proc. of IEEE CVPR, 2016.
The comparison of Eq. 19 and Eq. 20 reveals that vMFML [25] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled
incorporates the properties of weight normalization subject faces in the wild: A database for studying face recognition in
unconstrained environments. Technical report, Technical Report
to normalizing the activations of the CNN layer which are 07-49, University of Massachusetts, Amherst, 2007.
considered as the features. [26] H. Huang, R. He, Z. Sun, and T. Tan. Wavelet-srnet: A wavelet-
based cnn for multi-scale face super resolution. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
1689–1697, 2017.
R EFERENCES [27] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global
and local perception gan for photorealistic and identity preserving
[1] W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, I. Masi, frontal view synthesis. arXiv preprint arXiv:1704.04086, 2017.
J. Choi, J. Lekust, J. Kim, and P. Natarajan. Face recognition using [28] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
deep multi-pose representations. In IEEE WACV, pages 1–9, 2016. network training by reducing internal covariate shift. In Proc. of
[2] G. Antipov, M. Baccouche, and J.-L. Dugelay. Face aging ICML, pages 448–456, 2015.
with conditional generative adversarial networks. arXiv preprint [29] D. P. Kingma and M. Welling. Auto-encoding variational bayes.
arXiv:1702.01983, 2017. In Proceedings of International Conference on Learning Representations
[3] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the (ICLR), 2014.
unit hypersphere using von Mises-Fisher distributions. Journal of [30] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen,
Machine Learning Research, 6(Sep):1345–1382, 2005. P. Grother, A. Mah, M. Burge, and A. K. Jain. Pushing the frontiers
[4] L. Best-Rowden and A. K. Jain. Automatic face image quality of unconstrained face detection and recognition: IARPA Janus
prediction. arXiv preprint arXiv:1706.09887, 2017. Benchmark A. In Proc. of IEEE CVPR, pages 1931–1939, 2015.
[5] A. Bhalerao and C.-F. Westin. Hyperspherical von mises-fisher [31] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther.
mixture (HvMF) modelling of high angular resolution diffusion Autoencoding beyond pixels using a learned similarity metric.
MRI. In Proc. of MICCAI, pages 236–243. Springer, 2007. arXiv preprint arXiv:1512.09300, 2015.
[6] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture [32] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and
model for clustering with the integrated completed likelihood. G. Hua. Labeled faces in the wild: A survey. In Advances in face
IEEE TPAMI, 22(7):719–725, 2000. detection and facial image analysis, pages 189–248. Springer, 2016.
[7] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Face recognition and [33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based
retrieval using cross-age reference coding with cross-age celebrity learning applied to document recognition. Proc. of the IEEE,
dataset. IEEE Trans. on Multimedia, 17(6):804–815, 2015. 86(11):2278–2324, 1998.
[8] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face [34] C. Li, S. Lin, K. Zhou, and K. Ikeuchi. Specular highlight removal
revisited: A joint formulation. In ECCV, pages 566–579. 2012.
16

in facial images. In Proceedings of the IEEE Conference on Computer a deep neural network architecture. In Proc. of ICASSP, pages
Vision and Pattern Recognition, pages 3107–3116, 2017. 4270–4274. IEEE, 2015.
[35] S. Liao, Z. Lei, D. Yi, and S. Z. Li. A benchmark study of large- [64] D. H. T. Vu and R. Haeb-Umbach. Blind speech separation
scale unconstrained face recognition. In Proc. of IEEE IJCB, pages employing directional statistics in an expectation maximization
1–8, 2014. framework. In Proc. of ICASSP. IEEE, 2010.
[36] J. Liu, Y. Deng, and C. Huang. Targeting ultimate accuracy: Face [65] D. Wang, C. Otto, and A. K. Jain. Face search at scale. IEEE TPAMI,
recognition via deep embedding. arXiv:1506.07310, 2015. 2016.
[37] M. Liu, B. C. Vemuri, S.-I. Amari, and F. Nielsen. Shape retrieval [66] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: L2
using hierarchical total Bregman soft clustering. IEEE TPAMI, hypersphere embedding for face verification. In Proceedings of the
34(12):2407–2419, 2012. 25th ACM international conference on Multimedia. ACM, 2017.
[38] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: [67] Y. Wen, Z. Li, and Y. Qiao. Latent factor guided convolutional
Deep hypersphere embedding for face recognition. In Proc. of IEEE neural networks for age-invariant face recognition. In Proc. of IEEE
CVPR. IEEE, 2017. CVPR, pages 4893–4901, 2016.
[39] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-Margin Softmax Loss [68] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature
for convolutional neural networks. In Proc. of ICML, pages 507– learning approach for deep face recognition. In Proc. of ECCV,
516, 2016. pages 499–515. Springer, 2016.
[40] X. Liu, M. Kan, W. Wu, S. Shan, and X. Chen. VIPLFaceNet: An [69] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained
open source deep face recognition SDK. arXiv:1609.03892, 2016. videos with matched background similarity. In Proc. of IEEE CVPR,
[41] K. V. Mardia and P. E. Jupp. Directional statistics, volume 494. pages 529–534, 2011.
Wiley. com, 2009. [70] X. Wu, R. He, Z. Sun, and T. Tan. A light CNN for deep face
[42] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face representation with noisy labels. arXiv:1511.02683, 2015.
recognition in the wild. In Proc. of IEEE CVPR, pages 4838–4846, [71] J. Yang, P. Ren, D. Chen, F. Wen, H. Li, and G. Hua. Neural
2016. aggregation network for video face recognition. arXiv:1603.05474,
[43] I. Masi, A. Tran, T. Hassner, J. T. Leksut, and G. Medioni. Do 2016.
We Really Need to Collect Millions of Faces for Effective Face [72] H. Ye, W. Shao, H. Wang, J. Ma, L. Wang, Y. Zheng, and X. Xue.
Recognition? In ECCV, 2016. Face recognition via active annotation and learning. In Proc. of
[44] K. P. Murphy. Machine learning: a probabilistic perspective. The MIT ACM Multimedia, pages 1058–1062. ACM, 2016.
Press, 2012. [73] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation
[45] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. from scratch. arXiv:1411.7923, 2014.
Proc. of BMVC, 1(3):6, 2015. [74] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and
[46] A. B. Patel, M. T. Nguyen, and R. Baraniuk. A probabilistic alignment using multitask cascaded convolutional networks. IEEE
framework for deep learning. In Proc. of NIPS, pages 2558–2566, Signal Processing Letters, 23(10):1499–1503, Oct 2016.
2016. [75] W. Zhang, X. Zhao, J.-M. Morvan, and L. Chen. Improving shadow
[47] A. Prati, S. Calderara, and R. Cucchiara. Using circular statistics suppression for illumination robust face recognition. arXiv preprint
for trajectory shape analysis. In Proc. of IEEE CVPR, pages 1–8. arXiv:1710.05073, 2017.
IEEE, 2008. [76] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep
[48] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained face recognition with long-tail. arXiv:1611.08976, 2016.
softmax loss for discriminative face verification. arXiv preprint [77] T. Zheng, W. Deng, and J. Hu. Cross-age lfw: A database for study-
arXiv:1703.09507, 2017. ing cross-age face recognition in unconstrained environments.
[49] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa. arXiv preprint arXiv:1708.08197, 2017.
An all-in-one convolutional neural network for face analysis. [78] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: Touch-
arXiv:1611.00851, 2016. ing the limit of LFW benchmark or not? arXiv:1501.04690, 2015.
[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
IJCV, 115(3):211–252, 2015.
[51] T. Salimans and D. P. Kingma. Weight Normalization: A simple
reparameterization to accelerate training of deep neural networks.
In Proc. of NIPS, pages 901–901, 2016.
[52] S. Sankaranarayanan, A. Alavi, C. Castillo, and R. Chellappa.
Triplet probabilistic embedding for face verification and cluster-
ing. arXiv:1604.05417, 2016.
[53] S. Sankaranarayanan, A. Alavi, and R. Chellappa. Triplet similar-
ity embedding for face verification. arXiv:1602.03418, 2016.
[54] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified
embedding for face recognition and clustering. In Proc. of IEEE
CVPR, 2015.
[55] Y. Sun, D. Liang, X. Wang, and X. Tang. DeepID3: Face recognition
with very deep neural networks. arXiv:1502.00873, 2015.
[56] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations
are sparse, selective, and robust. In Proc. of IEEE CVPR, pages
2892–2900, 2015.
[57] Y. Sun, X. Wang, and X. Tang. Sparsifying neural network connec-
tions for face recognition. In Proc. of IEEE CVPR, 2016.
[58] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing
the gap to human-level performance in face verification. In Proc.
of IEEE CVPR, pages 1701–1708, 2014.
[59] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training
for face identification. In Proc. of IEEE CVPR, pages 2746–2754,
2015.
[60] Z. Tüske, M. A. Tahir, R. Schlüter, and H. Ney. Integrating
Gaussian mixtures into deep neural networks: softmax layer with
hidden variables. In Proc. of ICASSP, pages 4285–4289. IEEE, 2015.
[61] P. Upchurch, J. Gardner, K. Bala, R. Pless, N. Snavely, and K. Wein-
berger. Deep feature interpolation for image content changes.
arXiv preprint arXiv:1611.05507, 2016.
[62] A. Van den Oord and B. Schrauwen. Factoring variations in
natural images with deep Gaussian mixture models. In Proc. of
NIPS, pages 3518–3526, 2014.
[63] E. Variani, E. McDermott, and G. Heigold. A Gaussian mixture
model layer jointly optimized with discriminative features within

You might also like