0% found this document useful (0 votes)
156 views11 pages

Cosface: Large Margin Cosine Loss For Deep Face Recognition

This document proposes a new loss function called Large Margin Cosine Loss (LMCL) for training deep convolutional neural networks for face recognition. LMCL aims to learn discriminative face features by maximizing inter-class variance and minimizing intra-class variance. Specifically, it reformulates the softmax loss as a cosine loss by normalizing features and weights, and introduces a cosine margin term to further increase decision margins between different classes in the angular space. The model trained with LMCL, referred to as CosFace, achieves state-of-the-art performance on popular face recognition benchmarks.

Uploaded by

Raymon Arnold
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views11 pages

Cosface: Large Margin Cosine Loss For Deep Face Recognition

This document proposes a new loss function called Large Margin Cosine Loss (LMCL) for training deep convolutional neural networks for face recognition. LMCL aims to learn discriminative face features by maximizing inter-class variance and minimizing intra-class variance. Specifically, it reformulates the softmax loss as a cosine loss by normalizing features and weights, and introduces a cosine margin term to further increase decision margins between different classes in the angular space. The model trained with LMCL, referred to as CosFace, achieves state-of-the-art performance on popular face recognition benchmarks.

Uploaded by

Raymon Arnold
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CosFace: Large Margin Cosine Loss for Deep Face Recognition

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou,
Zhifeng Li∗, and Wei Liu∗
Tencent AI Lab
{hawelwang,yitongwang,encorezhou,denisji,sagazhou,michaelzfli}@tencent.com
arXiv:1801.09414v2 [cs.CV] 3 Apr 2018

[email protected] [email protected]

Abstract
Training Loss Layers Labels
Face recognition has made extraordinary progress ow- Faces

ing to the advancement of deep convolutional neural net-


ConvNet
works (CNNs). The central task of face recognition, in- Verification
Testing
cluding face verification and identification, involves face Faces

feature discrimination. However, the traditional softmax Identification

loss of deep CNNs usually lacks the power of discrimina- Cosine Similarity
tion. To address this problem, recently several loss func-
tions such as center loss, large margin softmax loss, and
angular softmax loss have been proposed. All these im-
proved losses share the same idea: maximizing inter-class
variance and minimizing intra-class variance. In this pa-
per, we propose a novel loss function, namely large mar- Cropped Faces Learned by Softmax Learned by LMCL
gin cosine loss (LMCL), to realize this idea from a different
perspective. More specifically, we reformulate the softmax
Figure 1. An overview of the proposed CosFace framework. In the
loss as a cosine loss by L2 normalizing both features and training phase, the discriminative face features are learned with a
weight vectors to remove radial variations, based on which large margin between different classes. In the testing phase, the
a cosine margin term is introduced to further maximize the testing data is fed into CosFace to extract face features which are
decision margin in the angular space. As a result, minimum later used to compute the cosine similarity score to perform face
intra-class variance and maximum inter-class variance are verification and identification.
achieved by virtue of normalization and cosine decision
margin maximization. We refer to our model trained with
LMCL as CosFace. Extensive experimental evaluations are a wide variety of computer vision tasks, which makes deep
conducted on the most popular public-domain face recogni- CNN a dominant machine learning approach for computer
tion datasets such as MegaFace Challenge, Youtube Faces vision. Face recognition, as one of the most common com-
(YTF) and Labeled Face in the Wild (LFW). We achieve the puter vision tasks, has been extensively studied for decades
state-of-the-art performance on these benchmarks, which [37, 45, 22, 19, 20, 40, 2]. Early studies build shallow mod-
confirms the effectiveness of our proposed approach. els with low-level face features, while modern face recogni-
tion techniques are greatly advanced driven by deep CNNs.
Face recognition usually includes two sub-tasks: face ver-
1. Introduction ification and face identification. Both of these two tasks
involve three stages: face detection, feature extraction, and
Recently progress on the development of deep convo-
classification. A deep CNN is able to extract clean high-
lutional neural networks (CNNs) [15, 18, 12, 9, 44] has
level features, making itself possible to achieve superior
significantly advanced the state-of-the-art performance on
performance with a relatively simple classification architec-
∗ Corresponding authors ture: usually, a multilayer perceptron networks followed by
a softmax loss [35, 32]. However, recent studies [42, 24, 23] the aforementioned shortcomings.
found that the traditional softmax loss is insufficient to ac- Based on the LMCL, we build a sophisticated deep
quire the discriminating power for classification. model called CosFace, as shown in Figure 1. In the train-
To encourage better discriminating performance, many ing phase, LMCL guides the ConvNet to learn features with
research studies have been carried out [42, 5, 7, 10, 39, 23]. a large cosine margin. In the testing phase, the face fea-
All these studies share the same idea for maximum discrimi- tures are extracted from the ConvNet to perform either face
nation capability: maximizing inter-class variance and min- verification or face identification. We summarize the con-
imizing intra-class variance. For example, [42, 5, 7, 10, 39] tributions of this work as follows:
propose to adopt multi-loss learning in order to increase the (1) We embrace the idea of maximizing inter-class vari-
feature discriminating power. While these methods improve ance and minimizing intra-class variance and propose a
classification performance over the traditional softmax loss, novel loss function, called LMCL, to learn highly discrimi-
they usually come with some extra limitations. For [42], native deep features for face recognition.
it only explicitly minimizes the intra-class variance while (2) We provide reasonable theoretical analysis based
ignoring the inter-class variances, which may result in sub- on the hyperspherical feature distribution encouraged by
optimal solutions. [5, 7, 10, 39] require thoroughly schem- LMCL.
ing the mining of pair or triplet samples, which is an ex- (3) The proposed approach advances the state-of-the-art
tremely time-consuming procedure. Very recently, [23] pro- performance over most of the benchmarks on popular face
posed to address this problem from a different perspective. databases including LFW[13], YTF[43] and Megaface [17,
More specifically, [23] (A-softmax) projects the original 25].
Euclidean space of features to an angular space, and intro-
duces an angular margin for larger inter-class variance. 2. Related Work
Compared to the Euclidean margin suggested by [42, 5, Deep Face Recognition. Recently, face recognition has
10], the angular margin is preferred because the cosine of achieved significant progress thanks to the great success
the angle has intrinsic consistency with softmax. The for- of deep CNN models [18, 15, 34, 9]. In DeepFace [35]
mulation of cosine matches the similarity measurement that and DeepID [32], face recognition is treated as a multi-
is frequently applied to face recognition. From this perspec- class classification problem and deep CNN models are
tive, it is more reasonable to directly introduce cosine mar- first introduced to learn features on large multi-identities
gin between different classes to improve the cosine-related datasets. DeepID2 [30] employs identification and verifi-
discriminative information. cation signals to achieve better feature embedding. Recent
In this paper, we reformulate the softmax loss as a cosine works DeepID2+ [33] and DeepID3 [31] further explore
loss by L2 normalizing both features and weight vectors to the advanced network structures to boost recognition per-
remove radial variations, based on which a cosine margin formance. FaceNet [29] uses triplet loss to learn an Eu-
term m is introduced to further maximize the decision mar- clidean space embedding and a deep CNN is then trained
gin in the angular space. Specifically, we propose a novel on nearly 200 million face images, leading to the state-of-
algorithm, dubbed Large Margin Cosine Loss (LMCL), the-art performance. Other approaches [41, 11] also prove
which takes the normalized features as input to learn highly the effectiveness of deep CNNs on face recognition.
discriminative features by maximizing the inter-class cosine Loss Functions. Loss function plays an important role
margin. Formally, we define a hyper-parameter m such that in deep feature learning. Contrastive loss [5, 7] and triplet
the decision boundary is given by cos(θ1 ) − m = cos(θ2 ), loss [10, 39] are usually used to increase the Euclidean mar-
where θi is the angle between the feature and weight of class gin for better feature embedding. Wen et al. [42] proposed
i. a center loss to learn centers for deep features of each iden-
For comparison, the decision boundary of the A-Softmax tity and used the centers to reduce intra-class variance. Liu
is defined over the angular space by cos(mθ1 ) = cos(θ2 ), et al. [24] proposed a large margin softmax (L-Softmax)
which has a difficulty in optimization due to the non- by adding angular constraints to each identity to improve
monotonicity of the cosine function. To overcome such a feature discrimination. Angular softmax (A-Softmax) [23]
difficulty, one has to employ an extra trick with an ad-hoc improves L-Softmax by normalizing the weights, which
piecewise function for A-Softmax. More importantly, the achieves better performance on a series of open-set face
decision margin of A-softmax depends on θ, which leads to recognition benchmarks [13, 43, 17]. Other loss functions
different margins for different classes. As a result, in the [47, 6, 4, 3] based on contrastive loss or center loss also
decision space, some inter-class features have a larger mar- demonstrate the performance on enhancing discrimination.
gin while others have a smaller margin, which reduces the Normalization Approaches. Normalization has been
discriminating power. Unlike A-Softmax, our approach de- studied in recent deep face recognition studies. [38] normal-
fines the decision margin in the cosine space, thus avoiding izes the weights which replace the inner product with cosine
similarity within the softmax loss. [28] applies the L2 con- cos(θ2) cos(θ2) θ2 cos(θ2)
c2 c2 c1 c2
straint on features to embed faces in the normalized space. 1.0 1.0 1.0
π/m
m
Note that normalization on feature vectors or weight vec-
c1
cos(θ1) c1 cos(θ1) c2 c1
cos(θ1)
π
tors achieves much lower intra-class angular variability by
margin<0 margin=0 margin>=0 θ1 margin>0
concentrating more on the angle during training. Hence the
Softmax NSL A-Softmax LMCL
angles between identities can be well optimized. The von
Mises-Fisher (vMF) based methods [48, 8] and A-Softmax Figure 2. The comparison of decision margins for different loss
[23] also adopt normalization in feature learning. functions the binary-classes scenarios. Dashed line represents de-
cision boundary, and gray areas are decision margins.
3. Proposed Approach
In this section, we firstly introduce the proposed LMCL Because we remove variations in radial directions by fix-
in detail (Sec. 3.1). And a comparison with other loss func- ing kxk = s, the resulting model learns features that are
tions is given to show the superiority of the LMCL (Sec. separable in the angular space. We refer to this loss as the
3.2). The feature normalization technique adopted by the Normalized version of Softmax Loss (NSL) in this paper.
LMCL is further described to clarify its effectiveness (Sec. However, features learned by the NSL are not suffi-
3.3). Lastly, we present a theoretical analysis for the pro- ciently discriminative because the NSL only emphasizes
posed LMCL (Sec. 3.4). correct classification. To address this issue, we introduce
the cosine margin to the classification boundary, which is
3.1. Large Margin Cosine Loss
naturally incorporated into the cosine formulation of Soft-
We start by rethinking the softmax loss from a cosine max.
perspective. The softmax loss separates features from dif- Considering a scenario of binary-classes for example,
ferent classes by maximizing the posterior probability of the let θi denote the angle between the learned feature vector
ground-truth class. Given an input feature vector xi with its and the weight vector of Class Ci (i = 1, 2). The NSL
corresponding label yi , the softmax loss can be formulated forces cos(θ1 ) > cos(θ2 ) for C1 , and similarly for C2 ,
as: so that features from different classes are correctly classi-
N N fied. To develop a large margin classifier, we further require
1 X 1 X efyi cos(θ1 ) − m > cos(θ2 ) and cos(θ2 ) − m > cos(θ1 ), where
Ls = − log pi = − log PC , (1)
N i=1 N i=1 j=1 e
fj
m ≥ 0 is a fixed parameter introduced to control the magni-
tude of the cosine margin. Since cos(θi ) − m is lower than
where pi denotes the posterior probability of xi being cor-
cos(θi ), the constraint is more stringent for classification.
rectly classified. N is the number of training samples and C
The above analysis can be well generalized to the scenario
is the number of classes. fj is usually denoted as activation
of multi-classes. Therefore, the altered loss reinforces the
of a fully-connected layer with weight vector Wj and bias
discrimination of learned features by encouraging an extra
Bj . We fix the bias Bj = 0 for simplicity, and as a result fj
margin in the cosine space.
is given by:
Formally, we define the Large Margin Cosine Loss
fj = WjT x = kWj kkxk cos θj , (2) (LMCL) as:

where θj is the angle between Wj and x. This formula sug- 1 X es(cos(θyi ,i )−m)
gests that both norm and angle of vectors contribute to the Llmc = − log s(cos(θ )−m) P ,
N i e yi ,i
+ j6=yi es cos(θj,i )
posterior probability.
(4)
To develop effective feature learning, the norm of W
subject to
should be necessarily invariable. To this end, We fix
kWj k = 1 by L2 normalization. In the testing stage, the W∗
face recognition score of a testing face pair is usually cal- W = ,
kW ∗ k
culated according to cosine similarity between the two fea-
ture vectors. This suggests that the norm of feature vector x∗ (5)
x= ∗ ,
x is not contributing to the scoring function. Thus, in the kx k
training stage, we fix kxk = s. Consequently, the posterior cos(θj , i) = Wj T xi ,
probability merely relies on cosine of angle. The modified
loss can be formulated as where N is the numer of training samples, xi is the i-th
feature vector corresponding to the ground-truth class of yi ,
1 X es cos(θyi ,i ) the Wj is the weight vector of the j-th class, and θj is the
Lns = − log P s cos(θ ) . (3)
N i je
j,i
angle between Wj and xi .
3.2. Comparison on Different Loss Functions Therefore, cos(θ1 ) is maximized while cos(θ2 ) being mini-
mized for C1 (similarly for C2 ) to perform the large-margin
In this subsection, we compare the decision margin of
classification. The last subplot in Figure 2 illustrates the de-
our method (LMCL) to: Softmax, NSL, and A-Softmax,
cision boundary of √LMCL in the cosine space, where we can
as illustrated in Figure 2. For simplicity of analysis, we
see a clear margin( 2m) in the produced distribution of the
consider the binary-classes scenarios with classes C1 and
cosine of angle. This suggests that the LMCL is more robust
C2 . Let W1 and W2 denote weight vectors for C1 and C2 ,
than the NSL, because a small perturbation around the deci-
respectively.
sion boundary (dashed line) less likely leads to an incorrect
Softmax loss defines a decision boundary by:
decision. The cosine margin is applied consistently to all
kW1 k cos(θ1 ) = kW2 k cos(θ2 ). samples, regardless of the angles of their weight vectors.
Thus, its boundary depends on both magnitudes of weight
3.3. Normalization on Features
vectors and cosine of angles, which results in an overlap-
ping decision area (margin < 0) in the cosine space. This is In the proposed LMCL, a normalization scheme is in-
illustrated in the first subplot of Figure 2. As noted before, volved on purpose to derive the formulation of the cosine
in the testing stage it is a common strategy to only consider loss and remove variations in radial directions. Unlike [23]
cosine similarity between testing feature vectors of faces. that only normalizes the weight vectors, our approach si-
Consequently, the trained classifier with the Softmax loss multaneously normalizes both weight vectors and feature
is unable to perfectly classify testing samples in the cosine vectors. As a result, the feature vectors distribute on a hy-
space. persphere, where the scaling parameter s controls the mag-
NSL normalizes weight vectors W1 and W2 such that nitude of radius. In this subsection, we discuss why feature
they have constant magnitude 1, which results in a decision normalization is necessary and how feature normalization
boundary given by: encourages better feature learning in the proposed LMCL
cos(θ1 ) = cos(θ2 ). approach.
The necessity of feature normalization is presented in
The decision boundary of NSL is illustrated in the second two respects: First, the original softmax loss without feature
subplot of Figure 2. We can see that by removing radial normalization implicitly learns both the Euclidean norm
variations, the NSL is able to perfectly classify testing sam- (L2 -norm) of feature vectors and the cosine value of the
ples in the cosine space, with margin = 0. However, it is angle. The L2 -norm is adaptively learned for minimizing
not quite robust to noise because there is no decision mar- the overall loss, resulting in the relatively weak cosine con-
gin: any small perturbation around the decision boundary straint. Particularly, the adaptive L2 -norm of easy samples
can change the decision. becomes much larger than hard samples to remedy the in-
A-Softmax improves the softmax loss by introducing an ferior performance of cosine metric. On the contrary, our
extra margin, such that its decision boundary is given by: approach requires the entire set of feature vectors to have
C1 : cos(mθ1 ) ≥ cos(θ2 ), the same L2 -norm such that the learning only depends on
cosine values to develop the discriminative power. Fea-
C2 : cos(mθ2 ) ≥ cos(θ1 ).
ture vectors from the same classes are clustered together
Thus, for C1 it requires θ1 ≤ θm2 , and similarly for C2 . The and those from different classes are pulled apart on the sur-
third subplot of Figure 2 depicts this decision area, where face of the hypersphere. Additionally, we consider the situ-
gray area denotes decision margin. However, the margin ation when the model initially starts to minimize the LMCL.
of A-Softmax is not consistent over all θ values: the mar- Given a feature vector x, let cos(θi ) and cos(θj ) denote co-
gin becomes smaller as θ reduces, and vanishes completely sine scores of the two classes, respectively. Without normal-
when θ = 0. This results in two potential issues. First, for ization on features, the LMCL forces kxk(cos(θi ) − m) >
difficult classes C1 and C2 which are visually similar and kxk cos(θj ). Note that cos(θi ) and cos(θj ) can be initially
thus have a smaller angle between W1 and W2 , the mar- comparable with each other. Thus, as long as (cos(θi ) − m)
gin is consequently smaller. Second, technically speaking is smaller than cos(θj ), kxk is required to decrease for mini-
one has to employ an extra trick with an ad-hoc piecewise mizing the loss, which degenerates the optimization. There-
function to overcome the nonmonotonicity difficulty of the fore, feature normalization is critical under the supervision
cosine function. of LMCL, especially when the networks are trained from
LMCL (our proposed) defines a decision margin in co- scratch. Likewise, it is more favorable to fix the scaling
sine space rather than the angle space (like A-Softmax) by: parameter s instead of adaptively learning.
Furthermore, the scaling parameter s should be set to a
C1 : cos(θ1 ) ≥ cos(θ2 ) + m,
properly large value to yield better-performing features with
C2 : cos(θ2 ) ≥ cos(θ1 ) + m. lower training loss. For NSL, the loss continuously goes
𝑊1
𝑥
𝑊2
𝑥 First, considering the binary-classes case with classes C1
𝑊1 𝑊2
and C2 as before, suppose that the normalized feature vec-
Margin
tor x is given. Let Wi denote the normalized weight vector,
cosθ1
θ1 θ2
cosθ2
θ1 θ
and θi denote the angle between x and Wi . For NSL, the
2
cosθ1 − m cosθ2
decision boundary defines as cos θ1 − cos θ2 = 0, which is
NSL LMCL equivalent to the angular bisector of W1 and W2 as shown
in the left of Figure 3. This addresses that the model su-
Figure 3. A geometrical interpretation of LMCL from feature per-
pervised by NSL partitions the underlying feature space to
spective. Different color areas represent feature space from dis-
tinct classes. LMCL has a relatively compact feature region com- two close regions, where the features near the boundary are
pared with NSL. extremely ambiguous (i.e., belonging to either class is ac-
ceptable). In contrast, LMCL drives the decision boundary
formulated by cos θ1 − cos θ2 = m for C1 , in which θ1
down with higher s, while too small s leads to an insuf- should be much smaller than θ2 (similarly for C2 ). Conse-
ficient convergence even no convergence. For LMCL, we quently, the inter-class variance is enlarged while the intra-
also need adequately large s to ensure a sufficient hyper- class variance shrinks.
space for feature learning with an expected large margin. Back to Figure 3, one can observe that the maximum
In the following, we show the parameter s should have a angular margin is subject to the angle between W1 and
lower bound to obtain expected classification performance. W2 . Accordingly, the cosine margin should have the lim-
Given the normalized learned feature vector x and unit ited variable scope when W1 and W2 are given. Specifi-
weight vector W , we denote the total number of classes cally, suppose a scenario that all the feature vectors belong-
as C. Suppose that the learned feature vectors separately ing to class i exactly overlap with the corresponding weight
lie on the surface of the hypersphere and center around the vector Wi of class i. In other words, every feature vector is
corresponding weight vector. Let PW denote the expected identical to the weight vector for class i, and apparently the
minimum posterior probability of class center (i.e., W ), the feature space is in an extreme situation, where all the fea-
lower bound of s is given by 1 : ture vectors lie at their class center. In that case, the margin
of decision boundaries has been maximized (i.e., the strict
C −1 (C − 1)PW upper bound of the cosine margin).
s≥ log . (6) To extend in general, we suppose that all the features are
C 1 − PW
well-separated and we have a total number of C classes.
Based on this bound, we can infer that s should be en- The theoretical variable scope of m is supposed to be:
larged consistently if we expect an optimal Pw for classifi- 0 ≤ m ≤ (1 − max(WiT Wj )), where i, j ≤ n, i 6= j.
cation with a certain number of classes. Besides, by keeping The softmax loss tries to maximize the angle between any
a fixed Pw , the desired s should be larger to deal with more of the two weight vectors from two different classes in order
classes since the growing number of classes increase the to perform perfect classification. Hence, it is clear that the
difficulty for classification in the relatively compact space. optimal solution for the softmax loss should uniformly dis-
A hypersphere with large radius s is therefore required for tribute the weight vectors on a unit hypersphere. Based on
embedding features with small intra-class distance and large this assumption, the variable scope of the introduced cosine
inter-class distance. margin m can be inferred as follows 2 :
3.4. Theoretical Analysis for LMCL 2π
0 ≤ m ≤ 1 − cos , (K = 2)
C
The preceding subsections essentially discuss the LMCL
C
from the classification point of view. In terms of learning 0≤m≤ , (C ≤ K + 1) (7)
the discriminative features on the hypersphere, the cosine C −1
C
margin servers as momentous part to strengthen the discrim- 0≤m , (C > K + 1)
inating power of features. Detailed analysis about the quan- C −1
titative feasible choice of the cosine margin (i.e., the bound where C is the number of training classes and K is the di-
of hyper-parameter m) is necessary. The optimal choice of mension of learned features. The inequalities indicate that
m potentially leads to more promising learning of highly as the number of classes increases, the upper bound of the
discriminative face features. In the following, we delve into cosine margin between classes are decreased correspond-
the decision boundary and angular margin in the feature ingly. Especially, if the number of classes is much larger
space to derive the theoretical bound for hyper-parameter than the feature dimension, the upper bound of the cosine
m. margin will get even smaller.
1 Proof is attached in the supplemental material. 2 Proof is attached in the supplemental material.
Figure 4. A toy experiment of different loss functions on 8 identities with 2D features. The first row maps the 2D features onto the Euclidean
space, while the second row projects the 2D features onto the angular space. The gap becomes evident as the margin term m increases.

C
A reasonable choice of larger m ∈ [0, C−1 ) should ef- 4. Experiments
fectively boost the learning of highly discriminative fea-
tures. Nevertheless, parameter m usually could not reach 4.1. Implementation Details
the theoretical upper bound in practice due to the vanish- Preprocessing. Firstly, face area and landmarks are de-
ing of the feature space. That is, all the feature vectors tected by MTCNN [16] for the entire set of training and
are centered together according to the weight vector of the testing images. Then, the 5 facial points (two eyes, nose and
corresponding class. In fact, the model fails to converge two mouth corners) are adopted to perform similarity trans-
when m is too large, because the cosine constraint (i.e., formation. After that we obtain the cropped faces which are
cos θ1 −m > cos θ2 or cos θ2 −m > cos θ1 for two classes) then resized to be 112 × 96. Following [42, 23], each pixel
becomes stricter and is hard to be satisfied. Besides, the co- (in [0, 255]) in RGB images is normalized by subtracting
sine constraint with overlarge m forces the training process 127.5 then dividing by 128.
to be more sensitive to noisy data. The ever-increasing m Training. For a direct and fair comparison to the existing
starts to degrade the overall performance at some point be- results that use small training datasets (less than 0.5M im-
cause of failing to converge. ages and 20K subjects) [17], we train our models on a small
training dataset, which is the publicly available CASIA-
We perform a toy experiment for better visualizing on WebFace [46] dataset containing 0.49M face images from
features and validating our approach. We select face im- 10,575 subjects. We also use a large training dataset to eval-
ages from 8 distinct identities containing enough samples to uate the performance of our approach for benchmark com-
clearly show the feature points on the plot. Several models parison with the state-of-the-art results (using large training
are trained using the original softmax loss and the proposed dataset) on the benchmark face dataset. The large training
LMCL with different settings of m. We extract 2-D features dataset that we use in this study is composed of several pub-
of face images for simplicity. As discussed above, m should lic datasets and a private face dataset, containing about 5M
be no larger than 1 − cos π4 (about 0.29), so we set up three images from more than 90K identities. The training faces
choices of m for comparison, which are m = 0, m = 0.1, are horizontally flipped for data augmentation. In our ex-
and m = 0.2. As shown in Figure 4, the first row and periments we remove face images belong to identities that
second row present the feature distributions in Euclidean appear in the testing datasets.
space and angular space, respectively. We can observe that For the fair comparison, the CNN architecture used in
the original softmax loss produces ambiguity in decision our work is similar to [23], which has 64 convolutional lay-
boundaries while the proposed LMCL performs much bet- ers and is based on residual units[9]. The scaling parameter
ter. As m increases, the angular margin between different s in Equation (4) is set to 64 empirically. We use Caffe[14]
classes has been amplified. to implement the modifications of the loss layer and run the
100 Normalization LFW YTF MF1 Rank 1 MF1 Veri.
No 99.10 93.1 75.10 88.65
98 Yes 99.33 96.1 77.11 89.88

Table 1. Comparison of our models with and without feature nor-


96
accuracy (%)

malization on Megaface Challenge 1 (MF1). “Rank 1” refers to


rank-1 face identification accuracy and “Veri.” refers to face ver-
94
ification TAR (True Accepted Rate) under 10−6 FAR (False Ac-
cepted Rate).
92
LFW
YTF
90 and without the feature normalization scheme by fixing
0 0.15 0.25 0.35 0.45
m to 0.35, and compare their performance on LFW[13],
margin
YTF[43], and the Megaface Challenge 1(MF1)[17]. Note
that the model trained without normalization is initial-
Figure 5. Accuracy (%) of CosFace with different margin parame- ized by softmax loss and then supervised by the proposed
ters m on LFW[13] and YTF [43].
LMCL. The comparative results are reported in Table 1. It
is very clear that the model using the feature normalization
models. The CNN models are trained with SGD algorithm, scheme consistently outperforms the model without the fea-
with the batch size of 64 on 8 GPUs. The weight decay is ture normalization scheme across the three datasets. As dis-
set to 0.0005. For the case of training on the small dataset, cussed above, feature normalization removes radical vari-
the learning rate is initially 0.1 and divided by 10 at the ance, and the learned features can be more discriminative in
16K, 24K, 28k iterations, and we finish the training process angular space. This experiment verifies this point.
at 30k iterations. While the training on the large dataset ter- 4.3. Comparison with state-of-the-art loss functions
minates at 240k iterations, with the initial learning rate 0.05
dropped at 80K, 140K, 200K iterations. In this part, we compare the performance of the pro-
Testing. At testing stage, features of original image and posed LMCL with the state-of-the-art loss functions. Fol-
the flipped image are concatenated together to compose the lowing the experimental setting in [23], we train a model
final face representation. The cosine distance of features with the guidance of the proposed LMCL on the CAISA-
is computed as the similarity score. Finally, face verifica- WebFace[46] using the same 64-layer CNN architecture de-
tion and identification are conducted by thresholding and scribed in [23]. The experimental comparison on LFW,
ranking the scores. We test our models on several popu- YTF and MF1 are reported in Table 2. For fair comparison,
lar public face datasets, including LFW[13], YTF[43], and we are strictly following the model structure (a 64-layers
MegaFace[17, 25]. ResNet-Like CNNs) and the detailed experimental settings
of SphereFace [23]. As can be seen in Table 2, LMCL con-
4.2. Exploratory Experiments sistently achieves competitive results compared to the other
losses across the three datasets. Especially, our method not
Effect of m. The margin parameter m plays a key role in only surpasses the performance of A-Softmax with feature
LMCL. In this part we conduct an experiment to investigate normalization (named as A-Softmax-NormFea in Table 2),
the effect of m. By varying m from 0 to 0.45 (If m is larger but also significantly outperforms the other loss functions
than 0.45, the model will fail to converge), we use the small on YTF and MF1, which demonstrates the effectiveness of
training data (CASIA-WebFace [46]) to train our CosFace LMCL.
model and evaluate its performance on the LFW[13] and
YTF[43] datasets, as illustrated in Figure 5. We can see 4.4. Overall Benchmark Comparison
that the model without the margin (in this case m=0) leads
4.4.1 Evaluation on LFW and YTF
to the worst performance. As m being increased, the accu-
racies are improved consistently on both datasets, and get LFW [13] is a standard face verification testing dataset in
saturated at m = 0.35. This demonstrates the effectiveness unconstrained conditions. It includes 13,233 face images
of the margin m. By increasing the margin m, the discrim- from 5749 identities collected from the website. We eval-
inative power of the learned features can be significantly uate our model strictly following the standard protocol of
improved. In this study, m is set to fixed 0.35 in the subse- unrestricted with labeled outside data [13], and report the
quent experiments. result on the 6,000 pair testing images. YTF [43] con-
Effect of Feature Normalization. To investigate the ef- tains 3,425 videos of 1,595 different people. The average
fect of the feature normalization scheme in our approach, length of a video clip is 181.3 frames. All the video se-
we train our CosFace models on the CASIA-WebFace with quences were downloaded from YouTube. We follow the
MF1 MF1 unrestricted with labeled outside data protocol and report
Method LFW YTF
Rank1 Veri.
the result on 5,000 video pairs.
Softmax Loss [23] 97.88 93.1 54.85 65.92
Softmax+Contrastive [30] 98.78 93.5 65.21 78.86 As shown in Table 3, the proposed CosFace achieves
Triplet Loss [29] 98.70 93.4 64.79 78.32 state-of-the-art results of 99.73% on LFW and 97.6% on
L-Softmax Loss [24] 99.10 94.0 67.12 80.42 YTF. FaceNet achieves the runner-up performance on LFW
Softmax+Center Loss [42] 99.05 94.4 65.49 80.14
A-Softmax [23] 99.42 95.0 72.72 85.56
with the large scale of the image dataset, which has approxi-
A-Softmax-NormFea 99.32 95.4 75.42 88.82 mately 200 million face images. In terms of YTF, our model
LMCL 99.33 96.1 77.11 89.88 reaches the first place over all other methods.

Table 2. Comparison of the proposed LMCL with state-of-the-art


loss functions in face recognition community. All the methods in
4.4.2 Evaluation on MegaFace
this table are using the same training data and the same 64-layer
CNN architecture.
MegaFace [17, 25] is a very challenging testing benchmark
Method Training Data #Models LFW YTF recently released for large-scale face identification and ver-
Deep Face[35] 4M 3 97.35 91.4 ification, which contains a gallery set and a probe set. The
FaceNet[29] 200M 1 99.63 95.1
DeepFR [27] 2.6M 1 98.95 97.3
gallery set in Megaface is composed of more than 1 mil-
DeepID2+[33] 300K 25 99.47 93.2 lion face images. The probe set has two existing databases:
Center Face[42] 0.7M 1 99.28 94.9 Facescrub [26] and FGNET [1]. In this study, we use the
Baidu[21] 1.3M 1 99.13 - Facescrub dataset (containing 106,863 face images of 530
SphereFace[23] 0.49M 1 99.42 95.0
CosFace 5M 1 99.73 97.6
celebrities) as the probe set to evaluate the performance of
our approach on both Megaface Challenge 1 and Challenge
Table 3. Face verification (%) on the LFW and YTF datasets. 2.
“#Models” indicates the number of models that have been used MegaFace Challenge 1 (MF1). On the MegaFace Chal-
in the method for evaluation. lenge 1 [17], The gallery set incorporates more than 1 mil-
Method Protocol MF1 Rank1 MF1 Veri. lion images from 690K individuals collected from Flickr
SIAT MMLAB[42] Small 65.23 76.72 photos [36]. Table 4 summarizes the results of our models
DeepSense - Small Small 70.98 82.85 trained on two protocols of MegaFace where the training
SphereFace - Small[23] Small 75.76 90.04 dataset is regarded as small if it has less than 0.5 million
Beijing FaceAll V2 Small 76.66 77.60
GRCCV Small 77.67 74.88 images, large otherwise. The CosFace approach shows its
FUDAN-CS SDS[41] Small 77.98 79.19 superiority for both the identification and verification tasks
CosFace(Single-patch) Small 77.11 89.88 on both the protocols.
CosFace(3-patch ensemble) Small 79.54 92.22
MegaFace Challenge 2 (MF2). In terms of MegaFace
Beijing FaceAll Norm 1600 Large 64.80 67.11
Google - FaceNet v8[29] Large 70.49 86.47
Challenge 2 [25], all the algorithms need to use the training
NTechLAB - facenx large Large 73.30 85.08 data provided by MegaFace. The training data for Megaface
SIATMMLAB TencentVision Large 74.20 87.27 Challenge 2 contains 4.7 million faces and 672K identities,
DeepSense V2 Large 81.29 95.99 which corresponds to the large protocol. The gallery set
YouTu Lab Large 83.29 91.34
Vocord - deepVo V3 Large 91.76 94.96
has 1 million images that are different from the challenge
CosFace(Single-patch) Large 82.72 96.65 1 gallery set. Not surprisingly, Our method wins the first
CosFace(3-patch ensemble) Large 84.26 97.96 place of challenge 2 in table 5, setting a new state-of-the-art
with a large margin (1.39% on rank-1 identification accu-
Table 4. Face identification and verification evaluation on MF1. racy and 5.46% on verification performance).
“Rank 1” refers to rank-1 face identification accuracy and “Veri.”
refers to face verification TAR under 10−6 FAR.
Method Protocol MF2 Rank1 MF2 Veri.
5. Conclusion
3DiVi Large 57.04 66.45 In this paper, we proposed an innovative approach named
Team 2009 Large 58.93 71.12
NEC Large 62.12 66.84 LMCL to guide deep CNNs to learn highly discriminative
GRCCV Large 75.77 74.84 face features. We provided a well-formed geometrical and
SphereFace Large 71.17 84.22 theoretical interpretation to verify the effectiveness of the
CosFace (Single-patch) Large 74.11 86.77
proposed LMCL. Our approach consistently achieves the
CosFace(3-patch ensemble) Large 77.06 90.30
state-of-the-art results on several face benchmarks. We wish
Table 5. Face identification and verification evaluation on MF2. that our substantial explorations on learning discriminative
“Rank 1” refers to rank-1 face identification accuracy and “Veri.” features via LMCL will benefit the face recognition com-
refers to face verification TAR under 10−6 FAR . munity.
References [17] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and
E. Brossard. The megaface benchmark: 1 million faces for
[1] FG-NET Aging Database,http://www.fgnet.rsunit.com/. 8 recognition at scale. In Conference on Computer Vision and
[2] P. Belhumeur, J. P. Hespanha, and D. Kriegman. Eigenfaces Pattern Recognition (CVPR), 2016. 2, 6, 7, 8
vs. fisherfaces: Recognition using class specific linear pro- [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
jection. IEEE Trans. Pattern Analysis and Machine Intelli- classification with deep convolutional neural networks. In
gence, 19(7):711–720, July 1997. 1 Advances in Neural Information Processing Systems (NIPS),
[3] J. Cai, Z. Meng, A. S. Khan, Z. Li, and Y. Tong. Island 2012. 1, 2
Loss for Learning Discriminative Features in Facial Expres- [19] Z. Li, D. Lin, and X. Tang. Nonparametric discriminant
sion Recognition. arXiv preprint arXiv:1710.03144, 2017. analysis for face recognition. IEEE Transactions on Pattern
2 Analysis and Machine Intelligence, 31:755–761, 2009. 1
[4] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet [20] Z. Li, W. Liu, D. Lin, and X. Tang. Nonparametric subspace
loss: a deep quadruplet network for person re-identification. analysis for face recognition. In Conference on Computer
arXiv preprint arXiv:1704.01719, 2017. 2 Vision and Pattern Recognition (CVPR), 2005. 1
[5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity [21] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang. Targeting ulti-
metric discriminatively, with application to face verification. mate accuracy: Face recognition via deep embedding. arXiv
In Conference on Computer Vision and Pattern Recognition preprint arXiv:1506.07310, 2015. 8
(CVPR), 2005. 2 [22] W. Liu, Z. Li, and X. Tang. Spatio-temporal embedding for
[6] J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss for deep statistical face recognition from video. In European Confer-
face recognition. In Conference on Computer Vision and Pat- ence on Computer Vision (ECCV), 2006. 1
tern Recognition Workshops (CVPRW), 2017. 2 [23] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
[7] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality re- SphereFace: Deep Hypersphere Embedding for Face Recog-
duction by learning an invariant mapping. In Conference on nition. In Conference on Computer Vision and Pattern
Computer Vision and Pattern Recognition (CVPR), 2006. 2 Recognition (CVPR), 2017. 2, 3, 4, 6, 7, 8
[8] M. A. Hasnat, J. Bohne, J. Milgram, S. Gentric, and [24] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-Margin Softmax
L. Chen. von Mises-Fisher Mixture Model-based Deep Loss for Convolutional Neural Networks. In International
learning: Application to Face Verification. arXiv preprint Conference on Machine Learning (ICML), 2016. 2, 8
arXiv:1706.04264, 2017. 3 [25] A. Nech and I. Kemelmacher-Shlizerman. Level playing
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning field for million scale face recognition. In Conference on
for Image Recognition. In Conference on Computer Vision Computer Vision and Pattern Recognition (CVPR), 2017. 2,
and Pattern Recognition (CVPR), 2016. 1, 2, 6 7, 8
[10] E. Hoffer and N. Ailon. Deep metric learning using triplet [26] H.-W. Ng and S. Winkler. A data-driven approach to clean-
network. In International Workshop on Similarity-Based ing large face datasets. In Image Processing (ICIP), 2014
Pattern Recognition, 2015. 2 IEEE International Conference on, pages 343–347. IEEE,
[11] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Z. Li, 2014. 8
and T. Hospedales. When face recognition meets with deep [27] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face
learning: an evaluation of convolutional neural networks for recognition. In BMVC, volume 1, page 6, 2015. 8
face recognition. In International Conference on Computer [28] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained
Vision Workshops (ICCVW), 2015. 2 Softmax Loss for Discriminative Face Verification. arXiv
[12] J. Hu, L. Shen, and G. Sun. Squeeze-and-Excitation Net- preprint arXiv:1703.09507, 2017. 2
works. arXiv preprint arXiv:1709.01507, 2017. 1 [29] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A
[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. unified embedding for face recognition and clustering. In
Labeled faces in the wild: A database for studying face Conference on Computer Vision and Pattern Recognition
recognition in unconstrained environments. In Technical Re- (CVPR), 2015. 2, 8
port 07-49, University of Massachusetts, Amherst, 2007. 2, [30] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning
7 face representation by joint identification-verification. In Ad-
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- vances in Neural Information Processing Systems (NIPS),
shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional 2014. 2, 8
architecture for fast feature embedding. In Proceedings of [31] Y. Sun, D. Liang, X. Wang, and X. Tang. DeepID3: Face
the 2016 ACM on Multimedia Conference (ACM MM), 2014. recognition with very deep neural networks. arXiv preprint
6 arXiv:1502.00873, 2015. 2
[15] K. Simonyan and A. Zisserman. Very deep convolutional [32] Y. Sun, X. Wang, and X. Tang. Deep learning face repre-
networks for large-scale image recognition. In International sentation from predicting 10,000 classes. In Conference on
Conference on Learning Representations (ICLR), 2015. 1, 2 Computer Vision and Pattern Recognition (CVPR), 2014. 2
[16] K. Zhang, Z. Zhang, Z. Li and Y. Qiao. Joint Face De- [33] Y. Sun, X. Wang, and X. Tang. Deeply learned face repre-
tection and Alignment using Multi-task Cascaded Convolu- sentations are sparse, selective, and robust. In Conference on
tional Networks. Signal Processing Letters, 23(10):1499– Computer Vision and Pattern Recognition (CVPR), 2015. 2,
1503, 2016. 6 8
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Conference on Computer
Vision and Pattern Recognition (CVPR), 2015. 2
[35] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
Closing the gap to human-level performance in face verifica-
tion. In Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2014. 2, 8
[36] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
D. Poland, D. Borth, and L.-J. Li. YFCC100M: The new
data in multimedia research. Communications of the ACM,
2016. 8
[37] M. A. Turk and A. P. Pentland. Face recognition using eigen-
faces. In Conference on Computer Vision and Pattern Recog-
nition (CVPR), 1991. 1
[38] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. NormFace:
L 2 Hypersphere Embedding for Face Verification. In Pro-
ceedings of the 2017 ACM on Multimedia Conference (ACM
MM), 2017. 2
[39] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang,
J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image
similarity with deep ranking. In Conference on Computer
Vision and Pattern Recognition (CVPR), 2014. 2
[40] X. Wang and X. Tang. A unified framework for subspace
face recognition. IEEE Trans. Pattern Analysis and Machine
Intelligence, 26(9):1222–1228, Sept. 2004. 1
[41] Z. Wang, K. He, Y. Fu, R. Feng, Y.-G. Jiang, and X. Xue.
Multi-task Deep Neural Network for Joint Face Recognition
and Facial Attribute Prediction. In Proceedings of the 2017
ACM on International Conference on Multimedia Retrieval
(ICMR), 2017. 2, 8
[42] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative fea-
ture learning approach for deep face recognition. In Euro-
pean Conference on Computer Vision (ECCV), pages 499–
515, 2016. 2, 6, 8
[43] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un-
constrained videos with matched background similarity. In
Conference on Computer Vision and Pattern Recognition
(CVPR), 2011. 2, 7
[44] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
residual transformations for deep neural networks. arXiv
preprint arXiv:1611.05431, 2016. 1
[45] Y. Xiong, W. Liu, D. Zhao, and X. Tang. Face recognition via
archetype hull ranking. In IEEE International Conference on
Computer Vision (ICCV), 2013. 1
[46] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tation from scratch. arXiv preprint arXiv:1411.7923, 2014.
6, 7
[47] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range Loss
for Deep Face Recognition with Long-tail. In International
Conference on Computer Vision (ICCV), 2017. 2
[48] X. Zhe, S. Chen, and H. Yan. Directional Statistics-based
Deep Metric Learning for Image Classification and Re-
trieval. arXiv preprint arXiv:1802.09662, 2018. 3
T
A. Supplementary Material Pholds if and only if every Wi Wj is equal
The equality
(i 6= j), and i Wi = 0. Because at most K + 1 unit
This supplementary document provides mathematical vectors are able to satisfy this condition in the K-dimension
details for the derivation of the lower bound of the scaling hyper-space, the equality holds only when C ≤ K + 1,
parameter s (Equation 6 in the main paper), and the variable where K is the dimension of the learned features.
scope of the cosine margin m (Equation 7 in the main
paper).
Proposition of the Cosine Margin m
Proposition of the Scaling Parameter s Suppose that the weight vectors are uniformly dis-
tributed on a unit hypersphere. The variable scope of the
Given the normalized learned features x and unit weight introduced cosine margin m is formulated as follows :
vectors W , we denote the total number of classes as C
where C > 1. Suppose that the learned features separately 2π
lie on the surface of a hypersphere and center around the 0 ≤ m ≤ 1 − cos , (K = 2)
C
corresponding weight vector. Let Pw denote the expected C
minimum posterior probability of the class center (i.e., W ). 0≤m≤ , (K > 2, C ≤ K + 1)
C −1
The lower bound of s is formulated as follows: C
0≤m , (K > 2, C > K + 1)
C − 1 (C − 1)PW C −1
s≥ ln
C 1 − PW where C is the total number of training classes and K is the
Proof: dimension of the learned features.
Let Wi denote the i-th unit weight vector. ∀i, we have: Proof:
For K = 2, the weight vectors uniformly spread on a
es unit circle. Hence, max(WiT Wj ) = cos 2π C . It follows 0 ≤
≥ PW , (8)
es +
P
es(Wi
TW
j) m ≤ (1 − max(WiT Wj )) = 1 − cos 2π C .
j,j6=i
For K > 2, the inequality below holds:
X T 1
1 + e−s es(Wi Wj )
≤ , (9)
PW X
j,j6=i C(C − 1) max(WiT Wj ) ≥ WiT Wj (16)
C i,j,i6=j
X X T C
(1 + e−s es(Wi Wj ) X X
)≤ , (10) =( Wi )2 − ( Wi2 )
i=1
PW
j,j6=i i i

e−s X T 1 ≥ −C.
1+ es(Wi Wj )
≤ . (11)
C PW −1
i,j,i6=j Therefore, max(WiT Wj ) ≥ C−1 , and we have 0 ≤
T C
Because f (x) = es·x is a convex function, according to m ≤ (1 − max(Wi Wj )) ≤ C−1 .
Jensen’s inequality, we obtain: T
Similarly, the equality Pholds if and only if every Wi Wj
is equal (i 6= j), and i Wi = 0. As discussed above,
1 X T s
P T
es(Wi Wj ) ≥ e C(C−1) i,j,i6=j Wi Wj . this is satisfied only if C ≤ K + 1. On this condition, the
C(C − 1) distance between the vertexes of two arbitrary W should be
i,j,i6=j
(12) the same. In other words, they form a regular simplex such
Besides, it is known that as an equilateral triangle if C = 3, or a regular tetrahedron
if C = 4.
X X X For the case of C > K + 1, the equality cannot be satis-
WiT Wj = ( Wi ) 2 − ( Wi2 ) ≥ −C. (13) fied. In fact, it is unable to formulate the strict upper bound.
i,j,i6=j i i C
Hence, we obtain 0 ≤ m  C−1 . Because the number of
classes can be much larger than the feature dimension, the
Thus, we have:
equality cannot hold in practice.
sC 1
1 + (C − 1)e− C−1 ≤ . (14)
PW
Further simplification yields:

C − 1 (C − 1)PW
s≥ ln . (15)
C 1 − PW

You might also like