Cosface: Large Margin Cosine Loss For Deep Face Recognition
Cosface: Large Margin Cosine Loss For Deep Face Recognition
Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou,
Zhifeng Li∗, and Wei Liu∗
Tencent AI Lab
{hawelwang,yitongwang,encorezhou,denisji,sagazhou,michaelzfli}@tencent.com
arXiv:1801.09414v2 [cs.CV] 3 Apr 2018
[email protected] [email protected]
Abstract
Training Loss Layers Labels
Face recognition has made extraordinary progress ow- Faces
loss of deep CNNs usually lacks the power of discrimina- Cosine Similarity
tion. To address this problem, recently several loss func-
tions such as center loss, large margin softmax loss, and
angular softmax loss have been proposed. All these im-
proved losses share the same idea: maximizing inter-class
variance and minimizing intra-class variance. In this pa-
per, we propose a novel loss function, namely large mar- Cropped Faces Learned by Softmax Learned by LMCL
gin cosine loss (LMCL), to realize this idea from a different
perspective. More specifically, we reformulate the softmax
Figure 1. An overview of the proposed CosFace framework. In the
loss as a cosine loss by L2 normalizing both features and training phase, the discriminative face features are learned with a
weight vectors to remove radial variations, based on which large margin between different classes. In the testing phase, the
a cosine margin term is introduced to further maximize the testing data is fed into CosFace to extract face features which are
decision margin in the angular space. As a result, minimum later used to compute the cosine similarity score to perform face
intra-class variance and maximum inter-class variance are verification and identification.
achieved by virtue of normalization and cosine decision
margin maximization. We refer to our model trained with
LMCL as CosFace. Extensive experimental evaluations are a wide variety of computer vision tasks, which makes deep
conducted on the most popular public-domain face recogni- CNN a dominant machine learning approach for computer
tion datasets such as MegaFace Challenge, Youtube Faces vision. Face recognition, as one of the most common com-
(YTF) and Labeled Face in the Wild (LFW). We achieve the puter vision tasks, has been extensively studied for decades
state-of-the-art performance on these benchmarks, which [37, 45, 22, 19, 20, 40, 2]. Early studies build shallow mod-
confirms the effectiveness of our proposed approach. els with low-level face features, while modern face recogni-
tion techniques are greatly advanced driven by deep CNNs.
Face recognition usually includes two sub-tasks: face ver-
1. Introduction ification and face identification. Both of these two tasks
involve three stages: face detection, feature extraction, and
Recently progress on the development of deep convo-
classification. A deep CNN is able to extract clean high-
lutional neural networks (CNNs) [15, 18, 12, 9, 44] has
level features, making itself possible to achieve superior
significantly advanced the state-of-the-art performance on
performance with a relatively simple classification architec-
∗ Corresponding authors ture: usually, a multilayer perceptron networks followed by
a softmax loss [35, 32]. However, recent studies [42, 24, 23] the aforementioned shortcomings.
found that the traditional softmax loss is insufficient to ac- Based on the LMCL, we build a sophisticated deep
quire the discriminating power for classification. model called CosFace, as shown in Figure 1. In the train-
To encourage better discriminating performance, many ing phase, LMCL guides the ConvNet to learn features with
research studies have been carried out [42, 5, 7, 10, 39, 23]. a large cosine margin. In the testing phase, the face fea-
All these studies share the same idea for maximum discrimi- tures are extracted from the ConvNet to perform either face
nation capability: maximizing inter-class variance and min- verification or face identification. We summarize the con-
imizing intra-class variance. For example, [42, 5, 7, 10, 39] tributions of this work as follows:
propose to adopt multi-loss learning in order to increase the (1) We embrace the idea of maximizing inter-class vari-
feature discriminating power. While these methods improve ance and minimizing intra-class variance and propose a
classification performance over the traditional softmax loss, novel loss function, called LMCL, to learn highly discrimi-
they usually come with some extra limitations. For [42], native deep features for face recognition.
it only explicitly minimizes the intra-class variance while (2) We provide reasonable theoretical analysis based
ignoring the inter-class variances, which may result in sub- on the hyperspherical feature distribution encouraged by
optimal solutions. [5, 7, 10, 39] require thoroughly schem- LMCL.
ing the mining of pair or triplet samples, which is an ex- (3) The proposed approach advances the state-of-the-art
tremely time-consuming procedure. Very recently, [23] pro- performance over most of the benchmarks on popular face
posed to address this problem from a different perspective. databases including LFW[13], YTF[43] and Megaface [17,
More specifically, [23] (A-softmax) projects the original 25].
Euclidean space of features to an angular space, and intro-
duces an angular margin for larger inter-class variance. 2. Related Work
Compared to the Euclidean margin suggested by [42, 5, Deep Face Recognition. Recently, face recognition has
10], the angular margin is preferred because the cosine of achieved significant progress thanks to the great success
the angle has intrinsic consistency with softmax. The for- of deep CNN models [18, 15, 34, 9]. In DeepFace [35]
mulation of cosine matches the similarity measurement that and DeepID [32], face recognition is treated as a multi-
is frequently applied to face recognition. From this perspec- class classification problem and deep CNN models are
tive, it is more reasonable to directly introduce cosine mar- first introduced to learn features on large multi-identities
gin between different classes to improve the cosine-related datasets. DeepID2 [30] employs identification and verifi-
discriminative information. cation signals to achieve better feature embedding. Recent
In this paper, we reformulate the softmax loss as a cosine works DeepID2+ [33] and DeepID3 [31] further explore
loss by L2 normalizing both features and weight vectors to the advanced network structures to boost recognition per-
remove radial variations, based on which a cosine margin formance. FaceNet [29] uses triplet loss to learn an Eu-
term m is introduced to further maximize the decision mar- clidean space embedding and a deep CNN is then trained
gin in the angular space. Specifically, we propose a novel on nearly 200 million face images, leading to the state-of-
algorithm, dubbed Large Margin Cosine Loss (LMCL), the-art performance. Other approaches [41, 11] also prove
which takes the normalized features as input to learn highly the effectiveness of deep CNNs on face recognition.
discriminative features by maximizing the inter-class cosine Loss Functions. Loss function plays an important role
margin. Formally, we define a hyper-parameter m such that in deep feature learning. Contrastive loss [5, 7] and triplet
the decision boundary is given by cos(θ1 ) − m = cos(θ2 ), loss [10, 39] are usually used to increase the Euclidean mar-
where θi is the angle between the feature and weight of class gin for better feature embedding. Wen et al. [42] proposed
i. a center loss to learn centers for deep features of each iden-
For comparison, the decision boundary of the A-Softmax tity and used the centers to reduce intra-class variance. Liu
is defined over the angular space by cos(mθ1 ) = cos(θ2 ), et al. [24] proposed a large margin softmax (L-Softmax)
which has a difficulty in optimization due to the non- by adding angular constraints to each identity to improve
monotonicity of the cosine function. To overcome such a feature discrimination. Angular softmax (A-Softmax) [23]
difficulty, one has to employ an extra trick with an ad-hoc improves L-Softmax by normalizing the weights, which
piecewise function for A-Softmax. More importantly, the achieves better performance on a series of open-set face
decision margin of A-softmax depends on θ, which leads to recognition benchmarks [13, 43, 17]. Other loss functions
different margins for different classes. As a result, in the [47, 6, 4, 3] based on contrastive loss or center loss also
decision space, some inter-class features have a larger mar- demonstrate the performance on enhancing discrimination.
gin while others have a smaller margin, which reduces the Normalization Approaches. Normalization has been
discriminating power. Unlike A-Softmax, our approach de- studied in recent deep face recognition studies. [38] normal-
fines the decision margin in the cosine space, thus avoiding izes the weights which replace the inner product with cosine
similarity within the softmax loss. [28] applies the L2 con- cos(θ2) cos(θ2) θ2 cos(θ2)
c2 c2 c1 c2
straint on features to embed faces in the normalized space. 1.0 1.0 1.0
π/m
m
Note that normalization on feature vectors or weight vec-
c1
cos(θ1) c1 cos(θ1) c2 c1
cos(θ1)
π
tors achieves much lower intra-class angular variability by
margin<0 margin=0 margin>=0 θ1 margin>0
concentrating more on the angle during training. Hence the
Softmax NSL A-Softmax LMCL
angles between identities can be well optimized. The von
Mises-Fisher (vMF) based methods [48, 8] and A-Softmax Figure 2. The comparison of decision margins for different loss
[23] also adopt normalization in feature learning. functions the binary-classes scenarios. Dashed line represents de-
cision boundary, and gray areas are decision margins.
3. Proposed Approach
In this section, we firstly introduce the proposed LMCL Because we remove variations in radial directions by fix-
in detail (Sec. 3.1). And a comparison with other loss func- ing kxk = s, the resulting model learns features that are
tions is given to show the superiority of the LMCL (Sec. separable in the angular space. We refer to this loss as the
3.2). The feature normalization technique adopted by the Normalized version of Softmax Loss (NSL) in this paper.
LMCL is further described to clarify its effectiveness (Sec. However, features learned by the NSL are not suffi-
3.3). Lastly, we present a theoretical analysis for the pro- ciently discriminative because the NSL only emphasizes
posed LMCL (Sec. 3.4). correct classification. To address this issue, we introduce
the cosine margin to the classification boundary, which is
3.1. Large Margin Cosine Loss
naturally incorporated into the cosine formulation of Soft-
We start by rethinking the softmax loss from a cosine max.
perspective. The softmax loss separates features from dif- Considering a scenario of binary-classes for example,
ferent classes by maximizing the posterior probability of the let θi denote the angle between the learned feature vector
ground-truth class. Given an input feature vector xi with its and the weight vector of Class Ci (i = 1, 2). The NSL
corresponding label yi , the softmax loss can be formulated forces cos(θ1 ) > cos(θ2 ) for C1 , and similarly for C2 ,
as: so that features from different classes are correctly classi-
N N fied. To develop a large margin classifier, we further require
1 X 1 X efyi cos(θ1 ) − m > cos(θ2 ) and cos(θ2 ) − m > cos(θ1 ), where
Ls = − log pi = − log PC , (1)
N i=1 N i=1 j=1 e
fj
m ≥ 0 is a fixed parameter introduced to control the magni-
tude of the cosine margin. Since cos(θi ) − m is lower than
where pi denotes the posterior probability of xi being cor-
cos(θi ), the constraint is more stringent for classification.
rectly classified. N is the number of training samples and C
The above analysis can be well generalized to the scenario
is the number of classes. fj is usually denoted as activation
of multi-classes. Therefore, the altered loss reinforces the
of a fully-connected layer with weight vector Wj and bias
discrimination of learned features by encouraging an extra
Bj . We fix the bias Bj = 0 for simplicity, and as a result fj
margin in the cosine space.
is given by:
Formally, we define the Large Margin Cosine Loss
fj = WjT x = kWj kkxk cos θj , (2) (LMCL) as:
where θj is the angle between Wj and x. This formula sug- 1 X es(cos(θyi ,i )−m)
gests that both norm and angle of vectors contribute to the Llmc = − log s(cos(θ )−m) P ,
N i e yi ,i
+ j6=yi es cos(θj,i )
posterior probability.
(4)
To develop effective feature learning, the norm of W
subject to
should be necessarily invariable. To this end, We fix
kWj k = 1 by L2 normalization. In the testing stage, the W∗
face recognition score of a testing face pair is usually cal- W = ,
kW ∗ k
culated according to cosine similarity between the two fea-
ture vectors. This suggests that the norm of feature vector x∗ (5)
x= ∗ ,
x is not contributing to the scoring function. Thus, in the kx k
training stage, we fix kxk = s. Consequently, the posterior cos(θj , i) = Wj T xi ,
probability merely relies on cosine of angle. The modified
loss can be formulated as where N is the numer of training samples, xi is the i-th
feature vector corresponding to the ground-truth class of yi ,
1 X es cos(θyi ,i ) the Wj is the weight vector of the j-th class, and θj is the
Lns = − log P s cos(θ ) . (3)
N i je
j,i
angle between Wj and xi .
3.2. Comparison on Different Loss Functions Therefore, cos(θ1 ) is maximized while cos(θ2 ) being mini-
mized for C1 (similarly for C2 ) to perform the large-margin
In this subsection, we compare the decision margin of
classification. The last subplot in Figure 2 illustrates the de-
our method (LMCL) to: Softmax, NSL, and A-Softmax,
cision boundary of √LMCL in the cosine space, where we can
as illustrated in Figure 2. For simplicity of analysis, we
see a clear margin( 2m) in the produced distribution of the
consider the binary-classes scenarios with classes C1 and
cosine of angle. This suggests that the LMCL is more robust
C2 . Let W1 and W2 denote weight vectors for C1 and C2 ,
than the NSL, because a small perturbation around the deci-
respectively.
sion boundary (dashed line) less likely leads to an incorrect
Softmax loss defines a decision boundary by:
decision. The cosine margin is applied consistently to all
kW1 k cos(θ1 ) = kW2 k cos(θ2 ). samples, regardless of the angles of their weight vectors.
Thus, its boundary depends on both magnitudes of weight
3.3. Normalization on Features
vectors and cosine of angles, which results in an overlap-
ping decision area (margin < 0) in the cosine space. This is In the proposed LMCL, a normalization scheme is in-
illustrated in the first subplot of Figure 2. As noted before, volved on purpose to derive the formulation of the cosine
in the testing stage it is a common strategy to only consider loss and remove variations in radial directions. Unlike [23]
cosine similarity between testing feature vectors of faces. that only normalizes the weight vectors, our approach si-
Consequently, the trained classifier with the Softmax loss multaneously normalizes both weight vectors and feature
is unable to perfectly classify testing samples in the cosine vectors. As a result, the feature vectors distribute on a hy-
space. persphere, where the scaling parameter s controls the mag-
NSL normalizes weight vectors W1 and W2 such that nitude of radius. In this subsection, we discuss why feature
they have constant magnitude 1, which results in a decision normalization is necessary and how feature normalization
boundary given by: encourages better feature learning in the proposed LMCL
cos(θ1 ) = cos(θ2 ). approach.
The necessity of feature normalization is presented in
The decision boundary of NSL is illustrated in the second two respects: First, the original softmax loss without feature
subplot of Figure 2. We can see that by removing radial normalization implicitly learns both the Euclidean norm
variations, the NSL is able to perfectly classify testing sam- (L2 -norm) of feature vectors and the cosine value of the
ples in the cosine space, with margin = 0. However, it is angle. The L2 -norm is adaptively learned for minimizing
not quite robust to noise because there is no decision mar- the overall loss, resulting in the relatively weak cosine con-
gin: any small perturbation around the decision boundary straint. Particularly, the adaptive L2 -norm of easy samples
can change the decision. becomes much larger than hard samples to remedy the in-
A-Softmax improves the softmax loss by introducing an ferior performance of cosine metric. On the contrary, our
extra margin, such that its decision boundary is given by: approach requires the entire set of feature vectors to have
C1 : cos(mθ1 ) ≥ cos(θ2 ), the same L2 -norm such that the learning only depends on
cosine values to develop the discriminative power. Fea-
C2 : cos(mθ2 ) ≥ cos(θ1 ).
ture vectors from the same classes are clustered together
Thus, for C1 it requires θ1 ≤ θm2 , and similarly for C2 . The and those from different classes are pulled apart on the sur-
third subplot of Figure 2 depicts this decision area, where face of the hypersphere. Additionally, we consider the situ-
gray area denotes decision margin. However, the margin ation when the model initially starts to minimize the LMCL.
of A-Softmax is not consistent over all θ values: the mar- Given a feature vector x, let cos(θi ) and cos(θj ) denote co-
gin becomes smaller as θ reduces, and vanishes completely sine scores of the two classes, respectively. Without normal-
when θ = 0. This results in two potential issues. First, for ization on features, the LMCL forces kxk(cos(θi ) − m) >
difficult classes C1 and C2 which are visually similar and kxk cos(θj ). Note that cos(θi ) and cos(θj ) can be initially
thus have a smaller angle between W1 and W2 , the mar- comparable with each other. Thus, as long as (cos(θi ) − m)
gin is consequently smaller. Second, technically speaking is smaller than cos(θj ), kxk is required to decrease for mini-
one has to employ an extra trick with an ad-hoc piecewise mizing the loss, which degenerates the optimization. There-
function to overcome the nonmonotonicity difficulty of the fore, feature normalization is critical under the supervision
cosine function. of LMCL, especially when the networks are trained from
LMCL (our proposed) defines a decision margin in co- scratch. Likewise, it is more favorable to fix the scaling
sine space rather than the angle space (like A-Softmax) by: parameter s instead of adaptively learning.
Furthermore, the scaling parameter s should be set to a
C1 : cos(θ1 ) ≥ cos(θ2 ) + m,
properly large value to yield better-performing features with
C2 : cos(θ2 ) ≥ cos(θ1 ) + m. lower training loss. For NSL, the loss continuously goes
𝑊1
𝑥
𝑊2
𝑥 First, considering the binary-classes case with classes C1
𝑊1 𝑊2
and C2 as before, suppose that the normalized feature vec-
Margin
tor x is given. Let Wi denote the normalized weight vector,
cosθ1
θ1 θ2
cosθ2
θ1 θ
and θi denote the angle between x and Wi . For NSL, the
2
cosθ1 − m cosθ2
decision boundary defines as cos θ1 − cos θ2 = 0, which is
NSL LMCL equivalent to the angular bisector of W1 and W2 as shown
in the left of Figure 3. This addresses that the model su-
Figure 3. A geometrical interpretation of LMCL from feature per-
pervised by NSL partitions the underlying feature space to
spective. Different color areas represent feature space from dis-
tinct classes. LMCL has a relatively compact feature region com- two close regions, where the features near the boundary are
pared with NSL. extremely ambiguous (i.e., belonging to either class is ac-
ceptable). In contrast, LMCL drives the decision boundary
formulated by cos θ1 − cos θ2 = m for C1 , in which θ1
down with higher s, while too small s leads to an insuf- should be much smaller than θ2 (similarly for C2 ). Conse-
ficient convergence even no convergence. For LMCL, we quently, the inter-class variance is enlarged while the intra-
also need adequately large s to ensure a sufficient hyper- class variance shrinks.
space for feature learning with an expected large margin. Back to Figure 3, one can observe that the maximum
In the following, we show the parameter s should have a angular margin is subject to the angle between W1 and
lower bound to obtain expected classification performance. W2 . Accordingly, the cosine margin should have the lim-
Given the normalized learned feature vector x and unit ited variable scope when W1 and W2 are given. Specifi-
weight vector W , we denote the total number of classes cally, suppose a scenario that all the feature vectors belong-
as C. Suppose that the learned feature vectors separately ing to class i exactly overlap with the corresponding weight
lie on the surface of the hypersphere and center around the vector Wi of class i. In other words, every feature vector is
corresponding weight vector. Let PW denote the expected identical to the weight vector for class i, and apparently the
minimum posterior probability of class center (i.e., W ), the feature space is in an extreme situation, where all the fea-
lower bound of s is given by 1 : ture vectors lie at their class center. In that case, the margin
of decision boundaries has been maximized (i.e., the strict
C −1 (C − 1)PW upper bound of the cosine margin).
s≥ log . (6) To extend in general, we suppose that all the features are
C 1 − PW
well-separated and we have a total number of C classes.
Based on this bound, we can infer that s should be en- The theoretical variable scope of m is supposed to be:
larged consistently if we expect an optimal Pw for classifi- 0 ≤ m ≤ (1 − max(WiT Wj )), where i, j ≤ n, i 6= j.
cation with a certain number of classes. Besides, by keeping The softmax loss tries to maximize the angle between any
a fixed Pw , the desired s should be larger to deal with more of the two weight vectors from two different classes in order
classes since the growing number of classes increase the to perform perfect classification. Hence, it is clear that the
difficulty for classification in the relatively compact space. optimal solution for the softmax loss should uniformly dis-
A hypersphere with large radius s is therefore required for tribute the weight vectors on a unit hypersphere. Based on
embedding features with small intra-class distance and large this assumption, the variable scope of the introduced cosine
inter-class distance. margin m can be inferred as follows 2 :
3.4. Theoretical Analysis for LMCL 2π
0 ≤ m ≤ 1 − cos , (K = 2)
C
The preceding subsections essentially discuss the LMCL
C
from the classification point of view. In terms of learning 0≤m≤ , (C ≤ K + 1) (7)
the discriminative features on the hypersphere, the cosine C −1
C
margin servers as momentous part to strengthen the discrim- 0≤m , (C > K + 1)
inating power of features. Detailed analysis about the quan- C −1
titative feasible choice of the cosine margin (i.e., the bound where C is the number of training classes and K is the di-
of hyper-parameter m) is necessary. The optimal choice of mension of learned features. The inequalities indicate that
m potentially leads to more promising learning of highly as the number of classes increases, the upper bound of the
discriminative face features. In the following, we delve into cosine margin between classes are decreased correspond-
the decision boundary and angular margin in the feature ingly. Especially, if the number of classes is much larger
space to derive the theoretical bound for hyper-parameter than the feature dimension, the upper bound of the cosine
m. margin will get even smaller.
1 Proof is attached in the supplemental material. 2 Proof is attached in the supplemental material.
Figure 4. A toy experiment of different loss functions on 8 identities with 2D features. The first row maps the 2D features onto the Euclidean
space, while the second row projects the 2D features onto the angular space. The gap becomes evident as the margin term m increases.
C
A reasonable choice of larger m ∈ [0, C−1 ) should ef- 4. Experiments
fectively boost the learning of highly discriminative fea-
tures. Nevertheless, parameter m usually could not reach 4.1. Implementation Details
the theoretical upper bound in practice due to the vanish- Preprocessing. Firstly, face area and landmarks are de-
ing of the feature space. That is, all the feature vectors tected by MTCNN [16] for the entire set of training and
are centered together according to the weight vector of the testing images. Then, the 5 facial points (two eyes, nose and
corresponding class. In fact, the model fails to converge two mouth corners) are adopted to perform similarity trans-
when m is too large, because the cosine constraint (i.e., formation. After that we obtain the cropped faces which are
cos θ1 −m > cos θ2 or cos θ2 −m > cos θ1 for two classes) then resized to be 112 × 96. Following [42, 23], each pixel
becomes stricter and is hard to be satisfied. Besides, the co- (in [0, 255]) in RGB images is normalized by subtracting
sine constraint with overlarge m forces the training process 127.5 then dividing by 128.
to be more sensitive to noisy data. The ever-increasing m Training. For a direct and fair comparison to the existing
starts to degrade the overall performance at some point be- results that use small training datasets (less than 0.5M im-
cause of failing to converge. ages and 20K subjects) [17], we train our models on a small
training dataset, which is the publicly available CASIA-
We perform a toy experiment for better visualizing on WebFace [46] dataset containing 0.49M face images from
features and validating our approach. We select face im- 10,575 subjects. We also use a large training dataset to eval-
ages from 8 distinct identities containing enough samples to uate the performance of our approach for benchmark com-
clearly show the feature points on the plot. Several models parison with the state-of-the-art results (using large training
are trained using the original softmax loss and the proposed dataset) on the benchmark face dataset. The large training
LMCL with different settings of m. We extract 2-D features dataset that we use in this study is composed of several pub-
of face images for simplicity. As discussed above, m should lic datasets and a private face dataset, containing about 5M
be no larger than 1 − cos π4 (about 0.29), so we set up three images from more than 90K identities. The training faces
choices of m for comparison, which are m = 0, m = 0.1, are horizontally flipped for data augmentation. In our ex-
and m = 0.2. As shown in Figure 4, the first row and periments we remove face images belong to identities that
second row present the feature distributions in Euclidean appear in the testing datasets.
space and angular space, respectively. We can observe that For the fair comparison, the CNN architecture used in
the original softmax loss produces ambiguity in decision our work is similar to [23], which has 64 convolutional lay-
boundaries while the proposed LMCL performs much bet- ers and is based on residual units[9]. The scaling parameter
ter. As m increases, the angular margin between different s in Equation (4) is set to 64 empirically. We use Caffe[14]
classes has been amplified. to implement the modifications of the loss layer and run the
100 Normalization LFW YTF MF1 Rank 1 MF1 Veri.
No 99.10 93.1 75.10 88.65
98 Yes 99.33 96.1 77.11 89.88
e−s X T 1 ≥ −C.
1+ es(Wi Wj )
≤ . (11)
C PW −1
i,j,i6=j Therefore, max(WiT Wj ) ≥ C−1 , and we have 0 ≤
T C
Because f (x) = es·x is a convex function, according to m ≤ (1 − max(Wi Wj )) ≤ C−1 .
Jensen’s inequality, we obtain: T
Similarly, the equality Pholds if and only if every Wi Wj
is equal (i 6= j), and i Wi = 0. As discussed above,
1 X T s
P T
es(Wi Wj ) ≥ e C(C−1) i,j,i6=j Wi Wj . this is satisfied only if C ≤ K + 1. On this condition, the
C(C − 1) distance between the vertexes of two arbitrary W should be
i,j,i6=j
(12) the same. In other words, they form a regular simplex such
Besides, it is known that as an equilateral triangle if C = 3, or a regular tetrahedron
if C = 4.
X X X For the case of C > K + 1, the equality cannot be satis-
WiT Wj = ( Wi ) 2 − ( Wi2 ) ≥ −C. (13) fied. In fact, it is unable to formulate the strict upper bound.
i,j,i6=j i i C
Hence, we obtain 0 ≤ m C−1 . Because the number of
classes can be much larger than the feature dimension, the
Thus, we have:
equality cannot hold in practice.
sC 1
1 + (C − 1)e− C−1 ≤ . (14)
PW
Further simplification yields:
C − 1 (C − 1)PW
s≥ ln . (15)
C 1 − PW