An Efficient Training Approach For Very Large Scale Face Recognition
An Efficient Training Approach For Very Large Scale Face Recognition
Kai Wang1,2 * Shuo Wang 2 * Panpan Zhang1 Zhipeng Zhou2 Zheng Zhu3
Xiaobo Wang4 Xiaojiang Peng5 Baigui Sun2 Hao Li 2 Yang You1†
1
National University of Singapore 2 Alibaba Group 3 Tsinghua University
4
Institute of Automation, Chinese Academy of Sciences 5 Shenzhen Technology University
Code: https://github.com/tiandunx/FFC
arXiv:2105.10375v5 [cs.CV] 3 Mar 2022
1
70
45
Backbone FC 60
40
35 50
Memory (GB)
Time (millisecond)
30
40
25
V100
20
30
15 20
P100
10
10 2080Ti
5
0 0
0.1 0.5 1 2 4 6 8 19 0.1 0.5 1 2 4 6 8 10
Face Identities (Million) Face Identities (Million)
(a) Comparison of backbone and FC time cost (ms). (b) The memory occupancy of the FC layer at training phase (G).
Figure 1: Visualization of training time and GPU memory occupancy. Figure 1a shows the forward time comparison of
backbone (ResNet50) and the FC layer. Given an image, the time cost of FC increases sharply with the growing number of
face identities but the time of backbone stays unchanged. Figure 1b illustrates the GPU memory occupancy with the size
of face identities. Even the V100 32G GPU can only store the FC parameters with the output size of about 6 millions (The
dimension of face recognition is usually 512). Therefore, it is very necessary to design a method that reduces the training
time and hardware cost of the FC layer.
To tackle aforementioned issues, we propose an efficient 1) We propose an efficient training approach F2 C for
training approach for ultra-large-scale face datasets, termed ultra-large-scale face recognition training, which aims to
as Faster Face Classification (F2 C). In F2 C, we first intro- reduce the training time and hardware costs while keeping
duce twin backbones named Gallery Net (G-Net) and Probe comparable performance to state-of-the-art FC-based meth-
Net (P-Net) to generate identity centers and extract face fea- ods.
tures, respectively. G-Net has the same structure with P-Net 2) We design DCP to store and update the identities’ fea-
and inherits the parameters from P-Net in a moving average tures dynamically, which is an alternative to the FC layer.
manner. Considering that the most time-consuming part of The size of DCP is much smaller than FC and independent
the ultra-large-scale training lies at the FC layer, we pro- of the whole face identities, so the training time and hard-
pose Dynamic Class Pool (DCP) to store the features from ware costs can be decreased substantially.
G-Net and calculate the logits with positive samples (whose 3) We design a dual data loader including identity-based
identities appear in DCP) in each mini-batch. DCP can be and instance-based loaders to improve the update efficiency
regarded as a substitute for the FC layer and its size is much of DCP parameters.
smaller than FC, which is the reason why F2 C can largely
reduce the time and resource cost compared to the FC layer. 2. Related Work
For negative samples (whose identities do not appear in the
DCP), we minimize the cosine similarities between nega- Face Recognition. Face recognition has witnessed dra-
tive samples and DCP. To improve the update efficiency and matical progress due to the large scale datasets, advanced
speed of DCP parameters, we design a dual data loader in- architectures and loss functions. Large scale datasets play
cluding identity-based and instance-based loaders. The dual the most crucial role in promoting the performance of face
data loader loads images from given dataset by instances recognition [5]. These datasets can be divided into three in-
and identities to generate batches for training. Finally, we tervals according to the number of face identities: 1-10K,
conduct sufficient experiments on several face benchmarks 11-100K, >100K. VGGFace [21], VGGFace2 [3], UMD-
to prove F2 C can achieve comparable results and a higher Faces [2], CelebFaces [28], and CASIA-WebFace [45] be-
training speed than normal FC-based method. F2 C also ob- long to the first interval. The face identities of the IMDB-
tains superior performance than previous methods in term Face [30] and MS1MV2 [5] are between 11K to 100K.
of recognition accuracy and hardware cost. Our contribu- Glint360k [1] and Webface260M [49] have about 0.36M
tions can be summarized as follows. and 4M identities. Many previous works [1, 46, 49] il-
lustrate that training on larger face identities datasets can
2
achieve better performance than on smaller ones. There- and achieve comparable performance compared to normal
fore using WebFace260M as the training dataset obtains FC-based methods.
state-of-the-art performance on IJBC[19] and top 3 in NIST-
FRVT challenge. Based on these datasets, a variety of CNN 3. Faster Face Classification
architectures for improving the performances, such as VG-
In this section, we first give an overview of F2 C for a
GNet [26], GoogleNet [29], ResNet [10], AttentionNet [32]
brief understanding of our method. Then we present our
and MobileFaceNet [4], have been proposed. For the loss
motivation and key modules for ultra-large-scale datasets
function, contrastive loss [28, 44] and triplet loss [26] might
training. After that, we show the theoretical/empirical anal-
be good candidates. But they suffer from high computa-
ysis over these modules. Finally we demonstrate the train-
tional cost and slow convergence. To this end, researchers
ing details for better reproduction.
attempt to explore new metric learning loss functions to
boost the face recognition performance. Several margin- 3.1. Overview of F2 C
based softmax losses [5, 17, 31, 40, 41] have been exploited
and obtained the state-of-the-art results. To sum up, cur- The problem we tackle is to accelerate the training speed
rent methods and large scale datasets have achieved excel- and reduce the hardware costs of ultra-large-scale face
lent performance in face recognition, but the training time datasets (face identities > 10M) without obvious degrada-
and hardware costs are still the bottleneck at training phase, tion of performance. To this end, we propose F2 C frame-
especially for training on million scale or even more face work for ultra-large-scale face datasets training. As shown
identities datasets. in Figure 2, given ultra-large-scale face datasets, we utilize
instance-based loader to generate an instance batch as data
Acceleration for Large-Scale FC Layer. As illustrated loader usually does. Meanwhile, identity-based loader se-
in Figure 1a, the time cost mainly focuses on FC layer rather lects two images randomly from the same identity to form
convolutional layer when the face identities reach 10M. Re- the paired identities batch. Subsequently, we mix up the
searchers try some attempts to accelerate the large scale FC images from instance and pair identity batches as shown
training since 2001. An intuitive idea is to design an approx- in Figure 2 and feed them into G-Net and P-Net. Inspired
imate function to reduce the computational cost, the Hier- by MoCo [9], G-Net has the same structure as P-Net and
archical Softmax (HSM)[7] tries to reformulate the multi- inherits parameters from P-Net in a moving average man-
class classifier into a hierarchy of binary classifiers. There- ner. G-Net and P-Net are used to generate identities’ cen-
fore, the training cost can be reduced by means of the given ters and extract face features for face recognition, respec-
sample only has to traverse along a path from the root to tively. Then, we introduce DCP as a substitute for the FC
the corresponding class. However, all the class centers are layer. DCP is randomly initialized and updated by the fea-
stored in RAM and the retrieval time can not be ignored tures from G-Net at each iteration. The update strategy of
with the increase of face identities. Zhang et.al. [46] pro- DCP follows the rule: using the current features to replace
posed a method that can recognize a small number of "ac- the most outdated part of features in DCP. For positive sam-
tive classes" in each mini batch, which constructs the dy- ples, we use the common cross entropy loss. For negative
namic class hierarchies on the fly. However, recognizing samples, we minimize the cosine similarities between nega-
the "active classes" is also time-consuming when the face tive samples and DCP. The whole F2 C is optimized by cross
identities is too large. Some companies, such as Google entropy loss and cosine similarities simultaneously.
and Microsoft, try to divide all the categories into multi-
GPUs averagely. The communication cost of inter-servers
3.2. Motivation
can not be ignored. To tackle this problem, Partial FC [1] Before digging into F2 C, we provide some motivations
tries to train a large-scale dataset on a single GPU server us- by rethinking the loss function cooperated with FC layer.
ing 10% identities randomly at each iteration. However it’s For convenience, we consider the Softmax as follows:
still limited by the memory of the GPUs in a single machine. N T
As shown in Figure 1b, Partial FC can only work when the 1 X eWyi xi
L=− log Pn T (1)
number of face identities is not ultra-large (<10M), other- N i=1 ID
eWj xi
j=1
wise the GPUs will still run out of memory. There are sev-
eral pairwise based methods [12] that utilize the face pairs where N is batchsize and nID stands for the number of
to train large scale datasets, while the time complexity is whole face identities. For each iteration of the training pro-
O(N k ), where k represents the size of the pair. The lat- cess, the update of the classifier {Wj }nj=1
ID
is performed as
est related work VFC [16] builds some virtual FC param- the following equations:
eters to reduce the computation cost but its performance is N T
much lower compared to normal FC. Different from previ- ∂L 1 X eWk xi
= − (δkyi − Pn T )xi (2)
ous works, our F2 C can reduce the FC training cost largely ∂Wk N i=1 ID
eWj xi
j=1
3
Figure 2: The pipeline of F2 C. We use instance and id data loader to generate mixed batches (I ∪ III, II ∪ IV), which are later
fed into G-Net and P-Net respectively. The features from G-Net will update DCP in the manner of LRU, and features from
P-Net will be used to compute loss together with DCP.
4
ages per identity. Here the shape of DCP mentioned in the 3.4. Dynamic Class Pool
main paper is C × K × D. C is the magnitude of DCP, K
In this subsection, we introduce the details of the Dy-
is the capacity for each placeholder in DCP, D represents
namic Class Pool (DCP). Inspired by sliding window [14]
feature dimension. The total images of given dataset can be
in object detection task, we can utilize a dynamic identity
denoted as k̄nID . We evaluate the update speed by estimat-
group that slides the whole face identities by iterations.
ing the minimum of epochs for given face identity to update
k̄nID We called this sliding identity group as DCP, which can
M . be regarded as a substitute for the FC layer. Firstly, we de-
fine a tensor T with size of C × K × D which is initialized
• If we only use instance-based loader, the update speed with Gaussian distribution, where C is the capacity or the
of identities’ centers ∈ [ Mk̄nkmax
ID
, Mk̄nkmin
ID
]. So only us- number of face identities the DCP can hold, K represents
ing instance-based loader may lead to following prob- the number of features that belong to the same identity (we
lems. 1. If the number of identities is severely imbal- set the default as K = 2). We store Fg in DCP and update
anced, the update speed of the identities’ centers that the most outdated features of DCP using the Fg in each it-
have rare number of images is too slow. 2. If we sam- eration. The updating rule is similar to least recently used
ple M images that belong to M different identities, (LRU)1 policy which can be formulated as,
the DCP may have no positive samples for this itera-
tion. In this case, cross entropy, which is crucial for T [1 : C − M, :, :] = T [M + 1 : C, :, :] ∈ R(C−M )×K×D
classification, cannot be calculated. T [C − M + 1 : C, 0, :] = Fg ∈ RM ×K×D
(7)
• If we only use identity-based loader, we can obtain the For the current batch, with the update of the DCP, we obtain
average fastest update speed ( nM ID
) of each identity. pseudo feature center for each identity in DCP, including the
However, identity-based loader re-sample identities identities contained in II ∪ IV . As claimed in equation 6,
that have rare number of images too many times, so features from P-Net can be divided into two types compared
it needs to use about kkmax
min
times more iterations than to DCP. One is FpDCP , the other is Fp¬DCP . For FpDCP , we can
instance-based loader to sample all images from the calculate its logits by the following equation,
dataset. Further, the sample probabilities for each in-
K
stance of identities with rich intra-class images are too 1 X
DCP
, T [:, i, :] ∈ RI×C
low, the identity-based loader can not sample plenty of P = F (8)
K i=1 p
intra-class images during the training phase.
where h·, ·i denotes the inner product operation, P repre-
• Using the dual data loaer can inherit the benefits from sents the logits of FpDCP . Therefore, we can formulate the
instance-based and identity-based loaders. First, dual Cross-Entropy loss as follows:
data loader provides appropriate ratios between posi- I T
tive and negative images, which is very important for 1X eWyi Pi
Lce = − log PC W T P , (9)
DCP. Second, dual data loader keeps high update effi- I i=1 e j ij=1
ciency (speed) of identities’ centers and various intra-
class images. where Wj is the j-th classifier, yi is the identity of Pi . For
features Fp¬DCP whose IDs are not in DCP, we add a con-
Feature Extraction We take I∪III and II∪IV as input straint to minimize the cosine similarity between Fp¬DCP and
to Probe and Gallery Nets respectively to extract the face T , which can be formulated as,
features and generate the identities’ centers. The process M −I
can be formulated as follows: 1 X
Lcos = ϕ(Fp¬DCP , T̄ ), (10)
M − I i=1
Pθ (I ∪ III) = FpDCP ⊕ Fp¬DCP
(6) where ϕ is the operation of calculating the cosine similarity,
Gφ (II ∪ IV ) = Fg
T̄ represents the average operation along the axis of K in
DCP. The total loss is Ltotal = Lce + Lcos .
where the probe and gallery net are abbreviated as Pθ and
Gφ with parameters denoted as θ, φ respectively. The sym- 3.5. Empirical Analysis
bol ¬ is set to split features whose identities belong to DCP
DCP As shown in equation 4 and 9, the cross entropy
(subsection 3.4) from those not belong to DCP. Fg represent
loss we utilize for DCP is similar to the loss for FC for-
the features extracted by the Gallery Net. For each batch,
mally. With special setting of vector V in equation 3, we
we denote number of identities in DCP as I and number of
identities not in DCP as M − I. 1 https://www.interviewcake.com/concept/java/lru-cache
5
can represent Lce in the form of equation 4. For further ver- 4. Experiments
ification of the effect of this mechanism on the training with
DCP, we provide some empirical analysis. In this section, we first review several benchmark
datasets in face recognition area briefly. Then, we conduct
ablation studies to evaluate the effectiveness of each mod-
Algorithm 1: Update Mechanism of DCP ule and the settings of hyper-parameters in F2 C. Finally, we
Input: compare F2 C to related state-of-the-art methods.
DCP: T ∈ RC×K×D initialized with Gaussian
4.1. Datasets.
distribution.
Index for the identity batch: t. We utilize MobileFaceNet, ResNet50 and ResNet100
Batch Size: M . to train F2 C on MS1MV2, Glint360k and Webface42M(
nID
1 for 1 ≤ t ≤ M do Webface42M is the cleaned version of the original Web-
2 utilize the G-Net to extract features from t-th face260M and it has 2M ID and about 42M images), respec-
batch as the pseudo feature centers denoted as tively. We mainly show the performances of F2 C in follow-
Fg ; ing 9 academic datasets: LFW [11], SLFW [11], CFP [25],
C
3 if 1 ≤ t ≤ M : CALFW [48], CPLFW [47], AGEDB [20], YTF [42],
4 store Fg sequentially in those unoccupied IJBC [19], and MegaFace [13]. LFW is collected from
position in DCP. the Internet which contains 13,233 images with 5,749 IDs.
5 else: SLFW is similar to the LFW but the scale of SLFW is
6 update DCP as shown in Equation 7 smaller than LFW. CFP collects celebrities’ images includ-
7 end ing frontal and profile views. CALFW is a cross-age ver-
sion of LFW. CPLFW is similar to CALFW, but CPLFW
contains more pose variant images. AGEDB contains im-
As mentioned in subsection 3.3 and equation 7, the iden- ages annotated with accurate to the year, noise-free labels.
tities in DCP are updated in an LRU mechanism as shown YTF includes 3425 videos from YouTube with 1595 IDs.
in Algorithm 1. As identity-based loader goes through the IJBC is updated from IJBB and includes 21294 images of
dataset in terms of identities, partial components( M 3531 objects. MegaFace aims at evaluating the face recog-
2 ) of
vector V can be determined by shuffling the whole face nition performance at the million scale of distractors, which
identities and taking the corresponding t-th part of it, where includes a large gallery set and a probe set. In this work, we
1 ≤ t ≤ nM ID
. When we use identity-based loader, then use the Facescrub as the probe set of MegaFace as gallery.
by the setting of V and property of LRU rules, each classi-
C 4.2. Performance Comparisons between FC and
fier/pseudo feature center can be updated at least [ M ] times.
F2 C
This means that every classifier can have the similar chance
to be optimized in our settings. DCP may have the fol- We choose 3 different backbones and evaluate the per-
lowing benefits: 1) The size of DCP is independent from formance on 9 academic benchmarks between FC and F2 C
magnitude of face identities, which can be far smaller than using MS1MV2, Glint360k and Webface42M as training
FC. Therefore the computational cost is greatly reduced; 2) datasets. As shown in Table 1, F2 C can achieve comparable
The hardware especially storage occupancy of DCP is also performance compared to FC. We also provide the average
smaller than FC and the communication cost can be reduced performance among these datasets and demonstrate it in the
dramatically. These benefits are the reasons why we call our last column where F2 C is only lower than FC within 1%.
method as Faster Face Classification. Note that, the size of DCP is only 10% of the total face
identities.
3.6. Experimental Details 4.3. Ablation Studies
2 We conduct ablation studies of hyper parameters and
We train our F C on a single server with 8 Tesla V100
32G GPUs. We utilize ResNet100, ResNet50 and Mobile- settings of F2 C. Here we demonstrate the experiments on
FaceNet as our backbones to evaluate the efficiency of F2 C. MS1MV2 using MobileFaceNet and ResNet50.
The learning rate is initialized as 0.1 with SGD optimizer Single Loader or Dual Loaders? As mentioned in
and divided by 10 at 10, 14, 17 epochs. The training is ter- methodology section, dual loaders can improve the update
minated at 20 epochs. The length (number of ID) of DCP efficiency of DCP. To evaluate the influence of loaders in
is defaulted as 10% of total face identities. The batch size F2 C, we use different combinations of identity-based and
is 512 i.e., 256 images from identity-based loader and 256 instance-based loaders and show the results in Table 2.
images from instance-based loader. The Small Datasets represent LFW, SLFW, CFP, CALFW,
6
Table 1: Evaluation results (%) on 9 face recognition benchmarks. All models are trained from scratch on MS1MV2,
Glint360k and Webface42M. The TPR@FAR=1e-4 metric is used for IJBC. MegaFace is TPR@FAR=1e-6
Method LFW SLFW CFP CALFW CPLFW AGEDB YTF IJBC MegaFace Avg.
Training on MS1MV2
FC-Mobile 99.04 98.80 96.94 94.37 88.37 96.73 97.04 92.29 90.69 94.92
F2 C-Mobile 98.93 98.57 97.16 94.53 87.80 96.47 97.24 91.06 89.30 94.56
FC-R50 99.78 99.55 98.80 95.76 92.01 98.13 98.03 95.74 97.82 97.29
F2 C-R50 99.50 99.45 98.46 95.58 90.58 97.83 98.16 94.91 96.74 96.80
Training on Glint360k
FC-R50 99.83 99.71 99.07 95.71 93.48 98.25 97.92 96.48 98.64 97.67
F2 C-R50 99.71 99.53 98.30 95.23 91.60 97.88 97.76 94.75 96.73 96.83
Training on Webface42M
FC-R100 99.83 99.81 99.38 96.11 94.90 98.58 98.51 97.68 98.57 98.15
F2 C-R100 99.83 98.80 99.33 95.92 94.85 98.33 98.23 97.31 98.53 97.90
CPLFW, AGEDB and YTF in this subsection. We show Table 2: Evaluation of single or dual data loaders.ID.L,
the average accuracy on Small Datasets. Unless specified, Ins.L and Dua.L represent id loader, instance loader and
TPR@FAR=1e-4 metric is used for IJBC and Megafce is dual loaders respectively.
FPR@FAR=1e-6 by default. Training with instance-based
loader or identity-based loader can obtain comparable re- Backbone Method Small Datasets IJBC MegaFace
sults on small datasets. Instance-based loader outperforms
ID.L 94.20 82.30 79.19
identity-based loader on IJBC and MegaFace by a large
Mobile Ins.L 94.24 89.30 86.40
margin. It could be explained that only using identity loader
Dua.L 95.29 91.06 89.30
can not ensure all the images are sampled. Using dual data
ID.L 96.70 91.75 93.65
loaders can improve the performance compared with each
ResNet50 Ins.L 96.08 92.06 92.74
single loader obviously, which is consistent to our analysis.
Dua.L 97.07 94.91 96.74
Note that, to make fair comparison, the results are obtained
with the same number of samples fed to the model, not with
the same number of epoch. Table 3: Evaluation of single net or dual nets.
Single Net or Dual Nets? MoCo treates the two aug- Backbone Method Small Datasets IJBC MegaFace
mented images of the same image as positive samples and
achieved impressive performance in unsupervised learning. Single 93.90 88.07 82.69
Mobile Dual 95.29 91.06 89.30
Therefore, pictures with the same ID can naturally be re-
garded as positive samples, thus it is intuitive to use twin Single 95.55 92.26 92.98
ResNet50 Dual 97.07 94.91 96.74
backbones in the same way as MoCo to generate the identi-
ties’ centers and extract the face features respectively. How-
ever we intend to reduce the training cost further, so we Table 4: Evaluation of the number of K.
compare the performance of single net to dual nets in Ta-
ble 3. The dual nets performs better than single net on Backbone K Small Datasets IJBC MegaFace
all the datasets, which illustrates only using single net may 1 95.19 90.75 88.31
fall into the trivial solution as explained in Semi-Siamese Mobile 2 95.29 91.06 89.30
Training[6]. 1 96.58 94.38 96.49
Exploring the Influence of K in DCP. K represents the ResNet50 2 97.07 94.91 96.74
number of the features that belong to the same identity. We
evaluate the K = 1 and K = 2 in Table 4. As the features in
DCP represent the category centers, an intuitive sense is that Ratios within dual data loader. We set the ratio of the
a larger K can provide more reliable center estimation. The size between instance-based and identity-based loaders as
experiments results also support our intuition. However we 1:1 by default. To further explore the influence of the ratios
must make a trade-off between performance and storage. within dual data loader, we show the experiments in Ta-
A larger K means better performance at the cost of GPU ble 5. We utilize ResNet50 as backbone to train MS1MV2
memory and communication among severs. Therefore, we dataset. We find that the default ratio within dual data loader
set K = 2 in DCP by default. achieves the highest results on most datasets, especially on
7
Table 5: Evaluation the ratios within dual data loader. 5. Conclusion
ResNet50 is used here.
In this paper, we propose an efficient training approach
Ins.L ID.L Small Datasets IJBC MegaFace F2 C for ultra-large-scale face recognition training, the main
0 1 96.77 91.75 93.65 innovation is Dynamic Class Pool (DCP) for store and up-
1 0 96.23 92.06 92.74 date of face identities’ feature as an substitute of FC and
1 1 97.08 94.91 96.74 dual loaders for helping DCP update efficiently. The results
2 1 96.29 94.21 96.43 of comprehensive experiments and analysis show that our
1 2 95.40 90.80 90.56 approach can reduce hardware cost and time for training as
well as obtaining comparable performance to state-of-the-
Out of Memory
art FC-based methods.
64 32
Broader impacts. The proposed method is validated on
Face Identities (Million)
32 16
face training datasets due to the wide variety, the scheme
8
16
FFC
could be expanded to other datasets and situations . How-
4
8
Model Parallel+FP16
Memory on Per GPU (GB)
Model Parallel Data Parallel PartialFC+FP16 FFC+FP16
Throughput (Images/Sec.)
FFC+FP16 PartialFC+FP16 Model Parallel+FP16 Model Parallel Data Parallel
ethics or human rights performed by any of the authors.
8
Table 6: Comparisons to state-of-the-art methods. To make fair comparison, Partial-FC, VFC, DCQ and F2 C only use 1%
of identities of MS1M for training. Megaface refers to rank-1 identification. IJBC is TPR@FAR=1e-4. The lower-boundary
results are excerpted from VFC paper. The upper-boundary results are reproduced by us.
[5] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos [14] Christoph H Lampert, Matthew B Blaschko, and Thomas
Zafeiriou. Arcface: Additive angular margin loss for deep Hofmann. Beyond sliding windows: Object localization by
face recognition. In Proceedings of the IEEE Conference efficient subwindow search. In 2008 IEEE conference on
on Computer Vision and Pattern Recognition, pages 4690– computer vision and pattern recognition, pages 1–8. IEEE,
4699, 2019. 2, 3 2008. 5
[6] Hang Du, Hailin Shi, Yuchi Liu, Jun Wang, Zhen Lei, Dan [15] Bi Li, Teng Xi, Gang Zhang, Haocheng Feng, Junyu Han,
Zeng, and Tao Mei. Semi-siamese training for shallow face Jingtuo Liu, Errui Ding, and Wenyu Liu. Dynamic class
learning. In European Conference on Computer Vision, queue for large scale face recognition in the wild, 2021. 9
pages 36–53. Springer, 2020. 1, 7 [16] Pengyu Li and Lei Zhang BiaoWang. Virtual fully-connected
[7] Joshua Goodman. Classes for fast maximum entropy train- layer: Training a large-scale face recognition dataset with
ing. In 2001 IEEE International Conference on Acous- limited computational resources. 1, 3, 8, 9
tics, Speech, and Signal Processing. Proceedings (Cat. No. [17] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha
01CH37221), volume 1, pages 561–564. IEEE, 2001. 3 Raj, and Le Song. Sphereface: Deep hypersphere embedding
[8] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and for face recognition. In Proceedings of the IEEE conference
Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for on computer vision and pattern recognition, pages 212–220,
large-scale face recognition. In European conference on 2017. 3
computer vision, pages 87–102. Springer, 2016. 1 [18] Yu Liu, Guanglu Song, Jing Shao, Xiao Jin, and Xiao-
[9] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross gang Wang. Transductive centroid projection for semi-
Girshick. Momentum contrast for unsupervised visual rep- supervised large-scale recognition. In Proceedings of the
resentation learning. In Proceedings of the IEEE/CVF Con- European Conference on Computer Vision (ECCV), pages
ference on Computer Vision and Pattern Recognition, pages 70–86, 2018. 9
9729–9738, 2020. 3 [19] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler
Deep residual learning for image recognition. In Proceed- Niggel, Janet Anderson, Jordan Cheney, et al. Iarpa janus
ings of the IEEE conference on computer vision and pattern benchmark-c: Face dataset and protocol. In 2018 Inter-
recognition, pages 770–778, 2016. 3 national Conference on Biometrics (ICB), pages 158–165.
[11] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric IEEE, 2018. 3, 6
Learned-Miller. Labeled faces in the wild: A database [20] Stylianos Moschoglou, Athanasios Papaioannou, Chris-
forstudying face recognition in unconstrained environments. tos Sagonas, Jiankang Deng, Irene Kotsia, and Stefanos
In Workshop on faces in’Real-Life’Images: detection, align- Zafeiriou. Agedb: the first manually collected, in-the-wild
ment, and recognition, 2008. 1, 6 age database. In Proceedings of the IEEE Conference on
[12] Bong-Nam Kang, Yonghyun Kim, and Daijin Kim. Pair- Computer Vision and Pattern Recognition Workshops, pages
wise relational networks for face recognition. In Proceedings 51–59, 2017. 6
of the European Conference on Computer Vision (ECCV), [21] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman.
pages 628–645, 2018. 3 Deep face recognition. In British Machine Vision Confer-
[13] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel ence, 2015. 2
Miller, and Evan Brossard. The megaface benchmark: 1 [22] Xiaojiang Peng, Kai Wang, Zhaoyang Zeng, Qing Li, Jianfei
million faces for recognition at scale. In Proceedings of the Yang, and Yu Qiao. Suppressing mislabeled data via group-
IEEE Conference on Computer Vision and Pattern Recogni- ing and self-attention. In European Conference on Computer
tion, pages 4873–4882, 2016. 6 Vision, pages 786–802. Springer, 2020. 1
9
[23] Xiangyu Peng, Kai Wang, Zheng Zhu, and Yang You. Craft- and Matthew R Scott. Multi-similarity loss with general
ing better contrastive views for siamese representation learn- pair weighting for deep metric learning. In Proceedings of
ing. arXiv preprint arXiv:2202.03278, 2022. 1 the IEEE/CVF Conference on Computer Vision and Pattern
[24] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Recognition, pages 5022–5030, 2019. 9
Facenet: A unified embedding for face recognition and clus- [38] Xiaobo Wang, Shuo Wang, Cheng Chi, Shifeng Zhang, and
tering. In Proceedings of the IEEE conference on computer Tao Mei. Loss function search for face recognition. In In-
vision and pattern recognition, pages 815–823, 2015. 1 ternational Conference on Machine Learning, pages 10029–
[25] S. Sengupta, J. Chen, C. Castillo, V. M. Patel, R. Chellappa, 10038. PMLR, 2020. 1
and D. W. Jacobs. Frontal to profile face verification in the [39] Xiaobo Wang, Shuo Wang, Jun Wang, Hailin Shi, and Tao
wild. In 2016 IEEE Winter Conference on Applications of Mei. Co-mining: Deep face recognition with noisy labels.
Computer Vision (WACV), pages 1–9, 2016. 6 In Proceedings of the IEEE/CVF International Conference
[26] Karen Simonyan and Andrew Zisserman. Very deep convo- on Computer Vision, pages 9358–9367, 2019. 1
lutional networks for large-scale image recognition. arXiv [40] Xiaobo Wang, Shuo Wang, Shifeng Zhang, Tianyu Fu,
preprint arXiv:1409.1556, 2014. 3 Hailin Shi, and Tao Mei. Support vector guided softmax
[27] Kihyuk Sohn. Improved deep metric learning with multi- loss for face recognition. arXiv preprint arXiv:1812.11317,
class n-pair loss objective. In Proceedings of the 30th Inter- 2018. 3
national Conference on Neural Information Processing Sys- [41] Xiaobo Wang, Shifeng Zhang, Zhen Lei, Si Liu, Xiaojie
tems, pages 1857–1865, 2016. 9 Guo, and Stan Z Li. Ensemble soft-margin softmax loss
[28] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deeply learned for image classification. arXiv preprint arXiv:1805.03922,
face representations are sparse, selective, and robust. In Pro- 2018. 3
ceedings of the IEEE conference on computer vision and pat- [42] Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition
tern recognition, pages 2892–2900, 2015. 2, 3 in unconstrained videos with matched background similarity.
[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, In CVPR 2011, pages 529–534. IEEE, 2011. 6
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent [43] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang.
Vanhoucke, and Andrew Rabinovich. Going deeper with Wider face: A face detection benchmark. In Proceedings of
convolutions. In Proceedings of the IEEE conference on the IEEE conference on computer vision and pattern recog-
computer vision and pattern recognition, pages 1–9, 2015. nition, pages 5525–5533, 2016. 1
3 [44] Yang Yang, Shengcai Liao, Zhen Lei, and Stan Z Li. Large
[30] Fei Wang, Liren Chen, Cheng Li, Shiyao Huang, Yanjie scale similarity learning using similar pairs for person ver-
Chen, Chen Qian, and Chen Change Loy. The devil of face ification. In Thirtieth AAAI conference on artificial intelli-
recognition is in the noise. In Proceedings of the European gence, 2016. 3
Conference on Computer Vision (ECCV), pages 765–780, [45] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learn-
2018. 2 ing face representation from scratch. arXiv preprint
[31] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Ad- arXiv:1411.7923, 2014. 1, 2
ditive margin softmax for face verification. IEEE Signal Pro- [46] Xingcheng Zhang, Lei Yang, Junjie Yan, and Dahua Lin.
cessing Letters, 25(7):926–930, 2018. 3 Accelerated training for massive classification via dynamic
[32] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng class selection. In Proceedings of the AAAI Conference on
Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Artificial Intelligence, volume 32, 2018. 1, 2, 3
Residual attention network for image classification. In Pro- [47] Tianyue Zheng and Weihong Deng. Cross-pose lfw: A
ceedings of the IEEE conference on computer vision and pat- database for studying cross-pose face recognition in un-
tern recognition, pages 3156–3164, 2017. 3 constrained environments. Beijing University of Posts and
[33] Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Telecommunications, Tech. Rep, 5, 2018. 6
Yu Qiao. Suppressing uncertainties for large-scale facial ex- [48] Tianyue Zheng, Weihong Deng, and Jiani Hu. Cross-age lfw:
pression recognition. In Proceedings of the IEEE/CVF Con- A database for studying cross-age face recognition in un-
ference on Computer Vision and Pattern Recognition, pages constrained environments. arXiv preprint arXiv:1708.08197,
6897–6906, 2020. 1 2017. 6
[34] Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and [49] Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie
Yu Qiao. Region attention networks for pose and occlusion Huang, Xinze Chen, Jiagang Zhu, Tian Yang, Jiwen Lu, Da-
robust facial expression recognition. IEEE Transactions on long Du, et al. Webface260m: A benchmark unveiling the
Image Processing, 29:4057–4069, 2020. 1 power of million-scale deep face recognition. arXiv preprint
[35] Kai Wang, Shuo Wang, Jianfei Yang, Xiaobo Wang, Baigui arXiv:2103.04098, 2021. 1, 2
Sun, Hao Li, and Yang You. Mask aware network for masked
face recognition in the wild. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 1456–
1461, 2021. 1
[36] Kai Wang, Bo Zhao, Xiangyu Peng, Yang You, et al.
Cafe: Learning to condense dataset by aligning features.
CVPR2022, 2022. 1
[37] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong,
10