Face Recognition Based On MTCNN and FaceNet
Face Recognition Based On MTCNN and FaceNet
Rongrong Jin, Hao Li, Jing Pan, Wenxi Ma, and Jingyu Lin
maximum suppression (NMS) to merge highly overlapped The learning objective is formulated as a regression prob-
candidates. lem, and we employ the Euclidean loss for each sample xi :
Stage 2: All candidates are fed to another CNN, called
2
Lbox =
ŷibox − yibox
2
Refine Network (R-Net), which further rejects a large num- i (2)
ber of false candidates, performs calibration with bounding
box regression, and conducts NMS. where ŷibox is the regression target obtained from the net-
Stage 3: This stage is similar to the second stage, but in work and yibox is the ground-truth coordinate.
this stage we aim to identify face regions with more super- 3) Facial landmark localization: Similar to bounding box
vision. In particular, the network will output five facial land- regression task, facial landmark detection is formulated as a
marks’ positions. regression problem and we minimize the Euclidean loss:
3.1.2. CNN Architectures Llanddmark
2
=
ŷilandmark − yilandmark
2
(3)
i
We use 3×3 filter rather than 5×5 filter to reduce the comput-
ing while increase the depth to get better performance. With
these improvements, compared to the previous architecture where ŷilandmark is the facial landmark’s coordinates ob-
in(Li et al. 2015), we can get better performance with less tained from the network and yilandmark is the ground-truth
runtime. The CNN architectures are shown in Figure. 2. We coordinate for the i-th sample.
apply PReLU(He et al. 2015) as nonlinearity activation func- 4) Multi-source training: Since we employ different tasks
tion after the convolution and fully connection layers(except in each CNN, there are different types of training images
output layers). in the learning process, such as face, non-face, and partially
aligned face. In this case, some of the loss functions (i.e.,
3.1.3. Training Eq. (1)-(3)) are not used. The overall learning target can be
We leverage three tasks to train our CNN detectors: formulated as:
face/non-face classification, bounding box regression, and N X
X
facial landmark localization. min αj βij Lji (4)
1) Face classification: The learning objective is formu- i−1 j∈U
lated as a two-class classification problem. For each sample
xi , we use the cross-entropy loss: where U = {det, box, landmark}, and N is the number of
training samples and aj denotes on the task importance.
Ldet
i = −(yidet log(pi ) + (1 − yidet )(1 − log(pi ))) (1)
300xForward
Group CNN Validation Accuracy
Propagation
Group1 12-Net 0.043s 93.10%
P-Net 0.040s 93.70%
Group2 24-Net 0.738s 93.80%
R-Net 0.466s 94.50%
Group3 48-Net 3.601s 92.10%
O-Net 1.411s 93.50%