When Face Recognition Meets With Deep Learning: An Evaluation of Convolutional Neural Networks For Face Recognition
When Face Recognition Meets With Deep Learning: An Evaluation of Convolutional Neural Networks For Face Recognition
Guosheng Hu∗♣ , Yongxin Yang∗♦ , Dong Yi♠ , Josef Kittler♣ , William Christmas♣ , Stan Z. Li♠ , Timothy Hospedales♦
Centre for Vision, Speech and Signal Processing, University of Surrey, UK♣
Electronic Engineering and Computer Science, Queen Mary University of London, UK♦
Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Chinese Academy of Sciences, China♠
arXiv:1504.02351v1 [cs.CV] 9 Apr 2015
{g.hu,j.kittler,w.christmas}@surrey.ac.uk,{yongxin.yang,t.hospedales}@qmul.ac.uk, {szli,dyi}@cbsr.ia.ac.cn
3. Methodology the last layer for predicting a single class of K (the number
of subjects in the context of face recognition) mutually ex-
LFW is the de facto benchmark database for FRUE. clusive classes. During training, the learning rate is set to
Most exisiting CNNs [25, 21, 19, 22] train their networks 0.001 for three networks, and the batch size is fixed to 100.
on private databases and test the trained models on LFW. Table 2 details these three architectures.
In comparison, we train our CNNs only using LFW data to
make our work easily reproducible. In this way, we cannot Table 2. Our CNN Architectures
directly use the reported CNN architectures [25, 21, 19, 22, CNN-S CNN-M CNN-L
28] since our training data is much less extensive. We intro- conv1
duce three architectures adapting to our training set in sub- 12 × 5 × 5 16 × 5 × 5 16 × 3 × 3
section 3.1. To further improve the discrimination of CNN- st. 1, pad 0 st. 1, pad 0 st. 1, pad 1
learned features, metric learning method is usually used. x2 pool x2 pool -
One metric learning method, Joint Bayesian model [6], is conv2
detailed in subsection 3.2. 24 × 4 × 4 32 × 4 × 4 16 × 3 × 3
st. 1, pad 0 st. 1, pad 0 st. 1, pad 1
3.1. CNN Architectures x2 pool x2 pool x2 pool
conv3
How to design a ‘good’ CNN architecture remains an
open problem. Generally, the architecture depends on the 32 × 3 × 3 48 × 3 × 3 32 × 3 × 3
size of training data. Less data should drive a smaller net- st. 2, pad 0 st. 2, pad 0 st. 1, pad 1
work (fewer layers and filters) to avoid overfitting. In this x2 pool x2 pool x3 pool, st. 2
study, the size of our training data is much smaller than that conv4
used by the state of the art methods [25, 21, 19, 22, 28]; 48 × 3 × 3
therefore, smaller architectures are designed. - - st. 1, pad 1
x2 pool
We propose three CNN architectures adapting to the size
fully connected
of training data in LFW. These architectures are of three dif-
ferent sizes: small (CNN-S), medium (CNN-M), and large 160 160 160
(CNN-L). CNN-S and CNN-M have 3 convolutional lay- 4000, softmax 4000, softmax 4000, softmax
Convolutional layer is detailed in 3 sub-rows: the 1st indicates the
ers and two fully connected layers, while CNN-M has more
number of filters and filter size as ‘num × size × size’; the 2nd
filters than CNN-S. Compared with CNN-S and CNN-M, specifies the convolutional stride (‘st.’) and padding (‘pad’); and
CNN-L has 4 convolutional layers. The activation func- the 3rd specifies the max-pooling downsampling factor. For fully
tion we used is REctification Linear Unit (RELU) [14]. In connected layers, we specify their dimensionality: 160 for feature
our experiments, dropout [10] does not improve the pre- length and 4000 for the number of class/subjects. Note that every
formance of our CNNs, therefore, it is not applied to our 9 splits (training set) of LFW have different number of subjects,
networks. Following [21, 28], softmax function is used in but all around 4000.
3.2. Metric Learning
Metric Learning (MeL), which aims to find a new metric
to make two classes more separable, is often used for face
verification. MeL is independent of the feature extraction
process and any feature (hand-crafted and learning-based)
can be fed into a MeL method. Joint Bayesian (JB) [6] unmatched matched
model is a well-known MeL method and it is the most
widely used MeL method which is applied to the features Figure 1. Cropped sample images in LFW
learned by CNNs [21, 19, 28].
JB models the face verification task as a Bayesian de-
cision problem. Let HI and HE represent intra-personal defines three standard protocols (unsupervised, restricted
(matched) and extra-personal (unmatched) hypotheses, re- and unrestricted) to evaluate face recognition performance.
spectively. Based on the MAP (Maximum a Posteriori) rule, ‘Unrestricted’ protocol is applied here because the infor-
the decision is made by: mation of both subject identities and matched/unmatched
labels is used in our system. The face recognition rate is
P (x1 , x2 | HI ) evaluated by mean classification accuracy and standard er-
r(x1 , x2 ) = log (1)
P (x1 , x2 | HE ) ror of the mean.
The images we used are aligned by deep funneling [11].
where x1 and x2 are features of one face pair. It is assumed Each image is cropped to 58×58 based on the coordiates of
that P (x1 , x2 | HI ) and P (x1 , x2 | HE ) have Gaussian two eye centers. Some sample crops are visualised in Fig. 1.
distributions N (0, SI ) and N (0, SE ), respectively. It is commonly believed that data augmentation can boost
Before discussing the way of computing SI and SE , we the generalisation capacity of a neural network; therefore,
first explain the distribution of a face feature. A face x is each image is horizontally flipped. The mean of the images
modelled by the sum of two independent Gaussian variables is subtracted before network training. The open source im-
(identity µ and intra-personal variations ε): plementation MatConvNet [27] is used to train our CNNs.
In this section, different components of our CNN-based face
x=µ+ε (2)
recognition system are evaluated and analysed.
µ and ε follow two Gaussian distributions N (0, Sµ ) and
N (0, Sε ), respectively. Sµ and Sε are two unknown covari- Architectures Choosing a ‘good’ architecture is crucial
ance matrices and they are regarded as face prior. For the for CNN training. Overlarge or extremely small networks
case of two faces, the joint distribution of {x1 , x2 } is also relative to the training data can lead to overfitting or under-
assumed as a Gaussian with zero mean. Based on Eq. (2), fitting, in the case of which the network does not converge
the covariance of two faces is: at all during training. In this comparison, the RGB colour
images are fed into CNNs and feature distance is measured
cov(x1 , x2 ) = cov(µ1 , µ2 ) + cov(ε1 , ε2 ) (3) by cosine distance. The performances of the three architec-
tures are compared in Table 3. CNN-M achieves the best
Then SI and SE can be derived as:
face recognition performance, indicating that the CNN-M
Sµ + Sε Sµ generalises best among these three architectures using only
SI = (4) LFW data. From this point, all the other evaluations are
Sµ Sµ + Sε
conducted using CNN-M. The face recognition rate 0.7882
and of CNN-M is considered as the baseline, and all the remain-
Sµ + Sε 0 ing investigations will be compared with it.
SE = (5)
0 Sµ + Sε
Table 3. Comparisons of Our Three Architectures
Clearly, r(x1 , x2 ) in Eq. (1) only depends on Sµ and Sε , Model Accuracy
which are learned from data using an EM algorithm [6]. CNN-S 0.7828±0.0046
CNN-M 0.7882±0.0037
4. Evaluation CNN-L 0.7807±0.0035
LFW contains 5,749 subjects and 13,233 images and the
training and test sets are defined in [12]. For evaluation,
LFW is divided into 10 predefined splits for 10-fold cross Feature Distance The exisiting research offers little dis-
validation. Each time nine of them are used for model train- cussion about the distance measurement for CNN-learned
ing and the other one (600 image pairs) for testing. LFW features. In particular, it is interesting to know what is the
Table 4. Distance Comparison
Distance Accuracy
euclidean 0.6898±0.0092 1
0.85
0.8
0.75
0.7
1 2 3 4 5 6 7 8 9 10
the split index in view 2 of LFW
Figure 3. Sample crops in LFW. Rows correspond to 5 regions from 4
corners and center; Columns correspond to 6 scales.
Figure 4. Face recognition accuracies with and without JB
size of these images can be different as shown in Table 1. Metric Learning For metric learning, the features of the
In [21], 120 networks are trained separately for this fu- fusion of best 16 networks are used. The feature dimension-
sion. However, it is not very clear how greatly this fusion ality (2560=160×16) is reduced to 320 via PCA before they
improves the face recognition performance. To clarify this are fed into JB. Figure 4 compares the face recognition ac-
issue, we implement the network fusion. curacies with and without JB in each split of LFW database.
JB consistently and significantly improves the face recogni-
We extract d × d crops from four corners and cen-
tion rates, showing the importance of metric learning.
ter and then upsample them to the original image size
Table 7 compares our method with non-commercial
58 × 58. The crops have 6 different scales: d = f loor(58 ×
state-of-the-art methods. The performance of our method
{0.3, 0.4, 0.5, 0.6, 0.7, 0.8}), where f loor is the operator to
is slightly better than [24, 8, 15] but worse than [7, 17, 20].
get the integer part. Therefore we obtain 30 local patches
However, the feature dimensionality of [7, 17] is much
with size of 58 × 58 from one original image. Figure 3
higher than ours. In [20], a large number of new pairs are
shows these 30 crops. To evaluate the performance of net-
generated in addition to those provided by LFW to train the
work fusion, we separately train 30 different networks us-
model, while we do not generate new pairs.
ing these crops. Then one face image can be represented by
concatenating the features learned from different networks. Table 7. Comparison with state-of-the-art methods on LFW under
Table 6 compares the performance of single network and ‘unrestricted, label-free outside data’
network fusion. Note that we choose 16 best networks of methods accuracy
30 ones for the fusion. It is clear that network fusion works LBP multishot [24] 0.8517 ± 0.0061
much better than a single network. Specifically, the fusion LDML-MkNN [8] 0.8750 ± 0.0040
of 16 best networks improves the face recognition accuracy LBP+PLDA [15] 0.8733 ± 0.0055
of single network by 4.51%. Clearly, the face representa- high-dim LBP [7] 0.9318 ± 0.0107
tion of network fusion is actually the fusion of features of Fisher vector faces [17] 0.9303 ± 0.0105
different facial componets and scales. Similar ideas have ConvNet+RBM [20] 0.9175 ± 0.0048
widely been used to improve the facial representation ca- Network fusion +JB 0.8763 ± 0.0064
pacity of hand-crafted features such as multi-scale local bi-
nary pattern [16], multi-scale local phase quantisation [4]
and high-dimensional local features [7]. 5. Conclusions
Table 6. Comparison of Network Fusion Recently, convolutional neural networks have attracted
Accuracy a lot of attention in the field of face recognition. In this
single network 0.7882 ± 0.0037 work, we present a rigorous empirical evaluation of CNN-
network fusion 0.8333 ± 0.0042 based face recognition systems. Specifically, we quantita-
tively evaluate the impact of different architectures and im-
plementation choices of CNNs on face recognition perfor-
mances on common ground. We have shown that network [9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
fusion can significantly improve the face recognition perfor- rectifiers: Surpassing human-level performance on imagenet
mance because different networks capture the information classification. arXiv preprint arXiv:1502.01852, 2015.
from different regions and scales to form a powerful face [10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
representation. In addition, metric learning such as Joint R. R. Salakhutdinov. Improving neural networks by pre-
Bayesian method can improve the face recognition greatly. venting co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
Since network fusion and metric learning are the two
most important factors affecting CNN performance, they [11] G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller.
Learning to align from scratch. In Advances in Neural In-
will be the subject of future investigation.
formation Processing Systems, pages 764–772, 2012.
[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.
Acknowledgements This work is supported by the Eu- Labeled faces in the wild: A database for studying face
ropean Union’s Horizon 2020 research and innovation recognition in unconstrained environments. Technical Re-
program under grant agreement No 640891, EPSRC/dstl port 07-49, University of Massachusetts, Amherst, October
project ‘Signal processing in a networked battlespace’ un- 2007.
der contract EP/K014307/1, EPSRC Programme Grant [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
‘S3A: Future Spatial Audio for Immersive Listener Experi- deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
ences at Home’ under contract EP/L000539, and the Euro-
pean Union project BEAT. We also gratefully acknowledge [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
the support of NVIDIA Corporation for the donation of the
Advances in neural information processing systems, pages
GPUs used for this research.
1097–1105, 2012.
[15] P. Li, Y. Fu, U. Mohammed, J. H. Elder, and S. J. Prince.
References Probabilistic models for inference about identity. Pattern
Analysis and Machine Intelligence, IEEE Transactions on,
[1] T. Ahonen, A. Hadid, and M. Pietikäinen. Face recognition
34(1):144–157, 2012.
with local binary patterns. In Computer vision-eccv 2004,
pages 469–481. Springer, 2004. [16] S. Liao, X. Zhu, Z. Lei, L. Zhang, and S. Z. Li. Learning
[2] T. Ahonen, E. Rahtu, V. Ojansivu, and J. Heikkila. Recog- multi-scale block local binary patterns for face recognition.
nition of blurred faces using local phase quantization. In In Advances in Biometrics, pages 828–837. Springer, 2007.
Pattern Recognition, 2008. ICPR 2008. 19th International [17] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman.
Conference on, pages 1–4. IEEE, 2008. Fisher Vector Faces in the Wild. In British Machine Vision
[3] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietikainen. Mul- Conference, 2013.
tiscale local phase quantization for robust component-based [18] K. Simonyan and A. Zisserman. Very deep convolutional
face recognition using kernel fusion of multiple descriptors. networks for large-scale image recognition. arXiv preprint
Pattern Analysis and Machine Intelligence, IEEE Transac- arXiv:1409.1556, 2014.
tions on, 35(5):1164–1177, 2013. [19] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning
[4] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietikainen. Mul- face representation by joint identification-verification. In
tiscale local phase quantization for robust component-based Advances in Neural Information Processing Systems, pages
face recognition using kernel fusion of multiple descriptors. 1988–1996, 2014.
Pattern Analysis and Machine Intelligence, IEEE Transac- [20] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for
tions on, 35(5):1164–1177, 2013. face verification. In Computer Vision (ICCV), 2013 IEEE
[5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. International Conference on, pages 1489–1496. IEEE, 2013.
Return of the devil in the details: Delving deep into convo- [21] Y. Sun, X. Wang, and X. Tang. Deep learning face represen-
lutional nets. arXiv preprint arXiv:1405.3531, 2014. tation from predicting 10,000 classes. In Computer Vision
[6] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face and Pattern Recognition (CVPR), 2014 IEEE Conference on,
revisited: A joint formulation. In Computer Vision–ECCV pages 1891–1898. IEEE, 2014.
2012, pages 566–579. Springer, 2012. [22] Y. Sun, X. Wang, and X. Tang. Deeply learned face repre-
[7] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimension- sentations are sparse, selective, and robust. arXiv preprint
ality: High-dimensional feature and its efficient compression arXiv:1412.1265, 2014.
for face verification. In Computer Vision and Pattern Recog- [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
nition (CVPR), 2013 IEEE Conference on, pages 3025–3032. D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-
IEEE, 2013. novich. Going deeper with convolutions. arXiv preprint
[8] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? met- arXiv:1409.4842, 2014.
ric learning approaches for face identification. In Computer [24] Y. Taigman, L. Wolf, T. Hassner, et al. Multiple one-shots
Vision, 2009 IEEE 12th International Conference on, pages for utilizing class label information. In BMVC, pages 1–12,
498–505. IEEE, 2009. 2009.
[25] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
Closing the gap to human-level performance in face verifica-
tion. In Computer Vision and Pattern Recognition (CVPR),
2014 IEEE Conference on, pages 1701–1708. IEEE, 2014.
[26] M. A. Turk and A. P. Pentland. Face recognition using eigen-
faces. In Computer Vision and Pattern Recognition, 1991.
Proceedings CVPR’91., IEEE Computer Society Conference
on, pages 586–591. IEEE, 1991.
[27] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural
networks for matlab. CoRR, abs/1412.4564, 2014.
[28] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tation from scratch. arXiv preprint arXiv:1411.7923, 2014.
[29] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition:
Touching the limit of lfw benchmark or not? arXiv preprint
arXiv:1501.04690, 2015.