0% found this document useful (0 votes)
38 views8 pages

When Face Recognition Meets With Deep Learning: An Evaluation of Convolutional Neural Networks For Face Recognition

The document evaluates different convolutional neural network architectures for face recognition. It compares CNN architectures trained on the publicly available LFW database in terms of factors like number of filters, layers, and implementation choices. The evaluation aims to identify effective CNN design principles for face recognition and determine the impact of techniques like data augmentation, network fusion, and metric learning.

Uploaded by

Sadiq Nazeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views8 pages

When Face Recognition Meets With Deep Learning: An Evaluation of Convolutional Neural Networks For Face Recognition

The document evaluates different convolutional neural network architectures for face recognition. It compares CNN architectures trained on the publicly available LFW database in terms of factors like number of filters, layers, and implementation choices. The evaluation aims to identify effective CNN design principles for face recognition and determine the impact of techniques like data augmentation, network fusion, and metric learning.

Uploaded by

Sadiq Nazeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

When Face Recognition Meets with Deep Learning: an Evaluation of

Convolutional Neural Networks for Face Recognition

Guosheng Hu∗♣ , Yongxin Yang∗♦ , Dong Yi♠ , Josef Kittler♣ , William Christmas♣ , Stan Z. Li♠ , Timothy Hospedales♦
Centre for Vision, Speech and Signal Processing, University of Surrey, UK♣
Electronic Engineering and Computer Science, Queen Mary University of London, UK♦
Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Chinese Academy of Sciences, China♠
arXiv:1504.02351v1 [cs.CV] 9 Apr 2015

{g.hu,j.kittler,w.christmas}@surrey.ac.uk,{yongxin.yang,t.hospedales}@qmul.ac.uk, {szli,dyi}@cbsr.ia.ac.cn

Abstract performance. However, the performance using these fea-


tures degrades dramatically in unconstrained environments
Deep learning, in particular Convolutional Neural Net- where face images cover complex and large intra-personal
work (CNN), has achieved promising results in face recog- variations such as pose, illumination, expression and oc-
nition recently. However, it remains an open question: why clusion. It remains an open problem to find an ideal fa-
CNNs work well and how to design a ‘good’ architecture. cial feature which is robust for face recognition in uncon-
The existing works tend to focus on reporting CNN archi- strained environments (FRUE). In the last three years, con-
tectures that work well for face recognition rather than in- volutional neural network (CNN) rebranded as ‘deep learn-
vestigate the reason. In this work, we conduct an extensive ing’ has achieved very impressive results on FRUE. Un-
evaluation of CNN-based face recognition systems (CNN- like the traditional hand-crafted features, the CNN learning-
FRS) on a common ground to make our work easily repro- based features are more robust to complex intra-personal
ducible. Specifically, we use public database LFW (Labeled variations. More notably, the top three face recognition
Faces in the Wild) to train CNNs, unlike most existing CNNs rates reported on the FRUE benchmark database LFW (La-
trained on private databases. We propose three CNN archi- beled Faces in the Wild) [12] have been achieved by CNN
tectures which are the first reported architectures trained methods [29, 22, 19]. The success of the latest CNNs on
using LFW data. This paper quantitatively compares the FRUE and more general object recognition task [14, 9, 13]
architectures of CNNs and evaluates the effect of different stems from the following facts: (1) much larger labeled
implementation choices. We identify several useful prop- training sets are available; (2) GPU implementations greatly
erties of CNN-FRS. For instance, the dimensionality of the reduce the time of training a large CNN; (3) CNNs greatly
learned features can be significantly reduced without ad- improve the model generation capacity by introducing ef-
verse effect on face recognition accuracy. In addition, a fective regularisation strategies, such as dropout [10].
traditional metric learning method exploiting CNN-learned
features is evaluated. Experiments show two crucial factors Despite the promising performance achieved by CNNs,
to good CNN-FRS performance are the fusion of multiple it remains unclear how to design a ‘good’ CNN architecture
CNNs and metric learning. To make our work reproducible, to adapt to a specific classification task due to the lack of
source code and models will be made publicly available. theoretical guidance. However, some insights into CNN de-
sign can be gained by experimental comparisons of different
CNN architectures. The work [5] made such comparisons
1. Introduction and comprehensive analysis for the task of object recogni-
tion. However, face recognition is very different from ob-
The conventional face recognition pipeline consists of ject recognition. Specifically, faces are aligned via 2D sim-
four stages: face detection, face alignment, feature extrac- ilarity transformation or 3D pose correction to a fixed ref-
tion (or face representation) and classification. Perhaps the erence position in images before feature extraction while
single most important stage is feature extraction. In con- object recognition usually does not conduct such align-
strained environments, the hand-crafted features such as Lo- ment, and therefore objects appear in arbitrary positions.
cal Binary Patterns (LBP) [1] and Local Phase Quantisa- As a result, the CNN architectures used for face recog-
tion (LPQ) [2, 3] have achieved respectable face recognition nition [21, 19, 22, 29, 25] are rather different from those
∗ These authors contributed equally to this work for object recognition [14, 18, 23, 9]. For the task of face
recognition, it is important to make a systematic evaluation 3D face alignment. Traditionally, face images are aligned
of the effect of different CNN design and implementation using 2D similarity transformation before they are fed into
choices. In addition, those published CNNs [21, 29, 25, 28] CNNs. However, this 2D alignment cannot handle out-of-
are trained in different face databases, most of which are plane rotations. To overcome this limitation, [25] proposes
not publicly available. The difference of training sets might a 3D alignment method using an affine camera model.
result in unfair comparisons of CNN architectures. To avoid In [21], a CNN-based face representation, referred to as
this unfairness, the comparison of different CNNs should be Deep hidden IDentity feature (DeepID), is proposed. Un-
conducted on a common ground. like DeepFace whose features are learned by one single big
To clarify the contributions of different components of CNN, DeepID is learned by training a collection of small
CNN-based face recognition systems, in this paper, a sys- CNNs (network fusion). The input of one single CNN is
tematic evaluation is conducted. To make our work repro- the crops/patches of facial images and the features learned
ducible, all the networks evaluated are trained on the pub- by all CNNs are concatenated to form a powerful feature.
licly available LFW database. Specifically, our contribu- Both RGB and grey crops extracted around facial points are
tions are as follows: used to train the DeepID. The length of DeepID is 2 (RGB
and Grey images) × 60 (crops) × 160 (feature length of
• Different CNN architectures including number of fil- one network) = 19,200. One small network consists of 4
ters and layers are compared. In addition, we evaluate convolutional layers, 3 max pooling layers and 2 fully con-
the impact of multiple network fusion introduced by nected layers shown in Table 1. DeepID uses identification
[21]. information only to supervise the CNN training. In com-
• Various implementation choices, such as data augmen- parison, DeepID2 [19], an extension of DeepID, uses both
tation, pixel value type (colour or grey) and similarity, identification and verification information to train a CNN,
are evaluated. aiming to maximise the inter-class difference but minimise
the intra-class variations. To further improve the perfor-
• We quantitatively analyse how downstream metric mance of DeepID and DeepID2, DeepID2+ [22] is pro-
learning methods such as joint Bayesian [6] can boost posed. DeepID2+ adds the supervision information to all
the effectiveness of the CNN-learned features. the convolutional layers rather than the topmost layers like
DeepID and DeepID2. In addition, DeepID2+ improves the
• Finally, source code for our CNN architectures and number of filters of each layer and uses a much bigger train-
trained networks will be made publicly available (the ing set than DeepID and DeepID2 . In [22], it is also dis-
training data is already public). This provides an ex- covered that DeepID2+ has three interesting properties: be-
tremely competitive baseline for face recognition to ing sparse, selective and robust.
the community. To our knowledge, we are the first to
The work [28] proposes another face recognition
publish fully reproducible CNNs for face recognition.
pipeline, refereed to as WebFace, which also learns the face
representation using a CNN. WebFace collects a database
2. Related Work which contains around 10,000 subjects and 500,000 images
CNN methods have drawn considerable attention in the and makes this database publicly available. Motivated by
field of face recognition in recent years. In particular, CNNs very deep architectures of [18, 23], WebFace trains a much
have achieved impressive results on FRUE. In this section, deeper CNN than those [21, 19, 22, 25] used for face recog-
we briefly review these CNNs. nition as shown in Table 1. Specifically, WebFace trains
The researchers in Facebook AI group trained an 8-layer a 17-layer CNN which includes 10 convolutional layers, 5
CNN named DeepFace [25]. The first three layers are con- pooling layers and 2 fully connected layers detailed in Ta-
ventional convolution-pooling-convolution layers. The sub- ble 1. Note that the use of very small convolutional filters
sequent three layers are locally connected, followed by 2 (3×3), which avoids too much texture information decrease
fully connected layers. Pooling layers make learned fea- along a very deep architecture, is crucial to learn a powerful
tures robust to local transformations but result in missing feature. In addition, WebFace stacks two 3×3 convolutional
local texture details. Pooling layers are important for object layers (without pooling in between) which is as effective as
recognition since the objects in images are not well aligned. a 5×5 convolutional layer but with fewer parameters.
However, face images are well aligned before training a Table 1 compares three typical CNNs (DeepFace [25],
CNN. It is claimed in [25] that one pooling layer is a good DeepID [21], WebFace [28]). It is clear that their architec-
balance between local transformation robustness and pre- tures and implementation choices are rather different, which
serving texture details. DeepFace is trained on the largest motivates our work. In this study, we make systematic eval-
face database to-date which contains four million facial im- uations to clarify the contributions of different components
ages of 4,000 subjects. Another contribution of [25] is the on a common ground.
Table 1. Comparisons of 3 Published CNNs
1 Patch Feature
Input Image Architecture 2 No. of para. Training set
Fusion Length
3
C1:32×11×11, M2, C3:16×9×9,
120M+ images
DeepFace [25] 152×152×3 L4: 16×9×9, L5:16×7×7, L6:16×5×5, 120M+ No 4096
4K+ subjects
F7, F8
C1:20×4×4, M2, C3:40×3×3, M4,
39×31×{3,1} 202K+ images
DeepID [21] C5:60×3×3, M6, C7:80×2×2, 101M+ Yes 19200
31×31×{3,1} 10K+ subjects
F8, F9
C1:32×3×3, C2:64×3×3, M3,
C4:64×3×3, C5:128×3×3, M6,
C7:96×3×3, C8:192×3×3, M9, 986K+ images
WebFace [28] 100×100×1 5M+ No 320
C10:128×3×3, C11:256×3×3, M12, 10K subjects
C13:160×3×3, C14:320×3×3, A15,
F16, F17
1
The input image is represented as width×height×channels. 1 and 3 mean grey or RGB images respectively.
2
The capital letters C, M, L, A, F represent convolutional, max pooling, locally connected, average pooling and fully connected layers
respectively. These capital letters are followed by the indices of CNN layers.
3
The number of filters and filter size are denoted as ‘num × size × size’

3. Methodology the last layer for predicting a single class of K (the number
of subjects in the context of face recognition) mutually ex-
LFW is the de facto benchmark database for FRUE. clusive classes. During training, the learning rate is set to
Most exisiting CNNs [25, 21, 19, 22] train their networks 0.001 for three networks, and the batch size is fixed to 100.
on private databases and test the trained models on LFW. Table 2 details these three architectures.
In comparison, we train our CNNs only using LFW data to
make our work easily reproducible. In this way, we cannot Table 2. Our CNN Architectures
directly use the reported CNN architectures [25, 21, 19, 22, CNN-S CNN-M CNN-L
28] since our training data is much less extensive. We intro- conv1
duce three architectures adapting to our training set in sub- 12 × 5 × 5 16 × 5 × 5 16 × 3 × 3
section 3.1. To further improve the discrimination of CNN- st. 1, pad 0 st. 1, pad 0 st. 1, pad 1
learned features, metric learning method is usually used. x2 pool x2 pool -
One metric learning method, Joint Bayesian model [6], is conv2
detailed in subsection 3.2. 24 × 4 × 4 32 × 4 × 4 16 × 3 × 3
st. 1, pad 0 st. 1, pad 0 st. 1, pad 1
3.1. CNN Architectures x2 pool x2 pool x2 pool
conv3
How to design a ‘good’ CNN architecture remains an
open problem. Generally, the architecture depends on the 32 × 3 × 3 48 × 3 × 3 32 × 3 × 3
size of training data. Less data should drive a smaller net- st. 2, pad 0 st. 2, pad 0 st. 1, pad 1
work (fewer layers and filters) to avoid overfitting. In this x2 pool x2 pool x3 pool, st. 2
study, the size of our training data is much smaller than that conv4
used by the state of the art methods [25, 21, 19, 22, 28]; 48 × 3 × 3
therefore, smaller architectures are designed. - - st. 1, pad 1
x2 pool
We propose three CNN architectures adapting to the size
fully connected
of training data in LFW. These architectures are of three dif-
ferent sizes: small (CNN-S), medium (CNN-M), and large 160 160 160
(CNN-L). CNN-S and CNN-M have 3 convolutional lay- 4000, softmax 4000, softmax 4000, softmax
Convolutional layer is detailed in 3 sub-rows: the 1st indicates the
ers and two fully connected layers, while CNN-M has more
number of filters and filter size as ‘num × size × size’; the 2nd
filters than CNN-S. Compared with CNN-S and CNN-M, specifies the convolutional stride (‘st.’) and padding (‘pad’); and
CNN-L has 4 convolutional layers. The activation func- the 3rd specifies the max-pooling downsampling factor. For fully
tion we used is REctification Linear Unit (RELU) [14]. In connected layers, we specify their dimensionality: 160 for feature
our experiments, dropout [10] does not improve the pre- length and 4000 for the number of class/subjects. Note that every
formance of our CNNs, therefore, it is not applied to our 9 splits (training set) of LFW have different number of subjects,
networks. Following [21, 28], softmax function is used in but all around 4000.
3.2. Metric Learning
Metric Learning (MeL), which aims to find a new metric
to make two classes more separable, is often used for face
verification. MeL is independent of the feature extraction
process and any feature (hand-crafted and learning-based)
can be fed into a MeL method. Joint Bayesian (JB) [6] unmatched matched
model is a well-known MeL method and it is the most
widely used MeL method which is applied to the features Figure 1. Cropped sample images in LFW
learned by CNNs [21, 19, 28].
JB models the face verification task as a Bayesian de-
cision problem. Let HI and HE represent intra-personal defines three standard protocols (unsupervised, restricted
(matched) and extra-personal (unmatched) hypotheses, re- and unrestricted) to evaluate face recognition performance.
spectively. Based on the MAP (Maximum a Posteriori) rule, ‘Unrestricted’ protocol is applied here because the infor-
the decision is made by: mation of both subject identities and matched/unmatched
labels is used in our system. The face recognition rate is
P (x1 , x2 | HI ) evaluated by mean classification accuracy and standard er-
r(x1 , x2 ) = log (1)
P (x1 , x2 | HE ) ror of the mean.
The images we used are aligned by deep funneling [11].
where x1 and x2 are features of one face pair. It is assumed Each image is cropped to 58×58 based on the coordiates of
that P (x1 , x2 | HI ) and P (x1 , x2 | HE ) have Gaussian two eye centers. Some sample crops are visualised in Fig. 1.
distributions N (0, SI ) and N (0, SE ), respectively. It is commonly believed that data augmentation can boost
Before discussing the way of computing SI and SE , we the generalisation capacity of a neural network; therefore,
first explain the distribution of a face feature. A face x is each image is horizontally flipped. The mean of the images
modelled by the sum of two independent Gaussian variables is subtracted before network training. The open source im-
(identity µ and intra-personal variations ε): plementation MatConvNet [27] is used to train our CNNs.
In this section, different components of our CNN-based face
x=µ+ε (2)
recognition system are evaluated and analysed.
µ and ε follow two Gaussian distributions N (0, Sµ ) and
N (0, Sε ), respectively. Sµ and Sε are two unknown covari- Architectures Choosing a ‘good’ architecture is crucial
ance matrices and they are regarded as face prior. For the for CNN training. Overlarge or extremely small networks
case of two faces, the joint distribution of {x1 , x2 } is also relative to the training data can lead to overfitting or under-
assumed as a Gaussian with zero mean. Based on Eq. (2), fitting, in the case of which the network does not converge
the covariance of two faces is: at all during training. In this comparison, the RGB colour
images are fed into CNNs and feature distance is measured
cov(x1 , x2 ) = cov(µ1 , µ2 ) + cov(ε1 , ε2 ) (3) by cosine distance. The performances of the three architec-
tures are compared in Table 3. CNN-M achieves the best
Then SI and SE can be derived as:
face recognition performance, indicating that the CNN-M
Sµ + Sε Sµ generalises best among these three architectures using only
SI = (4) LFW data. From this point, all the other evaluations are
Sµ Sµ + Sε
conducted using CNN-M. The face recognition rate 0.7882
and of CNN-M is considered as the baseline, and all the remain-
Sµ + Sε 0 ing investigations will be compared with it.
SE = (5)
0 Sµ + Sε
Table 3. Comparisons of Our Three Architectures
Clearly, r(x1 , x2 ) in Eq. (1) only depends on Sµ and Sε , Model Accuracy
which are learned from data using an EM algorithm [6]. CNN-S 0.7828±0.0046
CNN-M 0.7882±0.0037
4. Evaluation CNN-L 0.7807±0.0035
LFW contains 5,749 subjects and 13,233 images and the
training and test sets are defined in [12]. For evaluation,
LFW is divided into 10 predefined splits for 10-fold cross Feature Distance The exisiting research offers little dis-
validation. Each time nine of them are used for model train- cussion about the distance measurement for CNN-learned
ing and the other one (600 image pairs) for testing. LFW features. In particular, it is interesting to know what is the
Table 4. Distance Comparison
Distance Accuracy
euclidean 0.6898±0.0092 1

city block 0.6892±0.0088


chebychev 0.6692±0.0088 0.8

Face recognition accuracy


cosine 0.7882±0.0037
correlation 0.7882±0.0040 0.6
spearman 0.7878±0.0031
0.4

best distance measure for face recognition. Table 4 com-


pares the impact of six distance measures on face recog- 0.2

nition accuracy. Cosine and correlation achieve the best


recognition rates, however, the standard deviation of cosine 0
2 4 8 16 32 64 128 160
is smaller than that of correlation. Therefore, cosine dis- Dimensionality
tance is the best among these distances.

Grey vs Colour In [28] and [25], CNNs are trained using


Figure 2. The impact of feature dimensionality in PCA space on face
grey-level and RGB colour images, respectively. In com- recognition rate
parison, both grey and colour images are used in [21]. We
quantitatively compare the impact of these two images types
on face recognition. Their comparative evaluation yields Learned Feature Analysis It is interesting to investigate
face recognition accuracies using grey and colour images the properties of CNN-learned face representations. First,
of 0.7830±0.0077 and 0.7882±0.0118, respectively. The we discuss feature normalisation, which standardises the
performances using grey and colour images are very close range of features and is generally performed during the
to each other. Although colour images contain more infor- data preprocessing step. For example, to implement eigen-
mation, they do not deliver a significant improvement. face [26], the features (pixel values) are usually normalised
via Eq. (6) before training a PCA space.
Data Augmentation Flip, mirroring images horizontally x − µx
producing two samples from each, is a commonly used data x̂ = (6)
σx
augmentation technique for face recognition. Both original
and mirrored images are used for training in all our evalu- where x ∈ R and x̂ ∈ R are original and normalised fea-
ations. However, little discussion in the existing work was ture vectors, respectively. µx and σx are the mean and
made to analyse the impact of image flipping during test- standard deviation of x. Motivated by this, our CNN fea-
ing. Naturally, the test images can also be mirrored. A pair tures are normalised by Eq. (6) before computing cosine
of test images can produce 2 new mirrored ones. These 4 distance. The accuracies with and without normalisation
images can generate 4 pairs instead of one original pair. To are 0.7927±0.0126 and 0.7882±0.0118, respectively. Thus
combine these 4 images/pairs, two fusion strategies (feature normalisation is effective to improve recognition rate.
and score fusion) are implemented in this work. For feature Second, we perform dimensionality reduction on the
fusion, the learned features of a test image and its mirrored learned 160D features using PCA. As shown in Figure 2,
one are concatenated to one feature, which is then used for only 16 dimensions of the PCA feature space can achieve
score computing. For score fusion, 4 scores generated from comparable face recognition rates to those of the original
4 pairs are averaged to one score. Table 5 compares the space. It is a very interesting property of CNN-learned fea-
three scenarios: no flip during the test, feature and score tures because low dimensionality can significantly reduce
fusions. As is shown in Table 5, mirroring images does im- storage space and computation, which is crucial for large
prove the face recognition performance. In addition, feature scale applications or mobile devices such as smartphone.
fusion works slightly better than score fusion, however, the
improvements are not statistically significant. Network Fusion The work DeepID [21] and its vari-
ants [19, 22] apply the fusion of multiple networks. Specif-
Table 5. Comparison of Data Augmentation during Test
ically, the images of different facial regions and scales are
Accuracy
separately fed into the networks that have the same archi-
no flip on test set 0.7882 ± 0.0037 tecture. The features learned from different networks are
feature fusion 0.7895 ± 0.0036 concatenated to a powerful face representation, which im-
score fusion 0.7893 ± 0.0035 plicitly captures the spatial information of facial parts. The
1
network fusion
network fusion + JB
0.95

Face recognition accuracy


0.9

0.85

0.8

0.75

0.7
1 2 3 4 5 6 7 8 9 10
the split index in view 2 of LFW
Figure 3. Sample crops in LFW. Rows correspond to 5 regions from 4
corners and center; Columns correspond to 6 scales.
Figure 4. Face recognition accuracies with and without JB

size of these images can be different as shown in Table 1. Metric Learning For metric learning, the features of the
In [21], 120 networks are trained separately for this fu- fusion of best 16 networks are used. The feature dimension-
sion. However, it is not very clear how greatly this fusion ality (2560=160×16) is reduced to 320 via PCA before they
improves the face recognition performance. To clarify this are fed into JB. Figure 4 compares the face recognition ac-
issue, we implement the network fusion. curacies with and without JB in each split of LFW database.
JB consistently and significantly improves the face recogni-
We extract d × d crops from four corners and cen-
tion rates, showing the importance of metric learning.
ter and then upsample them to the original image size
Table 7 compares our method with non-commercial
58 × 58. The crops have 6 different scales: d = f loor(58 ×
state-of-the-art methods. The performance of our method
{0.3, 0.4, 0.5, 0.6, 0.7, 0.8}), where f loor is the operator to
is slightly better than [24, 8, 15] but worse than [7, 17, 20].
get the integer part. Therefore we obtain 30 local patches
However, the feature dimensionality of [7, 17] is much
with size of 58 × 58 from one original image. Figure 3
higher than ours. In [20], a large number of new pairs are
shows these 30 crops. To evaluate the performance of net-
generated in addition to those provided by LFW to train the
work fusion, we separately train 30 different networks us-
model, while we do not generate new pairs.
ing these crops. Then one face image can be represented by
concatenating the features learned from different networks. Table 7. Comparison with state-of-the-art methods on LFW under
Table 6 compares the performance of single network and ‘unrestricted, label-free outside data’
network fusion. Note that we choose 16 best networks of methods accuracy
30 ones for the fusion. It is clear that network fusion works LBP multishot [24] 0.8517 ± 0.0061
much better than a single network. Specifically, the fusion LDML-MkNN [8] 0.8750 ± 0.0040
of 16 best networks improves the face recognition accuracy LBP+PLDA [15] 0.8733 ± 0.0055
of single network by 4.51%. Clearly, the face representa- high-dim LBP [7] 0.9318 ± 0.0107
tion of network fusion is actually the fusion of features of Fisher vector faces [17] 0.9303 ± 0.0105
different facial componets and scales. Similar ideas have ConvNet+RBM [20] 0.9175 ± 0.0048
widely been used to improve the facial representation ca- Network fusion +JB 0.8763 ± 0.0064
pacity of hand-crafted features such as multi-scale local bi-
nary pattern [16], multi-scale local phase quantisation [4]
and high-dimensional local features [7]. 5. Conclusions
Table 6. Comparison of Network Fusion Recently, convolutional neural networks have attracted
Accuracy a lot of attention in the field of face recognition. In this
single network 0.7882 ± 0.0037 work, we present a rigorous empirical evaluation of CNN-
network fusion 0.8333 ± 0.0042 based face recognition systems. Specifically, we quantita-
tively evaluate the impact of different architectures and im-
plementation choices of CNNs on face recognition perfor-
mances on common ground. We have shown that network [9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
fusion can significantly improve the face recognition perfor- rectifiers: Surpassing human-level performance on imagenet
mance because different networks capture the information classification. arXiv preprint arXiv:1502.01852, 2015.
from different regions and scales to form a powerful face [10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
representation. In addition, metric learning such as Joint R. R. Salakhutdinov. Improving neural networks by pre-
Bayesian method can improve the face recognition greatly. venting co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
Since network fusion and metric learning are the two
most important factors affecting CNN performance, they [11] G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller.
Learning to align from scratch. In Advances in Neural In-
will be the subject of future investigation.
formation Processing Systems, pages 764–772, 2012.
[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.
Acknowledgements This work is supported by the Eu- Labeled faces in the wild: A database for studying face
ropean Union’s Horizon 2020 research and innovation recognition in unconstrained environments. Technical Re-
program under grant agreement No 640891, EPSRC/dstl port 07-49, University of Massachusetts, Amherst, October
project ‘Signal processing in a networked battlespace’ un- 2007.
der contract EP/K014307/1, EPSRC Programme Grant [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
‘S3A: Future Spatial Audio for Immersive Listener Experi- deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
ences at Home’ under contract EP/L000539, and the Euro-
pean Union project BEAT. We also gratefully acknowledge [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
the support of NVIDIA Corporation for the donation of the
Advances in neural information processing systems, pages
GPUs used for this research.
1097–1105, 2012.
[15] P. Li, Y. Fu, U. Mohammed, J. H. Elder, and S. J. Prince.
References Probabilistic models for inference about identity. Pattern
Analysis and Machine Intelligence, IEEE Transactions on,
[1] T. Ahonen, A. Hadid, and M. Pietikäinen. Face recognition
34(1):144–157, 2012.
with local binary patterns. In Computer vision-eccv 2004,
pages 469–481. Springer, 2004. [16] S. Liao, X. Zhu, Z. Lei, L. Zhang, and S. Z. Li. Learning
[2] T. Ahonen, E. Rahtu, V. Ojansivu, and J. Heikkila. Recog- multi-scale block local binary patterns for face recognition.
nition of blurred faces using local phase quantization. In In Advances in Biometrics, pages 828–837. Springer, 2007.
Pattern Recognition, 2008. ICPR 2008. 19th International [17] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman.
Conference on, pages 1–4. IEEE, 2008. Fisher Vector Faces in the Wild. In British Machine Vision
[3] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietikainen. Mul- Conference, 2013.
tiscale local phase quantization for robust component-based [18] K. Simonyan and A. Zisserman. Very deep convolutional
face recognition using kernel fusion of multiple descriptors. networks for large-scale image recognition. arXiv preprint
Pattern Analysis and Machine Intelligence, IEEE Transac- arXiv:1409.1556, 2014.
tions on, 35(5):1164–1177, 2013. [19] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning
[4] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietikainen. Mul- face representation by joint identification-verification. In
tiscale local phase quantization for robust component-based Advances in Neural Information Processing Systems, pages
face recognition using kernel fusion of multiple descriptors. 1988–1996, 2014.
Pattern Analysis and Machine Intelligence, IEEE Transac- [20] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for
tions on, 35(5):1164–1177, 2013. face verification. In Computer Vision (ICCV), 2013 IEEE
[5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. International Conference on, pages 1489–1496. IEEE, 2013.
Return of the devil in the details: Delving deep into convo- [21] Y. Sun, X. Wang, and X. Tang. Deep learning face represen-
lutional nets. arXiv preprint arXiv:1405.3531, 2014. tation from predicting 10,000 classes. In Computer Vision
[6] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face and Pattern Recognition (CVPR), 2014 IEEE Conference on,
revisited: A joint formulation. In Computer Vision–ECCV pages 1891–1898. IEEE, 2014.
2012, pages 566–579. Springer, 2012. [22] Y. Sun, X. Wang, and X. Tang. Deeply learned face repre-
[7] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimension- sentations are sparse, selective, and robust. arXiv preprint
ality: High-dimensional feature and its efficient compression arXiv:1412.1265, 2014.
for face verification. In Computer Vision and Pattern Recog- [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
nition (CVPR), 2013 IEEE Conference on, pages 3025–3032. D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-
IEEE, 2013. novich. Going deeper with convolutions. arXiv preprint
[8] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? met- arXiv:1409.4842, 2014.
ric learning approaches for face identification. In Computer [24] Y. Taigman, L. Wolf, T. Hassner, et al. Multiple one-shots
Vision, 2009 IEEE 12th International Conference on, pages for utilizing class label information. In BMVC, pages 1–12,
498–505. IEEE, 2009. 2009.
[25] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
Closing the gap to human-level performance in face verifica-
tion. In Computer Vision and Pattern Recognition (CVPR),
2014 IEEE Conference on, pages 1701–1708. IEEE, 2014.
[26] M. A. Turk and A. P. Pentland. Face recognition using eigen-
faces. In Computer Vision and Pattern Recognition, 1991.
Proceedings CVPR’91., IEEE Computer Society Conference
on, pages 586–591. IEEE, 1991.
[27] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural
networks for matlab. CoRR, abs/1412.4564, 2014.
[28] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tation from scratch. arXiv preprint arXiv:1411.7923, 2014.
[29] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition:
Touching the limit of lfw benchmark or not? arXiv preprint
arXiv:1501.04690, 2015.

You might also like