Deep Learning Face Attributes in
Deep Learning Face Attributes in
(a) HOG(landmarks)+SVM
Predicting face attributes in the wild is challenging due
to complex face variations. We propose a novel deep
learning framework for attribute prediction in the wild.
true false false false true
It cascades two CNNs, LNet and ANet, which are fine-
tuned jointly with attribute tags, but pre-trained differently.
LNet is pre-trained by massive general object categories for (b) Our Method
(5)
xo
(5)
ho xs hs
FC
Linear SVM Wavy Hair
No Beard
…
FC High Cheekbones
Linear SVM Smiling
FC
Linear SVM
…
(4)
hf FC y xf
xf (c) ANet (d) Extracting features to predict attributes
Figure 2. The proposed pipeline of attribute prediction (Best viewed in color)
accuracy of face localization. Both LNeto and LNets have (ILSVRC) 2012 [6], containing 1.2 million training images
network structures similar to AlexNet [13], whose hyper and 50 thousands validation images. All the data is
parameters are specified in Fig.2 (a) and (b) respectively. employed for pre-training except one third of the validation
The fifth convolutional layer (C5) of LNeto indicates head- data for choosing hyper-parameters [13]. We augment
shoulders while C5 of LNets indicates faces, with their data by cropping ten patches from each image, including
highly responsed regions in their averaged response maps. one patch at the center and four at the corners, and their
Moreover, the input xo of LNeto is a m × n image, while horizontal flips. We adopt softmax for object classification,
the input xs of LNets is the head-shoulder region, which is which is optimized by stochastic gradient descent (SGD)
localized by LNeto and resized to 227 × 227. with back-propagation (BP) [16]. As shown in Fig.3
As illustrated in Fig.2 (c), ANet is learned to predict (a.2), the averaged response map in C5 of LNeto already
attributes y by providing the input face region xf , which is indicates locations of objects including human faces after
detected by LNets and properly resized. Specifically, multi- pre-training.
view versions [13] of xf are utilized to train ANet. Further- Fine-tuning LNet Both LNeto and LNets are fine-tuned
more, ANet contains four convolutional layers, where the with attribute tags. Additional output layers are added to
filters of C1 and C2 are globally shared and the filters of C3 the LNets individually for fine-tuning and then removed for
and C4 are locally shared. The effectiveness of local filters evaluation. LNeto adopts the full image xo as input while
have been demonstrated in many face related tasks [26, 28]. LNets uses the highly responsed region xs in the averaged
To handle complex face variations, ANet is pre-trained by response map in C5 of LNeto as input, which roughly re-
distinguishing massive face identities, which facilitates the spond to head-shoulders. The cross-entropy loss is used for
learning of discriminative features. attribute classification, i.e. L =
P
y log p(yi |x) + (1 −
i=1 i
Fig.2 (d) outlines the procedure of attribute recognition. 1
yi ) log 1 − p(yi |x) , where p(yi = 1|x) = 1+exp(−f (x))
ANet extracts a set of feature vectors (FCs) by cropping is the probability of the i-th attribute given image x. As
overlapping patches on xf . An efficient feed-forward shown in Fig.3 (a.3), the response maps after fine-tuning
algorithm is developed to reduce redundant computation become much more clean and smooth, indicating that the
in the feature extraction stage. SVMs [8] are trained to filters learned by attribute tags can detect face patterns with
predict attribute values given each FC. The final prediction complex variations. To appreciate the effectiveness of pre-
is obtained by averaging all these values, to cope with small training, we also include the averaged response map in C5
misalignment of face localization. of being directly trained from scratch with attribute tags but
without pre-training in Fig.3 (a.4). It cannot separate face
2.1. Face Localization regions from background and other body parts well.
The cascade of LNeto and LNets accurately localizes Thresholding and Proposing Windows We show that
face regions by being trained on image-level attribute tags. the responses of C5 in LNet are discriminative enough
Pre-training LNet Both LNeto and LNets are pre- to separate faces and background by simply searching a
trained with 1, 000 general object categories from the threshold, such that a window with response larger than
ImageNet Large Scale Visual Recognition Challenge this threshold corresponding to face and otherwise is back-
(a.1) (a.2) (b) Brown Hair Male
Frontal Left Big Eyes Black Hair
Response on Face Images
Smiling Sunglasses
Response on Bg. Images
Percentage of Images
View 1 View N Attr Config 1 Attr Config N
(a.3) (a.4) threshold ...
... ...
(a) single detector (b) multi-view detector (c) face localization by attributes
…
5
4 9
8
…
7
(c) feature extraction with
(a) global convolution (b) local convolution interweaved operation (d) interweaved operation
Figure 5. Detailed pipeline of efficient feature extractions in ANet.
P|D|
have L = i=1,yi =yj kFCi − FCj k22 , where FCi and FCj of the filters, including {1, 2, 4, 5}, {2, 3, 5, 6}, {4, 5, 7, 8},
denote the feature vectors of the i-th and j-th face images and {5, 6, 8, 9}. Instead of directly applying the local filters
respectively, and yi = yj indicates the identities of these on h1 , the interweaved operation generates an interweaved
(1)
samples are the same. In summary, ANet is pre-trained by map Ii for each filter, where i = 1...4. Each local filter
combining the softmax loss and the similarity loss. is then apply on its corresponding interweaved map. Since
Efficient Feature Extractions In test, ANet is evaluated the interweaved map capturing the entire image, each local
on multiple patches of the face region as shown in Fig.2 (d), filter is turned into a global filter such that its computation
leading to redundant convolutional computations because can be shared across different patches.
of the large overlaps in these patches. When all the filters (1)
Specifically, each interweaved map, e.g. I1 , is achieved
are globally-shared, the computational cost can be reduced by padding the cells of the corresponding channels in an
by applying [11], which convolves the filters in the input (1)
interweaved manner, e.g. hi={1,2,4,5} , as shown in Fig.5
image and then obtains a feature vector for each patch by (d). All of the interweaved maps are illustrated in Fig.5
pooling over the last convolutional layer. Given a simple (c). After that, each of the four local filters is applied on its
example with one convolutional layer as shown in Fig.5 (a), corresponding interweaved map, leading to four response
the feature vector FC for each patch (e.g. rectangle in red) (2)
maps hi , where i = 1...4. As a result, the feature vector
can be extracted by pooling in the corresponding region of
FC is pooled and concatenated from the receptive fields of
the response map h(1) , without evaluating convolutions in
the filters, which are the rectangles in black as shown in (c).
the input image patch-by-patch. Therefore, it shares the
Intuitively, instead of padding cells according to the
convolutions for every patch.
receptive fields of all the local filters (e.g. h(1) in (b)),
However, this scheme is not applicable when we have
which has to be performed in a patch-by-patch way, the
more than two convolutional layers whose filters are
interweaved operation pads the cells with respect to the
locally-shared. An example is illustrated in Fig.5 (b), where
receptive field of each local filter over the entire image. It
each patch is equally divided into 3 × 3 = 9 cells and
enables extracting multiple feature vectors with only one-
we learn different filters for different cells. To reduce
pass of feed-forward evaluation. This operation can be
computations in the first convolutional layer, each local
repeated when more locally convolutional layers are added.
filter can be applied on the entire image, resulting in the
(1) The proposed feature extraction scheme has achieved 6×
response map with nine channels, i.e. hi and i = 1...9. speedup empirically when compared with patch-by-patch
(1)
The final response map h is obtained by cropping and scanning. It is applicable to CNNs with local filters and
padding the regions (i.e. rectangles in black) in these 9 compatible to all existing CNN operations.
channels. As a result, each feature vector FC can be pooled
from h(1) , without convolving the input image patch-
3. Experiments
by-patch. Nevertheless, since h(1) is corresponded to a
patch of the input image, the succeeding local convolutions Large-scale Data Collection We construct two face
have to be handled patch-by-patch, leading to redundant attribute datasets, namely CelebA and LFWA, by labeling
computations. images selected from two challenging face datasets, Celeb-
To this end, we propose an interweaved operation, which Faces [27] and LFW [12]. CelebA contains ten thousand
is a fast feed-forward method for CNN with locally-shared identities, each of which has twenty images. There are
filters. Suppose we have four local filters in the next locally two hundred thousand images in total. LFWA has 13, 233
convolutional layer and each filter is applied on 2 × 2 cells images of 5, 749 identities. Each image in CelebA and
of h(1) as shown in (b). These cells are the receptive fields LFWA is annotated with forty face attributes and five key
(a) (b) (c)
Figure 6. Averaged response maps of LNet, including (a) CelebA, (b) MobileFaces, (c) some failure cases. (Best viewed in color)
(a) 1 (b)
1
0.6 0.6 Input Image Young Pale Skin Bangs Brown Hair No Beard
LNet LNet
DPM [21] DPM [21]
0.4 0.4
ACF Multi-view [29] ACF Multi-view [29] (b)
SURF Cascade [17] SURF Cascade [17]
Face++ [1] Face++ [1]
0.2 0.2
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
False Positive Per Image False Positive Per Image Input Image Male Receding Bags Chubby Wearing
(c) 1 (d) 1 Hairline Under Eyes Necktie
Recall Rates (FPPI = 0.1)
0.8
0.9
(c)
0.6
0.8
0.4 Input Image Straight Attractive Wearing Wearing Wearing
LNet
DPM [21]
Hair Lipstick Earrings Necklace
0.7
0.2 ACF Multi-view [29]
SURF Cascade [17] Figure 9. Attribute-specific regions discovery.
LNet (w/o pre-training)
0
0 0.2 0.4 0.6 0.8 1 10 20 30 40
Overlap Ratio Number of Attributes
Figure 7. ROC curves on (a) CelebA (b) MobileFaces. (c) Recall rates
w.r.t. overlap ratio (F P P I = 0.1). (d) Recall rates w.r.t. number of
bling multiple CNNs, each of which extracts features from
attributes (F P P I = 0.1) a well-aligned human part. These features are concatenated
to train SVM for attribute recognition. It is straightforward
points by a professional labeling company. CelebA and to adapt this method to face attributes, since face parts can
LFWA have over eight million and five hundred thousand be well-aligned by landmark points. Here, we consider two
attribute labels, respectively. settings. PANDA-w obtains the face parts by applying the
state-of-the-art face detection [17] and alignment [26] on
CelebA is partitioned into three parts. Images of the
wild images, while PANDA-l attains the face parts by using
first eight thousand identities (with 160 thousand images)
ground truth landmark points. For fair comparison, all the
are used to pre-train and fine-tune ANet and LNet, and
above methods are trained with the same data as ours.
the images of another one thousand identities (with twenty
thousand images) are employed to train SVM. The images 3.1. Effectiveness of the Framework
of the remaining one thousand identities (with twenty
thousand images) are used for testing. LFWA is partitioned This section demonstrates the effectiveness of the frame-
into half for training and half for testing. Specifically, 6, 263 work. All experiments in this section are done on CelebA.
images are adopted to train SVM and the remaining images • LNet
for test. When being evaluated on LFWA, LNet and ANet Performance Comparison We compare LNet with four
are trained on CelebA. state-of-the-art face detectors, including DPM [21], ACF
Multi-view [30], SURF Cascade [17], and Face++ [1].
Methods for Comparisons The proposed method is
We evaluate them by using ROC curves when IoU 2 ≥0.5.
compared with three competitive approaches, i.e. Face-
As plotted in Fig.7(a), when F P P I = 0.01, the true
Tracer [14], PANDA-w [32], and PANDA-l [32]. Face-
positive rates of Face++ and LNet are 85% and 93%;
Tracer extracts HOG and color histograms in several im-
when F P P I = 0.1, our method outperforms the other
portant functional face regions and then trains SVM for
three methods by 11, 9 and 22 percent respectively. We
attribute classification. We extract these functional regions
also investigate how these methods perform with respect to
referring to the ground truth landmark points. PANDA-w
overlap ratio (IoU ), following [34, 21]. Fig.7(c) shows that
and PANDA-l are based on PANDA [32], which was pro-
posed recently for human attribute recognition by ensem- 2 IoU indicates Intersection over Union.
High Resp. Low Resp. High Resp. Low Resp. Test Image Activations Neurons
(a.1) Gender (a.2) Hair Color (b.1) Bangs Brown Hair Pale Skin Narrow Eyes High Cheek.
(a.3) Age (a.4) Race (b.2) Eyeglasses Mustache Black Hair Smiling Big Nose
(a.5) Face Shape (a.6) Eye Shape (b.3) Wear. Hat Blond Hair Wear. Lipstick Asian Big Eyes
Figure 8. Visualization of neurons in ANet (a) after pre-training (b) after fine-tuning (Best viewed in color)
(a) ANet (FC) ANet (C4) ANet (C3)
Attractive No_Beard
Group #1 Mouth_Slightly_Open Group #2
Identity-related Attributes Identity-non-related Attributes
100% 90% High_Cheekbones
Wavy_Hair
Heavy_Makeup Young
Smiling Wearing_Lipstick
95% 85% Bangs
Accuracy
50%
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% different circumstances (lighting, pose, occlusion, image
Percentage of Best Performing Neurons Used
resolution, background clutter etc.) are shown in Fig.14.
Figure 10. (a) Layer-wise comparison of ANet after pre-training (b) Best
performing neurons analysis of ANet after fine-tuning. Best performing Attribute-specific Regions Discovery Different at-
neurons are different for different attributes. The proposed accuracies are tribute captures information from different region of face.
averaged over attributes which select their own subsets of best performing We show that LNet automatically learns to discover these
neurons.
regions. Given an attribute, by converting fully connected
LNet generally provides more accurate face localization, layers of LNet into fully convolutional layers following
leading to good performance in the subsequent attribute [18], we can locate important region of this attribute. Fig.9
prediction. shows some examples. The important regions of some
Further Analysis LNet significantly outperforms LNet attributes are locally distributed, such as ‘Bags Under Eyes’,
(without pre-training) by 74 percent when the overlap ‘Straight Hair’ and ‘Wearing Necklace’, but some are glob-
ratio equals to 0.5, which validates the effectiveness of ally distributed, such as ‘Young’, ‘Male’ and ‘Attractive’.
pre-training, as shown in Fig.7(c). We then explore • ANet
the influence of the number of attributes on localization. Pre-training Discovers Semantic Concepts We show
Fig.7(d) illustrates rich attribute information facilitates face that pre-training of ANet can implicity discover semantic
localization. To examine the generalization ability of LNet, concepts related to face identity. Given a hidden neuron
we collect another 3, 876 face images for testing, namely at the FC layer of ANet as shown in Fig.2(c), we partition
MobileFaces, which comes from a different source3 and has the face images into three groups, including the face images
a different distribution from CelebA. Several examples of with high, medium, and low responses at this neuron. The
MobileFaces are shown in Fig.6(b) and the corresponding face images of each group are then averaged to obtain
ROC curves are plotted in Fig.7(b). We observe that the mean face. We visualize these mean faces for several
LNet constantly performs better and still gains 7 percent neurons in Fig.8(a). Interestingly, these mean face changes
improvement (F P P I = 0.1) compared with other face smoothly from high response to low response, following a
detectors. Despite some failure cases due to extreme poses high-level concept. Human can easily assign each neuron
and large occlusions, LNet accurately localize faces in the with a semantic concept it measures (i.e. the text in yellow).
wild as demonstrated in Fig.6. More results of LNet under For example, the neurons in (a.1) and (a.4) correspond
3 MobileFaces was collected by normal users with mobile phones, to ‘gender’ and ‘race’, respectively. It reveals that the
while CelebA and LFWA collected face images of celebrities taken by high-level hidden neurons of ANet can implicitly learn
professional photographers. to discover semantic concepts, even though they are only
Bushy Eyebrows
Arch. Eyebrows
H. Cheekbones
Heavy Makeup
Bags Un. Eyes
Double Chin
Brown Hair
Blond Hair
Eyeglasses
Black Hair
Gray Hair
Attractive
5 Shadow
Big Nose
Big Lips
Chubby
Goatee
Blurry
Bangs
Male
Bald
FaceTracer [14] 85 76 78 76 89 88 64 74 70 80 81 60 80 86 88 98 93 90 85 84 91
PANDA-w [32] 82 73 77 71 92 89 61 70 74 81 77 69 76 82 85 94 86 88 84 80 93
PANDA-l [32] 88 78 81 79 96 92 67 75 85 93 86 77 86 86 88 98 93 94 90 86 97
CelebA
[17]+ANet 86 75 79 77 92 94 63 74 77 86 83 74 80 86 90 96 92 93 87 85 95
LNets+ANet(w/o) 88 74 77 73 95 92 66 75 84 91 80 78 85 86 88 96 92 93 85 84 94
LNets+ANet 91 79 81 79 98 95 68 78 88 95 84 80 90 91 92 99 95 97 90 87 98
FaceTracer [14] 70 67 71 65 77 72 68 73 76 88 73 62 67 67 70 90 69 78 88 77 84
PANDA-w [32] 64 63 70 63 82 79 64 71 78 87 70 65 63 65 64 84 65 77 86 75 86
PANDA-l [32] 84 79 81 80 84 84 73 79 87 94 74 74 79 69 75 89 75 81 93 86 92
LFWA
[17]+ANet 78 66 75 72 86 84 70 73 82 90 75 71 69 68 70 88 68 82 89 79 91
LNets+ANet(w/o) 81 78 80 79 83 84 72 76 86 94 70 73 79 70 74 92 75 81 91 83 91
LNets+ANet 84 82 83 83 88 88 75 81 90 97 74 77 82 73 78 95 78 84 95 88 94
Reced. Hairline
Wear. Necklace
Wear. Earrings
Wear. Lipstick
Wear. Necktie
Rosy Cheeks
Narrow Eyes
Straight Hair
Mouth S. O.
Pointy Nose
Wavy Hair
Oval Face
Wear. Hat
Sideburns
Mustache
No Beard
Pale Skin
Average
Smiling
Young
FaceTracer [14] 87 91 82 90 64 83 68 76 84 94 89 63 73 73 89 89 68 86 80 81
PANDA-w [32] 82 83 79 87 62 84 65 82 81 90 89 67 76 72 91 88 67 88 77 79
PANDA-l [32] 93 93 84 93 65 91 71 85 87 93 92 69 77 78 96 93 67 91 84 85
CelebA
[17]+ANet 85 87 83 91 65 89 67 84 85 94 92 70 79 77 93 91 70 90 81 83
LNets+ANet(w/o) 86 91 77 92 63 87 70 85 87 91 88 69 75 78 96 90 68 86 83 83
LNets+ANet 92 95 81 95 66 91 72 89 90 96 92 73 80 82 99 93 71 93 87 87
FaceTracer [14] 77 83 73 69 66 70 74 63 70 71 78 67 62 88 75 87 81 71 80 74
PANDA-w [32] 74 77 68 63 64 64 68 61 64 68 77 68 63 85 78 83 79 70 76 71
PANDA-l [32] 78 87 73 75 72 84 76 84 73 76 89 73 75 92 82 93 86 79 82 81
LFWA
[17]+ANet 76 79 74 69 66 68 72 70 71 72 82 72 65 87 82 86 81 72 79 76
LNets+ANet(w/o) 78 87 77 75 71 81 76 81 72 72 88 71 73 90 84 92 83 76 82 79
LNets+ANet 82 92 81 79 74 84 80 85 78 77 91 76 76 94 88 95 88 79 86 84
Table 1. Performance comparison of attribute prediction. (Note that FaceTracer and PANDA-l attains the face parts by using ground truth landmark points.)
optimized for face recognition using identity information ably, these neurons express diverse high-level meanings
and attribute labels are not used in pre-training. We also and cooperate to explain the test images. The activations
observe that most of these concepts are intrinsic to face of all the neurons are visualized in Fig.8(b), and they are
identity, such as the shape of facial components, gender, sparse. In some sense, attributes presented in each test
and race. image are explained by a sparse linear combination of these
To better explain this phenomena, we compare the concepts. For instance, the first image is described as “a
accuracy of attribute prediction using features at different lady with bangs, brown hair, pale skin, narrow eyes and high
layers of ANet right after pre-training. They are FC, C4, cheekbones”, which well matches human perception.
and C3. The forty attributes are roughly separated into two To validate this, we explore how the number of neurons
groups, which are identity-related attributes, such as gender influences attribute prediction accuracies. Best performing
and race, and identity-non-related attributes, e.g. attributes neurons for each attribute are identified by sorting corre-
of expressions, wearing hat and sunglasses. We select sponding SVM weights. Fig.10(b) illusatrates that only
some representative attributes for each group and plot the 10% of ANet best performing neurons are needed to achieve
results in Fig.10(a), which shows that the performance of 90% of the original performance of a particular attribute4 .
FC outperforms C4 and C3 in the group of identity-related In contrast, HOG+PCA does not have the sparse nature
attributes, but they are relatively weaker when dealing with and need more than 95% features Besides, the best single
identity-non-related attributes. This is because the top layer performing neuron of ANet outperforms that of HOG+PCA
FC learns identity features, which are insensitive to intra- by 25 percent in average prediction accuracy.
personal face variations. Automatic Attributes Grouping Here we show that the
weight matrix at the FC layer of ANet can implicitly capture
Fine-tuning Expands Semantic Concepts Fig.8 shows relations between attributes. Each column vector of the
that after fine-tuning, ANet can expand these concepts to weight matrix can be viewed as a decision hyperplane to
more attribute types. Fig.8(b) visualizes the neurons in the partition the negatives and positive samples of an attribute.
FC layer, which are ranked by their responses in descending By simply applying k-means to these vectors, the clusters
order with respect to several test images. Human can assign show clear grouping patterns, which can be interpreted
semantic meaning to each of these neurons. We found that
a large number of new concepts can be observed. Remark- 4 Best performing neurons are different for different attributes.
Mustache
No Beard
Blond H.
M. Aged
Black H.
Average
No Eye.
B. Nose
R. Hair.
A. Eye.
B. Eye.
Gender
R. Jaw
Senior
White
Youth
Asian
Black
Bald
Eye.
FaceTracer [14] 91 87 86 75 66 54 70 66 68 72 84 86 83 76 72 66 65 81 51 73
POOF [2] 92 90 81 90 71 60 80 67 75 67 87 90 86 72 74 71 68 77 55 76
LNets+ANet 94 85 83 87 80 77 81 86 89 84 85 84 86 83 82 75 79 78 81 83
Table 2. Performance comparison on extended attributes. (Performance are measured by the average of true positive rates and true negative rates.)
FaceTracer [11] PANDA-w [26] PANDA-l [26] LNets+ANet outperforms them by 10 and 7 percent respectively.
100%
75%
periment with the case of providing ANet with localized
70% face region by LNets, but without pre-training, denoted as
65%
LNets+ANet(w/o). The average accuracies have dropped
60%
4 and 5 percent on CelebA and LFWA, which indicate
pre-training with massive facial identities helps discover
semantic concepts.
Figure 12. Performance comparison of FaceTracer [14], PANDA-w [32], Performance on LFWA+ To further examine whether
PANDA-l [32] and LNets+ANet on LFWA+.
the proposed approach can be generalized to unseen at-
tributes, we manually label 30 more attributes for the testing
LNets+ANet PANDA-l
CelebA LFWA LFWA+
images on LFWA and denote this extended dataset as
88 84 86 LFWA+. To test on these 30 attributes, we directly transfer
weights learned by deep models to extract features, and only
Accuracy
85 80 82
re-train SVMs using one third of the images. LNets+ANet
leads to 8, 10, and 3 percent average gains over the other
82 76 77
1k 3k 10k 1k 3k 10k 1k 3k 10k
three approaches (FaceTracer, PANDA-w, and PANDA-l).
Dataset Size It demonstrates that our method learns discriminative face
representations and has good generalization ability.
Figure 13. Performances of different training dataset sizes.
Size of Training Dataset We compare the attribute
prediction accuracy of the proposed method with the ac-
semantically. As shown in Fig.11, Group #1, Group #2 and curacy of PANDA-l, regarding different sizes of training
Group #4 demonstrate co-occurrence relationship between datasets. Only the training data of ANet is changed in
attributes, e.g. ‘Attractive’ and ‘Heavy Makeup’ have high our method for fair comparison. Fig.13 demonstrates that
correlation. Attributes in Group #3 share similar color LNets+ANet performs well when dataset size is small, but
descriptors, while attributes in Group #6 correspond to the performance of PANDA-l drops significantly.
certain texture and appearance traits. Time Complexity For a 300 ∗ 300 image, LNets takes
35ms to localize face region while ANet takes 14ms to
3.2. Attribute Prediction extract features on GPU. In contrast, a naı̈ve patch-by-
patch scanning needs nearly 80 ms to extract features. Our
Performance Comparison The attribute prediction per- framework has large potential in real-world applications.
formance is reported in Table.1. On CelebA, the prediction
accuracies of FaceTracer [14], PANDA-w [32], PANDA-l 4. Conclusion
[32], and our LNets+ANet are 81, 79, 85, and 87 percent
respectively, while the corresponding accuracies on LFWA This paper has proposed a novel deep learning frame-
are 74, 71, 81, and 84 percent. Our method outperforms work for face attribute prediction in the wild. With carefully
PANDA-w by nearly 10 percent. Remarkably, even when designed pre-training strategies, our method is robust to
PANDA-l is equipped with groundtruth bounding boxes background clutters and face variations. We devise a new
and landmark positions, our method still achieves 3 percent fast feed-forward algorithm for locally shared filters to save
gain. The strength of our method is illustrated not only redundant computation, which enables evaluating image
on global attributes, e.g. “Chubby” and “Young”, but also with arbitrary size in realtime. It allows taking images of
on fine-grained facial traits, e.g. “Mastache” and “Pointy arbitrary sizes as input without normalization. We have
Nose”. We also report performance on 19 extended at- also revealed multiple important facts about learning face
tributes and compare our result with [14] and [2]. The eval- representation, which shed a light on new directions of face
uation protocol is the same as [2]. In Table 2, LNets+ANet localization and representation learning.
References [19] P. Luo, X. Wang, and X. Tang. A deep sum-product
architecture for robust facial attributes analysis. In ICCV,
[1] Face++. http://www.faceplusplus.com/. 6 pages 2864–2871, 2013. 1, 2
[2] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one [20] O. K. Manyam, N. Kumar, P. Belhumeur, and D. Kriegman.
features for fine-grained categorization, face verification, and Two faces are better than one: Face recognition in group
attribute estimation. In CVPR, pages 955–962, 2013. 1, 2, 9 photographs. In IJCB, pages 1–8, 2011. 1
[3] A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani. [21] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool.
Self-taught object localization with deep networks. arXiv Face detection without bells and whistles. In ECCV, pages
preprint arXiv:1409.3964, 2014. 2 720–735. 2014. 4, 6
[4] L. Bourdev, S. Maji, and J. Malik. Describing people: A [22] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object
poselet-based approach to attribute classification. In ICCV, localization for free?–weakly-supervised learning with con-
pages 1543–1550, 2011. 1, 2 volutional neural networks. In CVPR, pages 685–694, 2015.
[5] J. Chung, D. Lee, Y. Seo, and C. D. Yoo. Deep attribute 2
networks. In NIPS Workshop on Deep Learning and Unsu- [23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
pervised Feature Learning, volume 3, 2012. 1 Cnn features off-the-shelf: an astounding baseline for recog-
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- nition. arXiv preprint arXiv:1403.6382, 2014. 1, 2
Fei. Imagenet: A large-scale hierarchical image database. In [24] A. Rodriguez and A. Laio. Clustering by fast search and find
CVPR, pages 248–255, 2009. 2, 3 of density peaks. Science, 344(6191):1492–1496, 2014. 4
[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, [25] F. Song, X. Tan, and S. Chen. Exploiting relationship
E. Tzeng, and T. Darrell. Decaf: A deep convolutional between attributes for improved face verification. CVIU,
activation feature for generic visual recognition. arXiv 122:143–154, 2014. 1
preprint arXiv:1310.1531, 2013. 2 [26] Y. Sun, X. Wang, and X. Tang. Deep convolutional network
[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.- cascade for facial point detection. In CVPR, pages 3476–
J. Lin. Liblinear: A library for large linear classification. 3483, 2013. 1, 3, 6
JMLR, 9:1871–1874, 2008. 3 [27] Y. Sun, X. Wang, and X. Tang. Deep learning face
[9] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing representation by joint identification-verification. In NIPS,
objects by their attributes. In CVPR, pages 1778–1785, 2009. 2014. 2, 4, 5
2 [28] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
[10] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality Closing the gap to human-level performance in face verifica-
reduction by learning an invariant mapping. In CVPR, tion. In CVPR, pages 1701–1708, 2014. 2, 3
volume 2, pages 1735–1742, 2006. 4 [29] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling Sun database: Large-scale scene recognition from abbey to
in deep convolutional networks for visual recognition. In zoo. In CVPR, pages 3485–3492, 2010. 4
ECCV, pages 346–361. 2014. 5 [30] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel
[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. features for multi-view face detection. In IJCB, pages 1–8,
Labeled faces in the wild: A database for studying face 2014. 4, 6
recognition in unconstrained environments. Technical Re- [31] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-
port 07-49, University of Massachusetts, Amherst, October based r-cnns for fine-grained category detection. In ECCV,
2007. 2, 5 pages 834–849. 2014. 2
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [32] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev.
classification with deep convolutional neural networks. In Panda: Pose aligned networks for deep attribute modeling. In
NIPS, pages 1097–1105, 2012. 2, 3, 4 CVPR, 2014. 1, 2, 6, 8, 9
[14] N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: A search [33] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
engine for large collections of images with faces. In ECCV, Object detectors emerge in deep scene cnns. In ICLR, 2015.
pages 340–353. 2008. 6, 8, 9 2
[15] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. [34] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
Attribute and simile classifiers for face verification. In ICCV, proposals from edges. In ECCV, pages 391–405. 2014. 4, 6
pages 365–372, 2009. 1, 2
[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Handwritten digit
recognition with a back-propagation network. In NIPS, 1990.
3
[17] J. Li and Y. Zhang. Learning surf cascade for fast and
accurate object detection. In CVPR, pages 3468–3475, 2013.
4, 6, 8, 9
[18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015. 7
Figure 14. More results of LNet averaged response maps. (Best viewed in color)