Exploiting Facial Landmarks For Emotion Recognition in The Wild
Exploiting Facial Landmarks For Emotion Recognition in The Wild
Exploiting Facial Landmarks For Emotion Recognition in The Wild
the Wild
Matthew Day
Department of Electronics, University of York, UK
[email protected]
Categories and Subject Descriptors1 The relationship between muscle movements and emotion has
I.2.10 [Artificial Intelligence]: Vision and Scene Understanding – been well studied and is defined by the emotional facial action
modeling and recovery of physical attributes, shape, texture. coding system [9] (EMFACS). For example, happiness is
represented by the combination of ‘cheek raiser’ with ‘lip corner
puller’. Sadness is demonstrated by ‘inner brow raiser’ plus ‘brow
General Terms lowerer’ together with ‘lip corner depresser’. However, in the
Algorithms, Performance, Design, Experimentation. static image sub-challenge, it is not possible to detect movements,
so how well can expression be predicted from a single image? In
Keywords contrast to the EmotiW2015 challenge, most prior-art has reported
Machine learning; Emotion recognition; Facial landmarks; BIF; results on well-lit well-aligned faces. For a 5-class problem, with
SVM; Gradient boosting. fairly accurate registration, [1] demonstrated classification
accuracies of around 60%.
1. INTRODUCTION Section 2 of this paper discusses face detection and landmark
Accurate machine analysis of human facial expression is location. Sections 3 and 4 describe the features and modelling
important in an increasing number of applications across a approaches used in our experiments. Section 5 provides the main
growing number of fields. Human computer interaction (HCI) is experimental results and section 6 discusses what we can take
one obvious example. Others include medical monitoring, from these as well as offering some miscellaneous considerations.
psychological condition analysis, or as a means of acquiring
commercially valuable feedback. 1 2. FACE REGISTRATION
Whilst the problem was traditionally addressed in highly The first stage in any system of facial analysis involves locating
constrained scenarios, an increasing focus on ‘in the wild’ (i.e. and aligning the face. Methods which holistically combine
unconstrained) data has emerged in recent years. In this respect, locating a face with locating facial landmark points have clear
the EmotiW2015 challenge [5] aims to advance the state of the appeal. In particular, deformable parts models (DPM), introduced
art. The challenge is divided into two sub-challenges: (1) audio- in [8] have become one of the most popular approaches in the
video based and (2) static image based. Audio, in particular, has research community.
been shown to contain discriminative information [4] and motion On the other hand, face detection and landmark location have
intuitively provides valuable cues to a human observer. However, both been extensively studied separately. Many excellent
effectively exploiting this information is undoubtedly challenging. solutions have been proposed to both problems and our own
This fact is demonstrated by the two baseline systems [5], which experience suggests that tackling the tasks separately may have
achieve negligible accuracy difference across the sub-challenges. advantages in some scenarios. In particular, a recent method [12]
It is then clearly a useful pursuit to consider only static images, for facial landmark location has excellent performance on
because accuracy improvements here can be built on in more unconstrained images and is therefore well suited to the
complex systems that analyze video. Consequently, this work EmotiW2015 challenge. The method uses a sequence of gradient
focusses on the image based sub-challenge. boosted regression models, where each stage refines the position
estimate according to the results of many point-intensity
comparisons. We use the implementation of [12] provided by the
dlib library [13]. The model, provided for use with this library,
1
This paper was originally accepted to the ACM International was trained using the data from the iBUG 300-W dataset and it
Conference on Multimodal Interaction (ICMI 2015), Seattle, positions 68 points on frontal faces, similar to the baseline.
USA, Nov 2015. It has been made available through arXiv.org
because the author was unable to present.
In [15], face detection based on rigid templates, similar to the 3.1.1 Distances between Points
classic method of [18], achieves comparable accuracy to a We consider the distances between all distinct pairs of landmark
detector based on DPM [8], but the former has a substantial speed points. We have 68 landmarks, giving 2278 unique pairs. Many of
advantage. We choose the rigid template detector included in the these pairs will contain no useful shape information and we could
dlib library [13] as a result of informal performance comparisons. add heuristics to reduce this number considerably, although this is
This method uses histogram of oriented gradients [3] (HoG) not necessary for the models we subsequently learn.
features combined with a linear classifier. It has a very low false
positive rate, although it fails to find a face in almost 12% of the
challenge images. In these cases, we roughly position a bounding
3.1.2 Axis Distances from Average
square manually to allow subsequent processing.2 We speculate that the point-distances may not capture all shape
information alone. We therefore test a second type of feature that
considers the displacement from the average landmark location,
where the average is taken from all faces in the training set. After
up-righting the face, for each point, we take the x- and y-
distances from the average location as feature values. This results
in a vector of length 136.
3.2 Texture
By including texture to complement the shape information, we
hope to improve classification accuracy in our experiments. We
note that the baseline system [5] is based entirely on texture
features and many previous successful approaches have also used
texture, e.g. [1].
Figure 1: Example landmarks from baseline (left) and
proposed system (right)
Figure 1 shows a representative comparison of the landmark
3.2.1 Biologically Inspired Features (BIF)
points output by our system and those of the baseline system. To BIF [10] are perhaps most well-known for the success they have
try to quantify the advantage here, we inspect the automatically achieved in facial age estimation. However, they have also
located points for each image in the training set and give each one demonstrated excellent performance in other face processing
a subjective score based on how well they fit the face. In Figure 1 problems [16] and have been applied to the classification of facial
for example, we would consider the baseline points as ‘close’ and expressions [14]. As a rich texture descriptor, based on a model of
our points as ‘very close’. The results from this exercise are the human visual system, BIF would appear to represent a good
shown in Table 1. It is clear from this that we have a better candidate for this application.
starting point for using shape information to estimate emotion – Evaluation of BIF involves applying a bank of Gabor filters with
experiments in later sections quantify this further. different orientations and scales to each location in the face
image. The responses of these filters are pooled over similar
Table 1. Accuracy of baseline points versus proposed system
locations and scales via non-linear operators, maximum (MAX) or
Very standard-deviation (STDDEV). In practice, the aligned face image
Excluded Fail Poor Close
Close is partitioned into overlapping rectangular regions for pooling and
Baseline 67 55 173 663 0 the pooling operation introduces some tolerance to misalignment.
Proposed 0 2 37 215 704 Our implementation closely follows the description in [10]. We
extract 60x60 face regions, aligned according to the automatically
3. FEATURES located landmarks. We use both MAX and STDDEV pooling,
Our final system, i.e. challenge entry, uses only one very simple with 8 orientations and 8 bands. Our implementation then has
type of shape feature. However, we performed experiments with 8640 feature values.
various features which we describe in the following paragraphs.
3.2.2 Point Texture Features
3.1 Shape We speculate that we may gain more information from texture
Intuitively, given accurate facial landmark locations, we can infer features that are more directly tied to the location of landmarks.
a lot about facial expression – arguably more than is possible from We therefore also consider a second type, where the feature values
only texture. We consider two simple types of shape feature, both are simply Gabor filter responses at different sizes and
derived from automatically located facial landmark locations. For orientations, centered on each landmark location. We refer to
both types, we first normalize the size of the face. these as ‘point-texture features’. We evaluate filters at 8 scales
and 12 orientations, giving a total of 6528 feature values for the
68 landmark points.
4. MODELLING
2
After our main experiments were complete, we found a To construct predictive models using our features, we use two
combination of open source detectors could reduce the miss rate standard approaches from the machine learning literature: support
to 3% and the landmark estimator is unlikely to perform well on vector machines (SVM) [11] and gradient boosting (GB) [11].
the remaining difficult faces, regardless of initialization. For the SVM classifiers, we use the implementation provided by
libsvm [2] with a RBF kernel. We optimize the C and gamma For the texture features of 3.2, the result was more surprising. We
parameters on the validation data via a grid search, as advocated expected these to add some useful information, but this appeared
by the authors of [2]. For the GB classifiers, we use our own not to be the case, despite their quantity far exceeding that of the
implementation. We find that trees with two splits and a shrinkage distance features.
factor of 0.1 generally work well on this problem, so we fix these
As a consequence, we conclude that the simple point-distance
parameters and optimize only the number of trees on the
features already contain the most information relevant to the task.
validation data.
We use the SVM model from Table 2 as our challenge entry.
Table 2: Main performance figures Table 4: Overall accuracy using baseline points and points
from proposed system
Model Train Validate Test
Landmarks Model Train Validate Test
GB 60.1% 40.8% 44.4%
GB 65.8% 40.4% 38.4%
SVM 52.1% 37.4% 46.8% Proposed
SVM 50.0% 41.2% 40.6%
Baseline - 36.0% 39.1%
GB 53.9% 32.3% 31.2%
Baseline
SVM 47.9% 34.1% 27.7%
A confusion matrix for the SVM classifier on the test data is
shown in Table 3.
6. DISCUSSION
Table 3: Test data confusion matrix for challenge entry Considering the results of Table 3, performance on each class of
emotion exhibits the same pattern seen in previous EmotiW
Estimate
Surprise
Neutral
Happy
Angry
Fear
→
Sad