Exploiting Facial Landmarks For Emotion Recognition in The Wild

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Exploiting Facial Landmarks for Emotion Recognition in

the Wild
Matthew Day
Department of Electronics, University of York, UK
[email protected]

ABSTRACT All images in this sub-challenge contain an expressive face with


In this paper, we describe an entry to the third Emotion the goal to assign an emotion label from the set {neutral, happy,
Recognition in the Wild Challenge, EmotiW2015. We detail the sad, angry, surprised, fearful, disgusted}. These labels originate
associated experiments and show that, through more accurately from the work of Ekman [7], who noted that facial expressions are
locating the facial landmarks, and considering only the distances primarily generated by the contraction or relaxation of facial
between them, we can achieve a surprising level of performance. muscles. This causes a change in the location of points on the face
The resulting system is not only more accurate than the challenge surface (i.e. facial landmarks). Whilst other cues may exist, such
baseline, but also much simpler. as coloring of the skin or the presence of sweat or tears, shape
changes remain the most significant indicator.

Categories and Subject Descriptors1 The relationship between muscle movements and emotion has
I.2.10 [Artificial Intelligence]: Vision and Scene Understanding – been well studied and is defined by the emotional facial action
modeling and recovery of physical attributes, shape, texture. coding system [9] (EMFACS). For example, happiness is
represented by the combination of ‘cheek raiser’ with ‘lip corner
puller’. Sadness is demonstrated by ‘inner brow raiser’ plus ‘brow
General Terms lowerer’ together with ‘lip corner depresser’. However, in the
Algorithms, Performance, Design, Experimentation. static image sub-challenge, it is not possible to detect movements,
so how well can expression be predicted from a single image? In
Keywords contrast to the EmotiW2015 challenge, most prior-art has reported
Machine learning; Emotion recognition; Facial landmarks; BIF; results on well-lit well-aligned faces. For a 5-class problem, with
SVM; Gradient boosting. fairly accurate registration, [1] demonstrated classification
accuracies of around 60%.
1. INTRODUCTION Section 2 of this paper discusses face detection and landmark
Accurate machine analysis of human facial expression is location. Sections 3 and 4 describe the features and modelling
important in an increasing number of applications across a approaches used in our experiments. Section 5 provides the main
growing number of fields. Human computer interaction (HCI) is experimental results and section 6 discusses what we can take
one obvious example. Others include medical monitoring, from these as well as offering some miscellaneous considerations.
psychological condition analysis, or as a means of acquiring
commercially valuable feedback. 1 2. FACE REGISTRATION
Whilst the problem was traditionally addressed in highly The first stage in any system of facial analysis involves locating
constrained scenarios, an increasing focus on ‘in the wild’ (i.e. and aligning the face. Methods which holistically combine
unconstrained) data has emerged in recent years. In this respect, locating a face with locating facial landmark points have clear
the EmotiW2015 challenge [5] aims to advance the state of the appeal. In particular, deformable parts models (DPM), introduced
art. The challenge is divided into two sub-challenges: (1) audio- in [8] have become one of the most popular approaches in the
video based and (2) static image based. Audio, in particular, has research community.
been shown to contain discriminative information [4] and motion On the other hand, face detection and landmark location have
intuitively provides valuable cues to a human observer. However, both been extensively studied separately. Many excellent
effectively exploiting this information is undoubtedly challenging. solutions have been proposed to both problems and our own
This fact is demonstrated by the two baseline systems [5], which experience suggests that tackling the tasks separately may have
achieve negligible accuracy difference across the sub-challenges. advantages in some scenarios. In particular, a recent method [12]
It is then clearly a useful pursuit to consider only static images, for facial landmark location has excellent performance on
because accuracy improvements here can be built on in more unconstrained images and is therefore well suited to the
complex systems that analyze video. Consequently, this work EmotiW2015 challenge. The method uses a sequence of gradient
focusses on the image based sub-challenge. boosted regression models, where each stage refines the position
estimate according to the results of many point-intensity
comparisons. We use the implementation of [12] provided by the
dlib library [13]. The model, provided for use with this library,
1
This paper was originally accepted to the ACM International was trained using the data from the iBUG 300-W dataset and it
Conference on Multimodal Interaction (ICMI 2015), Seattle, positions 68 points on frontal faces, similar to the baseline.
USA, Nov 2015. It has been made available through arXiv.org
because the author was unable to present.
In [15], face detection based on rigid templates, similar to the 3.1.1 Distances between Points
classic method of [18], achieves comparable accuracy to a We consider the distances between all distinct pairs of landmark
detector based on DPM [8], but the former has a substantial speed points. We have 68 landmarks, giving 2278 unique pairs. Many of
advantage. We choose the rigid template detector included in the these pairs will contain no useful shape information and we could
dlib library [13] as a result of informal performance comparisons. add heuristics to reduce this number considerably, although this is
This method uses histogram of oriented gradients [3] (HoG) not necessary for the models we subsequently learn.
features combined with a linear classifier. It has a very low false
positive rate, although it fails to find a face in almost 12% of the
challenge images. In these cases, we roughly position a bounding
3.1.2 Axis Distances from Average
square manually to allow subsequent processing.2 We speculate that the point-distances may not capture all shape
information alone. We therefore test a second type of feature that
considers the displacement from the average landmark location,
where the average is taken from all faces in the training set. After
up-righting the face, for each point, we take the x- and y-
distances from the average location as feature values. This results
in a vector of length 136.

3.2 Texture
By including texture to complement the shape information, we
hope to improve classification accuracy in our experiments. We
note that the baseline system [5] is based entirely on texture
features and many previous successful approaches have also used
texture, e.g. [1].
Figure 1: Example landmarks from baseline (left) and
proposed system (right)
Figure 1 shows a representative comparison of the landmark
3.2.1 Biologically Inspired Features (BIF)
points output by our system and those of the baseline system. To BIF [10] are perhaps most well-known for the success they have
try to quantify the advantage here, we inspect the automatically achieved in facial age estimation. However, they have also
located points for each image in the training set and give each one demonstrated excellent performance in other face processing
a subjective score based on how well they fit the face. In Figure 1 problems [16] and have been applied to the classification of facial
for example, we would consider the baseline points as ‘close’ and expressions [14]. As a rich texture descriptor, based on a model of
our points as ‘very close’. The results from this exercise are the human visual system, BIF would appear to represent a good
shown in Table 1. It is clear from this that we have a better candidate for this application.
starting point for using shape information to estimate emotion – Evaluation of BIF involves applying a bank of Gabor filters with
experiments in later sections quantify this further. different orientations and scales to each location in the face
image. The responses of these filters are pooled over similar
Table 1. Accuracy of baseline points versus proposed system
locations and scales via non-linear operators, maximum (MAX) or
Very standard-deviation (STDDEV). In practice, the aligned face image
Excluded Fail Poor Close
Close is partitioned into overlapping rectangular regions for pooling and
Baseline 67 55 173 663 0 the pooling operation introduces some tolerance to misalignment.
Proposed 0 2 37 215 704 Our implementation closely follows the description in [10]. We
extract 60x60 face regions, aligned according to the automatically
3. FEATURES located landmarks. We use both MAX and STDDEV pooling,
Our final system, i.e. challenge entry, uses only one very simple with 8 orientations and 8 bands. Our implementation then has
type of shape feature. However, we performed experiments with 8640 feature values.
various features which we describe in the following paragraphs.
3.2.2 Point Texture Features
3.1 Shape We speculate that we may gain more information from texture
Intuitively, given accurate facial landmark locations, we can infer features that are more directly tied to the location of landmarks.
a lot about facial expression – arguably more than is possible from We therefore also consider a second type, where the feature values
only texture. We consider two simple types of shape feature, both are simply Gabor filter responses at different sizes and
derived from automatically located facial landmark locations. For orientations, centered on each landmark location. We refer to
both types, we first normalize the size of the face. these as ‘point-texture features’. We evaluate filters at 8 scales
and 12 orientations, giving a total of 6528 feature values for the
68 landmark points.

4. MODELLING
2
After our main experiments were complete, we found a To construct predictive models using our features, we use two
combination of open source detectors could reduce the miss rate standard approaches from the machine learning literature: support
to 3% and the landmark estimator is unlikely to perform well on vector machines (SVM) [11] and gradient boosting (GB) [11].
the remaining difficult faces, regardless of initialization. For the SVM classifiers, we use the implementation provided by
libsvm [2] with a RBF kernel. We optimize the C and gamma For the texture features of 3.2, the result was more surprising. We
parameters on the validation data via a grid search, as advocated expected these to add some useful information, but this appeared
by the authors of [2]. For the GB classifiers, we use our own not to be the case, despite their quantity far exceeding that of the
implementation. We find that trees with two splits and a shrinkage distance features.
factor of 0.1 generally work well on this problem, so we fix these
As a consequence, we conclude that the simple point-distance
parameters and optimize only the number of trees on the
features already contain the most information relevant to the task.
validation data.
We use the SVM model from Table 2 as our challenge entry.

5. EXPERIMENTS 5.3 Improvement over Baseline


In all of the following experiments, only the challenge training
To quantify the advantage that our more accurate landmark
data are used to construct the model, with parameters optimized
locations bring over the baseline, we learn directly comparable
on the validation set. At one point, we experimented with
models using both sets of points. For the baseline system,
combining the training and validation data and learning using this
landmark points are not available for all images, because the face
larger set. However, this did not result in an improvement in
detector fails in some cases. For a fair comparison, we therefore
accuracy on the test data, so we did not pursue this approach or
use exactly the same subset of images across train/validation/test
include the result.
sets in both trials. Where no points exist for a test image, we
assign a ‘Neutral’ label.
5.1 Simple Shape-based Classifiers
We start using only the point-distance features described in 3.1.1. Table 4 shows the results of this comparison. From these we can
We learn SVM and GB classifiers which give the performance conclude that the landmarks used in the proposed system provide
figures shown in Table 2. a very clear advantage over those of the baseline system.

Table 2: Main performance figures Table 4: Overall accuracy using baseline points and points
from proposed system
Model Train Validate Test
Landmarks Model Train Validate Test
GB 60.1% 40.8% 44.4%
GB 65.8% 40.4% 38.4%
SVM 52.1% 37.4% 46.8% Proposed
SVM 50.0% 41.2% 40.6%
Baseline - 36.0% 39.1%
GB 53.9% 32.3% 31.2%
Baseline
SVM 47.9% 34.1% 27.7%
A confusion matrix for the SVM classifier on the test data is
shown in Table 3.
6. DISCUSSION
Table 3: Test data confusion matrix for challenge entry Considering the results of Table 3, performance on each class of
emotion exhibits the same pattern seen in previous EmotiW
Estimate
Surprise
Neutral

challenges. Specifically, performance is promising for faces with


Disgust

Happy
Angry

Fear


Sad

neutral (SVM:69%,GB:64%), happy (71%,62%), angry


Truth ↓ (42%,46%), and surprise (57%,43%) expressions. On the other
hand, sad (29%,25%) and fearful (2%,17%) expressions are more
Angry 29 1 4 5 10 7 13 difficult to distinguish. The subtleties of disgust (0%,0%) might
Disgust 3 0 0 6 4 4 0 be impossible to detect using such simple features taken from
static images. Indeed, this task is not only difficult for machines,
Fear 13 0 1 3 13 6 5 but without contextual information it is also difficult for humans
Happy 5 0 0 67 7 16 0 to distinguish disgust from other more prevalent emotions. Our
overall accuracy is more than three times better than random
Neutral 3 0 1 3 40 9 2
guessing, representing a small improvement over the accuracy
Sad 9 0 5 5 12 16 8 achieved in [1] on more constrained static images. The final
Surprise 8 0 2 0 6 0 21 system we propose achieves 47% accuracy on the test data, whilst
the baseline achieves 39% accuracy.
5.2 Classifiers using Other Features Comparing our SVM and GB classifiers, the former lead to
Taking each of the other three types of feature described in slightly better results in most cases, whereas the latter are
section 3 in turn, we add to the point-distance features. significantly simpler and faster to evaluate. However, model
Surprisingly, in each case, we did not observe any improvement complexity differences become insignificant if texture features
on validation data over using the point-distance features alone. must be evaluated as this dominates time required to evaluate
For the features of 3.1.2, the accuracy on validation data actually either type. The key advantage of our proposed system is that the
dropped slightly. This could be a result of using a slightly distance features are trivial to evaluate in comparison to
different procedure for size normalization with these features. commonly used features such as BIF [10], LBP [17] or HoG [3].
However, there is also a concern that the average point locations The GB model allows the influence of its features to be examined
were not useful, due to the large variations in pose. and Figure 2 is a result of this analysis. The distance from the eyes
to the corners of the mouth clearly has the greatest influence. This
seems reasonable considering the degree to which a mouth is [3] Dalal, N. and Triggs, B. Histograms of oriented gradients for
upturned or downturned is one of the clearest indicators of human detection. 2005. In Computer Vision and Pattern
emotional state. Figure 2 also includes distances indicative of eye Recognition (June 2005, San Diego, California). CVPR‘05,
and mouth openings, which are also intuitively discriminative. IEEE. 886-893.
[4] Day, M. Emotion recognition with boosted tree classifiers. In
International Conference on Multimodal Interaction.
(December 2013, Sydney, Australia). ICMI’13. ACM. 531-
534.
[5] Dhall, A., Murthy, R., Goecke, R., Joshi, J. and Gedeon, T.
2015. Video and Image based Emotion Recognition
Challenges in the Wild: EmotiW 2015. In International
Conference on Multimodal Interaction. (November 2015,
Seattle, Washington). ICMI’15. ACM.
[6] Dhall, A., Goecke, R., Lucey, S. and Gedeon, T. 2012.
Collecting large, richly annotated facial-expression databases
from movies. In MultiMedia. 19 (2012) IEEE. 34-41.
[7] Ekman, P. and Friesen, W. V. Constants across cultures in
the face and emotion. 1971. In Journal of Personality and
Social Psychology. 17, 2 (1971) 124.
[8] Felzenszwalb, P. et al. Object detection with discriminatively
trained part-based models. 2010. Pattern Analysis and
Machine Intelligence, IEEE Trans on. 32, 9 (2010): 1627-
1645.
[9] Friesen, W. and Ekman, P. 1983. EMFACS-7: Emotional
Facial Action Coding System. (1983) Unpublished manual,
University of California, California.
[10] Guo, G. et al. Human age estimation using bio-inspired
Figure 2: From left to right, then top to bottom, the most features.2009. In Computer Vision and Pattern Recognition.
influential distances in our gradient boosted model (June 2009, Miami Beach, Florida). CVPR’09, IEEE. 112-
Before concluding, we must note an observation that potentially 119.
affects the baseline accuracy. Almost all of the challenge images [11] Hastie, T., Tibshirani, R. and Friedman, J. 2009. The
have an incorrect aspect ratio that results in elongated faces. We elements of statistical learning: data mining, inference and
manually correct this prior to performing our experiments. If we prediction. New York: Springer.
instead use the images as provided, the face detector finds only
around 60% of faces. Given that we are particularly interested in [12] Kazemi, V. and Sullivan, J. 2014. One millisecond face
modelling shape here, it is important to work with consistent alignment with an ensemble of regression trees. In Computer
aspect ratios. Vision and Pattern Recognition (June 2014, Columbus,
Ohio). CVPR’14, IEEE. 1867-1874.
As a final comment, although the landmarks found by our system
[13] King, D. E., A Machine Learning Toolkit. Journal of
are more accurate than those in the baseline, there is still much
Machine Learning Research. 10 (2009). 1755-1758.
scope for improvement. Given 100% accurate landmark locations,
an interesting line of further work might be to tailor the modelling [14] Lihua, G. Smile Expression Classification using the
approach to the problem in an attempt to see just how far static improved BIF feature. 2011. In International Conference on
shape alone can be used in estimating facial expression. Image and Graphics. (August 2011, Hefei, China). IEEE.
783-788.
7. ACKNOWLEDGMENTS [15] Mathias, M. et al. Face detection without bells and whistles.
Our thanks to Professor John A. Robinson for his support in this 2014. In Computer Vision – ECCV 2014. (2014) Springer
work. International Publishing. 720-735.
[16] Meyers, E., and Wolf, L. 2008. Using biologically inspired
8. REFERENCES features for face processing. International Journal of
[1] Chew, S. W. et al. Person-independent facial expression Computer Vision. 76, 1. (2008). 93-104.
detection using constrained local models. 2011. In Automatic [17] Ojala, T., Pietikäinen, M. and Mäenpää, T. 2002.
Face & Gesture Recognition and Workshops (March 2011, Multiresolution gray-scale and rotation invariant texture
Santa Barbara, California). FG’11 IEEE. 915-920. classification with local binary patterns. In Trans. Pattern
[2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM : a Analysis and Machine Intelligence. 24, 7 (2002) IEEE. 971-
library for support vector machines. ACM Trans. on 987.
Intelligent Systems and Technology. 2, 3 (2011). Software at [18] Viola, P., and Jones, M. J. 2004. Robust real-time face
http://www.csie.ntu.edu.tw/~cjlin/libsvm detection. International Journal of Computer Vision 57,2
(2004) 137-154.

You might also like