0% found this document useful (0 votes)
108 views25 pages

Deep Facial Expression Recognition: A Survey: Shan Li and Weihong Deng, Member, IEEE

This document provides a survey of deep facial expression recognition techniques. It discusses how deep learning methods have achieved state-of-the-art results in facial expression recognition compared to previous techniques. The survey reviews popular datasets, the standard pipeline for facial expression recognition systems, and examines existing deep neural networks and training strategies for recognizing expressions from both static images and image sequences.

Uploaded by

Awatef Messaoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views25 pages

Deep Facial Expression Recognition: A Survey: Shan Li and Weihong Deng, Member, IEEE

This document provides a survey of deep facial expression recognition techniques. It discusses how deep learning methods have achieved state-of-the-art results in facial expression recognition compared to previous techniques. The survey reviews popular datasets, the standard pipeline for facial expression recognition systems, and examines existing deep neural networks and training strategies for recognizing expressions from both static images and image sequences.

Uploaded by

Awatef Messaoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

1

Deep Facial Expression Recognition: A Survey


Shan Li and Weihong Deng∗ , Member, IEEE

Abstract—With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions
and the recent success of deep learning techniques in various fields, deep neural networks have increasingly been leveraged to learn
discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting
caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this
paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic
problems. First, we introduce the available datasets that are widely used in the literature and provide accepted data selection and
evaluation principles for these datasets. We then describe the standard pipeline of a deep FER system with the related background
knowledge and suggestions of applicable implementations for each stage. For the state of the art in deep FER, we review existing
novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image
arXiv:1804.08348v2 [cs.CV] 22 Oct 2018

sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized
in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future directions for the design of robust deep FER systems.

Index Terms—Facial Expressions Recognition, Facial expression datasets, Affect, Deep Learning, Survey.

1 I NTRODUCTION

F ACIAL expression is one of the most powerful, natural and


universal signals for human beings to convey their emotional
states and intentions [1], [2]. Numerous studies have been con-
representation is encoded with only spatial information from the
current single image, whereas dynamic-based methods [15], [16],
[17] consider the temporal relation among contiguous frames in
ducted on automatic facial expression analysis because of its the input facial expression sequence. Based on these two vision-
practical importance in sociable robotics, medical treatment, driver based methods, other modalities, such as audio and physiological
fatigue surveillance, and many other human-computer interaction channels, have also been used in multimodal systems [18] to assist
systems. In the field of computer vision and machine learning, the recognition of expression.
various facial expression recognition (FER) systems have been The majority of the traditional methods have used handcrafted
explored to encode expression information from facial represen- features or shallow learning (e.g., local binary patterns (LBP) [12],
tations. As early as the twentieth century, Ekman and Friesen [3] LBP on three orthogonal planes (LBP-TOP) [15], non-negative
defined six basic emotions based on cross-culture study [4], which matrix factorization (NMF) [19] and sparse learning [20]) for FER.
indicated that humans perceive certain basic emotions in the same However, since 2013, emotion recognition competitions such as
way regardless of culture. These prototypical facial expressions FER2013 [21] and Emotion Recognition in the Wild (EmotiW)
are anger, disgust, fear, happiness, sadness, and surprise. Contempt [22], [23], [24] have collected relatively sufficient training data
was subsequently added as one of the basic emotions [5]. Recently, from challenging real-world scenarios, which implicitly promote
advanced research on neuroscience and psychology argued that the the transition of FER from lab-controlled to in-the-wild settings. In
model of six basic emotions are culture-specific and not universal the meanwhile, due to the dramatically increased chip processing
[6]. abilities (e.g., GPU units) and well-designed network architecture,
Although the affect model based on basic emotions is limited studies in various fields have begun to transfer to deep learning
in the ability to represent the complexity and subtlety of our methods, which have achieved the state-of-the-art recognition ac-
daily affective displays [7], [8], [9], and other emotion description curacy and exceeded previous results by a large margin (e.g., [25],
models, such as the Facial Action Coding System (FACS) [10] and [26], [27], [28]). Likewise, given with more effective training data
the continuous model using affect dimensions [11], are considered of facial expression, deep learning techniques have increasingly
to represent a wider range of emotions, the categorical model been implemented to handle the challenging factors for emotion
that describes emotions in terms of discrete basic emotions is recognition in the wild. Figure 1 illustrates this evolution on FER
still the most popular perspective for FER, due to its pioneering in the aspect of algorithms and datasets.
investigations along with the direct and intuitive definition of facial Exhaustive surveys on automatic expression analysis have
expressions. And in this survey, we will limit our discussion on been published in recent years [7], [8], [29], [30]. These surveys
FER based on the categorical model. have established a set of standard algorithmic pipelines for FER.
FER systems can be divided into two main categories accord- However, they focus on traditional methods, and deep learning
ing to the feature representations: static image FER and dynamic has rarely been reviewed. Very recently, FER based on deep
sequence FER. In static-based methods [12], [13], [14], the feature learning has been surveyed in [31], which is a brief review without
introductions on FER datasets and technical details on deep FER.
• The authors are with the Pattern Recognition and Intelligent System Lab- Therefore, in this paper, we make a systematic research on deep
oratory, School of Information and Communication Engineering, Beijing learning for FER tasks based on both static images and videos
University of Posts and Telecommunications, Beijing, 100876, China. (image sequences). We aim to give a newcomer to this filed an
E-mail:{ls1995, whdeng}@bupt.edu.cn.
overview of the systematic framework and prime skills for deep
2

Algorithm 2007 Dataset of image or video samples, collection environment, expression


Zhao et al. [15] (LBP-TOP, SVM)
distribution and additional information.
Shan et al. [12] (LBP, AdaBoost) 2009
CK+ [33]: The Extended CohnKanade (CK+) database is the
CK+ most extensively used laboratory-controlled database for evaluat-
---->

MMI ing FER systems. CK+ contains 593 video sequences from 123
Zhi et al. [19] (NMF) 2011
subjects. The sequences vary in duration from 10 to 60 frames
Zhong et al. [20] (Sparse learning) and show a shift from a neutral facial expression to the peak
Tang (CNN) [130] (winner of FER2013) 2013 expression. Among these videos, 327 sequences from 118 subjects
FER2013
Kahou et al. [57] (CNN, DBN, DAE) are labeled with seven basic expression labels (anger, contempt,
EmotiW
---->

(winner of EmotiW 2013)


disgust, fear, happiness, sadness, and surprise) based on the Facial
2015
Fan et al. [108] (CNN-LSTM, C3D) Action Coding System (FACS). Because CK+ does not provide
(winner of EmotiW 2016)
EmotioNet specified training, validation and test sets, the algorithms evaluated
HoloNet
PPDN LP loss
IACNN
FaceNet2ExpNet
tuplet cluster loss 2017
Island loss
RAF-DB on this database are not uniform. For static-based methods, the
…… …… AffectNet most common data selection method is to extract the last one to
three frames with peak formation and the first frame (neutral face)
Fig. 1. The evolution of facial expression recognition in terms of datasets of each sequence. Then, the subjects are divided into n groups for
and methods.
person-independent n-fold cross-validation experiments, where
commonly selected values of n are 5, 8 and 10.
FER. MMI [34], [35]: The MMI database is laboratory-controlled
Despite the powerful feature learning ability of deep learning, and includes 326 sequences from 32 subjects. A total of 213
problems remain when applied to FER. First, deep neural networks sequences are labeled with six basic expressions (without “con-
require a large amount of training data to avoid overfitting. tempt”), and 205 sequences are captured in frontal view. In
However, the existing facial expression databases are not sufficient contrast to CK+, sequences in MMI are onset-apex-offset labeled,
to train the well-known neural network with deep architecture that i.e., the sequence begins with a neutral expression and reaches
achieved the most promising results in object recognition tasks. peak near the middle before returning to the neutral expression.
Additionally, high inter-subject variations exist due to different Furthermore, MMI has more challenging conditions, i.e., there are
personal attributes, such as age, gender, ethnic backgrounds and large inter-personal variations because subjects perform the same
level of expressiveness [32]. In addition to subject identity bias, expression non-uniformly and many of them wear accessories
variations in pose, illumination and occlusions are common in (e.g., glasses, mustache). For experiments, the most common
unconstrained facial expression scenarios. These factors are non- method is to choose the first frame (neutral face) and the three peak
linearly coupled with facial expressions and therefore strengthen frames in each frontal sequence to conduct person-independent
the requirement of deep networks to address the large intra-class 10-fold cross-validation.
variability and to learn effective expression-specific representa- JAFFE [36]: The Japanese Female Facial Expression (JAFFE)
tions. database is a laboratory-controlled image database that contains
In this paper, we introduce recent advances in research on 213 samples of posed expressions from 10 Japanese females. Each
solving the above problems for deep FER. We examine the state- person has 3˜4 images with each of six basic facial expressions
of-the-art results that have not been reviewed in previous survey (anger, disgust, fear, happiness, sadness, and surprise) and one
papers. The rest of this paper is organized as follows. Frequently image with a neutral expression. The database is challenging be-
used expression databases are introduced in Section 2. Section 3 cause it contains few examples per subject/expression. Typically,
identifies three main steps required in a deep FER system and all the images are used for the leave-one-subject-out experiment.
describes the related background. Section 4 provides a detailed TFD [37]:The Toronto Face Database (TFD) is an amalgama-
review of novel neural network architectures and special network tion of several facial expression datasets. TFD contains 112,234
training tricks designed for FER based on static images and images, 4,178 of which are annotated with one of seven expres-
dynamic image sequences. We then cover additional related issues sion labels: anger, disgust, fear, happiness, sadness, surprise and
and other practical scenarios in Section 5. Section 6 discusses neutral. The faces have already been detected and normalized to a
some of the challenges and opportunities in this field and identifies size of 48*48 such that all the subjects eyes are the same distance
potential future directions. apart and have the same vertical coordinates. Five official folds are
provided in TFD; each fold contains a training, validation, and test
set consisting of 70%, 10%, and 20% of the images, respectively.
FER2013 [21]: The FER2013 database was introduced during
2 FACIAL EXPRESSION DATABASES
the ICML 2013 Challenges in Representation Learning. FER2013
Having sufficient labeled training data that include as many is a large-scale and unconstrained database collected automati-
variations of the populations and environments as possible is cally by the Google image search API. All images have been
important for the design of a deep expression recognition system. registered and resized to 48*48 pixels after rejecting wrongfully
In this section, we discuss the publicly available databases that labeled frames and adjusting the cropped region. FER2013 con-
contain basic expressions and that are widely used in our reviewed tains 28,709 training images, 3,589 validation images and 3,589
papers for deep learning algorithms evaluation. We also introduce test images with seven expression labels (anger, disgust, fear,
newly released databases that contain a large amount of affective happiness, sadness, surprise and neutral).
images collected from the real world to benefit the training of AFEW [48]: The Acted Facial Expressions in the Wild
deep neural networks. Table 1 provides an overview of these (AFEW) database was first established and introduced in [49]
datasets, including the main reference, number of subjects, number and has served as an evaluation platform for the annual Emo-
3

TABLE 1
An overview of the facial expression datasets. P = posed; S = spontaneous; Condit. = Collection condition; Elicit. = Elicitation method.

Database Samples Subject Condit. Elicit. Expression distribution Access

593 image 123 Lab P&S 6 basic expressions plus contempt


CK+ [33] http://www.consortium.ri.cmu.edu/ckagree/
sequences and neutral
740 images
MMI [34], [35] and 2,900 25 Lab P 6 basic expressions plus neutral https://mmifacedb.eu/
videos
JAFFE [36] 213 images 10 Lab P 6 basic expressions plus neutral http://www.kasrl.org/jaffe.html
112,234 N/A Lab P
TFD [37] 6 basic expressions plus neutral [email protected]
images

N/A Web P&S https://www.kaggle.com/c/challenges-in-representatio


FER-2013 [21] 35,887 images 6 basic expressions plus neutral
n-learning-facial-expression-recognition-challenge
AFEW 7.0 [24] 1,809 videos N/A Movie P&S 6 basic expressions plus neutral https://sites.google.com/site/emotiwchallenge/
SFEW 2.0 [22] 1,766 images N/A Movie P&S 6 basic expressions plus neutral https://cs.anu.edu.au/few/emotiw2015.html
755,370 337 Lab P Smile, surprised, squint, disgust,
Multi-PIE [38] http://www.flintbox.com/public/project/4742/
images scream and neutral

100 Lab P http://www.cs.binghamton.edu/∼lijun/Research/3DFE


BU-3DFE [39] 2,500 images 6 basic expressions plus neutral
/3DFE Analysis.html
2,880 image 80 Lab P
Oulu-CASIA [40] 6 basic expressions http://www.cse.oulu.fi/CMV/Downloads/Oulu-CASIA
sequences

67 Lab P 6 basic expressions plus contempt


RaFD [41] 1,608 images http://www.socsci.ru.nl:8180/RaFD2/RaFD
and neutral
KDEF [42] 4,900 images 70 Lab P 6 basic expressions plus neutral http://www.emotionlab.se/kdef/
1,000,000 N/A Web P&S 23 basic expressions or compound
EmotioNet [43] http://cbcsl.ece.ohio-state.edu/dbform emotionet.html
images expressions

N/A Web P&S 6 basic expressions plus neutral and


RAF-DB [44], [45] 29672 images http://www.whdeng.cn/RAF/model1.html
12 compound expressions
450,000
AffectNet [46] images N/A Web P&S 6 basic expressions plus neutral http://mohammadmahoor.com/databases-codes/
(labeled)

N/A Web P&S http://mmlab.ie.cuhk.edu.hk/projects/socialrelation/ind


ExpW [47] 91,793 images 6 basic expressions plus neutral
ex.html

tion Recognition In The Wild Challenge (EmotiW) since 2013. 755,370 images from 337 subjects under 15 viewpoints and 19
AFEW contains video clips collected from different movies with illumination conditions in up to four recording session. Each facial
spontaneous expressions, various head poses, occlusions and il- image is labeled with one of six expressions: disgust, neutral,
luminations. AFEW is a temporal and multimodal database that scream, smile, squint and surprise. This dataset is typically used
provides with vastly different environmental conditions in both for multiview facial expression analysis.
audio and video. Samples are labeled with seven expressions: BU-3DFE [39]: The Binghamton University 3D Facial Ex-
anger, disgust, fear, happiness, sadness, surprise and neutral. The pression (BU-3DFE) database contains 606 facial expression se-
annotation of expressions have been continuously updated, and quences captured from 100 people. For each subject, six universal
reality TV show data have been continuously added. The AFEW facial expressions (anger, disgust, fear, happiness, sadness and
7.0 in EmotiW 2017 [24] is divided into three data partitions in surprise) are elicited by various manners with multiple intensities.
an independent manner in terms of subject and movie/TV source: Similar to Multi-PIE, this dataset is typically used for multiview
Train (773 samples), Val (383 samples) and Test (653 samples), 3D facial expression analysis.
which ensures data in the three sets belong to mutually exclusive
Oulu-CASIA [40]: The Oulu-CASIA database includes 2,880
movies and actors.
image sequences collected from 80 subjects labeled with six
SFEW [50]: The Static Facial Expressions in the Wild basic emotion labels: anger, disgust, fear, happiness, sadness, and
(SFEW) was created by selecting static frames from the AFEW surprise. Each of the videos is captured with one of two imaging
database by computing key frames based on facial point clustering. systems, i.e., near-infrared (NIR) or visible light (VIS), under
The most commonly used version, SFEW 2.0, was the bench- three different illumination conditions. Similar to CK+, the first
marking data for the SReco sub-challenge in EmotiW 2015 [22]. frame is neutral and the last frame has the peak expression.
SFEW 2.0 has been divided into three sets: Train (958 samples), Typically, only the last three peak frames and the first frame
Val (436 samples) and Test (372 samples). Each of the images is (neutral face) from the 480 videos collected by the VIS System
assigned to one of seven expression categories, i.e., anger, disgust, under normal indoor illumination are employed for 10-fold cross-
fear, neutral, happiness, sadness, and surprise. The expression validation experiments.
labels of the training and validation sets are publicly available,
RaFD [41]: The Radboud Faces Database (RaFD) is
whereas those of the testing set are held back by the challenge
laboratory-controlled and has a total of 1,608 images from 67
organizer.
subjects with three different gaze directions, i.e., front, left and
Multi-PIE [38]: The CMU Multi-PIE database contains right. Each sample is labeled with one of eight expressions: anger,
4

contempt, disgust, fear, happiness, sadness, surprise and neutral. TABLE 2


KDEF [42]: The laboratory-controlled Karolinska Directed Summary of different types of face alignment detectors that are widely
used in deep FER models.
Emotional Faces (KDEF) database was originally developed for
use in psychological and medical research. KDEF consists of
type # points real-time speed performance used in
images from 70 actors with five different angles labeled with six poor
basic facial expressions plus neutral. Holistic AAM [53] 68 7 fair [54], [55]
generalization
In addition to these commonly used datasets for basic emo- MoT [56] 39/68 7 slow/ [57], [58]
Part-based good
tion recognition, several well-established and large-scale publicly DRMF [59] 66 7 fast [60], [61]
available facial expression databases collected from the Internet SDM [62] 49 3 [16], [63]
Cascaded fast/ good/
that are suitable for training deep neural networks have emerged 3000 fps [64] 68 3 [55]
regression very fast very good
Incremental [65] 49 3 [66]
in the last two years.
Deep cascaded CNN [67] 5 3 good/ [68]
EmotioNet [43]: EmotioNet is a large-scale database with one fast
learning MTCNN [69] 5 3 very good [70], [71]
million facial expression images collected from the Internet. A
total of 950,000 images were annotated by the automatic action
unit (AU) detection model in [43], and the remaining 25,000 and publicly available implementations that are widely used in
images were manually annotated with 11 AUs. The second track of deep FER.
the EmotioNet Challenge [51] provides six basic expressions and Given a series of training data, the first step is to detect the
ten compound expressions [52], and 2,478 images with expression face and then to remove background and non-face areas. The
labels are available. Viola-Jones (V&J) face detector [72] is a classic and widely
RAF-DB [44], [45]: The Real-world Affective Face Database employed implementation for face detection, which is robust and
(RAF-DB) is a real-world database that contains 29,672 highly computationally simple for detecting near-frontal faces.
diverse facial images downloaded from the Internet. With man- Although face detection is the only indispensable procedure to
ually crowd-sourced annotation and reliable estimation, seven enable feature learning, further face alignment using the coordi-
basic and eleven compound emotion labels are provided for the nates of localized landmarks can substantially enhance the FER
samples. Specifically, 15,339 images from the basic emotion set performance [14]. This step is crucial because it can reduce the
are divided into two groups (12,271 training samples and 3,068 variation in face scale and in-plane rotation. Table 2 investigates
testing samples) for evaluation. facial landmark detection algorithms widely-used in deep FER and
AffectNet [46]: AffectNet contains more than one million compares them in terms of efficiency and performance. The Active
images from the Internet that were obtained by querying different Appearance Model (AAM) [53] is a classic generative model that
search engines using emotion-related tags. It is by far the largest optimizes the required parameters from holistic facial appearance
database that provides facial expressions in two different emotion and global shape patterns. In discriminative models, the mixtures
models (categorical model and dimensional model), of which of trees (MoT) structured models [56] and the discriminative
450,000 images have manually annotated labels for eight basic response map fitting (DRMF) [59] use part-based approaches that
expressions. represent the face via the local appearance information around
ExpW [47]: The Expression in-the-Wild Database (ExpW) each landmark. Furthermore, a number of discriminative models
contains 91,793 faces downloaded using Google image search. directly use a cascade of regression functions to map the image
Each of the face images was manually annotated as one of the appearance to landmark locations and have shown better results,
seven basic expression categories. Non-face images were removed e.g., the supervised descent method (SDM) [62] implemented
in the annotation process. in IntraFace [73], the face alignment 3000 fps [64], and the
incremental face alignment [65]. Recently, deep networks have
been widely exploited for face alignment. Cascaded CNN [67]
3 D EEP FACIAL EXPRESSION RECOGNITION is the early work which predicts landmarks in a cascaded way.
In this section, we describe the three main steps that are common Based on this, Tasks-Constrained Deep Convolutional Network
in automatic deep FER, i.e., pre-processing, deep feature learning (TCDCN) [74] and Multi-task CNN (MTCNN) [69] further lever-
and deep feature classification. We briefly summarize the widely age multi-task learning to improve the performance. In general,
used algorithms for each step and recommend the existing state-of- cascaded regression has become the most popular and state-of-
the-art best practice implementations according to the referenced the-art methods for face alignment as its high speed and accuracy.
papers. In contrast to using only one detector for face alignment,
some methods proposed to combine multiple detectors for better
landmark estimation when processing faces in challenging uncon-
3.1 Pre-processing
strained environments. Yu et al. [75] concatenated three different
Variations that are irrelevant to facial expressions, such as different facial landmark detectors to complement each other. Kim et al.
backgrounds, illuminations and head poses, are fairly common in [76] considered different inputs (original image and histogram
unconstrained scenarios. Therefore, before training the deep neural equalized image) and different face detection models (V&J [72]
network to learn meaningful features, pre-processing is required and MoT [56]), and the landmark set with the highest confidence
to align and normalize the visual semantic information conveyed provided by the Intraface [73] was selected.
by the face.
3.1.2 Data augmentation
3.1.1 Face alignment Deep neural networks require sufficient training data to ensure
Face alignment is a traditional pre-processing step in many face- generalizability to a given recognition task. However, most pub-
related recognition tasks. We list some well-known approaches licly available databases for FER do not have a sufficient quantity
5

Emotion Training
Labels
Input Images
C1 Layer
P1 Layer
C2 Layer P2 Layer CNN

scaling, rotation, colors, noises…

Full
Convolutions Subsampling Convolutions Subsampling Connected Trained
Face hidden variables
ℎ3 𝑜
Model
𝒉
𝑶𝒕−𝟏 𝑶𝒕 𝑶𝒕+𝟏
Bipartite
Structure
RBM
𝑾 ℎ2 𝑉 𝑉 𝑉
𝑉 𝑊
Data Augmentation label
𝑠 Unfold
𝒔𝒕−𝟏 𝒔𝒕 𝒔𝒕+𝟏

ℎ1 𝑊 𝑊 𝑊 𝑊 Anger
𝑈 𝑈 𝑈 𝑈 Contempt
Normalization Image
𝑽
visible variables
DBN 𝑣 𝑥 RNN 𝒙𝒕−𝟏 𝒙𝒕 𝒙𝒕+𝟏

Alignment Disgust
Fear
As close as possible
Noise Happiness
Images & Generated Neutral

Output Layer
Input Layer

bottle
sample

Layer

Layer
Sadness

Layer
Layer

Layer

Layer
… Data
Sequences 𝑊1 𝑊2
𝑊2𝑇 𝑊1𝑇
sample
Discriminator Generator
Surprise
Illumination Pose Data sample ?
𝑥 Encoder Decoder 𝑥
Code

DAE Yes / No
GAN
Trained
Input Pre-processing Model Testing Deep Networks Output

Fig. 2. The general pipeline of deep facial expression recognition systems.

of images for training. Therefore, data augmentation is a vital 3.1.3 Face normalization
step for deep FER. Data augmentation techniques can be divided
into two groups: on-the-fly data augmentation and offline data Variations in illumination and head poses can introduce large
augmentation. changes in images and hence impair the FER performance.
Therefore, we introduce two typical face normalization methods
Usually, the on-the-fly data augmentation is embedded in deep
to ameliorate these variations: illumination normalization and
learning toolkits to alleviate overfitting. During the training step,
pose normalization (frontalization).
the input samples are randomly cropped from the four corners and
center of the image and then flipped horizontally, which can result
Illumination normalization: Illumination and contrast can
in a dataset that is ten times larger than the original training data.
vary in different images even from the same person with the same
Two common prediction modes are adopted during testing: only
expression, especially in unconstrained environments, which can
the center patch of the face is used for prediction (e.g., [61], [77])
result in large intra-class variances. In [60], several frequently
or the prediction value is averaged over all ten crops (e.g., [76],
used illumination normalization algorithms, namely, isotropic
[78]).
diffusion (IS)-based normalization, discrete cosine transform
Besides the elementary on-the-fly data augmentation, various (DCT)-based normalization [85] and difference of Gaussian
offline data augmentation operations have been designed to further (DoG), were evaluated for illumination normalization. And [86]
expand data on both size and diversity. The most frequently used employed homomorphic filtering based normalization, which has
operations include random perturbations and transforms, e.g., rota- been reported to yield the most consistent results among all other
tion, shifting, skew, scaling, noise, contrast and color jittering. For techniques, to remove illumination normalization. Furthermore,
example, common noise models, salt & pepper and speckle noise related studies have shown that histogram equalization combined
[79] and Gaussian noise [80], [81] are employed to enlarge the data with illumination normalization results in better face recognition
size. And for contrast transformation, saturation and value (S and performance than that achieved using illumination normalization
V components of the HSV color space) of each pixel are changed on it own. And many studies in the literature of deep FER (e.g.,
[70] for data augmentation. Combinations of multiple operations [75], [79], [87], [88]) have employed histogram equalization to
can generate more unseen training samples and make the network increase the global contrast of images for pre-processing. This
more robust to deviated and rotated faces. In [82], the authors method is effective when the brightness of the background and
applied five image appearance filters (disk, average, Gaussian, foreground are similar. However, directly applying histogram
unsharp and motion filters) and six affine transform matrices that equalization may overemphasize local contrast. To solve this
were formalized by adding slight geometric transformations to the problem, [89] proposed a weighted summation approach to
identity matrix. In [75], a more comprehensive affine transform combine histogram equalization and linear mapping. And in
matrix was proposed to randomly generate images that varied in [79], the authors compared three different methods: global
terms of rotation, skew and scale. Furthermore, deep learning contrast normalization (GCN), local normalization, and histogram
based technology can be applied for data augmentation. For equalization. GCN and histogram equalization were reported
example, a synthetic data generation system with 3D convolutional to achieve the best accuracy for the training and testing steps,
neural network (CNN) was created in [83] to confidentially create respectively.
faces with different levels of saturation in expression. And the
generative adversarial network (GAN) [84] can also be applied to Pose normalization: Considerable pose variation is another
augment data by generating diverse appearances varying in poses common and intractable problem in unconstrained settings. Some
and expressions. (see Section 4.1.7). studies have employed pose normalization techniques to yield
6

frontal facial views for FER (e.g., [90], [91]), among which the TABLE 3
most popular was proposed by Hassner et al. [92]. Specifically, Comparison of CNN models and their achievements. DA = Data
augmentation; BN = Batch normalization.
after localizing facial landmarks, a 3D texture reference model
generic to all faces is generated to efficiently estimate visible
facial components. Then, the initial frontalized face is synthesized AlexNet VGGNet GoogleNet ResNet
by back-projecting each input face image to the reference [25] [26] [27] [28]
coordinate system. Alternatively, Sagonas et al. [93] proposed an Year 2012 2014 2014 2015
effective statistical model to simultaneously localize landmarks # of layers† 5+3 13/16 + 3 21+1 151+1
and convert facial poses using only frontal faces. Very recently, a Kernel size? 11, 5, 3 3 7, 1, 3, 5 7, 1, 3, 5
series of GAN-based deep models were proposed for frontal view DA 3 3 3 3
synthesis (e.g., FF-GAN [94], TP-GAN [95]) and DR-GAN [96]) Dropout 3 3 3 3
and report promising performances. Inception 7 7 3 7
BN 7 7 7 3
3.2 Deep networks for feature learning Used in [110] [78], [111] [17], [78] [91], [112]
Deep learning has recently become a hot research topic and has †
number of convolutional layers + fully connected layers
achieved state-of-the-art performance for a variety of applications ?
size of the convolution kernel
[97]. Deep learning attempts to capture high-level abstractions
through hierarchical architectures of multiple nonlinear transfor-
mations and representations. In this section, we briefly introduce proposed the well-designed C3D, which exploits 3D convolutions
some deep learning techniques that have been applied for FER. on large-scale supervised training datasets to learn spatio-temporal
The traditional architectures of these deep neural networks are features. Many related studies (e.g., [108], [109]) have employed
shown in Fig. 2. this network for FER involving image sequences.

3.2.1 Convolutional neural network (CNN) 3.2.2 Deep belief network (DBN)
CNN has been extensively used in diverse computer vision ap- DBN proposed by Hinton et al. [113] is a graphical model that
plications, including FER. At the beginning of the 21st century, learns to extract a deep hierarchical representation of the training
several studies in the FER literature [98], [99] found that the data. The traditional DBN is built with a stack of restricted Boltz-
CNN is robust to face location changes and scale variations and mann machines (RBMs) [114], which are two-layer generative
behaves better than the multilayer perceptron (MLP) in the case stochastic models composed of a visible-unit layer and a hidden-
of previously unseen face pose variations. [100] employed the unit layer. These two layers in an RBM must form a bipartite
CNN to address the problems of subject independence as well graph without lateral connections. In a DBN, the units in higher
as translation, rotation, and scale invariance in the recognition of layers are trained to learn the conditional dependencies among the
facial expressions. units in the adjacent lower layers, except the top two layers, which
A CNN consists of three types of heterogeneous layers: have undirected connections. The training of a DBN contains two
convolutional layers, pooling layers, and fully connected layers. phases: pre-training and fine-tuning [115]. First, an efficient layer-
The convolutional layer has a set of learnable filters to convolve by-layer greedy learning strategy [116] is used to initialize the
through the whole input image and produce various specific types deep network in an unsupervised manner, which can prevent poor
of activation feature maps. The convolution operation is associ- local optimal results to some extent without the requirement of
ated with three main benefits: local connectivity, which learns a large amount of labeled data. During this procedure, contrastive
correlations among neighboring pixels; weight sharing in the same divergence [117] is used to train RBMs in the DBN to estimate the
feature map, which greatly reduces the number of the parameters approximation gradient of the log-likelihood. Then, the parameters
to be learned; and shift-invariance to the location of the object. The of the network and the desired output are fine-tuned with a simple
pooling layer follows the convolutional layer and is used to reduce gradient descent under supervision.
the spatial size of the feature maps and the computational cost of
the network. Average pooling and max pooling are the two most 3.2.3 Deep autoencoder (DAE)
commonly used nonlinear down-sampling strategies for translation DAE was first introduced in [118] to learn efficient codings for
invariance. The fully connected layer is usually included at the dimensionality reduction. In contrast to the previously mentioned
end of the network to ensure that all neurons in the layer are fully networks, which are trained to predict target values, the DAE
connected to activations in the previous layer and to enable the is optimized to reconstruct its inputs by minimizing the recon-
2D feature maps to be converted into 1D feature maps for further struction error. Variations of the DAE exist, such as the denoising
feature representation and classification. autoencoder [119], which recovers the original undistorted input
We list the configurations and characteristics of some well- from partially corrupted data; the sparse autoencoder network
known CNN models that have been applied for FER in Table 3. (DSAE) [120], which enforces sparsity on the learned feature
Besides these networks, several well-known derived frameworks representation; the contractive autoencoder (CAE1 ) [121], which
also exist. In [101], [102], region-based CNN (R-CNN) [103] adds an activity dependent regularization to induce locally invari-
was used to learn features for FER. In [104], Faster R-CNN ant features; the convolutional autoencoder (CAE2 ) [122], which
[105] was used to identify facial expressions by generating high- uses convolutional (and optionally pooling) layers for the hidden
quality region proposals. Moreover, Ji et al. proposed 3D CNN layers in the network; and the variational auto-encoder (VAE)
[106] to capture motion information encoded in multiple adjacent [123], which is a directed graphical model with certain types of
frames for action recognition via 3D convolutions. Tran et al. [107] latent variables to design complex generative models of data.
7

3.2.4 Recurrent neural network (RNN) such as support vector machine or random forest, to the extracted
RNN is a connectionist model that captures temporal information representations [133], [134]. Furthermore, [135], [136] showed
and is more suitable for sequential data prediction with arbitrary that the covariance descriptors computed on DCNN features
lengths. In addition to training the deep neural network in a single and classification with Gaussian kernels on Symmetric Positive
feed-forward manner, RNNs include recurrent edges that span Definite (SPD) manifold are more efficient than the standard
adjacent time steps and share the same parameters across all steps. classification with the softmax layer.
The classic back propagation through time (BPTT) [124] is used
to train the RNN. Long-short term memory (LSTM), introduced
by Hochreiter & Schmidhuber [125], is a special form of the 4 T HE STATE OF THE ART
traditional RNN that is used to address the gradient vanishing In this section, we review the existing novel deep neural networks
and exploding problems that are common in training RNNs. The designed for FER and the related training strategies proposed
cell state in LSTM is regulated and controlled by three gates: to address expression-specific problems. We divide the works
an input gate that allows or blocks alteration of the cell state by presented in the literature into two main groups depending on
the input signal, an output gate that enables or prevents the cell the type of data: deep FER networks for static images and deep
state to affect other neurons, and a forget gate that modulates the FER networks for dynamic image sequences. We then provide
cell’s self-recurrent connection to accumulate or forget its previous an overview of the current deep FER systems with respect to
state. By combining these three gates, LSTM can model long-term the network architecture and performance. Because some of the
dependencies in a sequence and has been widely employed for evaluated datasets do not provide explicit data groups for training,
video-based expression recognition tasks. validation and testing, and the relevant studies may conduct
experiments under different experimental conditions with different
3.2.5 Generative Adversarial Network (GAN) data, we summarize the expression recognition performance along
GAN was first introduced by Goodfellow et al [84] in 2014, which with information about the data selection and grouping methods.
trains models through a minimax two-player game between a
generator G(z) that generates synthesized input data by mapping
latents z to data space with z ∼ p(z) and a discriminator D(x) 4.1 Deep FER networks for static images
that assigns probability y = Dis(x) ∈ [0, 1] that x is an actual A large volume of the existing studies conducted expression recog-
training sample to tell apart real from fake input data. The gen- nition tasks based on static images without considering temporal
erator and the discriminator are trained alternatively and can both information due to the convenience of data processing and the
improve themselves by minimizing/maximizing the binary cross availability of the relevant training and test material. We first
entropy LGAN = log(D(x)) + log(1 − D(G(z))) with respect to introduce specific pre-training and fine-tuning skills for FER, then
D / G with x being a training sample and z ∼ p(z). Extensions review the novel deep neural networks in this field. For each of
of GAN exist, such as the cGAN [126] that adds a conditional the most frequently evaluated datasets, Table 4 shows the current
information to control the output of the generator, the DCGAN state-of-the-art methods in the field that are explicitly conducted in
[127] that adopts deconvolutional and convolutional neural net- a person-independent protocol (subjects in the training and testing
works to implement G and D respectively, the VAE/GAN [128] sets are separated).
that uses learned feature representations in the GAN discriminator
as basis for the VAE reconstruction objective, and the InfoGAN 4.1.1 Pre-training and fine-tuning
[129] that can learn disentangled representations in a completely
As mentioned before, direct training of deep networks on rela-
unsupervised manner.
tively small facial expression datasets is prone to overfitting. To
mitigate this problem, many studies used additional task-oriented
3.3 Facial expression classification data to pre-train their self-built networks from scratch or fine-tuned
After learning the deep features, the final step of FER is to classify on well-known pre-trained models (e.g., AlexNet [25], VGG [26],
the given face into one of the basic emotion categories. VGG-face [148] and GoogleNet [27]). Kahou et al. [57], [149]
Unlike the traditional methods, where the feature extraction indicated that the use of additional data can help to obtain models
step and the feature classification step are independent, deep with high capacity without overfitting, thereby enhancing the FER
networks can perform FER in an end-to-end way. Specifically, performance.
a loss layer is added to the end of the network to regulate the To select appropriate auxiliary data, large-scale face recogni-
back-propagation error; then, the prediction probability of each tion (FR) datasets (e.g., CASIA WebFace [150], Celebrity Face
sample can be directly output by the network. In CNN, softmax in the Wild (CFW) [151], FaceScrub dataset [152]) or relatively
loss is the most common used function that minimizes the cross- large FER datasets (FER2013 [21] and TFD [37]) are suitable.
entropy between the estimated class probabilities and the ground- Kaya et al. [153] suggested that VGG-Face which was trained
truth distribution. Alternatively, [130] demonstrated the benefit of for FR overwhelmed ImageNet which was developed for objected
using a linear support vector machine (SVM) for the end-to-end recognition. Another interesting result observed by Knyazev et
training which minimizes a margin-based loss instead of the cross- al. [154] is that pre-training on larger FR data positively affects
entropy. Likewise, [131] investigated the adaptation of deep neural the emotion recognition accuracy, and further fine-tuning with
forests (NFs) [132] which replaced the softmax loss layer with additional FER datasets can help improve the performance.
NFs and achieved competitive results for FER. Instead of directly using the pre-trained or fine-tuned models
Besides the end-to-end learning way, another alternative is to to extract features on the target dataset, a multistage fine-tuning
employ the deep neural network (particularly a CNN) as a feature strategy [63] (see “Submission 3” in Fig. 3) can achieve better
extraction tool and then apply additional independent classifiers, performance: after the first-stage fine-tuning using FER2013 on
8

TABLE 4
Performance summary of representative methods for static-based deep facial expression recognition on the most widely evaluated datasets.
Network size = depth & number of parameters; Pre-processing = Face Detection & Data Augmentation & Face Normalization; IN = Illumination
Normalization; N E = Network Ensemble; CN = Cascaded Network; MN = Multitask Network; LOSO = leave-one-subject-out.

Network Network Data Additional


Datasets Method Pre-processing Data selection Performance1(%)
type size group classifier

Ouellet 14 [110] CNN (AlexNet) - - V&J - - SVM 7 classes†: (94.4)


the last frame LOSO
Li et al. 15 [86] RBM 4 - V&J - IN 7 6 classes: 96.8
Liu et al. 14 [13] DBN CN 6 2m 3 - - 8 folds AdaBoost 6 classes: 96.7
Liu et al. 13 [137] CNN, RBM CN 5 - V&J - - 10 folds SVM 8 classes: 92.05 (87.67)
Liu et al. 15 [138] CNN, RBM CN 5 - V&J - - the last three frames 10 folds SVM 7 classes‡: 93.70
Khorrami and the first frame
CK+ [139] zero-bias CNN 4 7m 3 3 - 10 folds 7 6 classes: 95.7; 8 classes: 95.1
et al. 15
Ding et al. 17
CNN fine-tune 8 11m IntraFace 3 - 10 folds 7 6 classes: (98.6); 8 classes: (96.8)
[111]
Zeng et al. 18 the last four frames 7 classes†: 95.79 (93.78)
DAE (DSAE) 3 - AAM - - LOSO 7
[54] and the first frame 8 classes: 89.84 (86.82)
Cai et al. 17 [140] CNN loss layer 6 - DRMF 3 IN 10 folds 7 7 classes†: 94.39 (90.66)
Meng et al. 17
CNN MN 6 - DRMF 3 - 8 folds 7 7 classes†: 95.37 (95.51)
[61]
Liu et al. 17 [77] CNN loss layer 11 - IntraFace 3 IN the last three frames 8 folds 7 7 classes†: 97.1 (96.1)
Yang et al. 18
GAN (cGAN) - - MoT 3 - 10 folds 7 7 classes†: 97.30 (96.57)
[141]
Zhang et al. 18
CNN MN - - 3 3 - 10 folds 7 6 classes: 98.9
[47]

Liu et al. 14 [13] DBN CN 6 2m 3 - - AdaBoost 7 classes‡: 91.8


LOSO
JAFFE
Hamester 213 images
[142] CNN, CAE NE 3 - - - IN 7 7 classes‡: (95.8)
et al. 15

Liu et al. 13 [137] CNN, RBM CN 5 - V&J - - the middle three frames 10 folds SVM 7 classes‡: 74.76 (71.73)
and the first frame
Liu et al. 15 [138] CNN, RBM CN 5 - V&J - - 10 folds SVM 7 classes‡: 75.85
MMI Mollahosseini
images from each
et al. 16 CNN (Inception) 11 7.3m IntraFace 3 - 5 folds 7 6 classes: 77.9
sequence
[14]
Liu et al. 17 [77] CNN loss layer 11 - IntraFace 3 IN 10 folds 7 6 classes: 78.53 (73.50)
Li et al. 17 [44] CNN loss layer 8 5.8m IntraFace 3 - 5 folds SVM 6 classes: 78.46
the middle three frames
Yang et al. 18
GAN (cGAN) - - MoT 3 - 10 folds 7 6 classes: 73.23 (72.67)
[141]
Reed et al. 14 4,178 emotion labeled
RBM MN - - - - - SVM Test: 85.43
[143] 3,874 identity labeled
Devries et al. 14 Validation: 87.80
TFD CNN MN 4 12.0m MoT 3 IN 5 official 7
[58] Test: 85.13 (48.29)
folds
Khorrami
[139] zero-bias CNN 4 7m 3 3 - 7 Test: 88.6
et al. 15 4,178 labeled images
Ding et al. 17
CNN fine-tune 8 11m IntraFace 3 - 7 Test: 88.9 (87.7)
[111]

Tang 13 [130] CNN loss layer 4 12.0m - 3 IN 7 Test: 71.2


Devries et al. 14
FER CNN MN 4 12.0m MoT 3 IN 7 Validation+Test: 67.21
[58]
2013
Zhang et al. 15 Training Set: 28,709
CNN MN 6 21.3m SDM - - 7 Test: 75.10
[144] Validation Set: 3,589
Guo et al. 16 Test Set: 3,589
CNN loss layer 10 2.6m SDM 3 - k-NN Test: 71.33
[145]
Kim et al. 16
CNN NE 5 2.4m IntraFace 3 IN 7 Test: 73.73
[146]
pramerdorfer
1.8/1.2/5.3
et al. 16 CNN NE 10/16/33 - 3 IN 7 Test:75.2
(m)
[147]
VGG-S/VGG-M/ 891 training, 431 validation, Validation: 51.75
levi et al. 15 [78] CNN NE MoT 3 - 7
GoogleNet and 372 test Test: 54.56
921 training, ? validation, Validation: 48.5 (39.63)
Ng et al. 15 [63] CNN fine-tune AlexNet IntraFace 3 - 7
and 372 test Test: 55.6 (42.69)
Li et al. 17 [44] CNN loss layer 8 5.8m IntraFace 3 - 921 training, 427 validation SVM Validation: 51.05
Ding et al. 17
CNN fine-tune 8 11m IntraFace 3 - 891 training, 425 validation 7 Validation: 55.15 (46.6)
[111]
SFEW
Liu et al. 17 [77] CNN loss layer 11 - IntraFace 3 IN 7 Validation: 54.19 (47.97)
2.0
Validation: 52.52 (43.41)
Cai et al. 17 [140] CNN loss layer 6 - DRMF 3 IN 7
Test: 59.41 (48.29)
Meng et al. 17 958training, Validation: 50.98 (42.57)
CNN MN 6 - DRMF 3 - 7
[61] 436 validation, Test: 54.30 (44.77)
and 372 test Validation: 53.9
Kim et al. 15 [76] CNN NE 5 - multiple 3 IN 7
Test: 61.6
Validation: 55.96 (47.31)
Yu et al. 15 [75] CNN NE 8 6.2m multiple 3 IN 7
Test: 61.29 (51.27)
1
The value in parentheses is the mean accuracy, which is calculated with the confusion matrix given by the authors.

7 Classes: Anger, Contempt, Disgust, Fear, Happiness, Sadness, and Surprise.

7 Classes: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise.
9

Fig. 5. Image intensities (left) and LBP codes (middle). [78] proposed
mapping these values to a 3D metric space (right) as the input of CNNs.

Fig. 3. Flowchart of the different fine-tuning combinations used in [63]. 4.1.2 Diverse network input
Here, “FER28” and “FER32” indicate different parts of the FER2013
datasets. “EmotiW” is the target dataset. The proposed two-stage fine-
Traditional practices commonly use the whole aligned face of
tuning strategy (Submission 3) exhibited the best performance. RGB images as the input of the network to learn features for
FER. However, these raw data lack important information, such as
homogeneous or regular textures and invariance in terms of image
scaling, rotation, occlusion and illumination, which may represent
confounding factors for FER. Some methods have employed
diverse handcrafted features and their extensions as the network
input to alleviate this problem.
Low-level representations encode features from small regions
in the given RGB image, then cluster and pool these features
with local histograms, which are robust to illumination variations
and small registration errors. A novel mapped LBP feature [78]
(see Fig. 5) was proposed for illumination-invariant FER. Scale-
invariant feature transform (SIFT) [155]) features that are robust
against image scaling and rotation are employed [156] for multi-
view FER tasks. Combining different descriptors in outline, tex-
ture, angle, and color as the input data can also help enhance the
deep network performance [54], [157].
Part-based representations extract features according to the
target task, which remove noncritical parts from the whole image
and exploit key parts that are sensitive to the task. [158] indicated
that three regions of interest (ROI), i.e., eyebrows, eyes and mouth,
are strongly related to facial expression changes, and cropped
these regions as the input of DSAE. Other researches proposed
Fig. 4. Two-stage training flowchart in [111]. In stage (a), the deeper face
net is frozen and provides the feature-level regularization that pushes
to automatically learn the key parts for facial expression. For ex-
the convolutional features of the expression net to be close to the face ample, [159] employed a deep multi-layer network [160] to detect
net by using the proposed distribution function. Then, in stage (b), to the saliency map which put intensities on parts demanding visual
further improve the discriminativeness of the learned features, randomly attention. And [161] applied the neighbor-center difference vector
initialized fully connected layers are added and jointly trained with the
whole expression net using the expression label information. (NCDV) [162] to obtain features with more intrinsic information.

4.1.3 Auxiliary blocks & layers


Based on the foundation architecture of CNN, several studies
pre-trained models, a second-stage fine-tuning based on the train- have proposed the addition of well-designed auxiliary blocks or
ing part of the target dataset (EmotiW) is employed to refine the layers to enhance the expression-related representation capability
models to adapt to a more specific dataset (i.e., the target dataset). of learned features.
Although pre-training and fine-tuning on external FR data can Based on the foundation architecture of CNN, several studies
indirectly avoid the problem of small training data, the networks have proposed the addition of well-designed auxiliary blocks or
are trained separately from the FER and the face-dominated layers to enhance the expression-related representation capability
information remains in the learned features which may weaken of learned features.
the networks ability to represent expressions. To eliminate this A novel CNN architecture, HoloNet [90], was designed for
effect, a two-stage training algorithm FaceNet2ExpNet [111] was FER, where CReLU [163] was combined with the powerful
proposed (see Fig. 4). The fine-tuned face net serves as a good residual structure [28] to increase the network depth without
initialization for the expression net and is used to guide the efficiency reduction and an inception-residual block [164], [165]
learning of the convolutional layers only. And the fully connected was uniquely designed for FER to learn multi-scale features to
layers are trained from scratch with expression information to capture variations in expressions. Another CNN model, Super-
regularize the training of the target FER net. vised Scoring Ensemble (SSE) [91], was introduced to enhance
10

TABLE 5
Three primary ensemble methods on the decision level.

used in
definition
(example)
determine the class with the most
Majority [76], [146],
votes using the predicted label
Voting [173]
yielded from each individual
(a) Three different supervised blocks in [91]. SS Block for shallow-layer
supervision, IS Block for intermediate-layer supervision, and DS Block for determine the class with the
deep-layer supervision. highest mean score using the
Simple [76], [146],
posterior class probabilities
Average [173]
yielded from each individual
with the same weight
determine the class with the
highest weighted mean score
Weighted [57], [78],
using the posterior class
Average [147], [153]
(b) Island loss layer in [140]. The island loss calculated at the feature probabilities yielded from each
extraction layer and the softmax loss calculated at the decision layer are individual with different weights
combined to supervise the CNN training.

formalized to pull the locally neighboring features of the same


class together so that the intra-class local clusters of each class are
compact. Besides, based on the triplet loss [169], which requires
one positive example to be closer to the anchor than one negative
example with a fixed gap, two variations were proposed to replace
(c) (N+M)-tuple clusters loss layer in [77]. During training, the identity-
aware hard-negative mining and online positive mining schemes are used to or assist the supervision of the softmax loss: (1) exponential
decrease the inter-identity variation in the same expression class. triplet-based loss [145] was formalized to give difficult samples
more weight when updating the network, and (2) (N+M)-tuples
Fig. 6. Representative functional layers or blocks that are specifically cluster loss [77] was formalized to alleviate the difficulty of anchor
designed for deep facial expression recognition. selection and threshold validation in the triplet loss for identity-
invariant FER (see Fig. 6(c) for details). Besides, a feature loss
[170] was proposed to provide complementary information for the
the supervision degree for FER, where three types of super- deep feature during early training stage.
vised blocks were embedded in the early hidden layers of the
mainstream CNN for shallow, intermediate and deep supervision, 4.1.4 Network ensemble
respectively (see Fig. 6(a)). Previous research suggested that assemblies of multiple networks
And a feature selection network (FSN) [166] was designed can outperform an individual network [171]. Two key factors
by embedding a feature selection mechanism inside the AlexNet, should be considered when implementing network ensembles: (1)
which automatically filters irrelevant features and emphasizes sufficient diversity of the networks to ensure complementarity, and
correlated features according to learned feature maps of facial ex- (2) an appropriate ensemble method that can effectively aggregate
pression. Interestingly, Zeng et al. [167] pointed out that the incon- the committee networks.
sistent annotations among different FER databases are inevitable In terms of the first factor, different kinds of training data and
which would damage the performance when the training set is various network parameters or architectures are considered to gen-
enlarged by merging multiple datasets. To address this problem, erate diverse committees. Several pre-processing methods [146],
the authors proposed an Inconsistent Pseudo Annotations to Latent such as deformation and normalization, and methods described in
Truth (IPA2LT) framework. In IPA2LT, an end-to-end trainable Section 4.1.2 can generate different data to train diverse networks.
LTNet is designed to discover the latent truths from the human By changing the size of filters, the number of neurons and the
annotations and the machine annotations trained from different number of layers of the networks, and applying multiple random
datasets by maximizing the log-likelihood of these inconsistent seeds for weight initialization, the diversity of the networks can
annotations. also be enhanced [76], [172]. Besides, different architectures of
The traditional softmax loss layer in CNNs simply forces networks can be used to enhance the diversity. For example, a
features of different classes to remain apart, but FER in real- CNN trained in a supervised way and a convolutional autoencoder
world scenarios suffers from not only high inter-class similarity (CAE) trained in an unsupervised way were combined for network
but also high intra-class variation. Therefore, several works have ensemble [142].
proposed novel loss layers for FER. Inspired by the center loss For the second factor, each member of the committee networks
[168], which penalizes the distance between deep features and can be assembled at two different levels: the feature level and the
their corresponding class centers, two variations were proposed to decision level. For feature-level ensembles, the most commonly
assist the supervision of the softmax loss for more discriminative adopted strategy is to concatenate features learned from different
features for FER: (1) island loss [140] was formalized to further networks [88], [174]. For example, [88] concatenated features
increase the pairwise distances between different class centers (see learned from different networks to obtain a single feature vector
Fig. 6(b)), and (2) locality-preserving loss (LP loss) [44] was to describe the input image (see Fig. 7(a)). For decision-level
11

(a) Feature-level ensemble in [88]. Three different features (fc5 of VGG13


+ fc7 of VGG16 + pool of Resnet) after normalization are concatenated to Fig. 8. Representative multitask network for FER. In the proposed
create a single feature vector (FV) that describes the input frame. MSCNN [68], a pair of images is sent into the MSCNN during training.
The expression recognition task with cross-entropy loss, which learns
features with large between-expression variation, and the face verifi-
cation task with contrastive loss, which reduces the variation in within-
expression features, are combined to train the MSCNN.

(b) Decision-level ensemble in [76]. A 3-level hierarchical committee archi-


tecture with hybrid decision-level fusions was proposed to obtain sufficient
decision diversity.

Fig. 7. Representative network ensemble systems at the feature level


and decision level.
Fig. 9. Representative cascaded network for FER. The proposed AU-
aware deep network (AUDN) [137] is composed of three sequential
ensembles, three widely-used rules are applied: majority voting, modules: in the first module, a 2-layer CNN is trained to generate an
over-complete representation encoding all expression-specific appear-
simple average and weighted average. A summary of these three ance variations over all possible locations; in the second module, an
methods is provided in Table 5. Because the weighted average rule AU-aware receptive field layer is designed to search subsets of the
considers the importance and confidence of each individual, many over-complete representation; in the last module, a multilayer RBM is
weighted average methods have been proposed to find an optimal exploited to learn hierarchical features.
set of weights for network ensemble. [57] proposed a random
search method to weight the model predictions for each emotion
sensitive contrastive loss to learn identity-related features for
type. [75] used the log-likelihood loss and hinge loss to adaptively
identity-invariant FER. In [68], a multisignal CNN (MSCNN),
assign different weights to each network. [76] proposed an ex-
which was trained under the supervision of both FER and face
ponentially weighted average based on the validation accuracy to
verification tasks, was proposed to force the model to focus on ex-
emphasize qualified individuals (see Fig. 7(b)). [172] used a CNN
pression information (see Fig. 8). Furthermore, an all-in-one CNN
to learn weights for each individual model.
model [177] was proposed to simultaneously solve a diverse set
of face analysis tasks including smile detection. The network was
4.1.5 Multitask networks
first initialized using the weight pre-trained on face recognition,
Many existing networks for FER focus on a single task and then task-specific sub-networks were branched out from different
learn features that are sensitive to expressions without considering layers with domain-based regularization by training on multiple
interactions among other latent factors. However, in the real world, datasets. Specifically, as smile detection is a subject-independent
FER is intertwined with various factors, such as head pose, task that relies more on local information available from the lower
illumination, and subject identity (facial morphology). To solve layers, the authors proposed to fuse the lower convolutional layers
this problem, multitask leaning is introduced to transfer knowledge to form a generic representation for smile detection. Conventional
from other relevant tasks and to disentangle nuisance factors. supervised multitask learning requires training samples labeled for
Reed et al. [143] constructed a higher-order Boltzmann ma- all tasks. To relax this, [47] proposed a novel attribute propagation
chine (disBM) to learn manifold coordinates for the relevant method which can leverage the inherent correspondences between
factors of expressions and proposed training strategies for dis- facial expression and other heterogeneous attributes despite the
entangling so that the expression-related hidden units are invariant disparate distributions of different datasets.
to face morphology. Other works [58], [175] suggested that
simultaneously conducted FER with other tasks, such as facial
landmark localization and facial AUs [176] detection, can jointly 4.1.6 Cascaded networks
improve FER performance. In a cascaded network, various modules for different tasks are
Besides, several works [61], [68] employed multitask learn- combined sequentially to construct a deeper network, where the
ing for identity-invariant FER. In [61], an identity-aware CNN outputs of the former modules are utilized by the latter modules.
(IACNN) with two identical sub-CNNs was proposed. One stream Related studies have proposed combinations of different structures
used expression-sensitive contrastive loss to learn expression- to learn a hierarchy of features through which factors of variation
discriminative features, and the other stream used identity- that are unrelated with expressions can be gradually filtered out.
12

Most commonly, different networks or learning methods are focuses (computation efficiency, performance and difficulty of
combined sequentially and individually, and each of them con- network training).
tributes differently and hierarchically. In [178], DBNs were trained Pre-training and fine-tuning have become mainstream in
to first detect faces and to detect expression-related areas. Then, deep FER to solve the problem of insufficient training data and
these parsed face components were classified by a stacked autoen- overfitting. A practical technique that proved to be particularly
coder. In [179], a multiscale contractive convolutional network useful is pre-training and fine-tuning the network in multiple
(CCNET) was proposed to obtain local-translation-invariant (LTI) stages using auxiliary data from large-scale objection or face
representations. Then, contractive autoencoder was designed to hi- recognition datasets to small-scale FER datasets, i.e., from large to
erarchically separate out the emotion-related factors from subject small and from general to specific. However, when compared with
identity and pose. In [137], [138], over-complete representations the end-to-end training framework, the representational structure
were first learned using CNN architecture, then a multilayer RBM that are unrelated to expressions are still remained in the off-the-
was exploited to learn higher-level features for FER (see Fig. shelf pre-trained model, such as the large domain gap with the
9). Instead of simply concatenating different networks, Liu et al. objection net [153] and the subject identification distraction in the
[13] presented a boosted DBN (BDBN) that iteratively performed face net [111]. Thus the extracted features are usually vulnerable
feature representation, feature selection and classifier construction to identity variations and the performance would be degraded.
in a unified loopy state. Compared with the concatenation without Noticeably, with the advent of large-scale in-the-wild FER datasets
feedback, this loopy framework propagates backward the classi- (e.g., AffectNet and RAF-DB), the end-to-end training using
fication error to initiate the feature selection process alternately deep networks with moderate size can also achieve competitive
until convergence. Thus, the discriminative ability for FER can be performances [45], [167].
substantially improved during this iteration. In addition to directly using the raw image data to train the
deep network, diverse pre-designed features are recommended to
4.1.7 Generative adversarial networks (GANs) strengthen the network’s robustness to common distractions (e.g.,
illumination, head pose and occlusion) and to force the network
Recently, GAN-based methods have been successfully used in
to focus more on facial areas with expressive information. More-
image synthesis to generate impressively realistic faces, numbers,
over, the use of multiple heterogeneous input data can indirectly
and a variety of other image types, which are beneficial to train-
enlarge the data size. However, the problem of identity bias is
ing data augmentation and the corresponding recognition tasks.
commonly ignored in this methods. Moreover, generating diverse
Several works have proposed novel GAN-based models for pose-
data accounts for additional time consuming and combining these
invariant FER and identity-invariant FER.
multiple data can lead to high dimension which may influence the
For pose-invariant FER, Lai et al. [180] proposed a GAN-
computational efficiency of the network.
based face frontalization framework, where the generator frontal-
Training a deep and wide network with a large number of
izes input face images while preserving the identity and expression
hidden layers and flexible filters is an effective way to learn
characteristics and the discriminator distinguishes the real images
deep high-level features that are discriminative for the target task.
from the generated frontal face images. And Zhang et al. [181]
However, this process is vulnerable to the size of training data and
proposed a GAN-based model that can generate images with
can underperform if insufficient training data is available to learn
different expressions under arbitrary poses for multi-view FER.
the new parameters. Integrating multiple relatively small networks
For identity-invariant FER, Yang et al. [182] proposed an Identity-
in parallel or in series is a natural research direction to overcome
Adaptive Generation (IA-gen) model with two parts. The upper
this problem. Network ensemble integrates diverse networks at
part generates images of the same subject with different expres-
the feature or decision level to combine their advantages, which
sions using cGANs, respectively. Then, the lower part conducts
is usually applied in emotion competitions to help boost the
FER for each single identity sub-space without involving other
performance. However, designing different kinds of networks
individuals, thus identity variations can be well alleviated. Chen et
to compensate each other obviously enlarge the computational
al. [183] proposed a Privacy-Preserving Representation-Learning
cost and the storage requirement. Moreover, the weight of each
Variational GAN (PPRL-VGAN) that combines VAE and GAN
sub-network is usually learned according to the performance on
to learn an identity-invariant representation that is explicitly
original training data, leading to overfitting on newly unseen
disentangled from the identity information and generative for
testing data. Multitask networks jointly train multiple networks
expression-preserving face image synthesis. Yang et al. [141]
with consideration of interactions between the target FER task and
proposed a De-expression Residue Learning (DeRL) procedure to
other secondary tasks, such as facial landmark localization, facial
explore the expressive information, which is filtered out during the
AU recognition and face verification, thus the expression-unrelated
de-expression process but still embedded in the generator. Then
factors including identity bias can be well disentangled. The
the model extracted this information from the generator directly to
downside of this method is that it requires labeled data from all
mitigate the influence of subject variations and improve the FER
tasks and the training becomes increasingly cumbersome as more
performance.
tasks are involved. Alternatively, cascaded networks sequentially
train multiple networks in a hierarchical approach, in which case
4.1.8 Discussion the discriminative ability of the learned features are continuously
The existing well-constructed deep FER systems focus on two key strengthened. In general, this method can alleviate the overfitting
issues: the lack of plentiful diverse training data and expression- problem, and in the meanwhile, progressively disentangling fac-
unrelated variations, such as illumination, head pose and identity. tors that are irrelevant to facial expression. A deficiency worths
Table 6 shows relative advantages and disadvantages of these considering is that the sub-networks in most existing cascaded
different types of methods with respect to two open issues (data systems are training individually without feedback, and the end-
size requirement and expression-unrelated variations) and other to-end training strategy is preferable to enhance the training
13

TABLE 6
Comparison of different types of methods for static images in terms of data size requirement, variations* (head pose, illumination, occlusion and
other environment factors), identity bias, computational efficiency, accuracy, and difficulty on network training.

Network type data variations* identity bias efficiency accuracy difficulty


Pre-train & Fine-tune low fair vulnerable high fair easy
Diverse input low good vulnerable low fair easy
Auxiliary layers varies good varies varies good varies
Network ensemble low good fair low good medium
Multitask network high varies good fair varies hard
Cascaded network fair good fair fair fair medium
GAN fair good good fair good hard

effectiveness and the performance [13].


Ideally, deep networks, especially CNNs, have good capa-
bilities for dealing with head-pose variations, yet most current
FER networks do not address head-pose variations explicitly and
are not tested in naturalistic scenarios. Generative adversarial
networks (GANs) can be exploited to solve this issue by frontaliz-
ing face images while preserving expression characteristics [180]
or synthesizing arbitrary poses to help train the pose-invariant (a) Frame averaging (b) Frame expansion
network [181]. Another advantage of GANs is that the identity
variations can be explicitly disentangled through generating the Fig. 10. Frame aggregation in [57]. The flowchart is top-down. (a) For
sequences with more than 10 frames, we averaged the probability
corresponding neutral face image [141] or synthesizing different vectors of 10 independent groups of frames taken uniformly along time.
expressions while preserving the identity information for identity- (b) For sequences with less than 10 frames, we expanded by repeating
invariant FER [182]. Moreover, GANs can help augment the frames uniformly to obtain 10 total frames.
training data on both size and diversity. The main drawback of
GAN is the training instability and the trade-off between visual
Two aggregation approaches have been considered to generate a
quality and image diversity.
fixed-length feature vector for each sequence [57], [191]: frame
averaging and frame expansion (see Fig. 10 for details). An
4.2 Deep FER networks for dynamic image sequences alternative approach which dose not require a fixed number of
Although most of the previous models focus on static images, frames is applying statistical coding. The average, max, average
facial expression recognition can benefit from the temporal cor- of square, average of maximum suppression vectors and so on can
relations of consecutive frames in a sequence. We first intro- be used to summarize the per-frame probabilities in each sequence.
duce the existing frame aggregation techniques that strategically For feature-level frame aggregation, the learned features of
combine deep features learned from static-based FER networks. frames in the sequence are aggregate. Many statistical-based
Then, considering that in a videostream people usually display encoding modules can be applied in this scheme. A simple and
the same expression with different intensities, we further review effective way is to concatenate the mean, variance, minimum,
methods that use images in different expression intensity states for and maximum of the features over all frames [88]. Alternatively,
intensity-invariant FER. Finally, we introduce deep FER networks matrix-based models such as eigenvector, covariance matrix and
that consider spatio-temporal motion patterns in video frames and multi-dimensional Gaussian distribution can also be employed
learned features derived from the temporal structure. For each of for aggregation [186], [192]. Besides, multi-instance learning has
the most frequently evaluated datasets, Table 7 shows the current been explored for video-level representation [193], where the
state-of-the-art methods conducted in the person-independent pro- cluster centers are computed from auxiliary image data and then
tocol. bag-of-words representation is obtained for each bag of video
frames.
4.2.1 Frame aggregation
Because the frames in a given video clip may vary in expres- 4.2.2 Expression Intensity network
sion intensity, directly measuring per-frame error does not yield Most methods (introduced in Section 4.1) focus on recognizing
satisfactory performance. Various methods have been proposed the peak high-intensity expression and ignore the subtle lower-
to aggregate the network output for frames in each sequence intensity expressions. In this section, we introduced expression
to improve the performance. We divide these methods into two intensity-invariant networks that take training samples with differ-
groups: decision-level frame aggregation and feature-level frame ent intensities as input to exploit the intrinsic correlations among
aggregation. expressions from a sequence that vary in intensity.
For decision-level frame aggregation, n-class probability vec- In expression intensity-invariant network, image frames with
tors of each frame in a sequence are integrated. The most con- intensity labels are used for training. During test, data that vary
venient way is to directly concatenate the output of these frames. in expression intensity are used to verify the intensity-invariant
However, the number of frames in each sequence may be different. ability of the network. Zhao et al. [17] proposed a peak-piloted
14

TABLE 7
Performances of representative methods for dynamic-based deep facial expression recognition on the most widely evaluated datasets. Network
size = depth & number of parameters; Pre-processing = Face Detection & Data Augmentation & Face Normalization; IN = Illumination
Normalization; F A = Frame Aggregation; EIN = Expression Intensity-invariant Network; F LT = Facial Landmark Trajectory; CN = Cascaded
Network; N E = Network Ensemble; S = Spatial Network; T = Temporal Network; LOSO = leave-one-subject-out.

Network Network Training data Selection Testing data selection Data


Datasets Methods Pre-processing Performance1(%)
type size in each sequence in each sequence group

Zhao et al. 16 [17] EIN 22 6.8m 3 - - from the 7th to the last2 the last frame 10 folds 6 classes: 99.3
2
Yu et al. 17 [70] EIN 42 - 3 - from the 7th to the last the peak expression
MTCNN 10 folds 6 classes: 99.6
kim et al. 17 [184] EIN 14 - 3 3 - all frames 10 folds 7 classes: 97.93
S: emotional
Sun et al. 17 [185] NE 3 * GoogLeNetv2 3 - - 10 folds 6 classes: 97.28
CK+ T: neutral+emotional
Jung et al. 15 [16] F LT 2 177.6k IntraFace 3 - fixed number of frames 10 folds 7 classes: 92.35
the same as
Jung et al. 15 [16] C3D 4 - IntraFace 3 - fixed number of frames 10 folds 7 classes: 91.44
the training data
Jung et al. 15 [16] NE F LT /C3D IntraFace 3 - fixed number of frames 10 folds 7 classes: 97.25 (95.22)
kuo et al. 18 [89] FA 6 2.7m IntraFace 3 IN fixed length 9 10 folds 7 classes: 98.47
SDM/ S: the last frame
Zhang et al. 17 [68] NE 7/5 2k/1.6m 3 - 10 folds 7 classes: 98.50 (97.78)
Cascaded CNN T: all frames

Kim et al. 17 [66] EIN , CN 7 1.5m Incremental 3 - 5 intensities frames LOSO 6 classes: 78.61 (78.00)
kim et al. 17 [184] EIN 14 - 3 3 - all frames 10 folds 6 classes: 81.53
Hasani et al. 17 [112] F LT , CN 22 - 3000 fps - - ten frames 5 folds 6 classes: 77.50 (74.50)
the same as
Hasani et al. 17 [55] CN 29 - AAM - - static frames 5 folds 6 classes: 78.68
MMI the training data
SDM/ S: the middle frame
Zhang et al. 17 [68] NE 7/5 2k/1.6m 3 - 10 folds 6 classes: 81.18 (79.30)
Cascaded CNN T: all frames
S: emotional
Sun et al. 17 [185] NE 3 * GoogLeNetv2 3 - - 10 folds 6 classes: 91.46
T: neutral+emotional
2
Zhao et al. 16 [17] EIN 22 6.8m 3 - - from the 7th to the last the last frame 10 folds 6 classes: 84.59
Yu et al. 17 [70] EIN 42 - MTCNN 3 -from the 7th to the last2 the peak expression 10 folds 6 classes: 86.23
Oulu- Jung et al. 15 [16] F LT 2 177.6k IntraFace 3 -fixed number of frames 10 folds 6 classes: 74.17
CASIA Jung et al. 15 [16] C3D 4 - IntraFace 3 -fixed number of frames 10 folds 6 classes: 74.38
Jung et al. 15 [16] NE F LT /C3D IntraFace 3 -fixed number of frames the same as 10 folds 6 classes: 81.46 (81.49)
SDM/ S: the last frame the training data
Zhang et al. 17 [68] NE 7/5 2k/1.6m 3 - 10 folds 6 classes: 86.25 (86.25)
Cascaded CNN T: all frames
kuo et al. 18 [89] NE 6 2.7m IntraFace 3 IN fixed length 9 10 folds 6 classes: 91.67

Ding et al. 16 [186] FA AlexNet 3 - - Training: 773; Validation: 373; Test: 593 Validation: 44.47
Yan et al. 16 [187] CN VGG16-LSTM 3 3 - 40 frames 3 folds 7 classes: 44.46
AFEW*
Yan et al. 16 [187] F LT 4 - [188] - - 30 frames 3 folds 7 classes: 37.37
6.0
Fan et al. 16 [108] CN VGG16-LSTM 3 - - 16 features for LSTM Validation: 45.43 (38.96)
Fan et al. [108] C3D 10 - 3 - - several windows of 16 consecutive frames Validation: 39.69 (38.55)
Yan et al. 16 [187] fusion / Training: 773; Validation: 383; Test: 593 Test: 56.66 (40.81)
Fan et al. 16 [108] fusion / Training: 774; Validation: 383; Test: 593 Test: 59.02 (44.94)
Ouyang et al. 17 [189] CN VGG-LSTM MTCNN 3 - 16 frames Validation: 47.4
Ouyang et al. 17 [189] C3D 10 - MTCNN 3 - 16 frames Validation: 35.2
AFEW*
Vielzeuf et al. [190] CN C3D-LSTM 3 3 - detected face frames Validation: 43.2
7.0
Vielzeuf et al. [190] CN VGG16-LSTM 3 3 - several windows of 16 consecutive frames Validation: 48.6
Vielzeuf et al. [190] fusion / Training: 773; Validation: 383; Test: 653 Test: 58.81 (43.23)
1
The value in parentheses is the mean accuracy calculated from the confusion matrix given by authors.
2
A pair of images (peak and non-peak expression) is chosen for training each time.
*
We have included the result of a single spatio-temporal network and also the best result after fusion with both video and audio modalities.

7 Classes in CK+: Anger, Contempt, Disgust, Fear, Happiness, Sadness, and Surprise.

7 Classes in AFEW: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise.

deep network (PPDN) that takes a pair of peak and non-peak and encoding intermediate intensity, respectively.
images of the same expression and from the same subject as
input and utilizes the L2-norm loss to minimize the distance
between both images. During back propagation, a peak gradient Considering that images with different expression intensities
suppression (PGS) was proposed to drive the learned feature of for an individual identity is not always available in the wild,
the non-peak expression towards that of peak expression while several works proposed to automatically acquire the intensity label
avoiding the inverse. Thus, the network discriminant ability on or to generate new images with targeted intensity. For example, in
lower-intensity expressions can be improved. Based on PPDN, [194] the peak and neutral frames was automatically picked out
Yu et al. [70] proposed a deeper cascaded peak-piloted network from the sequence with two stages : a clustering stage to divide all
(DCPN) that used a deeper and larger architecture to enhance the frames into the peak-like group and the neutral-like group using
discriminative ability of the learned features and employed an in- K-means algorithm, and a classification stage to detect peak and
tegration training method called cascade fine-tuning to avoid over- neutral frames using a semi-supervised SVM. And in [184], a
fitting. In [66], more intensity states were utilized (onset, onset to deep generative-contrastive model was presented with two steps:
apex transition, apex, apex to offset transition and offset) and five a generator to generate the reference (less-expressive) face for
loss functions were adopted to regulate the network training by each sample via convolutional encoder-decoder and a contrastive
minimizing expression classification error, intra-class expression network to jointly filter out information that is irrelevant with
variation, intensity classification error and intra-intensity variation, expressions through a contrastive metric loss and a supervised
reconstruction loss.
15

Fig. 12. The proposed 3DCNN-DAP [199]. The input n-frame sequence
is convolved with 3D filters; then, 13 ∗ c ∗ k part filters corresponding to
13 manually defined facial parts are used to convolve k feature maps for
the facial action part detection maps of c expression classes.

Fig. 11. The proposed PPDN in [17]. During training, PPDN is trained by sequence and weighted based on their prediction scores. Instead
jointly optimizing the L2-norm loss and the cross-entropy losses of two of directly using C3D for classification, [109] employed C3D for
expression images. During testing, the PPDN takes one still image as
input for probability prediction. spatio-temporal feature extraction and then cascaded with DBN
for prediction. In [201], C3D was also used as a feature extractor,
followed by a NetVLAD layer [202] to aggregate the temporal
4.2.3 Deep spatio-temporal FER network information of the motion features by learning cluster centers.

Although the frame aggregation can integrate frames in the Facial landmark trajectory: Related psychological studies
video sequence, the crucial temporal dependency is not explicitly have shown that expressions are invoked by dynamic motions
exploited. By contrast, the spatio-temporal FER network takes of certain facial parts (e.g., eyes, nose and mouth) that contain
a range of frames in a temporal window as a single input the most descriptive information for representing expressions.
without prior knowledge of the expression intensity and utilizes To obtain more accurate facial actions for FER, facial landmark
both textural and temporal information to encode more subtle trajectory models have been proposed to capture the dynamic
expressions. variations of facial components from consecutive frames.
To extract landmark trajectory representation, the most direct
RNN and C3D: RNN can robustly derive information from way is to concatenate coordinates of facial landmark points
sequences by exploiting the fact that feature vectors for successive from frames over time with normalization to generate a one-
data are connected semantically and are therefore interdependent. dimensional trajectory signal for each sequence [16] or to form
The improved version, LSTM, is flexible to handle varying-length an image-like map as the input of CNN [187]. Besides, relative
sequential data with lower computation cost. Derived from RNN, distance variation of each landmark in consecutive frames can
an RNN that is composed of ReLUs and initialized with the also be used to capture the temporal information [203]. Further,
identity matrix (IRNN) [195] was used to provide a simpler part-based model that divides facial landmarks into several parts
mechanism for addressing the vanishing and exploding gradient according to facial physical structure and then separately feeds
problems [87]. And bidirectional RNNs (BRNNs) [196] were them into the networks hierarchically is proved to be efficient for
employed to learn the temporal relations in both the original and both local low-level and global high-level feature encoding [68]
reversed directions [68], [187]. Recently, a Nested LSTM was (see “PHRNN” in Fig. 13) . Instead of separately extracting the
proposed in [71] with two sub-LSTMs. Namely, T-LSTM models trajectory features and then input them into the networks, Hasani
the temporal dynamics of the learned features, and C-LSTM et al. [112] incorporated the trajectory features by replacing the
integrates the outputs of all T-LSTMs together so as to encode shortcut in the residual unit of the original 3D Inception-ResNet
the multi-level features encoded in the intermediate layers of the with element-wise multiplication of facial landmarks and the input
network. tensor of the residual unit. Thus, the landmark based network can
Compared with RNN, CNN is more suitable for computer be trained end-to-end.
vision applications; hence, its derivative C3D [107], which uses
3D convolutional kernels with shared weights along the time Cascaded networks: By combining the powerful perceptual
axis instead of the traditional 2D kernels, has been widely used vision representations learned from CNNs with the strength of
for dynamic-based FER (e.g., [83], [108], [189], [197], [198]) LSTM for variable-length inputs and outputs, Donahue et al.
to capture the spatio-temporal features. Based on C3D, many [204] proposed a both spatially and temporally deep model which
derived structures have been designed for FER. In [199], 3D CNN cascades the outputs of CNNs with LSTMs for various vision
was incorporated with the DPM-inspired [200] deformable facial tasks involving time-varying inputs and outputs. Similar to this
action constraints to simultaneously encode dynamic motion hybrid network, many cascaded networks have been proposed for
and discriminative part-based representations (see Fig. 12 for FER (e.g., [66], [108], [190], [205]).
details). In [16], a deep temporal appearance network (DTAN) Instead of CNN, [206] employed a convolutional sparse
was proposed that employed 3D filters without weight sharing autoencoder for sparse and shift-invariant features; then, an
along the time axis; hence, each filter can vary in importance LSTM classifier was trained for temporal evolution. [189]
over time. Likewise, a weighted C3D was proposed [190], where employed a more flexible network called ResNet-LSTM, which
several windows of consecutive frames were extracted from each allows nodes in lower CNN layers to directly contact with
16

trajectory”) and the integrated network (see Fig. 14 for details),


which outperformed the weighed sum strategy.

4.2.4 Discussion
In the real world, people display facial expressions in a dynamic
process, e.g., from subtle to obvious, and it has become a trend to
conduct FER on sequence/video data. Table 8 summarizes relative
merits of different types of methods on dynamic data in regards
to the capability of representing spatial and temporal information,
the requirement on training data size and frame length (variable or
Fig. 13. The spatio-temporal network proposed in [68]. The tempo- fixed), the computational efficiency and the performance.
ral network PHRNN for landmark trajectory and the spatial network Frame aggregation is employed to combine the learned feature
MSCNN for identity-invariant features are trained separately. Then, the
predicted probabilities from the two networks are fused together for or prediction probability of each frame for a sequence-level
spatio-temporal FER. result. The output of each frame can be simply concatenated
(fixed-length frames is required in each sequence) or statistically
aggregated to obtain video-level representation (variable-length
frames processible). This method is computationally simple and
can achieve moderate performance if the temporal variations of
the target dataset is not complicated.
According to the fact that the expression intensity in a video
sequence varies over time, the expression intensity-invariant net-
work considers images with non-peak expressions and further
exploits the dynamic correlations between peak and non-peak
expressions to improve performance. Commonly, image frames
with specific intensity states are needed for intensity-invariant
Fig. 14. The joint fine-tuning method for DTAGN proposed in [16]. To FER.
integrate DTGA and DTAN, we freeze the weight values in the gray Despite the advantages of these methods, frame aggregation
boxes and retrain the top layer in the green boxes. The logit values
of the green boxes are used by Softmax3 to supervise the integrated handles frames without consideration of temporal information
network. During training, we combine three softmax loss functions, and and subtle appearance changes, and expression intensity-invariant
for prediction, we use only Softmax3. networks require prior knowledge of expression intensity which
is unavailable in real-world scenarios. By contrast, Deep spatio-
temporal networks are designed to encode temporal dependencies
LSTMs to capture spatio-temporal information. In addition in consecutive frames and have been shown to benefit from
to concatenating LSTM with the fully connected layer of learning spatial features in conjunction with temporal features.
CNN, a hypercolumn-based system [207] extracted the last RNN and its variations (e.g., LSTM, IRNN and BRNN) and C3D
convolutional layer features as the input of the LSTM for longer are foundational networks for learning spatio-temporal features.
range dependencies without losing global coherence. Instead However, the performance of these networks is barely satisfac-
of LSTM, the conditional random fields (CRFs) model [208] tory. RNN is incapable of capturing the powerful convolutional
that are effective in recognizing human activities was employed features. And 3D filers in C3D are applied over very short
in [55] to distinguish the temporal relations of the input sequences. video clips ignoring long-range dynamics. Also, training such
a huge network is computationally a problem, especially for
Network ensemble: A two-stream CNN for action recognition in dynamic FER where video data is insufficient. Alternatively,
videos, which trained one stream of the CNN on the multi-frame facial landmark trajectory methods extract shape features based
dense optical flow for temporal information and the other stream on the physical structures of facial morphological variations to
of the CNN on still images for appearance features and then fused capture dynamic facial component activities, and then apply deep
the outputs of two streams, was introduced by Simonyan et al. networks for classification. This method is computationally simple
[209]. Inspired by this architecture, several network ensemble and can get rid of the issue on illumination variations. However,
models have been proposed for FER. it is sensitive to registration errors and requires accurate facial
Sun et al. [185] proposed a multi-channel network that ex- landmark detection, which is difficult to access in unconstrained
tracted the spatial information from emotion-expressing faces and conditions. Consequently, this method performs less well and is
temporal information (optical flow) from the changes between more suitable to complement appearance representations. Network
emotioanl and neutral faces, and investigated three feature fusion ensemble is utilized to train multiple networks for both spatial and
strategies: score average fusion, SVM-based fusion and neural- temporal information and then to fuse the network outputs in the
network-based fusion. Zhang et al. [68] fused the temporal net- final stage. Optic flow and facial landmark trajectory can be used
work PHRNN (discussed in “Landmark trajectory”) and the as temporal representations to collaborate spatial representations.
spatial network MSCNN (discussed in section 4.1.5) to extract One of the drawbacks of this framework is the pre-computing
the partial-whole, geometry-appearance, and static-dynamic in- and storage consumption on optical flow or landmark trajectory
formation for FER (see Fig. 13). Instead of fusing the network vectors. And most related researches randomly selected fixed-
outputs with different weights, Jung et al. [16] proposed a joint length video frames as input, leading to the loss of useful temporal
fine-tuning method that jointly trained the DTAN (discussed in information. Cascaded networks were proposed to first extract
the “RNN and C3D” ), the DTGN (discussed in the “Landmark discriminative representations for facial expression images and
17

TABLE 8
Comparison of different types of methods for dynamic image sequences in terms of data size requirement, representability of spatial and temporal
information, requirement on frame length, performance, and computational efficiency. F LT = Facial Landmark Trajectory; CN = Cascaded
Network; N E = Network Ensemble.

Network type data spatial temporal frame length accuracy efficiency


Frame aggregation low good no depends fair high
Expression intensity fair good low fixed fair varies
RNN low low good variable low fair
Spatio- C3D high good fair fixed low fair
temporal FLT fair fair fair fixed low high
network CN high good good variable good fair
NE low good good fixed good low

then input these features to sequential networks to reinforce the [215] proposed a multi-channel pose-aware CNN (MPCNN) that
temporal information encoding. However, this model introduces contains three cascaded parts (multi-channel feature extraction,
additional parameters to capture sequence information, and the jointly multi-scale feature fusion and the pose-aware recognition)
feature learning network (e.g., CNN) and the temporal information to predict expression labels by minimizing the conditional joint
encoding network (e.g., LSTM) in current works are not trained loss of pose and expression recognition. Besides, the technology
jointly, which may lead to suboptimal parameter settings. And of generative adversarial network (GAN) has been employed in
training in an end-to-end fashion is still a long road. [180], [181] to generate facial images with different expressions
Compared with deep networks on static data, Table 4 and Table under arbitrary poses for multi-view FER.
7 demonstrate the powerful capability and popularity trend of deep
spatio-temporal networks. For instance, comparison results on 5.2 FER on infrared data
widely evaluated benchmarks (e.g., CK+ and MMI) illustrate that
Although RBG or gray data are the current standard in deep FER,
training networks based on sequence data and analyzing temporal
these data are vulnerable to ambient lighting conditions. While,
dependency between frames can further improve the performance.
infrared images that record the skin temporal distribution produced
Also, in the EmotiW challenge 2015, only one system employed
by emotions are not sensitive to illumination variations, which may
deep spatio-networks for FER, whereas 5 of 7 reviewed systems
be a promising alternative for investigation of facial expression.
in the EmotiW challenge 2017 relied on such networks.
For example, He et al. [216] employed a DBM model that consists
of a Gaussian-binary RBM and a binary RBM for FER. The model
5 A DDITIONAL R ELATED I SSUES was trained by layerwise pre-training and joint training and was
In addition to the most popular basic expression classification then fine-tuned on long-wavelength thermal infrared images to
task reviewed above, we further introduce a few related issues learn thermal features. Wu et al. [217] proposed a three-stream
that depend on deep neural networks and prototypical expression- 3D CNN to fuse local and global spatio-temporal features on
related knowledge. illumination-invariant near-infrared images for FER.

5.1 Occlusion and non-frontal head pose 5.3 FER on 3D static and dynamic data
Occlusion and non-frontal head pose, which may change the Despite significant advances have achieved in 2D FER, it fails
visual appearance of the original facial expression, are two major to solve the two main problems: illumination changes and pose
obstacles for automatic FER, especially in real-world scenarios. variations [29]. 3D FER that uses 3D face shape models with
For facial occlusion, Ranzato et al. [210], [211] proposed a depth information can capture subtle facial deformations, which
deep generative model that used mPoT [212] as the first layer are naturally robust to pose and lighting variations.
of DBNs to model pixel-level representations and then trained Depth images and videos record the intensity of facial pixels
DBNs to fit an appropriate distribution to its inputs. Thus, the based on distance from a depth camera, which contain critical
occluded pixels in images could be filled in by reconstructing information of facial geometric relations. For example, [218] used
the top layer representation using the sequence of conditional kinect depth sensor to obtain gradient direction information and
distributions. Cheng et al. [213] employed multilayer RBMs then employed CNN on unregistered facial depth images for FER.
with a pre-training and fine-tuning process on Gabor features [219], [220] extracted a series of salient features from depth
to compress features from the occluded facial parts. Xu et al. videos and combined them with deep networks (i.e., CNN and
[214] concatenated high-level learned features transferred from DBN) for FER. To emphasize the dynamic deformation patterns
two CNNs with the same structure but pre-trained on different of facial expression motions, Li et al. [221] explore the 4D FER
data: the original MSRA-CFW database and the MSRA-CFW (3D FER using dynamic data) using a dynamic geometrical image
database with additive occluded samples. network. Furthermore, Chang et al. [222] proposed to estimate 3D
For multi-view FER, Zhang et al. [156] introduced a projection expression coefficients from image intensities using CNN without
layer into the CNN that learned discriminative facial features requiring facial landmark detection. Thus, the model is highly
by weighting different facial landmark points within 2D SIFT robust to extreme appearance variations, including out-of-plane
feature matrices without requiring facial pose estimation. Liu et al. head rotations, scale changes, and occlusions.
18

Recently, more and more works trend to combine 2D and 3D 6 C HALLENGES AND O PPORTUNITIES
data to further improve the performance. Oyedotun et al. [223] 6.1 Facial expression datasets
employed CNN to jointly learn facial expression features from
As the FER literature shifts its main focus to the challeng-
both RGB and depth map latent modalities. And Li et al. [224]
ing in-the-wild environmental conditions, many researchers have
proposed a deep fusion CNN (DF-CNN) to explore multi-modal
committed to employing deep learning technologies to handle
2D+3D FER. Specifically, six types of 2D facial attribute maps
difficulties, such as illumination variation, occlusions, non-frontal
(i.e., geometry, texture, curvature, normal components x, y, and z)
head poses, identity bias and the recognition of low-intensity
were first extracted from the textured 3D face scans and were
expressions. Given that FER is a data-driven task and that training
then jointly fed into the feature extraction and feature fusion
a sufficiently deep network to capture subtle expression-related
subnets to learn the optimal combination weights of 2D and 3D
deformations requires a large amount of training data, the major
facial representations. To improve this work, [225] proposed to
challenge that deep FER systems face is the lack of training data
extract deep features from different facial parts extracted from the
in terms of both quantity and quality.
texture and depth images, and then fused these features together to
Because people of different age ranges, cultures and genders
interconnect them with feedback. Wei et al. [226] further explored
display and interpret facial expression in different ways, an ideal
the data bias problem in 2D+3D FER using unsupervised domain
facial expression dataset is expected to include abundant sample
adaption technique.
images with precise face attribute labels, not just expression but
5.4 Facial expression synthesis other attributes such as age, gender and ethnicity, which would
facilitate related research on cross-age range, cross-gender and
Realistic facial expression synthesis, which can generate various
cross-cultural FER using deep learning techniques, such as mul-
facial expressions for interactive interfaces, is a hot topic. Susskind
titask deep networks and transfer learning. In addition, although
et al. [227] demonstrated that DBN has the capacity to capture
occlusion and multipose problems have received relatively wide
the large range of variation in expressive appearance and can be
interest in the field of deep face recognition, the occlusion-
trained on large but sparsely labeled datasets. In light of this work,
robust and pose-invariant issues have receive less attention in deep
[210], [211], [228] employed DBN with unsupervised learning
FER. One of the main reasons is the lack of a large-scale facial
to construct facial expression synthesis systems. Kaneko et al.
expression dataset with occlusion type and head-pose annotations.
[149] proposed a multitask deep network with state recognition
On the other hand, accurately annotating a large volume of
and key-point localization to adaptively generate visual feedback
image data with the large variation and complexity of natural
to improve facial expression recognition. With the recent success
scenarios is an obvious impediment to the construction of expres-
of the deep generative models, such as variational autoencoder
sion datasets. A reasonable approach is to employ crowd-sourcing
(VAE), adversarial autoencoder (AAE), and generative adversarial
models [44], [46], [249] under the guidance of expert annotators.
network (GAN), a series of facial expression synthesis systems
Additionally, a fully automatic labeling tool [43] refined by experts
have been developed based on these models (e.g., [229], [230],
is alternative to provide approximate but efficient annotations. In
[231], [232] and [233]). Facial expression synthesis can also be
both cases, a subsequent reliable estimation or labeling learning
applied to data augmentation without manually collecting and
process is necessary to filter out noisy annotations. In particular,
labeling huge datasets. Masi et al. [234] employed CNN to
few comparatively large-scale datasets that consider real-world
synthesize new face images by increasing face-specific appearance
scenarios and contain a wide range of facial expressions have
variation, such as expressions within the 3D textured face model.
recently become publicly available, i.e., EmotioNet [43], RAF-
5.5 Visualization techniques DB [44], [45] and AffectNet [46], and we anticipate that with
advances in technology and the wide spread of the Internet, more
In addition to utilizing CNN for FER, several works (e.g., [139],
complementary facial expression datasets will be constructed to
[235], [236]) employed visualization techniques [237] on the
promote the development of deep FER.
learned CNN features to qualitatively analyze how the CNN
contributes to the appearance-based learning process of FER and
to qualitatively decipher which portions of the face yield the 6.2 Incorporating other affective models
most discriminative information. The deconvolutional results all Another major issue that requires consideration is that while FER
indicated that the activations of some particular filters on the within the categorical model is widely acknowledged and re-
learned features have strong correlations with the face regions that searched, the definition of the prototypical expressions covers only
correspond to facial AUs. a small portion of specific categories and cannot capture the full
repertoire of expressive behaviors for realistic interactions. Two
5.6 Other special issues additional models were developed to describe a larger range of
Several novel issues have been approached on the basis of the emotional landscape: the FACS model [10], [176], where various
prototypical expression categories: dominant and complementary facial muscle AUs are combined to describe the visible appearance
emotion recognition challenge [238] and the Real versus Fake changes of facial expressions, and the dimensional model [11],
expressed emotions challenge [239]. Furthermore, deep learning [250], where two continuous-valued variables, namely, valence
techniques have been thoroughly applied by the participants of and arousal, are proposed to continuously encode small changes in
these two challenges (e.g., [240], [241], [242]). Additional related the intensity of emotions. Another novel definition, i.e., compound
real-world applications, such as the Real-time FER App for expression, was proposed by Du et al. [52], who argued that some
smartphones [243], [244], Eyemotion (FER using eye-tracking facial expressions are actually combinations of more than one
cameras) [245], privacy-preserving mobile analytics [246], Unfelt basic emotion. These works improve the characterization of facial
emotions [247] and Depression recognition [248], have also been expressions and, to some extent, can complement the categorical
developed. model. For instance, as discussed above, the visualization results
19

of CNNs have demonstrated a certain congruity between the [4] P. Ekman, “Strong evidence for universals in facial expressions: a reply
learned representations and the facial areas defined by AUs. Thus, to russell’s mistaken critique,” Psychological bulletin, vol. 115, no. 2,
pp. 268–287, 1994.
we can design filters of the deep neural networks to distribute
[5] D. Matsumoto, “More evidence for the universality of a contempt
different weights according to the importance degree of different expression,” Motivation and Emotion, vol. 16, no. 4, pp. 363–368, 1992.
facial muscle action parts. [6] R. E. Jack, O. G. Garrod, H. Yu, R. Caldara, and P. G. Schyns, “Facial
expressions of emotion are not culturally universal,” Proceedings of the
National Academy of Sciences, vol. 109, no. 19, pp. 7241–7244, 2012.
6.3 Dataset bias and imbalanced distribution [7] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect
recognition methods: Audio, visual, and spontaneous expressions,”
Data bias and inconsistent annotations are very common among IEEE transactions on pattern analysis and machine intelligence, vol. 31,
different facial expression datasets due to different collecting no. 1, pp. 39–58, 2009.
conditions and the subjectiveness of annotating. Researchers com- [8] E. Sariyanidi, H. Gunes, and A. Cavallaro, “Automatic analysis of facial
monly evaluate their algorithms within a specific dataset and can affect: A survey of registration, representation, and recognition,” IEEE
transactions on pattern analysis and machine intelligence, vol. 37, no. 6,
achieve satisfactory performance. However, early cross-database pp. 1113–1133, 2015.
experiments have indicated that discrepancies between databases [9] B. Martinez and M. F. Valstar, “Advances, challenges, and opportuni-
exist due to the different collection environments and construction ties in automatic facial expression recognition,” in Advances in Face
indicators [12]; hence, algorithms evaluated via intra-database Detection and Facial Image Analysis. Springer, 2016, pp. 63–100.
protocols lack generalizability on unseen test data, and the per- [10] P. Ekman, “Facial action coding system (facs),” A human face, 2002.
[11] H. Gunes and B. Schuller, “Categorical and dimensional affect analysis
formance in cross-dataset settings is greatly deteriorated. Deep
in continuous input: Current trends and future directions,” Image and
domain adaption and knowledge distillation are alternatives to Vision Computing, vol. 31, no. 2, pp. 120–136, 2013.
address this bias [226], [251]. Furthermore, because of the in- [12] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition
consistent expression annotations, FER performance cannot keep based on local binary patterns: A comprehensive study,” Image and
improving when enlarging the training data by directly merging Vision Computing, vol. 27, no. 6, pp. 803–816, 2009.
[13] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via
multiple datasets [167]. a boosted deep belief network,” in Proceedings of the IEEE Conference
Another common problem in facial expression is class imbal- on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812.
ance, which is a result of the practicalities of data acquisition: [14] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial
eliciting and annotating a smile is easy, however, capturing in- expression recognition using deep neural networks,” in Applications of
Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE,
formation for disgust, anger and other less common expressions 2016, pp. 1–10.
can be very challenging. As shown in Table 4 and Table 7, the [15] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local
performance assessed in terms of mean accuracy, which assigns binary patterns with an application to facial expressions,” IEEE trans-
equal weights to all classes, decreases when compared with the actions on pattern analysis and machine intelligence, vol. 29, no. 6, pp.
915–928, 2007.
accuracy criterion, and this decline is especially evident in real-
[16] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning in deep
world datasets (e.g., SFEW 2.0 and AFEW). One solution is to neural networks for facial expression recognition,” in Computer Vision
balance the class distribution during the pre-processing stage using (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp.
data augmentation and synthesis. Another alternative is to develop 2983–2991.
a cost-sensitive loss layer for deep networks during training. [17] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and
S. Yan, “Peak-piloted deep network for facial expression recognition,”
in European conference on computer vision. Springer, 2016, pp. 425–
442.
6.4 Multimodal affect recognition
[18] C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero,
Last but not the least, human expressive behaviors in realistic “Survey on rgb, 3d, thermal, and multimodal approaches for facial
applications involve encoding from different perspectives, and the expression recognition: History, trends, and affect-related applications,”
IEEE transactions on pattern analysis and machine intelligence, vol. 38,
facial expression is only one modality. Although pure expression no. 8, pp. 1548–1568, 2016.
recognition based on visible face images can achieve promis- [19] R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graph-preserving sparse
ing results, incorporating with other models into a high-level nonnegative matrix factorization with application to facial expression
framework can provide complementary information and further recognition,” IEEE Transactions on Systems, Man, and Cybernetics,
Part B (Cybernetics), vol. 41, no. 1, pp. 38–52, 2011.
enhance the robustness. For example, participants in the EmotiW
[20] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas,
challenges and Audio Video Emotion Challenges (AVEC) [252], “Learning active facial patches for expression analysis,” in Computer
[253] considered the audio model to be the second most important Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.
element and employed various fusion techniques for multimodal IEEE, 2012, pp. 2562–2569.
affect recognition. Additionally, the fusion of other modalities, [21] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza,
B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee et al.,
such as infrared images, depth information from 3D face models “Challenges in representation learning: A report on three machine
and physiological data, is becoming a promising research direction learning contests,” in International Conference on Neural Information
due to the large complementarity for facial expressions. Processing. Springer, 2013, pp. 117–124.
[22] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon,
“Video and image based emotion recognition challenges in the wild:
Emotiw 2015,” in Proceedings of the 2015 ACM on International
R EFERENCES Conference on Multimodal Interaction. ACM, 2015, pp. 423–426.
[1] C. Darwin and P. Prodger, The expression of the emotions in man and [23] A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon, “Emotiw 2016:
animals. Oxford University Press, USA, 1998. Video and group-level emotion recognition challenges,” in Proceedings
[2] Y.-I. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for of the 18th ACM International Conference on Multimodal Interaction.
facial expression analysis,” IEEE Transactions on pattern analysis and ACM, 2016, pp. 427–432.
machine intelligence, vol. 23, no. 2, pp. 97–115, 2001. [24] A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon,
[3] P. Ekman and W. V. Friesen, “Constants across cultures in the face and “From individual to group-level emotion recognition: Emotiw 5.0,” in
emotion.” Journal of personality and social psychology, vol. 17, no. 2, Proceedings of the 19th ACM International Conference on Multimodal
pp. 124–129, 1971. Interaction. ACM, 2017, pp. 524–528.
20

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- [48] A. Dhall, R. Goecke, S. Lucey, T. Gedeon et al., “Collecting large, richly
tion with deep convolutional neural networks,” in Advances in neural annotated facial-expression databases from movies,” IEEE multimedia,
information processing systems, 2012, pp. 1097–1105. vol. 19, no. 3, pp. 34–41, 2012.
[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [49] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Acted facial expres-
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. sions in the wild database,” Australian National University, Canberra,
[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Australia, Technical Report TR-CS-11, vol. 2, p. 1, 2011.
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” [50] ——, “Static facial expression analysis in tough conditions: Data,
in Proceedings of the IEEE conference on computer vision and pattern evaluation protocol and benchmark,” in Computer Vision Workshops
recognition, 2015, pp. 1–9. (ICCV Workshops), 2011 IEEE International Conference on. IEEE,
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 2011, pp. 2106–2112.
recognition,” in Proceedings of the IEEE conference on computer vision [51] C. F. Benitez-Quiroz, R. Srinivasan, Q. Feng, Y. Wang, and A. M.
and pattern recognition, 2016, pp. 770–778. Martinez, “Emotionet challenge: Recognition of facial expressions of
[29] M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial emotion in the wild,” arXiv preprint arXiv:1703.01210, 2017.
expressions: The state of the art,” IEEE Transactions on pattern analysis [52] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of
and machine intelligence, vol. 22, no. 12, pp. 1424–1445, 2000. emotion,” Proceedings of the National Academy of Sciences, vol. 111,
[30] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” no. 15, pp. E1454–E1462, 2014.
Pattern recognition, vol. 36, no. 1, pp. 259–275, 2003. [53] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance mod-
[31] T. Zhang, “Facial expression recognition based on deep learning: A els,” IEEE Transactions on Pattern Analysis & Machine Intelligence,
survey,” in International Conference on Intelligent and Interactive no. 6, pp. 681–685, 2001.
Systems and Applications. Springer, 2017, pp. 345–352. [54] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, and A. M. Dobaie,
[32] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer, “Meta- “Facial expression recognition via learning deep sparse autoencoders,”
analysis of the first facial expression recognition challenge,” IEEE Neurocomputing, vol. 273, pp. 643–649, 2018.
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), [55] B. Hasani and M. H. Mahoor, “Spatio-temporal facial expression recog-
vol. 42, no. 4, pp. 966–979, 2012. nition using convolutional neural networks and conditional random
[33] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and fields,” in Automatic Face & Gesture Recognition (FG 2017), 2017
I. Matthews, “The extended cohn-kanade dataset (ck+): A complete 12th IEEE International Conference on. IEEE, 2017, pp. 790–795.
dataset for action unit and emotion-specified expression,” in Computer [56] X. Zhu and D. Ramanan, “Face detection, pose estimation, and land-
Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Com- mark localization in the wild,” in Computer Vision and Pattern Recogni-
puter Society Conference on. IEEE, 2010, pp. 94–101. tion (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2879–2886.
[34] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database [57] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Ç. Gülçehre,
for facial expression analysis,” in Multimedia and Expo, 2005. ICME R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari
2005. IEEE International Conference on. IEEE, 2005, pp. 5–pp. et al., “Combining modality specific deep neural networks for emotion
[35] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an recognition in video,” in Proceedings of the 15th ACM on International
addition to the mmi facial expression database,” in Proc. 3rd Intern. conference on multimodal interaction. ACM, 2013, pp. 543–550.
Workshop on EMOTION (satellite of LREC): Corpora for Research on [58] T. Devries, K. Biswaranjan, and G. W. Taylor, “Multi-task learning of
Emotion and Affect, 2010, p. 65. facial landmarks and expression,” in Computer and Robot Vision (CRV),
[36] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding facial 2014 Canadian Conference on. IEEE, 2014, pp. 98–103.
expressions with gabor wavelets,” in Automatic Face and Gesture [59] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Robust discrimina-
Recognition, 1998. Proceedings. Third IEEE International Conference tive response map fitting with constrained local models,” in Computer
on. IEEE, 1998, pp. 200–205. Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.
[37] J. M. Susskind, A. K. Anderson, and G. E. Hinton, “The toronto face IEEE, 2013, pp. 3444–3451.
database,” Department of Computer Science, University of Toronto, [60] M. Shin, M. Kim, and D.-S. Kwon, “Baseline cnn structure analysis
Toronto, ON, Canada, Tech. Rep, vol. 3, 2010. for facial expression recognition,” in Robot and Human Interactive
[38] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Communication (RO-MAN), 2016 25th IEEE International Symposium
Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010. on. IEEE, 2016, pp. 724–729.
[39] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial [61] Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong, “Identity-aware convo-
expression database for facial behavior research,” in Automatic face lutional neural network for facial expression recognition,” in Automatic
and gesture recognition, 2006. FGR 2006. 7th international conference Face & Gesture Recognition (FG 2017), 2017 12th IEEE International
on. IEEE, 2006, pp. 211–216. Conference on. IEEE, 2017, pp. 558–565.
[40] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial [62] X. Xiong and F. De la Torre, “Supervised descent method and its appli-
expression recognition from near-infrared videos,” Image and Vision cations to face alignment,” in Computer Vision and Pattern Recognition
Computing, vol. 29, no. 9, pp. 607–619, 2011. (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 532–539.
[41] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and [63] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning
A. van Knippenberg, “Presentation and validation of the radboud faces for emotion recognition on small datasets using transfer learning,” in
database,” Cognition and Emotion, vol. 24, no. 8, pp. 1377–1388, 2010. Proceedings of the 2015 ACM on international conference on multi-
[42] D. Lundqvist, A. Flykt, and A. Öhman, “The karolinska directed modal interaction. ACM, 2015, pp. 443–449.
emotional faces (kdef),” CD ROM from Department of Clinical Neu- [64] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps
roscience, Psychology section, Karolinska Institutet, no. 1998, 1998. via regressing local binary features,” in Proceedings of the IEEE
[43] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: Conference on Computer Vision and Pattern Recognition, 2014, pp.
An accurate, real-time algorithm for the automatic annotation of a 1685–1692.
million facial expressions in the wild,” in Proceedings of IEEE Interna- [65] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Incremental face
tional Conference on Computer Vision & Pattern Recognition (CVPR), alignment in the wild,” in Proceedings of the IEEE conference on
Las Vegas, NV, USA, 2016. computer vision and pattern recognition, 2014, pp. 1859–1866.
[44] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality- [66] D. H. Kim, W. Baddar, J. Jang, and Y. M. Ro, “Multi-objective based
preserving learning for expression recognition in the wild,” in 2017 spatio-temporal feature representation learning robust to expression in-
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). tensity variations for facial expression recognition,” IEEE Transactions
IEEE, 2017, pp. 2584–2593. on Affective Computing, 2017.
[45] S. Li and W. Deng, “Reliable crowdsourcing and deep locality- [67] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade
preserving learning for unconstrained facial expression recognition,” for facial point detection,” in Computer Vision and Pattern Recognition
IEEE Transactions on Image Processing, 2018. (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 3476–3483.
[46] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database [68] K. Zhang, Y. Huang, Y. Du, and L. Wang, “Facial expression recognition
for facial expression, valence, and arousal computing in the wild,” IEEE based on deep evolutional spatial-temporal networks,” IEEE Transac-
Transactions on Affective Computing, vol. PP, no. 99, pp. 1–1, 2017. tions on Image Processing, vol. 26, no. 9, pp. 4193–4203, 2017.
[47] Z. Zhang, P. Luo, C. L. Chen, and X. Tang, “From facial expression [69] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
recognition to interpersonal relation prediction,” International Journal alignment using multitask cascaded convolutional networks,” IEEE
of Computer Vision, vol. 126, no. 5, pp. 1–20, 2018. Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
21

[70] Z. Yu, Q. Liu, and G. Liu, “Deeper cascaded peak-piloted network for International Conference on Multimodal Interaction. ACM, 2016, pp.
weak expression recognition,” The Visual Computer, pp. 1–9, 2017. 472–478.
[71] Z. Yu, G. Liu, Q. Liu, and J. Deng, “Spatio-temporal convolutional [91] P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen, “Learning supervised
features with nested lstm for facial expression recognition,” Neurocom- scoring ensemble for emotion recognition in the wild,” in Proceedings
puting, vol. 317, pp. 50–57, 2018. of the 19th ACM International Conference on Multimodal Interaction.
[72] P. Viola and M. Jones, “Rapid object detection using a boosted cascade ACM, 2017, pp. 553–560.
of simple features,” in Computer Vision and Pattern Recognition, [92] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization
2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society in unconstrained images,” in Proceedings of the IEEE Conference on
Conference on, vol. 1. IEEE, 2001, pp. I–I. Computer Vision and Pattern Recognition, 2015, pp. 4295–4304.
[73] F. De la Torre, W.-S. Chu, X. Xiong, F. Vicente, X. Ding, and J. F. Cohn, [93] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Robust sta-
“Intraface,” in IEEE International Conference on Automatic Face and tistical face frontalization,” in Proceedings of the IEEE International
Gesture Recognition (FG), 2015. Conference on Computer Vision, 2015, pp. 3871–3879.
[74] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by [94] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Towards large-
deep multi-task learning,” in European Conference on Computer Vision. pose face frontalization in the wild,” in Proceedings of the IEEE
Springer, 2014, pp. 94–108. Conference on Computer Vision and Pattern Recognition, 2017, pp.
[75] Z. Yu and C. Zhang, “Image based static facial expression recognition 3990–3999.
with multiple deep network learning,” in Proceedings of the 2015 ACM [95] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and
on International Conference on Multimodal Interaction. ACM, 2015, local perception gan for photorealistic and identity preserving frontal
pp. 435–442. view synthesis,” in Proceedings of the IEEE Conference on Computer
[76] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee, “Hierarchical committee Vision and Pattern Recognition, 2017, pp. 2439–2448.
of deep cnns with exponentially-weighted decision fusion for static [96] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning
facial expression recognition,” in Proceedings of the 2015 ACM on gan for pose-invariant face recognition,” in Proceedings of the IEEE
International Conference on Multimodal Interaction. ACM, 2015, Conference on Computer Vision and Pattern Recognition, 2017, pp.
pp. 427–434. 1415–1424.
[77] X. Liu, B. Kumar, J. You, and P. Jia, “Adaptive deep metric learning [97] L. Deng, D. Yu et al., “Deep learning: methods and applications,”
for identity-aware facial expression recognition,” in Proc. IEEE Conf. Foundations and Trends R in Signal Processing, vol. 7, no. 3–4, pp.
Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2017, pp. 522– 197–387, 2014.
531. [98] B. Fasel, “Robust face analysis using convolutional neural networks,” in
[78] G. Levi and T. Hassner, “Emotion recognition in the wild via convolu- Pattern Recognition, 2002. Proceedings. 16th International Conference
tional neural networks and mapped binary patterns,” in Proceedings of on, vol. 2. IEEE, 2002, pp. 40–43.
the 2015 ACM on international conference on multimodal interaction. [99] ——, “Head-pose invariant facial expression recognition using convo-
ACM, 2015, pp. 503–510. lutional neural networks,” in Proceedings of the 4th IEEE International
[79] D. A. Pitaloka, A. Wulandari, T. Basaruddin, and D. Y. Liliana, Conference on Multimodal Interfaces. IEEE Computer Society, 2002,
“Enhancing cnn with preprocessing stage in automatic emotion recog- p. 529.
nition,” Procedia Computer Science, vol. 116, pp. 523–529, 2017. [100] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, “Subject independent
[80] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, facial expression recognition with robust face detection using a convolu-
“Facial expression recognition with convolutional neural networks: cop- tional neural network,” Neural Networks, vol. 16, no. 5-6, pp. 555–559,
ing with few data and the training sample order,” Pattern Recognition, 2003.
vol. 61, pp. 610–628, 2017. [101] B. Sun, L. Li, G. Zhou, X. Wu, J. He, L. Yu, D. Li, and Q. Wei,
[81] M. V. Zavarez, R. F. Berriel, and T. Oliveira-Santos, “Cross-database “Combining multimodal features within a fusion network for emotion
facial expression recognition based on fine-tuned deep convolutional recognition in the wild,” in Proceedings of the 2015 ACM on Interna-
network,” in Graphics, Patterns and Images (SIBGRAPI), 2017 30th tional Conference on Multimodal Interaction. ACM, 2015, pp. 497–
SIBGRAPI Conference on. IEEE, 2017, pp. 405–412. 502.
[82] W. Li, M. Li, Z. Su, and Z. Zhu, “A deep-learning approach to [102] B. Sun, L. Li, G. Zhou, and J. He, “Facial expression recognition in
facial expression recognition with candid images,” in Machine Vision the wild based on multimodal texture features,” Journal of Electronic
Applications (MVA), 2015 14th IAPR International Conference on. Imaging, vol. 25, no. 6, p. 061407, 2016.
IEEE, 2015, pp. 279–282. [103] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[83] I. Abbasnejad, S. Sridharan, D. Nguyen, S. Denman, C. Fookes, and hierarchies for accurate object detection and semantic segmentation,”
S. Lucey, “Using synthetic data to improve facial expression analysis in Proceedings of the IEEE conference on computer vision and pattern
with 3d convolutional networks,” in Proceedings of the IEEE Confer- recognition, 2014, pp. 580–587.
ence on Computer Vision and Pattern Recognition, 2017, pp. 1609– [104] J. Li, D. Zhang, J. Zhang, J. Zhang, T. Li, Y. Xia, Q. Yan, and L. Xun,
1618. “Facial expression recognition with faster r-cnn,” Procedia Computer
[84] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, Science, vol. 107, pp. 135–140, 2017.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in [105] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
Advances in neural information processing systems, 2014, pp. 2672– object detection with region proposal networks,” in Advances in neural
2680. information processing systems, 2015, pp. 91–99.
[85] W. Chen, M. J. Er, and S. Wu, “Illumination compensation and nor- [106] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
malization for robust face recognition using discrete cosine transform for human action recognition,” IEEE transactions on pattern analysis
in logarithm domain,” IEEE Transactions on Systems, Man, and Cyber- and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
netics, Part B (Cybernetics), vol. 36, no. 2, pp. 458–466, 2006. [107] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
[86] J. Li and E. Y. Lam, “Facial expression recognition using deep neural spatiotemporal features with 3d convolutional networks,” in Computer
networks,” in Imaging Systems and Techniques (IST), 2015 IEEE Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015,
International Conference on. IEEE, 2015, pp. 1–6. pp. 4489–4497.
[87] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, [108] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition
“Recurrent neural networks for emotion recognition in video,” in Pro- using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th
ceedings of the 2015 ACM on International Conference on Multimodal ACM International Conference on Multimodal Interaction. ACM,
Interaction. ACM, 2015, pp. 467–474. 2016, pp. 445–450.
[88] S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang, “Emotion [109] D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and
recognition in the wild from videos using images,” in Proceedings of the C. Fookes, “Deep spatio-temporal features for multimodal emotion
18th ACM International Conference on Multimodal Interaction. ACM, recognition,” in Applications of Computer Vision (WACV), 2017 IEEE
2016, pp. 433–436. Winter Conference on. IEEE, 2017, pp. 1215–1223.
[89] C.-M. Kuo, S.-H. Lai, and M. Sarkis, “A compact deep learning model [110] S. Ouellet, “Real-time emotion recognition for gaming using deep
for robust facial expression recognition,” in Proceedings of the IEEE convolutional network features,” arXiv preprint arXiv:1408.3750, 2014.
Conference on Computer Vision and Pattern Recognition Workshops, [111] H. Ding, S. K. Zhou, and R. Chellappa, “Facenet2expnet: Regularizing
2018, pp. 2121–2129. a deep face recognition net for expression recognition,” in Automatic
[90] A. Yao, D. Cai, P. Hu, S. Wang, L. Sha, and Y. Chen, “Holonet: towards Face & Gesture Recognition (FG 2017), 2017 12th IEEE International
robust emotion recognition in the wild,” in Proceedings of the 18th ACM Conference on. IEEE, 2017, pp. 118–126.
22

[112] B. Hasani and M. H. Mahoor, “Facial expression recognition using [137] M. Liu, S. Li, S. Shan, and X. Chen, “Au-aware deep networks for facial
enhanced deep 3d convolutional neural networks,” in Computer Vision expression recognition,” in Automatic Face and Gesture Recognition
and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference (FG), 2013 10th IEEE International Conference and Workshops on.
on. IEEE, 2017, pp. 2278–2288. IEEE, 2013, pp. 1–6.
[113] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for [138] ——, “Au-inspired deep networks for facial expression feature learn-
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, ing,” Neurocomputing, vol. 159, pp. 126–136, 2015.
2006. [139] P. Khorrami, T. Paine, and T. Huang, “Do deep neural networks learn
[114] G. E. Hinton and T. J. Sejnowski, “Learning and releaming in boltz- facial action units when doing expression recognition?” arXiv preprint
mann machines,” Parallel distributed processing: Explorations in the arXiv:1510.02969v3, 2015.
microstructure of cognition, vol. 1, no. 282-317, p. 2, 1986. [140] J. Cai, Z. Meng, A. S. Khan, Z. Li, J. OReilly, and Y. Tong, “Island loss
[115] G. E. Hinton, “A practical guide to training restricted boltzmann for learning discriminative features in facial expression recognition,” in
machines,” in Neural networks: Tricks of the trade. Springer, 2012, Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE
pp. 599–619. International Conference on. IEEE, 2018, pp. 302–309.
[116] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- [141] H. Yang, U. Ciftci, and L. Yin, “Facial expression recognition by de-
wise training of deep networks,” in Advances in neural information expression residue learning,” in Proceedings of the IEEE Conference on
processing systems, 2007, pp. 153–160. Computer Vision and Pattern Recognition, 2018, pp. 2168–2177.
[117] G. E. Hinton, “Training products of experts by minimizing contrastive [142] D. Hamester, P. Barros, and S. Wermter, “Face expression recognition
divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002. with a 2-channel convolutional neural network,” in Neural Networks
[118] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of (IJCNN), 2015 International Joint Conference on. IEEE, 2015, pp.
data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 1–8.
2006. [143] S. Reed, K. Sohn, Y. Zhang, and H. Lee, “Learning to disentangle fac-
[119] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, tors of variation with manifold interaction,” in International Conference
“Stacked denoising autoencoders: Learning useful representations in a on Machine Learning, 2014, pp. 1431–1439.
deep network with a local denoising criterion,” Journal of Machine [144] Z. Zhang, P. Luo, C.-C. Loy, and X. Tang, “Learning social relation
Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010. traits from face images,” in Proceedings of the IEEE International
[120] Q. V. Le, “Building high-level features using large scale unsupervised Conference on Computer Vision, 2015, pp. 3631–3639.
learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 [145] Y. Guo, D. Tao, J. Yu, H. Xiong, Y. Li, and D. Tao, “Deep neural
IEEE International Conference on. IEEE, 2013, pp. 8595–8598. networks with relativity learning for facial expression recognition,” in
[121] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Con- Multimedia & Expo Workshops (ICMEW), 2016 IEEE International
tractive auto-encoders: Explicit invariance during feature extraction,” Conference on. IEEE, 2016, pp. 1–6.
in Proceedings of the 28th International Conference on International [146] B.-K. Kim, S.-Y. Dong, J. Roh, G. Kim, and S.-Y. Lee, “Fusing aligned
Conference on Machine Learning. Omnipress, 2011, pp. 833–840. and non-aligned face information for automatic affect recognition in
[122] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convolu- the wild: A deep learning approach,” in Proceedings of the IEEE
tional auto-encoders for hierarchical feature extraction,” in International Conference on Computer Vision and Pattern Recognition Workshops,
Conference on Artificial Neural Networks. Springer, 2011, pp. 52–59. 2016, pp. 48–57.
[123] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
[147] C. Pramerdorfer and M. Kampel, “Facial expression recognition us-
preprint arXiv:1312.6114, 2013.
ing convolutional neural networks: State of the art,” arXiv preprint
[124] P. J. Werbos, “Backpropagation through time: what it does and how to arXiv:1612.02903, 2016.
do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[148] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.”
[125] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
in BMVC, vol. 1, no. 3, 2015, p. 6.
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[149] T. Kaneko, K. Hiramatsu, and K. Kashino, “Adaptive visual feedback
[126] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
generation for facial expression improvement with multi-task deep
arXiv preprint arXiv:1411.1784, 2014.
neural networks,” in Proceedings of the 2016 ACM on Multimedia
[127] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
Conference. ACM, 2016, pp. 327–331.
learning with deep convolutional generative adversarial networks,”
arXiv preprint arXiv:1511.06434, 2015. [150] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from
scratch,” arXiv preprint arXiv:1411.7923, 2014.
[128] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
“Autoencoding beyond pixels using a learned similarity metric,” arXiv [151] X. Zhang, L. Zhang, X.-J. Wang, and H.-Y. Shum, “Finding celebrities
preprint arXiv:1512.09300, 2015. in billions of web images,” IEEE Transactions on Multimedia, vol. 14,
[129] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and no. 4, pp. 995–1007, 2012.
P. Abbeel, “Infogan: Interpretable representation learning by informa- [152] H.-W. Ng and S. Winkler, “A data-driven approach to cleaning large
tion maximizing generative adversarial nets,” in Advances in neural face datasets,” in Image Processing (ICIP), 2014 IEEE International
information processing systems, 2016, pp. 2172–2180. Conference on. IEEE, 2014, pp. 343–347.
[130] Y. Tang, “Deep learning using linear support vector machines,” arXiv [153] H. Kaya, F. Gürpınar, and A. A. Salah, “Video-based emotion recogni-
preprint arXiv:1306.0239, 2013. tion in the wild using deep transfer learning and score fusion,” Image
[131] A. Dapogny and K. Bailly, “Investigating deep neural forests for facial and Vision Computing, vol. 65, pp. 66–75, 2017.
expression recognition,” in Automatic Face & Gesture Recognition (FG [154] B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko, “Convolu-
2018), 2018 13th IEEE International Conference on. IEEE, 2018, pp. tional neural networks pretrained on large face recognition datasets for
629–633. emotion classification from video,” arXiv preprint arXiv:1711.04598,
[132] P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo, “Deep 2017.
neural decision forests,” in Proceedings of the IEEE international [155] D. G. Lowe, “Object recognition from local scale-invariant features,” in
conference on computer vision, 2015, pp. 1467–1475. Computer vision, 1999. The proceedings of the seventh IEEE interna-
[133] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and tional conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
T. Darrell, “Decaf: A deep convolutional activation feature for generic [156] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, “A deep neural
visual recognition,” in International conference on machine learning, network-driven feature learning method for multi-view facial expression
2014, pp. 647–655. recognition,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp.
[134] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features 2528–2536, 2016.
off-the-shelf: an astounding baseline for recognition,” in Computer [157] Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, “Facial expression recogni-
Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Con- tion with deep age,” in Multimedia & Expo Workshops (ICMEW), 2017
ference on. IEEE, 2014, pp. 512–519. IEEE International Conference on. IEEE, 2017, pp. 657–662.
[135] N. Otberdout, A. Kacem, M. Daoudi, L. Ballihi, and S. Berretti, “Deep [158] L. Chen, M. Zhou, W. Su, M. Wu, J. She, and K. Hirota, “Softmax
covariance descriptors for facial expression recognition,” in BMVC, regression based deep sparse autoencoder network for facial emotion
2018. recognition in human-robot interaction,” Information Sciences, vol. 428,
[136] D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool, “Covariance pp. 49–61, 2018.
pooling for facial expression recognition,” in Proceedings of the IEEE [159] V. Mavani, S. Raman, and K. P. Miyapuram, “Facial expression
Conference on Computer Vision and Pattern Recognition Workshops, recognition using visual saliency and deep learning,” arXiv preprint
2018, pp. 367–374. arXiv:1708.08016, 2017.
23

[160] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi-level tive adversarial networks,” in Automatic Face & Gesture Recognition
network for saliency prediction,” in Pattern Recognition (ICPR), 2016 (FG 2018), 2018 13th IEEE International Conference on. IEEE, 2018,
23rd International Conference on. IEEE, 2016, pp. 3488–3493. pp. 294–301.
[161] B.-F. Wu and C.-H. Lin, “Adaptive feature mapping for customizing [183] J. Chen, J. Konrad, and P. Ishwar, “Vgan-based image representation
deep learning based facial expression recognition model,” IEEE Access, learning for privacy-preserving facial expression recognition,” in Pro-
2018. ceedings of the IEEE Conference on Computer Vision and Pattern
[162] J. Lu, V. E. Liong, and J. Zhou, “Cost-sensitive local binary feature Recognition Workshops, 2018, pp. 1570–1579.
learning for facial age estimation,” IEEE Transactions on Image Pro- [184] Y. Kim, B. Yoo, Y. Kwak, C. Choi, and J. Kim, “Deep generative-
cessing, vol. 24, no. 12, pp. 5356–5368, 2015. contrastive networks for facial expression recognition,” arXiv preprint
[163] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and arXiv:1703.07140, 2017.
improving convolutional neural networks via concatenated rectified [185] N. Sun, Q. Li, R. Huan, J. Liu, and G. Han, “Deep spatial-temporal
linear units,” in International Conference on Machine Learning, 2016, feature fusion for facial expression recognition in static images,” Pattern
pp. 2217–2225. Recognition Letters, 2017.
[164] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- [186] W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu, and H. Li,
thinking the inception architecture for computer vision,” in Proceedings “Audio and face video emotion recognition in the wild using deep
of the IEEE Conference on Computer Vision and Pattern Recognition, neural networks and small datasets,” in Proceedings of the 18th ACM
2016, pp. 2818–2826. International Conference on Multimodal Interaction. ACM, 2016, pp.
[165] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, 506–513.
inception-resnet and the impact of residual connections on learning.” in [187] J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, Y. Zong, and N. Sun,
AAAI, vol. 4, 2017, p. 12. “Multi-clue fusion for emotion recognition in the wild,” in Proceedings
[166] S. Zhao, H. Cai, H. Liu, J. Zhang, and S. Chen, “Feature selection of the 18th ACM International Conference on Multimodal Interaction.
mechanism in cnns for facial expression recognition,” in BMVC, 2018. ACM, 2016, pp. 458–463.
[167] J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with [188] Z. Cui, S. Xiao, Z. Niu, S. Yan, and W. Zheng, “Recurrent shape regres-
inconsistently annotated datasets,” in Proceedings of the European sion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
Conference on Computer Vision (ECCV), 2018, pp. 222–237. 2018.
[168] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature [189] X. Ouyang, S. Kawaai, E. G. H. Goh, S. Shen, W. Ding, H. Ming, and
learning approach for deep face recognition,” in European Conference D.-Y. Huang, “Audio-visual emotion recognition using deep transfer
on Computer Vision. Springer, 2016, pp. 499–515. learning and multiple temporal models,” in Proceedings of the 19th
[169] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed- ACM International Conference on Multimodal Interaction. ACM,
ding for face recognition and clustering,” in Proceedings of the IEEE 2017, pp. 577–582.
conference on computer vision and pattern recognition, 2015, pp. 815– [190] V. Vielzeuf, S. Pateux, and F. Jurie, “Temporal multimodal fusion for
823. video emotion classification in the wild,” in Proceedings of the 19th
[170] G. Zeng, J. Zhou, X. Jia, W. Xie, and L. Shen, “Hand-crafted feature ACM International Conference on Multimodal Interaction. ACM,
guided deep learning for facial expression recognition,” in Automatic 2017, pp. 569–576.
Face & Gesture Recognition (FG 2018), 2018 13th IEEE International
[191] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michal-
Conference on. IEEE, 2018, pp. 423–430.
ski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-
[171] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural Lewandowski et al., “Emonets: Multimodal deep learning approaches
networks for image classification,” in Computer vision and pattern for emotion recognition in video,” Journal on Multimodal User Inter-
recognition (CVPR), 2012 IEEE conference on. IEEE, 2012, pp. 3642– faces, vol. 10, no. 2, pp. 99–111, 2016.
3649.
[192] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, “Combining
[172] G. Pons and D. Masip, “Supervised committee of convolutional neural
multiple kernel methods on riemannian manifold for emotion recogni-
networks in automated facial expression analysis,” IEEE Transactions
tion in the wild,” in Proceedings of the 16th International Conference
on Affective Computing, 2017.
on Multimodal Interaction. ACM, 2014, pp. 494–501.
[173] B.-K. Kim, J. Roh, S.-Y. Dong, and S.-Y. Lee, “Hierarchical committee
[193] B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, “Video emotion
of deep convolutional neural networks for robust facial expression
recognition with transferred deep feature encodings,” in Proceedings
recognition,” Journal on Multimodal User Interfaces, vol. 10, no. 2,
of the 2016 ACM on International Conference on Multimedia Retrieval.
pp. 173–189, 2016.
ACM, 2016, pp. 15–22.
[174] K. Liu, M. Zhang, and Z. Pan, “Facial expression recognition with cnn
ensemble,” in Cyberworlds (CW), 2016 International Conference on. [194] J. Chen, R. Xu, and L. Liu, “Deep peak-neutral difference feature for
IEEE, 2016, pp. 163–166. facial expression recognition,” Multimedia Tools and Applications, pp.
1–17, 2018.
[175] G. Pons and D. Masip, “Multi-task, multi-label and multi-domain
learning with residual convolutional networks for emotion recognition,” [195] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to ini-
arXiv preprint arXiv:1802.06664, 2018. tialize recurrent networks of rectified linear units,” arXiv preprint
[176] P. Ekman and E. L. Rosenberg, What the face reveals: Basic and applied arXiv:1504.00941, 2015.
studies of spontaneous expression using the Facial Action Coding [196] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
System (FACS). Oxford University Press, USA, 1997. works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp.
[177] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa, “An 2673–2681, 1997.
all-in-one convolutional neural network for face analysis,” in Automatic [197] P. Barros and S. Wermter, “Developing crossmodal expression recogni-
Face & Gesture Recognition (FG 2017), 2017 12th IEEE International tion based on a deep neural model,” Adaptive behavior, vol. 24, no. 5,
Conference on. IEEE, 2017, pp. 17–24. pp. 373–396, 2016.
[178] Y. Lv, Z. Feng, and C. Xu, “Facial expression recognition via deep [198] J. Zhao, X. Mao, and J. Zhang, “Learning deep facial expression
learning,” in Smart Computing (SMARTCOMP), 2014 International features from image and optical flow sequences using 3d cnn,” The
Conference on. IEEE, 2014, pp. 303–308. Visual Computer, pp. 1–15, 2018.
[179] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza, “Disentan- [199] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen, “Deeply learning
gling factors of variation for facial expression recognition,” in European deformable facial action parts model for dynamic expression analysis,”
Conference on Computer Vision. Springer, 2012, pp. 808–822. in Asian conference on computer vision. Springer, 2014, pp. 143–157.
[180] Y.-H. Lai and S.-H. Lai, “Emotion-preserving representation learning [200] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
via generative adversarial network for multi-view facial expression “Object detection with discriminatively trained part-based models,”
recognition,” in Automatic Face & Gesture Recognition (FG 2018), IEEE transactions on pattern analysis and machine intelligence, vol. 32,
2018 13th IEEE International Conference on. IEEE, 2018, pp. 263– no. 9, pp. 1627–1645, 2010.
270. [201] S. Pini, O. B. Ahmed, M. Cornia, L. Baraldi, R. Cucchiara, and B. Huet,
[181] F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Joint pose and expression “Modeling multimodal cues in a deep learning-based framework for
modeling for facial expression recognition,” in Proceedings of the IEEE emotion recognition in the wild,” in Proceedings of the 19th ACM
Conference on Computer Vision and Pattern Recognition, 2018, pp. International Conference on Multimodal Interaction. ACM, 2017,
3359–3368. pp. 536–543.
[182] H. Yang, Z. Zhang, and L. Yin, “Identity-adaptive facial expression [202] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad:
recognition through expression regeneration using conditional genera- Cnn architecture for weakly supervised place recognition,” in Pro-
24

ceedings of the IEEE Conference on Computer Vision and Pattern of rgb-depth map latent representations,” in 2017 IEEE International
Recognition, 2016, pp. 5297–5307. Conference on Computer Vision Workshop (ICCVW), 2017.
[203] D. H. Kim, M. K. Lee, D. Y. Choi, and B. C. Song, “Multi-modal [224] H. Li, J. Sun, Z. Xu, and L. Chen, “Multimodal 2d+ 3d facial expression
emotion recognition using semi-supervised learning and multiple neural recognition with deep fusion convolutional neural network,” IEEE
networks in the wild,” in Proceedings of the 19th ACM International Transactions on Multimedia, vol. 19, no. 12, pp. 2816–2831, 2017.
Conference on Multimodal Interaction. ACM, 2017, pp. 529–535. [225] A. Jan, H. Ding, H. Meng, L. Chen, and H. Li, “Accurate facial parts
[204] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu- localization and deep learning for 3d facial expression recognition,” in
gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE
networks for visual recognition and description,” in Proceedings of the International Conference on. IEEE, 2018, pp. 466–472.
IEEE conference on computer vision and pattern recognition, 2015, pp. [226] X. Wei, H. Li, J. Sun, and L. Chen, “Unsupervised domain adaptation
2625–2634. with regularized optimal transport for multimodal 2d+ 3d facial expres-
[205] D. K. Jain, Z. Zhang, and K. Huang, “Multi angle optimal pattern-based sion recognition,” in Automatic Face & Gesture Recognition (FG 2018),
deep learning for automatic facial expression recognition,” Pattern 2018 13th IEEE International Conference on. IEEE, 2018, pp. 31–37.
Recognition Letters, 2017. [227] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson,
[206] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Spatio- “Generating facial expressions with deep belief nets,” in Affective
temporal convolutional sparse auto-encoder for sequence classification.” Computing. InTech, 2008.
in BMVC, 2012, pp. 1–12. [228] M. Sabzevari, S. Toosizadeh, S. R. Quchani, and V. Abrishami, “A
[207] S. Kankanamge, C. Fookes, and S. Sridharan, “Facial analysis in the fast and accurate facial expression synthesis system for color face
wild with lstm networks,” in Image Processing (ICIP), 2017 IEEE images using face graph and deep belief network,” in Electronics and
International Conference on. IEEE, 2017, pp. 1052–1056. Information Engineering (ICEIE), 2010 International Conference On,
[208] J. D. Lafferty, A. Mccallum, and F. C. N. Pereira, “Conditional random vol. 2. IEEE, 2010, pp. V2–354.
fields: Probabilistic models for segmenting and labeling sequence data,” [229] R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala, “Semantic
Proceedings of Icml, vol. 3, no. 2, pp. 282–289, 2001. facial expression editing using autoencoded flow,” arXiv preprint
[209] K. Simonyan and A. Zisserman, “Two-stream convolutional networks arXiv:1611.09961, 2016.
for action recognition in videos,” in Advances in neural information [230] Y. Zhou and B. E. Shi, “Photorealistic facial expression synthesis by
processing systems, 2014, pp. 568–576. the conditional difference adversarial autoencoder,” in Affective Com-
[210] J. Susskind, V. Mnih, G. Hinton et al., “On deep generative models puting and Intelligent Interaction (ACII), 2017 Seventh International
with applications to recognition,” in Computer Vision and Pattern Conference on. IEEE, 2017, pp. 370–376.
Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. [231] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversarial
2857–2864. facial expression synthesis,” arXiv preprint arXiv:1712.03474, 2017.
[211] V. Mnih, J. M. Susskind, G. E. Hinton et al., “Modeling natural images [232] H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expres-
using gated mrfs,” IEEE transactions on pattern analysis and machine sion editing with controllable expression intensity,” in AAAI, 2018, p.
intelligence, vol. 35, no. 9, pp. 2206–2222, 2013. 67816788.
[233] F. Qiao, N. Yao, Z. Jiao, Z. Li, H. Chen, and H. Wang, “Geometry-
[212] V. Mnih, G. E. Hinton et al., “Generating more realistic images using
contrastive generative adversarial network for facial expression synthe-
gated mrf’s,” in Advances in Neural Information Processing Systems,
sis,” arXiv preprint arXiv:1802.01822, 2018.
2010, pp. 2002–2010.
[234] I. Masi, A. T. Tran, T. Hassner, J. T. Leksut, and G. Medioni, “Do we
[213] Y. Cheng, B. Jiang, and K. Jia, “A deep structure for facial expression
really need to collect millions of faces for effective face recognition?”
recognition under partial occlusion,” in Intelligent Information Hiding
in European Conference on Computer Vision. Springer, 2016, pp.
and Multimedia Signal Processing (IIH-MSP), 2014 Tenth International
579–596.
Conference on. IEEE, 2014, pp. 211–214.
[235] N. Mousavi, H. Siqueira, P. Barros, B. Fernandes, and S. Wermter,
[214] M. Xu, W. Cheng, Q. Zhao, L. Ma, and F. Xu, “Facial expression recog-
“Understanding how deep neural networks learn face expressions,” in
nition based on transfer learning from deep convolutional networks,” in
Neural Networks (IJCNN), 2016 International Joint Conference on.
Natural Computation (ICNC), 2015 11th International Conference on.
IEEE, 2016, pp. 227–234.
IEEE, 2015, pp. 702–708.
[236] R. Breuer and R. Kimmel, “A deep learning perspective on the origin
[215] Y. Liu, J. Zeng, S. Shan, and Z. Zheng, “Multi-channel pose-aware con- of facial expressions,” arXiv preprint arXiv:1705.01842, 2017.
volution neural networks for multi-view facial expression recognition,”
[237] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE
tional networks,” in European conference on computer vision. Springer,
International Conference on. IEEE, 2018, pp. 458–465.
2014, pp. 818–833.
[216] S. He, S. Wang, W. Lan, H. Fu, and Q. Ji, “Facial expression recognition [238] I. Lüsi, J. C. J. Junior, J. Gorbova, X. Baró, S. Escalera, H. Demirel,
using deep boltzmann machine from thermal infrared images,” in J. Allik, C. Ozcinar, and G. Anbarjafari, “Joint challenge on dominant
Affective Computing and Intelligent Interaction (ACII), 2013 Humaine and complementary emotion recognition using micro emotion features
Association Conference on. IEEE, 2013, pp. 239–244. and head-pose estimation: Databases,” in Automatic Face & Gesture
[217] Z. Wu, T. Chen, Y. Chen, Z. Zhang, and G. Liu, “Nirexpnet: Three- Recognition (FG 2017), 2017 12th IEEE International Conference on.
stream 3d convolutional neural network for near infrared facial expres- IEEE, 2017, pp. 809–813.
sion recognition,” Applied Sciences, vol. 7, no. 11, p. 1184, 2017. [239] J. Wan, S. Escalera, X. Baro, H. J. Escalante, I. Guyon, M. Madadi,
[218] E. P. Ijjina and C. K. Mohan, “Facial expression recognition using kinect J. Allik, J. Gorbova, and G. Anbarjafari, “Results and analysis of
depth sensor and convolutional neural networks,” in Machine Learning chalearn lap multi-modal isolated and continuous gesture recognition,
and Applications (ICMLA), 2014 13th International Conference on. and real versus fake expressed emotions challenges,” in ChaLearn LaP,
IEEE, 2014, pp. 392–396. Action, Gesture, and Emotion Recognition Workshop and Competitions:
[219] M. Z. Uddin, M. M. Hassan, A. Almogren, M. Zuair, G. Fortino, and Large Scale Multimodal Gesture Recognition and Real versus Fake
J. Torresen, “A facial expression recognition system using robust face expressed emotions, ICCV, vol. 4, no. 6, 2017.
features from depth videos and deep learning,” Computers & Electrical [240] Y.-G. Kim and X.-P. Huynh, “Discrimination between genuine versus
Engineering, vol. 63, pp. 114–125, 2017. fake emotion using long-short term memory with parametric bias and
[220] M. Z. Uddin, W. Khaksar, and J. Torresen, “Facial expression recog- facial landmarks,” in Computer Vision Workshop (ICCVW), 2017 IEEE
nition using salient features and convolutional neural network,” IEEE International Conference on. IEEE, 2017, pp. 3065–3072.
Access, vol. 5, pp. 26 146–26 161, 2017. [241] L. Li, T. Baltrusaitis, B. Sun, and L.-P. Morency, “Combining sequential
[221] W. Li, D. Huang, H. Li, and Y. Wang, “Automatic 4d facial expression geometry and texture features for distinguishing genuine and deceptive
recognition using dynamic geometrical image network,” in Automatic emotions,” in Proceedings of the IEEE Conference on Computer Vision
Face & Gesture Recognition (FG 2018), 2018 13th IEEE International and Pattern Recognition, 2017, pp. 3147–3153.
Conference on. IEEE, 2018, pp. 24–30. [242] J. Guo, S. Zhou, J. Wu, J. Wan, X. Zhu, Z. Lei, and S. Z. Li, “Multi-
[222] F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni, modality network with visual and geometrical information for micro
“Expnet: Landmark-free, deep, 3d facial expressions,” in Automatic emotion recognition,” in Automatic Face & Gesture Recognition (FG
Face & Gesture Recognition (FG 2018), 2018 13th IEEE International 2017), 2017 12th IEEE International Conference on. IEEE, 2017, pp.
Conference on. IEEE, 2018, pp. 122–129. 814–819.
[223] O. K. Oyedotun, G. Demisse, A. E. R. Shabayek, D. Aouada, and [243] I. Song, H.-J. Kim, and P. B. Jeon, “Deep learning for real-time
B. Ottersten, “Facial expression recognition via joint deep learning robust facial expression recognition on a smartphone,” in Consumer
25

Electronics (ICCE), 2014 IEEE International Conference on. IEEE,


2014, pp. 564–567.
[244] S. Bazrafkan, T. Nedelcu, P. Filipczuk, and P. Corcoran, “Deep learning
for facial expression recognition: A step closer to a smartphone that
knows your moods,” in Consumer Electronics (ICCE), 2017 IEEE
International Conference on. IEEE, 2017, pp. 217–220.
[245] S. Hickson, N. Dufour, A. Sud, V. Kwatra, and I. Essa, “Eyemotion:
Classifying facial expressions in vr using eye-tracking cameras,” arXiv
preprint arXiv:1707.07204, 2017.
[246] S. A. Ossia, A. S. Shamsabadi, A. Taheri, H. R. Rabiee, N. Lane, and
H. Haddadi, “A hybrid deep learning architecture for privacy-preserving
mobile analytics,” arXiv preprint arXiv:1703.02952, 2017.
[247] K. Kulkarni, C. A. Corneanu, I. Ofodile, S. Escalera, X. Baro,
S. Hyniewska, J. Allik, and G. Anbarjafari, “Automatic recognition of
facial displays of unfelt emotions,” arXiv preprint arXiv:1707.04061,
2017.
[248] X. Zhou, K. Jin, Y. Shang, and G. Guo, “Visually interpretable repre-
sentation learning for depression recognition from facial images,” IEEE
Transactions on Affective Computing, pp. 1–1, 2018.
[249] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep
networks for facial expression recognition with crowd-sourced label
distribution,” in Proceedings of the 18th ACM International Conference
on Multimodal Interaction. ACM, 2016, pp. 279–283.
[250] J. A. Russell, “A circumplex model of affect.” Journal of personality
and social psychology, vol. 39, no. 6, p. 1161, 1980.
[251] S. Li and W. Deng, “Deep emotion transfer network for cross-database
facial expression recognition,” in Pattern Recognition (ICPR), 2018
26th International Conference. IEEE, 2018, pp. 3092–3099.
[252] M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor-
res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016:
Depression, mood, and emotion recognition workshop and challenge,”
in Proceedings of the 6th International Workshop on Audio/Visual
Emotion Challenge. ACM, 2016, pp. 3–10.
[253] F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer,
S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017:
Real-life depression, and affect recognition workshop and challenge,”
in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion
Challenge. ACM, 2017, pp. 3–9.

You might also like