Deep Facial Expression Recognition: A Survey: Shan Li and Weihong Deng, Member, IEEE
Deep Facial Expression Recognition: A Survey: Shan Li and Weihong Deng, Member, IEEE
Abstract—With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions
and the recent success of deep learning techniques in various fields, deep neural networks have increasingly been leveraged to learn
discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting
caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this
paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic
problems. First, we introduce the available datasets that are widely used in the literature and provide accepted data selection and
evaluation principles for these datasets. We then describe the standard pipeline of a deep FER system with the related background
knowledge and suggestions of applicable implementations for each stage. For the state of the art in deep FER, we review existing
novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image
arXiv:1804.08348v2 [cs.CV] 22 Oct 2018
sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized
in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future directions for the design of robust deep FER systems.
Index Terms—Facial Expressions Recognition, Facial expression datasets, Affect, Deep Learning, Survey.
1 I NTRODUCTION
MMI ing FER systems. CK+ contains 593 video sequences from 123
Zhi et al. [19] (NMF) 2011
subjects. The sequences vary in duration from 10 to 60 frames
Zhong et al. [20] (Sparse learning) and show a shift from a neutral facial expression to the peak
Tang (CNN) [130] (winner of FER2013) 2013 expression. Among these videos, 327 sequences from 118 subjects
FER2013
Kahou et al. [57] (CNN, DBN, DAE) are labeled with seven basic expression labels (anger, contempt,
EmotiW
---->
TABLE 1
An overview of the facial expression datasets. P = posed; S = spontaneous; Condit. = Collection condition; Elicit. = Elicitation method.
tion Recognition In The Wild Challenge (EmotiW) since 2013. 755,370 images from 337 subjects under 15 viewpoints and 19
AFEW contains video clips collected from different movies with illumination conditions in up to four recording session. Each facial
spontaneous expressions, various head poses, occlusions and il- image is labeled with one of six expressions: disgust, neutral,
luminations. AFEW is a temporal and multimodal database that scream, smile, squint and surprise. This dataset is typically used
provides with vastly different environmental conditions in both for multiview facial expression analysis.
audio and video. Samples are labeled with seven expressions: BU-3DFE [39]: The Binghamton University 3D Facial Ex-
anger, disgust, fear, happiness, sadness, surprise and neutral. The pression (BU-3DFE) database contains 606 facial expression se-
annotation of expressions have been continuously updated, and quences captured from 100 people. For each subject, six universal
reality TV show data have been continuously added. The AFEW facial expressions (anger, disgust, fear, happiness, sadness and
7.0 in EmotiW 2017 [24] is divided into three data partitions in surprise) are elicited by various manners with multiple intensities.
an independent manner in terms of subject and movie/TV source: Similar to Multi-PIE, this dataset is typically used for multiview
Train (773 samples), Val (383 samples) and Test (653 samples), 3D facial expression analysis.
which ensures data in the three sets belong to mutually exclusive
Oulu-CASIA [40]: The Oulu-CASIA database includes 2,880
movies and actors.
image sequences collected from 80 subjects labeled with six
SFEW [50]: The Static Facial Expressions in the Wild basic emotion labels: anger, disgust, fear, happiness, sadness, and
(SFEW) was created by selecting static frames from the AFEW surprise. Each of the videos is captured with one of two imaging
database by computing key frames based on facial point clustering. systems, i.e., near-infrared (NIR) or visible light (VIS), under
The most commonly used version, SFEW 2.0, was the bench- three different illumination conditions. Similar to CK+, the first
marking data for the SReco sub-challenge in EmotiW 2015 [22]. frame is neutral and the last frame has the peak expression.
SFEW 2.0 has been divided into three sets: Train (958 samples), Typically, only the last three peak frames and the first frame
Val (436 samples) and Test (372 samples). Each of the images is (neutral face) from the 480 videos collected by the VIS System
assigned to one of seven expression categories, i.e., anger, disgust, under normal indoor illumination are employed for 10-fold cross-
fear, neutral, happiness, sadness, and surprise. The expression validation experiments.
labels of the training and validation sets are publicly available,
RaFD [41]: The Radboud Faces Database (RaFD) is
whereas those of the testing set are held back by the challenge
laboratory-controlled and has a total of 1,608 images from 67
organizer.
subjects with three different gaze directions, i.e., front, left and
Multi-PIE [38]: The CMU Multi-PIE database contains right. Each sample is labeled with one of eight expressions: anger,
4
Emotion Training
Labels
Input Images
C1 Layer
P1 Layer
C2 Layer P2 Layer CNN
Full
Convolutions Subsampling Convolutions Subsampling Connected Trained
Face hidden variables
ℎ3 𝑜
Model
𝒉
𝑶𝒕−𝟏 𝑶𝒕 𝑶𝒕+𝟏
Bipartite
Structure
RBM
𝑾 ℎ2 𝑉 𝑉 𝑉
𝑉 𝑊
Data Augmentation label
𝑠 Unfold
𝒔𝒕−𝟏 𝒔𝒕 𝒔𝒕+𝟏
ℎ1 𝑊 𝑊 𝑊 𝑊 Anger
𝑈 𝑈 𝑈 𝑈 Contempt
Normalization Image
𝑽
visible variables
DBN 𝑣 𝑥 RNN 𝒙𝒕−𝟏 𝒙𝒕 𝒙𝒕+𝟏
Alignment Disgust
Fear
As close as possible
Noise Happiness
Images & Generated Neutral
Output Layer
Input Layer
bottle
sample
Layer
Layer
Sadness
Layer
Layer
…
Layer
Layer
… Data
Sequences 𝑊1 𝑊2
𝑊2𝑇 𝑊1𝑇
sample
Discriminator Generator
Surprise
Illumination Pose Data sample ?
𝑥 Encoder Decoder 𝑥
Code
DAE Yes / No
GAN
Trained
Input Pre-processing Model Testing Deep Networks Output
of images for training. Therefore, data augmentation is a vital 3.1.3 Face normalization
step for deep FER. Data augmentation techniques can be divided
into two groups: on-the-fly data augmentation and offline data Variations in illumination and head poses can introduce large
augmentation. changes in images and hence impair the FER performance.
Therefore, we introduce two typical face normalization methods
Usually, the on-the-fly data augmentation is embedded in deep
to ameliorate these variations: illumination normalization and
learning toolkits to alleviate overfitting. During the training step,
pose normalization (frontalization).
the input samples are randomly cropped from the four corners and
center of the image and then flipped horizontally, which can result
Illumination normalization: Illumination and contrast can
in a dataset that is ten times larger than the original training data.
vary in different images even from the same person with the same
Two common prediction modes are adopted during testing: only
expression, especially in unconstrained environments, which can
the center patch of the face is used for prediction (e.g., [61], [77])
result in large intra-class variances. In [60], several frequently
or the prediction value is averaged over all ten crops (e.g., [76],
used illumination normalization algorithms, namely, isotropic
[78]).
diffusion (IS)-based normalization, discrete cosine transform
Besides the elementary on-the-fly data augmentation, various (DCT)-based normalization [85] and difference of Gaussian
offline data augmentation operations have been designed to further (DoG), were evaluated for illumination normalization. And [86]
expand data on both size and diversity. The most frequently used employed homomorphic filtering based normalization, which has
operations include random perturbations and transforms, e.g., rota- been reported to yield the most consistent results among all other
tion, shifting, skew, scaling, noise, contrast and color jittering. For techniques, to remove illumination normalization. Furthermore,
example, common noise models, salt & pepper and speckle noise related studies have shown that histogram equalization combined
[79] and Gaussian noise [80], [81] are employed to enlarge the data with illumination normalization results in better face recognition
size. And for contrast transformation, saturation and value (S and performance than that achieved using illumination normalization
V components of the HSV color space) of each pixel are changed on it own. And many studies in the literature of deep FER (e.g.,
[70] for data augmentation. Combinations of multiple operations [75], [79], [87], [88]) have employed histogram equalization to
can generate more unseen training samples and make the network increase the global contrast of images for pre-processing. This
more robust to deviated and rotated faces. In [82], the authors method is effective when the brightness of the background and
applied five image appearance filters (disk, average, Gaussian, foreground are similar. However, directly applying histogram
unsharp and motion filters) and six affine transform matrices that equalization may overemphasize local contrast. To solve this
were formalized by adding slight geometric transformations to the problem, [89] proposed a weighted summation approach to
identity matrix. In [75], a more comprehensive affine transform combine histogram equalization and linear mapping. And in
matrix was proposed to randomly generate images that varied in [79], the authors compared three different methods: global
terms of rotation, skew and scale. Furthermore, deep learning contrast normalization (GCN), local normalization, and histogram
based technology can be applied for data augmentation. For equalization. GCN and histogram equalization were reported
example, a synthetic data generation system with 3D convolutional to achieve the best accuracy for the training and testing steps,
neural network (CNN) was created in [83] to confidentially create respectively.
faces with different levels of saturation in expression. And the
generative adversarial network (GAN) [84] can also be applied to Pose normalization: Considerable pose variation is another
augment data by generating diverse appearances varying in poses common and intractable problem in unconstrained settings. Some
and expressions. (see Section 4.1.7). studies have employed pose normalization techniques to yield
6
frontal facial views for FER (e.g., [90], [91]), among which the TABLE 3
most popular was proposed by Hassner et al. [92]. Specifically, Comparison of CNN models and their achievements. DA = Data
augmentation; BN = Batch normalization.
after localizing facial landmarks, a 3D texture reference model
generic to all faces is generated to efficiently estimate visible
facial components. Then, the initial frontalized face is synthesized AlexNet VGGNet GoogleNet ResNet
by back-projecting each input face image to the reference [25] [26] [27] [28]
coordinate system. Alternatively, Sagonas et al. [93] proposed an Year 2012 2014 2014 2015
effective statistical model to simultaneously localize landmarks # of layers† 5+3 13/16 + 3 21+1 151+1
and convert facial poses using only frontal faces. Very recently, a Kernel size? 11, 5, 3 3 7, 1, 3, 5 7, 1, 3, 5
series of GAN-based deep models were proposed for frontal view DA 3 3 3 3
synthesis (e.g., FF-GAN [94], TP-GAN [95]) and DR-GAN [96]) Dropout 3 3 3 3
and report promising performances. Inception 7 7 3 7
BN 7 7 7 3
3.2 Deep networks for feature learning Used in [110] [78], [111] [17], [78] [91], [112]
Deep learning has recently become a hot research topic and has †
number of convolutional layers + fully connected layers
achieved state-of-the-art performance for a variety of applications ?
size of the convolution kernel
[97]. Deep learning attempts to capture high-level abstractions
through hierarchical architectures of multiple nonlinear transfor-
mations and representations. In this section, we briefly introduce proposed the well-designed C3D, which exploits 3D convolutions
some deep learning techniques that have been applied for FER. on large-scale supervised training datasets to learn spatio-temporal
The traditional architectures of these deep neural networks are features. Many related studies (e.g., [108], [109]) have employed
shown in Fig. 2. this network for FER involving image sequences.
3.2.1 Convolutional neural network (CNN) 3.2.2 Deep belief network (DBN)
CNN has been extensively used in diverse computer vision ap- DBN proposed by Hinton et al. [113] is a graphical model that
plications, including FER. At the beginning of the 21st century, learns to extract a deep hierarchical representation of the training
several studies in the FER literature [98], [99] found that the data. The traditional DBN is built with a stack of restricted Boltz-
CNN is robust to face location changes and scale variations and mann machines (RBMs) [114], which are two-layer generative
behaves better than the multilayer perceptron (MLP) in the case stochastic models composed of a visible-unit layer and a hidden-
of previously unseen face pose variations. [100] employed the unit layer. These two layers in an RBM must form a bipartite
CNN to address the problems of subject independence as well graph without lateral connections. In a DBN, the units in higher
as translation, rotation, and scale invariance in the recognition of layers are trained to learn the conditional dependencies among the
facial expressions. units in the adjacent lower layers, except the top two layers, which
A CNN consists of three types of heterogeneous layers: have undirected connections. The training of a DBN contains two
convolutional layers, pooling layers, and fully connected layers. phases: pre-training and fine-tuning [115]. First, an efficient layer-
The convolutional layer has a set of learnable filters to convolve by-layer greedy learning strategy [116] is used to initialize the
through the whole input image and produce various specific types deep network in an unsupervised manner, which can prevent poor
of activation feature maps. The convolution operation is associ- local optimal results to some extent without the requirement of
ated with three main benefits: local connectivity, which learns a large amount of labeled data. During this procedure, contrastive
correlations among neighboring pixels; weight sharing in the same divergence [117] is used to train RBMs in the DBN to estimate the
feature map, which greatly reduces the number of the parameters approximation gradient of the log-likelihood. Then, the parameters
to be learned; and shift-invariance to the location of the object. The of the network and the desired output are fine-tuned with a simple
pooling layer follows the convolutional layer and is used to reduce gradient descent under supervision.
the spatial size of the feature maps and the computational cost of
the network. Average pooling and max pooling are the two most 3.2.3 Deep autoencoder (DAE)
commonly used nonlinear down-sampling strategies for translation DAE was first introduced in [118] to learn efficient codings for
invariance. The fully connected layer is usually included at the dimensionality reduction. In contrast to the previously mentioned
end of the network to ensure that all neurons in the layer are fully networks, which are trained to predict target values, the DAE
connected to activations in the previous layer and to enable the is optimized to reconstruct its inputs by minimizing the recon-
2D feature maps to be converted into 1D feature maps for further struction error. Variations of the DAE exist, such as the denoising
feature representation and classification. autoencoder [119], which recovers the original undistorted input
We list the configurations and characteristics of some well- from partially corrupted data; the sparse autoencoder network
known CNN models that have been applied for FER in Table 3. (DSAE) [120], which enforces sparsity on the learned feature
Besides these networks, several well-known derived frameworks representation; the contractive autoencoder (CAE1 ) [121], which
also exist. In [101], [102], region-based CNN (R-CNN) [103] adds an activity dependent regularization to induce locally invari-
was used to learn features for FER. In [104], Faster R-CNN ant features; the convolutional autoencoder (CAE2 ) [122], which
[105] was used to identify facial expressions by generating high- uses convolutional (and optionally pooling) layers for the hidden
quality region proposals. Moreover, Ji et al. proposed 3D CNN layers in the network; and the variational auto-encoder (VAE)
[106] to capture motion information encoded in multiple adjacent [123], which is a directed graphical model with certain types of
frames for action recognition via 3D convolutions. Tran et al. [107] latent variables to design complex generative models of data.
7
3.2.4 Recurrent neural network (RNN) such as support vector machine or random forest, to the extracted
RNN is a connectionist model that captures temporal information representations [133], [134]. Furthermore, [135], [136] showed
and is more suitable for sequential data prediction with arbitrary that the covariance descriptors computed on DCNN features
lengths. In addition to training the deep neural network in a single and classification with Gaussian kernels on Symmetric Positive
feed-forward manner, RNNs include recurrent edges that span Definite (SPD) manifold are more efficient than the standard
adjacent time steps and share the same parameters across all steps. classification with the softmax layer.
The classic back propagation through time (BPTT) [124] is used
to train the RNN. Long-short term memory (LSTM), introduced
by Hochreiter & Schmidhuber [125], is a special form of the 4 T HE STATE OF THE ART
traditional RNN that is used to address the gradient vanishing In this section, we review the existing novel deep neural networks
and exploding problems that are common in training RNNs. The designed for FER and the related training strategies proposed
cell state in LSTM is regulated and controlled by three gates: to address expression-specific problems. We divide the works
an input gate that allows or blocks alteration of the cell state by presented in the literature into two main groups depending on
the input signal, an output gate that enables or prevents the cell the type of data: deep FER networks for static images and deep
state to affect other neurons, and a forget gate that modulates the FER networks for dynamic image sequences. We then provide
cell’s self-recurrent connection to accumulate or forget its previous an overview of the current deep FER systems with respect to
state. By combining these three gates, LSTM can model long-term the network architecture and performance. Because some of the
dependencies in a sequence and has been widely employed for evaluated datasets do not provide explicit data groups for training,
video-based expression recognition tasks. validation and testing, and the relevant studies may conduct
experiments under different experimental conditions with different
3.2.5 Generative Adversarial Network (GAN) data, we summarize the expression recognition performance along
GAN was first introduced by Goodfellow et al [84] in 2014, which with information about the data selection and grouping methods.
trains models through a minimax two-player game between a
generator G(z) that generates synthesized input data by mapping
latents z to data space with z ∼ p(z) and a discriminator D(x) 4.1 Deep FER networks for static images
that assigns probability y = Dis(x) ∈ [0, 1] that x is an actual A large volume of the existing studies conducted expression recog-
training sample to tell apart real from fake input data. The gen- nition tasks based on static images without considering temporal
erator and the discriminator are trained alternatively and can both information due to the convenience of data processing and the
improve themselves by minimizing/maximizing the binary cross availability of the relevant training and test material. We first
entropy LGAN = log(D(x)) + log(1 − D(G(z))) with respect to introduce specific pre-training and fine-tuning skills for FER, then
D / G with x being a training sample and z ∼ p(z). Extensions review the novel deep neural networks in this field. For each of
of GAN exist, such as the cGAN [126] that adds a conditional the most frequently evaluated datasets, Table 4 shows the current
information to control the output of the generator, the DCGAN state-of-the-art methods in the field that are explicitly conducted in
[127] that adopts deconvolutional and convolutional neural net- a person-independent protocol (subjects in the training and testing
works to implement G and D respectively, the VAE/GAN [128] sets are separated).
that uses learned feature representations in the GAN discriminator
as basis for the VAE reconstruction objective, and the InfoGAN 4.1.1 Pre-training and fine-tuning
[129] that can learn disentangled representations in a completely
As mentioned before, direct training of deep networks on rela-
unsupervised manner.
tively small facial expression datasets is prone to overfitting. To
mitigate this problem, many studies used additional task-oriented
3.3 Facial expression classification data to pre-train their self-built networks from scratch or fine-tuned
After learning the deep features, the final step of FER is to classify on well-known pre-trained models (e.g., AlexNet [25], VGG [26],
the given face into one of the basic emotion categories. VGG-face [148] and GoogleNet [27]). Kahou et al. [57], [149]
Unlike the traditional methods, where the feature extraction indicated that the use of additional data can help to obtain models
step and the feature classification step are independent, deep with high capacity without overfitting, thereby enhancing the FER
networks can perform FER in an end-to-end way. Specifically, performance.
a loss layer is added to the end of the network to regulate the To select appropriate auxiliary data, large-scale face recogni-
back-propagation error; then, the prediction probability of each tion (FR) datasets (e.g., CASIA WebFace [150], Celebrity Face
sample can be directly output by the network. In CNN, softmax in the Wild (CFW) [151], FaceScrub dataset [152]) or relatively
loss is the most common used function that minimizes the cross- large FER datasets (FER2013 [21] and TFD [37]) are suitable.
entropy between the estimated class probabilities and the ground- Kaya et al. [153] suggested that VGG-Face which was trained
truth distribution. Alternatively, [130] demonstrated the benefit of for FR overwhelmed ImageNet which was developed for objected
using a linear support vector machine (SVM) for the end-to-end recognition. Another interesting result observed by Knyazev et
training which minimizes a margin-based loss instead of the cross- al. [154] is that pre-training on larger FR data positively affects
entropy. Likewise, [131] investigated the adaptation of deep neural the emotion recognition accuracy, and further fine-tuning with
forests (NFs) [132] which replaced the softmax loss layer with additional FER datasets can help improve the performance.
NFs and achieved competitive results for FER. Instead of directly using the pre-trained or fine-tuned models
Besides the end-to-end learning way, another alternative is to to extract features on the target dataset, a multistage fine-tuning
employ the deep neural network (particularly a CNN) as a feature strategy [63] (see “Submission 3” in Fig. 3) can achieve better
extraction tool and then apply additional independent classifiers, performance: after the first-stage fine-tuning using FER2013 on
8
TABLE 4
Performance summary of representative methods for static-based deep facial expression recognition on the most widely evaluated datasets.
Network size = depth & number of parameters; Pre-processing = Face Detection & Data Augmentation & Face Normalization; IN = Illumination
Normalization; N E = Network Ensemble; CN = Cascaded Network; MN = Multitask Network; LOSO = leave-one-subject-out.
Liu et al. 13 [137] CNN, RBM CN 5 - V&J - - the middle three frames 10 folds SVM 7 classes‡: 74.76 (71.73)
and the first frame
Liu et al. 15 [138] CNN, RBM CN 5 - V&J - - 10 folds SVM 7 classes‡: 75.85
MMI Mollahosseini
images from each
et al. 16 CNN (Inception) 11 7.3m IntraFace 3 - 5 folds 7 6 classes: 77.9
sequence
[14]
Liu et al. 17 [77] CNN loss layer 11 - IntraFace 3 IN 10 folds 7 6 classes: 78.53 (73.50)
Li et al. 17 [44] CNN loss layer 8 5.8m IntraFace 3 - 5 folds SVM 6 classes: 78.46
the middle three frames
Yang et al. 18
GAN (cGAN) - - MoT 3 - 10 folds 7 6 classes: 73.23 (72.67)
[141]
Reed et al. 14 4,178 emotion labeled
RBM MN - - - - - SVM Test: 85.43
[143] 3,874 identity labeled
Devries et al. 14 Validation: 87.80
TFD CNN MN 4 12.0m MoT 3 IN 5 official 7
[58] Test: 85.13 (48.29)
folds
Khorrami
[139] zero-bias CNN 4 7m 3 3 - 7 Test: 88.6
et al. 15 4,178 labeled images
Ding et al. 17
CNN fine-tune 8 11m IntraFace 3 - 7 Test: 88.9 (87.7)
[111]
Fig. 5. Image intensities (left) and LBP codes (middle). [78] proposed
mapping these values to a 3D metric space (right) as the input of CNNs.
Fig. 3. Flowchart of the different fine-tuning combinations used in [63]. 4.1.2 Diverse network input
Here, “FER28” and “FER32” indicate different parts of the FER2013
datasets. “EmotiW” is the target dataset. The proposed two-stage fine-
Traditional practices commonly use the whole aligned face of
tuning strategy (Submission 3) exhibited the best performance. RGB images as the input of the network to learn features for
FER. However, these raw data lack important information, such as
homogeneous or regular textures and invariance in terms of image
scaling, rotation, occlusion and illumination, which may represent
confounding factors for FER. Some methods have employed
diverse handcrafted features and their extensions as the network
input to alleviate this problem.
Low-level representations encode features from small regions
in the given RGB image, then cluster and pool these features
with local histograms, which are robust to illumination variations
and small registration errors. A novel mapped LBP feature [78]
(see Fig. 5) was proposed for illumination-invariant FER. Scale-
invariant feature transform (SIFT) [155]) features that are robust
against image scaling and rotation are employed [156] for multi-
view FER tasks. Combining different descriptors in outline, tex-
ture, angle, and color as the input data can also help enhance the
deep network performance [54], [157].
Part-based representations extract features according to the
target task, which remove noncritical parts from the whole image
and exploit key parts that are sensitive to the task. [158] indicated
that three regions of interest (ROI), i.e., eyebrows, eyes and mouth,
are strongly related to facial expression changes, and cropped
these regions as the input of DSAE. Other researches proposed
Fig. 4. Two-stage training flowchart in [111]. In stage (a), the deeper face
net is frozen and provides the feature-level regularization that pushes
to automatically learn the key parts for facial expression. For ex-
the convolutional features of the expression net to be close to the face ample, [159] employed a deep multi-layer network [160] to detect
net by using the proposed distribution function. Then, in stage (b), to the saliency map which put intensities on parts demanding visual
further improve the discriminativeness of the learned features, randomly attention. And [161] applied the neighbor-center difference vector
initialized fully connected layers are added and jointly trained with the
whole expression net using the expression label information. (NCDV) [162] to obtain features with more intrinsic information.
TABLE 5
Three primary ensemble methods on the decision level.
used in
definition
(example)
determine the class with the most
Majority [76], [146],
votes using the predicted label
Voting [173]
yielded from each individual
(a) Three different supervised blocks in [91]. SS Block for shallow-layer
supervision, IS Block for intermediate-layer supervision, and DS Block for determine the class with the
deep-layer supervision. highest mean score using the
Simple [76], [146],
posterior class probabilities
Average [173]
yielded from each individual
with the same weight
determine the class with the
highest weighted mean score
Weighted [57], [78],
using the posterior class
Average [147], [153]
(b) Island loss layer in [140]. The island loss calculated at the feature probabilities yielded from each
extraction layer and the softmax loss calculated at the decision layer are individual with different weights
combined to supervise the CNN training.
Most commonly, different networks or learning methods are focuses (computation efficiency, performance and difficulty of
combined sequentially and individually, and each of them con- network training).
tributes differently and hierarchically. In [178], DBNs were trained Pre-training and fine-tuning have become mainstream in
to first detect faces and to detect expression-related areas. Then, deep FER to solve the problem of insufficient training data and
these parsed face components were classified by a stacked autoen- overfitting. A practical technique that proved to be particularly
coder. In [179], a multiscale contractive convolutional network useful is pre-training and fine-tuning the network in multiple
(CCNET) was proposed to obtain local-translation-invariant (LTI) stages using auxiliary data from large-scale objection or face
representations. Then, contractive autoencoder was designed to hi- recognition datasets to small-scale FER datasets, i.e., from large to
erarchically separate out the emotion-related factors from subject small and from general to specific. However, when compared with
identity and pose. In [137], [138], over-complete representations the end-to-end training framework, the representational structure
were first learned using CNN architecture, then a multilayer RBM that are unrelated to expressions are still remained in the off-the-
was exploited to learn higher-level features for FER (see Fig. shelf pre-trained model, such as the large domain gap with the
9). Instead of simply concatenating different networks, Liu et al. objection net [153] and the subject identification distraction in the
[13] presented a boosted DBN (BDBN) that iteratively performed face net [111]. Thus the extracted features are usually vulnerable
feature representation, feature selection and classifier construction to identity variations and the performance would be degraded.
in a unified loopy state. Compared with the concatenation without Noticeably, with the advent of large-scale in-the-wild FER datasets
feedback, this loopy framework propagates backward the classi- (e.g., AffectNet and RAF-DB), the end-to-end training using
fication error to initiate the feature selection process alternately deep networks with moderate size can also achieve competitive
until convergence. Thus, the discriminative ability for FER can be performances [45], [167].
substantially improved during this iteration. In addition to directly using the raw image data to train the
deep network, diverse pre-designed features are recommended to
4.1.7 Generative adversarial networks (GANs) strengthen the network’s robustness to common distractions (e.g.,
illumination, head pose and occlusion) and to force the network
Recently, GAN-based methods have been successfully used in
to focus more on facial areas with expressive information. More-
image synthesis to generate impressively realistic faces, numbers,
over, the use of multiple heterogeneous input data can indirectly
and a variety of other image types, which are beneficial to train-
enlarge the data size. However, the problem of identity bias is
ing data augmentation and the corresponding recognition tasks.
commonly ignored in this methods. Moreover, generating diverse
Several works have proposed novel GAN-based models for pose-
data accounts for additional time consuming and combining these
invariant FER and identity-invariant FER.
multiple data can lead to high dimension which may influence the
For pose-invariant FER, Lai et al. [180] proposed a GAN-
computational efficiency of the network.
based face frontalization framework, where the generator frontal-
Training a deep and wide network with a large number of
izes input face images while preserving the identity and expression
hidden layers and flexible filters is an effective way to learn
characteristics and the discriminator distinguishes the real images
deep high-level features that are discriminative for the target task.
from the generated frontal face images. And Zhang et al. [181]
However, this process is vulnerable to the size of training data and
proposed a GAN-based model that can generate images with
can underperform if insufficient training data is available to learn
different expressions under arbitrary poses for multi-view FER.
the new parameters. Integrating multiple relatively small networks
For identity-invariant FER, Yang et al. [182] proposed an Identity-
in parallel or in series is a natural research direction to overcome
Adaptive Generation (IA-gen) model with two parts. The upper
this problem. Network ensemble integrates diverse networks at
part generates images of the same subject with different expres-
the feature or decision level to combine their advantages, which
sions using cGANs, respectively. Then, the lower part conducts
is usually applied in emotion competitions to help boost the
FER for each single identity sub-space without involving other
performance. However, designing different kinds of networks
individuals, thus identity variations can be well alleviated. Chen et
to compensate each other obviously enlarge the computational
al. [183] proposed a Privacy-Preserving Representation-Learning
cost and the storage requirement. Moreover, the weight of each
Variational GAN (PPRL-VGAN) that combines VAE and GAN
sub-network is usually learned according to the performance on
to learn an identity-invariant representation that is explicitly
original training data, leading to overfitting on newly unseen
disentangled from the identity information and generative for
testing data. Multitask networks jointly train multiple networks
expression-preserving face image synthesis. Yang et al. [141]
with consideration of interactions between the target FER task and
proposed a De-expression Residue Learning (DeRL) procedure to
other secondary tasks, such as facial landmark localization, facial
explore the expressive information, which is filtered out during the
AU recognition and face verification, thus the expression-unrelated
de-expression process but still embedded in the generator. Then
factors including identity bias can be well disentangled. The
the model extracted this information from the generator directly to
downside of this method is that it requires labeled data from all
mitigate the influence of subject variations and improve the FER
tasks and the training becomes increasingly cumbersome as more
performance.
tasks are involved. Alternatively, cascaded networks sequentially
train multiple networks in a hierarchical approach, in which case
4.1.8 Discussion the discriminative ability of the learned features are continuously
The existing well-constructed deep FER systems focus on two key strengthened. In general, this method can alleviate the overfitting
issues: the lack of plentiful diverse training data and expression- problem, and in the meanwhile, progressively disentangling fac-
unrelated variations, such as illumination, head pose and identity. tors that are irrelevant to facial expression. A deficiency worths
Table 6 shows relative advantages and disadvantages of these considering is that the sub-networks in most existing cascaded
different types of methods with respect to two open issues (data systems are training individually without feedback, and the end-
size requirement and expression-unrelated variations) and other to-end training strategy is preferable to enhance the training
13
TABLE 6
Comparison of different types of methods for static images in terms of data size requirement, variations* (head pose, illumination, occlusion and
other environment factors), identity bias, computational efficiency, accuracy, and difficulty on network training.
TABLE 7
Performances of representative methods for dynamic-based deep facial expression recognition on the most widely evaluated datasets. Network
size = depth & number of parameters; Pre-processing = Face Detection & Data Augmentation & Face Normalization; IN = Illumination
Normalization; F A = Frame Aggregation; EIN = Expression Intensity-invariant Network; F LT = Facial Landmark Trajectory; CN = Cascaded
Network; N E = Network Ensemble; S = Spatial Network; T = Temporal Network; LOSO = leave-one-subject-out.
Zhao et al. 16 [17] EIN 22 6.8m 3 - - from the 7th to the last2 the last frame 10 folds 6 classes: 99.3
2
Yu et al. 17 [70] EIN 42 - 3 - from the 7th to the last the peak expression
MTCNN 10 folds 6 classes: 99.6
kim et al. 17 [184] EIN 14 - 3 3 - all frames 10 folds 7 classes: 97.93
S: emotional
Sun et al. 17 [185] NE 3 * GoogLeNetv2 3 - - 10 folds 6 classes: 97.28
CK+ T: neutral+emotional
Jung et al. 15 [16] F LT 2 177.6k IntraFace 3 - fixed number of frames 10 folds 7 classes: 92.35
the same as
Jung et al. 15 [16] C3D 4 - IntraFace 3 - fixed number of frames 10 folds 7 classes: 91.44
the training data
Jung et al. 15 [16] NE F LT /C3D IntraFace 3 - fixed number of frames 10 folds 7 classes: 97.25 (95.22)
kuo et al. 18 [89] FA 6 2.7m IntraFace 3 IN fixed length 9 10 folds 7 classes: 98.47
SDM/ S: the last frame
Zhang et al. 17 [68] NE 7/5 2k/1.6m 3 - 10 folds 7 classes: 98.50 (97.78)
Cascaded CNN T: all frames
Kim et al. 17 [66] EIN , CN 7 1.5m Incremental 3 - 5 intensities frames LOSO 6 classes: 78.61 (78.00)
kim et al. 17 [184] EIN 14 - 3 3 - all frames 10 folds 6 classes: 81.53
Hasani et al. 17 [112] F LT , CN 22 - 3000 fps - - ten frames 5 folds 6 classes: 77.50 (74.50)
the same as
Hasani et al. 17 [55] CN 29 - AAM - - static frames 5 folds 6 classes: 78.68
MMI the training data
SDM/ S: the middle frame
Zhang et al. 17 [68] NE 7/5 2k/1.6m 3 - 10 folds 6 classes: 81.18 (79.30)
Cascaded CNN T: all frames
S: emotional
Sun et al. 17 [185] NE 3 * GoogLeNetv2 3 - - 10 folds 6 classes: 91.46
T: neutral+emotional
2
Zhao et al. 16 [17] EIN 22 6.8m 3 - - from the 7th to the last the last frame 10 folds 6 classes: 84.59
Yu et al. 17 [70] EIN 42 - MTCNN 3 -from the 7th to the last2 the peak expression 10 folds 6 classes: 86.23
Oulu- Jung et al. 15 [16] F LT 2 177.6k IntraFace 3 -fixed number of frames 10 folds 6 classes: 74.17
CASIA Jung et al. 15 [16] C3D 4 - IntraFace 3 -fixed number of frames 10 folds 6 classes: 74.38
Jung et al. 15 [16] NE F LT /C3D IntraFace 3 -fixed number of frames the same as 10 folds 6 classes: 81.46 (81.49)
SDM/ S: the last frame the training data
Zhang et al. 17 [68] NE 7/5 2k/1.6m 3 - 10 folds 6 classes: 86.25 (86.25)
Cascaded CNN T: all frames
kuo et al. 18 [89] NE 6 2.7m IntraFace 3 IN fixed length 9 10 folds 6 classes: 91.67
Ding et al. 16 [186] FA AlexNet 3 - - Training: 773; Validation: 373; Test: 593 Validation: 44.47
Yan et al. 16 [187] CN VGG16-LSTM 3 3 - 40 frames 3 folds 7 classes: 44.46
AFEW*
Yan et al. 16 [187] F LT 4 - [188] - - 30 frames 3 folds 7 classes: 37.37
6.0
Fan et al. 16 [108] CN VGG16-LSTM 3 - - 16 features for LSTM Validation: 45.43 (38.96)
Fan et al. [108] C3D 10 - 3 - - several windows of 16 consecutive frames Validation: 39.69 (38.55)
Yan et al. 16 [187] fusion / Training: 773; Validation: 383; Test: 593 Test: 56.66 (40.81)
Fan et al. 16 [108] fusion / Training: 774; Validation: 383; Test: 593 Test: 59.02 (44.94)
Ouyang et al. 17 [189] CN VGG-LSTM MTCNN 3 - 16 frames Validation: 47.4
Ouyang et al. 17 [189] C3D 10 - MTCNN 3 - 16 frames Validation: 35.2
AFEW*
Vielzeuf et al. [190] CN C3D-LSTM 3 3 - detected face frames Validation: 43.2
7.0
Vielzeuf et al. [190] CN VGG16-LSTM 3 3 - several windows of 16 consecutive frames Validation: 48.6
Vielzeuf et al. [190] fusion / Training: 773; Validation: 383; Test: 653 Test: 58.81 (43.23)
1
The value in parentheses is the mean accuracy calculated from the confusion matrix given by authors.
2
A pair of images (peak and non-peak expression) is chosen for training each time.
*
We have included the result of a single spatio-temporal network and also the best result after fusion with both video and audio modalities.
†
7 Classes in CK+: Anger, Contempt, Disgust, Fear, Happiness, Sadness, and Surprise.
‡
7 Classes in AFEW: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise.
deep network (PPDN) that takes a pair of peak and non-peak and encoding intermediate intensity, respectively.
images of the same expression and from the same subject as
input and utilizes the L2-norm loss to minimize the distance
between both images. During back propagation, a peak gradient Considering that images with different expression intensities
suppression (PGS) was proposed to drive the learned feature of for an individual identity is not always available in the wild,
the non-peak expression towards that of peak expression while several works proposed to automatically acquire the intensity label
avoiding the inverse. Thus, the network discriminant ability on or to generate new images with targeted intensity. For example, in
lower-intensity expressions can be improved. Based on PPDN, [194] the peak and neutral frames was automatically picked out
Yu et al. [70] proposed a deeper cascaded peak-piloted network from the sequence with two stages : a clustering stage to divide all
(DCPN) that used a deeper and larger architecture to enhance the frames into the peak-like group and the neutral-like group using
discriminative ability of the learned features and employed an in- K-means algorithm, and a classification stage to detect peak and
tegration training method called cascade fine-tuning to avoid over- neutral frames using a semi-supervised SVM. And in [184], a
fitting. In [66], more intensity states were utilized (onset, onset to deep generative-contrastive model was presented with two steps:
apex transition, apex, apex to offset transition and offset) and five a generator to generate the reference (less-expressive) face for
loss functions were adopted to regulate the network training by each sample via convolutional encoder-decoder and a contrastive
minimizing expression classification error, intra-class expression network to jointly filter out information that is irrelevant with
variation, intensity classification error and intra-intensity variation, expressions through a contrastive metric loss and a supervised
reconstruction loss.
15
Fig. 12. The proposed 3DCNN-DAP [199]. The input n-frame sequence
is convolved with 3D filters; then, 13 ∗ c ∗ k part filters corresponding to
13 manually defined facial parts are used to convolve k feature maps for
the facial action part detection maps of c expression classes.
Fig. 11. The proposed PPDN in [17]. During training, PPDN is trained by sequence and weighted based on their prediction scores. Instead
jointly optimizing the L2-norm loss and the cross-entropy losses of two of directly using C3D for classification, [109] employed C3D for
expression images. During testing, the PPDN takes one still image as
input for probability prediction. spatio-temporal feature extraction and then cascaded with DBN
for prediction. In [201], C3D was also used as a feature extractor,
followed by a NetVLAD layer [202] to aggregate the temporal
4.2.3 Deep spatio-temporal FER network information of the motion features by learning cluster centers.
Although the frame aggregation can integrate frames in the Facial landmark trajectory: Related psychological studies
video sequence, the crucial temporal dependency is not explicitly have shown that expressions are invoked by dynamic motions
exploited. By contrast, the spatio-temporal FER network takes of certain facial parts (e.g., eyes, nose and mouth) that contain
a range of frames in a temporal window as a single input the most descriptive information for representing expressions.
without prior knowledge of the expression intensity and utilizes To obtain more accurate facial actions for FER, facial landmark
both textural and temporal information to encode more subtle trajectory models have been proposed to capture the dynamic
expressions. variations of facial components from consecutive frames.
To extract landmark trajectory representation, the most direct
RNN and C3D: RNN can robustly derive information from way is to concatenate coordinates of facial landmark points
sequences by exploiting the fact that feature vectors for successive from frames over time with normalization to generate a one-
data are connected semantically and are therefore interdependent. dimensional trajectory signal for each sequence [16] or to form
The improved version, LSTM, is flexible to handle varying-length an image-like map as the input of CNN [187]. Besides, relative
sequential data with lower computation cost. Derived from RNN, distance variation of each landmark in consecutive frames can
an RNN that is composed of ReLUs and initialized with the also be used to capture the temporal information [203]. Further,
identity matrix (IRNN) [195] was used to provide a simpler part-based model that divides facial landmarks into several parts
mechanism for addressing the vanishing and exploding gradient according to facial physical structure and then separately feeds
problems [87]. And bidirectional RNNs (BRNNs) [196] were them into the networks hierarchically is proved to be efficient for
employed to learn the temporal relations in both the original and both local low-level and global high-level feature encoding [68]
reversed directions [68], [187]. Recently, a Nested LSTM was (see “PHRNN” in Fig. 13) . Instead of separately extracting the
proposed in [71] with two sub-LSTMs. Namely, T-LSTM models trajectory features and then input them into the networks, Hasani
the temporal dynamics of the learned features, and C-LSTM et al. [112] incorporated the trajectory features by replacing the
integrates the outputs of all T-LSTMs together so as to encode shortcut in the residual unit of the original 3D Inception-ResNet
the multi-level features encoded in the intermediate layers of the with element-wise multiplication of facial landmarks and the input
network. tensor of the residual unit. Thus, the landmark based network can
Compared with RNN, CNN is more suitable for computer be trained end-to-end.
vision applications; hence, its derivative C3D [107], which uses
3D convolutional kernels with shared weights along the time Cascaded networks: By combining the powerful perceptual
axis instead of the traditional 2D kernels, has been widely used vision representations learned from CNNs with the strength of
for dynamic-based FER (e.g., [83], [108], [189], [197], [198]) LSTM for variable-length inputs and outputs, Donahue et al.
to capture the spatio-temporal features. Based on C3D, many [204] proposed a both spatially and temporally deep model which
derived structures have been designed for FER. In [199], 3D CNN cascades the outputs of CNNs with LSTMs for various vision
was incorporated with the DPM-inspired [200] deformable facial tasks involving time-varying inputs and outputs. Similar to this
action constraints to simultaneously encode dynamic motion hybrid network, many cascaded networks have been proposed for
and discriminative part-based representations (see Fig. 12 for FER (e.g., [66], [108], [190], [205]).
details). In [16], a deep temporal appearance network (DTAN) Instead of CNN, [206] employed a convolutional sparse
was proposed that employed 3D filters without weight sharing autoencoder for sparse and shift-invariant features; then, an
along the time axis; hence, each filter can vary in importance LSTM classifier was trained for temporal evolution. [189]
over time. Likewise, a weighted C3D was proposed [190], where employed a more flexible network called ResNet-LSTM, which
several windows of consecutive frames were extracted from each allows nodes in lower CNN layers to directly contact with
16
4.2.4 Discussion
In the real world, people display facial expressions in a dynamic
process, e.g., from subtle to obvious, and it has become a trend to
conduct FER on sequence/video data. Table 8 summarizes relative
merits of different types of methods on dynamic data in regards
to the capability of representing spatial and temporal information,
the requirement on training data size and frame length (variable or
Fig. 13. The spatio-temporal network proposed in [68]. The tempo- fixed), the computational efficiency and the performance.
ral network PHRNN for landmark trajectory and the spatial network Frame aggregation is employed to combine the learned feature
MSCNN for identity-invariant features are trained separately. Then, the
predicted probabilities from the two networks are fused together for or prediction probability of each frame for a sequence-level
spatio-temporal FER. result. The output of each frame can be simply concatenated
(fixed-length frames is required in each sequence) or statistically
aggregated to obtain video-level representation (variable-length
frames processible). This method is computationally simple and
can achieve moderate performance if the temporal variations of
the target dataset is not complicated.
According to the fact that the expression intensity in a video
sequence varies over time, the expression intensity-invariant net-
work considers images with non-peak expressions and further
exploits the dynamic correlations between peak and non-peak
expressions to improve performance. Commonly, image frames
with specific intensity states are needed for intensity-invariant
Fig. 14. The joint fine-tuning method for DTAGN proposed in [16]. To FER.
integrate DTGA and DTAN, we freeze the weight values in the gray Despite the advantages of these methods, frame aggregation
boxes and retrain the top layer in the green boxes. The logit values
of the green boxes are used by Softmax3 to supervise the integrated handles frames without consideration of temporal information
network. During training, we combine three softmax loss functions, and and subtle appearance changes, and expression intensity-invariant
for prediction, we use only Softmax3. networks require prior knowledge of expression intensity which
is unavailable in real-world scenarios. By contrast, Deep spatio-
temporal networks are designed to encode temporal dependencies
LSTMs to capture spatio-temporal information. In addition in consecutive frames and have been shown to benefit from
to concatenating LSTM with the fully connected layer of learning spatial features in conjunction with temporal features.
CNN, a hypercolumn-based system [207] extracted the last RNN and its variations (e.g., LSTM, IRNN and BRNN) and C3D
convolutional layer features as the input of the LSTM for longer are foundational networks for learning spatio-temporal features.
range dependencies without losing global coherence. Instead However, the performance of these networks is barely satisfac-
of LSTM, the conditional random fields (CRFs) model [208] tory. RNN is incapable of capturing the powerful convolutional
that are effective in recognizing human activities was employed features. And 3D filers in C3D are applied over very short
in [55] to distinguish the temporal relations of the input sequences. video clips ignoring long-range dynamics. Also, training such
a huge network is computationally a problem, especially for
Network ensemble: A two-stream CNN for action recognition in dynamic FER where video data is insufficient. Alternatively,
videos, which trained one stream of the CNN on the multi-frame facial landmark trajectory methods extract shape features based
dense optical flow for temporal information and the other stream on the physical structures of facial morphological variations to
of the CNN on still images for appearance features and then fused capture dynamic facial component activities, and then apply deep
the outputs of two streams, was introduced by Simonyan et al. networks for classification. This method is computationally simple
[209]. Inspired by this architecture, several network ensemble and can get rid of the issue on illumination variations. However,
models have been proposed for FER. it is sensitive to registration errors and requires accurate facial
Sun et al. [185] proposed a multi-channel network that ex- landmark detection, which is difficult to access in unconstrained
tracted the spatial information from emotion-expressing faces and conditions. Consequently, this method performs less well and is
temporal information (optical flow) from the changes between more suitable to complement appearance representations. Network
emotioanl and neutral faces, and investigated three feature fusion ensemble is utilized to train multiple networks for both spatial and
strategies: score average fusion, SVM-based fusion and neural- temporal information and then to fuse the network outputs in the
network-based fusion. Zhang et al. [68] fused the temporal net- final stage. Optic flow and facial landmark trajectory can be used
work PHRNN (discussed in “Landmark trajectory”) and the as temporal representations to collaborate spatial representations.
spatial network MSCNN (discussed in section 4.1.5) to extract One of the drawbacks of this framework is the pre-computing
the partial-whole, geometry-appearance, and static-dynamic in- and storage consumption on optical flow or landmark trajectory
formation for FER (see Fig. 13). Instead of fusing the network vectors. And most related researches randomly selected fixed-
outputs with different weights, Jung et al. [16] proposed a joint length video frames as input, leading to the loss of useful temporal
fine-tuning method that jointly trained the DTAN (discussed in information. Cascaded networks were proposed to first extract
the “RNN and C3D” ), the DTGN (discussed in the “Landmark discriminative representations for facial expression images and
17
TABLE 8
Comparison of different types of methods for dynamic image sequences in terms of data size requirement, representability of spatial and temporal
information, requirement on frame length, performance, and computational efficiency. F LT = Facial Landmark Trajectory; CN = Cascaded
Network; N E = Network Ensemble.
then input these features to sequential networks to reinforce the [215] proposed a multi-channel pose-aware CNN (MPCNN) that
temporal information encoding. However, this model introduces contains three cascaded parts (multi-channel feature extraction,
additional parameters to capture sequence information, and the jointly multi-scale feature fusion and the pose-aware recognition)
feature learning network (e.g., CNN) and the temporal information to predict expression labels by minimizing the conditional joint
encoding network (e.g., LSTM) in current works are not trained loss of pose and expression recognition. Besides, the technology
jointly, which may lead to suboptimal parameter settings. And of generative adversarial network (GAN) has been employed in
training in an end-to-end fashion is still a long road. [180], [181] to generate facial images with different expressions
Compared with deep networks on static data, Table 4 and Table under arbitrary poses for multi-view FER.
7 demonstrate the powerful capability and popularity trend of deep
spatio-temporal networks. For instance, comparison results on 5.2 FER on infrared data
widely evaluated benchmarks (e.g., CK+ and MMI) illustrate that
Although RBG or gray data are the current standard in deep FER,
training networks based on sequence data and analyzing temporal
these data are vulnerable to ambient lighting conditions. While,
dependency between frames can further improve the performance.
infrared images that record the skin temporal distribution produced
Also, in the EmotiW challenge 2015, only one system employed
by emotions are not sensitive to illumination variations, which may
deep spatio-networks for FER, whereas 5 of 7 reviewed systems
be a promising alternative for investigation of facial expression.
in the EmotiW challenge 2017 relied on such networks.
For example, He et al. [216] employed a DBM model that consists
of a Gaussian-binary RBM and a binary RBM for FER. The model
5 A DDITIONAL R ELATED I SSUES was trained by layerwise pre-training and joint training and was
In addition to the most popular basic expression classification then fine-tuned on long-wavelength thermal infrared images to
task reviewed above, we further introduce a few related issues learn thermal features. Wu et al. [217] proposed a three-stream
that depend on deep neural networks and prototypical expression- 3D CNN to fuse local and global spatio-temporal features on
related knowledge. illumination-invariant near-infrared images for FER.
5.1 Occlusion and non-frontal head pose 5.3 FER on 3D static and dynamic data
Occlusion and non-frontal head pose, which may change the Despite significant advances have achieved in 2D FER, it fails
visual appearance of the original facial expression, are two major to solve the two main problems: illumination changes and pose
obstacles for automatic FER, especially in real-world scenarios. variations [29]. 3D FER that uses 3D face shape models with
For facial occlusion, Ranzato et al. [210], [211] proposed a depth information can capture subtle facial deformations, which
deep generative model that used mPoT [212] as the first layer are naturally robust to pose and lighting variations.
of DBNs to model pixel-level representations and then trained Depth images and videos record the intensity of facial pixels
DBNs to fit an appropriate distribution to its inputs. Thus, the based on distance from a depth camera, which contain critical
occluded pixels in images could be filled in by reconstructing information of facial geometric relations. For example, [218] used
the top layer representation using the sequence of conditional kinect depth sensor to obtain gradient direction information and
distributions. Cheng et al. [213] employed multilayer RBMs then employed CNN on unregistered facial depth images for FER.
with a pre-training and fine-tuning process on Gabor features [219], [220] extracted a series of salient features from depth
to compress features from the occluded facial parts. Xu et al. videos and combined them with deep networks (i.e., CNN and
[214] concatenated high-level learned features transferred from DBN) for FER. To emphasize the dynamic deformation patterns
two CNNs with the same structure but pre-trained on different of facial expression motions, Li et al. [221] explore the 4D FER
data: the original MSRA-CFW database and the MSRA-CFW (3D FER using dynamic data) using a dynamic geometrical image
database with additive occluded samples. network. Furthermore, Chang et al. [222] proposed to estimate 3D
For multi-view FER, Zhang et al. [156] introduced a projection expression coefficients from image intensities using CNN without
layer into the CNN that learned discriminative facial features requiring facial landmark detection. Thus, the model is highly
by weighting different facial landmark points within 2D SIFT robust to extreme appearance variations, including out-of-plane
feature matrices without requiring facial pose estimation. Liu et al. head rotations, scale changes, and occlusions.
18
Recently, more and more works trend to combine 2D and 3D 6 C HALLENGES AND O PPORTUNITIES
data to further improve the performance. Oyedotun et al. [223] 6.1 Facial expression datasets
employed CNN to jointly learn facial expression features from
As the FER literature shifts its main focus to the challeng-
both RGB and depth map latent modalities. And Li et al. [224]
ing in-the-wild environmental conditions, many researchers have
proposed a deep fusion CNN (DF-CNN) to explore multi-modal
committed to employing deep learning technologies to handle
2D+3D FER. Specifically, six types of 2D facial attribute maps
difficulties, such as illumination variation, occlusions, non-frontal
(i.e., geometry, texture, curvature, normal components x, y, and z)
head poses, identity bias and the recognition of low-intensity
were first extracted from the textured 3D face scans and were
expressions. Given that FER is a data-driven task and that training
then jointly fed into the feature extraction and feature fusion
a sufficiently deep network to capture subtle expression-related
subnets to learn the optimal combination weights of 2D and 3D
deformations requires a large amount of training data, the major
facial representations. To improve this work, [225] proposed to
challenge that deep FER systems face is the lack of training data
extract deep features from different facial parts extracted from the
in terms of both quantity and quality.
texture and depth images, and then fused these features together to
Because people of different age ranges, cultures and genders
interconnect them with feedback. Wei et al. [226] further explored
display and interpret facial expression in different ways, an ideal
the data bias problem in 2D+3D FER using unsupervised domain
facial expression dataset is expected to include abundant sample
adaption technique.
images with precise face attribute labels, not just expression but
5.4 Facial expression synthesis other attributes such as age, gender and ethnicity, which would
facilitate related research on cross-age range, cross-gender and
Realistic facial expression synthesis, which can generate various
cross-cultural FER using deep learning techniques, such as mul-
facial expressions for interactive interfaces, is a hot topic. Susskind
titask deep networks and transfer learning. In addition, although
et al. [227] demonstrated that DBN has the capacity to capture
occlusion and multipose problems have received relatively wide
the large range of variation in expressive appearance and can be
interest in the field of deep face recognition, the occlusion-
trained on large but sparsely labeled datasets. In light of this work,
robust and pose-invariant issues have receive less attention in deep
[210], [211], [228] employed DBN with unsupervised learning
FER. One of the main reasons is the lack of a large-scale facial
to construct facial expression synthesis systems. Kaneko et al.
expression dataset with occlusion type and head-pose annotations.
[149] proposed a multitask deep network with state recognition
On the other hand, accurately annotating a large volume of
and key-point localization to adaptively generate visual feedback
image data with the large variation and complexity of natural
to improve facial expression recognition. With the recent success
scenarios is an obvious impediment to the construction of expres-
of the deep generative models, such as variational autoencoder
sion datasets. A reasonable approach is to employ crowd-sourcing
(VAE), adversarial autoencoder (AAE), and generative adversarial
models [44], [46], [249] under the guidance of expert annotators.
network (GAN), a series of facial expression synthesis systems
Additionally, a fully automatic labeling tool [43] refined by experts
have been developed based on these models (e.g., [229], [230],
is alternative to provide approximate but efficient annotations. In
[231], [232] and [233]). Facial expression synthesis can also be
both cases, a subsequent reliable estimation or labeling learning
applied to data augmentation without manually collecting and
process is necessary to filter out noisy annotations. In particular,
labeling huge datasets. Masi et al. [234] employed CNN to
few comparatively large-scale datasets that consider real-world
synthesize new face images by increasing face-specific appearance
scenarios and contain a wide range of facial expressions have
variation, such as expressions within the 3D textured face model.
recently become publicly available, i.e., EmotioNet [43], RAF-
5.5 Visualization techniques DB [44], [45] and AffectNet [46], and we anticipate that with
advances in technology and the wide spread of the Internet, more
In addition to utilizing CNN for FER, several works (e.g., [139],
complementary facial expression datasets will be constructed to
[235], [236]) employed visualization techniques [237] on the
promote the development of deep FER.
learned CNN features to qualitatively analyze how the CNN
contributes to the appearance-based learning process of FER and
to qualitatively decipher which portions of the face yield the 6.2 Incorporating other affective models
most discriminative information. The deconvolutional results all Another major issue that requires consideration is that while FER
indicated that the activations of some particular filters on the within the categorical model is widely acknowledged and re-
learned features have strong correlations with the face regions that searched, the definition of the prototypical expressions covers only
correspond to facial AUs. a small portion of specific categories and cannot capture the full
repertoire of expressive behaviors for realistic interactions. Two
5.6 Other special issues additional models were developed to describe a larger range of
Several novel issues have been approached on the basis of the emotional landscape: the FACS model [10], [176], where various
prototypical expression categories: dominant and complementary facial muscle AUs are combined to describe the visible appearance
emotion recognition challenge [238] and the Real versus Fake changes of facial expressions, and the dimensional model [11],
expressed emotions challenge [239]. Furthermore, deep learning [250], where two continuous-valued variables, namely, valence
techniques have been thoroughly applied by the participants of and arousal, are proposed to continuously encode small changes in
these two challenges (e.g., [240], [241], [242]). Additional related the intensity of emotions. Another novel definition, i.e., compound
real-world applications, such as the Real-time FER App for expression, was proposed by Du et al. [52], who argued that some
smartphones [243], [244], Eyemotion (FER using eye-tracking facial expressions are actually combinations of more than one
cameras) [245], privacy-preserving mobile analytics [246], Unfelt basic emotion. These works improve the characterization of facial
emotions [247] and Depression recognition [248], have also been expressions and, to some extent, can complement the categorical
developed. model. For instance, as discussed above, the visualization results
19
of CNNs have demonstrated a certain congruity between the [4] P. Ekman, “Strong evidence for universals in facial expressions: a reply
learned representations and the facial areas defined by AUs. Thus, to russell’s mistaken critique,” Psychological bulletin, vol. 115, no. 2,
pp. 268–287, 1994.
we can design filters of the deep neural networks to distribute
[5] D. Matsumoto, “More evidence for the universality of a contempt
different weights according to the importance degree of different expression,” Motivation and Emotion, vol. 16, no. 4, pp. 363–368, 1992.
facial muscle action parts. [6] R. E. Jack, O. G. Garrod, H. Yu, R. Caldara, and P. G. Schyns, “Facial
expressions of emotion are not culturally universal,” Proceedings of the
National Academy of Sciences, vol. 109, no. 19, pp. 7241–7244, 2012.
6.3 Dataset bias and imbalanced distribution [7] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect
recognition methods: Audio, visual, and spontaneous expressions,”
Data bias and inconsistent annotations are very common among IEEE transactions on pattern analysis and machine intelligence, vol. 31,
different facial expression datasets due to different collecting no. 1, pp. 39–58, 2009.
conditions and the subjectiveness of annotating. Researchers com- [8] E. Sariyanidi, H. Gunes, and A. Cavallaro, “Automatic analysis of facial
monly evaluate their algorithms within a specific dataset and can affect: A survey of registration, representation, and recognition,” IEEE
transactions on pattern analysis and machine intelligence, vol. 37, no. 6,
achieve satisfactory performance. However, early cross-database pp. 1113–1133, 2015.
experiments have indicated that discrepancies between databases [9] B. Martinez and M. F. Valstar, “Advances, challenges, and opportuni-
exist due to the different collection environments and construction ties in automatic facial expression recognition,” in Advances in Face
indicators [12]; hence, algorithms evaluated via intra-database Detection and Facial Image Analysis. Springer, 2016, pp. 63–100.
protocols lack generalizability on unseen test data, and the per- [10] P. Ekman, “Facial action coding system (facs),” A human face, 2002.
[11] H. Gunes and B. Schuller, “Categorical and dimensional affect analysis
formance in cross-dataset settings is greatly deteriorated. Deep
in continuous input: Current trends and future directions,” Image and
domain adaption and knowledge distillation are alternatives to Vision Computing, vol. 31, no. 2, pp. 120–136, 2013.
address this bias [226], [251]. Furthermore, because of the in- [12] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition
consistent expression annotations, FER performance cannot keep based on local binary patterns: A comprehensive study,” Image and
improving when enlarging the training data by directly merging Vision Computing, vol. 27, no. 6, pp. 803–816, 2009.
[13] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via
multiple datasets [167]. a boosted deep belief network,” in Proceedings of the IEEE Conference
Another common problem in facial expression is class imbal- on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812.
ance, which is a result of the practicalities of data acquisition: [14] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial
eliciting and annotating a smile is easy, however, capturing in- expression recognition using deep neural networks,” in Applications of
Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE,
formation for disgust, anger and other less common expressions 2016, pp. 1–10.
can be very challenging. As shown in Table 4 and Table 7, the [15] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local
performance assessed in terms of mean accuracy, which assigns binary patterns with an application to facial expressions,” IEEE trans-
equal weights to all classes, decreases when compared with the actions on pattern analysis and machine intelligence, vol. 29, no. 6, pp.
915–928, 2007.
accuracy criterion, and this decline is especially evident in real-
[16] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning in deep
world datasets (e.g., SFEW 2.0 and AFEW). One solution is to neural networks for facial expression recognition,” in Computer Vision
balance the class distribution during the pre-processing stage using (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp.
data augmentation and synthesis. Another alternative is to develop 2983–2991.
a cost-sensitive loss layer for deep networks during training. [17] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and
S. Yan, “Peak-piloted deep network for facial expression recognition,”
in European conference on computer vision. Springer, 2016, pp. 425–
442.
6.4 Multimodal affect recognition
[18] C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero,
Last but not the least, human expressive behaviors in realistic “Survey on rgb, 3d, thermal, and multimodal approaches for facial
applications involve encoding from different perspectives, and the expression recognition: History, trends, and affect-related applications,”
IEEE transactions on pattern analysis and machine intelligence, vol. 38,
facial expression is only one modality. Although pure expression no. 8, pp. 1548–1568, 2016.
recognition based on visible face images can achieve promis- [19] R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graph-preserving sparse
ing results, incorporating with other models into a high-level nonnegative matrix factorization with application to facial expression
framework can provide complementary information and further recognition,” IEEE Transactions on Systems, Man, and Cybernetics,
Part B (Cybernetics), vol. 41, no. 1, pp. 38–52, 2011.
enhance the robustness. For example, participants in the EmotiW
[20] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas,
challenges and Audio Video Emotion Challenges (AVEC) [252], “Learning active facial patches for expression analysis,” in Computer
[253] considered the audio model to be the second most important Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.
element and employed various fusion techniques for multimodal IEEE, 2012, pp. 2562–2569.
affect recognition. Additionally, the fusion of other modalities, [21] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza,
B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee et al.,
such as infrared images, depth information from 3D face models “Challenges in representation learning: A report on three machine
and physiological data, is becoming a promising research direction learning contests,” in International Conference on Neural Information
due to the large complementarity for facial expressions. Processing. Springer, 2013, pp. 117–124.
[22] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon,
“Video and image based emotion recognition challenges in the wild:
Emotiw 2015,” in Proceedings of the 2015 ACM on International
R EFERENCES Conference on Multimodal Interaction. ACM, 2015, pp. 423–426.
[1] C. Darwin and P. Prodger, The expression of the emotions in man and [23] A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon, “Emotiw 2016:
animals. Oxford University Press, USA, 1998. Video and group-level emotion recognition challenges,” in Proceedings
[2] Y.-I. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for of the 18th ACM International Conference on Multimodal Interaction.
facial expression analysis,” IEEE Transactions on pattern analysis and ACM, 2016, pp. 427–432.
machine intelligence, vol. 23, no. 2, pp. 97–115, 2001. [24] A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon,
[3] P. Ekman and W. V. Friesen, “Constants across cultures in the face and “From individual to group-level emotion recognition: Emotiw 5.0,” in
emotion.” Journal of personality and social psychology, vol. 17, no. 2, Proceedings of the 19th ACM International Conference on Multimodal
pp. 124–129, 1971. Interaction. ACM, 2017, pp. 524–528.
20
[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- [48] A. Dhall, R. Goecke, S. Lucey, T. Gedeon et al., “Collecting large, richly
tion with deep convolutional neural networks,” in Advances in neural annotated facial-expression databases from movies,” IEEE multimedia,
information processing systems, 2012, pp. 1097–1105. vol. 19, no. 3, pp. 34–41, 2012.
[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [49] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Acted facial expres-
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. sions in the wild database,” Australian National University, Canberra,
[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Australia, Technical Report TR-CS-11, vol. 2, p. 1, 2011.
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” [50] ——, “Static facial expression analysis in tough conditions: Data,
in Proceedings of the IEEE conference on computer vision and pattern evaluation protocol and benchmark,” in Computer Vision Workshops
recognition, 2015, pp. 1–9. (ICCV Workshops), 2011 IEEE International Conference on. IEEE,
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 2011, pp. 2106–2112.
recognition,” in Proceedings of the IEEE conference on computer vision [51] C. F. Benitez-Quiroz, R. Srinivasan, Q. Feng, Y. Wang, and A. M.
and pattern recognition, 2016, pp. 770–778. Martinez, “Emotionet challenge: Recognition of facial expressions of
[29] M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial emotion in the wild,” arXiv preprint arXiv:1703.01210, 2017.
expressions: The state of the art,” IEEE Transactions on pattern analysis [52] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of
and machine intelligence, vol. 22, no. 12, pp. 1424–1445, 2000. emotion,” Proceedings of the National Academy of Sciences, vol. 111,
[30] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” no. 15, pp. E1454–E1462, 2014.
Pattern recognition, vol. 36, no. 1, pp. 259–275, 2003. [53] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance mod-
[31] T. Zhang, “Facial expression recognition based on deep learning: A els,” IEEE Transactions on Pattern Analysis & Machine Intelligence,
survey,” in International Conference on Intelligent and Interactive no. 6, pp. 681–685, 2001.
Systems and Applications. Springer, 2017, pp. 345–352. [54] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, and A. M. Dobaie,
[32] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer, “Meta- “Facial expression recognition via learning deep sparse autoencoders,”
analysis of the first facial expression recognition challenge,” IEEE Neurocomputing, vol. 273, pp. 643–649, 2018.
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), [55] B. Hasani and M. H. Mahoor, “Spatio-temporal facial expression recog-
vol. 42, no. 4, pp. 966–979, 2012. nition using convolutional neural networks and conditional random
[33] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and fields,” in Automatic Face & Gesture Recognition (FG 2017), 2017
I. Matthews, “The extended cohn-kanade dataset (ck+): A complete 12th IEEE International Conference on. IEEE, 2017, pp. 790–795.
dataset for action unit and emotion-specified expression,” in Computer [56] X. Zhu and D. Ramanan, “Face detection, pose estimation, and land-
Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Com- mark localization in the wild,” in Computer Vision and Pattern Recogni-
puter Society Conference on. IEEE, 2010, pp. 94–101. tion (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2879–2886.
[34] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database [57] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Ç. Gülçehre,
for facial expression analysis,” in Multimedia and Expo, 2005. ICME R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari
2005. IEEE International Conference on. IEEE, 2005, pp. 5–pp. et al., “Combining modality specific deep neural networks for emotion
[35] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an recognition in video,” in Proceedings of the 15th ACM on International
addition to the mmi facial expression database,” in Proc. 3rd Intern. conference on multimodal interaction. ACM, 2013, pp. 543–550.
Workshop on EMOTION (satellite of LREC): Corpora for Research on [58] T. Devries, K. Biswaranjan, and G. W. Taylor, “Multi-task learning of
Emotion and Affect, 2010, p. 65. facial landmarks and expression,” in Computer and Robot Vision (CRV),
[36] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding facial 2014 Canadian Conference on. IEEE, 2014, pp. 98–103.
expressions with gabor wavelets,” in Automatic Face and Gesture [59] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Robust discrimina-
Recognition, 1998. Proceedings. Third IEEE International Conference tive response map fitting with constrained local models,” in Computer
on. IEEE, 1998, pp. 200–205. Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.
[37] J. M. Susskind, A. K. Anderson, and G. E. Hinton, “The toronto face IEEE, 2013, pp. 3444–3451.
database,” Department of Computer Science, University of Toronto, [60] M. Shin, M. Kim, and D.-S. Kwon, “Baseline cnn structure analysis
Toronto, ON, Canada, Tech. Rep, vol. 3, 2010. for facial expression recognition,” in Robot and Human Interactive
[38] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Communication (RO-MAN), 2016 25th IEEE International Symposium
Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010. on. IEEE, 2016, pp. 724–729.
[39] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial [61] Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong, “Identity-aware convo-
expression database for facial behavior research,” in Automatic face lutional neural network for facial expression recognition,” in Automatic
and gesture recognition, 2006. FGR 2006. 7th international conference Face & Gesture Recognition (FG 2017), 2017 12th IEEE International
on. IEEE, 2006, pp. 211–216. Conference on. IEEE, 2017, pp. 558–565.
[40] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial [62] X. Xiong and F. De la Torre, “Supervised descent method and its appli-
expression recognition from near-infrared videos,” Image and Vision cations to face alignment,” in Computer Vision and Pattern Recognition
Computing, vol. 29, no. 9, pp. 607–619, 2011. (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 532–539.
[41] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and [63] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning
A. van Knippenberg, “Presentation and validation of the radboud faces for emotion recognition on small datasets using transfer learning,” in
database,” Cognition and Emotion, vol. 24, no. 8, pp. 1377–1388, 2010. Proceedings of the 2015 ACM on international conference on multi-
[42] D. Lundqvist, A. Flykt, and A. Öhman, “The karolinska directed modal interaction. ACM, 2015, pp. 443–449.
emotional faces (kdef),” CD ROM from Department of Clinical Neu- [64] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps
roscience, Psychology section, Karolinska Institutet, no. 1998, 1998. via regressing local binary features,” in Proceedings of the IEEE
[43] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: Conference on Computer Vision and Pattern Recognition, 2014, pp.
An accurate, real-time algorithm for the automatic annotation of a 1685–1692.
million facial expressions in the wild,” in Proceedings of IEEE Interna- [65] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Incremental face
tional Conference on Computer Vision & Pattern Recognition (CVPR), alignment in the wild,” in Proceedings of the IEEE conference on
Las Vegas, NV, USA, 2016. computer vision and pattern recognition, 2014, pp. 1859–1866.
[44] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality- [66] D. H. Kim, W. Baddar, J. Jang, and Y. M. Ro, “Multi-objective based
preserving learning for expression recognition in the wild,” in 2017 spatio-temporal feature representation learning robust to expression in-
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). tensity variations for facial expression recognition,” IEEE Transactions
IEEE, 2017, pp. 2584–2593. on Affective Computing, 2017.
[45] S. Li and W. Deng, “Reliable crowdsourcing and deep locality- [67] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade
preserving learning for unconstrained facial expression recognition,” for facial point detection,” in Computer Vision and Pattern Recognition
IEEE Transactions on Image Processing, 2018. (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 3476–3483.
[46] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database [68] K. Zhang, Y. Huang, Y. Du, and L. Wang, “Facial expression recognition
for facial expression, valence, and arousal computing in the wild,” IEEE based on deep evolutional spatial-temporal networks,” IEEE Transac-
Transactions on Affective Computing, vol. PP, no. 99, pp. 1–1, 2017. tions on Image Processing, vol. 26, no. 9, pp. 4193–4203, 2017.
[47] Z. Zhang, P. Luo, C. L. Chen, and X. Tang, “From facial expression [69] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
recognition to interpersonal relation prediction,” International Journal alignment using multitask cascaded convolutional networks,” IEEE
of Computer Vision, vol. 126, no. 5, pp. 1–20, 2018. Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
21
[70] Z. Yu, Q. Liu, and G. Liu, “Deeper cascaded peak-piloted network for International Conference on Multimodal Interaction. ACM, 2016, pp.
weak expression recognition,” The Visual Computer, pp. 1–9, 2017. 472–478.
[71] Z. Yu, G. Liu, Q. Liu, and J. Deng, “Spatio-temporal convolutional [91] P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen, “Learning supervised
features with nested lstm for facial expression recognition,” Neurocom- scoring ensemble for emotion recognition in the wild,” in Proceedings
puting, vol. 317, pp. 50–57, 2018. of the 19th ACM International Conference on Multimodal Interaction.
[72] P. Viola and M. Jones, “Rapid object detection using a boosted cascade ACM, 2017, pp. 553–560.
of simple features,” in Computer Vision and Pattern Recognition, [92] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization
2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society in unconstrained images,” in Proceedings of the IEEE Conference on
Conference on, vol. 1. IEEE, 2001, pp. I–I. Computer Vision and Pattern Recognition, 2015, pp. 4295–4304.
[73] F. De la Torre, W.-S. Chu, X. Xiong, F. Vicente, X. Ding, and J. F. Cohn, [93] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Robust sta-
“Intraface,” in IEEE International Conference on Automatic Face and tistical face frontalization,” in Proceedings of the IEEE International
Gesture Recognition (FG), 2015. Conference on Computer Vision, 2015, pp. 3871–3879.
[74] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by [94] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Towards large-
deep multi-task learning,” in European Conference on Computer Vision. pose face frontalization in the wild,” in Proceedings of the IEEE
Springer, 2014, pp. 94–108. Conference on Computer Vision and Pattern Recognition, 2017, pp.
[75] Z. Yu and C. Zhang, “Image based static facial expression recognition 3990–3999.
with multiple deep network learning,” in Proceedings of the 2015 ACM [95] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and
on International Conference on Multimodal Interaction. ACM, 2015, local perception gan for photorealistic and identity preserving frontal
pp. 435–442. view synthesis,” in Proceedings of the IEEE Conference on Computer
[76] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee, “Hierarchical committee Vision and Pattern Recognition, 2017, pp. 2439–2448.
of deep cnns with exponentially-weighted decision fusion for static [96] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning
facial expression recognition,” in Proceedings of the 2015 ACM on gan for pose-invariant face recognition,” in Proceedings of the IEEE
International Conference on Multimodal Interaction. ACM, 2015, Conference on Computer Vision and Pattern Recognition, 2017, pp.
pp. 427–434. 1415–1424.
[77] X. Liu, B. Kumar, J. You, and P. Jia, “Adaptive deep metric learning [97] L. Deng, D. Yu et al., “Deep learning: methods and applications,”
for identity-aware facial expression recognition,” in Proc. IEEE Conf. Foundations and Trends
R in Signal Processing, vol. 7, no. 3–4, pp.
Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2017, pp. 522– 197–387, 2014.
531. [98] B. Fasel, “Robust face analysis using convolutional neural networks,” in
[78] G. Levi and T. Hassner, “Emotion recognition in the wild via convolu- Pattern Recognition, 2002. Proceedings. 16th International Conference
tional neural networks and mapped binary patterns,” in Proceedings of on, vol. 2. IEEE, 2002, pp. 40–43.
the 2015 ACM on international conference on multimodal interaction. [99] ——, “Head-pose invariant facial expression recognition using convo-
ACM, 2015, pp. 503–510. lutional neural networks,” in Proceedings of the 4th IEEE International
[79] D. A. Pitaloka, A. Wulandari, T. Basaruddin, and D. Y. Liliana, Conference on Multimodal Interfaces. IEEE Computer Society, 2002,
“Enhancing cnn with preprocessing stage in automatic emotion recog- p. 529.
nition,” Procedia Computer Science, vol. 116, pp. 523–529, 2017. [100] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, “Subject independent
[80] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, facial expression recognition with robust face detection using a convolu-
“Facial expression recognition with convolutional neural networks: cop- tional neural network,” Neural Networks, vol. 16, no. 5-6, pp. 555–559,
ing with few data and the training sample order,” Pattern Recognition, 2003.
vol. 61, pp. 610–628, 2017. [101] B. Sun, L. Li, G. Zhou, X. Wu, J. He, L. Yu, D. Li, and Q. Wei,
[81] M. V. Zavarez, R. F. Berriel, and T. Oliveira-Santos, “Cross-database “Combining multimodal features within a fusion network for emotion
facial expression recognition based on fine-tuned deep convolutional recognition in the wild,” in Proceedings of the 2015 ACM on Interna-
network,” in Graphics, Patterns and Images (SIBGRAPI), 2017 30th tional Conference on Multimodal Interaction. ACM, 2015, pp. 497–
SIBGRAPI Conference on. IEEE, 2017, pp. 405–412. 502.
[82] W. Li, M. Li, Z. Su, and Z. Zhu, “A deep-learning approach to [102] B. Sun, L. Li, G. Zhou, and J. He, “Facial expression recognition in
facial expression recognition with candid images,” in Machine Vision the wild based on multimodal texture features,” Journal of Electronic
Applications (MVA), 2015 14th IAPR International Conference on. Imaging, vol. 25, no. 6, p. 061407, 2016.
IEEE, 2015, pp. 279–282. [103] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[83] I. Abbasnejad, S. Sridharan, D. Nguyen, S. Denman, C. Fookes, and hierarchies for accurate object detection and semantic segmentation,”
S. Lucey, “Using synthetic data to improve facial expression analysis in Proceedings of the IEEE conference on computer vision and pattern
with 3d convolutional networks,” in Proceedings of the IEEE Confer- recognition, 2014, pp. 580–587.
ence on Computer Vision and Pattern Recognition, 2017, pp. 1609– [104] J. Li, D. Zhang, J. Zhang, J. Zhang, T. Li, Y. Xia, Q. Yan, and L. Xun,
1618. “Facial expression recognition with faster r-cnn,” Procedia Computer
[84] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, Science, vol. 107, pp. 135–140, 2017.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in [105] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
Advances in neural information processing systems, 2014, pp. 2672– object detection with region proposal networks,” in Advances in neural
2680. information processing systems, 2015, pp. 91–99.
[85] W. Chen, M. J. Er, and S. Wu, “Illumination compensation and nor- [106] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
malization for robust face recognition using discrete cosine transform for human action recognition,” IEEE transactions on pattern analysis
in logarithm domain,” IEEE Transactions on Systems, Man, and Cyber- and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
netics, Part B (Cybernetics), vol. 36, no. 2, pp. 458–466, 2006. [107] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
[86] J. Li and E. Y. Lam, “Facial expression recognition using deep neural spatiotemporal features with 3d convolutional networks,” in Computer
networks,” in Imaging Systems and Techniques (IST), 2015 IEEE Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015,
International Conference on. IEEE, 2015, pp. 1–6. pp. 4489–4497.
[87] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, [108] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition
“Recurrent neural networks for emotion recognition in video,” in Pro- using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th
ceedings of the 2015 ACM on International Conference on Multimodal ACM International Conference on Multimodal Interaction. ACM,
Interaction. ACM, 2015, pp. 467–474. 2016, pp. 445–450.
[88] S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang, “Emotion [109] D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and
recognition in the wild from videos using images,” in Proceedings of the C. Fookes, “Deep spatio-temporal features for multimodal emotion
18th ACM International Conference on Multimodal Interaction. ACM, recognition,” in Applications of Computer Vision (WACV), 2017 IEEE
2016, pp. 433–436. Winter Conference on. IEEE, 2017, pp. 1215–1223.
[89] C.-M. Kuo, S.-H. Lai, and M. Sarkis, “A compact deep learning model [110] S. Ouellet, “Real-time emotion recognition for gaming using deep
for robust facial expression recognition,” in Proceedings of the IEEE convolutional network features,” arXiv preprint arXiv:1408.3750, 2014.
Conference on Computer Vision and Pattern Recognition Workshops, [111] H. Ding, S. K. Zhou, and R. Chellappa, “Facenet2expnet: Regularizing
2018, pp. 2121–2129. a deep face recognition net for expression recognition,” in Automatic
[90] A. Yao, D. Cai, P. Hu, S. Wang, L. Sha, and Y. Chen, “Holonet: towards Face & Gesture Recognition (FG 2017), 2017 12th IEEE International
robust emotion recognition in the wild,” in Proceedings of the 18th ACM Conference on. IEEE, 2017, pp. 118–126.
22
[112] B. Hasani and M. H. Mahoor, “Facial expression recognition using [137] M. Liu, S. Li, S. Shan, and X. Chen, “Au-aware deep networks for facial
enhanced deep 3d convolutional neural networks,” in Computer Vision expression recognition,” in Automatic Face and Gesture Recognition
and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference (FG), 2013 10th IEEE International Conference and Workshops on.
on. IEEE, 2017, pp. 2278–2288. IEEE, 2013, pp. 1–6.
[113] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for [138] ——, “Au-inspired deep networks for facial expression feature learn-
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, ing,” Neurocomputing, vol. 159, pp. 126–136, 2015.
2006. [139] P. Khorrami, T. Paine, and T. Huang, “Do deep neural networks learn
[114] G. E. Hinton and T. J. Sejnowski, “Learning and releaming in boltz- facial action units when doing expression recognition?” arXiv preprint
mann machines,” Parallel distributed processing: Explorations in the arXiv:1510.02969v3, 2015.
microstructure of cognition, vol. 1, no. 282-317, p. 2, 1986. [140] J. Cai, Z. Meng, A. S. Khan, Z. Li, J. OReilly, and Y. Tong, “Island loss
[115] G. E. Hinton, “A practical guide to training restricted boltzmann for learning discriminative features in facial expression recognition,” in
machines,” in Neural networks: Tricks of the trade. Springer, 2012, Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE
pp. 599–619. International Conference on. IEEE, 2018, pp. 302–309.
[116] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- [141] H. Yang, U. Ciftci, and L. Yin, “Facial expression recognition by de-
wise training of deep networks,” in Advances in neural information expression residue learning,” in Proceedings of the IEEE Conference on
processing systems, 2007, pp. 153–160. Computer Vision and Pattern Recognition, 2018, pp. 2168–2177.
[117] G. E. Hinton, “Training products of experts by minimizing contrastive [142] D. Hamester, P. Barros, and S. Wermter, “Face expression recognition
divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002. with a 2-channel convolutional neural network,” in Neural Networks
[118] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of (IJCNN), 2015 International Joint Conference on. IEEE, 2015, pp.
data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 1–8.
2006. [143] S. Reed, K. Sohn, Y. Zhang, and H. Lee, “Learning to disentangle fac-
[119] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, tors of variation with manifold interaction,” in International Conference
“Stacked denoising autoencoders: Learning useful representations in a on Machine Learning, 2014, pp. 1431–1439.
deep network with a local denoising criterion,” Journal of Machine [144] Z. Zhang, P. Luo, C.-C. Loy, and X. Tang, “Learning social relation
Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010. traits from face images,” in Proceedings of the IEEE International
[120] Q. V. Le, “Building high-level features using large scale unsupervised Conference on Computer Vision, 2015, pp. 3631–3639.
learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 [145] Y. Guo, D. Tao, J. Yu, H. Xiong, Y. Li, and D. Tao, “Deep neural
IEEE International Conference on. IEEE, 2013, pp. 8595–8598. networks with relativity learning for facial expression recognition,” in
[121] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Con- Multimedia & Expo Workshops (ICMEW), 2016 IEEE International
tractive auto-encoders: Explicit invariance during feature extraction,” Conference on. IEEE, 2016, pp. 1–6.
in Proceedings of the 28th International Conference on International [146] B.-K. Kim, S.-Y. Dong, J. Roh, G. Kim, and S.-Y. Lee, “Fusing aligned
Conference on Machine Learning. Omnipress, 2011, pp. 833–840. and non-aligned face information for automatic affect recognition in
[122] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convolu- the wild: A deep learning approach,” in Proceedings of the IEEE
tional auto-encoders for hierarchical feature extraction,” in International Conference on Computer Vision and Pattern Recognition Workshops,
Conference on Artificial Neural Networks. Springer, 2011, pp. 52–59. 2016, pp. 48–57.
[123] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
[147] C. Pramerdorfer and M. Kampel, “Facial expression recognition us-
preprint arXiv:1312.6114, 2013.
ing convolutional neural networks: State of the art,” arXiv preprint
[124] P. J. Werbos, “Backpropagation through time: what it does and how to arXiv:1612.02903, 2016.
do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[148] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.”
[125] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
in BMVC, vol. 1, no. 3, 2015, p. 6.
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[149] T. Kaneko, K. Hiramatsu, and K. Kashino, “Adaptive visual feedback
[126] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
generation for facial expression improvement with multi-task deep
arXiv preprint arXiv:1411.1784, 2014.
neural networks,” in Proceedings of the 2016 ACM on Multimedia
[127] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
Conference. ACM, 2016, pp. 327–331.
learning with deep convolutional generative adversarial networks,”
arXiv preprint arXiv:1511.06434, 2015. [150] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from
scratch,” arXiv preprint arXiv:1411.7923, 2014.
[128] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
“Autoencoding beyond pixels using a learned similarity metric,” arXiv [151] X. Zhang, L. Zhang, X.-J. Wang, and H.-Y. Shum, “Finding celebrities
preprint arXiv:1512.09300, 2015. in billions of web images,” IEEE Transactions on Multimedia, vol. 14,
[129] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and no. 4, pp. 995–1007, 2012.
P. Abbeel, “Infogan: Interpretable representation learning by informa- [152] H.-W. Ng and S. Winkler, “A data-driven approach to cleaning large
tion maximizing generative adversarial nets,” in Advances in neural face datasets,” in Image Processing (ICIP), 2014 IEEE International
information processing systems, 2016, pp. 2172–2180. Conference on. IEEE, 2014, pp. 343–347.
[130] Y. Tang, “Deep learning using linear support vector machines,” arXiv [153] H. Kaya, F. Gürpınar, and A. A. Salah, “Video-based emotion recogni-
preprint arXiv:1306.0239, 2013. tion in the wild using deep transfer learning and score fusion,” Image
[131] A. Dapogny and K. Bailly, “Investigating deep neural forests for facial and Vision Computing, vol. 65, pp. 66–75, 2017.
expression recognition,” in Automatic Face & Gesture Recognition (FG [154] B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko, “Convolu-
2018), 2018 13th IEEE International Conference on. IEEE, 2018, pp. tional neural networks pretrained on large face recognition datasets for
629–633. emotion classification from video,” arXiv preprint arXiv:1711.04598,
[132] P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo, “Deep 2017.
neural decision forests,” in Proceedings of the IEEE international [155] D. G. Lowe, “Object recognition from local scale-invariant features,” in
conference on computer vision, 2015, pp. 1467–1475. Computer vision, 1999. The proceedings of the seventh IEEE interna-
[133] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and tional conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
T. Darrell, “Decaf: A deep convolutional activation feature for generic [156] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, “A deep neural
visual recognition,” in International conference on machine learning, network-driven feature learning method for multi-view facial expression
2014, pp. 647–655. recognition,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp.
[134] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features 2528–2536, 2016.
off-the-shelf: an astounding baseline for recognition,” in Computer [157] Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, “Facial expression recogni-
Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Con- tion with deep age,” in Multimedia & Expo Workshops (ICMEW), 2017
ference on. IEEE, 2014, pp. 512–519. IEEE International Conference on. IEEE, 2017, pp. 657–662.
[135] N. Otberdout, A. Kacem, M. Daoudi, L. Ballihi, and S. Berretti, “Deep [158] L. Chen, M. Zhou, W. Su, M. Wu, J. She, and K. Hirota, “Softmax
covariance descriptors for facial expression recognition,” in BMVC, regression based deep sparse autoencoder network for facial emotion
2018. recognition in human-robot interaction,” Information Sciences, vol. 428,
[136] D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool, “Covariance pp. 49–61, 2018.
pooling for facial expression recognition,” in Proceedings of the IEEE [159] V. Mavani, S. Raman, and K. P. Miyapuram, “Facial expression
Conference on Computer Vision and Pattern Recognition Workshops, recognition using visual saliency and deep learning,” arXiv preprint
2018, pp. 367–374. arXiv:1708.08016, 2017.
23
[160] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi-level tive adversarial networks,” in Automatic Face & Gesture Recognition
network for saliency prediction,” in Pattern Recognition (ICPR), 2016 (FG 2018), 2018 13th IEEE International Conference on. IEEE, 2018,
23rd International Conference on. IEEE, 2016, pp. 3488–3493. pp. 294–301.
[161] B.-F. Wu and C.-H. Lin, “Adaptive feature mapping for customizing [183] J. Chen, J. Konrad, and P. Ishwar, “Vgan-based image representation
deep learning based facial expression recognition model,” IEEE Access, learning for privacy-preserving facial expression recognition,” in Pro-
2018. ceedings of the IEEE Conference on Computer Vision and Pattern
[162] J. Lu, V. E. Liong, and J. Zhou, “Cost-sensitive local binary feature Recognition Workshops, 2018, pp. 1570–1579.
learning for facial age estimation,” IEEE Transactions on Image Pro- [184] Y. Kim, B. Yoo, Y. Kwak, C. Choi, and J. Kim, “Deep generative-
cessing, vol. 24, no. 12, pp. 5356–5368, 2015. contrastive networks for facial expression recognition,” arXiv preprint
[163] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and arXiv:1703.07140, 2017.
improving convolutional neural networks via concatenated rectified [185] N. Sun, Q. Li, R. Huan, J. Liu, and G. Han, “Deep spatial-temporal
linear units,” in International Conference on Machine Learning, 2016, feature fusion for facial expression recognition in static images,” Pattern
pp. 2217–2225. Recognition Letters, 2017.
[164] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- [186] W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu, and H. Li,
thinking the inception architecture for computer vision,” in Proceedings “Audio and face video emotion recognition in the wild using deep
of the IEEE Conference on Computer Vision and Pattern Recognition, neural networks and small datasets,” in Proceedings of the 18th ACM
2016, pp. 2818–2826. International Conference on Multimodal Interaction. ACM, 2016, pp.
[165] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, 506–513.
inception-resnet and the impact of residual connections on learning.” in [187] J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, Y. Zong, and N. Sun,
AAAI, vol. 4, 2017, p. 12. “Multi-clue fusion for emotion recognition in the wild,” in Proceedings
[166] S. Zhao, H. Cai, H. Liu, J. Zhang, and S. Chen, “Feature selection of the 18th ACM International Conference on Multimodal Interaction.
mechanism in cnns for facial expression recognition,” in BMVC, 2018. ACM, 2016, pp. 458–463.
[167] J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with [188] Z. Cui, S. Xiao, Z. Niu, S. Yan, and W. Zheng, “Recurrent shape regres-
inconsistently annotated datasets,” in Proceedings of the European sion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
Conference on Computer Vision (ECCV), 2018, pp. 222–237. 2018.
[168] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature [189] X. Ouyang, S. Kawaai, E. G. H. Goh, S. Shen, W. Ding, H. Ming, and
learning approach for deep face recognition,” in European Conference D.-Y. Huang, “Audio-visual emotion recognition using deep transfer
on Computer Vision. Springer, 2016, pp. 499–515. learning and multiple temporal models,” in Proceedings of the 19th
[169] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed- ACM International Conference on Multimodal Interaction. ACM,
ding for face recognition and clustering,” in Proceedings of the IEEE 2017, pp. 577–582.
conference on computer vision and pattern recognition, 2015, pp. 815– [190] V. Vielzeuf, S. Pateux, and F. Jurie, “Temporal multimodal fusion for
823. video emotion classification in the wild,” in Proceedings of the 19th
[170] G. Zeng, J. Zhou, X. Jia, W. Xie, and L. Shen, “Hand-crafted feature ACM International Conference on Multimodal Interaction. ACM,
guided deep learning for facial expression recognition,” in Automatic 2017, pp. 569–576.
Face & Gesture Recognition (FG 2018), 2018 13th IEEE International
[191] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michal-
Conference on. IEEE, 2018, pp. 423–430.
ski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-
[171] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural Lewandowski et al., “Emonets: Multimodal deep learning approaches
networks for image classification,” in Computer vision and pattern for emotion recognition in video,” Journal on Multimodal User Inter-
recognition (CVPR), 2012 IEEE conference on. IEEE, 2012, pp. 3642– faces, vol. 10, no. 2, pp. 99–111, 2016.
3649.
[192] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, “Combining
[172] G. Pons and D. Masip, “Supervised committee of convolutional neural
multiple kernel methods on riemannian manifold for emotion recogni-
networks in automated facial expression analysis,” IEEE Transactions
tion in the wild,” in Proceedings of the 16th International Conference
on Affective Computing, 2017.
on Multimodal Interaction. ACM, 2014, pp. 494–501.
[173] B.-K. Kim, J. Roh, S.-Y. Dong, and S.-Y. Lee, “Hierarchical committee
[193] B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, “Video emotion
of deep convolutional neural networks for robust facial expression
recognition with transferred deep feature encodings,” in Proceedings
recognition,” Journal on Multimodal User Interfaces, vol. 10, no. 2,
of the 2016 ACM on International Conference on Multimedia Retrieval.
pp. 173–189, 2016.
ACM, 2016, pp. 15–22.
[174] K. Liu, M. Zhang, and Z. Pan, “Facial expression recognition with cnn
ensemble,” in Cyberworlds (CW), 2016 International Conference on. [194] J. Chen, R. Xu, and L. Liu, “Deep peak-neutral difference feature for
IEEE, 2016, pp. 163–166. facial expression recognition,” Multimedia Tools and Applications, pp.
1–17, 2018.
[175] G. Pons and D. Masip, “Multi-task, multi-label and multi-domain
learning with residual convolutional networks for emotion recognition,” [195] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to ini-
arXiv preprint arXiv:1802.06664, 2018. tialize recurrent networks of rectified linear units,” arXiv preprint
[176] P. Ekman and E. L. Rosenberg, What the face reveals: Basic and applied arXiv:1504.00941, 2015.
studies of spontaneous expression using the Facial Action Coding [196] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
System (FACS). Oxford University Press, USA, 1997. works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp.
[177] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa, “An 2673–2681, 1997.
all-in-one convolutional neural network for face analysis,” in Automatic [197] P. Barros and S. Wermter, “Developing crossmodal expression recogni-
Face & Gesture Recognition (FG 2017), 2017 12th IEEE International tion based on a deep neural model,” Adaptive behavior, vol. 24, no. 5,
Conference on. IEEE, 2017, pp. 17–24. pp. 373–396, 2016.
[178] Y. Lv, Z. Feng, and C. Xu, “Facial expression recognition via deep [198] J. Zhao, X. Mao, and J. Zhang, “Learning deep facial expression
learning,” in Smart Computing (SMARTCOMP), 2014 International features from image and optical flow sequences using 3d cnn,” The
Conference on. IEEE, 2014, pp. 303–308. Visual Computer, pp. 1–15, 2018.
[179] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza, “Disentan- [199] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen, “Deeply learning
gling factors of variation for facial expression recognition,” in European deformable facial action parts model for dynamic expression analysis,”
Conference on Computer Vision. Springer, 2012, pp. 808–822. in Asian conference on computer vision. Springer, 2014, pp. 143–157.
[180] Y.-H. Lai and S.-H. Lai, “Emotion-preserving representation learning [200] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
via generative adversarial network for multi-view facial expression “Object detection with discriminatively trained part-based models,”
recognition,” in Automatic Face & Gesture Recognition (FG 2018), IEEE transactions on pattern analysis and machine intelligence, vol. 32,
2018 13th IEEE International Conference on. IEEE, 2018, pp. 263– no. 9, pp. 1627–1645, 2010.
270. [201] S. Pini, O. B. Ahmed, M. Cornia, L. Baraldi, R. Cucchiara, and B. Huet,
[181] F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Joint pose and expression “Modeling multimodal cues in a deep learning-based framework for
modeling for facial expression recognition,” in Proceedings of the IEEE emotion recognition in the wild,” in Proceedings of the 19th ACM
Conference on Computer Vision and Pattern Recognition, 2018, pp. International Conference on Multimodal Interaction. ACM, 2017,
3359–3368. pp. 536–543.
[182] H. Yang, Z. Zhang, and L. Yin, “Identity-adaptive facial expression [202] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad:
recognition through expression regeneration using conditional genera- Cnn architecture for weakly supervised place recognition,” in Pro-
24
ceedings of the IEEE Conference on Computer Vision and Pattern of rgb-depth map latent representations,” in 2017 IEEE International
Recognition, 2016, pp. 5297–5307. Conference on Computer Vision Workshop (ICCVW), 2017.
[203] D. H. Kim, M. K. Lee, D. Y. Choi, and B. C. Song, “Multi-modal [224] H. Li, J. Sun, Z. Xu, and L. Chen, “Multimodal 2d+ 3d facial expression
emotion recognition using semi-supervised learning and multiple neural recognition with deep fusion convolutional neural network,” IEEE
networks in the wild,” in Proceedings of the 19th ACM International Transactions on Multimedia, vol. 19, no. 12, pp. 2816–2831, 2017.
Conference on Multimodal Interaction. ACM, 2017, pp. 529–535. [225] A. Jan, H. Ding, H. Meng, L. Chen, and H. Li, “Accurate facial parts
[204] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu- localization and deep learning for 3d facial expression recognition,” in
gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE
networks for visual recognition and description,” in Proceedings of the International Conference on. IEEE, 2018, pp. 466–472.
IEEE conference on computer vision and pattern recognition, 2015, pp. [226] X. Wei, H. Li, J. Sun, and L. Chen, “Unsupervised domain adaptation
2625–2634. with regularized optimal transport for multimodal 2d+ 3d facial expres-
[205] D. K. Jain, Z. Zhang, and K. Huang, “Multi angle optimal pattern-based sion recognition,” in Automatic Face & Gesture Recognition (FG 2018),
deep learning for automatic facial expression recognition,” Pattern 2018 13th IEEE International Conference on. IEEE, 2018, pp. 31–37.
Recognition Letters, 2017. [227] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson,
[206] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Spatio- “Generating facial expressions with deep belief nets,” in Affective
temporal convolutional sparse auto-encoder for sequence classification.” Computing. InTech, 2008.
in BMVC, 2012, pp. 1–12. [228] M. Sabzevari, S. Toosizadeh, S. R. Quchani, and V. Abrishami, “A
[207] S. Kankanamge, C. Fookes, and S. Sridharan, “Facial analysis in the fast and accurate facial expression synthesis system for color face
wild with lstm networks,” in Image Processing (ICIP), 2017 IEEE images using face graph and deep belief network,” in Electronics and
International Conference on. IEEE, 2017, pp. 1052–1056. Information Engineering (ICEIE), 2010 International Conference On,
[208] J. D. Lafferty, A. Mccallum, and F. C. N. Pereira, “Conditional random vol. 2. IEEE, 2010, pp. V2–354.
fields: Probabilistic models for segmenting and labeling sequence data,” [229] R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala, “Semantic
Proceedings of Icml, vol. 3, no. 2, pp. 282–289, 2001. facial expression editing using autoencoded flow,” arXiv preprint
[209] K. Simonyan and A. Zisserman, “Two-stream convolutional networks arXiv:1611.09961, 2016.
for action recognition in videos,” in Advances in neural information [230] Y. Zhou and B. E. Shi, “Photorealistic facial expression synthesis by
processing systems, 2014, pp. 568–576. the conditional difference adversarial autoencoder,” in Affective Com-
[210] J. Susskind, V. Mnih, G. Hinton et al., “On deep generative models puting and Intelligent Interaction (ACII), 2017 Seventh International
with applications to recognition,” in Computer Vision and Pattern Conference on. IEEE, 2017, pp. 370–376.
Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. [231] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversarial
2857–2864. facial expression synthesis,” arXiv preprint arXiv:1712.03474, 2017.
[211] V. Mnih, J. M. Susskind, G. E. Hinton et al., “Modeling natural images [232] H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expres-
using gated mrfs,” IEEE transactions on pattern analysis and machine sion editing with controllable expression intensity,” in AAAI, 2018, p.
intelligence, vol. 35, no. 9, pp. 2206–2222, 2013. 67816788.
[233] F. Qiao, N. Yao, Z. Jiao, Z. Li, H. Chen, and H. Wang, “Geometry-
[212] V. Mnih, G. E. Hinton et al., “Generating more realistic images using
contrastive generative adversarial network for facial expression synthe-
gated mrf’s,” in Advances in Neural Information Processing Systems,
sis,” arXiv preprint arXiv:1802.01822, 2018.
2010, pp. 2002–2010.
[234] I. Masi, A. T. Tran, T. Hassner, J. T. Leksut, and G. Medioni, “Do we
[213] Y. Cheng, B. Jiang, and K. Jia, “A deep structure for facial expression
really need to collect millions of faces for effective face recognition?”
recognition under partial occlusion,” in Intelligent Information Hiding
in European Conference on Computer Vision. Springer, 2016, pp.
and Multimedia Signal Processing (IIH-MSP), 2014 Tenth International
579–596.
Conference on. IEEE, 2014, pp. 211–214.
[235] N. Mousavi, H. Siqueira, P. Barros, B. Fernandes, and S. Wermter,
[214] M. Xu, W. Cheng, Q. Zhao, L. Ma, and F. Xu, “Facial expression recog-
“Understanding how deep neural networks learn face expressions,” in
nition based on transfer learning from deep convolutional networks,” in
Neural Networks (IJCNN), 2016 International Joint Conference on.
Natural Computation (ICNC), 2015 11th International Conference on.
IEEE, 2016, pp. 227–234.
IEEE, 2015, pp. 702–708.
[236] R. Breuer and R. Kimmel, “A deep learning perspective on the origin
[215] Y. Liu, J. Zeng, S. Shan, and Z. Zheng, “Multi-channel pose-aware con- of facial expressions,” arXiv preprint arXiv:1705.01842, 2017.
volution neural networks for multi-view facial expression recognition,”
[237] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE
tional networks,” in European conference on computer vision. Springer,
International Conference on. IEEE, 2018, pp. 458–465.
2014, pp. 818–833.
[216] S. He, S. Wang, W. Lan, H. Fu, and Q. Ji, “Facial expression recognition [238] I. Lüsi, J. C. J. Junior, J. Gorbova, X. Baró, S. Escalera, H. Demirel,
using deep boltzmann machine from thermal infrared images,” in J. Allik, C. Ozcinar, and G. Anbarjafari, “Joint challenge on dominant
Affective Computing and Intelligent Interaction (ACII), 2013 Humaine and complementary emotion recognition using micro emotion features
Association Conference on. IEEE, 2013, pp. 239–244. and head-pose estimation: Databases,” in Automatic Face & Gesture
[217] Z. Wu, T. Chen, Y. Chen, Z. Zhang, and G. Liu, “Nirexpnet: Three- Recognition (FG 2017), 2017 12th IEEE International Conference on.
stream 3d convolutional neural network for near infrared facial expres- IEEE, 2017, pp. 809–813.
sion recognition,” Applied Sciences, vol. 7, no. 11, p. 1184, 2017. [239] J. Wan, S. Escalera, X. Baro, H. J. Escalante, I. Guyon, M. Madadi,
[218] E. P. Ijjina and C. K. Mohan, “Facial expression recognition using kinect J. Allik, J. Gorbova, and G. Anbarjafari, “Results and analysis of
depth sensor and convolutional neural networks,” in Machine Learning chalearn lap multi-modal isolated and continuous gesture recognition,
and Applications (ICMLA), 2014 13th International Conference on. and real versus fake expressed emotions challenges,” in ChaLearn LaP,
IEEE, 2014, pp. 392–396. Action, Gesture, and Emotion Recognition Workshop and Competitions:
[219] M. Z. Uddin, M. M. Hassan, A. Almogren, M. Zuair, G. Fortino, and Large Scale Multimodal Gesture Recognition and Real versus Fake
J. Torresen, “A facial expression recognition system using robust face expressed emotions, ICCV, vol. 4, no. 6, 2017.
features from depth videos and deep learning,” Computers & Electrical [240] Y.-G. Kim and X.-P. Huynh, “Discrimination between genuine versus
Engineering, vol. 63, pp. 114–125, 2017. fake emotion using long-short term memory with parametric bias and
[220] M. Z. Uddin, W. Khaksar, and J. Torresen, “Facial expression recog- facial landmarks,” in Computer Vision Workshop (ICCVW), 2017 IEEE
nition using salient features and convolutional neural network,” IEEE International Conference on. IEEE, 2017, pp. 3065–3072.
Access, vol. 5, pp. 26 146–26 161, 2017. [241] L. Li, T. Baltrusaitis, B. Sun, and L.-P. Morency, “Combining sequential
[221] W. Li, D. Huang, H. Li, and Y. Wang, “Automatic 4d facial expression geometry and texture features for distinguishing genuine and deceptive
recognition using dynamic geometrical image network,” in Automatic emotions,” in Proceedings of the IEEE Conference on Computer Vision
Face & Gesture Recognition (FG 2018), 2018 13th IEEE International and Pattern Recognition, 2017, pp. 3147–3153.
Conference on. IEEE, 2018, pp. 24–30. [242] J. Guo, S. Zhou, J. Wu, J. Wan, X. Zhu, Z. Lei, and S. Z. Li, “Multi-
[222] F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni, modality network with visual and geometrical information for micro
“Expnet: Landmark-free, deep, 3d facial expressions,” in Automatic emotion recognition,” in Automatic Face & Gesture Recognition (FG
Face & Gesture Recognition (FG 2018), 2018 13th IEEE International 2017), 2017 12th IEEE International Conference on. IEEE, 2017, pp.
Conference on. IEEE, 2018, pp. 122–129. 814–819.
[223] O. K. Oyedotun, G. Demisse, A. E. R. Shabayek, D. Aouada, and [243] I. Song, H.-J. Kim, and P. B. Jeon, “Deep learning for real-time
B. Ottersten, “Facial expression recognition via joint deep learning robust facial expression recognition on a smartphone,” in Consumer
25