Real-Time Convolutional Neural Networks For Emotion and Gender Classification
Real-Time Convolutional Neural Networks For Emotion and Gender Classification
I. I NTRODUCTION
The success of service robotics decisively depends on a
smooth robot to user interaction. Thus, a robot should be
able to extract information just from the face of its user,
e.g. identify the emotional state or deduce gender. Interpret-
ing correctly any of these elements using machine learning
(ML) techniques has proven to be complicated due the high
variability of the samples within each task [4]. This leads to
models with millions of parameters trained under thousands of
samples [3]. Furthermore, the human accuracy for classifying
an image of a face in one of 7 different emotions is 65% ±
5% [4]. One can observe the difficulty of this task by trying
to manually classify the FER-2013 dataset images in Figure
1 within the following classes {“angry”, “disgust”, “fear”, Fig. 2: Samples of the IMDB dataset [9].
“happy”, “sad”, “surprise”, “neutral”}.
In spite of these difficulties, robot platforms oriented to
attend and solve household tasks require facial expressions becomes unfeasible. In this paper we propose an implement
systems that are robust and computationally efficient. More- a general CNN building framework for designing real-time
over, the state-of-the-art methods in image-related tasks such CNNs. The implementations have been validated in a real-time
as image classification [1] and object detection are all based on facial expression system that provides face-detection, gender
Convolutional Neural Networks (CNNs). These tasks require classification and that achieves human-level performance when
CNN architectures with millions of parameters; therefore, classifying emotions. This system has been deployed in a
their deployment in robot platforms and real-time systems care-O-bot 3 robot, and has been extended for general robot
platforms and the RoboCup@Home competition challenges.
Furthermore, CNNs are used as black-boxes and often their
learned features remain hidden, making it complicated to
establish a balance between their classification accuracy and
unnecessary parameters. Therefore, we implemented a real-
time visualization of the guided-gradient back-propagation
proposed by Springenberg [11] in order to validate the features
learned by the CNN.
II. R ELATED W ORK
Commonly used CNNs for feature extraction include a
set of fully connected layers at the end. Fully connected
layers tend to contain most of the parameters in a CNN.
Specifically, VGG16 [10] contains approximately 90% of all
its parameters in their last fully connected layers. Recent
architectures such as Inception V3 [12], reduced the amount
of parameters in their last layers by including a Global
Average Pooling operation. Global Average Pooling reduces
each feature map into a scalar value by taking the average over
all elements in the feature map. The average operation forces
the network to extract global features from the input image.
Modern CNN architectures such as Xception [1] leverage from
the combination of two of the most successful experimental
assumptions in CNNs: the use of residual modules [6] and
depth-wise separable convolutions [2]. Depth-wise separable
convolutions reduce further the amount of parameters by
separating the processes of feature extraction and combination
within a convolutional layer.
Furthermore, the state-of-the-art model for the FER2-2013
dataset is based on CNN trained with square hinged loss Fig. 3: Our proposed model for real-time classification.
[13]. This model achieved an accuracy of 71% [4] using
approximately 5 million parameters. In this architecture 98%
of all parameters are located in the last fully connected layers. to each reduced feature map. Our initial proposed architecture
The second-best methods presented in [4] achieved an is a standard fully-convolutional neural network composed of
accuracy of 66% using an ensemble of CNNs. 9 convolution layers, ReLUs [5], Batch Normalization [7]
III. M ODEL and Global Average Pooling. This model contains approx-
imately 600,000 parameters. It was trained on the IMDB
We propose two models which we evaluated in accordance gender dataset, which contains 460,723 RGB images where
to their test accuracy and number of parameters. Both models each image belongs to the class “woman” or “man”, and it
were designed with the idea of creating the best accuracy achieved an accuracy of 96% in this dataset. We also validated
over number of parameters ratio. Reducing the number of this model in the FER-2013 dataset. This dataset contains
parameters help us overcoming two important problems. First, 35,887 grayscale images where each image belongs to one
the use of small CNNs alleviate us from slow performances of the following classes {“angry”, “disgust”, “fear”, “happy”,
in hardware-constrained systems such robot platforms. And “sad”, “surprise”, “neutral”}. Our initial model achieved an
second, the reduction of parameters provides a better gener- accuracy of 66% in this dataset. We will refer to this model
alization under an Occam’s razor framework. Our first model as “sequential fully-CNN”.
relies on the idea of eliminating completely the fully connected Our second model is inspired by the Xception [1] archi-
layers. The second architecture combines the deletion of the tecture. This architecture combines the use of residual mod-
fully connected layer and the inclusion of the combined ules [6] and depth-wise separable convolutions [2]. Residual
depth-wise separable convolutions and residual modules. Both modules modify the desired mapping between two subsequent
architectures were trained with the ADAM optimizer [8]. layers, so that the learned features become the difference of the
Following the previous architecture schemas, our initial ar- original feature map and the desired features. Consequently,
chitecture used Global Average Pooling to completely remove the desired features H(x) are modified in order to solve an
any fully connected layers. This was achieved by having in the easier learning problem F (X) such that:
last convolutional layer the same number of feature maps as
number of classes, and applying a softmax activation function H(x) = F (x) + x (1)
to a reduction of one percent with respect to our initial
implementation. Furthermore, we tested this architecture in
the FER-2013 dataset and we obtained the same accuracy of
66% for the emotion classification task. Our final architecture
weights can be stored in an 855 kilobytes file. By reducing our
architectures computational cost we are now able to join both
models and use them consecutively in the same image without
any serious time reduction. Our complete pipeline including
(a)
the openCV face detection module, the gender classification
and the emotion classification takes 0.22 ± 0.0003 ms on a
i5-4210M CPU. This corresponds to a speedup of 1.5× when
compared to the original architecture of Tang.
We also added to our implementation a real-time guided
back-propagation visualization to observe which pixels in the
image activate an element of a higher-level feature map.
Given a CNN with only ReLUs as activation functions for
the intermediate layers, guided-back propagation takes the
derivative of every element (x, y) of the input image I with
respect to an element (i, j) of the feature map f L in layer L.
The reconstructed image R filters all the negative gradients;
(b) consequently, the remaining gradients are chosen such that
Fig. 4: [2] Difference between (a) standard convolutions and they only increase the value of the chosen element of the
(b) depth-wise separable convolutions. feature map. Following [11], a fully ReLU CNN reconstructed
image in layer l is given by:
l l+1 l+1
Since our initial proposed architecture deleted the last fully Ri,j = (Ri,j > 0) ∗ Ri,j (2)
connected layer, we reduced further the amount of parame-
IV. R ESULTS
ters by eliminating them now from the convolutional layers.
This was done trough the use of depth-wise separable con- Results of the real-time emotion classification task in un-
volutions. Depth-wise separable convolutions are composed seen faces can be observed in Figure 5. Our complete real-
of two different layers: depth-wise convolutions and point- time pipeline including: face detection, emotion and gender
wise convolutions. The main purpose of these layers is to classification have been fully integrated in our Care-O-bot 3
separate the spatial cross-correlations from the channel cross- robot.
correlations [1]. They do this by first applying a D × D filter An example of our complete pipeline can be seen in Figure
on every M input channels and then applying N 1 × 1 × M 6 in which we provide emotion and gender classification.
convolution filters to combine the M input channels into N In Figure 7 we provide the confusion matrix results of our
output channels. Applying 1 × 1 × M convolutions combines emotion classification mini-Xception model. We can observe
each value in the feature map without considering their spatial several common misclassifications such as predicting “sad”
relation within the channel. instead of “fear” and predicting “angry” instead “disgust”.
Depth-wise separable convolutions reduces the computation A comparison of the learned features between several emo-
with respect to the standard convolutions by a factor of N1 + tions and both of our proposed models can be observed in
1
D 2 [2]. A visualization of the difference between a normal
Figure 8. The white areas in figure 8b correspond to the pixel
Convolution layer and a depth-wise separable convolution can values that activate a selected neuron in our last convolution
be observed in Figure 4. layer. The selected neuron was always selected in accordance
Our final architecture is a fully-convolutional neural net- to the highest activation. We can observe that the CNN learned
work that contains 4 residual depth-wise separable convolu- to get activated by considering features such as the frown, the
tions where each convolution is followed by a batch nor- teeth, the eyebrows and the widening of one’s eyes, and that
malization operation and a ReLU activation function. The each feature remains constant within the same class. These
last layer applies a global average pooling and a soft-max results reassure that the CNN learned to interpret understand-
activation function to produce a prediction. This architecture able human-like features, that provide generalizable elements.
has approximately 60, 000 parameters; which corresponds to These interpretable results have helped us understand several
a reduction of 10× when compared to our initial naive common misclassification such as persons with glasses being
implementation, and 80× when compared to the original CNN. classified as “angry”. This happens since the label “angry”
Figure 3 displays our complete final architecture which we is highly activated when it believes a person is frowning
refer to as mini-Xception. This architectures obtains an accu- and frowning features get confused with darker glass frames.
racy of 95% in gender classification task. Which corresponds Moreover, we can also observe that the features learned in
Fig. 7: Normalized confusion matrix of our mini-Xception
Fig. 5: Results of the provided real-time emotion classification network.
provided in our public repository