Classifying Emotions and Engagement in Online Learning Based On A Single Facial Expression Recognition Neural Networ
Classifying Emotions and Engagement in Online Learning Based On A Single Facial Expression Recognition Neural Networ
Classifying Emotions and Engagement in Online Learning Based On A Single Facial Expression Recognition Neural Networ
4, OCTOBER-DECEMBER 2022
Abstract—In this article, behaviour of students in the e-learning environment is analyzed. The novel pipeline is proposed based on video
facial processing. At first, face detection, tracking and clustering techniques are applied to extract the sequences of faces of each student.
Next, a single efficient neural network is used to extract emotional features in each frame. This network is pre-trained on face identification
and fine-tuned for facial expression recognition on static images from AffectNet using a specially developed robust optimization technique.
It is shown that the resulting facial features can be used for fast simultaneous prediction of students’ engagement levels (from disengaged
to highly engaged), individual emotions (happy, sad, etc.,) and group-level affect (positive, neutral or negative). This model can be used for
real-time video processing even on a mobile device of each student without the need for sending their facial video to the remote server or
teacher’s PC. In addition, the possibility to prepare a summary of a lesson is demonstrated by saving short clips of different emotions and
engagement of all students. The experimental study on the datasets from EmotiW (Emotion Recognition in the Wild) challenges showed
that the proposed network significantly outperforms existing single models.
Index Terms—Online learning, e-learning, video-based facial expression recognition, engagement prediction, group-level emotion recognition,
mobile devices
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
SAVCHENKO ET AL.: CLASSIFYING EMOTIONS AND ENGAGEMENT IN ONLINE LEARNING BASED ON A SINGLE FACIAL EXPRESSION... 2133
Thus, the aim of this article is a development of fast and user’s messages provide better efficiency than the single-
accurate technique for classifying emotions and engage- modal ones. Similar features have been used in [21] for offline
ment that can be implemented in online learning software learning and videos of classroom environments. It is known
at laptops without powerful GPUs (graphics processing that facial emotions, which are a form of non-verbal commu-
units) and/or mobile devices of students and teachers. The nication, can be used to estimate the learning affect of a stu-
main contribution consists of the following: dent and enhance the current e-learning platforms [22].
Hence, in this article, it was decided to deal only with an anal-
Lightweight FER (facial expression recognition) ysis of facial modality.
models based on EfficientNet and MobileNet archi- The FER models are typically pre-trained on single images
tectures for emotional features extraction from facial from a rather large dataset, such as AffectNet [23]. Excellent
images. It is proposed to borrow the idea of robust results have been recently obtained by using supervised
data mining [17] to modify the softmax loss function learning (SL) and self-supervised learning (SSL) [24] of Effi-
for the training of this model to predict emotions on cientNets [25], visual transformers and attentional selective
static images. fusion [26], relation-aware transformers (TransFER) [27] and
Efficient neural network model for simultaneous the lightweight models with careful pre-training on the face
engagement detection and recognition of individual- recognition datasets [28]. Very accurate EmotionGCN
and group-level emotions in facial videos. Light- explores emotional dependencies between facial expressions
weight CNN from the previous item extracts unified and valence-arousal by training the graph convolutional net-
emotional features of each frame on the learner’s works in the multi-task learning framework [29].
device. The features of several frames are aggregated The progress in the video-based FER is mainly measured
into a video descriptor using statistical (STAT) on various versions of the AFEW dataset from EmotiW
functions (mean, standard deviation, etc.) [18]. The 2013-2019 challenges [20]. One of the best single models is
resulting models let us reach the state-of-the-art obtained via the noisy student training using body lan-
results in several emotion recognition and engage- guage [30], while the old method with the STAT aggrega-
ment detection tasks. tion of features extracted by three CNNs (VGG13, VGG16
A novel technological framework for real-time and ResNet) [18] is still one of the best ensemble-based
video-based classification of emotions and engage- techniques. The best validation accuracy is achieved by
ment in online learning by using only facial modal- the attention cross-modal feature fusion mechanisms that
ity. The engagement and individual emotions of highlight important emotion feature by exploring feature
each student are predicted on the device of each stu- concatenation and factorized bi-linear pooling (FBP) [31].
dent. Obtained emotional feature vectors may be However, the latter model has slightly lower accuracy on
sent to the teacher’s device to classify the emotions the testing set when compared to bi-modality fusion [32] of
of the whole group of students. If the faces of some audio and video features extracted by four different CNNs.
students in the online video conferencing tool Predicted emotions can be used not only for understand-
(Zoom, MS Teams, Google Meet, etc.) are turned on, ing the behaviour of each learner but also for visual summa-
it is proposed to additionally cluster these faces and rization of classroom videos [19] or classification of the
summarize their emotions and engagement during group-level emotions on videos. The latter task has become
the whole lesson into short video clips [19]. This studied since the appearance of the VGAF dataset [10].
helps the lecturers to understand their own weak- Rather high accuracy is achieved by activity recognition
ness and to change it [5]. The sources of training and and K-injection networks [33], [34]. The winner of the Emo-
testing code using Tensorflow 2 and Pytorch frame- tiW 2020 Audio-video group emotion recognition sub-chal-
works together with demo Android application and lenge developed an ensemble of hybrid networks for audio,
several models are made publicly-available1. facial emotion, video, environmental object statistics and
The remaining part of this article is structured as follows. fighting detector streams [14].
Section 2 contains a brief survey of related articles. The
details of the proposed framework are given in Section 3. 2.2 Automatic Engagement Detection in E-Learning
Section 4 provides experimental results of our models on Systems
EngageWild [8], AFEW (Acted Facial Expression In The Parental involvement, interaction, and students’ engagement
Wild) [20] and VGAF (Video-level Group AFfect) [10] data- are the key factors that may influence online learning
sets from EmotiW challenges. Finally, concluding comments effects [1]. Though most e-learning techniques are focused on
and future work are discussed in Section 5. improving the learners’ interaction, the algorithms of behav-
ioural analysis and engagement detection have become
recently studied in educational data mining [35]. Researchers
2 LITERATURE SURVEY
do not have a consistent understanding of the definition of
2.1 Video-Based Emotion Recognition learning engagement and regard it as a multidimensional
Recognition of students’ emotions may have a great impact concept [36]. In this article, a special type of the student’s
on the quality of many e-learning systems. The authors of the persistent effort to accomplish the learning task [8] is con-
review [9] claimed that the multi-modal emotion recognition sidered, namely, the emotional engagement. It focuses on
based on a fusion of facial expressions, body gestures and the extent of positive and negative reactions, feeling of
interest towards a particular theme and enjoying learning
1. https://github.com/HSE-asavchenko/face-emotion-recognition about it [36].
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
2134 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 13, NO. 4, OCTOBER-DECEMBER 2022
A survey [6] considered the dependencies of existing which enforces a distance margin between the features of
methods on learners’ participation and classified them into distant category pairs and adjacent category pairs [44]. The
automatic, semi-automatic (engagement tracing) and man- anti-overfitting strategy with training on the overlapped
ual categories. The most popular one is still the latter cate- segments of inputs videos was proposed in [45]. Solution of
gory. It includes self-reports, observational checklists and the winners [46] used the facial behaviour features extracted
rating scales, and typically requires a great deal of time and by OpenFace and ResNet-50 model pre-trained on face
effort from observers [36]. As a result, the recent research identification on the large VGGFace2 dataset [47]. These
focus has shifted to the automatic engagement detection results were improved in the 2020 challenge by using an
that infers the social cues of engagement/disengagement attention-based GRU and multi-rate video processing [13].
from facial expressions, body movements and gaze pat-
tern [8]. Particular attention is paid to the FER-based meth-
ods due to simplicity of their usage [6]. Indeed, the FER and
3 MATERIALS AND METHODS
engagement prediction tasks are strongly correlated [5]. For 3.1 Proposed Approach
example, a lecturer use students’ facial expressions as valu- Most of the techniques mentioned in the previous section
able sources of feedback. Moreover, the emotions of the lec- used complex ensemble models and various sets of features
turers kept the students motivated and interested during to boost their performance. Unfortunately, every single
the lectures [37]. model for one feature set reported in these articles cannot
One of the first techniques that applied machine learning compete with the final solutions. Thus, in this article, the
and FER to predict student’s engagement, was proposed novel technological framework (Fig. 1) is proposed to ana-
in [38]. Their experiments with support vector machines lyse the behaviour of students at online lectures.
(SVM) with Gabor features and regression for expression Here each student may launch an application on his or
outputs from the Computer Expression Recognition Tool- her device to provide the results of behaviour analysis with-
box proved that the automated engagement detectors per- out the need to share the facial video. As a result, the high
form with comparable accuracy to humans. Traditional level of data privacy may be achieved because the video of
computer vision for FER was used in [36], where the adap- a face is not required to be sent to the remote server or
tive weighted histograms of eight-bit gray codes calculated teacher’s PC. In this case, the largest facial region is located
by Local Gray Code Patterns (LGCP) were classified by in each t-th video frame in the unit “Face detection 1” by
SVM. The authors of the latter article introduced two data- using any fast technique, such as MTCNN (multi-task
sets for learning engagement detection based on facial and CNN). Next, the emotional features xðtÞ of an extracted face
mouse movement data, but they are not publicly available. are obtained in the unit “Emotional feature extraction
Nowadays, the progress of deep learning caused the wide- 2” [48] using the proposed lightweight CNN trained to clas-
spread use of CNNs. For example, the Mean Engagement sify emotions on static images. The details about the train-
Score was proposed in [4] by analyzing the results from facial ing of this neural network will be provided in the next
landmark detection, emotional recognition and the weights subsections. Finally, the features of several sequential
from a special survey. The non-contact engagement predic- frames with duration 5-10 seconds are aggregated using
tion in unconstrained environments is applied not only in STAT functions to classify engagement level and individual
e-learning but in other interactive tasks, such as gaming [39]. emotions in the units “Engagement prediction 3” and
The framework of learning engagement assessment [2] timely “Emotion recognition 4”, respectively. Predicted engage-
acquired the emotional changes of the learners using a special ment level (from disengaged to highly engaged) and indi-
CNN trained based on domain adaptation, which is suitable vidual emotions (happy, angry, sad, neutral, etc.) for each
for the MOOC scenario. time frame at the output of units 3 and 4 together with the
The rapid growth of studies in engagement prediction emotional features at the output of unit 2 are sent to the
has begun from the introduction of the EngageWild data- teacher’s PC. As the models in first four units are very effi-
set [8] in EmotiW 2018-2020 challenges. This dataset con- cient, the inference can be launched even on any low-
tains facial videos with corresponding engagement labels of resource environment, such as a mobile device of the learn-
the user’s, while they are watching educational videos such ers [16], [28].
as the ones in MOOCs. The gaze, head pose and action unit The most difficult processing is implemented on the
intensities features from the OpenFace library [40] were teacher’s device in units 5-13. These steps may run in offline
concatenated into the Gaze-AU-Pose (GAP) descriptor [41]. mode after obtaining the recording of the whole lecture in
Its classification using the GRU (gated recurrent unit) net- the online video conferencing tool. This video is fed into the
works leads to MSE (mean square error) on the validation “Face detection 5”, which works similarly to the first unit
set, which is 0.03 lower when compared to the baseline solu- on the student’s device, but can return several (K 1) faces
tion for the OpenFace features [8]. The usage of the dilated of learners who agreed to transfer their videos. It is still pos-
Temporal Convolutional Network (TCN) classifier [42] for sible that several extracted facial images have very low reso-
similar OpenFace features led to slightly lower MSE 0.0655. lution for accurate emotion recognition [49]. In this article,
The best results on the testing set in the 2018 challenge were the simplest solution is implemented, so that the faces with
obtained by additional LBP-TOP facial descriptor and C3D size lower than a predefined threshold (64x64 pixels) are
action features [43]. ignored. Next, the emotional feature vector xk ðtÞ of every
The authors of the latter approach improved it for the k-th face are obtained in “Emotional feature extraction 6”,
EmotiW 2019 challenge by using the classical bootstrap which may use either the same CNN as in unit 2 or more
aggregation and designing a rank loss as a regularization complex architecture if the processing is implemented on a
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
SAVCHENKO ET AL.: CLASSIFYING EMOTIONS AND ENGAGEMENT IN ONLINE LEARNING BASED ON A SINGLE FACIAL EXPRESSION... 2135
ðIÞ
rather powerful PC. The identity features xk ðtÞ are about how students are concentrated, affected and inspired
extracted in unit 7 by using appropriate face recognition during the e-lesson. Finally, a short GIF with different emo-
CNN [28], [47], [50]. The latter features are used to track tions and engagement during a lesson can be sent to a
and group the facial regions of the same students in the unit learner of his or her relatives to increase the parental
“Face tracking & clustering 8”. The emotional features xk ðtÞ involvement [1]. Moreover, these clips together with the
at the output of unit 8 of the same track are combined to charts of predicted students’ emotions, group affect and
solve the down-stream tasks in units 9, 10 and 11. The unit engagement depending on time are stored in the unit
“Group affect prediction 9” first aggregates emotional fea- “Lecture analysis 13” which can help the online educators
tures of faces from the same frame into single frame features to detect their online learners’ engagement status pre-
of the whole group of students. Next, all frame features dur- cisely [6] and to better organize his or her materials. It is
ing 5-10 seconds of a video are combined into a single also possible to highlight the time points with either high or
descriptor which can be fed into an appropriate classifier. very low engagement to find the weird or difficult parts of
The ‘Engagement prediction 10” and “Emotion recognition the lecture. Such data let track the efficiency of lessons and
11” work identically to units 3 and 4, but repeat the process- increase conversion of online courses.
ing for every k-th face and each group of frames.
Finally, the emotions and engagement of individual stu- 3.2 Robust Optimization of the FER Network
dents can be summarized into short videos clips and visual- In this subsection, let us describe the procedure to train the
ized in the unit “Summarization 12”. For example, it is FER network that extracts robust emotional features from
possible to take the time points where the strong emotion is either static photos or video frames. At first, a lightweight
predicted. The typical results for several real conferences or CNN is trained for face identification on a very large data-
lessons are presented in Fig. 2. Another opportunity is the set [47]. Next, this network is fine-tuned on any emotional
grouping of different emotions based on Russel’s 2D space dataset with static facial photos [23]. As existing emotional
of affect [51] that can give the teacher an initial impression datasets are typically highly imbalanced, the weighted
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
2136 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 13, NO. 4, OCTOBER-DECEMBER 2022
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
SAVCHENKO ET AL.: CLASSIFYING EMOTIONS AND ENGAGEMENT IN ONLINE LEARNING BASED ON A SINGLE FACIAL EXPRESSION... 2137
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
2138 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 13, NO. 4, OCTOBER-DECEMBER 2022
TABLE 1 TABLE 3
Accuracy for the AffectNet Validation Set Ablation Study of the Proposed Models, AffectNet Dataset
Accuracy, % Accuracy, %
Model 8 classes 7 classes Model Pre-train set 8 classes 7 classes
Baseline (AlexNet) [23] 58.0 - MobileNet-v1 ImageNet 56.88 60.4
Deep Attentive Center Loss [53] - 65.20 EfficientNet-B0 ImageNet 57.55 60.8
Distilled student [54] 61.60 65.40 EfficientNet-B2 ImageNet 60.28 64.3
EfficientNet-B2 (SL + SSL in-panting-pl) [24] 61.32 - SENet-50 VGGFace2 58.70 62.31
EfficientNet-B0 (SL + SSL in-panting-pl) [24] 61.72 - Our MobileNet-v1, 8 classes VGGFace2 60.25 63.21
DAN [55] 62.09 65.69 Our MobileNet-v1, 7 classes VGGFace2 - 64.71
TransFER [27] - 66.23 Our EfficientNet-B0, 8 classes VGGFace2 61.32 64.57
EmotionGCN [29] - 66.46 Our EfficientNet-B0, 7 classes VGGFace2 - 65.74
Our MobileNet-v1 60.20 64.71 Our EfficientNet-B2, 8 classes VGGFace2 63.03 66.29
Our EfficientNet-B0 61.32 65.74 Our EfficientNet-B2, 7 classes VGGFace2 - 66.34
Our EfficientNet-B2 63.03 66.34
TABLE 4
TABLE 2 Ablation Study of Optimizers, AffectNet Dataset
Class-Level Accuracy of Emotion Recognition
on Statis Photos, AffectNet Dataset 8-class accuracy, % 7-class accuracy, %
CNN Adam Robust (3)-(5) Adam Robust (3)-(5)
Emotion MobileNet-v1 EfficientNet-B0 EfficientNet-B2
MobileNet-v1 59.87 60.25 64.54 64.71
Anger 62.8 61.4 54.2 EfficientNet-B0 60.94 61.32 65.46 65.74
Contempt 48.0 60.4 66.0 EfficientNet-B2 62.11 63.03 65.89 66.34
Disgust 51.8 50.0 65.4
Fear 66.8 66.2 63.8
Happiness 81.8 78.0 74.6 videos with an average duration of 5 minutes. Each video is
Neutral 58.6 53.4 54.6
associated with one of 4 engagement levels 0, 0.33, 0.66 and 1
Sadness 61.8 59.4 65.4
Surprise 56.0 61.8 60.2 representing engagement mapped to disengaged, barely
engaged, engaged and highly engaged.
The frame images were extracted from the video using
training process was straightforward. The accuracy of Mobile- the FFmpeg tool, and the facial regions were found in each
Net and EfficientNet-B0 is lower, but still comparable with the frame using the MTCNN detector. If no faces were detected,
best-known results reported for this dataset. It is important to the frame was ignored. Next, the developed emotional mod-
emphasize that though the average accuracy of the deepest els were used to extract features of the largest facial region.
EfficientNet-B2 model is greater, the class accuracy for every The final descriptor of the whole video was computed as a
type of emotions is sometimes lower when compared to Effi- standard deviation of frame-wise facial features similar to
cientNet-B0 and MobileNet (Table 2), so that all our models the baseline [20]. We tried to use other STAT features
may be useful in different down-stream tasks. (mean, max, min) but did not observe the improvements in
The detailed ablation study of experiments for AffectNet the MSE (mean squared error) measured on the validation
is presented in Tables 3 and 4. In the latter table, the greatest set. The obtained video descriptor was fed into ridge regres-
8-class and 7-class accuracy for each row (model) are sion from the MORD package because the initial engage-
marked by bold. ment prediction task may be formulated as an ordinal
Here two datasets for pre-training were examined, regression. The results of the best attempts compared to the
namely, (1) conventional ImageNet; and (2) VGGFace2 [47] results of the participants of the EmotiW challenge on the
to learn the facial embeddings suitable for face recognition. official training and validation set are shown in Table 5.
The official models pre-trained on ImageNet were taken It is important to emphasize that the best results are typi-
from Tensorflow 2 and PyTorch Image Models (timm) for cally achieved by ensemble models that use several differ-
the former approach. The latter technique was implemented ent audio and video features. Hence, the best results of
as described in Section 3.3. As one can notice, such a pre- single models are presented here to frankly compare the
training leads to much better FER’s accuracy, even though methods that use only one CNN. Nevertheless, the MSE
facial identity features should not depend on the emotional 0.0563 for EfficientNet-B0 features is the best one when com-
state [28]. Moreover, Table 4 demonstrates that the robust pared to any existing method. The confusion matrix of the
optimization (Algorithm 1) makes it possible to increase the best ordinal regression is shown in Fig. 5. Its MSE is lower
accuracy. It is especially noticeable for the best EfficientNet- than the best single model [13] up to 0.01 (15% relative
B2 model which established a new state-of-the-art result for improvement).
complete validation set with 8 classes. However, this point should be clarified. The participants
of the Emotion Engagement in the Wild challenge verified
4.2 Engagement Prediction that achieving better results on the validation set does not
In this subsection, the results on the EngageWild [8] are lead to excellent quality on the testing set. For example,
reported. This dataset contains 147 training and 48 validation the winner of the first challenge (EmotiW 2018) has high
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
SAVCHENKO ET AL.: CLASSIFYING EMOTIONS AND ENGAGEMENT IN ONLINE LEARNING BASED ON A SINGLE FACIAL EXPRESSION... 2139
TABLE 5 TABLE 6
MSE for the EngageWild Validation Set Ablation Study of the Proposed Models, EngageWild Dataset
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
2140 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 13, NO. 4, OCTOBER-DECEMBER 2022
TABLE 7 TABLE 8
Accuracy for the AFEW Validation Set Accuracy for the VGAF Validation Set
Fig. 6. Confusion matrix for video-based individual emotion recognition, Fig. 7. Confusion matrix for video-based group-level emotion recogni-
EfficientNet-B0 features. tion, EfficientNet-B2 features.
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
SAVCHENKO ET AL.: CLASSIFYING EMOTIONS AND ENGAGEMENT IN ONLINE LEARNING BASED ON A SINGLE FACIAL EXPRESSION... 2141
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
2142 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 13, NO. 4, OCTOBER-DECEMBER 2022
[14] C. Liu, W. Jiang, M. Wang, and T. Tang, “Group level audio-video [37] T. Dragon, I. Arroyo, B. P. Woolf, W. Burleson, R. E. Kaliouby, and
emotion recognition using hybrid networks,” in Proc. Int. Conf. H. Eydgahi, “Viewing student affect and learning through class-
Multimodal Interact., 2020, pp. 807–812. room observation and physical sensors,” in Proc. Int. Conf. Intell.
[15] T. Liu, J. Wang, B. Yang, and X. Wang, “Facial expression recogni- Tutoring Syst., 2008, pp. 29–39.
tion method with multi-label distribution learning for non-verbal [38] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, and J. R. Movellan,
behavior understanding in the classroom,” Infrared Phys. Technol., “The faces of engagement: Automatic recognition of student
vol. 112, 2021, Art. no. 103594. engagement from facial expressions,” IEEE Trans. Affective Com-
[16] A. V. Savchenko, K. V. Demochkin, and I. S. Grechikhin, put., vol. 5, no. 1, pp. 86–98, Jan.–Mar. 2014.
“Preference prediction based on a photo gallery analysis with [39] X. Chen, L. Niu, A. Veeraraghavan, and A. Sabharwal,
scene recognition and object detection,” Pattern Recognit., vol. 121, “FaceEngage: Robust estimation of gameplay engagement from
2022, Art. no. 108248. user-contributed (YouTube) videos,” IEEE Trans. Affective Com-
[17] P. Xanthopoulos, P. M. Pardalos, and T. B. Trafalis, Robust Data put., vol. 13, no. 2, pp. 651–665, Apr.–Jun. 2022.
Mining. Berlin, Germany: Springer, 2012. [40] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency,
[18] S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang, “Emotion “OpenFace 2.0: Facial behavior analysis toolkit,” in Proc. IEEE
recognition in the wild from videos using images,” in Proc. Int. 13th Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 59–66.
Conf. Multimodal Interact., 2016, pp. 433–436. [41] X. Niu et al., “Automatic engagement prediction with
[19] H. Zeng et al., “EmotionCues: Emotion-oriented visual summari- GAP feature,” in Proc. Int. Conf. Multimodal Interact., 2018,
zation of classroom videos,” IEEE Trans. Vis. Comput. Graphics, pp. 599–603.
vol. 27, no. 7, pp. 3168–3181, Jul. 2021. [42] C. Thomas, N. Nair, and D. B. Jayagopi, “Predicting engagement
[20] A. Dhall, “EmotiW 2019: Automatic emotion, engagement and intensity in the wild using temporal convolutional network,” in
cohesion prediction tasks,” in Proc. Int. Conf. Multimodal Interact., Proc. Int. Conf. Multimodal Interact., 2018, pp. 604–610.
2019, pp. 546–550. [43] J. Yang, K. Wang, X. Peng, and Y. Qiao, “Deep recurrent multi-
[21] T. Ashwin and R. M. R. Guddeti, “Affective database for e-learn- instance learning with spatio-temporal features for engagement
ing and classroom environments using indian students’ faces, intensity prediction,” in Proc. Int. Conf. Multimodal Interact., 2018,
hand gestures and body postures,” Future Gener. Comput. Syst., pp. 594–598.
vol. 108, pp. 334–348, 2020. [44] K. Wang, J. Yang, D. Guo, K. Zhang, X. Peng, and Y. Qiao,
[22] B. E. Zakka and H. Vadapalli, “Estimating student learning affect “Bootstrap model ensemble and rank loss for engagement intensity
using facial emotions,” in Proc. IEEE 2nd Int. Multidisciplinary Inf. regression,” in Proc. Int. Conf. Multimodal Interact., 2019, pp. 551–556.
Technol. Eng. Conf., 2020, pp. 1–6. [45] J. Wu, Z. Zhou, Y. Wang, Y. Li, X. Xu, and Y. Uchida, “Multi-fea-
[23] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A ture and multi-instance learning with anti-overfitting strategy for
database for facial expression, valence, and arousal computing in engagement intensity prediction,” in Proc. Int. Conf. Multimodal
the wild,” IEEE Trans. Affective Comput., vol. 10, no. 1, pp. 18–31, Interact., 2019, pp. 582–588.
Jan.–Mar. 2019. [46] V. T. Huynh, S.-H. Kim, G.-S. Lee, and H.-J. Yang, “Engagement
[24] M. Pourmirzaei, G. A. Montazer, and F. Esmaili, “Using self- intensity prediction with facial behavior features,” in Proc. Int.
supervised auxiliary tasks to improve fine-grained facial repre- Conf. Multimodal Interact., 2019, pp. 567–571.
sentation,” 2021, arXiv:2105.06421. [47] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VGGFace2:
[25] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for A dataset for recognising faces across pose and age,” in Proc. IEEE
convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 13th Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 67–74.
2019, pp. 6105–6114. [48] A. V. Savchenko, “Video-based frame-level facial analysis of affec-
[26] F. Ma, B. Sun, and S. Li, “Facial expression recognition with visual tive behavior on mobile devices using EfficientNets,” in Proc.
transformers and attentional selective fusion,” IEEE Trans. Affec- IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, 2022,
tive Comput., to be published, doi: 10.1109/TAFFC.2021.3122146. pp. 2359–2366.
[27] F. Xue, Q. Wang, and G. Guo, “TransFER: Learning relation-aware [49] Y. Yan, Z. Zhang, S. Chen, and H. Wang, “Low-resolution facial
facial expression representations with transformers,” in Proc. expression recognition: A filter learning perspective,” Signal Pro-
IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 3601–3610. cess., vol. 169, 2020, Art. no. 107370.
[28] A. V. Savchenko, “Facial expression and attributes recognition [50] A. V. Savchenko, “Efficient facial representations for age, gender
based on multi-task learning of lightweight neural networks,” in and identity recognition in organizing photo albums using multi-
Proc. IEEE 19th Int. Symp. Intell. Syst. Inform., 2021, pp. 119–124. output ConvNet,” PeerJ Comput. Sci., vol. 5, 2019, Art. no. e197.
[29] P. Antoniadis, P. P. Filntisis, and P. Maragos, “Exploiting emo- [51] J. A. Russell, L. M. Ward, and G. Pratt, “Affective quality attrib-
tional dependencies with graph convolutional networks for facial uted to environments: A factor analytic study,” Environ. Behav.,
expression recognition,” in Proc. 16th IEEE Int. Conf. Autom. Face vol. 13, no. 3, pp. 259–288, 1981.
Gesture Recognit., 2021, pp. 1–8. [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[30] V. Kumar, S. Rao, and L. Yu, “Noisy student training using body mization,” 2014, arXiv:1412.6980.
language dataset improves facial expression recognition,” in Proc. [53] A. H. Farzaneh and X. Qi, “Facial expression recognition in the
Eur. Conf. Comput. Vis., 2020, pp. 756–773. wild via deep attentive center loss,” in Proc. IEEE/CVF Winter
[31] H. Zhou et al., “Exploring emotion features and fusion strategies Conf. Appl. Comput. Vis., 2021, pp. 2402–2411.
for audio-video emotion recognition,” in Proc. Int. Conf. Multi- [54] L. Schoneveld, A. Othmani, and H. Abdelkawy, “Leveraging
modal Interact., 2019, pp. 562–566. recent advances in deep learning for audio-visual emotion recog-
[32] S. Li et al., “Bi-modality fusion for emotion recognition in the nition,” Pattern Recognit. Lett., vol. 146, pp. 1–7, 2021.
wild,” in Proc. Int. Conf. Multimodal Interact., 2019, pp. 589–594. [55] Z. Wen, W. Lin, T. Wang, and G. Xu, “Distract your attention:
[33] J. R. Pinto et al., “Audiovisual classification of group emotion Multi-head cross attention network for facial expression recog-
valence using activity recognition networks,” in Proc. IEEE 4th Int. nition,” 2021, arXiv:2109.07270.
Conf. Image Process. Appl. Syst., 2020, pp. 114–119. [56] D. Meng, X. Peng, K. Wang, and Y. Qiao, “Frame attention net-
[34] Y. Wang, J. Wu, P. Heracleous, S. Wada, R. Kimura, and S. Kuri- works for facial expression recognition in videos,” in Proc. IEEE
hara, “Implicit knowledge injectable cross attention audiovisual Int. Conf. Image Process., 2019, pp. 3866–3870.
model for group emotion recognition,” in Proc. Int. Conf. Multi- [57] P. Demochkina and A. V. Savchenko, “MobileEmotiFace: Efficient
modal Interact., 2020, pp. 827–834. facial image representations in video-based emotion recognition
[35] I. P. Ratnapala, R. G. Ragel, and S. Deegalla, “Students behav- on mobile devices,” in Proc. Pattern Recognit. Int. Workshops Chal-
ioural analysis in an online learning environment using data lenge, 2021, pp. 266–274.
mining,” in Proc. IEEE 7th Int. Conf. Inf. Autom. Sustainability, [58] M. Sun et al., “Multi-modal fusion using spatio-temporal and
2014, pp. 1–7. static features for group emotion recognition,” in Proc. Int. Conf.
[36] Z. Zhang, Z. Li, H. Liu, T. Cao, and S. Liu, “Data-driven online Multimodal Interact., 2020, pp. 835–840.
learning engagement detection via facial expression and mouse [59] V. Skaramagkas et al., “A machine learning approach to predict
behavior recognition technology,” J. Educ. Comput. Res., vol. 58, emotional arousal and valence from gaze extracted features,” in
no. 1, pp. 63–86, 2020. Proc. IEEE 21st Int. Conf. Bioinf. Bioeng., 2021, pp. 1–5.
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.
SAVCHENKO ET AL.: CLASSIFYING EMOTIONS AND ENGAGEMENT IN ONLINE LEARNING BASED ON A SINGLE FACIAL EXPRESSION... 2143
Andrey V. Savchenko received the BS degree in Ilya Makarov received the PhD degree in com-
applied mathematics and informatics from Nizhny puter science from the University of Ljubljana,
Novgorod State Technical University, Nizhny Nov- Ljubljana, Slovenia. Since 2011 up to 2022, he
gorod, Russia, in 2006, the PhD degree in mathe- was a full-time lecturer with HSE University,
matical modelling and computer science from the School of Data Analysis and Artificial Intelligence.
State University Higher School of Economics, He is senior research fellow with AIRI, HSE Uni-
Moscow, Russia, in 2010, and the DrSc degree in versity – Nizhniy Novgorod, and researcher with
system analysis and information processing from Samsung-PDMI Joint AI Center, St. Petersburg
Nizhny Novgorod State Technical University, in Department of Steklov Institute of Mathematics,
2016. Since 2008, he has been with the HSE Uni- Russian Academy of Sciences, St. Petersburg,
versity, Nizhny Novgorod, where he is currently a Russia. His educational career in data science
full professor with the Department of Information Systems and Technolo- covers positions of program director of BigData Academy MADE from
gies. He is also a leading research fellow with the Laboratory of Algo- VK, senior lecturer with the Moscow Institute of Physics and Technology,
rithms and Technologies for Network Analysis, HSE University. He has and machine learning engineer and head of Data Science Tech Master
authored or co-authored one monograph and more than 50 articles. His program in NLP, National University of Science and Technology MISIS.
current research interests include statistical pattern recognition, image
classification, and biometrics.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
Lyudmila V. Savchenko received the PhD
degree in system analysis and information proc-
essing from Voronezh State Technical University,
in 2017, and the specialist degree in applied
mathematics and informatics from Nizhny Nov-
gorod State Technical University, Nizhny Nov-
gorod, Russia, in 2008. Since 2018, she has
been with the HSE University, Nizhny Novgorod,
where she is currently an associate professor
with the Department of Information Systems and
Technologies. She is also a senior research fel-
low with the Laboratory of Algorithms and Technologies for Network
Analysis, HSE University. Her current research interests include speech
processing and e-learning systems.
Authorized licensed use limited to: MEHRAN UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on January 18,2023 at 04:33:38 UTC from IEEE Xplore. Restrictions apply.