0% found this document useful (0 votes)
32 views18 pages

A Multimodal Driver Anger Recognition Method Based On Context-Awareness

This study presents a context-aware multimodal driver anger emotion recognition method (CA-MDER) that utilizes facial, vocal, and driving state information to improve traffic safety by accurately identifying driver anger. The method employs advanced machine learning techniques, including Attention Mechanism-Depthwise Separable Convolutional Neural Networks and adaptive weight distribution for decision-level fusion, achieving an accuracy of 91.68% and an F1 score of 90.37%. The research highlights the significance of addressing individual differences and emotional variability in driving scenarios to enhance emotion recognition capabilities.

Uploaded by

Ria Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views18 pages

A Multimodal Driver Anger Recognition Method Based On Context-Awareness

This study presents a context-aware multimodal driver anger emotion recognition method (CA-MDER) that utilizes facial, vocal, and driving state information to improve traffic safety by accurately identifying driver anger. The method employs advanced machine learning techniques, including Attention Mechanism-Depthwise Separable Convolutional Neural Networks and adaptive weight distribution for decision-level fusion, achieving an accuracy of 91.68% and an F1 score of 90.37%. The research highlights the significance of addressing individual differences and emotional variability in driving scenarios to enhance emotion recognition capabilities.

Uploaded by

Ria Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received 30 May 2024, accepted 30 June 2024, date of publication 3 July 2024, date of current version 3 September 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3422383

A Multimodal Driver Anger Recognition Method


Based on Context-Awareness
TONGQIANG DING 1, KEXIN ZHANG1 , SHUAI GAO2 , XINNING MIAO3 , AND JIANFENG XI 1
1 TransportationCollege, Jilin University, Changchun 130022, China
2 Jilin Communications Polytechnic, Changchun 130022, China
3 Beijing Jingwei Hirain Technologies Company Inc., Beijing 100029, China

Corresponding author: Shuai Gao ([email protected])


This work was supported in part by the National Key Research and Development Program Project ‘‘Major Accident Risk Prevention and
Emergency Avoidance Technology for Road Transport Vehicles’’ under Grant 2023YFC3009600 and in part by the Graduate Innovation
Fund of Jilin University.

ABSTRACT In today’s society, the harm of driving anger to traffic safety is increasingly prominent. With
the development of human-computer interaction and intelligent transportation systems, the application of
biometric technology in driver emotion recognition has attracted widespread attention. This study proposes
a context-aware multi-modal driver anger emotion recognition method (CA-MDER) to address the main
issues encountered in multi-modal emotion recognition tasks. These include individual differences among
drivers, variability in emotional expression across different driving scenarios, and the inability to capture
driving behavior information that represents vehicle-to-vehicle interaction. The method employs Attention
Mechanism-Depthwise Separable Convolutional Neural Networks (AM-DSCNN), an improved Support
Vector Machines (SVM), and Random Forest (RF) models to perform multi-modal anger emotion recog-
nition using facial, vocal, and driving state information. It also uses Context-Aware Reinforcement Learning
(CA-RL) based adaptive weight distribution for multi-modal decision-level fusion. The results show that the
proposed method performs well in emotion classification metrics, with an accuracy and F1 score of 91.68%
and 90.37%, respectively, demonstrating robust multi-modal emotion recognition performance and powerful
emotion recognition capabilities.

INDEX TERMS Context-awareness, driving state emotion recognition, emotional expression heterogeneity,
multimodal emotion recognition, machine learning.

I. INTRODUCTION social behavior in various ways. For drivers, emotional fluc-


In recent years, with the evolution of human-computer inter- tuations directly affect the reception and judgment of road
action and intelligent transportation systems, the application information during driving, thereby distracting the driver
of biometric technology in the field of driver emotion recog- and affecting operational accuracy. Anger is one of the
nition has garnered widespread attention. It assesses the most common emotions experienced by drivers while driv-
emotional state of drivers by analyzing data such as facial ing. In China, angry driving is quite common, with data
expressions, speech information, physiological signals, and showing that about 68% of motor vehicle drivers have expe-
driving behavior, demonstrating significant application value rienced ‘‘road rage.’’ Parkinson [1] found that people are
in improving road safety and driving experience. more likely to get angry while driving than not driving.
Emotions are experiences and subjective perceptions that Additionally, factors like time pressure and traffic conges-
humans generate in response to the external environment tion during driving increase the likelihood of anger while
and events, along with corresponding behavioral reactions driving [2]. Angry driving is a serious public issue, with
and changes. Different emotions can impact personal or studies proving that anger during driving can have a range of
adverse effects [3]. Cognitively, anger can negatively impact
The associate editor coordinating the review of this manuscript and attention, perception, and information processing, affecting
approving it for publication was Shaohua Wan. the driver’s control over the vehicle. Behaviorally, driving
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 118533
T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

under the influence of anger can lead to more aggressive then to determine whether the emotion is anger or not, using
behavior [4]. certain rules or classification models.
In addressing the issues caused by angry driving, it is In terms of facial feature extraction methods, Ekman
essential to start from the perspective of the individual driver and Friesen [5] proposed the Facial Action Coding System
and study the recognition of angry driving to facilitate early (FACS) facial behavior coding method in 1978, which can
emotional soothing and safety warnings. This can prevent be used to judge facial expressions by identifying specific
many traffic problems caused by angry driving, provide more Action Unit (AU) feature vectors. Barmana and Dutta [6]
precise protection for drivers and vehicles, and offer feasible proposed a fast Sic active appearance model that can extract
safety assurances for the driving community. Therefore, this facial features by forming a mesh of 14 facial feature points.
paper discusses a multi-modal method for recognizing driver In addition, there are Active Shape Model (ASM), Local
anger emotions, with the remaining sections organized as Binary Patterns (LBP) [7] and Histogram of Oriented Gra-
follows. Chapter 2 introduces the current state of research on dient (HOG) [8]. The purpose of the ASM algorithm is
both unimodal and multimodal driver anger emotion recog- to generate points that fit the contours of the object. LBP,
nition and elaborates on the problems this paper aims to on the other hand, constructs a binary number by comparing
solve. Chapter 3 focuses on the main innovations of the each cell’s pixels with its eight neighboring pixels through
multi-modal anger emotion recognition method used in this a function. Afterwards, the algorithm constructs a histogram
paper. Chapter 4 introduces the dataset used in this study and of these numbers for each cell and connects the histograms.
related anger emotion recognition experiments. In terms of HOG is to produce a histogram based on the gradient pre-
datasets, this paper emphasizes a multimodal dataset of facial viously calculated after segmenting the image into cells.
expressions, voice, and vehicle driving states collected by the However, in real life applications, all of the above methods
authors. Regarding recognition experiments, the paper not have some drawbacks and shortcomings. The FACS system is
only validates the proposed method through experiments but more complex, and manually performing the FACS coding is
also includes comparative and ablation experiments. Finally, a very time-consuming process, and the process is affected by
Chapter 5 summarizes the experimental results and provides the subjective judgment of the coder; Sic and LBP models are
an outlook on future work. difficult to accurately capture the subtle dynamic changes in
facial expressions; ASM’s effectiveness relies heavily on the
accuracy of the initial shape and is not flexible enough to deal
II. RELATED WORK
with non-standard facial expressions; HOG is also difficult
In recent years, with the rapid development of sensor tech-
to accurately capture the subtle changes, and its computation
nology and machine learning, significant progress has been
may be relatively complex and time-consuming. In addition
made in the recognition of driver anger emotions, both in
to the above methods, we observe that more feature extraction
unimodal methods, such as facial expression, speech informa-
is performed by convolutional neural networks. Compared to
tion, physiological information, and behavioral information,
the above methods, Convolutional Neural Networks (CNNs)
and in multimodal methods involving the fusion of various
can automatically learn and extract features from training data
types of information. Recognition methods based on facial
and realize hierarchical feature learning, which not only can
expressions and speech information have become mainstream
extract higher-level features and subtle features that may be
in emotion recognition due to their ease of data acquisi-
neglected by traditional methods, but also can improve the
tion and high accuracy rates. In the field of driver anger
adaptability and flexibility of the model.
emotion recognition, methods based on vehicle driving state
In terms of classification algorithms, there are two main
information demonstrate unique data advantages. However,
categories: classical methods and neural network methods.
recognition methods based on physiological information, due
Classical methods include methods such as support vec-
to the challenges of non-contact data collection, often signif-
tor machines (SVMs), dynamic Bayesian networks, extreme
icantly impact the driver during the data collection process.
learning machines [9], sparse learning machines [10], sup-
Therefore, this paper will next elaborate in detail on the meth-
port vector regression [11], sparse representation classifica-
ods of anger emotion recognition based on facial information,
tion [10], random forests [12], random trees [12], multi-graph
speech information, and vehicle driving state information.
embeddings [13], and single-modified Viola-Jines [14]. The
above classical methods usually have deficiencies in terms of
A. ANGER EMOTION RECOGNITION BASED ON FACIAL computational complexity, parameter tuning and model gen-
INFORMATION eralizability. DBN and random forests, for example, face the
Facial expressions are the most natural and direct way for challenge of computational resources when dealing with large
humans to express emotions. Since facial information can be datasets, and the optimization of parameters such as the type
collected non-invasively, thus minimizing interference with of kernel function and regularization parameter in SVM, the
the driver, facial emotion recognition has become the primary number of nodes in the hidden layer in ELM, and the number
method for recognizing driver anger. The two most important of trees in random forests usually requires a lot of experiments
steps in facial information-based anger emotion recognition and expertise, among other deficiencies. The main methods
are first to extract features from the facial information and designed through neural networks include methods such as
118534 VOLUME 12, 2024
T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

multilayer perceptron (MLP) and CNN. Although MLP is a TABLE 1. Speech features.
powerful classification tool, it still has some limitations in
terms of deep structure design, overfitting control, parameter
tuning, and dependence on data, etc. For example, MLP is
unable to effectively utilize the spatial or temporal structure in
the input data; MLP is susceptible to overfitting phenomenon
when the training data is limited or noisy. CNN, on the
other hand, can effectively capture the spatial hierarchical
structure in the input data through its convolutional layer,
and CNN usually has fewer parameters due to weight sharing
and pooling operations, which reduces the computational
complexity and the risk of overfitting, and improves the conducted experiments using prosodic features for emotion
generalization ability of the model.AT Lopes [15] designed a recognition, proving their effectiveness in speech emotion
multilayered CNN model that automatically extracts features detection. Timbral features reflect the quality of speech sig-
and recognizes the emotion of anger from facial images; nals, measuring the intelligibility, purity, and distinctiveness
Liu et al. [16] used a three-dimensional CNN (3D-CNN) to of speech. Research by Nussbaum et al. [19] has found
capture spatio-temporal features in facial expression videos, that fundamental frequency, a timbral characteristic, plays
not only focusing on the spatial features of the face, but also a significant role in emotion recognition. Spectral features
taking into account the change of expression over time, which represent the characteristics of the signal in the frequency
improves the accuracy of emotion recognition; Ng et al. [17] domain, where emotional fluctuations cause variations in
proposed a CNN model that simultaneously learns facial the spectral distribution of speech. Lalitha et al. [20] have
emotion recognition and other face-related tasks (e.g., gender suggested the role of cepstral coefficients in spectral features
recognition, age estimation), showing that the potential that for emotion classification in enhancing human-computer
sharing representations between related tasks can improve interaction performance. Furthermore, some scholars have
emotion recognition performance. CNNs show great poten- combined the above-mentioned speech characteristics for
tial in facial emotion recognition, however, traditional CNNs emotion recognition. For example, Zhou et al. [21] proposed
usually contain a large number of parameters, leading to high a method using a fusion of MFCC and prosodic features to
computational complexity and significant storage require- identify speech emotions. Their experiments showed that this
ments, which are not conducive to real-time applications; and fusion increased the accuracy rate by nearly 20% compared to
when image data processing is performed, it is usually not using a single feature, with the accuracy of using only MFCC
processed to see which parts are more important for the final being 62.3% higher than using only prosodic features.
prediction, which may lead to the model to be disturbed by In the field of emotion classification, models mainly
non-feature information. include SVM [22], [23], Artificial Neural Network (ANN)
[20], Hidden Markov Model (HMM) [24], CNN [25], Deci-
sion Tree [26], Long Short-Term Memory (LSTM) [27], and
B. ANGER EMOTION RECOGNITION BASED ON SPEECH Recurrent Neural Network (RNN) [28] methods. SVM is
INFORMATION suitable for clear classification problems in high-dimensional
Language, as a unique means of human communication, can spaces but only performs well on small to medium-sized
directly reflect human emotions through its vocal character- datasets. CNN and LSTM demonstrate excellent performance
istics. In 1983, Bezooijen and others explored the correlation but are computationally expensive. ANN has strong learning
between vocal features and different emotions, suggesting capabilities but is prone to overfitting and sensitive to param-
that statistical parameters of speech features could be used eter selection. HMM excels in processing time-series data
for emotion classification. However, in the application field but is limited in handling nonlinear features. RNN is apt for
of driver anger emotion recognition, speech data often faces sequential data but struggles with long sequences. Decision
challenges such as poor data usability and missing data. Trees are easy to understand but prone to overfitting.
Therefore, speech emotion recognition is usually used as Many methods in the field of speech emotion recognition
a supplementary method to enhance the overall accuracy have achieved excellent recognition effects, but there are still
of driver anger detection. Speech emotion recognition also some shortcomings in application scenarios like recognizing
involves two critical steps: feature extraction and emotion the anger emotion of drivers while driving. Firstly, to improve
classification. the comprehensive recognition accuracy of drivers’ anger, the
In the aspect of feature extraction, speech features can speech emotion recognition module should enhance recog-
be categorized into prosodic features, timbral features, and nition efficiency to achieve real-time overall recognition.
spectral features. Each category encompasses specific related However, neural network models like RNN, which have high
characteristics, as shown in Table 1. recognition accuracy, lack real-time performance, while mod-
Prosodic features primarily focus on the rhythm, intensity, els like SVM, which have high recognition efficiency, need
speech rate, and pitch of speech. For instance, John et al. [18] improved accuracy. Secondly, in the scenario of recognizing

VOLUME 12, 2024 118535


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

drivers’ anger while driving, issues such as missing speech driver heterogeneity. Anomalies in vehicle driving state, such
information and poor information usability exist. Finally, the as significant changes in acceleration, may not necessarily
selection of features in speech emotion recognition often indicate anger but could also be due to an aggressive driving
focuses on one or a few features, lacking comprehensiveness style. Secondly, the anger reflected in the interaction between
in feature selection. vehicles, such as vehicle following distance and frequency of
overtaking, has not been captured.

C. ANGER EMOTION RECOGNITION BASED ON VEHICLE


DRIVING STATE INFORMATION D. MULTIMODAL ANGER EMOTION RECOGNITION BASED
Vehicle driving state information directly reflects the driver’s ON MULTIPLE INFORMATION FUSION
behavior during driving. Studies have proven that this infor- Due to the fact that unimodal methods can only provide one
mation can also reveal the driver’s anger. For example, Lei type of emotional information and in some situations, expres-
Hu [29] designed a scale to measure drivers’ anger expres- sions of certain modalities may be suppressed [37], they have
sions and conducted a survey using this scale. The results significant limitations in anger emotion recognition. Mul-
indicated that when driving in an angry state, there was an timodal approaches, by considering a more comprehensive
increase in lane-changing frequency, and operations such as range of emotional expression channels, demonstrate bet-
accelerating, braking, and steering became more frequent ter recognition performance. However, current multimodal
and intense. Zhong et al. [30] and others collected data on research in the driver domain is relatively scarce, and there
drivers’ behavior when they were angry, showing that driv- is no unified dataset for scholars to use. Therefore, both
ing speed, honking frequency, and incidences of speeding the selected modalities and the data used are diverse. For
increased, while the behavior of slowing down at crosswalks instance, Zhou et al. [38] proposed a multimodal fusion
decreased. Techer et al. [31] studied the impact of drivers’ framework based on CNN+Bi-LSTM+HAM, which com-
anger on attention processes and driving performance. The bines the driver’s voice, facial images, and video sequences
results suggested that anger affects driving behavior and for emotion recognition, achieving an anger recognition accu-
attention, leading to significant fluctuations in driving behav- racy of 85.0%. Ni et al. [39] collected physiological response
ior and decreased attention. Precht et al. [32] researched how signals, nasal tip temperature signals, and vehicle behavior
drivers’ anger impacts their driving behavior, analyzing data signals from drivers in a simulated driving situation, conduct-
on drivers who showed anger towards driving errors, viola- ing combination experiments with different data and methods
tions, and aggressive expressions, and comparing it with data using Random Forest (RF), K-Nearest Neighbor (KNN),
from drivers who did not exhibit anger. The results indicated XGBoost. The results showed that using the RF model for
that anger led to more frequent aggressive driving behaviors, multimodal recognition of the three types of data was the
but did not increase the frequency of driving errors. most effective, with an anger recognition accuracy of 92.4%.
Based on these studies, some scholars have used vehicle Du et al. [40] proposed a Convolutional Bidirectional Long
driving state information for recognizing driver anger. For Short-Term Memory Neural Network (CBLNN), predicting
instance, Shafaei et al. [33] integrated the vehicle’s yaw angle the driver’s emotions based on geometric features extracted
and acceleration into an emotion recognition system, creating from facial skin information and heart rate extracted from
two modules: a sudden car operation counter based on steer- changes in RGB components, with an anger recognition accu-
ing wheel rotation and an aggressive driving predictor based racy of 90.5%.
on acceleration changes. Combined with a facial emotion In fact, multimodal fusion mainly falls into two cate-
recognition module, the final result showed a 94% accuracy gories: feature fusion and decision-level fusion. Decision-
rate in predicting drivers’ emotions. Wang [34] utilized vehi- level fusion, due to its simplicity and tolerance to different
cle motion information such as speed and steering wheel modality recognition, often exhibits better performance. For
angle, along with electrocardiogram signals, for multimodal example, Wang [36] conducted multimodal anger emotion
anger emotion recognition, achieving an accuracy rate of recognition based on EEG signals, physiological signals,
84.75%. Yu [35] collected vehicle motion information like and driving behavior information. The experiments showed
speed, acceleration, and steering wheel turning amplitude that decision-level fusion performed better than feature-level
through a driving simulator, combining it with facial data for fusion, with an accuracy of 76.3%. Wang [34] conducted
multimodal anger emotion recognition. Wang [36] consid- multimodal fusion recognition of anger emotions based on
ered more comprehensive vehicle driving state information ECG signals and driving behavior signals. The results showed
through a driving simulator, including steering wheel angle, that the SVM-DS model, which employed decision-level
longitudinal and lateral speed, pitch angle, yaw angle, and fusion, performed best, with an accuracy of 84.75%. In pre-
engine speed, achieving an anger recognition accuracy rate vious studies, the weights of each modality in decision-level
of 65.8%. fusion were often fixed. However, in actual scenarios, the
Currently, research using driving state information for weights of modalities should vary. Li et al. [41] research
anger emotion recognition is not widespread and has some found that drivers’ emotional expressions are influenced by
limitations. Firstly, most studies do not consider the impact of driving tasks, affecting emotion recognition. Tang et al. [37]

118536 VOLUME 12, 2024


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

discovered that the distribution of multimodal physiolog- fusion is proposed, as shown in Figure 1. It is divided into the
ical responses varies across different emotional scenarios. following parts:
Specifically, the degree of expression in facial emotion, vocal (1) A facial anger emotion recognition module based on
emotion, and vehicle driving state emotion of drivers may AM-DSCNN;
differ under conditions such as congested versus smooth (2) A vocal anger emotion recognition module based on an
traffic flow, or when driving alone versus with passengers. improved SVM;
In summary, previous research work has made significant (3) A vehicle driving state emotion recognition module that
progress, but there are still some challenges and limitations. considers vehicle driving state information and driving style;
First, the recognition of driver emotions is a complex task (4) A multimodal decision-making module with adaptive
requiring the consideration of multimodal data. However, weight distribution based on context awareness.
issues such as individual differences among drivers and
variability in emotional expression across different driving
scenarios can impact the accuracy of emotion recognition.
Second, while facial and vocal emotion recognition are
the mainstream methods for emotion recognition, there is
still room for improvement in their accuracy. Third, within
our research scope, no current recognition methods con-
sider using traffic flow information, which represents vehicle
interaction, for anger emotion recognition, thus failing to
fully capture the drivers’ emotional expressions. Lastly, most
scholars base their model training and validation on simulated
scenarios, which still differ from real-world driving.
Based on this, this paper proposes a context-aware mul-
timodal driver emotion recognition method (CA-MDER) FIGURE 1. CA-MDER modeling framework.
aimed at overcoming these issues and effectively recognizing
drivers’ anger emotions. The contributions of this paper are
as follows: A. FACIAL ANGER EMOTION RECOGNITION BASED ON
(1) A facial emotion recognition method based on AM-DSCNN
AM-DSCNN is proposed, which introduces an attention In response to the numerous shortcomings of traditional
mechanism module and a depthwise separable convolution CNNs, this paper proposes a facial emotion recognition
module to enhance the model’s ability to capture impor- method based on Attention Mechanism-Depthwise Separable
tant facial emotional features and improve computational Convolutional Neural Networks (AM-DSCNN), specifically
efficiency; for recognizing the angry emotional states of drivers. The
(2) A hybrid kernel function combining Dynamic Time proposed AM-DSCNN is composed of three parts: the back-
Warping (DTW) with RBF is proposed to improve the SVM bone network, depthwise separable convolution module, and
model, making the improved model more effective in han- attention mechanism module, the specific model structure is
dling the temporal elasticity in voice data and improving shown in Figure 2.
recognition accuracy;
(3) In the vehicle driving state emotion recognition module,
features capturing vehicle interactions are introduced, and a
driver driving style recognition module is proposed to mit-
igate the impact of driving heterogeneity on anger emotion
recognition;
(4) A context-aware, multi-modal decision-level fusion
method based on CA-RL is proposed, which uses context
awareness to achieve optimal weight distribution in adaptive
scenarios;
(5) This paper uses data collected from real driving sce-
narios and public datasets for model training and validation,
FIGURE 2. AM-DSCNN model structure diagram.
which, compared to simulated data, results in a more realistic
and effective model.

1) BACKBONE NETWORK
III. METHOD Backbone Network is the foundation of the AM-DSCNN
For the recognition of driver anger emotions, a recognition model, consisting of convolutional layers, pooling layers,
framework that combines context-awareness and multimodal fully connected layers, and dropout layers. Its purpose is

VOLUME 12, 2024 118537


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

to capture image features from low-level to high-level and 26 × 26×256.This process can be described as follows:
ultimately provide emotion classification. ′
I c (x, y) = I c (x, y)∗K c (1)
a: CONVOLUTIONAL LAYER
Here, x, y are the spatial positions in the driver’s facial
The convolutional layer is the core component of CNN, feature map, Ic′ is the output of the cth channel, Ic is the input
used to extract features from the input image. Convolution of the cth channel, and Kc is the convolution kernel of the cth
operations slide convolution kernels over the input image to channel.
perform dot product operations, resulting in feature maps. For
example, in the first convolutional layer shown in Figure 2, b: POINTWISE CONVOLUTION
the input image size in the model is 416 × 416×1. Then, the
This uses a 1 × 1 convolution kernel to perform convolution
convolutional layer Conv1 with 32 convolution kernels of size
across channels, combining the outputs of the depth convo-
3 × 3 is used for convolution operations, and the output size
lution. Similarly, for the first layer of depthwise separable
after convolution is 416 × 416×32.
convolution DSConv1 in Figure 2, after channel-wise con-
volution, a 1 × 1×256 convolution kernel is used to perform
b: POOLING LAYER
convolution, converting 256 channels into 512 channels. The
The pooling layer is used to reduce the size of the feature
final output feature map has a size of 26 × 26×512.This
maps, decrease computational complexity, and retain impor-
process can be described as follows:
tant features. For example, in the first pooling layer shown in
′′ X ′ ′
Figure 2, the pooling layer uses a 2 × 2 window to perform I c (x, y) = I c (x, y)∗K c (2)
c
max pooling on the feature map of size 416 × 416×32. The
output size after pooling is 208 × 208×32. Here, Kc′is the cth channel of the 1 × 1 convolution kernel.
Compared to traditional convolution operations, depthwise
c: FULLY CONNECTED LAYER convolution significantly reduces the number of multiplica-
The fully connected layer connects all nodes from the pre- tive operations since each kernel only convolves on its
vious layer to all nodes in the current layer, functioning respective single channel. Although pointwise convolution
as a standard neural network layer. As shown in Figure 2, involves all channels, the computational load is still much
this model includes fully connected layers in the attention lower than traditional convolution due to the use of 1 × 1 ker-
mechanism module and at the final emotion classification. nels. Furthermore, depthwise separable convolution requires
The former is used to generate attention weights for the image far fewer parameters than classical convolution. For the same
after global average pooling, thereby rescaling the feature size of feature maps and convolution kernels, depthwise
maps; the latter is used to integrate features extracted from all separable convolution can significantly reduce the model’s
previous layers for the final classification of angry emotions. parameter count, thereby lowering the risk of overfitting and
enhancing the model’s applicability in resource-constrained
d: DROPOUT LAYER environments. Simultaneously, by decomposing standard
The Dropout layer is a regularization technique used to convolution into these two steps, depthwise separable convo-
prevent neural networks from overfitting. By randomly lution still effectively captures the spatial patterns and texture
‘‘dropping out’’ a portion of neurons during training, the information within each channel of the facial feature map and
network is forced to train on different sub-networks, thus the feature combinations across channels, ensuring efficient
enhancing the model’s generalization ability. feature extraction.

2) DEPTHWISE SEPARABLE CONVOLUTION MODULE 3) ATTENTION MECHANISM MODULE


When applying facial emotion recognition in a driver sce- Traditional CNNs struggle to identify which features are
nario, it’s crucial to consider not only accuracy but also important for predictions, leading to the model being dis-
efficiency. For this reason, this paper introduces depthwise tracted by non-feature information. To address this issue, this
separable convolution into the model, a highly efficient paper introduces an attention mechanism module to help the
convolution operation comprised of two steps: depthwise model focus more on features crucial for emotion recognition.
convolution and pointwise convolution. The attention mechanism module consists of a lightweight
attention network, capable of adaptively adjusting the channel
a: DEPTHWISE CONVOLUTION weights of the feature maps, directing the model’s focus to
This involves applying a single filter to each input chan- features more beneficial for emotion recognition.
nel independently for convolution. Taking the first layer of Specifically, this module is implemented through the fol-
depthwise separable convolution DSConv1 in Figure 2 as an lowing steps:
example, a 3 × 3×1 convolution kernel is used to perform (1) Global Average Pooling: This paper applies global
independent convolution on each channel of the input feature average pooling to the feature maps of each channel, reduc-
map with a size of 26 × 26×256. After the individual con- ing them to a scalar, essentially compressing to the channel
volution operations, the resulting feature map has a size of dimension to obtain global context information, enabling

118538 VOLUME 12, 2024


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

N−k −1
each channel feature to have a global perspective. For X
instance, in the first layer of the attention mechanism shown R (k) = x (n) · x (n + k) (7)
n=1
in Figure 2, global average pooling is performed on the
input feature map with a size of 104 × 104×64 as per the Here, F0 is the fundamental frequency, R (k) is the auto-
following formula. After pooling, the output feature map size correlation function, x (n) is the signal value at time n, N is
is 1 × 1×64. the frame length, and k is the delay amount.
H W
1 XX b: SHORT-TERM ENERGY FEATURES
C global = I(x, y) (3)
H×W Literature [42] confirms the effectiveness of short-term
i=1 j=1
energy E in speech emotion recognition, which is calculated
Here, H and W are the height and width of the facial feature
by summing the squares of the sample points within the
map.
frame:
(2) Multi-Layer Perceptron (MLP) Learning: The paper
−1
NX
uses a multi-layer perceptron with one hidden layer to learn
the non-linear dependencies between channels. This can be E= x(n)2 (8)
represented as: n=0

C weight = δ(W 2 δ(W 1 C global + b1 ) + bmb2 ) (4) c: SPEECH RATE


Speech rate refers to the number of speech units spoken
Here, W1 and W2 denote the weights in the MLP, b1 and
per unit of time and is one of the important features of
b2 denote the biases in the MLP, and δ represents the ReLU
speech signals. Speech rate not only reflects the speaker’s
activation function.
linguistic style and habits, but is also closely related to his/her
(3) Scaling Transformation and Feature Map
emotional state. Literatures [43] and [44] suggests that high
Re-Calibration: The sigmoid activation function is applied to
arousal emotions such as anger and excitement are usually
obtain the channel attention weights, and the original feature
accompanied by faster speech rate. Specifically, anger tends
map is re-calibrated accordingly:
to cause speakers to speak faster because of the increased
Z = σ (C weight ) ⊙ I (5) physiological arousal level in anger, which leads to rapid
Here, σ is the sigmoid activation function, and Z is the breathing and faster speech tempo.
feature map after attention re-calibration.
d: FORMANT FEATURES
B. SPEECH ANGER EMOTION RECOGNITION BASED ON The quality of speech, affected by vocal tract deformation
IMPROVED SVM under different emotional states, exhibits distinct feature
1) EXTRACTION OF EMOTIONAL FEATURES FROM SPEECH changes. Therefore, the peak values and positions of formants
INFORMATION in speech signals vary under different emotional states. This
paper uses Linear Predictive Coding (LPC) to extract the
Effective extraction of speech features is crucial for identify-
central frequencies of the first three formants, followed by
ing drivers’ angry emotional states based on speech informa-
peak detection estimation on the LPC spectrum.
tion. To comprehensively extract emotional characteristics,
this paper selects six key features: fundamental frequency,
e: MFCC
short-term energy, speech rate, formants, Mel-Frequency
Cepstral Coefficients (MFCC), and sample entropy. The The Mel Frequency Cepstrum Coefficient (MFCC) com-
features extracted from preprocessed speech signals are bines the auditory perceptual properties of the human ear
frame-level and only represent local emotional characteristics with the generation mechanism of speech signals, and can be
of the speech signal. To extract global features, it is necessary converted between frequencies (in Hz) and Mel frequencies.
to calculate the global statistical properties of multi-frame The conversion formula is as follows, where f represents the
speech signals to obtain utterance-level features. The sta- frequency of the speech signal at 16,000 Hz.
 
tistical measures derived from sample data can reflect the f
quantitative characteristics of the sample population. For dif- Mel (f ) = 1125ln 1 + (9)
700
ferent speech features, their respective parameter statistics are
It is shown that the data of the speech signal is mainly
calculated for subsequent emotion recognition.
concentrated in the low frequency region after the transform,
so it is sufficient to extract the first 12 MFCC coefficients
a: FUNDAMENTAL FREQUENCY FEATURES
as features. After the frame-splitting operation, the global
The fundamental frequency, related to the vocal cord status,
sentiment features need to be extracted. The static features
varies under different emotions, thus making it a good repre-
of MFCC are logarithmic energy, while the dynamic features
sentation for speech emotion. The formula for calculating the
can be obtained by calculating the derivatives of the static
fundamental frequency using the autocorrelation method is:
features. Therefore, in this paper, instead of directly extract-
F0 = argmax (R(k)) (6) ing the mean value of the logarithmic energy, the first-order

VOLUME 12, 2024 118539


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

derivative and second-order derivative of the mean value of recognition model can help us more accurately capture and
the logarithmic energy of the MFCC are calculated. The differentiate speech features under different emotional states,
formula for the first-order derivatives is as follows, and the thereby enhancing the accuracy and efficiency of emotion
second-order derivatives are calculated similarly to the first- recognition.
order derivatives, where s is the range to find the difference
and is taken as s = 2. 2) SPEECH EMOTION RECOGNITION MODEL BASED ON
" s # IMPROVED SVM
δ 1 X
1C n (m) = C n (m) =
∼ tC n−1 (m) (10) SVM has been proven in the literatures [46] and [47] to have
δm T S t=−s
better recognition performance in speech emotion recogni-
s
X tion (especially for angry emotions), which, together with
TS = t2 (11) its significant advantages in computational efficiency and
t=−s memory usage, makes it suitable for use as a as a complemen-
By combining the dynamic and static features of MFCC, tary recognition for facial emotion recognition in multimodal
the performance of the speech emotion recognition model can emotion recognition tasks. In this paper, based on the tradi-
be effectively improved. tional SVM model, considering the time series characteristics
of speech data, we improve the SVM kernel function and
f: SAMPLE ENTROPY propose a hybrid kernel function combining dynamic time
Sample entropy (SampEn) is a statistical measure for quan- warping (DTW) and Radial Basis Function (RBF). The DTW
tifying the complexity of time series, proposed by Richman algorithm can effectively deal with temporal elasticity on
and others in 2000. It is particularly effective in analyzing time-series data because it is able to bend the time axis in
nonlinear dynamic systems, such as speech signals, because order to find the best correspondence between two time-
it quantifies the complexity of time series from a probabilis- series, which means that even if the speech samples are
tic perspective. Compared to approximate entropy, sample different in time, it can find the similarity between them, thus
entropy does not include self-comparison in its calculations, improving the accuracy of the model for sentiment classifi-
reducing data bias. Even with limited data, sample entropy cation.
can effectively estimate probabilities, hence providing higher The improved hybrid kernel function K (x, y) is defined as:
detection accuracy [45]. Sample entropy is defined as the nat-  
K (x, y) = α·K DTW xi,t , xj,t ′ + (1−α) · K RBF xi , xj

ural logarithm of the conditional probability that data vectors
of dimension m remain similar when the dimension increases = α · exp(−γ DTW ·DTW (xi,t , xj,t ′ ))
to m + 1. Specifically, sample entropy SampEn(m,r,N) is 2
defined as: + (1 − α )· exp(−γ RBF · xi − xj ) (14)

Bm+1 (r) where KDTW is the DTW-based kernel, used for time-series
SampEn (m, r, N) = − ln (12) features, with i and j representing the feature and label data,
Bm (r)
N−m+1 respectively, and xi,t , xj,t ′ is the value of the time series at a
1 X
certain point in time (such as fundamental frequency, short-
Bm (r) = i (r)
Bm (13)
N − m+1 term energy, speech rate, and formant features); KRBF is
i=1
the Radial Basis Function (RBF) kernel, used for statistical
Here, m is the embedding dimension, r is the similarity
features, where xi , xj are two data points in the feature space
threshold for judging whether two sequences are similar, and
(such as MFCC, sample entropy features);α is a parameter
Bm+1 (r) and Bm (r) are the normalized counts of similar
to adjust the weight of the two, and γDTW and γRBF are the
sequence pairs at different dimensions. Bmi (r) represents the respective scaling parameters of the kernels.
similarity of a specific speech data vector with other vectors
The computation formula for DTW can be expressed as:
in the m-dimensional space. The calculation process can v 
be described as: for a given embedding dimension m and u T
uX
similarity threshold r, calculate the similarity of a specific DTW (A, B) = min t (xi,t − xj,t ′ )2  (15)
speech data vector X (i) with all other vectors X (j) in the t=1
m-dimensional space and count the number of similar vector
pairs. C. DRIVING STATE EMOTION RECOGNITION BASED ON RF
In the application of speech emotion recognition, sample 1) EXTRACTION OF EMOTIONAL FEATURES FROM VEHICLE
entropy can reveal the complexity and dynamic changes of DRIVING STATE INFORMATION
speech signals under different emotional states. For exam- In the recognition of anger emotion based on vehicle driving
ple, angry or excited speech may exhibit higher sample state, this paper not only selects vehicle acceleration and yaw
entropy, indicating more complex and variable signals, while rate, which represent the vehicle’s motion parameters, but
calm or sad speech may have lower sample entropy, reflect- also for the first time proposes the use of three behavioral fea-
ing more stable and consistent characteristics. Therefore, tures representing vehicle-to-vehicle interaction: following
incorporating sample entropy as a feature into the emotion distance, lane-changing frequency, and overtaking frequency.

118540 VOLUME 12, 2024


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

These features are used to comprehensively extract the of lane changes clfr within a unit time window as one of the
driver’s driving state emotions for more accurate recognition indicators for emotion recognition.
of the driver’s anger emotion.
e: OVERTAKING FREQUENCY
a: VEHICLE ACCELERATION Overtaking refers to a vehicle passing slower-moving vehi-
The degree of speed change per unit time is known as accel- cles ahead on the road, usually by moving into an adjacent
eration, and its magnitude or rate of change directly affects lane. When a vehicle is moving fast, overtaking can help
the urgency of speed variation. Acceleration is a key indi- it quickly bypass the slower vehicles ahead, reducing traf-
cator of the extent of vehicle speed change and reflects the fic congestion. Reasonable overtaking can make traffic flow
driver’s longitudinal control ability over the vehicle. Litera- smoother and allow vehicles to travel at appropriate speeds,
ture [36] indicates that as the intensity of the driver’s anger avoiding the formation of a slow-moving convoy. However,
increases, the fluctuation amplitude of vehicle acceleration unsafe overtaking can lead to traffic accidents, especially
also increases, leading to reduced smoothness in vehicle in conditions of poor visibility, complex road conditions,
motion, i.e., the driver’s longitudinal control ability over the or inappropriate timing for overtaking. On busy roads, fre-
vehicle decreases. Therefore, this paper extracts the mean quent overtaking can disrupt traffic flow. Literature [49]
value of vehicle acceleration aav as one of the indicators for shows that the overtaking frequency increases under the
emotion recognition. driver’s anger emotion. Therefore, this paper extracts the
frequency of overtaking otkfr within a unit time window as
b: YAW RATE one of the indicators for emotion recognition.
When the driver is in a normal driving state, the range of
changes in the vehicle’s yaw rate is small, and the frequency 2) STATISTICAL ANALYSIS OF DRIVING STYLE BASED ON
of change is high. Under the state of anger, the range of HISTORICAL DRIVING DATA
acceleration change is larger, and the adjustment frequency Due to individual differences in driving style (habits) or driv-
is lower. Literature [36] shows that as the driver’s anger ing experience, driving behaviors may vary among different
intensity increases, the fluctuation amplitude of the vehicle’s drivers. Therefore, in addition to the emotional (anger) state
yaw rate also increases, leading to reduced lateral stability affecting the driving behavior characteristics of the driver,
of the vehicle, i.e., the driver’s lateral control ability over the individual differences between drivers can also have a
the vehicle decreases. Hence, this paper extracts the mean certain impact on these characteristics, thereby affecting the
value of the vehicle’s yaw rate ψ̇av as one of the indicators accuracy of the anger recognition model for driving. How-
for emotion recognition. ever, the anger characteristics of the same subject have strong
stability over different periods. Hence, this paper adopts a
driving style analysis method based on historical driving data
c: FOLLOWING DISTANCE
to reduce the impact of individual differences among drivers
Following distance refers to the distance between vehicles
on the recognition of their anger state, thereby enhancing the
traveling on the road. Generally, a reduced following dis-
robustness and accuracy of the recognition model.
tance allows vehicles to pass through intersections and other
delay-prone areas more quickly, but as the following distance  For the feature subset obtained earlier M =
aav , ψ̇av , gav , clfr , otkfr , calculate the mean value of each
decreases, safety risks increase. Literature [48] indicates that
feature under the normal state for each subject. This mean
the following distance tends to decrease under the driver’s
value is taken as the reference value. Let the mean value of
anger emotion. Therefore, this paper extracts the mean fol-
the feature parameters of the ith subject under normal driving
lowing distance gav as one of the indicators for emotion
state be Ri , that is, the reference value. The equation is as
recognition.
follows:
N
d: LANE-CHANGING FREQUENCY 1 X
Ri = Mi (16)
Lane changing refers to a vehicle moving from its current lane N
i=1
to an adjacent lane on the road, often to avoid obstructions,
pursue other vehicles, or turn at intersections. Lane changing Finally, the feature parameter values of all subjects under
can help the vehicle bypass obstacles or slow-moving vehi- normal state are inputted into the training sample library as
cles in its current lane, maintaining a smooth speed. If one the statistical values of each subject’s driving style.
lane is faster than another, changing lanes can help the vehicle
increase its speed and reach its destination more quickly. 3) DRIVING STATE EMOTION RECOGNITION MODEL BASED
However, frequent lane changes can disrupt traffic flow and ON RANDOM FOREST
increase the risk of traffic accidents. Literature [29] indicates The random forest model builds multiple decision trees
that the frequency of lane changes increases under the driver’s and derives the final classification result by majority voting
anger emotion. Therefore, this paper extracts the frequency among these trees. This makes the random forest model

VOLUME 12, 2024 118541


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

generally robust against outliers and noise and reduces the D. ADAPTIVE MULTIMODAL ANGER EMOTION
risk of overfitting. Moreover, random forests adapt better to RECOGNITION BASED ON CA-RL
unbalanced datasets. Based on these advantages, this paper 1) ADAPTIVE WEIGHT ALLOCATION BASED ON CA-RL
considers the random forest model to be more suitable for In multimodal emotion recognition, how to assign appropri-
the task of vehicle driving state emotion recognition. The ate weights to each modality is the key issue. In order to
implementation steps of the model are as follows: make the system more adaptable to different driving envi-
(1) Feature Centralization ronments and individual differences, we propose a method
To better capture the emotional differences of drivers and based on reinforcement learning to perceive the scenarios,
reduce the impact of individual differences on anger state find the optimal strategy in different scenarios by interacting
recognition, feature data must be centralized before apply- with the environment, and dynamically assign weights to
ing the random forest model. This is done by obtaining the each modality, so as to improve the model’s adaptability and
average feature value for each individual under normal condi- generalisation ability.
tions from the sample library, and then subtracting the actual
feature value to obtain the centralized feature value Ci . The a: MODEL DEFINITION
process can be represented as: State space st : Emotion recognition scene, characterized by
the degree of traffic congestion Ctr and the state of the number
C i = M i − Ri (17) of people inside the vehicle Hpe . This includes the driver’s
emotions, vehicle state, traffic conditions, etc.
where Mi is the actual value of the feature of the ith driver, Their representation is as follows:
and Ri is its average feature value.
(2) Bootstrap Sampling N
C tr = (20)
Randomly select samples from the original dataset to cre- L
(
ate a new dataset. Let the size of the original dataset be N. 1, if 1Et > e
Then N samples are selected from the original dataset with H pe = (21)
0, otherwise
putback to form a new dataset Di :
Here, N is the number of vehicles at a given moment within
Di = (x∗1 , y∗1 ), (x∗2 , y∗2 ), · · · , (x∗N , y∗N )

(18) the observed road section, L is the length of the observed
section, 1Et is the short-term energy difference in voice
where Di represents the training data for the ith tree, and between adjacent windows, and e is a threshold. The number
(xi∗ , y∗i ) is the sample randomly selected from the original of people inside the vehicle is detected by voice energy
dataset. fluctuation. When a single person speaks, the short-term
(3) Decision Tree Construction with Feature Subsets energy usually shows a more consistent pattern, whereas in
In terms of using driving state information for emotion multi-person conversations, energy fluctuations may be more
recognition, the literature [39] has demonstrated that random pronounced and frequent. This is because different speakers
forest models (RF) have good performance. Each deci- usually have different voice intensities and rhythms, causing
sion tree is trained on its corresponding bootstrap sampling more peaks and valleys in energy levels during multi-person
dataset Di . However, in the splitting process of each node, conversations.
not all features are considered, but a random feature subset Action space at : The weight parameters of the three modal-
is selected. Assuming the original number of features is M, ities in the decision-level fusion of the anger recognition
m features are randomly selected at each node split. models, represented as [w1 , w2 , w3 ].
(4) Decision Tree Ensemble Reward function rt : Given based on the consistency of the
Using the above method, T decision trees are constructed emotion recognition result with the actual emotion, defined
in the random forest, each independently trained based on as:
different bootstrap sampling datasets. (
(5) Prediction 1, if prediction matches actual emotion
rt = (22)
The prediction of the random forest is based on the predic- 0, otherwise
tions of all its decision trees. Specifically, for classification
tasks, each tree provides a classification prediction, and the
final prediction of the random forest is the mode of these b: REINFORCEMENT LEARNING ALGORITHM
classifications: This paper chooses the Deep Q Network (DQN) to implement
dynamic weight adjustment. DQN attempts to estimate an
ŷ = mode(ŷ1 , ŷ2 , · · · ,ŷT ) (19) action-value function Q(s,a) representing the expected return
of choosing action a in state s.
Here, ŷi represents the prediction of the ith tree, and mode The core update formula of DQN is:
represents taking the mode of these classification results,

i.e., the most frequently occurring classification prediction. Q(st , at ) = rt + γ maxa′ Q(st+1 , a ) (23)

118542 VOLUME 12, 2024


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

Here, γ is the discount factor for future rewards, and TABLE 2. Overview of the AffectNet dataset.
maxa′ Q(st+1 , a′ ) is the maximum estimate of the expected
return for all possible actions a′ in the future state st+1 .

2) DECISION-LEVEL FUSION BASED ON ADAPTIVE WEIGHTS


The emotion recognition results obtained by each modality
for the emotion recognition sample x are {ple (x) ,
l = 1, 2, 3; e = 1, 2}, where l is the modality category,
and e is the emotion category. The anger emotion recognition
results of each modality are fused at the decision level, and Hungary, Greece, Serbia, and China. The gender and age dis-
the final probability expression after fusion is: tribution of each group of volunteers was wide-ranging. The
database thus generated includes 199 experimental records,
R ple (x)

′ comprising 1525 minutes of audiovisual data recording the
pe (x) = P2 (24)
reactions of 398 individuals to advertisements, and over
e=1 R ple (x)

550 minutes of computer-mediated interactions between sub-
Here, p′e (x) is the final probability of the sample data being jects face-to-face. Different emotions were filtered from the
recognized as emotion category e, and R is the fusion crite- dataset to obtain data that meet the needs of this work, with
rion, i.e., linear weighted fusion using the adaptive weights the composition of emotion labels detailed in Table 3.
[w1 , w2 , w3 ] obtained previously. Therefore, the final expres-
sion for determining emotion classification is: TABLE 3. Overview of the SEWA dataset.
 ′ 
f (x) = arg max pe (x) (25)
e

IV. DATA AND EXPERIMENT


A. DATA
The method proposed in this paper involves multimodal anger
emotion recognition using AM-DSCNN, improved SVM,
and RF models for facial, voice, and driving state informa-
tion, respectively. Additionally, it employs an adaptive weight
allocation based on CA-RL for multimodal decision-level 2) MULTIMODAL DATASETS
fusion. For this recognition method, two types of datasets Due to the current lack of vehicle driving state data under real
were selected. On one hand, to enhance the generalizability of vehicle operating scenarios, this paper designed a real vehi-
the facial and voice recognition models, large public datasets cle experiment to collect multimodal data of drivers under
such as AffectNet and SEWA were used. On the other hand, different scenarios. The real vehicle operation location was
to utilize specific driving state features and the adaptive set in the Jingyue District of Changchun City, Jilin Province,
weight allocation method based on CA-RL, a multimodal China. The operational scenarios were controlled by two
dataset, Multimodal Data, was constructed by the team for intersecting factors: traffic flow density (with three scenarios:
training and validation of the model. severely congested, generally congested, and uncongested)
and the number of drivers inside the vehicle (either single
1) PUBLIC DATASETS or multiple). The experiment recruited 8 drivers, including
a: AffectNet 4 males and 4 females, with male ages ranging from 22-47
The AffectNet database [50], also known as a large-scale and female ages from 23-49.
dataset for facial emotion recognition and analysis, was
released in 2017. AffectNet contains over one million facial a: DATA COLLECTION EQUIPMENT
images collected from the internet using 1250 emotion- To collect multimodal emotion information, this study used
related keywords in six different languages and queries to cameras and recorders to collect facial and speech infor-
three major search engines. It includes a variety of races, ages, mation of drivers; drones were used to collect driving state
and cultural backgrounds. Different emotions were filtered information, especially the micro-traffic flow information
from the dataset to obtain data that meet the needs of this representing vehicle interactions. The information collection
work, with the composition of emotion labels detailed in equipment is detailed in Table 4.
Table 2.
b: EMOTION INDUCTION
b: SEWA This paper uses the emotion induction method from Liter-
The SEWA speech database [51] was released in 2019. ature [36], employing emotion induction materials relevant
It recorded 6 groups of volunteers (each group with 30 peo- to the cultural background, for inducing anger and neutral
ple) from six different cultural backgrounds: UK, Germany, emotions. The related materials are listed in Table 5. Drivers

VOLUME 12, 2024 118543


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

TABLE 4. Data collection equipment. important in the model training process. Table 7 lists the
parameter settings for our experiments.

TABLE 7. Experimental parameter settings.

watched related video materials before real vehicle operation,


followed by 20 minutes of vehicle operation and data collec-
tion as one data collection cycle.

TABLE 5. Emotion induction materials.

C. EVALUATION METRICS
This study uses Accuracy and F1 score as the evaluation
criteria for the model.

a: ACCURACY
Accuracy is the most intuitive performance metric, represent-
ing the proportion of correctly predicted instances to the total
c: DATA PREVIEW number of predictions. It reflects the credibility of the model’s
The collected multimodal data were filtered, including the predictions. Its calculation formula can be expressed as:
exclusion of driving state video data where the main vehicle TP + TN
was not captured and the removal of face video data where ACC = (26)
TP + TN + FP + FN
faces were obscured. This resulted in the Multimodal Data
dataset, which includes 600 samples. Each sample consists of b: F1 SCORE
simultaneously recorded facial data, voice data, and vehicle The F1 score is the harmonic mean of Precision and Recall.
driving state data, with the composition of emotion labels Its calculation formula is:
detailed in Table 6. 2 × precision × recall
F1 = (27)
precision + recall
TABLE 6. Overview of the multimodal data dataset.

D. SINGLE MODALITY AND CONTEXT-AWARE


MULTIMODAL FUSION ANGER EMOTION
RECOGNITION EXPERIMENT
1) FACIAL ANGER EMOTION RECOGNITION EXPERIMENT
Firstly, the dataset images were subjected to preprocessing
operations such as normalization, grayscale processing, and
image enhancement. The effect after preprocessing is shown
in Figure 3 (b). Then, the AM-DSCNN facial anger emo-
B. EXPERIMENTAL SETUP tion recognition model proposed in this paper was trained
The experiments in this paper were conducted using Python according to the parameters set in Table 7. After training,
3.10.9 and the deep learning framework TensorFlow 2.14.0. Grad-CAM was used to visualize the attention mechanism
In terms of the experimental environment, we used the Win- in the model structure as a heatmap. This was done by cal-
dows 10 operating system. For the experimental hardware, culating gradients on the last convolutional layer and then
we used an Intel Core i7-13620H processor and a GeForce rendering it as a heatmap overlaid on the original image,
GTX 4060 graphics card. Parameter setting is particularly as shown in Figure 3 (c).

118544 VOLUME 12, 2024


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

FIGURE 3. Experimental correlogram for facial anger recognition.

Validation tests were conducted on the AffectNet valida-


tion set and the Multimodal Data validation set, with results FIGURE 4. Effective voice endpoint detection diagram.

for various indicators shown in Table 8.

TABLE 8. Performance results of the AM-DSCNN model on the dataset.

2) VOICE ANGER EMOTION RECOGNITION EXPERIMENT


The voice signals collected by the equipment are often
affected by the environment, vehicles, and people, leading to
many silent segments, strong and unstable noise. Therefore,
to obtain a more uniform and smooth signal and extract
FIGURE 5. Voice data preprocessing effect diagram.
more complete voice feature parameters, preprocessing of
voice signals is necessary. This includes effective endpoint
detection, pre-emphasis, framing, and windowing.
Endpoint detection refers to detecting the start and
end points of a speech signal, which is essentially a
two-classification problem of distinguishing the speech seg-
ment and the silence segment of a sample. The endpoint
detection can reduce the influence of environmental noise,
reduce the amount of system computation and computation
time, and improve the system real-time. In this paper, we use
the double threshold method based on short-time energy and
short-time over-zero rate to achieve the endpoint detection of
speech signals, and the idea of the double threshold method
is shown in Figure 4. Firstly, the short-time energy is used for
FIGURE 6. Speaker diarization clustering diagram.
the first level of discrimination. Find the short-time energy Ek
and the threshold value T1 and T2 for each frame; higher than
T2 is defined as the voice segment, the voice start and end
point should be located outside the CD segment; therefore, After effective endpoint detection, voice data were pre-
from point C to the left, point D to the right, search for points processed with pre-emphasis, framing, and windowing. The
B and E where Ek and T1 intersect, and the BE segment is effect before and after preprocessing is shown in Figure 5.
the start and end point of the voice segment of the first level In addition to this, in some scenarios will be collected
of determination. Then use the short time zero rate for the to contain two or more people and mixed speech, so part
second level of judgement. Find the short time zero rate Zk of the voice data also needed to enter the driver’s speech
and the threshold value T3 for each frame; then search from information directional extraction. In this paper, the open
point B to the left and point E to the right to find the points source tool falcon package is used to perform voiceprint
A and F where Z is lower than T3 , and the AF segment is segmentation and clustering (the effect graph after voiceprint
the starting and stopping point of the speech segment of the segmentation and clustering is shown in Figure 6), and then
second level of determination, i.e., the effective endpoint of the driver’s voice data is selected, so as to complete the
the speech. directional extraction of the driver’s voice.

VOLUME 12, 2024 118545


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

When speech features are extracted and computed, differ-


ent combinations of parameters may have different effects on
the computation of the features, and here we take short-time
energy as an example to show the parameter sensitivity
analysis in the experiment. In this paper, the parameter sen-
sitivity experiment is carried out with the main parameters of
short-time energy, such as window length and overlap rate,
as shown in Figure 7.
In Figure 7, F stands for frame size and O stands for overlap
size. From the results of the average energy and variance,
the change of overlap size has a small effect on both the
FIGURE 7. Voice data preprocessing effect diagram.
average energy and variance, which means that the short-term
energy characteristics are not sensitive to the selection of the
parameter of overlap size. On the other hand, the increase
of frame size will not only increase the average energy, but
also make the energy variance larger, which means that the
short-term energy is more sensitive to the selection of the
parameters of frame size.
Based on the preprocessing of speech data and parameter
sensitivity analysis, the improved SVM voice anger emotion
recognition model proposed in this paper was trained and
validated. Tests were conducted on the SWEA training and
validation sets, and the Multimodal Data validation set, with
performance indicators involved shown in Table 9.
FIGURE 8. Main vehicle framing schematic diagram.
TABLE 9. Validation results of the improved SVM model.

3) DRIVING STATE ANGER EMOTION RECOGNITION


EXPERIMENT
In this study, drones were used to collect vehicle driv-
ing state information under different scenarios, flying at a
height of 250 m during the collection process. In addition
to this, vehicle driving status information can be obtained
through on-board sensors and devices or road test data
acquisition facilities. The Yolov5x algorithm was used for FIGURE 9. Tracking identification and calculation of driving status
vehicle recognition and detection. First, the main vehicle information.
was framed, followed by tracking and recognition using
the detection algorithm. Then, combining the video frame
rate with the real-time flight speed from the drone’s log E. CONTEXT-AWARE MULTIMODAL FUSION ANGER
file, driving features were calculated, resulting in M = EMOTION RECOGNITION COMPARATIVE
aav , ψ̇av , gav , clfr , otkfr . The program operation is shown EXPERIMENT
in Figures 8 and 9. This study determined the real-time driving scenario through
After calculating the various feature values of the vehicle’s traffic density detection and voice energy fluctuation
driving state, the Random Forest model was used to train and detection, defining the state space accordingly. The con-
validate the Multimodal Data dataset. The final experimental ceptual diagram of perception is shown in Figure 10. For
results for each indicator are shown in Table 10. traffic density detection, the Yolov5x algorithm was used to
detect continuous lane lines closest to the main vehicle. After
TABLE 10. Performance results of the random forest model on the successful detection, vehicles within the lane lines were iden-
dataset. tified to obtain traffic density information. For voice energy
fluctuation detection, analysis was conducted based on the
short-term energy difference between adjacent windows in
the voice features.

118546 VOLUME 12, 2024


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

recognition shows that the SVM-DS model with multi-


modal decision-level fusion performs best, with an accuracy
of 84.8%.
Table 12 lists the results of the proposed method and
other advanced models on different datasets in terms of var-
ious evaluation metrics. The proposed model (CA-MDER)
outperforms all other models in both ACC and F1 metrics.
In summary, compared to classic methods, the CA-MDER
model proposed in this study demonstrates good classifica-
tion and generalization capabilities.

FIGURE 10. Traffic density perception schematic diagram. TABLE 12. Performance comparison of various multimodal recognition
methods.

Next, reinforcement learning was used to train and validate


different scenarios in the action space (i.e., weight schemes)
on the Multimodal Data dataset. The performance indica-
tors of the CA-MDER model proposed in this paper for the
multimodal emotion recognition task of drivers are shown in
Table 11.

TABLE 11. Performance results of the CA-MDER model on the dataset. F. ABLATION EXPERIMENT
To verify the effectiveness of the context-aware module and
multimodal fusion, an ablation experiment was designed for
the proposed model, including the removal of context aware-
ness and fusion of each single modality data. M represents
Due to the diversity of modalities and data types used in the multimodal task, F represents the facial single modal-
multimodal emotion recognition, there is currently no unified ity, V represents the voice single modality, S represents the
data platform covering all modalities. Therefore, this study driving state single modality, and Awareness represents the
selected the following representative multimodal anger emo- context-awareness module. The ablation experiment results
tion recognition methods for a brief comparison: for each indicator are shown in Table 13.
CNN+Bi-LSTM+HAM [38]: This method introduces
TABLE 13. Results of modality ablation experiment: Multimodal dataset.
HAM on the basis of the CNN + Bi-LSTM network frame-
work. The mechanism can consider features of different lev-
els and types and adaptively adjust the attention mechanism
to selectively focus on key facial features. This multimodal
framework combines driver’s voice, facial images, and video
sequence data for emotion recognition, achieving an anger
emotion recognition accuracy of 85%.
CBLNN [40]: This method uses facial geometric fea-
tures obtained by Convolutional Neural Network (CNN) as
intermediate variables for Bidirectional Long Short-Term V. RESULTS AND DISCUSSION
Memory (Bi-LSTM) heart rate analysis. Subsequently, the To enhance the accuracy of driver anger emotion recogni-
output of Bi-LSTM is used as input for the CNN module tion, this study proposed a context-aware multimodal driver
to extract listening rate features. Finally, Multimodal Factor- anger emotion recognition method (CA-MDER), integrating
ized Bilinear Pooling (MFB) is used to fuse the extracted facial, voice, and vehicle driving state information. First,
information for emotion recognition. Using facial and heart anger emotion recognition was conducted for each single
rate data, this dual-modal method achieves an anger emotion modality. To improve recognition accuracy, we initially pro-
recognition accuracy of 90.5%. posed a facial emotion recognition method based on Attention
AM-LSTM [36]: This study combines LSTM with an Mechanism Deep Separable Convolutional Neural Network
attention mechanism for anger emotion recognition through (AM-DSCNN). This method focuses on key facial features
multimodal decision-level fusion of EEG signals, physiolog- determining emotions using an attention module and then
ical signals, and driving behavior information, achieving an enhances model computational efficiency by introducing
accuracy of 76.3%. deep separable convolution modules. Next, the SVM used for
SVM-DS [34]: This study based on ECG signals and driv- speech emotion recognition was improved by considering the
ing behavior signals for anger emotion multimodal fusion temporal characteristics of voice data and proposing a hybrid

VOLUME 12, 2024 118547


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

kernel function combining Dynamic Time Warping (DTW) enriching the types of scenarios from aspects such as weather
and RBF, effectively handling time elasticity in time series and light intensity.
data and improving the optimal correspondence between time
series. Finally, for vehicle driving state emotion recognition, REFERENCES
features capturing vehicle interactions were introduced, and [1] B. Parkinson, ‘‘Anger on and off the road,’’ Brit. J. Psychol., vol. 92, no. 3,
a driver’s driving style recognition module was proposed to pp. 507–526, Aug. 2001, doi: 10.1348/000712601162310.
mitigate the impact of driving heterogeneity on anger emo- [2] J. L. Deffenbacher, E. R. Oetting, and R. S. Lynch, ‘‘Development of a
driving anger scale,’’ Psychol. Rep., vol. 74, no. 1, pp. 83–91, Feb. 1994,
tion recognition. After completing emotion recognition for doi: 10.2466/pr0.1994.74.1.83.
each modality, this study used the optimal weight allocation [3] E. R. Dahlen, R. C. Martin, K. Ragan, and M. M. Kuhlman, ‘‘Driving anger,
under adaptive scenarios obtained through context awareness sensation seeking, impulsiveness, and boredom proneness in the prediction
of unsafe driving,’’ Accident Anal. Prevention, vol. 37, no. 2, pp. 341–348,
for multimodal decision-level fusion, ultimately outputting Mar. 2005, doi: 10.1016/j.aap.2004.10.006.
emotion classification results. [4] J. Lu, X. Xie, and R. Zhang, ‘‘Focusing on appraisals: How and why anger
The recognition results show (Table 11) that the and fear influence driving risk perception,’’ J. Saf. Res., vol. 45, pp. 65–73,
Jun. 2013, doi: 10.1016/j.jsr.2013.01.009.
CA-MDER model proposed in this paper achieves an accu-
[5] P. Ekman and W. V. Friesen, ‘‘Facial action coding system,’’ Environ.
racy of 91.68% and an F1 score of 90.37%. Compared to other Psychol. Nonverbal Behav., Jan. 1978.
advanced multimodal anger emotion recognition models [6] A. Barman and P. Dutta, ‘‘Facial expression recognition using distance
(Table 12), CA-MDER achieves high levels in both accuracy and texture signature relevant features,’’ Appl. Soft Comput., vol. 77,
pp. 88–105, Apr. 2019, doi: 10.1016/j.asoc.2019.01.011.
and F1, surpassing existing multimodal recognition models. [7] T. Ojala, M. Pietikäinen, and D. Harwood, ‘‘A comparative study of tex-
Additionally, the ablation experiment revealed some notable ture measures with classification based on featured distributions,’’ Pattern
points. Although the recognition accuracies of individual Recognit., vol. 29, no. 1, pp. 51–59, Jan. 1996.
[8] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human detec-
modalities vary significantly, the overall recognition rate
tion,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
still increases after multimodal fusion, aligning with current (CVPR), 2005, pp. 886–893, doi: 10.1109/CVPR.2005.177.
mainstream research and possibly corroborating the view that [9] H. Ali, M. Hariharan, S. Yaacob, and A. H. Adom, ‘‘Facial emotion
drivers may hide their emotions in certain modalities under recognition using empirical mode decomposition,’’ Expert Syst. Appl.,
vol. 42, no. 3, pp. 1261–1277, Feb. 2015.
some scenarios. However, in some cases, even using more [10] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. Pietikäinen, ‘‘Facial
modalities for fusion may not yield ideal results. For example, expression recognition from near-infrared videos,’’ Image Vis. Comput.,
in the ablation experiment, directly fusing facial, voice, and vol. 29, no. 9, pp. 607–619, Aug. 2011.
[11] L. Zhang, K. Mistry, M. Jiang, S. C. Neoh, and M. A. Hossain, ‘‘Adaptive
driving state modalities without using context-aware adaptive facial point detection and emotion recognition for a humanoid robot,’’
weight allocation only slightly improved recognition rate by Comput. Vis. Image Understand., vol. 140, pp. 93–114, Nov. 2015.
about 1% (compared to the highest accuracy of facial sin- [12] Z. Zhang, L. Cui, X. Liu, and T. Zhu, ‘‘Emotion detection using Kinect
gle modality, same below). In contrast, using context-aware 3D facial points,’’ in Proc. IEEE/WIC/ACM Int. Conf. Web Intell. (WI),
Oct. 2016, pp. 407–410. Accessed: Nov. 9, 2023. [Online]. Available:
adaptive weight allocation for fusion of facial and driving https://ieeexplore.ieee.org/abstract/document/7817080/
state modalities alone can improve recognition accuracy by [13] R. Jiang, A. T. S. Ho, I. Cheheb, N. Al-Maadeed, S. Al-Maadeed,
about 3%. If fusing facial, voice, and driving state modalities, and A. Bouridan, ‘‘Emotion recognition from scrambled facial images
via many graph embedding,’’ Pattern Recognit., vol. 67, pp. 245–251,
the improvement in recognition accuracy can reach about 5%. Jul. 2017.
This validates the effectiveness of the context-aware adaptive [14] K. Candra Kirana, S. Wibawanto, and H. Wahyu Herwanto, ‘‘Facial
weight allocation method proposed in this study. emotion recognition based on viola-jones algorithm in the learning
environment,’’ in Proc. Int. Seminar Appl. Technol. Inf. Commun.,
Looking forward, under the support of the National Key Sep. 2018, pp. 406–410. Accessed: Nov. 9, 2023. [Online]. Available:
R&D Program Project ‘‘Major Accident Risk Prevention https://ieeexplore.ieee.org/abstract/document/8549735/
and Emergency Avoidance Technology for Road Transport [15] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, ‘‘Facial
expression recognition with convolutional neural networks: Coping with
Vehicles’’ (2023YFC3009600) and the Graduate Innovation
few data and the training sample order,’’ Pattern Recognit., vol. 61,
Fund of Jilin University, this study will explore applications pp. 610–628, Jan. 2017.
in accident risk prevention, emergency avoidance, and other [16] M. Liu, S. Li, S. Shan, and X. Chen, ‘‘AU-aware deep networks for
aspects related to road transport vehicles. Due to limitations facial expression recognition,’’ in Proc. 10th IEEE Int. Conf. Work-
shops Autom. Face Gesture Recognit. (FG), Apr. 2013, pp. 1–6, doi:
in research duration and measurement methods, this paper has 10.1109/FG.2013.6553734.
some limitations, and we plan to make improvements in the [17] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, ‘‘Deep learning
following areas: a. In this paper, the attention mechanism is for emotion recognition on small datasets using transfer learning,’’ in
Proc. ACM Int. Conf. Multimodal Interact., Nov. 2015, pp. 443–449, doi:
only used to focus on facial features. In the future, we will 10.1145/2818346.2830593.
introduce the attention mechanism into other modalities’ [18] R. S. John, S. B. Alex, M. S. Sinith, and L. Mary, ‘‘Significance of prosodic
data to explore possibilities for improving overall recognition features for automatic emotion recognition,’’ AIP Conf. Proc., vol. 2222,
no. 1, Apr. 2020, Art. no. 030003, doi: 10.1063/5.0004235.
accuracy; b. The type and amount of data in real vehicle
[19] C. Nussbaum, A. Schirmer, and S. R. Schweinberger, ‘‘Contributions of
scenarios is relatively small, and multimodal data can be fundamental frequency and timbre to vocal emotion perception and their
collected more easily in the future, e.g., through on-board electrophysiological correlates,’’ Social Cognit. Affect. Neurosci., vol. 17,
no. 12, pp. 1145–1154, Dec. 2022.
sensors or vehicle networking; c. The scenarios used to train
[20] S. Lalitha, D. Geyasruti, R. Narayanan, and S. M, ‘‘Emotion detec-
the context-aware adaptive weight allocation method in this tion using MFCC and cepstrum features,’’ Proc. Comput. Sci., vol. 70,
paper are somewhat limited. In the future, we will consider pp. 29–35, Jan. 2015.

118548 VOLUME 12, 2024


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

[21] Y. Zhou, J. Li, Y. Sun, J. Zhang, Y. Yan, and M. Akagi, ‘‘A hybrid speech [42] A. J. Kayal and J. Nirmal, ‘‘Multilingual vocal emotion recognition and
emotion recognition system based on spectral and prosodic features,’’ classification using back propagation neural network,’’ in Proc. Advance-
IEICE Trans. Inf. Syst., vol. E93-D, no. 10, pp. 2813–2821, 2010. ment Sci. Technol., 2nd Int. Conf. Commun. Syst. (ICCS), Rajasthan, India,
[22] H. Aouani and Y. B. Ayed, ‘‘Speech emotion recognition with deep 2016, Art. no. 020054, doi: 10.1063/1.4942736.
learning,’’ Proc. Comput. Sci., vol. 176, pp. 251–260, Jan. 2020, doi: [43] K. Scherer, ‘‘Vocal communication of emotion: A review of research
10.1016/j.procs.2020.08.027. paradigms,’’ Speech Commun., vol. 40, nos. 1–2, pp. 227–256, Apr. 2003,
[23] T. M. Rajisha, A. P. Sunija, and K. S. Riyas, ‘‘Performance analy- doi: 10.1016/s0167-6393(02)00084-5.
sis of Malayalam language speech emotion recognition system using [44] M. Abdelwahab and C. Busso, ‘‘Evaluation of syllable rate estimation in
ANN/SVM,’’ Proc. Technol., vol. 24, pp. 1097–1104, Jan. 2016. expressive speech and its contribution to emotion recognition,’’ in Proc.
[24] E. M. Albornoz, D. H. Milone, and H. L. Rufiner, ‘‘Spoken emotion IEEE Spoken Lang. Technol. Workshop (SLT), South Lake Tahoe, NV,
recognition using hierarchical classifiers,’’ Comput. Speech Lang., vol. 25, USA, Dec. 2014, pp. 472–477, doi: 10.1109/SLT.2014.7078620.
no. 3, pp. 556–570, Jul. 2011. [45] R. Alcaraz and J. J. Rieta, ‘‘A novel application of sample entropy to
the electrocardiogram of atrial fibrillation,’’ Nonlinear Anal., Real World
[25] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, ‘‘Learning salient features for
Appl., vol. 11, no. 2, pp. 1026–1035, Apr. 2010.
speech emotion recognition using convolutional neural networks,’’ IEEE
[46] L. Sun, B. Zou, S. Fu, J. Chen, and F. Wang, ‘‘Speech emotion recognition
Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec. 2014.
based on DNN-decision tree SVM model,’’ Speech Commun., vol. 115,
[26] L. Sun, S. Fu, and F. Wang, ‘‘Decision tree SVM model with Fisher feature
pp. 29–37, Dec. 2019, doi: 10.1016/j.specom.2019.10.004.
selection for speech emotion recognition,’’ EURASIP J. Audio, Speech,
[47] S. Kanwal, S. Asghar, and H. Ali, ‘‘Feature selection enhancement and
Music Process., vol. 2019, no. 1, p. 2, Dec. 2019, doi: 10.1186/s13636-
feature space visualization for speech-based emotion recognition,’’ PeerJ
018-0145-5.
Comput. Sci., vol. 8, p. e1091, Nov. 2022, doi: 10.7717/peerj-cs.1091.
[27] S. Kwon, ‘‘CLSTM: Deep feature-based speech emotion recognition [48] T. Zimasa, S. Jamson, and B. Henson, ‘‘The influence of driver’s mood
using the hierarchical ConvLSTM network,’’ Mathematics, vol. 8, no. 12, on car following and glance behaviour: Using cognitive load as an inter-
p. 2133, Nov. 2020. vention,’’ Transp. Res. F, Traffic Psychol. Behav., vol. 66, pp. 87–100,
[28] H. M. M. Hasan and Md. A. Islam, ‘‘Emotion recognition from Bengali Oct. 2019, doi: 10.1016/j.trf.2019.08.019.
speech using RNN modulation-based categorization,’’ in Proc. 3rd Int. [49] L. Shamoa-Nir, ‘‘Road rage and aggressive driving behaviors: The
Conf. Smart Syst. Inventive Technol. (ICSSIT), Aug. 2020, pp. 1131–1136. role of state-trait anxiety and coping strategies,’’ Transp. Res. Inter-
Accessed: Dec. 19, 2023. [Online]. Available: https://ieeexplore. discipl. Perspect., vol. 18, Mar. 2023, Art. no. 100780, doi: 10.1016/
ieee.org/abstract/document/9214196/ j.trip.2023.100780.
[29] H. Lei, ‘‘The characteristics of angry driving behaviors and its effects on [50] A. Mollahosseini, B. Hasani, and M. H. Mahoor, ‘‘AffectNet: A database
traffic safety,’’ M.S. thesis, Wuhan Univ. Technol., 2011. for facial expression, valence, and arousal computing in the wild,’’ IEEE
[30] M. Zhong, H. Hong, and Z. Yuan, ‘‘Experiment research on influence of Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, Jan. 2019.
angry emotion for driving behaviors,’’ J. Chongqing Univ. Technol. Natural [51] J. Kossaifi, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt, F. Ringeval,
Sci., vol. 25, no. 10, pp. 6–11, 2011. J. Han, V. Pandit, A. Toisoul, B. Schuller, K. Star, E. Hajiyev, and
[31] F. Techer, C. Jallais, Y. Corson, F. Moreau, D. Ndiaye, B. Piechnick, and M. Pantic, ‘‘SEWA DB: A rich database for audio-visual emotion and
A. Fort, ‘‘Attention and driving performance modulations due to anger sentiment research in the wild,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
state: Contribution of electroencephalographic data,’’ Neurosci. Lett., vol. 43, no. 3, pp. 1022–1040, Mar. 2021.
vol. 636, pp. 134–139, Jan. 2017.
[32] L. Precht, A. Keinath, and J. F. Krems, ‘‘Effects of driving anger on driver
behavior—Results from naturalistic driving data,’’ Transp. Res. F, Traffic
Psychol. Behav., vol. 45, pp. 75–92, Feb. 2017.
[33] S. Shafaei, T. Hacizade, and A. Knoll, ‘‘Integration of driver behavior
into emotion recognition systems: A preliminary study on steering wheel
and vehicle acceleration,’’ in Proc. Asian Conf. Comput. Vis., G. Carneiro
and S. You, Eds., Cham, Switzerland: Springer, 2019, pp. 386–401, doi: TONGQIANG DING received the M.S. and Ph.D.
10.1007/978-3-030-21074-8_32. degrees from the School of Transportation, Jilin
[34] F. Wang, ‘‘Research on driver anger emotion recognition method based on University, China, in 2001 and 2005, respectively.
multimodal fusion,’’ M.S. thesis, Jilin Univ., 2023. He worked at the University of Minnesota,
[35] X. Yu, ‘‘Driver’s anger recognition method based on human-vehicle envi- USA, as a Visiting Scholar, in 2014. He is cur-
ronment information fusion,’’ M.S. thesis, Shandong Univ. Technol., 2021. rently an Associate Professor with the School of
[36] P. Wang, ‘‘Study on multimodal recognition method of Driver’s anger Transportation, Jilin University, where he is also
emotion and mechanism of driving risk under anger emotion,’’ M.S. thesis, the Head of the Department of Traffic Engineering,
Chongqing Univ., 2020. School of Transportation. His research interests
[37] J. Tang, Z. Ma, K. Gan, J. Zhang, and Z. Yin, ‘‘Hierarchical multimodal- include traffic safety of traditional road traffic sys-
fusion of physiological signals for emotion recognition with scenario tems and emerging intelligent transport systems represented by intelligent
adaption and contrastive alignment,’’ Inf. Fusion, vol. 103, Mar. 2024, vehicles and intelligent networks, involving fundamental theories, methods,
Art. no. 102129, doi: 10.1016/j.inffus.2023.102129. technologies, and practical applications of traffic safety.
[38] D. Zhou, Y. Cheng, L. Wen, H. Luo, and Y. Liu, ‘‘Drivers’ comprehensive
emotion recognition based on HAM,’’ Sensors, vol. 23, no. 19, p. 8293,
Oct. 2023, doi: 10.3390/s23198293.
[39] J. Ni, W. Xie, Y. Liu, J. Zhang, Y. Wan, and H. Ge, ‘‘Driver emotion
recognition involving multimodal signals: Electrophysiological response,
nasal-tip temperature, and vehicle behavior,’’ J. Transp. Eng., A, Syst.,
vol. 150, no. 1, Jan. 2024, Art. no. 04023125, doi: 10.1061/jtepbs.teeng-
7802. KEXIN ZHANG received the B.S. degree from
[40] G. Du, Z. Wang, B. Gao, S. Mumtaz, K. M. Abualnaja, and C. Du, Shandong University of Science and Technol-
‘‘A convolution bidirectional long short-term memory neural network for ogy, in 2022. He is currently pursuing the M.S.
driver emotion recognition,’’ IEEE Trans. Intell. Transp. Syst., vol. 22, degree with the School of Transportation, Jilin
no. 7, pp. 4570–4578, Jul. 2021, doi: 10.1109/TITS.2020.3007357. University.
[41] W. Li, Y. Cui, Y. Ma, X. Chen, G. Li, G. Zeng, G. Guo, and D. Cao, His research interests include the analysis and
‘‘A spontaneous driver emotion facial expression (DEFE) dataset for recognition of drivers’ psycho-behavioural charac-
intelligent vehicles: Emotions triggered by video-audio clips in driving teristics and in-vehicle intelligent systems.
scenarios,’’ IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 747–760,
Jan. 2023, doi: 10.1109/TAFFC.2021.3063387.

VOLUME 12, 2024 118549


T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness

SHUAI GAO received the master’s degree from JIANFENG XI received the M.S. and Ph.D.
the School of Transportation, Jilin University, degrees from the School of Transportation, Jilin
in 2013. University, China, in 2003 and 2007, respectively.
She became a full-time Faculty Member with He worked at the University of Minnesota,
the Department of Urban Railway Operation Man- USA, as a Postdoctoral Fellow, in 2010.
agement, School of Railway and Transportation, He became a Professor with the School of Trans-
Jilin Jiaotong Vocational and Technical College, portation, Jilin University, in 2016. His research
in 2014, and the Director of the Department of interests include driving safety and manage-
Operation, in 2022. Her research interests include ment and emergency management and simulation
intelligent transport systems, on-board intelligent systems.
systems, and traffic microsimulation.

XINNING MIAO received the B.S. degree from


Shandong University of Technology, in 2020, and
the M.S. degree from the School of Transportation,
Jilin University, in 2023.
She is currently working with Beijing Jingwei
Hirain Technologies Company Inc. Her research
interests include psychological and behavioural
characteristics of drivers.

118550 VOLUME 12, 2024

You might also like