A Multimodal Driver Anger Recognition Method Based On Context-Awareness
A Multimodal Driver Anger Recognition Method Based On Context-Awareness
ABSTRACT In today’s society, the harm of driving anger to traffic safety is increasingly prominent. With
the development of human-computer interaction and intelligent transportation systems, the application of
biometric technology in driver emotion recognition has attracted widespread attention. This study proposes
a context-aware multi-modal driver anger emotion recognition method (CA-MDER) to address the main
issues encountered in multi-modal emotion recognition tasks. These include individual differences among
drivers, variability in emotional expression across different driving scenarios, and the inability to capture
driving behavior information that represents vehicle-to-vehicle interaction. The method employs Attention
Mechanism-Depthwise Separable Convolutional Neural Networks (AM-DSCNN), an improved Support
Vector Machines (SVM), and Random Forest (RF) models to perform multi-modal anger emotion recog-
nition using facial, vocal, and driving state information. It also uses Context-Aware Reinforcement Learning
(CA-RL) based adaptive weight distribution for multi-modal decision-level fusion. The results show that the
proposed method performs well in emotion classification metrics, with an accuracy and F1 score of 91.68%
and 90.37%, respectively, demonstrating robust multi-modal emotion recognition performance and powerful
emotion recognition capabilities.
INDEX TERMS Context-awareness, driving state emotion recognition, emotional expression heterogeneity,
multimodal emotion recognition, machine learning.
under the influence of anger can lead to more aggressive then to determine whether the emotion is anger or not, using
behavior [4]. certain rules or classification models.
In addressing the issues caused by angry driving, it is In terms of facial feature extraction methods, Ekman
essential to start from the perspective of the individual driver and Friesen [5] proposed the Facial Action Coding System
and study the recognition of angry driving to facilitate early (FACS) facial behavior coding method in 1978, which can
emotional soothing and safety warnings. This can prevent be used to judge facial expressions by identifying specific
many traffic problems caused by angry driving, provide more Action Unit (AU) feature vectors. Barmana and Dutta [6]
precise protection for drivers and vehicles, and offer feasible proposed a fast Sic active appearance model that can extract
safety assurances for the driving community. Therefore, this facial features by forming a mesh of 14 facial feature points.
paper discusses a multi-modal method for recognizing driver In addition, there are Active Shape Model (ASM), Local
anger emotions, with the remaining sections organized as Binary Patterns (LBP) [7] and Histogram of Oriented Gra-
follows. Chapter 2 introduces the current state of research on dient (HOG) [8]. The purpose of the ASM algorithm is
both unimodal and multimodal driver anger emotion recog- to generate points that fit the contours of the object. LBP,
nition and elaborates on the problems this paper aims to on the other hand, constructs a binary number by comparing
solve. Chapter 3 focuses on the main innovations of the each cell’s pixels with its eight neighboring pixels through
multi-modal anger emotion recognition method used in this a function. Afterwards, the algorithm constructs a histogram
paper. Chapter 4 introduces the dataset used in this study and of these numbers for each cell and connects the histograms.
related anger emotion recognition experiments. In terms of HOG is to produce a histogram based on the gradient pre-
datasets, this paper emphasizes a multimodal dataset of facial viously calculated after segmenting the image into cells.
expressions, voice, and vehicle driving states collected by the However, in real life applications, all of the above methods
authors. Regarding recognition experiments, the paper not have some drawbacks and shortcomings. The FACS system is
only validates the proposed method through experiments but more complex, and manually performing the FACS coding is
also includes comparative and ablation experiments. Finally, a very time-consuming process, and the process is affected by
Chapter 5 summarizes the experimental results and provides the subjective judgment of the coder; Sic and LBP models are
an outlook on future work. difficult to accurately capture the subtle dynamic changes in
facial expressions; ASM’s effectiveness relies heavily on the
accuracy of the initial shape and is not flexible enough to deal
II. RELATED WORK
with non-standard facial expressions; HOG is also difficult
In recent years, with the rapid development of sensor tech-
to accurately capture the subtle changes, and its computation
nology and machine learning, significant progress has been
may be relatively complex and time-consuming. In addition
made in the recognition of driver anger emotions, both in
to the above methods, we observe that more feature extraction
unimodal methods, such as facial expression, speech informa-
is performed by convolutional neural networks. Compared to
tion, physiological information, and behavioral information,
the above methods, Convolutional Neural Networks (CNNs)
and in multimodal methods involving the fusion of various
can automatically learn and extract features from training data
types of information. Recognition methods based on facial
and realize hierarchical feature learning, which not only can
expressions and speech information have become mainstream
extract higher-level features and subtle features that may be
in emotion recognition due to their ease of data acquisi-
neglected by traditional methods, but also can improve the
tion and high accuracy rates. In the field of driver anger
adaptability and flexibility of the model.
emotion recognition, methods based on vehicle driving state
In terms of classification algorithms, there are two main
information demonstrate unique data advantages. However,
categories: classical methods and neural network methods.
recognition methods based on physiological information, due
Classical methods include methods such as support vec-
to the challenges of non-contact data collection, often signif-
tor machines (SVMs), dynamic Bayesian networks, extreme
icantly impact the driver during the data collection process.
learning machines [9], sparse learning machines [10], sup-
Therefore, this paper will next elaborate in detail on the meth-
port vector regression [11], sparse representation classifica-
ods of anger emotion recognition based on facial information,
tion [10], random forests [12], random trees [12], multi-graph
speech information, and vehicle driving state information.
embeddings [13], and single-modified Viola-Jines [14]. The
above classical methods usually have deficiencies in terms of
A. ANGER EMOTION RECOGNITION BASED ON FACIAL computational complexity, parameter tuning and model gen-
INFORMATION eralizability. DBN and random forests, for example, face the
Facial expressions are the most natural and direct way for challenge of computational resources when dealing with large
humans to express emotions. Since facial information can be datasets, and the optimization of parameters such as the type
collected non-invasively, thus minimizing interference with of kernel function and regularization parameter in SVM, the
the driver, facial emotion recognition has become the primary number of nodes in the hidden layer in ELM, and the number
method for recognizing driver anger. The two most important of trees in random forests usually requires a lot of experiments
steps in facial information-based anger emotion recognition and expertise, among other deficiencies. The main methods
are first to extract features from the facial information and designed through neural networks include methods such as
118534 VOLUME 12, 2024
T. Ding et al.: Multimodal Driver Anger Recognition Method Based on Context-Awareness
multilayer perceptron (MLP) and CNN. Although MLP is a TABLE 1. Speech features.
powerful classification tool, it still has some limitations in
terms of deep structure design, overfitting control, parameter
tuning, and dependence on data, etc. For example, MLP is
unable to effectively utilize the spatial or temporal structure in
the input data; MLP is susceptible to overfitting phenomenon
when the training data is limited or noisy. CNN, on the
other hand, can effectively capture the spatial hierarchical
structure in the input data through its convolutional layer,
and CNN usually has fewer parameters due to weight sharing
and pooling operations, which reduces the computational
complexity and the risk of overfitting, and improves the conducted experiments using prosodic features for emotion
generalization ability of the model.AT Lopes [15] designed a recognition, proving their effectiveness in speech emotion
multilayered CNN model that automatically extracts features detection. Timbral features reflect the quality of speech sig-
and recognizes the emotion of anger from facial images; nals, measuring the intelligibility, purity, and distinctiveness
Liu et al. [16] used a three-dimensional CNN (3D-CNN) to of speech. Research by Nussbaum et al. [19] has found
capture spatio-temporal features in facial expression videos, that fundamental frequency, a timbral characteristic, plays
not only focusing on the spatial features of the face, but also a significant role in emotion recognition. Spectral features
taking into account the change of expression over time, which represent the characteristics of the signal in the frequency
improves the accuracy of emotion recognition; Ng et al. [17] domain, where emotional fluctuations cause variations in
proposed a CNN model that simultaneously learns facial the spectral distribution of speech. Lalitha et al. [20] have
emotion recognition and other face-related tasks (e.g., gender suggested the role of cepstral coefficients in spectral features
recognition, age estimation), showing that the potential that for emotion classification in enhancing human-computer
sharing representations between related tasks can improve interaction performance. Furthermore, some scholars have
emotion recognition performance. CNNs show great poten- combined the above-mentioned speech characteristics for
tial in facial emotion recognition, however, traditional CNNs emotion recognition. For example, Zhou et al. [21] proposed
usually contain a large number of parameters, leading to high a method using a fusion of MFCC and prosodic features to
computational complexity and significant storage require- identify speech emotions. Their experiments showed that this
ments, which are not conducive to real-time applications; and fusion increased the accuracy rate by nearly 20% compared to
when image data processing is performed, it is usually not using a single feature, with the accuracy of using only MFCC
processed to see which parts are more important for the final being 62.3% higher than using only prosodic features.
prediction, which may lead to the model to be disturbed by In the field of emotion classification, models mainly
non-feature information. include SVM [22], [23], Artificial Neural Network (ANN)
[20], Hidden Markov Model (HMM) [24], CNN [25], Deci-
sion Tree [26], Long Short-Term Memory (LSTM) [27], and
B. ANGER EMOTION RECOGNITION BASED ON SPEECH Recurrent Neural Network (RNN) [28] methods. SVM is
INFORMATION suitable for clear classification problems in high-dimensional
Language, as a unique means of human communication, can spaces but only performs well on small to medium-sized
directly reflect human emotions through its vocal character- datasets. CNN and LSTM demonstrate excellent performance
istics. In 1983, Bezooijen and others explored the correlation but are computationally expensive. ANN has strong learning
between vocal features and different emotions, suggesting capabilities but is prone to overfitting and sensitive to param-
that statistical parameters of speech features could be used eter selection. HMM excels in processing time-series data
for emotion classification. However, in the application field but is limited in handling nonlinear features. RNN is apt for
of driver anger emotion recognition, speech data often faces sequential data but struggles with long sequences. Decision
challenges such as poor data usability and missing data. Trees are easy to understand but prone to overfitting.
Therefore, speech emotion recognition is usually used as Many methods in the field of speech emotion recognition
a supplementary method to enhance the overall accuracy have achieved excellent recognition effects, but there are still
of driver anger detection. Speech emotion recognition also some shortcomings in application scenarios like recognizing
involves two critical steps: feature extraction and emotion the anger emotion of drivers while driving. Firstly, to improve
classification. the comprehensive recognition accuracy of drivers’ anger, the
In the aspect of feature extraction, speech features can speech emotion recognition module should enhance recog-
be categorized into prosodic features, timbral features, and nition efficiency to achieve real-time overall recognition.
spectral features. Each category encompasses specific related However, neural network models like RNN, which have high
characteristics, as shown in Table 1. recognition accuracy, lack real-time performance, while mod-
Prosodic features primarily focus on the rhythm, intensity, els like SVM, which have high recognition efficiency, need
speech rate, and pitch of speech. For instance, John et al. [18] improved accuracy. Secondly, in the scenario of recognizing
drivers’ anger while driving, issues such as missing speech driver heterogeneity. Anomalies in vehicle driving state, such
information and poor information usability exist. Finally, the as significant changes in acceleration, may not necessarily
selection of features in speech emotion recognition often indicate anger but could also be due to an aggressive driving
focuses on one or a few features, lacking comprehensiveness style. Secondly, the anger reflected in the interaction between
in feature selection. vehicles, such as vehicle following distance and frequency of
overtaking, has not been captured.
discovered that the distribution of multimodal physiolog- fusion is proposed, as shown in Figure 1. It is divided into the
ical responses varies across different emotional scenarios. following parts:
Specifically, the degree of expression in facial emotion, vocal (1) A facial anger emotion recognition module based on
emotion, and vehicle driving state emotion of drivers may AM-DSCNN;
differ under conditions such as congested versus smooth (2) A vocal anger emotion recognition module based on an
traffic flow, or when driving alone versus with passengers. improved SVM;
In summary, previous research work has made significant (3) A vehicle driving state emotion recognition module that
progress, but there are still some challenges and limitations. considers vehicle driving state information and driving style;
First, the recognition of driver emotions is a complex task (4) A multimodal decision-making module with adaptive
requiring the consideration of multimodal data. However, weight distribution based on context awareness.
issues such as individual differences among drivers and
variability in emotional expression across different driving
scenarios can impact the accuracy of emotion recognition.
Second, while facial and vocal emotion recognition are
the mainstream methods for emotion recognition, there is
still room for improvement in their accuracy. Third, within
our research scope, no current recognition methods con-
sider using traffic flow information, which represents vehicle
interaction, for anger emotion recognition, thus failing to
fully capture the drivers’ emotional expressions. Lastly, most
scholars base their model training and validation on simulated
scenarios, which still differ from real-world driving.
Based on this, this paper proposes a context-aware mul-
timodal driver emotion recognition method (CA-MDER) FIGURE 1. CA-MDER modeling framework.
aimed at overcoming these issues and effectively recognizing
drivers’ anger emotions. The contributions of this paper are
as follows: A. FACIAL ANGER EMOTION RECOGNITION BASED ON
(1) A facial emotion recognition method based on AM-DSCNN
AM-DSCNN is proposed, which introduces an attention In response to the numerous shortcomings of traditional
mechanism module and a depthwise separable convolution CNNs, this paper proposes a facial emotion recognition
module to enhance the model’s ability to capture impor- method based on Attention Mechanism-Depthwise Separable
tant facial emotional features and improve computational Convolutional Neural Networks (AM-DSCNN), specifically
efficiency; for recognizing the angry emotional states of drivers. The
(2) A hybrid kernel function combining Dynamic Time proposed AM-DSCNN is composed of three parts: the back-
Warping (DTW) with RBF is proposed to improve the SVM bone network, depthwise separable convolution module, and
model, making the improved model more effective in han- attention mechanism module, the specific model structure is
dling the temporal elasticity in voice data and improving shown in Figure 2.
recognition accuracy;
(3) In the vehicle driving state emotion recognition module,
features capturing vehicle interactions are introduced, and a
driver driving style recognition module is proposed to mit-
igate the impact of driving heterogeneity on anger emotion
recognition;
(4) A context-aware, multi-modal decision-level fusion
method based on CA-RL is proposed, which uses context
awareness to achieve optimal weight distribution in adaptive
scenarios;
(5) This paper uses data collected from real driving sce-
narios and public datasets for model training and validation,
FIGURE 2. AM-DSCNN model structure diagram.
which, compared to simulated data, results in a more realistic
and effective model.
1) BACKBONE NETWORK
III. METHOD Backbone Network is the foundation of the AM-DSCNN
For the recognition of driver anger emotions, a recognition model, consisting of convolutional layers, pooling layers,
framework that combines context-awareness and multimodal fully connected layers, and dropout layers. Its purpose is
to capture image features from low-level to high-level and 26 × 26×256.This process can be described as follows:
ultimately provide emotion classification. ′
I c (x, y) = I c (x, y)∗K c (1)
a: CONVOLUTIONAL LAYER
Here, x, y are the spatial positions in the driver’s facial
The convolutional layer is the core component of CNN, feature map, Ic′ is the output of the cth channel, Ic is the input
used to extract features from the input image. Convolution of the cth channel, and Kc is the convolution kernel of the cth
operations slide convolution kernels over the input image to channel.
perform dot product operations, resulting in feature maps. For
example, in the first convolutional layer shown in Figure 2, b: POINTWISE CONVOLUTION
the input image size in the model is 416 × 416×1. Then, the
This uses a 1 × 1 convolution kernel to perform convolution
convolutional layer Conv1 with 32 convolution kernels of size
across channels, combining the outputs of the depth convo-
3 × 3 is used for convolution operations, and the output size
lution. Similarly, for the first layer of depthwise separable
after convolution is 416 × 416×32.
convolution DSConv1 in Figure 2, after channel-wise con-
volution, a 1 × 1×256 convolution kernel is used to perform
b: POOLING LAYER
convolution, converting 256 channels into 512 channels. The
The pooling layer is used to reduce the size of the feature
final output feature map has a size of 26 × 26×512.This
maps, decrease computational complexity, and retain impor-
process can be described as follows:
tant features. For example, in the first pooling layer shown in
′′ X ′ ′
Figure 2, the pooling layer uses a 2 × 2 window to perform I c (x, y) = I c (x, y)∗K c (2)
c
max pooling on the feature map of size 416 × 416×32. The
output size after pooling is 208 × 208×32. Here, Kc′is the cth channel of the 1 × 1 convolution kernel.
Compared to traditional convolution operations, depthwise
c: FULLY CONNECTED LAYER convolution significantly reduces the number of multiplica-
The fully connected layer connects all nodes from the pre- tive operations since each kernel only convolves on its
vious layer to all nodes in the current layer, functioning respective single channel. Although pointwise convolution
as a standard neural network layer. As shown in Figure 2, involves all channels, the computational load is still much
this model includes fully connected layers in the attention lower than traditional convolution due to the use of 1 × 1 ker-
mechanism module and at the final emotion classification. nels. Furthermore, depthwise separable convolution requires
The former is used to generate attention weights for the image far fewer parameters than classical convolution. For the same
after global average pooling, thereby rescaling the feature size of feature maps and convolution kernels, depthwise
maps; the latter is used to integrate features extracted from all separable convolution can significantly reduce the model’s
previous layers for the final classification of angry emotions. parameter count, thereby lowering the risk of overfitting and
enhancing the model’s applicability in resource-constrained
d: DROPOUT LAYER environments. Simultaneously, by decomposing standard
The Dropout layer is a regularization technique used to convolution into these two steps, depthwise separable convo-
prevent neural networks from overfitting. By randomly lution still effectively captures the spatial patterns and texture
‘‘dropping out’’ a portion of neurons during training, the information within each channel of the facial feature map and
network is forced to train on different sub-networks, thus the feature combinations across channels, ensuring efficient
enhancing the model’s generalization ability. feature extraction.
N−k −1
each channel feature to have a global perspective. For X
instance, in the first layer of the attention mechanism shown R (k) = x (n) · x (n + k) (7)
n=1
in Figure 2, global average pooling is performed on the
input feature map with a size of 104 × 104×64 as per the Here, F0 is the fundamental frequency, R (k) is the auto-
following formula. After pooling, the output feature map size correlation function, x (n) is the signal value at time n, N is
is 1 × 1×64. the frame length, and k is the delay amount.
H W
1 XX b: SHORT-TERM ENERGY FEATURES
C global = I(x, y) (3)
H×W Literature [42] confirms the effectiveness of short-term
i=1 j=1
energy E in speech emotion recognition, which is calculated
Here, H and W are the height and width of the facial feature
by summing the squares of the sample points within the
map.
frame:
(2) Multi-Layer Perceptron (MLP) Learning: The paper
−1
NX
uses a multi-layer perceptron with one hidden layer to learn
the non-linear dependencies between channels. This can be E= x(n)2 (8)
represented as: n=0
derivative and second-order derivative of the mean value of recognition model can help us more accurately capture and
the logarithmic energy of the MFCC are calculated. The differentiate speech features under different emotional states,
formula for the first-order derivatives is as follows, and the thereby enhancing the accuracy and efficiency of emotion
second-order derivatives are calculated similarly to the first- recognition.
order derivatives, where s is the range to find the difference
and is taken as s = 2. 2) SPEECH EMOTION RECOGNITION MODEL BASED ON
" s # IMPROVED SVM
δ 1 X
1C n (m) = C n (m) =
∼ tC n−1 (m) (10) SVM has been proven in the literatures [46] and [47] to have
δm T S t=−s
better recognition performance in speech emotion recogni-
s
X tion (especially for angry emotions), which, together with
TS = t2 (11) its significant advantages in computational efficiency and
t=−s memory usage, makes it suitable for use as a as a complemen-
By combining the dynamic and static features of MFCC, tary recognition for facial emotion recognition in multimodal
the performance of the speech emotion recognition model can emotion recognition tasks. In this paper, based on the tradi-
be effectively improved. tional SVM model, considering the time series characteristics
of speech data, we improve the SVM kernel function and
f: SAMPLE ENTROPY propose a hybrid kernel function combining dynamic time
Sample entropy (SampEn) is a statistical measure for quan- warping (DTW) and Radial Basis Function (RBF). The DTW
tifying the complexity of time series, proposed by Richman algorithm can effectively deal with temporal elasticity on
and others in 2000. It is particularly effective in analyzing time-series data because it is able to bend the time axis in
nonlinear dynamic systems, such as speech signals, because order to find the best correspondence between two time-
it quantifies the complexity of time series from a probabilis- series, which means that even if the speech samples are
tic perspective. Compared to approximate entropy, sample different in time, it can find the similarity between them, thus
entropy does not include self-comparison in its calculations, improving the accuracy of the model for sentiment classifi-
reducing data bias. Even with limited data, sample entropy cation.
can effectively estimate probabilities, hence providing higher The improved hybrid kernel function K (x, y) is defined as:
detection accuracy [45]. Sample entropy is defined as the nat-
K (x, y) = α·K DTW xi,t , xj,t ′ + (1−α) · K RBF xi , xj
ural logarithm of the conditional probability that data vectors
of dimension m remain similar when the dimension increases = α · exp(−γ DTW ·DTW (xi,t , xj,t ′ ))
to m + 1. Specifically, sample entropy SampEn(m,r,N) is 2
defined as: + (1 − α )· exp(−γ RBF · xi − xj ) (14)
Bm+1 (r) where KDTW is the DTW-based kernel, used for time-series
SampEn (m, r, N) = − ln (12) features, with i and j representing the feature and label data,
Bm (r)
N−m+1 respectively, and xi,t , xj,t ′ is the value of the time series at a
1 X
certain point in time (such as fundamental frequency, short-
Bm (r) = i (r)
Bm (13)
N − m+1 term energy, speech rate, and formant features); KRBF is
i=1
the Radial Basis Function (RBF) kernel, used for statistical
Here, m is the embedding dimension, r is the similarity
features, where xi , xj are two data points in the feature space
threshold for judging whether two sequences are similar, and
(such as MFCC, sample entropy features);α is a parameter
Bm+1 (r) and Bm (r) are the normalized counts of similar
to adjust the weight of the two, and γDTW and γRBF are the
sequence pairs at different dimensions. Bmi (r) represents the respective scaling parameters of the kernels.
similarity of a specific speech data vector with other vectors
The computation formula for DTW can be expressed as:
in the m-dimensional space. The calculation process can v
be described as: for a given embedding dimension m and u T
uX
similarity threshold r, calculate the similarity of a specific DTW (A, B) = min t (xi,t − xj,t ′ )2 (15)
speech data vector X (i) with all other vectors X (j) in the t=1
m-dimensional space and count the number of similar vector
pairs. C. DRIVING STATE EMOTION RECOGNITION BASED ON RF
In the application of speech emotion recognition, sample 1) EXTRACTION OF EMOTIONAL FEATURES FROM VEHICLE
entropy can reveal the complexity and dynamic changes of DRIVING STATE INFORMATION
speech signals under different emotional states. For exam- In the recognition of anger emotion based on vehicle driving
ple, angry or excited speech may exhibit higher sample state, this paper not only selects vehicle acceleration and yaw
entropy, indicating more complex and variable signals, while rate, which represent the vehicle’s motion parameters, but
calm or sad speech may have lower sample entropy, reflect- also for the first time proposes the use of three behavioral fea-
ing more stable and consistent characteristics. Therefore, tures representing vehicle-to-vehicle interaction: following
incorporating sample entropy as a feature into the emotion distance, lane-changing frequency, and overtaking frequency.
These features are used to comprehensively extract the of lane changes clfr within a unit time window as one of the
driver’s driving state emotions for more accurate recognition indicators for emotion recognition.
of the driver’s anger emotion.
e: OVERTAKING FREQUENCY
a: VEHICLE ACCELERATION Overtaking refers to a vehicle passing slower-moving vehi-
The degree of speed change per unit time is known as accel- cles ahead on the road, usually by moving into an adjacent
eration, and its magnitude or rate of change directly affects lane. When a vehicle is moving fast, overtaking can help
the urgency of speed variation. Acceleration is a key indi- it quickly bypass the slower vehicles ahead, reducing traf-
cator of the extent of vehicle speed change and reflects the fic congestion. Reasonable overtaking can make traffic flow
driver’s longitudinal control ability over the vehicle. Litera- smoother and allow vehicles to travel at appropriate speeds,
ture [36] indicates that as the intensity of the driver’s anger avoiding the formation of a slow-moving convoy. However,
increases, the fluctuation amplitude of vehicle acceleration unsafe overtaking can lead to traffic accidents, especially
also increases, leading to reduced smoothness in vehicle in conditions of poor visibility, complex road conditions,
motion, i.e., the driver’s longitudinal control ability over the or inappropriate timing for overtaking. On busy roads, fre-
vehicle decreases. Therefore, this paper extracts the mean quent overtaking can disrupt traffic flow. Literature [49]
value of vehicle acceleration aav as one of the indicators for shows that the overtaking frequency increases under the
emotion recognition. driver’s anger emotion. Therefore, this paper extracts the
frequency of overtaking otkfr within a unit time window as
b: YAW RATE one of the indicators for emotion recognition.
When the driver is in a normal driving state, the range of
changes in the vehicle’s yaw rate is small, and the frequency 2) STATISTICAL ANALYSIS OF DRIVING STYLE BASED ON
of change is high. Under the state of anger, the range of HISTORICAL DRIVING DATA
acceleration change is larger, and the adjustment frequency Due to individual differences in driving style (habits) or driv-
is lower. Literature [36] shows that as the driver’s anger ing experience, driving behaviors may vary among different
intensity increases, the fluctuation amplitude of the vehicle’s drivers. Therefore, in addition to the emotional (anger) state
yaw rate also increases, leading to reduced lateral stability affecting the driving behavior characteristics of the driver,
of the vehicle, i.e., the driver’s lateral control ability over the individual differences between drivers can also have a
the vehicle decreases. Hence, this paper extracts the mean certain impact on these characteristics, thereby affecting the
value of the vehicle’s yaw rate ψ̇av as one of the indicators accuracy of the anger recognition model for driving. How-
for emotion recognition. ever, the anger characteristics of the same subject have strong
stability over different periods. Hence, this paper adopts a
driving style analysis method based on historical driving data
c: FOLLOWING DISTANCE
to reduce the impact of individual differences among drivers
Following distance refers to the distance between vehicles
on the recognition of their anger state, thereby enhancing the
traveling on the road. Generally, a reduced following dis-
robustness and accuracy of the recognition model.
tance allows vehicles to pass through intersections and other
delay-prone areas more quickly, but as the following distance For the feature subset obtained earlier M =
aav , ψ̇av , gav , clfr , otkfr , calculate the mean value of each
decreases, safety risks increase. Literature [48] indicates that
feature under the normal state for each subject. This mean
the following distance tends to decrease under the driver’s
value is taken as the reference value. Let the mean value of
anger emotion. Therefore, this paper extracts the mean fol-
the feature parameters of the ith subject under normal driving
lowing distance gav as one of the indicators for emotion
state be Ri , that is, the reference value. The equation is as
recognition.
follows:
N
d: LANE-CHANGING FREQUENCY 1 X
Ri = Mi (16)
Lane changing refers to a vehicle moving from its current lane N
i=1
to an adjacent lane on the road, often to avoid obstructions,
pursue other vehicles, or turn at intersections. Lane changing Finally, the feature parameter values of all subjects under
can help the vehicle bypass obstacles or slow-moving vehi- normal state are inputted into the training sample library as
cles in its current lane, maintaining a smooth speed. If one the statistical values of each subject’s driving style.
lane is faster than another, changing lanes can help the vehicle
increase its speed and reach its destination more quickly. 3) DRIVING STATE EMOTION RECOGNITION MODEL BASED
However, frequent lane changes can disrupt traffic flow and ON RANDOM FOREST
increase the risk of traffic accidents. Literature [29] indicates The random forest model builds multiple decision trees
that the frequency of lane changes increases under the driver’s and derives the final classification result by majority voting
anger emotion. Therefore, this paper extracts the frequency among these trees. This makes the random forest model
generally robust against outliers and noise and reduces the D. ADAPTIVE MULTIMODAL ANGER EMOTION
risk of overfitting. Moreover, random forests adapt better to RECOGNITION BASED ON CA-RL
unbalanced datasets. Based on these advantages, this paper 1) ADAPTIVE WEIGHT ALLOCATION BASED ON CA-RL
considers the random forest model to be more suitable for In multimodal emotion recognition, how to assign appropri-
the task of vehicle driving state emotion recognition. The ate weights to each modality is the key issue. In order to
implementation steps of the model are as follows: make the system more adaptable to different driving envi-
(1) Feature Centralization ronments and individual differences, we propose a method
To better capture the emotional differences of drivers and based on reinforcement learning to perceive the scenarios,
reduce the impact of individual differences on anger state find the optimal strategy in different scenarios by interacting
recognition, feature data must be centralized before apply- with the environment, and dynamically assign weights to
ing the random forest model. This is done by obtaining the each modality, so as to improve the model’s adaptability and
average feature value for each individual under normal condi- generalisation ability.
tions from the sample library, and then subtracting the actual
feature value to obtain the centralized feature value Ci . The a: MODEL DEFINITION
process can be represented as: State space st : Emotion recognition scene, characterized by
the degree of traffic congestion Ctr and the state of the number
C i = M i − Ri (17) of people inside the vehicle Hpe . This includes the driver’s
emotions, vehicle state, traffic conditions, etc.
where Mi is the actual value of the feature of the ith driver, Their representation is as follows:
and Ri is its average feature value.
(2) Bootstrap Sampling N
C tr = (20)
Randomly select samples from the original dataset to cre- L
(
ate a new dataset. Let the size of the original dataset be N. 1, if 1Et > e
Then N samples are selected from the original dataset with H pe = (21)
0, otherwise
putback to form a new dataset Di :
Here, N is the number of vehicles at a given moment within
Di = (x∗1 , y∗1 ), (x∗2 , y∗2 ), · · · , (x∗N , y∗N )
(18) the observed road section, L is the length of the observed
section, 1Et is the short-term energy difference in voice
where Di represents the training data for the ith tree, and between adjacent windows, and e is a threshold. The number
(xi∗ , y∗i ) is the sample randomly selected from the original of people inside the vehicle is detected by voice energy
dataset. fluctuation. When a single person speaks, the short-term
(3) Decision Tree Construction with Feature Subsets energy usually shows a more consistent pattern, whereas in
In terms of using driving state information for emotion multi-person conversations, energy fluctuations may be more
recognition, the literature [39] has demonstrated that random pronounced and frequent. This is because different speakers
forest models (RF) have good performance. Each deci- usually have different voice intensities and rhythms, causing
sion tree is trained on its corresponding bootstrap sampling more peaks and valleys in energy levels during multi-person
dataset Di . However, in the splitting process of each node, conversations.
not all features are considered, but a random feature subset Action space at : The weight parameters of the three modal-
is selected. Assuming the original number of features is M, ities in the decision-level fusion of the anger recognition
m features are randomly selected at each node split. models, represented as [w1 , w2 , w3 ].
(4) Decision Tree Ensemble Reward function rt : Given based on the consistency of the
Using the above method, T decision trees are constructed emotion recognition result with the actual emotion, defined
in the random forest, each independently trained based on as:
different bootstrap sampling datasets. (
(5) Prediction 1, if prediction matches actual emotion
rt = (22)
The prediction of the random forest is based on the predic- 0, otherwise
tions of all its decision trees. Specifically, for classification
tasks, each tree provides a classification prediction, and the
final prediction of the random forest is the mode of these b: REINFORCEMENT LEARNING ALGORITHM
classifications: This paper chooses the Deep Q Network (DQN) to implement
dynamic weight adjustment. DQN attempts to estimate an
ŷ = mode(ŷ1 , ŷ2 , · · · ,ŷT ) (19) action-value function Q(s,a) representing the expected return
of choosing action a in state s.
Here, ŷi represents the prediction of the ith tree, and mode The core update formula of DQN is:
represents taking the mode of these classification results,
′
i.e., the most frequently occurring classification prediction. Q(st , at ) = rt + γ maxa′ Q(st+1 , a ) (23)
Here, γ is the discount factor for future rewards, and TABLE 2. Overview of the AffectNet dataset.
maxa′ Q(st+1 , a′ ) is the maximum estimate of the expected
return for all possible actions a′ in the future state st+1 .
TABLE 4. Data collection equipment. important in the model training process. Table 7 lists the
parameter settings for our experiments.
C. EVALUATION METRICS
This study uses Accuracy and F1 score as the evaluation
criteria for the model.
a: ACCURACY
Accuracy is the most intuitive performance metric, represent-
ing the proportion of correctly predicted instances to the total
c: DATA PREVIEW number of predictions. It reflects the credibility of the model’s
The collected multimodal data were filtered, including the predictions. Its calculation formula can be expressed as:
exclusion of driving state video data where the main vehicle TP + TN
was not captured and the removal of face video data where ACC = (26)
TP + TN + FP + FN
faces were obscured. This resulted in the Multimodal Data
dataset, which includes 600 samples. Each sample consists of b: F1 SCORE
simultaneously recorded facial data, voice data, and vehicle The F1 score is the harmonic mean of Precision and Recall.
driving state data, with the composition of emotion labels Its calculation formula is:
detailed in Table 6. 2 × precision × recall
F1 = (27)
precision + recall
TABLE 6. Overview of the multimodal data dataset.
FIGURE 10. Traffic density perception schematic diagram. TABLE 12. Performance comparison of various multimodal recognition
methods.
TABLE 11. Performance results of the CA-MDER model on the dataset. F. ABLATION EXPERIMENT
To verify the effectiveness of the context-aware module and
multimodal fusion, an ablation experiment was designed for
the proposed model, including the removal of context aware-
ness and fusion of each single modality data. M represents
Due to the diversity of modalities and data types used in the multimodal task, F represents the facial single modal-
multimodal emotion recognition, there is currently no unified ity, V represents the voice single modality, S represents the
data platform covering all modalities. Therefore, this study driving state single modality, and Awareness represents the
selected the following representative multimodal anger emo- context-awareness module. The ablation experiment results
tion recognition methods for a brief comparison: for each indicator are shown in Table 13.
CNN+Bi-LSTM+HAM [38]: This method introduces
TABLE 13. Results of modality ablation experiment: Multimodal dataset.
HAM on the basis of the CNN + Bi-LSTM network frame-
work. The mechanism can consider features of different lev-
els and types and adaptively adjust the attention mechanism
to selectively focus on key facial features. This multimodal
framework combines driver’s voice, facial images, and video
sequence data for emotion recognition, achieving an anger
emotion recognition accuracy of 85%.
CBLNN [40]: This method uses facial geometric fea-
tures obtained by Convolutional Neural Network (CNN) as
intermediate variables for Bidirectional Long Short-Term V. RESULTS AND DISCUSSION
Memory (Bi-LSTM) heart rate analysis. Subsequently, the To enhance the accuracy of driver anger emotion recogni-
output of Bi-LSTM is used as input for the CNN module tion, this study proposed a context-aware multimodal driver
to extract listening rate features. Finally, Multimodal Factor- anger emotion recognition method (CA-MDER), integrating
ized Bilinear Pooling (MFB) is used to fuse the extracted facial, voice, and vehicle driving state information. First,
information for emotion recognition. Using facial and heart anger emotion recognition was conducted for each single
rate data, this dual-modal method achieves an anger emotion modality. To improve recognition accuracy, we initially pro-
recognition accuracy of 90.5%. posed a facial emotion recognition method based on Attention
AM-LSTM [36]: This study combines LSTM with an Mechanism Deep Separable Convolutional Neural Network
attention mechanism for anger emotion recognition through (AM-DSCNN). This method focuses on key facial features
multimodal decision-level fusion of EEG signals, physiolog- determining emotions using an attention module and then
ical signals, and driving behavior information, achieving an enhances model computational efficiency by introducing
accuracy of 76.3%. deep separable convolution modules. Next, the SVM used for
SVM-DS [34]: This study based on ECG signals and driv- speech emotion recognition was improved by considering the
ing behavior signals for anger emotion multimodal fusion temporal characteristics of voice data and proposing a hybrid
kernel function combining Dynamic Time Warping (DTW) enriching the types of scenarios from aspects such as weather
and RBF, effectively handling time elasticity in time series and light intensity.
data and improving the optimal correspondence between time
series. Finally, for vehicle driving state emotion recognition, REFERENCES
features capturing vehicle interactions were introduced, and [1] B. Parkinson, ‘‘Anger on and off the road,’’ Brit. J. Psychol., vol. 92, no. 3,
a driver’s driving style recognition module was proposed to pp. 507–526, Aug. 2001, doi: 10.1348/000712601162310.
mitigate the impact of driving heterogeneity on anger emo- [2] J. L. Deffenbacher, E. R. Oetting, and R. S. Lynch, ‘‘Development of a
driving anger scale,’’ Psychol. Rep., vol. 74, no. 1, pp. 83–91, Feb. 1994,
tion recognition. After completing emotion recognition for doi: 10.2466/pr0.1994.74.1.83.
each modality, this study used the optimal weight allocation [3] E. R. Dahlen, R. C. Martin, K. Ragan, and M. M. Kuhlman, ‘‘Driving anger,
under adaptive scenarios obtained through context awareness sensation seeking, impulsiveness, and boredom proneness in the prediction
of unsafe driving,’’ Accident Anal. Prevention, vol. 37, no. 2, pp. 341–348,
for multimodal decision-level fusion, ultimately outputting Mar. 2005, doi: 10.1016/j.aap.2004.10.006.
emotion classification results. [4] J. Lu, X. Xie, and R. Zhang, ‘‘Focusing on appraisals: How and why anger
The recognition results show (Table 11) that the and fear influence driving risk perception,’’ J. Saf. Res., vol. 45, pp. 65–73,
Jun. 2013, doi: 10.1016/j.jsr.2013.01.009.
CA-MDER model proposed in this paper achieves an accu-
[5] P. Ekman and W. V. Friesen, ‘‘Facial action coding system,’’ Environ.
racy of 91.68% and an F1 score of 90.37%. Compared to other Psychol. Nonverbal Behav., Jan. 1978.
advanced multimodal anger emotion recognition models [6] A. Barman and P. Dutta, ‘‘Facial expression recognition using distance
(Table 12), CA-MDER achieves high levels in both accuracy and texture signature relevant features,’’ Appl. Soft Comput., vol. 77,
pp. 88–105, Apr. 2019, doi: 10.1016/j.asoc.2019.01.011.
and F1, surpassing existing multimodal recognition models. [7] T. Ojala, M. Pietikäinen, and D. Harwood, ‘‘A comparative study of tex-
Additionally, the ablation experiment revealed some notable ture measures with classification based on featured distributions,’’ Pattern
points. Although the recognition accuracies of individual Recognit., vol. 29, no. 1, pp. 51–59, Jan. 1996.
[8] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human detec-
modalities vary significantly, the overall recognition rate
tion,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
still increases after multimodal fusion, aligning with current (CVPR), 2005, pp. 886–893, doi: 10.1109/CVPR.2005.177.
mainstream research and possibly corroborating the view that [9] H. Ali, M. Hariharan, S. Yaacob, and A. H. Adom, ‘‘Facial emotion
drivers may hide their emotions in certain modalities under recognition using empirical mode decomposition,’’ Expert Syst. Appl.,
vol. 42, no. 3, pp. 1261–1277, Feb. 2015.
some scenarios. However, in some cases, even using more [10] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. Pietikäinen, ‘‘Facial
modalities for fusion may not yield ideal results. For example, expression recognition from near-infrared videos,’’ Image Vis. Comput.,
in the ablation experiment, directly fusing facial, voice, and vol. 29, no. 9, pp. 607–619, Aug. 2011.
[11] L. Zhang, K. Mistry, M. Jiang, S. C. Neoh, and M. A. Hossain, ‘‘Adaptive
driving state modalities without using context-aware adaptive facial point detection and emotion recognition for a humanoid robot,’’
weight allocation only slightly improved recognition rate by Comput. Vis. Image Understand., vol. 140, pp. 93–114, Nov. 2015.
about 1% (compared to the highest accuracy of facial sin- [12] Z. Zhang, L. Cui, X. Liu, and T. Zhu, ‘‘Emotion detection using Kinect
gle modality, same below). In contrast, using context-aware 3D facial points,’’ in Proc. IEEE/WIC/ACM Int. Conf. Web Intell. (WI),
Oct. 2016, pp. 407–410. Accessed: Nov. 9, 2023. [Online]. Available:
adaptive weight allocation for fusion of facial and driving https://ieeexplore.ieee.org/abstract/document/7817080/
state modalities alone can improve recognition accuracy by [13] R. Jiang, A. T. S. Ho, I. Cheheb, N. Al-Maadeed, S. Al-Maadeed,
about 3%. If fusing facial, voice, and driving state modalities, and A. Bouridan, ‘‘Emotion recognition from scrambled facial images
via many graph embedding,’’ Pattern Recognit., vol. 67, pp. 245–251,
the improvement in recognition accuracy can reach about 5%. Jul. 2017.
This validates the effectiveness of the context-aware adaptive [14] K. Candra Kirana, S. Wibawanto, and H. Wahyu Herwanto, ‘‘Facial
weight allocation method proposed in this study. emotion recognition based on viola-jones algorithm in the learning
environment,’’ in Proc. Int. Seminar Appl. Technol. Inf. Commun.,
Looking forward, under the support of the National Key Sep. 2018, pp. 406–410. Accessed: Nov. 9, 2023. [Online]. Available:
R&D Program Project ‘‘Major Accident Risk Prevention https://ieeexplore.ieee.org/abstract/document/8549735/
and Emergency Avoidance Technology for Road Transport [15] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, ‘‘Facial
expression recognition with convolutional neural networks: Coping with
Vehicles’’ (2023YFC3009600) and the Graduate Innovation
few data and the training sample order,’’ Pattern Recognit., vol. 61,
Fund of Jilin University, this study will explore applications pp. 610–628, Jan. 2017.
in accident risk prevention, emergency avoidance, and other [16] M. Liu, S. Li, S. Shan, and X. Chen, ‘‘AU-aware deep networks for
aspects related to road transport vehicles. Due to limitations facial expression recognition,’’ in Proc. 10th IEEE Int. Conf. Work-
shops Autom. Face Gesture Recognit. (FG), Apr. 2013, pp. 1–6, doi:
in research duration and measurement methods, this paper has 10.1109/FG.2013.6553734.
some limitations, and we plan to make improvements in the [17] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, ‘‘Deep learning
following areas: a. In this paper, the attention mechanism is for emotion recognition on small datasets using transfer learning,’’ in
Proc. ACM Int. Conf. Multimodal Interact., Nov. 2015, pp. 443–449, doi:
only used to focus on facial features. In the future, we will 10.1145/2818346.2830593.
introduce the attention mechanism into other modalities’ [18] R. S. John, S. B. Alex, M. S. Sinith, and L. Mary, ‘‘Significance of prosodic
data to explore possibilities for improving overall recognition features for automatic emotion recognition,’’ AIP Conf. Proc., vol. 2222,
no. 1, Apr. 2020, Art. no. 030003, doi: 10.1063/5.0004235.
accuracy; b. The type and amount of data in real vehicle
[19] C. Nussbaum, A. Schirmer, and S. R. Schweinberger, ‘‘Contributions of
scenarios is relatively small, and multimodal data can be fundamental frequency and timbre to vocal emotion perception and their
collected more easily in the future, e.g., through on-board electrophysiological correlates,’’ Social Cognit. Affect. Neurosci., vol. 17,
no. 12, pp. 1145–1154, Dec. 2022.
sensors or vehicle networking; c. The scenarios used to train
[20] S. Lalitha, D. Geyasruti, R. Narayanan, and S. M, ‘‘Emotion detec-
the context-aware adaptive weight allocation method in this tion using MFCC and cepstrum features,’’ Proc. Comput. Sci., vol. 70,
paper are somewhat limited. In the future, we will consider pp. 29–35, Jan. 2015.
[21] Y. Zhou, J. Li, Y. Sun, J. Zhang, Y. Yan, and M. Akagi, ‘‘A hybrid speech [42] A. J. Kayal and J. Nirmal, ‘‘Multilingual vocal emotion recognition and
emotion recognition system based on spectral and prosodic features,’’ classification using back propagation neural network,’’ in Proc. Advance-
IEICE Trans. Inf. Syst., vol. E93-D, no. 10, pp. 2813–2821, 2010. ment Sci. Technol., 2nd Int. Conf. Commun. Syst. (ICCS), Rajasthan, India,
[22] H. Aouani and Y. B. Ayed, ‘‘Speech emotion recognition with deep 2016, Art. no. 020054, doi: 10.1063/1.4942736.
learning,’’ Proc. Comput. Sci., vol. 176, pp. 251–260, Jan. 2020, doi: [43] K. Scherer, ‘‘Vocal communication of emotion: A review of research
10.1016/j.procs.2020.08.027. paradigms,’’ Speech Commun., vol. 40, nos. 1–2, pp. 227–256, Apr. 2003,
[23] T. M. Rajisha, A. P. Sunija, and K. S. Riyas, ‘‘Performance analy- doi: 10.1016/s0167-6393(02)00084-5.
sis of Malayalam language speech emotion recognition system using [44] M. Abdelwahab and C. Busso, ‘‘Evaluation of syllable rate estimation in
ANN/SVM,’’ Proc. Technol., vol. 24, pp. 1097–1104, Jan. 2016. expressive speech and its contribution to emotion recognition,’’ in Proc.
[24] E. M. Albornoz, D. H. Milone, and H. L. Rufiner, ‘‘Spoken emotion IEEE Spoken Lang. Technol. Workshop (SLT), South Lake Tahoe, NV,
recognition using hierarchical classifiers,’’ Comput. Speech Lang., vol. 25, USA, Dec. 2014, pp. 472–477, doi: 10.1109/SLT.2014.7078620.
no. 3, pp. 556–570, Jul. 2011. [45] R. Alcaraz and J. J. Rieta, ‘‘A novel application of sample entropy to
the electrocardiogram of atrial fibrillation,’’ Nonlinear Anal., Real World
[25] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, ‘‘Learning salient features for
Appl., vol. 11, no. 2, pp. 1026–1035, Apr. 2010.
speech emotion recognition using convolutional neural networks,’’ IEEE
[46] L. Sun, B. Zou, S. Fu, J. Chen, and F. Wang, ‘‘Speech emotion recognition
Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec. 2014.
based on DNN-decision tree SVM model,’’ Speech Commun., vol. 115,
[26] L. Sun, S. Fu, and F. Wang, ‘‘Decision tree SVM model with Fisher feature
pp. 29–37, Dec. 2019, doi: 10.1016/j.specom.2019.10.004.
selection for speech emotion recognition,’’ EURASIP J. Audio, Speech,
[47] S. Kanwal, S. Asghar, and H. Ali, ‘‘Feature selection enhancement and
Music Process., vol. 2019, no. 1, p. 2, Dec. 2019, doi: 10.1186/s13636-
feature space visualization for speech-based emotion recognition,’’ PeerJ
018-0145-5.
Comput. Sci., vol. 8, p. e1091, Nov. 2022, doi: 10.7717/peerj-cs.1091.
[27] S. Kwon, ‘‘CLSTM: Deep feature-based speech emotion recognition [48] T. Zimasa, S. Jamson, and B. Henson, ‘‘The influence of driver’s mood
using the hierarchical ConvLSTM network,’’ Mathematics, vol. 8, no. 12, on car following and glance behaviour: Using cognitive load as an inter-
p. 2133, Nov. 2020. vention,’’ Transp. Res. F, Traffic Psychol. Behav., vol. 66, pp. 87–100,
[28] H. M. M. Hasan and Md. A. Islam, ‘‘Emotion recognition from Bengali Oct. 2019, doi: 10.1016/j.trf.2019.08.019.
speech using RNN modulation-based categorization,’’ in Proc. 3rd Int. [49] L. Shamoa-Nir, ‘‘Road rage and aggressive driving behaviors: The
Conf. Smart Syst. Inventive Technol. (ICSSIT), Aug. 2020, pp. 1131–1136. role of state-trait anxiety and coping strategies,’’ Transp. Res. Inter-
Accessed: Dec. 19, 2023. [Online]. Available: https://ieeexplore. discipl. Perspect., vol. 18, Mar. 2023, Art. no. 100780, doi: 10.1016/
ieee.org/abstract/document/9214196/ j.trip.2023.100780.
[29] H. Lei, ‘‘The characteristics of angry driving behaviors and its effects on [50] A. Mollahosseini, B. Hasani, and M. H. Mahoor, ‘‘AffectNet: A database
traffic safety,’’ M.S. thesis, Wuhan Univ. Technol., 2011. for facial expression, valence, and arousal computing in the wild,’’ IEEE
[30] M. Zhong, H. Hong, and Z. Yuan, ‘‘Experiment research on influence of Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, Jan. 2019.
angry emotion for driving behaviors,’’ J. Chongqing Univ. Technol. Natural [51] J. Kossaifi, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt, F. Ringeval,
Sci., vol. 25, no. 10, pp. 6–11, 2011. J. Han, V. Pandit, A. Toisoul, B. Schuller, K. Star, E. Hajiyev, and
[31] F. Techer, C. Jallais, Y. Corson, F. Moreau, D. Ndiaye, B. Piechnick, and M. Pantic, ‘‘SEWA DB: A rich database for audio-visual emotion and
A. Fort, ‘‘Attention and driving performance modulations due to anger sentiment research in the wild,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
state: Contribution of electroencephalographic data,’’ Neurosci. Lett., vol. 43, no. 3, pp. 1022–1040, Mar. 2021.
vol. 636, pp. 134–139, Jan. 2017.
[32] L. Precht, A. Keinath, and J. F. Krems, ‘‘Effects of driving anger on driver
behavior—Results from naturalistic driving data,’’ Transp. Res. F, Traffic
Psychol. Behav., vol. 45, pp. 75–92, Feb. 2017.
[33] S. Shafaei, T. Hacizade, and A. Knoll, ‘‘Integration of driver behavior
into emotion recognition systems: A preliminary study on steering wheel
and vehicle acceleration,’’ in Proc. Asian Conf. Comput. Vis., G. Carneiro
and S. You, Eds., Cham, Switzerland: Springer, 2019, pp. 386–401, doi: TONGQIANG DING received the M.S. and Ph.D.
10.1007/978-3-030-21074-8_32. degrees from the School of Transportation, Jilin
[34] F. Wang, ‘‘Research on driver anger emotion recognition method based on University, China, in 2001 and 2005, respectively.
multimodal fusion,’’ M.S. thesis, Jilin Univ., 2023. He worked at the University of Minnesota,
[35] X. Yu, ‘‘Driver’s anger recognition method based on human-vehicle envi- USA, as a Visiting Scholar, in 2014. He is cur-
ronment information fusion,’’ M.S. thesis, Shandong Univ. Technol., 2021. rently an Associate Professor with the School of
[36] P. Wang, ‘‘Study on multimodal recognition method of Driver’s anger Transportation, Jilin University, where he is also
emotion and mechanism of driving risk under anger emotion,’’ M.S. thesis, the Head of the Department of Traffic Engineering,
Chongqing Univ., 2020. School of Transportation. His research interests
[37] J. Tang, Z. Ma, K. Gan, J. Zhang, and Z. Yin, ‘‘Hierarchical multimodal- include traffic safety of traditional road traffic sys-
fusion of physiological signals for emotion recognition with scenario tems and emerging intelligent transport systems represented by intelligent
adaption and contrastive alignment,’’ Inf. Fusion, vol. 103, Mar. 2024, vehicles and intelligent networks, involving fundamental theories, methods,
Art. no. 102129, doi: 10.1016/j.inffus.2023.102129. technologies, and practical applications of traffic safety.
[38] D. Zhou, Y. Cheng, L. Wen, H. Luo, and Y. Liu, ‘‘Drivers’ comprehensive
emotion recognition based on HAM,’’ Sensors, vol. 23, no. 19, p. 8293,
Oct. 2023, doi: 10.3390/s23198293.
[39] J. Ni, W. Xie, Y. Liu, J. Zhang, Y. Wan, and H. Ge, ‘‘Driver emotion
recognition involving multimodal signals: Electrophysiological response,
nasal-tip temperature, and vehicle behavior,’’ J. Transp. Eng., A, Syst.,
vol. 150, no. 1, Jan. 2024, Art. no. 04023125, doi: 10.1061/jtepbs.teeng-
7802. KEXIN ZHANG received the B.S. degree from
[40] G. Du, Z. Wang, B. Gao, S. Mumtaz, K. M. Abualnaja, and C. Du, Shandong University of Science and Technol-
‘‘A convolution bidirectional long short-term memory neural network for ogy, in 2022. He is currently pursuing the M.S.
driver emotion recognition,’’ IEEE Trans. Intell. Transp. Syst., vol. 22, degree with the School of Transportation, Jilin
no. 7, pp. 4570–4578, Jul. 2021, doi: 10.1109/TITS.2020.3007357. University.
[41] W. Li, Y. Cui, Y. Ma, X. Chen, G. Li, G. Zeng, G. Guo, and D. Cao, His research interests include the analysis and
‘‘A spontaneous driver emotion facial expression (DEFE) dataset for recognition of drivers’ psycho-behavioural charac-
intelligent vehicles: Emotions triggered by video-audio clips in driving teristics and in-vehicle intelligent systems.
scenarios,’’ IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 747–760,
Jan. 2023, doi: 10.1109/TAFFC.2021.3063387.
SHUAI GAO received the master’s degree from JIANFENG XI received the M.S. and Ph.D.
the School of Transportation, Jilin University, degrees from the School of Transportation, Jilin
in 2013. University, China, in 2003 and 2007, respectively.
She became a full-time Faculty Member with He worked at the University of Minnesota,
the Department of Urban Railway Operation Man- USA, as a Postdoctoral Fellow, in 2010.
agement, School of Railway and Transportation, He became a Professor with the School of Trans-
Jilin Jiaotong Vocational and Technical College, portation, Jilin University, in 2016. His research
in 2014, and the Director of the Department of interests include driving safety and manage-
Operation, in 2022. Her research interests include ment and emergency management and simulation
intelligent transport systems, on-board intelligent systems.
systems, and traffic microsimulation.