Multi-Modal Human Behaviour Graph Representation Learning For Automatic Depression Assessment
Multi-Modal Human Behaviour Graph Representation Learning For Automatic Depression Assessment
Abstract— Automatic depression assessment (ADA) often re- [18] segment the input into many small chunks and make
lies on crucial cues embedded in human verbal and non-verbal predictions for each chunk, then average them to obtain
behaviors, which exists in video, audio, and text modalities. the prediction. Although there exists several approaches that
Although these modalities often show in time-series forms,
2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) | 979-8-3503-9494-8/24/$31.00 ©2024 IEEE | DOI: 10.1109/FG59268.2024.10581964
current research offers limited exploration of the complex leverage Temporal Convolutional Networks (TCN) [22], [72]
intra-modal temporal dynamics inherent to each modality, and Recurrent Neural Networks (RNN) [48], [23], [2] to
failing to extract the depression-related cues in a global view. encode temporal dynamics within each modality, they are
While many methodologies attempt to exploit the multifaceted limited by the one-way induction and long dependency
information encoded across modalities via decision-level or issues, respectively.
feature-level fusion techniques, they often fall short in effectively
representing pairwise inter-modal relationships, which is the To obtain enhanced depression assessment predictions,
key to utilize the distinct complementary relationship between researchers also investigated how to make predictions from
each modality pair. This paper presents a novel graph-based multiple modalities. Consequently, it is important to explore
multimodal fusion approach, which can model intra-modal and the relationship between different modalities (i.e., modelling
inter-modal dynamics conveniently using a graph representa- inter-modal dynamics) in order to optimally combine them
tion. It adopts undirected edges to link not only temporally
continuous, pre-extracted features of each modality, but also for ADA. To achieve this, feature-level fusion methods that
temporally aligned features across each pair of modalities. concatenate features from different modalities into a single
This ensures the seamless propagation of global information high-dimensional vector [50], [51] have been frequently
across temporal dimensions and helps capture the pairwise employed. However, these approaches mistakenly assume
inter-modal dynamics. We conduct experiments on the E-DAIC modalities are conditionally independent [46], missing out
dataset to prove our approach’s effectiveness, with an RMSE of
4.80 and a CCC value of 0.563, which rival the top-performing on capturing important pair-wise inter-modal relationships.
method. We also experiment on the AFAR-BSFP dataset to Alternatively, decision-level fusion strategies mainly com-
show the generality of our approach. Our code will be made bine the predictions from separate uni-modal models, which
publicly available. overlook the dynamic relationships between modalities [5].
To address the issues discussed above, this paper proposes
I. INTRODUCTION a novel graph-based multi-modal fusion strategy for the
Depression is a highly prevalent mental health disorder that ADA task, which aims to effectively model both intra-modal
exerts a detrimental influence on an individual’s feelings and inter-modal dynamics via a graph-based strategy. Our
and behaviors [36]. Traditional diagnostic methods, primarily approach constructs a multi-modal graph with nodes repre-
reliant on professional interviews, are both time-consuming, senting chunk-level depression-relevant features from various
subjective, and strain limited mental health resources [27], modalities. This graph structure includes undirected edges
[45]. Recognizing these challenges, recent research has between temporally adjacent nodes within the same modality,
shifted focus on applying deep learning to automatic depres- allowing each node to consider information from both of its
sion assessment (ADA). The majority of these ADA studies past and future states during the reasoning. We also establish
focuses on analyzing video [57], [78], [19], [28], [32], [25], inter-modal edges between temporally aligned nodes from
[58], [1], audio [75], [48], [50], and textual data [17], [41], different modalities, enabling the graph to explicitly capture
[12] expressed by the target subject, as these modalities not complex inter-modal relationships. These inter-modal edges
only can be easily recorded but also contain rich depression- also act as shortcuts for information sharing, overcoming the
related cues. limitations of RNN-style models in efficiently transmitting
Since existing ADA approaches frequently attempt to information across temporally nonadjacent nodes. In sum-
make predictions based on time-series signals (e.g., video, mary, the main novelties and contributions of this paper are
audio and text), a key challenge is how to properly utilize summarised as follows:
intra-modal temporal dynamics to extract depression-related • Our study pioneers the use of graphs to explore and
cues from each modality. However, most of these approaches represent the intra-modal and inter-modal dynamics of
fail to consider or deliberately overlook the temporal rela- time-series data in ADA, by modelling the temporal dy-
tionship within each modality. Some of them [31], [33], [64], namics within the same modality or different modalities
[32] eliminate the temporal properties within the raw input using graph representations.
by extracting hand-crafted features combined with statistical • With the proposed graph structure, we address the
methods (average, sum, frequency, etc.). Others [59], [78], efficiency and long-dependency limitations typically
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
associated with RNN-style models. these features in different machine learning classifiers for
• Our methodology explicitly models the pair-wise rela- depression detection. Ray et al. [48] employ the pre-trained
tionship between different modalities within the graph Universal Sentence Encoder [14] to derive sentence-level
framework, which enables our approach to achieve embeddings, which were then padded and processed through
performance comparable with the SOTA for the E-DAIC a 2-layer Bi-LSTM network. Lin et al. [38] exploit the
dataset. capabilities of the pre-trained language model, ELMo, to
encode textual data, followed by training using a Bi-LSTM
II. R ELATED W ORKS
network enhanced with an attention layer. In a similar vein,
A. Automatic depression assessment approaches Shen et al. [56] also utilize ELMo for feature extraction,
In the domain of ADA, the use of video, audio, and text with the training process leveraging a Bi-LSTM network.
modalities has been extensively researched. To capture the Amanat et al. [3] adopt a one-hot encoding technique to
visual cues, Song et al. [59] employs a histogram-based quantify the frequency of key depressive words in a pre-
approach to quantify the average occurrences of human facial cleaned dataset, feeding these features into an LSTM-RNN
primitives, using a MLP for training. In another research model for depression assessment. Additionally, Ye et al. [72]
[57], they apply Fourier transform to facial features, achiev- employed the Continuous Bag of Words (CBOW) method
ing a fixed-length spectral representation conducive for their for text feature extraction, followed by training using a
training process. He et al. [26] develops a comprehensive customized transformer model.
framework that combines local-attention-based and global- For enhanced performance and the utilization of comple-
attention-based Convolutional Neural Networks to capture mentary information across different modalities, a variety of
facial features at different scales. Pampouchidou et al. [44] multimodal fusion methods have been employed in ADA.
focuses on dynamic facial expressions, utilizing algorithms Rohanian et al. [51] employ a feature-level fusion strategy
including local binary patterns on motion history images. that concatenates feature vectors from video, audio and
Xie et al. [70] designs an end-to-end framework tailored for text modalities at early stage before feeding the integrated
variable-length videos, integrating a 3D CNN for exploring features into a word-level LSTM. In contrast, three other
local temporal patterns and a redundancy-aware self-attention studies [22] [56][38] implement feature-level fusion at a
(RAS) scheme for aggregating global features. Melo et al. later stage, which is the time after each modality has been
[19] adopts a two-stream CNN network, separately process- processed by the corresponding encoders. Those encoders
ing appearance and temporal features, followed by a score are typically CNN and LSTM-based networks, and the last
fusion method to integrate the predictions from both streams. layers of them are concatenated horizontally before fed into
Wang et al. [69] selects key frames from videos, combines a dense network for prediction. Ringeval et al. [49] utilize
adjacent frames within a certain window, and processes them a straightforward decision-level fusion technique, averaging
through separate LSTM networks, with a global max pooling regression outputs from video-based and text-based models
layer aggregating the outputs. Finally, Niu et al. [43] seg- to obtain the final depression severity prediction. Conversely,
ments videos into fixed-length clips, subsequently analyzing Ye et al.[72] involves the learning process into their decision-
them using a spatio-temporal attention (STA) network. level fusion approach, combining predictions from individual
For audio modality, Ye et al. [72] extract deep features modality models into a fully-connected network. Niu et al.
using DeepSpectrum, subsequently integrating these features [42] implement a novel method involving the concatenation
into a customized Temporal Convolutional Network (TCN), of features extracted from audio and text modalities of
with the final layer employing relational attention classifica- each question and answer pair. These concatenated features
tion for output activation. Zhang et al. [76] developed a self- are treated as vertices in a graph, with edges established
supervised convolutional encoder-decoder network dedicated between adjacent nodes within a specified context window.
to extracting features from spectrogram images of audio This graph is then trained using a customized Graph At-
clips. These features are then processed through a 4-layer, tention Network (GAT). While many studies in multimodal
128-dimensional LSTM network. In a differing approach, fusion for ADA focus predominantly on employing advanced
Vazquez et al. [66] utilize spectrograms directly as input training networks, there remains a lack of research centered
for their 1D-Convolutional Neural Network (CNN), which on representation learning. Specifically, there is a gap in
draws inspiration from DepAudioNet [40]. Likewise, Lin et exploring methodologies for more effectively combining fea-
al. [38] also engage in spectrogram extraction from audio tures from different modalities beyond simple concatenation.
data, subsequently channeling these into a 1D-CNN for Addressing this gap could lead to a better performance for
analysis. Finally, Toto et al. [65] segment raw audio into the ADA systems.
multiple overlapping sub-clips for feature extraction, train
these features using Support Vector Machines (SVM), and B. Graph-related techniques in multimodal fusion
then employ mean pooling to aggregate the outputs for final Recent research has increasingly focused on the utilization
prediction. of graph-related techniques for multimodal fusion. One study
Regarding the text modality, Chiong et al. [17] extract [74] focuses on the multimodal neural machine translation,
hand-crafted features using bag-of-words (BoW) and n-gram which involves translating sentences from a source to a
techniques from a Twitter dataset [55], subsequently utilizing target language within the context provided by an image.
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
To establish cross-modal relationships, this study introduces sample. The behavior encoder for extracting depression-
edges between the embeddings of nouns in the sentences and related deep features for the video modality draws inspiration
the corresponding object embeddings in the images. In the from [71], which comprises two main components: the
domain of emotion recognition, Hu et al. [29] combine the Multi-Scale Behavior Feature Extraction Module (MB) and
Memory Fusion Network (MFN) with a Graph Convolutional the Noise Separation (NS) module. The MB module discerns
Network (GCN) to achieve fusion of multimodal data. Li depression-relevant behavior-primitives across varied scales,
et al. [37] adopt a graph-centric approach, constructing from small to large are 4, 8, 16, and 31 by adjusting the
individual graphs for each modality pair among three distinct filter size of the 1D convolutional layers. The NS module
modalities. Each graph is trained using a Graph Attention eliminates non-depression-related noise from the extracted
Network (GAT), with the outputs subsequently concatenated features from the MB module.
and processed through a dense network for final analysis. 2) Audio Modality: For the audio modality, we use the
Targeting at action recognition, Duhme et al. [21] employ a Hugging Face pre-trained speaker diarization model to sep-
unique fusion strategy for data collected from various sen- arate the sound of interviewee from the raw audio clip
sors. They utilize a GCN to integrate sensor data both in the for further processing [9]. Each prcoessed audio clip is
channel dimension and spatial dimension, demonstrating the filtered with 4th order band pass filtering to filter out sound
versatility of graph-related techniques in handling complex between 85-3400 Hz (typical human sound frequency range)
multimodal datasets. [6], [13], [63]. Then the resulting audio clip is fed into
the DeepSpectrum tookit [4] to get the 4096-d time-series
III. M ETHODOLOGY form deep features, which is obtained by applying the pre-
trained VGG16 model to the spectrogram of the audio –
This section introduces a two-module framework for multi-
an image created by generating 128 mel-frequency bands
modal depression behavior graph representation learning. As
for the window size of 4 seconds and the hop size of 1
shown in Figure 1, this framework comprises a Depression-
second. Similary to the video modality, the time series is
Related Feature Extraction Module and a Graph-Based Mul-
segmented into chunks for further feature extraction. The
timodal Fusion Module, which are explained below sepa-
feature extraction backbone for the audio modality is a CNN-
rately. In short, the framework initiates by extracting time-
RNN architecture, as it is powerful in managing time-series
series features from relevant tools, which are then segmented
data. These networks commence with multiple CNN blocks,
into multiple chunks. These chunks are processed through
each containing a 1D convolutional layer succeeded by a
respective behavior encoders to extract deep features. Subse-
batch normalization layer, a Rectified Linear Unit (ReLU)
quently, these extracted features are represented as nodes in a
activation, and a max-pooling layer. Following the CNN
graph, with edges formed between them based on predefined
blocks is a 32-d LSTM layer and a fully connected layer with
rules. This constructed graph is then inputted into a GNN
32 neurons, where the deep extracted features are obtained.
aiming to predict the depression severity.
3) Text Modality: we use the Huggin Face DistilBERT
[53] to extract the sentence-level features from the transcript
A. Depression-Related Feature Extraction Module
of each interview, resulting in a 768-d representation. The
This module is designed to learn depression-related be- behavior encoder of the text modality empirically shares the
haviour features from raw input data for subsequent fusion similar CNN-RNN architecture with the RNN part is a 32-d
processes, which is illustrated in the first part of Figure GRU layer.
1. Initially, through corresponding preprocessing, toolkits Motivation of using chunk-level feature: In our study,
and pre-trained models, raw data from each modality is the time series video and audio signals are divided into
transformed into multi-channel time series represented as multiple chunks based on the fact that indications of de-
F∗ ∈ Rt×d , where ∗ denotes the type of modality, t for time pression are discernible in short-term interval regardless of
dimension and d for the feature dimension. The resulting the conversational content [35], [39]. The similar feature
time-series are fed into appropriate behavior encoders to processing strategies have also been adopted in [20], [54],
extract deep features. The procedure is detailed for each which is particularly advantageous in processing variable
modality, respectively. lengths of the video and audio time series. Segmenting these
1) Video modality: To summarize complex facial be- into fixed-length chunks both facilitates data augmentation
haviours (facial image sequence) as a set of compact and and simplifies the training process. On the contrary, for the
semantically meaningful representations, we directly employ text modality in our study, we employ a different strategy.
the widely-used OpenFace 2.0 toolkit [8] to obtain 27 The text-based behavior encoder combines all sentence em-
FAU intensities, 6 head poses and 8 gaze directions from beddings of the target speech transcript, which generates
each frame. We exclude all the frames where the toolkit only one chunk-level representation. This approach stems
fails to capture human face or possess a confidence value from the premise that textual data might reveal depressive
below 0.85, then apply min-max normalization for facial indicators primarily within specific conversational contexts.
representation time-series. Then, we split the time-series into For instance, mundane discussions, such as those about an
several standardized chunks based on the temporal sequence, individual’s residence, are less likely to provide insights into
where each chunk is treated as an independent and discrete depressive states when analyzed in isolation. In contrast,
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Graph-Based Multimodal Fusion Pipeline. Note: The chunk numbers (4, 2, and 1) for each modality depicted in the figure
are for illustrative purposes only and do not represent actual case scenarios.
considering the entire speech transcript enables the inclusion and detailed understanding of temporal dynamics within each
of conversational topics and global contexts, which can serve modality can be achieved.
as significant indicators of depression [11], [60], [47]. 2) Inter-modal dynamics modelling: To model inter-
modal dynamics among features extracted from text, audio
B. Graph-Based Multimodal Fusion Module
and video, we also propose to add inter-modality edges
As illustrated in the second part of Figure 1, this module allowing nodes from temporally aligned modality pairs (e.g.,
constructs a graph representation to fuse the extracted audio, a pair of audio-visual chunk-level node features) to be
visual and text features. This is achieved by establishing the interconnected. Specifically, nodes from different modalities
intra-modal and inter-modal edges to model the intra-modal are interconnected if they represent equivalent temporal
and inter-modal dynamics. Before defining these edges, we intervals. The inter-modal edges Einter of a pair of modalities
first encode each extracted chunk-level depression-related m1 and m2 can be formulated as:
feature as a node of the graph. For simplicity, we denote N m2 n
ith chunk feature in the time dimension of video, audio and EInter =
[
Dm1i j
, Dm2 |
text modality as DVi , DA i
, DTi , where V , A, T represent j=1
video, audio, and text modality, respectively. It should be (2)
Nm1 Nm1
noted that the i-th nodes from different modalities do not i ∈ 1 + (j − 1) ,j
necessarily correspond to the same time interval, since the Nm2 Nm2
number of chunks varies for different modalities. where m1, m2 represent two distinct modalities while Nm1 ,
1) Intra-modal dynamics modelling: To model the intra- Nm2 represent the number of nodes for the modality m1
modal dynamics, we consider the temporal order of each and m2, respectively. This formula suggests dividing the
chunk in the original modality. To encapsulate the inher- nodes with a larger count into segments equivalent to the
ent temporal inter-relationships among continuous features, ratio of Nm1 to Nm2 , and then linking each segment to the
undirected edges are established between temporally adjacent corresponding node from the modality with fewer nodes.
node pairs of the same modality. The set of intra-modal edges The design stems from the assumption that features in
Eintra can be formulated as: close temporal proximity across different modalities can
offer compensatory or complementary insights [7] for various
Eintra = {D∗i , D∗i+1 } | 1 ≤ i < N∗
(1)
tasks. Contrary to methodologies introduced in [15], [51],
where ∗ represents modality V ,A or T and N∗ represents [42], the proposed multi-modal graph representations does
the number of nodes defined for the modality ∗. These intra- not require manual temporal-alignment in the pre-processing
modal edges serve a dual purpose: (i) they enable seamless phase, which simplifies the data preparatory process. This
access to information from both preceding and succeeding is because the inter-modal edges ensure each node can
nodes compared to the TCN-based methods which only accept and share information to all other nodes within the
allows one-way information passing; and (ii) despite that the graph, which allows information contained in temporally
introduction of shortcuts between nodes facilitated by the associated nodes of different modalities propagate to each
inter-modal edges (to be discussed subsequently), these intra- other wherever they are positioned in the graph.
modal edges preserve the temporal induction inherent in the Moreover, the inter-modal edges function as efficient
relationships between nodes. As a result, a comprehensive conduits for information exchange between distantly located
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
ship between features from different modalities, the nodes are
classified into different types based on the modality, and the
edges are classified into different types based on the type of
nodes they link, which result in a heterogeneous graph.
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
Mean Square Error (MSE) loss function is employed for the
E-DAIC
Modality regression task in E-DAIC, while Cross Entropy is used for
Reference RMSE CCC the classification task in BSFP dataset. To ensure balanced
A+V Baseline [49] 6.37 0.111 training in the BSFP dataset, we utilize a weighted random
A+V Sun et al. [61] 6.22 0.331 sampling approach, addressing potential data imbalances.
A+T Kaya et al. [34] - 0.344 Our evaluation for the E-DAIC dataset is on its test set,
A+V+T Rodrigues et al. [50] 6.11 0.403 following the train-test split predefined by the dataset owner.
A+V+T Suggu et al. [52] 5.36 0.457 For the AFAR-BSFP dataset, we implement a 5-fold cross
A+V+T Zhao et al. [77] 4.11 - validation approach to evaluate the performance of our
A+T Fan et al. [22] 5.91 0.430 methodology.
A+V+T Yin et al. [73] 5.50 0.442
A+V+T Fang et al. [23] 5.17 - AFAR-BSFP-DB
A+V+T Sun et al. [62] - 0.583 Modality
A+V+T Ours 4.80 0.563 Reference Acc F1
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
potentially complementary information, which in turn pro-
vides a richer set of depression-related cues. This also proves
that out graph-based fusion method can exploit the subtle
inter-relationship between modalities for better prediction.
E-DAIC
Modality
RMSE CCC
A 5.65 0.413
V 5.83 0.324
T 5.68 0.418
A+V 5.23 0.552
A+T 5.01 0.526
V+T 5.16 0.516
A+V+T 4.80 0.563 Fig. 3: Predictions of different fusion methods on the E-
DAIC dataset. The x-axis denotes ground truth, y-axis denotes
TABLE III: Performance comparison of our mutimodal fusion predictions.
depression severity assessment on the E-DAIC via different set
of modalities. Best result is highlighted in bold.
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
heterogeneous graphs are pre-equipped with this relational reflects a careful consideration of the inherent characteristics
information, facilitating more effective learning. Among the of each modality in presenting the depression-related cues.
heterogeneous graph-focused GNNs, the HGT exhibits su- 5) Alternative graph structure: In our research, we rigor-
perior performance across both metrics. This result can be ously evaluated two alternative configurations for the graph
ascribed to its Transformer-based architecture, which is adept structure utilized in multimodal fusion. The first alternative
at capturing the intricate nuances present among various involved the introduction of additional edges between nodes
modalities. Furthermore, the HAN also shows satisfactory of the same modality, where these nodes were separated by
performance. This outcome underscores the versatility and one intermediate node, effectively creating a ‘skip’ connec-
general applicability of our graph representation framework, tion. The second structural variation was the incorporation
indicating its robustness across different GNN architectures. of self-loops for each node, a design intended to facilitate
4) Comparison between different chunk size: A significant capturing of an explicit self-attention within the graph’s
challenge inherent in our methodology is related to the framework. However, the performance metrics reveal that
determination of an appropriate chunk size for the time neither of these alternative structures outperform the original
series data derived from various modalities. Given that the graph design. This observation could be attributed to the
text modality necessitates a single chunk configuration to additional, and potentially erroneous, assumptions introduced
capture the global conversational context essential for accu- by these modifications. Specifically, the creation of ’skip’
rate prediction, as previously discussed, our focus is on the edges between non-adjacent nodes presupposes a strong
chunk size settings for the video and audio modalities. We correlation between these nodes across the intervening gap.
conduct the grid search to find the optimal chunk-size. The However, this assumption may not universally apply across
best performance is selected based on the prediction of the all timestamps, potentially leading to inaccuracies in the
extracted features from corresponding behavior encoders. model’s interpretation of the temporal dynamics. Further-
more, the introduction of additional self-attention through
self-loops, while theoretically promising for emphasizing
individual node characteristics, may not effectively contribute
to the overall fusion process in the context of multimodal
data, where inter-node and inter-modality relationships are
more crucial.
V. C ONCLUSION AND L IMITATIONS
In this research, we have introduced a graph-based multi-
modal fusion approach for automatic depression assessment,
addressing challenges of modeling both intra-modal and
inter-modal behavioural dynamics. Our approach achieved
competitive results in the E-DAIC dataset, providing an
RMSE of 4.80 and a CCC value of 0.563, and outper-
Fig. 4: Predictions of corresponding modality encoder with form the baseline of the AFAR-BSFP dataset. Despite these
different chunk size. The best results are presented. promising results, our study has certain limitations, notably
the utilization of a fixed graph structure, which may not
As illustrated in Figure 4, empirical results indicate that be universally optimal across diverse datasets. Additionally,
both the video and audio modalities exhibit optimal perfor- our investigation was confined to the integration of three
mance with a chunk size of 30. However, despite the same behavioral modalities: video, audio, and text. Future research
chunk size, the temporal coverage of information differs directions could focus on dynamic graph construction meth-
significantly between these two modalities. Specifically, a ods and extend the exploration of our fusion approach to
single chunk in the video modality encompasses one second other modalities, including Electroencephalography (EEG),
of raw video data, whereas in the audio modality, it covers thereby broadening the scope and applicability of our find-
30 seconds of audio data. The difference in its temporal ings in the realm of mental health assessment.
coverage can be attributed to the nature of each modality.
Visual cues related to depression, such as a frown, can E THICAL S TATEMENT
manifest instantaneously and therefore can be effectively In this research, we strictly adhere to the highest ethical
captured within short temporal spans. Conversely, vocal standards, ensuring all data were collected from individuals
indicators of depression, which include a reduced speech who have provided informed consent for their use in re-
rate and a monotonous tone, require a longer time frame search. The datasets are exclusively utilized for the intended
for accurate detection. Additionally, the presence of pauses research purposes and are not shared beyond the research
between sentences in the raw audio further necessitates team. Moreover, all data processing is conducted anony-
longer chunks. Short temporal chunks in audio might fall mously, guaranteeing that individual participants could not
on the pause part, failing to capture meaningful patterns. be identified, thus upholding the privacy and confidentiality
Therefore, the determination of chunk size for each modality of each participant.
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGEMENTS [18] W. C. de Melo, E. Granger, and A. Hadid. Combining global and
local convolutional 3d networks for detecting depression from facial
Haotian Shen undertook this research work as part of expressions. In 2019 14th ieee international conference on automatic
face & gesture recognition (fg 2019), pages 1–8. IEEE, 2019.
his MPhil in ACS degree at the Department of Computer [19] W. C. De Melo, E. Granger, and M. B. Lopez. Encoding temporal
Science and Technology, University of Cambridge. Funding: information for automatic depression recognition from facial analysis.
Siyang Song and Hatice Gunes have been supported by In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 1080–1084. IEEE,
the EPSRC project ARoEQ under grant ref. EP/R030782/1. 2020.
Open Access: For the purpose of open access, the authors [20] H. Dibeklioğlu, Z. Hammal, and J. F. Cohn. Dynamic multimodal
measurement of depression severity using deep autoencoding. IEEE
have applied a Creative Commons Attribution (CC BY) journal of biomedical and health informatics, 22(2):525–536, 2017.
license to any Accepted Manuscript version arising. Data [21] M. Duhme, R. Memmesheimer, and D. Paulus. Fusion-gcn: Mul-
Access: This study involves secondary analyses of existing timodal action recognition using graph convolutional networks. In
DAGM German Conference on Pattern Recognition, pages 265–281.
datasets, that are described and cited in the text. Licensing Springer, 2021.
restrictions prevent sharing of the datasets. [22] W. Fan, Z. He, X. Xing, B. Cai, and W. Lu. Multi-modality depression
detection via multi-scale temporal dilated cnns. In Proceedings of the
9th International on Audio/Visual Emotion Challenge and Workshop,
R EFERENCES AVEC ’19, page 73–80, New York, NY, USA, 2019. Association for
Computing Machinery.
[1] N. I. Abbasi, S. Song, and H. Gunes. Statistical, spectral and [23] M. Fang, S. Peng, Y. Liang, C.-C. Hung, and S. Liu. A multimodal
graph representations for video-based facial expression recognition in fusion model with multi-level attention mechanism for depression
children. In ICASSP 2022-2022 IEEE International Conference on detection. Biomedical Signal Processing and Control, 82:104561,
Acoustics, Speech and Signal Processing (ICASSP), pages 1725–1729. 2023.
IEEE, 2022. [24] M. Fey and J. E. Lenssen. Fast graph representation learning with
[2] T. Al Hanai, M. M. Ghassemi, and J. R. Glass. Detecting depression PyTorch Geometric. In ICLR Workshop on Representation Learning
with audio/text sequence modeling of interviews. In Interspeech, pages on Graphs and Manifolds, 2019.
1716–1720, 2018. [25] M. Gavrilescu and N. Vizireanu. Predicting depression, anxiety,
[3] A. Amanat, M. Rizwan, A. R. Javed, M. Abdelhaq, R. Alsaqour, and stress levels from videos using the facial action coding system.
S. Pandya, and M. Uddin. Deep learning for depression detection Sensors, 19(17):3693, 2019.
from textual data. Electronics, 11(5):676, 2022. [26] L. He, J. C.-W. Chan, and Z. Wang. Automatic depression recognition
[4] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, using cnn with attention mechanism from videos. Neurocomputing,
S. Pugachevskiy, A. Baird, and B. Schuller. Snore sound classification 422:165–175, 2021.
using image-based deep spectrum features. In Interspeech 2017, pages [27] L. He, D. Jiang, and H. Sahli. Multimodal depression recognition with
3512–3516. ISCA, Aug. 2017. dynamic visual and audio cues. In 2015 International conference on
[5] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. affective computing and intelligent interaction (ACII), pages 260–266.
Multimodal fusion for multimedia analysis: a survey. Multimedia IEEE, 2015.
systems, 16:345–379, 2010. [28] L. He, M. Niu, P. Tiwari, P. Marttinen, R. Su, J. Jiang, C. Guo,
[6] R. J. Baken and R. F. Orlikoff. Clinical measurement of speech and H. Wang, S. Ding, Z. Wang, et al. Deep learning for depression
voice. Cengage Learning, 2000. recognition with audiovisual cues: A review. Information Fusion,
[7] T. Baltrušaitis, C. Ahuja, and L.-P. Morency. Multimodal machine 80:56–86, 2022.
learning: A survey and taxonomy. IEEE transactions on pattern [29] J. Hu, Y. Liu, J. Zhao, and Q. Jin. Mmgcn: Multimodal fusion via deep
analysis and machine intelligence, 41(2):423–443, 2018. graph convolution network for emotion recognition in conversation.
[8] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency. Openface arXiv preprint arXiv:2107.06779, 2021.
2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international [30] Z. Hu, Y. Dong, K. Wang, and Y. Sun. Heterogeneous graph
conference on automatic face & gesture recognition (FG 2018), pages transformer. In Proceedings of the web conference 2020, pages 2704–
59–66. IEEE, 2018. 2710, 2020.
[9] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, [31] M. R. Islam, M. A. Kabir, A. Ahmed, A. R. M. Kamal, H. Wang,
D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill. pyannote.audio: and A. Ulhaq. Depression detection from social network data using
neural building blocks for speaker diarization. In ICASSP 2020, IEEE machine learning techniques. Health information science and systems,
International Conference on Acoustics, Speech, and Signal Processing, 6:1–12, 2018.
Barcelona, Spain, May 2020. [32] S. Jaiswal, S. Song, and M. Valstar. Automatic prediction of depression
[10] S. Brody, U. Alon, and E. Yahav. How attentive are graph attention and anxiety from behaviour and personality attributes. In 2019
networks? arXiv preprint arXiv:2105.14491, 2021. 8th international conference on affective computing and intelligent
[11] W. Bucci and N. Freedman. The language of depression. Bulletin of interaction (acii), pages 1–7. IEEE, 2019.
the Menninger Clinic, 45(4):334, 1981. [33] H. Jiang, B. Hu, Z. Liu, G. Wang, L. Zhang, X. Li, and H. Kang.
[12] S. G. Burdisso, M. L. Errecalde, and M. Montes y Gómez. Using text Detecting depression using an ensemble logistic regression model
classification to estimate the depression level of reddit users. Journal based on multiple speech features. Computational and mathematical
of Computer Science & Technology, 21, 2021. methods in medicine, 2018, 2018.
[13] J. C. Catford et al. A practical introduction to phonetics. Clarendon [34] H. Kaya, D. Fedotov, D. Dresvyanskiy, M. Doyran, D. Mamontov,
Press Oxford, 1988. M. Markitantov, A. A. Akdag Salah, E. Kavcar, A. Karpov, and
[14] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, A. A. Salah. Predicting depression and emotions in the cross-roads of
N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. Universal cultures, para-linguistics, and non-linguistics. In Proceedings of the
sentence encoder. arXiv preprint arXiv:1803.11175, 2018. 9th International on Audio/Visual Emotion Challenge and Workshop,
[15] M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P. pages 27–35, 2019.
Morency. Multimodal sentiment analysis with word-level fusion and [35] D. Keltner and A. M. Kring. Emotion, social function, and psy-
reinforcement learning. In Proceedings of the 19th ACM International chopathology. Review of General Psychology, 2(3):320–342, 1998.
Conference on Multimodal Interaction, ICMI ’17, page 163–171, New [36] Y. Lee, R.-M. Ragguett, R. B. Mansur, J. J. Boutilier, J. D. Rosenblat,
York, NY, USA, 2017. Association for Computing Machinery. A. Trevizol, E. Brietzke, K. Lin, Z. Pan, M. Subramaniapillai, et al.
[16] J. Cheong, M. Spitale, and H. Gunes. “ it’s not fair!”–fairness for a Applications of machine learning algorithms to predict therapeutic out-
small dataset of multi-modal dyadic mental well-being coaching. In comes in depression: A meta-analysis and systematic review. Journal
IEEE International Conference on Affective Computing and Intelligent of affective disorders, 241:519–532, 2018.
Interaction (IEEE ACII’23), pages 1–8, 2023. [37] J. Li, X. Wang, G. Lv, and Z. Zeng. Graphmft: A graph network based
[17] R. Chiong, G. S. Budhi, S. Dhakal, and F. Chiong. A textual-based multimodal fusion technique for emotion recognition in conversation.
featuring approach for depression detection using machine learning Neurocomputing, page 126427, 2023.
classifiers and social media texts. Computers in Biology and Medicine, [38] L. Lin, X. Chen, Y. Shen, and L. Zhang. Towards automatic
135:104499, 2021.
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
depression detection: A bilstm/1d cnn-based model. Applied Sciences, spectral features. In 2018 13th IEEE International Conference on
10(23):8701, 2020. Automatic Face & Gesture Recognition (FG 2018), pages 158–165.
[39] L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, and N. B. IEEE, 2018.
Allen. Detection of clinical depression in adolescents’ speech during [60] S. W. Stirman and J. W. Pennebaker. Word use in the poetry of suicidal
family interactions. IEEE Transactions on Biomedical Engineering, and nonsuicidal poets. Psychosomatic medicine, 63(4):517–522, 2001.
58(3):574–586, 2010. [61] H. Sun, J. Liu, S. Chai, Z. Qiu, L. Lin, X. Huang, and Y. Chen.
[40] X. Ma, H. Yang, Q. Chen, D. Huang, and Y. Wang. Depaudionet: Multi-modal adaptive fusion transformer network for the estimation
An efficient deep model for audio based depression classification. In of depression level. Sensors, 21(14):4764, 2021.
Proceedings of the 6th international workshop on audio/visual emotion [62] H. Sun, H. Wang, J. Liu, Y.-W. Chen, and L. Lin. Cubemlp: An
challenge, pages 35–42, 2016. mlp-based model for multimodal sentiment analysis and depression
[41] K. Milintsevich, K. Sirts, and G. Dias. Towards automatic text- estimation. In Proceedings of the 30th ACM International Conference
based estimation of depression through symptom prediction. Brain on Multimedia, pages 3722–3729, 2022.
Informatics, 10(1):1–14, 2023. [63] J. Sundberg. Articulatory interpretation of the “singing formant”. The
[42] M. Niu, K. Chen, Q. Chen, and L. Yang. Hcag: A hierarchical context- Journal of the Acoustical Society of America, 55(4):838–844, 1974.
aware graph attention model for depression detection. In ICASSP [64] M. Tlachac and E. Rundensteiner. Screening for depression with
2021-2021 IEEE international conference on acoustics, speech and retrospectively harvested private versus public text. IEEE journal of
signal processing (ICASSP), pages 4235–4239. IEEE, 2021. biomedical and health informatics, 24(11):3326–3332, 2020.
[43] M. Niu, J. Tao, B. Liu, J. Huang, and Z. Lian. Multimodal spatiotem- [65] E. Toto, M. Tlachac, F. L. Stevens, and E. A. Rundensteiner. Audio-
poral representation for automatic depression level detection. IEEE based depression screening using sliding window sub-clip pooling. In
transactions on affective computing, 2020. 2020 19th IEEE International Conference on Machine Learning and
[44] A. Pampouchidou, M. Pediaditis, E. Kazantzaki, S. Sfakianakis, I.- Applications (ICMLA), pages 791–796. IEEE, 2020.
A. Apostolaki, K. Argyraki, D. Manousos, F. Meriaudeau, K. Marias, [66] A. Vázquez-Romero and A. Gallardo-Antolı́n. Automatic detection of
F. Yang, et al. Automated facial video-based recognition of depression depression in speech using ensemble convolutional neural networks.
and anxiety symptom severity: cross-corpus validation. Machine Vision Entropy, 22(6):688, 2020.
and Applications, 31(4):30, 2020. [67] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio,
[45] V. Patel, S. Saxena, C. Lund, G. Thornicroft, F. Baingana, P. Bolton, et al. Graph attention networks. stat, 1050(20):10–48550, 2017.
D. Chisholm, P. Y. Collins, J. L. Cooper, J. Eaton, et al. The lancet [68] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu.
commission on global mental health and sustainable development. The Heterogeneous graph attention network. In The world wide web
lancet, 392(10157):1553–1598, 2018. conference, pages 2022–2032, 2019.
[46] D. Ramachandram and G. W. Taylor. Deep multimodal learning: [69] Y. Wang, J. Ma, B. Hao, P. Hu, X. Wang, J. Mei, and S. Li. Automatic
A survey on recent advances and trends. IEEE signal processing depression detection via facial expressions using multiple instance
magazine, 34(6):96–108, 2017. learning. In 2020 IEEE 17th International Symposium on Biomedical
[47] N. Ramirez-Esparza, C. Chung, E. Kacewic, and J. Pennebaker. The Imaging (ISBI), pages 1933–1936. IEEE, 2020.
psychology of word use in depression forums in english and in [70] W. Xie, L. Liang, Y. Lu, C. Wang, J. Shen, H. Luo, and X. Liu.
spanish: Testing two text analytic approaches. In Proceedings of the Interpreting depression from question-wise long-term video recording
international AAAI conference on web and social media, volume 2, of sds evaluation. IEEE Journal of Biomedical and Health Informatics,
pages 102–108, 2008. 26(2):865–875, 2021.
[48] A. Ray, S. Kumar, R. Reddy, P. Mukherjee, and R. Garg. Multi- [71] J. Xu, S. Song, K. Kusumam, H. Gunes, and M. Valstar. Two-stage
level attention network using text, audio and video for depression temporal modelling framework for video-based depression recognition
prediction. In Proceedings of the 9th international on audio/visual using graph representation. arXiv preprint arXiv:2111.15266, 2021.
emotion challenge and workshop, pages 81–88, 2019. [72] J. Ye, Y. Yu, Q. Wang, W. Li, H. Liang, Y. Zheng, and G. Fu. Multi-
[49] F. Ringeval, B. Schuller, M. Valstar, N. Cummins, R. Cowie, L. Tavabi, modal depression detection based on emotional audio and evaluation
M. Schmitt, S. Alisamir, S. Amiriparian, E.-M. Messner, et al. Avec text. Journal of Affective Disorders, 295:904–913, 2021.
2019 workshop and challenge: state-of-mind, detecting depression [73] S. Yin, C. Liang, H. Ding, and S. Wang. A multi-modal hierarchical
with ai, and cross-cultural affect recognition. In Proceedings of the recurrent neural network for depression detection. In Proceedings
9th International on Audio/visual Emotion Challenge and Workshop, of the 9th International on Audio/Visual Emotion Challenge and
pages 3–12, 2019. Workshop, AVEC ’19, page 65–71, New York, NY, USA, 2019.
[50] M. Rodrigues Makiuchi, T. Warnita, K. Uto, and K. Shinoda. Multi- Association for Computing Machinery.
modal fusion of bert-cnn and gated cnn representations for depression [74] Y. Yin, F. Meng, J. Su, C. Zhou, Z. Yang, J. Zhou, and J. Luo.
detection. In Proceedings of the 9th International on Audio/Visual A novel graph-based multi-modal fusion encoder for neural machine
Emotion Challenge and Workshop, pages 55–63, 2019. translation. arXiv preprint arXiv:2007.08742, 2020.
[51] M. Rohanian, J. Hough, M. Purver, et al. Detecting depression with [75] L. Zhang, J. Driscol, X. Chen, and R. Hosseini Ghomi. Evaluating
word-level multimodal fusion. In Interspeech, pages 1443–1447, 2019. acoustic and linguistic features of detecting depression sub-challenge
[52] G. S. Saggu, K. Gupta, K. Arya, and C. R. Rodriguez. Depressnet: A dataset. In Proceedings of the 9th International on Audio/Visual
multimodal hierarchical attention mechanism approach for depression Emotion Challenge and Workshop, pages 47–53, 2019.
detection. Int. J. Eng. Sci., 15(1):24–32, 2022. [76] P. Zhang, M. Wu, H. Dinkel, and K. Yu. Depa: Self-supervised audio
[53] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled embedding for depression detection. In Proceedings of the 29th ACM
version of bert: smaller, faster, cheaper and lighter. arXiv preprint international conference on multimedia, pages 135–143, 2021.
arXiv:1910.01108, 2019. [77] Z. Zhao and K. Wang. Unaligned multimodal sequences for depression
[54] S. Sardari, B. Nakisa, M. N. Rastgoo, and P. Eklund. Audio based assessment from speech. In 2022 44th Annual International Confer-
depression detection using convolutional autoencoder. Expert Systems ence of the IEEE Engineering in Medicine Biology Society (EMBC),
with Applications, 189:116076, 2022. pages 3409–3413, 2022.
[55] G. Shen, J. Jia, L. Nie, F. Feng, C. Zhang, T. Hu, T.-S. Chua, W. Zhu, [78] X. Zhou, K. Jin, Y. Shang, and G. Guo. Visually interpretable
et al. Depression detection via harvesting social media: A multimodal representation learning for depression recognition from facial images.
dictionary learning solution. In IJCAI, pages 3838–3844, 2017. IEEE Transactions on Affective Computing, 11(3):542–552, 2018.
[56] Y. Shen, H. Yang, and L. Lin. Automatic depression detection:
An emotional audio-textual corpus and a gru/bilstm-based model.
In ICASSP 2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 6247–6251. IEEE,
2022.
[57] S. Song, S. Jaiswal, L. Shen, and M. Valstar. Spectral representation
of behaviour primitives for depression analysis. IEEE Transactions on
Affective Computing, 13(2):829–844, 2020.
[58] S. Song, Y. Luo, T. Tumer, M. Valstar, and H. Gunes. Loss relaxation
strategy for noisy facial video-based automatic depression recognition.
ACM Transactions on Computing for Healthcare, 2024.
[59] S. Song, L. Shen, and M. Valstar. Human behaviour-based automatic
depression analysis using hand-crafted statistics and deep learned
Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.