0% found this document useful (0 votes)

10 views10 pages

Multi-Modal Human Behaviour Graph Representation Learning For Automatic Depression Assessment

This paper presents a novel graph-based multimodal fusion approach for automatic depression assessment (ADA) that effectively models both intra-modal and inter-modal dynamics using a graph representation. By linking features from video, audio, and text modalities, the method captures complex relationships and improves prediction accuracy, achieving competitive results on the E-DAIC dataset. The proposed framework addresses limitations of existing techniques by utilizing a graph structure to enhance information propagation across modalities and temporal dimensions.

Uploaded by

speckleteam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views10 pages

Multi-Modal Human Behaviour Graph Representation Learning For Automatic Depression Assessment

Uploaded by

speckleteam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

2024 18th International Conference on Automatic Face and Gesture Recognition (FG)

Multi-modal Human Behaviour Graph Representation Learning for

Automatic Depression Assessment
Haotian Shen, Siyang Song and Hatice Gunes
AFAR Lab, University of Cambridge

Abstract— Automatic depression assessment (ADA) often re- [18] segment the input into many small chunks and make
lies on crucial cues embedded in human verbal and non-verbal predictions for each chunk, then average them to obtain
behaviors, which exists in video, audio, and text modalities. the prediction. Although there exists several approaches that
Although these modalities often show in time-series forms,
2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) | 979-8-3503-9494-8/24/$31.00 ©2024 IEEE | DOI: 10.1109/FG59268.2024.10581964

current research offers limited exploration of the complex leverage Temporal Convolutional Networks (TCN) [22], [72]
intra-modal temporal dynamics inherent to each modality, and Recurrent Neural Networks (RNN) [48], [23], [2] to
failing to extract the depression-related cues in a global view. encode temporal dynamics within each modality, they are
While many methodologies attempt to exploit the multifaceted limited by the one-way induction and long dependency
information encoded across modalities via decision-level or issues, respectively.
feature-level fusion techniques, they often fall short in effectively
representing pairwise inter-modal relationships, which is the To obtain enhanced depression assessment predictions,
key to utilize the distinct complementary relationship between researchers also investigated how to make predictions from
each modality pair. This paper presents a novel graph-based multiple modalities. Consequently, it is important to explore
multimodal fusion approach, which can model intra-modal and the relationship between different modalities (i.e., modelling
inter-modal dynamics conveniently using a graph representa- inter-modal dynamics) in order to optimally combine them
tion. It adopts undirected edges to link not only temporally
continuous, pre-extracted features of each modality, but also for ADA. To achieve this, feature-level fusion methods that
temporally aligned features across each pair of modalities. concatenate features from different modalities into a single
This ensures the seamless propagation of global information high-dimensional vector [50], [51] have been frequently
across temporal dimensions and helps capture the pairwise employed. However, these approaches mistakenly assume
inter-modal dynamics. We conduct experiments on the E-DAIC modalities are conditionally independent [46], missing out
dataset to prove our approach’s effectiveness, with an RMSE of
4.80 and a CCC value of 0.563, which rival the top-performing on capturing important pair-wise inter-modal relationships.
method. We also experiment on the AFAR-BSFP dataset to Alternatively, decision-level fusion strategies mainly com-
show the generality of our approach. Our code will be made bine the predictions from separate uni-modal models, which
publicly available. overlook the dynamic relationships between modalities [5].
To address the issues discussed above, this paper proposes
I. INTRODUCTION a novel graph-based multi-modal fusion strategy for the
Depression is a highly prevalent mental health disorder that ADA task, which aims to effectively model both intra-modal
exerts a detrimental influence on an individual’s feelings and inter-modal dynamics via a graph-based strategy. Our
and behaviors [36]. Traditional diagnostic methods, primarily approach constructs a multi-modal graph with nodes repre-
reliant on professional interviews, are both time-consuming, senting chunk-level depression-relevant features from various
subjective, and strain limited mental health resources [27], modalities. This graph structure includes undirected edges
[45]. Recognizing these challenges, recent research has between temporally adjacent nodes within the same modality,
shifted focus on applying deep learning to automatic depres- allowing each node to consider information from both of its
sion assessment (ADA). The majority of these ADA studies past and future states during the reasoning. We also establish
focuses on analyzing video [57], [78], [19], [28], [32], [25], inter-modal edges between temporally aligned nodes from
[58], [1], audio [75], [48], [50], and textual data [17], [41], different modalities, enabling the graph to explicitly capture
[12] expressed by the target subject, as these modalities not complex inter-modal relationships. These inter-modal edges
only can be easily recorded but also contain rich depression- also act as shortcuts for information sharing, overcoming the
related cues. limitations of RNN-style models in efficiently transmitting
Since existing ADA approaches frequently attempt to information across temporally nonadjacent nodes. In sum-
make predictions based on time-series signals (e.g., video, mary, the main novelties and contributions of this paper are
audio and text), a key challenge is how to properly utilize summarised as follows:
intra-modal temporal dynamics to extract depression-related • Our study pioneers the use of graphs to explore and
cues from each modality. However, most of these approaches represent the intra-modal and inter-modal dynamics of
fail to consider or deliberately overlook the temporal rela- time-series data in ADA, by modelling the temporal dy-
tionship within each modality. Some of them [31], [33], [64], namics within the same modality or different modalities
[32] eliminate the temporal properties within the raw input using graph representations.
by extracting hand-crafted features combined with statistical • With the proposed graph structure, we address the
methods (average, sum, frequency, etc.). Others [59], [78], efficiency and long-dependency limitations typically

979-8-3503-9494-8/24/$31.00 ©2024 IEEE

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
associated with RNN-style models. these features in different machine learning classifiers for
• Our methodology explicitly models the pair-wise rela- depression detection. Ray et al. [48] employ the pre-trained
tionship between different modalities within the graph Universal Sentence Encoder [14] to derive sentence-level
framework, which enables our approach to achieve embeddings, which were then padded and processed through
performance comparable with the SOTA for the E-DAIC a 2-layer Bi-LSTM network. Lin et al. [38] exploit the
dataset. capabilities of the pre-trained language model, ELMo, to
encode textual data, followed by training using a Bi-LSTM
II. R ELATED W ORKS
network enhanced with an attention layer. In a similar vein,
A. Automatic depression assessment approaches Shen et al. [56] also utilize ELMo for feature extraction,
In the domain of ADA, the use of video, audio, and text with the training process leveraging a Bi-LSTM network.
modalities has been extensively researched. To capture the Amanat et al. [3] adopt a one-hot encoding technique to
visual cues, Song et al. [59] employs a histogram-based quantify the frequency of key depressive words in a pre-
approach to quantify the average occurrences of human facial cleaned dataset, feeding these features into an LSTM-RNN
primitives, using a MLP for training. In another research model for depression assessment. Additionally, Ye et al. [72]
[57], they apply Fourier transform to facial features, achiev- employed the Continuous Bag of Words (CBOW) method
ing a fixed-length spectral representation conducive for their for text feature extraction, followed by training using a
training process. He et al. [26] develops a comprehensive customized transformer model.
framework that combines local-attention-based and global- For enhanced performance and the utilization of comple-
attention-based Convolutional Neural Networks to capture mentary information across different modalities, a variety of
facial features at different scales. Pampouchidou et al. [44] multimodal fusion methods have been employed in ADA.
focuses on dynamic facial expressions, utilizing algorithms Rohanian et al. [51] employ a feature-level fusion strategy
including local binary patterns on motion history images. that concatenates feature vectors from video, audio and
Xie et al. [70] designs an end-to-end framework tailored for text modalities at early stage before feeding the integrated
variable-length videos, integrating a 3D CNN for exploring features into a word-level LSTM. In contrast, three other
local temporal patterns and a redundancy-aware self-attention studies [22] [56][38] implement feature-level fusion at a
(RAS) scheme for aggregating global features. Melo et al. later stage, which is the time after each modality has been
[19] adopts a two-stream CNN network, separately process- processed by the corresponding encoders. Those encoders
ing appearance and temporal features, followed by a score are typically CNN and LSTM-based networks, and the last
fusion method to integrate the predictions from both streams. layers of them are concatenated horizontally before fed into
Wang et al. [69] selects key frames from videos, combines a dense network for prediction. Ringeval et al. [49] utilize
adjacent frames within a certain window, and processes them a straightforward decision-level fusion technique, averaging
through separate LSTM networks, with a global max pooling regression outputs from video-based and text-based models
layer aggregating the outputs. Finally, Niu et al. [43] seg- to obtain the final depression severity prediction. Conversely,
ments videos into fixed-length clips, subsequently analyzing Ye et al.[72] involves the learning process into their decision-
them using a spatio-temporal attention (STA) network. level fusion approach, combining predictions from individual
For audio modality, Ye et al. [72] extract deep features modality models into a fully-connected network. Niu et al.
using DeepSpectrum, subsequently integrating these features [42] implement a novel method involving the concatenation
into a customized Temporal Convolutional Network (TCN), of features extracted from audio and text modalities of
with the final layer employing relational attention classifica- each question and answer pair. These concatenated features
tion for output activation. Zhang et al. [76] developed a self- are treated as vertices in a graph, with edges established
supervised convolutional encoder-decoder network dedicated between adjacent nodes within a specified context window.
to extracting features from spectrogram images of audio This graph is then trained using a customized Graph At-
clips. These features are then processed through a 4-layer, tention Network (GAT). While many studies in multimodal
128-dimensional LSTM network. In a differing approach, fusion for ADA focus predominantly on employing advanced
Vazquez et al. [66] utilize spectrograms directly as input training networks, there remains a lack of research centered
for their 1D-Convolutional Neural Network (CNN), which on representation learning. Specifically, there is a gap in
draws inspiration from DepAudioNet [40]. Likewise, Lin et exploring methodologies for more effectively combining fea-
al. [38] also engage in spectrogram extraction from audio tures from different modalities beyond simple concatenation.
data, subsequently channeling these into a 1D-CNN for Addressing this gap could lead to a better performance for
analysis. Finally, Toto et al. [65] segment raw audio into the ADA systems.
multiple overlapping sub-clips for feature extraction, train
these features using Support Vector Machines (SVM), and B. Graph-related techniques in multimodal fusion
then employ mean pooling to aggregate the outputs for final Recent research has increasingly focused on the utilization
prediction. of graph-related techniques for multimodal fusion. One study
Regarding the text modality, Chiong et al. [17] extract [74] focuses on the multimodal neural machine translation,
hand-crafted features using bag-of-words (BoW) and n-gram which involves translating sentences from a source to a
techniques from a Twitter dataset [55], subsequently utilizing target language within the context provided by an image.

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
To establish cross-modal relationships, this study introduces sample. The behavior encoder for extracting depression-
edges between the embeddings of nouns in the sentences and related deep features for the video modality draws inspiration
the corresponding object embeddings in the images. In the from [71], which comprises two main components: the
domain of emotion recognition, Hu et al. [29] combine the Multi-Scale Behavior Feature Extraction Module (MB) and
Memory Fusion Network (MFN) with a Graph Convolutional the Noise Separation (NS) module. The MB module discerns
Network (GCN) to achieve fusion of multimodal data. Li depression-relevant behavior-primitives across varied scales,
et al. [37] adopt a graph-centric approach, constructing from small to large are 4, 8, 16, and 31 by adjusting the
individual graphs for each modality pair among three distinct filter size of the 1D convolutional layers. The NS module
modalities. Each graph is trained using a Graph Attention eliminates non-depression-related noise from the extracted
Network (GAT), with the outputs subsequently concatenated features from the MB module.
and processed through a dense network for final analysis. 2) Audio Modality: For the audio modality, we use the
Targeting at action recognition, Duhme et al. [21] employ a Hugging Face pre-trained speaker diarization model to sep-
unique fusion strategy for data collected from various sen- arate the sound of interviewee from the raw audio clip
sors. They utilize a GCN to integrate sensor data both in the for further processing [9]. Each prcoessed audio clip is
channel dimension and spatial dimension, demonstrating the filtered with 4th order band pass filtering to filter out sound
versatility of graph-related techniques in handling complex between 85-3400 Hz (typical human sound frequency range)
multimodal datasets. [6], [13], [63]. Then the resulting audio clip is fed into
the DeepSpectrum tookit [4] to get the 4096-d time-series
III. M ETHODOLOGY form deep features, which is obtained by applying the pre-
trained VGG16 model to the spectrogram of the audio –
This section introduces a two-module framework for multi-
an image created by generating 128 mel-frequency bands
modal depression behavior graph representation learning. As
for the window size of 4 seconds and the hop size of 1
shown in Figure 1, this framework comprises a Depression-
second. Similary to the video modality, the time series is
Related Feature Extraction Module and a Graph-Based Mul-
segmented into chunks for further feature extraction. The
timodal Fusion Module, which are explained below sepa-
feature extraction backbone for the audio modality is a CNN-
rately. In short, the framework initiates by extracting time-
RNN architecture, as it is powerful in managing time-series
series features from relevant tools, which are then segmented
data. These networks commence with multiple CNN blocks,
into multiple chunks. These chunks are processed through
each containing a 1D convolutional layer succeeded by a
respective behavior encoders to extract deep features. Subse-
batch normalization layer, a Rectified Linear Unit (ReLU)
quently, these extracted features are represented as nodes in a
activation, and a max-pooling layer. Following the CNN
graph, with edges formed between them based on predefined
blocks is a 32-d LSTM layer and a fully connected layer with
rules. This constructed graph is then inputted into a GNN
32 neurons, where the deep extracted features are obtained.
aiming to predict the depression severity.
3) Text Modality: we use the Huggin Face DistilBERT
[53] to extract the sentence-level features from the transcript
A. Depression-Related Feature Extraction Module
of each interview, resulting in a 768-d representation. The
This module is designed to learn depression-related be- behavior encoder of the text modality empirically shares the
haviour features from raw input data for subsequent fusion similar CNN-RNN architecture with the RNN part is a 32-d
processes, which is illustrated in the first part of Figure GRU layer.
1. Initially, through corresponding preprocessing, toolkits Motivation of using chunk-level feature: In our study,
and pre-trained models, raw data from each modality is the time series video and audio signals are divided into
transformed into multi-channel time series represented as multiple chunks based on the fact that indications of de-
F∗ ∈ Rt×d , where ∗ denotes the type of modality, t for time pression are discernible in short-term interval regardless of
dimension and d for the feature dimension. The resulting the conversational content [35], [39]. The similar feature
time-series are fed into appropriate behavior encoders to processing strategies have also been adopted in [20], [54],
extract deep features. The procedure is detailed for each which is particularly advantageous in processing variable
modality, respectively. lengths of the video and audio time series. Segmenting these
1) Video modality: To summarize complex facial be- into fixed-length chunks both facilitates data augmentation
haviours (facial image sequence) as a set of compact and and simplifies the training process. On the contrary, for the
semantically meaningful representations, we directly employ text modality in our study, we employ a different strategy.
the widely-used OpenFace 2.0 toolkit [8] to obtain 27 The text-based behavior encoder combines all sentence em-
FAU intensities, 6 head poses and 8 gaze directions from beddings of the target speech transcript, which generates
each frame. We exclude all the frames where the toolkit only one chunk-level representation. This approach stems
fails to capture human face or possess a confidence value from the premise that textual data might reveal depressive
below 0.85, then apply min-max normalization for facial indicators primarily within specific conversational contexts.
representation time-series. Then, we split the time-series into For instance, mundane discussions, such as those about an
several standardized chunks based on the temporal sequence, individual’s residence, are less likely to provide insights into
where each chunk is treated as an independent and discrete depressive states when analyzed in isolation. In contrast,

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Graph-Based Multimodal Fusion Pipeline. Note: The chunk numbers (4, 2, and 1) for each modality depicted in the figure
are for illustrative purposes only and do not represent actual case scenarios.

considering the entire speech transcript enables the inclusion and detailed understanding of temporal dynamics within each
of conversational topics and global contexts, which can serve modality can be achieved.
as significant indicators of depression [11], [60], [47]. 2) Inter-modal dynamics modelling: To model inter-
modal dynamics among features extracted from text, audio
B. Graph-Based Multimodal Fusion Module
and video, we also propose to add inter-modality edges
As illustrated in the second part of Figure 1, this module allowing nodes from temporally aligned modality pairs (e.g.,
constructs a graph representation to fuse the extracted audio, a pair of audio-visual chunk-level node features) to be
visual and text features. This is achieved by establishing the interconnected. Specifically, nodes from different modalities
intra-modal and inter-modal edges to model the intra-modal are interconnected if they represent equivalent temporal
and inter-modal dynamics. Before defining these edges, we intervals. The inter-modal edges Einter of a pair of modalities
first encode each extracted chunk-level depression-related m1 and m2 can be formulated as:
feature as a node of the graph. For simplicity, we denote N m2 n
ith chunk feature in the time dimension of video, audio and EInter =
[
Dm1i j
, Dm2 |
text modality as DVi , DA i
, DTi , where V , A, T represent j=1
video, audio, and text modality, respectively. It should be (2)
Nm1 Nm1
noted that the i-th nodes from different modalities do not i ∈ 1 + (j − 1) ,j
necessarily correspond to the same time interval, since the Nm2 Nm2
number of chunks varies for different modalities. where m1, m2 represent two distinct modalities while Nm1 ,
1) Intra-modal dynamics modelling: To model the intra- Nm2 represent the number of nodes for the modality m1
modal dynamics, we consider the temporal order of each and m2, respectively. This formula suggests dividing the
chunk in the original modality. To encapsulate the inher- nodes with a larger count into segments equivalent to the
ent temporal inter-relationships among continuous features, ratio of Nm1 to Nm2 , and then linking each segment to the
undirected edges are established between temporally adjacent corresponding node from the modality with fewer nodes.
node pairs of the same modality. The set of intra-modal edges The design stems from the assumption that features in
Eintra can be formulated as: close temporal proximity across different modalities can
offer compensatory or complementary insights [7] for various
Eintra = {D∗i , D∗i+1 } | 1 ≤ i < N∗

(1)
tasks. Contrary to methodologies introduced in [15], [51],
where ∗ represents modality V ,A or T and N∗ represents [42], the proposed multi-modal graph representations does
the number of nodes defined for the modality ∗. These intra- not require manual temporal-alignment in the pre-processing
modal edges serve a dual purpose: (i) they enable seamless phase, which simplifies the data preparatory process. This
access to information from both preceding and succeeding is because the inter-modal edges ensure each node can
nodes compared to the TCN-based methods which only accept and share information to all other nodes within the
allows one-way information passing; and (ii) despite that the graph, which allows information contained in temporally
introduction of shortcuts between nodes facilitated by the associated nodes of different modalities propagate to each
inter-modal edges (to be discussed subsequently), these intra- other wherever they are positioned in the graph.
modal edges preserve the temporal induction inherent in the Moreover, the inter-modal edges function as efficient
relationships between nodes. As a result, a comprehensive conduits for information exchange between distantly located

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
ship between features from different modalities, the nodes are
classified into different types based on the modality, and the
edges are classified into different types based on the type of
nodes they link, which result in a heterogeneous graph.

C. Prediction with Heterogeneous Graph Transformer

Upon the construction of the graph, we employ the Hetero-
geneous Graph Transformer (HGT) [30] for training. HGT is
uniquely configured to recognize and interpret diverse node
and edge types, thereby facilitating a deeper understanding
of the interactions among various relationships and entities
within the graph [30] Node representations are refined by
sequentially stacking HGT layers, with the representation of
node t at the (l −1)-th layer, H l−1 [t], being updated to the l-
th layer, H l [t], through the application of the general update
formula [30]:

H l [t] ← Extract H l−1 [s]; H l−1 [t], e

Aggregate
∀e∈E(s,t),∀s∈N (t)
(3)
Here, N (t) denotes all neighboring nodes of t, and E(s, t)
represents all edges directed from node s to t. The func-
Fig. 2: Information sharing with shortcut of inter-modal edges
tion Extract(.) is responsible for extracting information
through the pair of nodes and the edge within, while Ag-
gregate(.) compiles the neighborhood information for the
target node. Specifically, the model computes heterogeneous
nodes of the same modality, thereby obviating the need to mutual attention, inspired by transformers, using meta-
traverse through all intermediate nodes chronologically. This relations—represented as ⟨τ (s), ϕ(e), τ (t)⟩ —which encode
mechanism is important in aggregating the global context of the type of source node s edge e, and target node t. This
information from a specific modality. To illustrate, consider attention value is then utilized to update the message ex-
the hypothetical scenario depicted in Figure 2, comprising tracted from the source node. Subsequently, the neighboring
three nodes from modality m1 and one node from modality information is aggregated through a weighted average, with
m2. In the initial iteration, all nodes from modality m1 relay each attention vector serving as a weight, to yield the updated
their information to the node of m2, which in turn encodes representation H l [t] [30].
the overarching context of m1. In the subsequent iteration, In our HGT design, we employ a 64-dimensional linear
this global context is propagated to all nodes of m1, thereby projection layer to unify the feature dimensions across all
granting access to the information of other nodes, regardless modalities. Subsequently, we stack eight PyTorch Geometric
of their temporal distance. This efficient information sharing (PyG) [24] HGT layers, each with a filter size of 64 and
mechanism requires merely two iterations. In contrast, a 2 attention heads. A linear regressor is then applied to each
scenario relying solely on intra-modal edges, without the node, reducing its feature dimension to 1. To obtain a graph-
facilitation provided by inter-modal edges, would necessitate level prediction, we employ a customized weighted mean
N −1 steps for information from the initial node to reach the pooling layer that computes the average of all the post-
final node in a sequence of N nodes of the same modality. regression node values, with larger weights for modalities
Importantly, unlike decision-level fusion approaches that that have lower node count. We adopt the Mean Square Error
make predictions for each modality separately or naive (MSE) as the loss function, which is defined as follows:
feature-level methods that only concatenate features across
all modalities into a high-dimension vector, our graph rep-
n
resentation learning-based fusion strategy not only utilizes 1X i
Floss = (y − ŷ i )2
the relationship between features of different modalities to n i=1
encode rich task-related information but also can explicitly (4)
yˆVi
NV NA ˆi
and specifically model the relationship between each pair of 1 X yA
+ yˆTi )
X
where yˆi = ( +
modalities. 3 i=1 NV i=1
N A
After modelling both intra-modal and inter-modal dynam-
ics, the edge set is defined: E ⊆ V ×V = Eintra ∪Einter . Upon IV. E XPERIMENTS
defining the node set V and edge set E, the graph G = (V, E) A. Experimental settings
is constructed. To better capture the complementary relation-

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
Mean Square Error (MSE) loss function is employed for the
E-DAIC
Modality regression task in E-DAIC, while Cross Entropy is used for
Reference RMSE CCC the classification task in BSFP dataset. To ensure balanced
A+V Baseline [49] 6.37 0.111 training in the BSFP dataset, we utilize a weighted random
A+V Sun et al. [61] 6.22 0.331 sampling approach, addressing potential data imbalances.
A+T Kaya et al. [34] - 0.344 Our evaluation for the E-DAIC dataset is on its test set,
A+V+T Rodrigues et al. [50] 6.11 0.403 following the train-test split predefined by the dataset owner.
A+V+T Suggu et al. [52] 5.36 0.457 For the AFAR-BSFP dataset, we implement a 5-fold cross
A+V+T Zhao et al. [77] 4.11 - validation approach to evaluate the performance of our
A+T Fan et al. [22] 5.91 0.430 methodology.
A+V+T Yin et al. [73] 5.50 0.442
A+V+T Fang et al. [23] 5.17 - AFAR-BSFP-DB
A+V+T Sun et al. [62] - 0.583 Modality
A+V+T Ours 4.80 0.563 Reference Acc F1

TABLE I: Performance comparison of mutimodal fusion de-

A+V Baseline [16] 0.760 0.800
pression severity assessment on the E-DAIC. A: Audio; V: A+V Ours 0.780 0.830
Video; T: Text. Best result is highlighted in bold; second best
result is highlighted with underline. TABLE II: Performance comparison of mutimodal fusion gen-
eral mental-wellbeing state classification on the AFAR-BSFP-
DB. No other result except the baseline from [16] is available.

a) Datasets: Our study employs two datasets: the Ex-

tended Distress Analysis Interview Corpus (E-DAIC) for B. Comparison with existing methods
analyzing mental health issues such as depression, and the Table I shows that our graph-based multimodal fusion
AFAR-Brief Solution-Focused Practice (BSFP) dataset for methodology yields competitive results. Specifically, it
classifying mental well-being states. The E-DAIC comprises achieved the second-best result in terms of both RMSE and
275 semiclinical interviews (totaling 73.2 hours), where each CCC metric on the E-DAIC dataset, trailing by less than
involves an English-speaking participant answering a series 3.5% from the current SOTA. This underscores the efficacy
of open-ended questions. PHQ score that ranges from 0 to of our graph representation approach in integrating multiple
27 is used as the ground truth for assessing the depression modalities, positioning it as one of the leading methodologies
severity. The AFAR-BSFP dataset is a proprietary dataset for multimodal depression recognition. Notably, the work
owned by the AFAR Lab at the University of Cambridge of Zhao et al. [77], which records the best RMSE, also
[16]. It contains data from 41 sessions each recorded as 20- utilizes a transformer-based approach that investigates the
minute mental wellbeing coaching sessions between a human pairwise relationships between modalities, thereby validating
coach and 11 coachees, practising Brief Solution Focussed our method’s emphasis on establishing pairwise inter-modal
Practice (BSFP) over a 4-week period. The metal-wellbeing edges. Furthermore, all competitive methods, including Sun
state of the participant of each session was labelled as either et al.’s study [62]—the best approach in terms of CCC on
positive or negative based on self-report questionnaires. modeling the complexities inherent in multimodal interac-
There are 17 negative samples and 24 positive samples. tions, moving beyond simplistic concatenation-based feature-
b) Metrics: We follow previous studies [49], [71] to level fusion and decision-level fusion strategies. Moreover,
employ Root Mean Square Error (RMSE) and Concordance as presented in the table II, the robust performance of our
Correlation Coefficient (CCC) as metrics for the E-DAIC approach on the BSFP dataset further shows its general
dataset. For the BSFP dataset, we use accuracy and F1 score applicability, indicating it can be safely used with different
in accordance with the baseline established by the dataset dataset and tasks.
owner.
c) Implementation Details: In our approach, feature C. Ablation studies
segmentation is important for the temporal analysis. For the We conduct a series of rigorous ablation studies to exam-
video modality, we configure each chunk with 30 times- ine the reliability and validity of our multimodal fusion
tamps, equivalent to approximately one second of raw video methodologies. Notably, these investigations are exclusively
at a 30Hz sampling rate. In the audio modality, chunks also conducted on the E-DAIC dataset, while preserving universal
consist of 30 timestamps, each timestamp representing one applicability.
second of audio, aligning with our spectrogram generation 1) Number of modalities: Table III shows the performance
technique that uses a one-second hop size. For training of our multimodal methods via different combination of
our graph-based model, we optimize performance through modalities. It can be observed that the more modalities
specific parameters: a batch size of 10, a learning rate used in the fusion, the better performance obtained. This
of 10−3 , and the Adam optimizer with β1 = 0.9 and is because the inclusion of more modalities in the fusion
β2 = 0.999, applied consistently across both datasets. The process implies the incorporation of a broader spectrum of

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
potentially complementary information, which in turn pro-
vides a richer set of depression-related cues. This also proves
that out graph-based fusion method can exploit the subtle
inter-relationship between modalities for better prediction.

E-DAIC
Modality
RMSE CCC
A 5.65 0.413
V 5.83 0.324
T 5.68 0.418
A+V 5.23 0.552
A+T 5.01 0.526
V+T 5.16 0.516
A+V+T 4.80 0.563 Fig. 3: Predictions of different fusion methods on the E-
DAIC dataset. The x-axis denotes ground truth, y-axis denotes
TABLE III: Performance comparison of our mutimodal fusion predictions.
depression severity assessment on the E-DAIC via different set
of modalities. Best result is highlighted in bold.

3 shows our proposed method has the best performance, as

2) Comparison with other fusion techniques: We also
more predictions approximate to the ground truth.
examine the performance of other three fusion techniques,
3) Comparison with other GNNs: We conducted a com-
including two decision-level fusion and one feature-level fu-
parative analysis of various Graph Neural Networks (GNNs)
sion strategies. The compared fusion approaches are detailed
applied to our graph representation, utilizing the PyTorch
as follows:
Geometric (PyG) implementations. The GNN architectures
• Decision-level fusion (Average): This fusion approach
examined include the Graph Attention Network (GAT) [67],
computes an average of the individual predictions from Graph Attention Network version 2 (GATv2) [10], Hetero-
each behavior encoder to generate the final prediction. geneous Attention Network (HAN) [68], and Heterogeneous
No training incurred within this method. Graph Transformer (HGT) [30]. GAT is characterized by its
• Decision-level fusion (MLP): This fusion technique
utilization of linear attention mechanisms between a source
involves concatenating the predictions from all behavior node and its neighbors. In contrast, GATv2 extends this con-
encoders to form an input feature, which is then fed cept by incorporating a dynamic attention mechanism. HAN
into a 2-layer MLP, consisting of 8 and 4 neurons represents a heterogeneous adaptation of GAT, designed to
respectively, to generate the final prediction. process graphs with diverse node and edge types. Meanwhile,
• Feature-level fusion (concat): This fusion method con-
HGT draws inspiration from the Transformer model, catering
catenates the features extracted from various behavior specifically to the nuances of heterogeneous graphs. The first
encoders and inputs the resulting feature into a 2-layer two models, GAT and GATv2, are tailored for homogeneous
MLP, which with the hidden dimension of 32 and 16 graphs, where there is no differentiation between node and
respectively, for prediction. edge types, whereas HAN and HGT are engineered to handle
the complexity of heterogeneous graphs. The results of this
comparative study are shown in the table V.
Fusion level Name RMSE CCC
Average 5.07 0.445
Decision GNN RMSE CCC
MLP 5.26 0.425
Concat 6.97 0.351 GAT 7.23 0.428
Feature GATv2 5.04 0.549
Ours 4.80 0.563
HAN 4.99 0.557
TABLE IV: Performance comparison of our depression severity HGT 4.80 0.563
assessment on the E-DAIC between different fusion methods.
Best result is highlighted in bold. TABLE V: Performance comparison between GNNs. Best result
is highlighted in bold. Second best result is underlined.
The comparative performance of the various fusion meth-
ods is presented in Table IV. As expected, the proposed In Table V, it is evident that the GNNs designed for ho-
graph representations yield the best performance. This can be mogeneous graphs exhibit comparatively lower performance.
attributed to the generation of the graph representation, which This outcome is due to the inherent limitations of these
finely captures the interplay of both intra-modal and inter- models in discerning and categorizing the meta-relations
modal relationships. Visualization of predictions on Figure between nodes. Conversely, GNNs specifically developed for

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
heterogeneous graphs are pre-equipped with this relational reflects a careful consideration of the inherent characteristics
information, facilitating more effective learning. Among the of each modality in presenting the depression-related cues.
heterogeneous graph-focused GNNs, the HGT exhibits su- 5) Alternative graph structure: In our research, we rigor-
perior performance across both metrics. This result can be ously evaluated two alternative configurations for the graph
ascribed to its Transformer-based architecture, which is adept structure utilized in multimodal fusion. The first alternative
at capturing the intricate nuances present among various involved the introduction of additional edges between nodes
modalities. Furthermore, the HAN also shows satisfactory of the same modality, where these nodes were separated by
performance. This outcome underscores the versatility and one intermediate node, effectively creating a ‘skip’ connec-
general applicability of our graph representation framework, tion. The second structural variation was the incorporation
indicating its robustness across different GNN architectures. of self-loops for each node, a design intended to facilitate
4) Comparison between different chunk size: A significant capturing of an explicit self-attention within the graph’s
challenge inherent in our methodology is related to the framework. However, the performance metrics reveal that
determination of an appropriate chunk size for the time neither of these alternative structures outperform the original
series data derived from various modalities. Given that the graph design. This observation could be attributed to the
text modality necessitates a single chunk configuration to additional, and potentially erroneous, assumptions introduced
capture the global conversational context essential for accu- by these modifications. Specifically, the creation of ’skip’
rate prediction, as previously discussed, our focus is on the edges between non-adjacent nodes presupposes a strong
chunk size settings for the video and audio modalities. We correlation between these nodes across the intervening gap.
conduct the grid search to find the optimal chunk-size. The However, this assumption may not universally apply across
best performance is selected based on the prediction of the all timestamps, potentially leading to inaccuracies in the
extracted features from corresponding behavior encoders. model’s interpretation of the temporal dynamics. Further-
more, the introduction of additional self-attention through
self-loops, while theoretically promising for emphasizing
individual node characteristics, may not effectively contribute
to the overall fusion process in the context of multimodal
data, where inter-node and inter-modality relationships are
more crucial.
V. C ONCLUSION AND L IMITATIONS
In this research, we have introduced a graph-based multi-
modal fusion approach for automatic depression assessment,
addressing challenges of modeling both intra-modal and
inter-modal behavioural dynamics. Our approach achieved
competitive results in the E-DAIC dataset, providing an
RMSE of 4.80 and a CCC value of 0.563, and outper-
Fig. 4: Predictions of corresponding modality encoder with form the baseline of the AFAR-BSFP dataset. Despite these
different chunk size. The best results are presented. promising results, our study has certain limitations, notably
the utilization of a fixed graph structure, which may not
As illustrated in Figure 4, empirical results indicate that be universally optimal across diverse datasets. Additionally,
both the video and audio modalities exhibit optimal perfor- our investigation was confined to the integration of three
mance with a chunk size of 30. However, despite the same behavioral modalities: video, audio, and text. Future research
chunk size, the temporal coverage of information differs directions could focus on dynamic graph construction meth-
significantly between these two modalities. Specifically, a ods and extend the exploration of our fusion approach to
single chunk in the video modality encompasses one second other modalities, including Electroencephalography (EEG),
of raw video data, whereas in the audio modality, it covers thereby broadening the scope and applicability of our find-
30 seconds of audio data. The difference in its temporal ings in the realm of mental health assessment.
coverage can be attributed to the nature of each modality.
Visual cues related to depression, such as a frown, can E THICAL S TATEMENT
manifest instantaneously and therefore can be effectively In this research, we strictly adhere to the highest ethical
captured within short temporal spans. Conversely, vocal standards, ensuring all data were collected from individuals
indicators of depression, which include a reduced speech who have provided informed consent for their use in re-
rate and a monotonous tone, require a longer time frame search. The datasets are exclusively utilized for the intended
for accurate detection. Additionally, the presence of pauses research purposes and are not shared beyond the research
between sentences in the raw audio further necessitates team. Moreover, all data processing is conducted anony-
longer chunks. Short temporal chunks in audio might fall mously, guaranteeing that individual participants could not
on the pause part, failing to capture meaningful patterns. be identified, thus upholding the privacy and confidentiality
Therefore, the determination of chunk size for each modality of each participant.

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGEMENTS [18] W. C. de Melo, E. Granger, and A. Hadid. Combining global and
local convolutional 3d networks for detecting depression from facial
Haotian Shen undertook this research work as part of expressions. In 2019 14th ieee international conference on automatic
face & gesture recognition (fg 2019), pages 1–8. IEEE, 2019.
his MPhil in ACS degree at the Department of Computer [19] W. C. De Melo, E. Granger, and M. B. Lopez. Encoding temporal
Science and Technology, University of Cambridge. Funding: information for automatic depression recognition from facial analysis.
Siyang Song and Hatice Gunes have been supported by In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 1080–1084. IEEE,
the EPSRC project ARoEQ under grant ref. EP/R030782/1. 2020.
Open Access: For the purpose of open access, the authors [20] H. Dibeklioğlu, Z. Hammal, and J. F. Cohn. Dynamic multimodal
measurement of depression severity using deep autoencoding. IEEE
have applied a Creative Commons Attribution (CC BY) journal of biomedical and health informatics, 22(2):525–536, 2017.
license to any Accepted Manuscript version arising. Data [21] M. Duhme, R. Memmesheimer, and D. Paulus. Fusion-gcn: Mul-
Access: This study involves secondary analyses of existing timodal action recognition using graph convolutional networks. In
DAGM German Conference on Pattern Recognition, pages 265–281.
datasets, that are described and cited in the text. Licensing Springer, 2021.
restrictions prevent sharing of the datasets. [22] W. Fan, Z. He, X. Xing, B. Cai, and W. Lu. Multi-modality depression
detection via multi-scale temporal dilated cnns. In Proceedings of the
9th International on Audio/Visual Emotion Challenge and Workshop,
R EFERENCES AVEC ’19, page 73–80, New York, NY, USA, 2019. Association for
Computing Machinery.
[1] N. I. Abbasi, S. Song, and H. Gunes. Statistical, spectral and [23] M. Fang, S. Peng, Y. Liang, C.-C. Hung, and S. Liu. A multimodal
graph representations for video-based facial expression recognition in fusion model with multi-level attention mechanism for depression
children. In ICASSP 2022-2022 IEEE International Conference on detection. Biomedical Signal Processing and Control, 82:104561,
Acoustics, Speech and Signal Processing (ICASSP), pages 1725–1729. 2023.
IEEE, 2022. [24] M. Fey and J. E. Lenssen. Fast graph representation learning with
[2] T. Al Hanai, M. M. Ghassemi, and J. R. Glass. Detecting depression PyTorch Geometric. In ICLR Workshop on Representation Learning
with audio/text sequence modeling of interviews. In Interspeech, pages on Graphs and Manifolds, 2019.
1716–1720, 2018. [25] M. Gavrilescu and N. Vizireanu. Predicting depression, anxiety,
[3] A. Amanat, M. Rizwan, A. R. Javed, M. Abdelhaq, R. Alsaqour, and stress levels from videos using the facial action coding system.
S. Pandya, and M. Uddin. Deep learning for depression detection Sensors, 19(17):3693, 2019.
from textual data. Electronics, 11(5):676, 2022. [26] L. He, J. C.-W. Chan, and Z. Wang. Automatic depression recognition
[4] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, using cnn with attention mechanism from videos. Neurocomputing,
S. Pugachevskiy, A. Baird, and B. Schuller. Snore sound classification 422:165–175, 2021.
using image-based deep spectrum features. In Interspeech 2017, pages [27] L. He, D. Jiang, and H. Sahli. Multimodal depression recognition with
3512–3516. ISCA, Aug. 2017. dynamic visual and audio cues. In 2015 International conference on
[5] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. affective computing and intelligent interaction (ACII), pages 260–266.
Multimodal fusion for multimedia analysis: a survey. Multimedia IEEE, 2015.
systems, 16:345–379, 2010. [28] L. He, M. Niu, P. Tiwari, P. Marttinen, R. Su, J. Jiang, C. Guo,
[6] R. J. Baken and R. F. Orlikoff. Clinical measurement of speech and H. Wang, S. Ding, Z. Wang, et al. Deep learning for depression
voice. Cengage Learning, 2000. recognition with audiovisual cues: A review. Information Fusion,
[7] T. Baltrušaitis, C. Ahuja, and L.-P. Morency. Multimodal machine 80:56–86, 2022.
learning: A survey and taxonomy. IEEE transactions on pattern [29] J. Hu, Y. Liu, J. Zhao, and Q. Jin. Mmgcn: Multimodal fusion via deep
analysis and machine intelligence, 41(2):423–443, 2018. graph convolution network for emotion recognition in conversation.
[8] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency. Openface arXiv preprint arXiv:2107.06779, 2021.
2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international [30] Z. Hu, Y. Dong, K. Wang, and Y. Sun. Heterogeneous graph
conference on automatic face & gesture recognition (FG 2018), pages transformer. In Proceedings of the web conference 2020, pages 2704–
59–66. IEEE, 2018. 2710, 2020.
[9] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, [31] M. R. Islam, M. A. Kabir, A. Ahmed, A. R. M. Kamal, H. Wang,
D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill. pyannote.audio: and A. Ulhaq. Depression detection from social network data using
neural building blocks for speaker diarization. In ICASSP 2020, IEEE machine learning techniques. Health information science and systems,
International Conference on Acoustics, Speech, and Signal Processing, 6:1–12, 2018.
Barcelona, Spain, May 2020. [32] S. Jaiswal, S. Song, and M. Valstar. Automatic prediction of depression
[10] S. Brody, U. Alon, and E. Yahav. How attentive are graph attention and anxiety from behaviour and personality attributes. In 2019
networks? arXiv preprint arXiv:2105.14491, 2021. 8th international conference on affective computing and intelligent
[11] W. Bucci and N. Freedman. The language of depression. Bulletin of interaction (acii), pages 1–7. IEEE, 2019.
the Menninger Clinic, 45(4):334, 1981. [33] H. Jiang, B. Hu, Z. Liu, G. Wang, L. Zhang, X. Li, and H. Kang.
[12] S. G. Burdisso, M. L. Errecalde, and M. Montes y Gómez. Using text Detecting depression using an ensemble logistic regression model
classification to estimate the depression level of reddit users. Journal based on multiple speech features. Computational and mathematical
of Computer Science & Technology, 21, 2021. methods in medicine, 2018, 2018.
[13] J. C. Catford et al. A practical introduction to phonetics. Clarendon [34] H. Kaya, D. Fedotov, D. Dresvyanskiy, M. Doyran, D. Mamontov,
Press Oxford, 1988. M. Markitantov, A. A. Akdag Salah, E. Kavcar, A. Karpov, and
[14] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, A. A. Salah. Predicting depression and emotions in the cross-roads of
N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. Universal cultures, para-linguistics, and non-linguistics. In Proceedings of the
sentence encoder. arXiv preprint arXiv:1803.11175, 2018. 9th International on Audio/Visual Emotion Challenge and Workshop,
[15] M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P. pages 27–35, 2019.
Morency. Multimodal sentiment analysis with word-level fusion and [35] D. Keltner and A. M. Kring. Emotion, social function, and psy-
reinforcement learning. In Proceedings of the 19th ACM International chopathology. Review of General Psychology, 2(3):320–342, 1998.
Conference on Multimodal Interaction, ICMI ’17, page 163–171, New [36] Y. Lee, R.-M. Ragguett, R. B. Mansur, J. J. Boutilier, J. D. Rosenblat,
York, NY, USA, 2017. Association for Computing Machinery. A. Trevizol, E. Brietzke, K. Lin, Z. Pan, M. Subramaniapillai, et al.
[16] J. Cheong, M. Spitale, and H. Gunes. “ it’s not fair!”–fairness for a Applications of machine learning algorithms to predict therapeutic out-
small dataset of multi-modal dyadic mental well-being coaching. In comes in depression: A meta-analysis and systematic review. Journal
IEEE International Conference on Affective Computing and Intelligent of affective disorders, 241:519–532, 2018.
Interaction (IEEE ACII’23), pages 1–8, 2023. [37] J. Li, X. Wang, G. Lv, and Z. Zeng. Graphmft: A graph network based
[17] R. Chiong, G. S. Budhi, S. Dhakal, and F. Chiong. A textual-based multimodal fusion technique for emotion recognition in conversation.
featuring approach for depression detection using machine learning Neurocomputing, page 126427, 2023.
classifiers and social media texts. Computers in Biology and Medicine, [38] L. Lin, X. Chen, Y. Shen, and L. Zhang. Towards automatic
135:104499, 2021.

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.
depression detection: A bilstm/1d cnn-based model. Applied Sciences, spectral features. In 2018 13th IEEE International Conference on
10(23):8701, 2020. Automatic Face & Gesture Recognition (FG 2018), pages 158–165.
[39] L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, and N. B. IEEE, 2018.
Allen. Detection of clinical depression in adolescents’ speech during [60] S. W. Stirman and J. W. Pennebaker. Word use in the poetry of suicidal
family interactions. IEEE Transactions on Biomedical Engineering, and nonsuicidal poets. Psychosomatic medicine, 63(4):517–522, 2001.
58(3):574–586, 2010. [61] H. Sun, J. Liu, S. Chai, Z. Qiu, L. Lin, X. Huang, and Y. Chen.
[40] X. Ma, H. Yang, Q. Chen, D. Huang, and Y. Wang. Depaudionet: Multi-modal adaptive fusion transformer network for the estimation
An efficient deep model for audio based depression classification. In of depression level. Sensors, 21(14):4764, 2021.
Proceedings of the 6th international workshop on audio/visual emotion [62] H. Sun, H. Wang, J. Liu, Y.-W. Chen, and L. Lin. Cubemlp: An
challenge, pages 35–42, 2016. mlp-based model for multimodal sentiment analysis and depression
[41] K. Milintsevich, K. Sirts, and G. Dias. Towards automatic text- estimation. In Proceedings of the 30th ACM International Conference
based estimation of depression through symptom prediction. Brain on Multimedia, pages 3722–3729, 2022.
Informatics, 10(1):1–14, 2023. [63] J. Sundberg. Articulatory interpretation of the “singing formant”. The
[42] M. Niu, K. Chen, Q. Chen, and L. Yang. Hcag: A hierarchical context- Journal of the Acoustical Society of America, 55(4):838–844, 1974.
aware graph attention model for depression detection. In ICASSP [64] M. Tlachac and E. Rundensteiner. Screening for depression with
2021-2021 IEEE international conference on acoustics, speech and retrospectively harvested private versus public text. IEEE journal of
signal processing (ICASSP), pages 4235–4239. IEEE, 2021. biomedical and health informatics, 24(11):3326–3332, 2020.
[43] M. Niu, J. Tao, B. Liu, J. Huang, and Z. Lian. Multimodal spatiotem- [65] E. Toto, M. Tlachac, F. L. Stevens, and E. A. Rundensteiner. Audio-
poral representation for automatic depression level detection. IEEE based depression screening using sliding window sub-clip pooling. In
transactions on affective computing, 2020. 2020 19th IEEE International Conference on Machine Learning and
[44] A. Pampouchidou, M. Pediaditis, E. Kazantzaki, S. Sfakianakis, I.- Applications (ICMLA), pages 791–796. IEEE, 2020.
A. Apostolaki, K. Argyraki, D. Manousos, F. Meriaudeau, K. Marias, [66] A. Vázquez-Romero and A. Gallardo-Antolı́n. Automatic detection of
F. Yang, et al. Automated facial video-based recognition of depression depression in speech using ensemble convolutional neural networks.
and anxiety symptom severity: cross-corpus validation. Machine Vision Entropy, 22(6):688, 2020.
and Applications, 31(4):30, 2020. [67] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio,
[45] V. Patel, S. Saxena, C. Lund, G. Thornicroft, F. Baingana, P. Bolton, et al. Graph attention networks. stat, 1050(20):10–48550, 2017.
D. Chisholm, P. Y. Collins, J. L. Cooper, J. Eaton, et al. The lancet [68] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu.
commission on global mental health and sustainable development. The Heterogeneous graph attention network. In The world wide web
lancet, 392(10157):1553–1598, 2018. conference, pages 2022–2032, 2019.
[46] D. Ramachandram and G. W. Taylor. Deep multimodal learning: [69] Y. Wang, J. Ma, B. Hao, P. Hu, X. Wang, J. Mei, and S. Li. Automatic
A survey on recent advances and trends. IEEE signal processing depression detection via facial expressions using multiple instance
magazine, 34(6):96–108, 2017. learning. In 2020 IEEE 17th International Symposium on Biomedical
[47] N. Ramirez-Esparza, C. Chung, E. Kacewic, and J. Pennebaker. The Imaging (ISBI), pages 1933–1936. IEEE, 2020.
psychology of word use in depression forums in english and in [70] W. Xie, L. Liang, Y. Lu, C. Wang, J. Shen, H. Luo, and X. Liu.
spanish: Testing two text analytic approaches. In Proceedings of the Interpreting depression from question-wise long-term video recording
international AAAI conference on web and social media, volume 2, of sds evaluation. IEEE Journal of Biomedical and Health Informatics,
pages 102–108, 2008. 26(2):865–875, 2021.
[48] A. Ray, S. Kumar, R. Reddy, P. Mukherjee, and R. Garg. Multi- [71] J. Xu, S. Song, K. Kusumam, H. Gunes, and M. Valstar. Two-stage
level attention network using text, audio and video for depression temporal modelling framework for video-based depression recognition
prediction. In Proceedings of the 9th international on audio/visual using graph representation. arXiv preprint arXiv:2111.15266, 2021.
emotion challenge and workshop, pages 81–88, 2019. [72] J. Ye, Y. Yu, Q. Wang, W. Li, H. Liang, Y. Zheng, and G. Fu. Multi-
[49] F. Ringeval, B. Schuller, M. Valstar, N. Cummins, R. Cowie, L. Tavabi, modal depression detection based on emotional audio and evaluation
M. Schmitt, S. Alisamir, S. Amiriparian, E.-M. Messner, et al. Avec text. Journal of Affective Disorders, 295:904–913, 2021.
2019 workshop and challenge: state-of-mind, detecting depression [73] S. Yin, C. Liang, H. Ding, and S. Wang. A multi-modal hierarchical
with ai, and cross-cultural affect recognition. In Proceedings of the recurrent neural network for depression detection. In Proceedings
9th International on Audio/visual Emotion Challenge and Workshop, of the 9th International on Audio/Visual Emotion Challenge and
pages 3–12, 2019. Workshop, AVEC ’19, page 65–71, New York, NY, USA, 2019.
[50] M. Rodrigues Makiuchi, T. Warnita, K. Uto, and K. Shinoda. Multi- Association for Computing Machinery.
modal fusion of bert-cnn and gated cnn representations for depression [74] Y. Yin, F. Meng, J. Su, C. Zhou, Z. Yang, J. Zhou, and J. Luo.
detection. In Proceedings of the 9th International on Audio/Visual A novel graph-based multi-modal fusion encoder for neural machine
Emotion Challenge and Workshop, pages 55–63, 2019. translation. arXiv preprint arXiv:2007.08742, 2020.
[51] M. Rohanian, J. Hough, M. Purver, et al. Detecting depression with [75] L. Zhang, J. Driscol, X. Chen, and R. Hosseini Ghomi. Evaluating
word-level multimodal fusion. In Interspeech, pages 1443–1447, 2019. acoustic and linguistic features of detecting depression sub-challenge
[52] G. S. Saggu, K. Gupta, K. Arya, and C. R. Rodriguez. Depressnet: A dataset. In Proceedings of the 9th International on Audio/Visual
multimodal hierarchical attention mechanism approach for depression Emotion Challenge and Workshop, pages 47–53, 2019.
detection. Int. J. Eng. Sci., 15(1):24–32, 2022. [76] P. Zhang, M. Wu, H. Dinkel, and K. Yu. Depa: Self-supervised audio
[53] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled embedding for depression detection. In Proceedings of the 29th ACM
version of bert: smaller, faster, cheaper and lighter. arXiv preprint international conference on multimedia, pages 135–143, 2021.
arXiv:1910.01108, 2019. [77] Z. Zhao and K. Wang. Unaligned multimodal sequences for depression
[54] S. Sardari, B. Nakisa, M. N. Rastgoo, and P. Eklund. Audio based assessment from speech. In 2022 44th Annual International Confer-
depression detection using convolutional autoencoder. Expert Systems ence of the IEEE Engineering in Medicine Biology Society (EMBC),
with Applications, 189:116076, 2022. pages 3409–3413, 2022.
[55] G. Shen, J. Jia, L. Nie, F. Feng, C. Zhang, T. Hu, T.-S. Chua, W. Zhu, [78] X. Zhou, K. Jin, Y. Shang, and G. Guo. Visually interpretable
et al. Depression detection via harvesting social media: A multimodal representation learning for depression recognition from facial images.
dictionary learning solution. In IJCAI, pages 3838–3844, 2017. IEEE Transactions on Affective Computing, 11(3):542–552, 2018.
[56] Y. Shen, H. Yang, and L. Lin. Automatic depression detection:
An emotional audio-textual corpus and a gru/bilstm-based model.
In ICASSP 2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 6247–6251. IEEE,
2022.
[57] S. Song, S. Jaiswal, L. Shen, and M. Valstar. Spectral representation
of behaviour primitives for depression analysis. IEEE Transactions on
Affective Computing, 13(2):829–844, 2020.
[58] S. Song, Y. Luo, T. Tumer, M. Valstar, and H. Gunes. Loss relaxation
strategy for noisy facial video-based automatic depression recognition.
ACM Transactions on Computing for Healthcare, 2024.
[59] S. Song, L. Shen, and M. Valstar. Human behaviour-based automatic
depression analysis using hand-crafted statistics and deep learned

Authorized licensed use limited to: Universidad de Jaen. Downloaded on March 07,2025 at 12:14:03 UTC from IEEE Xplore. Restrictions apply.

A Hybrid Learning-Architecture For Mental Disorder Detection Using Emotion Recognition
No ratings yet
A Hybrid Learning-Architecture For Mental Disorder Detection Using Emotion Recognition
16 pages
Towards Automatic Depression Detection A BiLSTM1D CNN-Based Model
No ratings yet
Towards Automatic Depression Detection A BiLSTM1D CNN-Based Model
20 pages
Depression Detection in Speech Using Transformer and Parallel Convolutional Neural Networks
No ratings yet
Depression Detection in Speech Using Transformer and Parallel Convolutional Neural Networks
16 pages
Facial Emotion Recognition For Therapy Session
No ratings yet
Facial Emotion Recognition For Therapy Session
8 pages
MFCC-based Recurrent Neural Network For Automatic Clinical Depression
No ratings yet
MFCC-based Recurrent Neural Network For Automatic Clinical Depression
14 pages
11.-I Love The Earth - Password - Removed
No ratings yet
11.-I Love The Earth - Password - Removed
16 pages
Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives
No ratings yet
Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives
51 pages
A Computer Vision Based Image Processing System Fo
No ratings yet
A Computer Vision Based Image Processing System Fo
31 pages
A Review On EEG-based Multimodal Learning For Emotion Recognition
No ratings yet
A Review On EEG-based Multimodal Learning For Emotion Recognition
63 pages
1 s2.0 S0010482524014100 Main
No ratings yet
1 s2.0 S0010482524014100 Main
17 pages
Multimodal Data Fusion For Depression Detection Approach
No ratings yet
Multimodal Data Fusion For Depression Detection Approach
18 pages
74LS113
No ratings yet
74LS113
2 pages
5.IEEE Trans Affect Compu Interpretation of Depression Detection Models Via Feature Selection Methods
No ratings yet
5.IEEE Trans Affect Compu Interpretation of Depression Detection Models Via Feature Selection Methods
52 pages
Mti 06 00011 v3
No ratings yet
Mti 06 00011 v3
23 pages
Depression Screening GenAI
No ratings yet
Depression Screening GenAI
9 pages
Depression Detection and Analysis Using Large Language Models On Textual and Audio-Visual Modalities
No ratings yet
Depression Detection and Analysis Using Large Language Models On Textual and Audio-Visual Modalities
12 pages
A Computer Vision Based Image Processing System Fo
No ratings yet
A Computer Vision Based Image Processing System Fo
10 pages
PDF 3
No ratings yet
PDF 3
16 pages
SM Prima CS
No ratings yet
SM Prima CS
15 pages
Whta Revels About Depression Level The Role of Multimodal Features at The Level of Interview Questions
No ratings yet
Whta Revels About Depression Level The Role of Multimodal Features at The Level of Interview Questions
14 pages
Multimodal Representations and Classification of First-Episode Psychosis Via Live Face Processing
No ratings yet
Multimodal Representations and Classification of First-Episode Psychosis Via Live Face Processing
12 pages
Towards Automatic Text-Based Estimation of Depression Through Symptom Prediction
No ratings yet
Towards Automatic Text-Based Estimation of Depression Through Symptom Prediction
14 pages
BC 37
No ratings yet
BC 37
17 pages
A Survey On Cross-Platform Depression Detection Combining Text, Audio, Images To Understand Emotions Over Time
No ratings yet
A Survey On Cross-Platform Depression Detection Combining Text, Audio, Images To Understand Emotions Over Time
7 pages
Minor PRJ LR Paper
No ratings yet
Minor PRJ LR Paper
6 pages
Osac 2 5 1791
No ratings yet
Osac 2 5 1791
16 pages
Paper 80
No ratings yet
Paper 80
11 pages
2020 Msce Practical Questions Target
No ratings yet
2020 Msce Practical Questions Target
30 pages
Paper 065
No ratings yet
Paper 065
10 pages
Shvetsova Everything at Once - Multi-Modal Fusion Transformer For Video Retrieval CVPR 2022 Paper
No ratings yet
Shvetsova Everything at Once - Multi-Modal Fusion Transformer For Video Retrieval CVPR 2022 Paper
10 pages
Global Humanitarian Overview 2025 (Abridged Report)
No ratings yet
Global Humanitarian Overview 2025 (Abridged Report)
20 pages
Final Paperhh
No ratings yet
Final Paperhh
13 pages
Entropy 25 01440
No ratings yet
Entropy 25 01440
33 pages
Additive Cross-Modal Attention Network ACMA For Depression Detection Based On Audio and Textual Features
No ratings yet
Additive Cross-Modal Attention Network ACMA For Depression Detection Based On Audio and Textual Features
11 pages
Capas de Atención Conscientes Del Contexto Acopladas Con Métodos de Adaptación Óptima Del Dominio de Transporte y Fusión Multimodal para Reconocer La Demencia A Partir Del Habla Espontánea
No ratings yet
Capas de Atención Conscientes Del Contexto Acopladas Con Métodos de Adaptación Óptima Del Dominio de Transporte y Fusión Multimodal para Reconocer La Demencia A Partir Del Habla Espontánea
21 pages
HIRA
No ratings yet
HIRA
9 pages
Detecting Depression With Heterogeneous Graph Neural Network in Clinical Interview Transcript
No ratings yet
Detecting Depression With Heterogeneous Graph Neural Network in Clinical Interview Transcript
10 pages
Depression Recognition Using Voice Based Pre Training Model
No ratings yet
Depression Recognition Using Voice Based Pre Training Model
13 pages
Multimodal Spatio Temporal Framework For Re - 2024 - International Journal of in
No ratings yet
Multimodal Spatio Temporal Framework For Re - 2024 - International Journal of in
11 pages
Multimodal Depression Detection Based On Self-Attention Network With Facial Expression and Pupil
No ratings yet
Multimodal Depression Detection Based On Self-Attention Network With Facial Expression and Pupil
13 pages
Speaker-Independent Depression Detection Based On Adversarial Training Method
No ratings yet
Speaker-Independent Depression Detection Based On Adversarial Training Method
5 pages
Depression Detection Through Transformers-Based Emotion Recognition in Multivariate Time Series Facial Data
No ratings yet
Depression Detection Through Transformers-Based Emotion Recognition in Multivariate Time Series Facial Data
9 pages
MAST-GCN Multi-Scale Adaptive Spatial-Temporal Graph Convolutional Network For EEG-Based Depression Recognition
No ratings yet
MAST-GCN Multi-Scale Adaptive Spatial-Temporal Graph Convolutional Network For EEG-Based Depression Recognition
12 pages
PHQ-V GAD-V Assessments To Identify Signals of Depression
No ratings yet
PHQ-V GAD-V Assessments To Identify Signals of Depression
15 pages
fNIRS-Driven Depression Recognition Based On Cross-Modal Data Augmentation
No ratings yet
fNIRS-Driven Depression Recognition Based On Cross-Modal Data Augmentation
11 pages
A Hierarchical Cross-Modal Spatial Fusion Network For Multimodal Emotion Recognition
No ratings yet
A Hierarchical Cross-Modal Spatial Fusion Network For Multimodal Emotion Recognition
10 pages
FPT-Former A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures
No ratings yet
FPT-Former A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures
14 pages
Icassp40776 2020 9053207
No ratings yet
Icassp40776 2020 9053207
5 pages
Spatial-Temporal Attention Network For Depression Recognition From Facial Videos.
No ratings yet
Spatial-Temporal Attention Network For Depression Recognition From Facial Videos.
12 pages
6 Chatzianastasis
No ratings yet
6 Chatzianastasis
5 pages
Automatic Depression Level Assessment From Speech by Long-Term Global Information Embedding
No ratings yet
Automatic Depression Level Assessment From Speech by Long-Term Global Information Embedding
5 pages
COLD Fusion Calibrated and Ordinal Latent Distribution Fusion For Uncertainty-Aware Multimodal Emotion Recognition
No ratings yet
COLD Fusion Calibrated and Ordinal Latent Distribution Fusion For Uncertainty-Aware Multimodal Emotion Recognition
18 pages
Cross-Modal Dynamic Transfer Learning For Multimodal Emotion Recognition
No ratings yet
Cross-Modal Dynamic Transfer Learning For Multimodal Emotion Recognition
10 pages
EEG-based Multimodal Representation Learning For E
No ratings yet
EEG-based Multimodal Representation Learning For E
4 pages
Multimodal Depression Detection Using Audio Visual Cues
No ratings yet
Multimodal Depression Detection Using Audio Visual Cues
5 pages
PDPII - May 2011 - Dhanaraj - 10644
No ratings yet
PDPII - May 2011 - Dhanaraj - 10644
96 pages
Clay Pot Refrigerator
No ratings yet
Clay Pot Refrigerator
494 pages
An Automated System For Depression Detection Based On Facial and Vocal Features
No ratings yet
An Automated System For Depression Detection Based On Facial and Vocal Features
7 pages
PHQ 8 Erisk
No ratings yet
PHQ 8 Erisk
2 pages
Applsci 12 00327
No ratings yet
Applsci 12 00327
23 pages
The English Version Of Truy Ệ N Ki Ề U: ~ By Phan Huy MPH
No ratings yet
The English Version Of Truy Ệ N Ki Ề U: ~ By Phan Huy MPH
76 pages
Deep Learning For Prediction of Depressive Symptoms in A Large Textual Dataset
No ratings yet
Deep Learning For Prediction of Depressive Symptoms in A Large Textual Dataset
24 pages
Reducing Noisy Annotations For Depression Estimation From Facial Images
No ratings yet
Reducing Noisy Annotations For Depression Estimation From Facial Images
10 pages
Enhancing Modal Fusion by Alignment and Label Matching For Multimodal Emotion Recognition
No ratings yet
Enhancing Modal Fusion by Alignment and Label Matching For Multimodal Emotion Recognition
5 pages
An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation
No ratings yet
An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation
12 pages
Sensors 23 01080 v2
No ratings yet
Sensors 23 01080 v2
23 pages
Detecting Depression With Word-Level Multimodal Fusion - LEÍDO
No ratings yet
Detecting Depression With Word-Level Multimodal Fusion - LEÍDO
5 pages
BIGDAS2023 Paper 13
No ratings yet
BIGDAS2023 Paper 13
8 pages
Privacy-Preserving Speech-Based Depression Diagnosis Via Federated Learning
No ratings yet
Privacy-Preserving Speech-Based Depression Diagnosis Via Federated Learning
4 pages
Li Ion Standards
No ratings yet
Li Ion Standards
4 pages
Depression Detection - Final Report
No ratings yet
Depression Detection - Final Report
6 pages
WinGD - TIN036 1 - Update On Dual Fuel Methanol Engine Development
No ratings yet
WinGD - TIN036 1 - Update On Dual Fuel Methanol Engine Development
2 pages
Group 40 Ijarcce
No ratings yet
Group 40 Ijarcce
5 pages
Fundus Changes in High Myopia in Relation To Axial
No ratings yet
Fundus Changes in High Myopia in Relation To Axial
5 pages
PDF Original Del Articulo 05
No ratings yet
PDF Original Del Articulo 05
8 pages
Client Questionaire
No ratings yet
Client Questionaire
17 pages
Deep Learning Techniques For Depression Assessment
No ratings yet
Deep Learning Techniques For Depression Assessment
5 pages
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
2017 Multimodal2
No ratings yet
2017 Multimodal2
13 pages
Agricultural and Biological Engineering: Psychrometric Chart Use
No ratings yet
Agricultural and Biological Engineering: Psychrometric Chart Use
6 pages
30 Day Diabetic Mealplan PDF
50% (2)
30 Day Diabetic Mealplan PDF
1 page
Evaluacion Ing
No ratings yet
Evaluacion Ing
1 page
Detecting Depression With Audio/Text Sequence Modeling of Interviews
No ratings yet
Detecting Depression With Audio/Text Sequence Modeling of Interviews
5 pages
$Xwrpdwlf'Hsuhvvlrq/Hyho'Hwhfwlrq7Kurxjk 9lvxdo, QSXW: Abstract Depression Is The Most Comprehensive Mood
No ratings yet
$Xwrpdwlf'Hsuhvvlrq/Hyho'Hwhfwlrq7Kurxjk 9lvxdo, QSXW: Abstract Depression Is The Most Comprehensive Mood
4 pages
VW Tiguan 2 2021 Driver Assist Systems Eng
No ratings yet
VW Tiguan 2 2021 Driver Assist Systems Eng
52 pages
Rscada: The Complete Scada Solution The Complete Scada Solution
No ratings yet
Rscada: The Complete Scada Solution The Complete Scada Solution
14 pages
SEE 3433 Electrical Machines: Classification of DC Machines DC Generators - Separately Excited - Armature Reaction
No ratings yet
SEE 3433 Electrical Machines: Classification of DC Machines DC Generators - Separately Excited - Armature Reaction
22 pages
Math SPM
No ratings yet
Math SPM
54 pages
Sixteen Saviours or One?, John Perry. 1879
100% (3)
Sixteen Saviours or One?, John Perry. 1879
160 pages
A Cute Letter From A Muslim Girl To Her Christian Parents
No ratings yet
A Cute Letter From A Muslim Girl To Her Christian Parents
3 pages
Structural Irregularities 2
No ratings yet
Structural Irregularities 2
12 pages
Resolución Test 2
No ratings yet
Resolución Test 2
2 pages
Comparison of Critical Rate Correlations: Firdavs A. Aliev, Khurshed A. Rahimov, Balabek Amzayev, Alim F.Kemalov
No ratings yet
Comparison of Critical Rate Correlations: Firdavs A. Aliev, Khurshed A. Rahimov, Balabek Amzayev, Alim F.Kemalov
7 pages
JK Cements: Swot Analysis
No ratings yet
JK Cements: Swot Analysis
3 pages
Algebra Che 304 Cimpetency Reviewer
No ratings yet
Algebra Che 304 Cimpetency Reviewer
1 page
African Religion
No ratings yet
African Religion
5 pages
General Navigation
100% (4)
General Navigation
46 pages
Unit Plan - Science The 5 Senses
No ratings yet
Unit Plan - Science The 5 Senses
4 pages
Lectra Diamino V5R3 Referencia Brochure
100% (1)
Lectra Diamino V5R3 Referencia Brochure
3 pages
SOLIDWORKS Simulation 2019 Validation
100% (3)
SOLIDWORKS Simulation 2019 Validation
140 pages

Multi-Modal Human Behaviour Graph Representation Learning For Automatic Depression Assessment

Uploaded by

Multi-Modal Human Behaviour Graph Representation Learning For Automatic Depression Assessment

Uploaded by

2024 18th International Conference on Automatic Face and Gesture Recognition (FG)

Multi-modal Human Behaviour Graph Representation Learning for

979-8-3503-9494-8/24/$31.00 ©2024 IEEE

C. Prediction with Heterogeneous Graph Transformer

H l [t] ← Extract H l−1 [s]; H l−1 [t], e

TABLE I: Performance comparison of mutimodal fusion de-

a) Datasets: Our study employs two datasets: the Ex-

3 shows our proposed method has the best performance, as

You might also like