0% found this document useful (0 votes)

33 views8 pages

Depression Recognition Over Fusion of Visual and Vocal Expression Using Artificial Intelligence

Uploaded by

rafael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views8 pages

Depression Recognition Over Fusion of Visual and Vocal Expression Using Artificial Intelligence

Uploaded by

rafael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Indonesian Journal of Electrical Engineering and Computer Science

Vol. 34, No. 3, June 2024, pp. 1753~1759

ISSN: 2502-4752, DOI: 10.11591/ijeecs.v34.i3.pp1753-1759  1753

Depression recognition over fusion of visual and vocal

expression using artificial intelligence

Chandan Gautam, Aaradhya Raj, Bhargavee Nemade, Vinaya Gohokar

School of Electronics and Communication Engineering, Dr Vishwanath Karad MIT World Peace University, Pune, India

Article Info ABSTRACT

Article history: Depression is a mental illness that usually goes untreated in people and can
have catastrophic consequences, including suicidal thoughts. Counselling
Received Dec 29, 2023 services are widely available, but because depression is a stigmatized illness,
Revised Jan 22, 2024 many people who are depressed decide not to seek help. Therefore, it is
Accepted Feb 8, 2024 essential to develop an automated system that can recognize depression in
individuals before it worsens. In this study, a novel approach is proposed for
identifying depression using a combination of visual and vocal emotions.
Keywords: Long short-term memory (LSTM) is used to assess verbal expressions and
convolutional neural networks (CNN) to analyze facial expressions.
CNN The proposed system is trained using features of depression from the distress
DAIC dataset analysis interview corpus (DAIC) dataset and tested on videos of college
Depression scale recognition students with frontal faces. The proposed approach is effective in detecting
LSTM depression in individuals, with high accuracy and reliability.
Visual expression
This is an open access article under the CC BY-SA license.
Vocal expression

Corresponding Author:
Vinaya Gohokar
School of Electronics and Communication Engineering
Dr Vishwanath Karad MIT World Peace University
Pune, India
Email: [email protected]

1. INTRODUCTION
Around 350 million people all over the world suffer from depression, and the frequency is rising
every year, according to research on the condition by the World Health Organization (WHO).
Epidemiological research found lifetime prevalence of depression to be around 6.9% [1]. Depression is a
psychiatric condition that causes extreme sadness and makes it difficult for sufferers to function normally in
social situations. More gravely, depression can cause behaviors like self-harm and suicide [2]. Additionally,
it may have an impact on significant life events like marriage timing and school success. The majority of
persons who receive treatment for depression, according to [3], do not recover from it. The coronavirus
disease 2019 (COVID-19) pandemic resulted in extraordinary actions like social segregation, worldwide
isolation, and a state of health emergency. This disrupted daily life and caused a significant increase in
anxiety and depression symptoms. Over time, the prevalence of depression has more than doubled
worldwide. The wellbeing of a person depends on accurate diagnosis and treatment. Traditional diagnostic
techniques for depression rely on self-reporting, which can be arbitrary and unreliable, resulting in incorrect
diagnoses and insufficient care. Using numerous data sets to create an accurate and dependable detection
system could be a way to enhance depression diagnosis and therapy. Currently, the standard clinical
approaches for diagnosing depression are primarily based on the clinical diagnoses made by psychiatrists and
the patients themselves. The diagnostic and statistical manual of mental disorders (DSM-5), the hamilton
depression scale (HAMD), the beck depression inventory (BDI), and the patient health questionnaire (PHQ-9)

Journal homepage: http://ijeecs.iaescore.com

1754  ISSN: 2502-4752

are currently the most widely used clinical diagnostic tools. Although correct diagnosis is challenging, these
methods have limitations because the diagnostic procedure does not involve objective measurements [4].
A self-administered survey called the beck depression inventory II (BDI-II) [5] is used to gauge how
severe depression is in adults and teenagers. Speech and facial expressions are used as biomarkers to
determine an individual’s level of depression, which can be determined by looking at their BDI-II score as
depicted in Table 1. There are 21 multiple-choice questions in the survey, each of which focuses on a
different depressive symptom, such as mood, guilt, hunger, or sleep. Based on the person’s answers to these
questions, the BDI-II score is determined, with each question receiving a number between 0 and 3 that
reflects the severity of the symptom. Higher scores on the BDI-II scale, which goes from 0 to 63, indicate
more severe depressed symptoms. Mental health practitioners frequently use the BDI-II to diagnose
depression, track therapeutic progress, and assess the efficacy of various treatments.
Numerous data sets, including face pictures, and electroencephalogram (EEG), human voice,
behaviour, and text, have frequently been employed in research using artificial intelligence (AI) technology
to identify depression. Recent background research on the identification of depression shows many
techniques and traits used by researchers to identify depression, allowing for significant comparisons
between them. Table 2 gives a thorough summary of the state of art liturature video data is used as features in
[6], [7], AVEC datasets are used by the reserchers and have achieved around 70% accuracy.

Table 1. The BDI-II scores and corresponding depression degree

BDI-II score Depression degree
0-13 None
14-19 Mild
20-28 Moderate

Table 2. The comparison of the different traits and methods used to evaluate depression
Authors Features Dataset Algorithm Result Demerits
Uddin et al. [6] Video AVEC2013 CNN, MAE - 7.04 It relies solely on video data to
AVEC2014 BILSTM RMSE - 8.93 estimate depression levels.
Shah et al. [7] Video AVEC2014 VGG16 Accuracy - Only utilized facial features.
68.75%
Singh et al. [8] Video audio DAIC - WOZ PCA, F1 - 0.82 It does not provide a
SVM comparative analysis of its
results with other existing
methods.
Pampouchidou Video audio Own datasets LBP, Accuracy - 70% A small patient group was
et al. [9] HOG engaged in the dataset.
Zhou et al. [10] Video audio D-Vlog dataset LSTM, F1 - 0.75 The dataset was gathered from
CNN a particular age group and
demographic.
Niu et al. [2] Audio text DAIC - WOZ BLSTM, F1 - 0.9074 Potential technical complexities
TCNN or resource requirements.
Jo and Kwak Audio text DAIC - WOZ BILSTN Accuracy - 90% Text data models perform less
[11] well than speech data models.
Rustagi et al. Facial Fer 2013 CNN, Accuracy - 85% Only text and visual traits were
[12] expression and RNN considered.
text data
Shekar and Face Facial expression CNN Accuracy - 64% Only utilized facial features,
Kumar [13] circumference dataset from Kaggle not considering head posture or
audio eye movement.
Wang et al. [14] Audio DAIC - WOZ CNN, MAE - 10.21 It only focuses on audio
GAN RMSE - 7.96 features and does not consider
other potential sources of
information.

Changes in speech patterns, facial expressions, body language, and general behavior are a few of the
symptoms of depression. Researchers and computers can detect nonverbal signs and subtle behavioral
markers that are frequently missed in conventional text-or survey-based assessments by including audio and
video data to the dataset. Voice inflection, speech cadence, hesitations, facial expressions, and body language
can all reveal important details about a person’s emotional state and aid in the detection of depressive
symptoms. The incorporation of audio and video data not only improves the sensitivity and accuracy of
algorithms for detecting depression, but also offers a more complete picture of a person’s wellbeing, aiding in
the understanding of the illness and the development of more accurate diagnoses for clinicians and

Indonesian J Elec Eng & Comp Sci, Vol. 34, No. 3, June 2024: 1753-1759
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  1755

researchers. Video and accompanied audio is used by researchers in paper [8]−[10]. Various algorithms
including principal component analysis (PCA), support vector machine (SVM), long short-term memory
(LSTM), and convolutional neural networks (CNN) are implemented to improve accuracy of recognition
[2], [11] have implemented the algorithm based on audio and text analysis. Even though 90% accuracy is
claimed the technical complexities are more in these algoritms. Facial expressions and text is uased [12], [13]
to develop the system. Audio features are only criteria used in [14] to achieve the recognition. The datasets
frequently used in the literature include:
− DAIC-WOZ
− AVEC 2013
− AVEC 2014
− SEMAINE
− D-Vlog
A set of psychiatric interviews called the distress analysis interview corpus-wizard of oz (DAIC) is
used to evaluate people’s mental health. This corpus’s DAIC-WOZ subset was created using a robotic agent
that can communicate with people and recognize verbal and nonverbal indicators of mental illness. 189 audio
and video clips as well as in-depth survey responses are included in the collection. The corpus includes WOZ
interviews that were used to record and analyze samples for different verbal and nonverbal traits. The data
can be utilized to create automated systems for the early detection and prevention of mental health issues and
is a useful resource for researchers studying depression and related disorders. A relatively new area of study
involves the analysis of audio and video to determine how people are feeling and how they are feeling by
observing video frames, audio, and text [15]−[18].
The AVEC 2013 dataset [19]−[23], commonly known as the “audio-visual depression corpus,”
contains 150 movies of people recorded with a camera and microphone while they perform depression-
oriented tasks in a human-computer interface. Each recording contains only one person, and multiple
recordings have a total of 84 subjects. The material contains recordings of 18 test subjects on three, 31 on
two and 34 on one, the duration varies from 50 to 20 minutes and the average is 25 minutes. It takes a total of
about 240 hours to watch all the clips. The age of individuals in the data varied between 18 and 63 years,
with an average age of 31.5 years and a standard deviation of 12.3 years.
Only two of the original 14 activities-which were chosen due to strong participant agreement-are
included in the AVEC 2014 selection. 300 movies, ranging in length from 6 to 4 minutes and 8 seconds, were
the product. The assessment of anxiety and sadness is the focus of the two activities, which are each
independently recorded. Five pairs of previously unreleased clips were additionally included to replace the
few videos that were deemed unfit for the competition. The AVEC 2014 source videos were the same as
those used in the AVEC 2013. The two tasks in this category are “Free Form” and “The North Wind,” in
which participants recite passages from the German novel “Die Sonne und der Wind” (the sun and the north
wind), respectively. SEMAINE corpus detects anxiety using face recognition. The corpus consists of
dialogues between individuals and a simulated agent and is designed to explore the natural social cues that
appear when interacting with an artificial human or robot. Sensitive artificial hearer (SAH) technology was
used for data collection. The corpus contains a total of 95 session files with both audio and video, video only,
and separate audio files that can be used for testing, interpretation, and case studies. The D-Vlog dataset
[1], [24]-[27] is a collection of films made facial expressions. The length and calibre of the movies vary, and
the creators come from various social and mental health backgrounds. These videos have been published on
YouTube and are open to the general audience. The information, which consists of more than 500 movies,
has been utilised for research to build automatic algorithms to identify depression and anxiety based on
speech patterns.
The proposed work focuses on, employment of artificial intelligence to recognize depression scales
from fusions of visual and vocal expression. A unique method that combines LSTM for audio analysis with
CNN for visual analysis is developed which is a novel approach for identifying depression using a
combination of visual and vocal emotions. Useful features and models are detected to increase the precision
of depression evaluation and identification. The influence of poor data quality on model performance is also
addressed. The remainder of the text is structured as follows: The proposed model is thoroughly covered in
section 2. Section 3 of this article discusses the results. The conclusion of our work and future research
directions are presented in section 4.

2. PROPOSED METHOD
Through physiological research, it has been demonstrated that people with depression behave
differently from people in good health in terms of their speech and facial expression. In order to predict a
person’s state of depression, our suggested method makes use of a multimodal representation framework that

Depression recognition over fusion of visual and vocal expression … (Chandan Gautam)
1756  ISSN: 2502-4752

can integrate both visual and audible expressions. As seen in Figure 1, the proposed method consists of
different algorithms for audio- and video-based depression recognition. Input, preprocessing, feature
extraction, training, and classification are all processes in the system’s conventional pipeline. The CNN
algorithm is used to train and classify the audio-based depression identification system while the LSTM
method is used to train and classify the video-based depression recognition system.

Figure 1. Block diagram for the proposed system

The depression state is decided after combining the output probabilities of the two modalities.
The suggested strategy increases the accuracy and robustness of identifying depression by utilizing the
advantages of both visual and auditory modalities. Standard performance measures including the confusion
matrix, precision, recall, F-measure, and accuracy are used to assess how well the suggested technique
performs. Our results demonstrate that the fusion of visual and vocal expressions significantly improves the
accuracy of depression detection compared to using only one modality. Proposed CNN technique for
diagnosing depression includes the preprocessing, feature extraction, training, and classification phases.
The video frames are preprocessed by being scaled to a uniform size and having their pixel values
normalized to maintain consistency in the data. The features are extracted using the histogram of oriented
gradients (HoG) approach. In the CNN model, which contains 4 convolutional layers, 4 max pooling layers,
and 2 fully connected layers, rectified linear units (ReLU) acts as the activation function. The model is
trained using backpropagation and a binary cross-entropy loss function. Overfitting is avoided by employing
dropout regularization and early halting techniques.
The DAIC-WOZ database is used to diagnose psychological distress disorders such anxiety,
depression, and post-traumatic stress disorder. Before extracting features from the collected visual data for
depression detection, pre-processing is necessary. In this step, the video frames are resized and normalized to
ensure consistency in the data. Normalization involves adjusting the pixel values so that they have a mean of
zero and a standard deviation of one, which helps to reduce any variation caused by differences in lighting or
camera settings. These pre-processing steps help to ensure that the data is in a consistent format and ready for
feature extraction. A variety of video attributes from the open face framework are retrieved and analyzed.
These include,
− Facial landmarks: 68 spots on the face with estimated 2D and 3D coordinates from video
− HOG characteristics for both eyes’ gaze direction estimation on the aligned 112×112 region of the face
− Head pose: the 3D posture and orientation of the head Additionally, based on FACET software,
assessments of emotion and facial action unit is done. We specifically present the following emotions:
“anger,” “contempt,” “disgust,” “disgust,” “joy,” “fear,” “neutral,” “sadness,” “surprise,” “confusion,”
and “frustration.”
Docker is used to expedite and simplify the feature extraction process. By encapsulating the feature
extraction algorithms and their dependencies within a Docker container, the consistency and reproducibility
of the process across various machines and environments is assured. This approach streamlines the
deployment and management of feature extraction workflows, resulting in consistent and reliable outcomes.
A training set and a testing set were created from the dataset. The pre-processed video frames are the CNN
model’s input. The output of the CNN model is a binary classification indicating whether the individual is
depressed or not. The CNN model is trained using 4 convolutional layers using ReLU as the activation

Indonesian J Elec Eng & Comp Sci, Vol. 34, No. 3, June 2024: 1753-1759
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  1757

function. ReLU, which is applied per pixel and recreates all negative values in the feature map by zero, is an
element-wise process. We utilized a flattening layer, two fully linked layers, and a convolutional layer after
that. The fully connected layer’s purpose is to use these features to categorize the input image into different
classes depending on the training dataset. The CNN model then applies a binary cross-entropy loss function
and backpropagation. The goal of this loss function is to reduce discrepancy between the predicted and actual
classifications, resulting in improved model accuracy. To prevent overfitting, the model is trained with early
stopping and dropout regularization techniques. These approaches help improve the model’s generalization
ability and enhance its performance in processing unseen data.

3. RESULTS AND DISCUSSION

The proposed algorithm was evaluated on a dataset of audio recordings from depressed and
non-depressed individuals. It includes preprocessing, feature extraction, data preparation, LSTM network
design, training, validation, testing, and post-processing techniques. The preprocessing step involves loading
audio signals, removing noise, and normalizing them. The feature extraction step extracts mel-frequency
cepstral coefficients (MFCC) features from preprocessed audio signals. MFCC is a widely used feature
extraction technique in audio signal processing for various applications such as speech recognition, music
classification, and speaker identification. The MFCC feature extraction process involves several steps.
The first step is pre-processing, which includes removing any noise or artifacts that may affect the quality of
the feature extraction. Common pre-processing techniques include filtering, normalization, and framing.
The audio signal is then divided into small, overlapping frames typically 20-30 ms each. Each frame is then
multiplied by a window function such as the Hamming or Hann window to reduce spectral leakage. Next, a
fourier transform (FT) is performed on each frame to obtain its frequency-domain representation.
The frequency-domain representation of each frame is then passed through a mel filter bank, which is a set of
overlapping triangular filters that mimic the non-linear frequency response of the human ear. The mel
filter bank

3.1. Fusion
The fusion result of the CNN and LSTM models for depression detection is achieved by taking the
average of the output from both models. This approach enables the strengths of both models to be combined,
resulting in a more accurate and robust prediction. The final prediction is based on a threshold of 0.5, where
if the output value is less than 0.5, the result is “not depressed,” and if the output value is greater than 0.5, the
result is “depressed.” This approach provides a binary classification of depression detection that is easily
interpretable and actionable for healthcare professionals. By combining the strengths of both models, we can
achieve higher accuracy in detecting depression, ultimately improving the diagnosis and treatment of this
prevalent mental health disorder.

3.2. Analysis of CNN algorithm

The performance of the proposed depression recognition is evaluated using different optimization
algorithms. The CNN and LSTM models are used with each of these optimization algorithms and compared
their performance using the test dataset. The results of the CNN algorithm are presented in Table 3.
From the comparative analysis of performance of the Adam and RMSprop optimizer for the training
of video-based depression data, it is observed that Adam optimizer outperforms the RMSprop optimizer.
The CNN algorithm with Adam optimizer achieved an accuracy of 93%. In the audio-based depression
recognition system LSTM algorithm uses the classification of depression and non-depression. The results of
audio-based depression recognition using LSTM algorithm with different optimizer are shown in Table 4.
From the comparative analysis of performance of the Adam, RMSprop, stochastic gradient descent
(SGD), and Adagrad for the training of audio-based depression data, it is observed that RMSprop optimizer
outperform than the other optimizer. The CNN algorithm with RMSprop optimizer achieved an accuracy of
83%. In the testing phase the final decision is given by considering audio as well as video probability of the
classes. The average of both the probability is taken and class of maximum probability is considered as
output of the depression recognition system. The performance of these models is analyzed using precision,
recall, F1-score, and accuracy measures.

Table 3. Analysis of CNN algorithm on video-based depression detection

Algorithm Optimizer Precision Recall F1-score Accuracy
CNN RMSprop 0.96 0.76 0.81 0.83
Adam 0.93 0.93 0.92 0.93

Depression recognition over fusion of visual and vocal expression … (Chandan Gautam)
1758  ISSN: 2502-4752

Table 4. Performance evaluation of LSTM algorithm on training audio data

Algorithm Optimizer precision recall F1-score accuracy
LSTM Adam 0.83 0.76 0.79 0.76
RMSprop 0.85 0.82 0.83 0.82
SGD 0.76 0.53 0.58 0.53
Adagrad 0.03 0.18 0.05 0.18

4. CONCLUSION
The proposed system is an innovative method for identifying depression that incorporates verbal and
nonverbal clues. Using CNN for facial expression analysis and LSTM for speech expression processing,
accurate classification of individuals as depressed or not depressed is achieved. The suggested approach is
evaluated on videos of college students presenting frontal faces after being trained using depression features
from the DIAC dataset. The outcomes reveal that proposed method is helpful in identifying depression with
high accuracy and dependability. This automated technique is extremely valuable in identifying and treating
depression at an early stage. It can lessen the effects of depression on people’s lives, such as disturbances to
daily routines, academic performance, and social interactions, by offering prompt treatment. There are
chances to improve feature extraction at a more sophisticated level in terms of future scope. Investigating
additional audio aspects, such as reaction time, pause frequency, and quiet rate, may help us better
comprehend the symptoms of depression.

REFERENCES
[1] K. Mao et al., “Prediction of depression severity based on the prosodic and semantic features with bidirectional LSTM and time
distributed CNN,” IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 2251–2265, Jul. 2023, doi:
10.1109/TAFFC.2022.3154332.
[2] M. Niu, J. Tao, B. Liu, J. Huang, and Z. Lian, “Multimodal spatiotemporal representation for automatic depression level
detection,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 294–307, Jan. 2023, doi:
10.1109/TAFFC.2020.3031345.
[3] A. Jan, H. Meng, Y. F. B. A. Gaus, and F. Zhang, “Artificial intelligent system for automatic depression level analysis through
visual and vocal expressions,” IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 3, pp. 668–680, Sep.
2018, doi: 10.1109/TCDS.2017.2721552.
[4] M. A. Wani, M. A. Elaffendi, K. A. Shakil, A. S. Imran, and A. A. A. El-Latif, “Depression screening in humans with AI and
deep learning techniques,” IEEE Transactions on Computational Social Systems, vol. 10, no. 4, pp. 2074–2089, Aug. 2023, doi:
10.1109/TCSS.2022.3200213.
[5] A. T. Beck, R. A. Steer, and G. Brown, “Beck depression inventory-II,” APA PsycTests, vol. 78, no. 2, pp. 490–498, 1996,
[Online]. Available: https://psycnet.apa.org/doiLanding?doi=10.1037%2Ft00742-000.
[6] M. A. Uddin, J. B. Joolee, and Y. K. Lee, “Depression level prediction using deep spatiotemporal features and multilayer Bi-
LTSM,” IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 864–870, Apr. 2022, doi: 10.1109/TAFFC.2020.2970418.
[7] A. Shah, S. Mota, and A. Panchal, “Depression detection using visual cues,” vol. IX, no. V, pp. 1–8, 2020.
[8] N. Singh, R. Kapoor, R. Kapoor, and S. Arora, “Automated major depressive disorder (AMDD) detection using audio-visual
feature discriptor and CNN,” in Proceedings - International Conference on Augmented Intelligence and Sustainable Systems,
ICAISS 2022, Nov. 2022, pp. 1456–1459, doi: 10.1109/ICAISS55157.2022.10010907.
[9] A. Pampouchidou et al., “Automated facial video-based recognition of depression and anxiety symptom severity: cross-corpus
validation,” Machine Vision and Applications, vol. 31, no. 4, p. 30, May 2020, doi: 10.1007/s00138-020-01080-7.
[10] L. Zhou, Z. Liu, Z. Shangguan, X. Yuan, Y. Li, and B. Hu, “TAMFN: time-aware attention multimodal fusion network for
depression detection,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 31, pp. 669–679, 2023, doi:
10.1109/TNSRE.2022.3224135.
[11] A. H. Jo and K. C. Kwak, “Diagnosis of depression based on four-stream model of Bi-LSTM and CNN from audio and text
information,” IEEE Access, vol. 10, pp. 134113–134135, 2022, doi: 10.1109/ACCESS.2022.3231884.
[12] A. Rustagi, C. Manchanda, N. Sharma, and I. Kaushik, “Depression anatomy using combinational deep neural network,”
Advances in Intelligent Systems and Computing, vol. 1166, pp. 19–33, 2021, doi: 10.1007/978-981-15-5148-2_3.
[13] P. H. Shekar and K. S. Kumar, “Emotion recognition and depression detection using deep learning,” International Research
Journal of Engineering and Technology, pp. 3031– 3036, 2020.
[14] Z. Wang, L. Chen, L. Wang, and G. Diao, “Recognition of audio depression based on convolutional neural network and
generative antagonism network model,” IEEE Access, vol. 8, pp. 101181–101191, 2020, doi: 10.1109/ACCESS.2020.2998532.
[15] M. Gheorghe, S. Mihalache, and D. Burileanu, “Using deep neural networks for detecting depression from speech,” in European
Signal Processing Conference, Sep. 2023, pp. 411–415, doi: 10.23919/EUSIPCO58844.2023.10289973.
[16] H. Solieman and E. A. Pustozerov, “The detection of depression using multimodal models based on text and voice quality
features,” in Proceedings of the 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering,
ElConRus 2021, Jan. 2021, pp. 1843–1848, doi: 10.1109/ElConRus51938.2021.9396540.
[17] A. Bailey and M. D. Plumbley, “Gender bias in depression detection using audio features,” in European Signal Processing
Conference, Aug. 2021, vol. 2021-August, pp. 596–600, doi: 10.23919/EUSIPCO54536.2021.9615933.
[18] Z. Zhao, Z. Bao, Z. Zhang, N. Cummins, H. Wang, and B. Schuller, “Hierarchical attention transfer networks for depression
assessment from speech,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings,
May 2020, vol. 2020-May, pp. 7159–7163, doi: 10.1109/ICASSP40776.2020.9053207.
[19] S. Prabhudesai, M. Parmar, A. Mhaske, and S. Bhagwat, “Depression detection and analysis using deep learning: study and
comparative analysis,” in Proceedings - 2021 IEEE 10th International Conference on Communication Systems and Network
Technologies, CSNT 2021, Jun. 2021, pp. 570–574, doi: 10.1109/CSNT51715.2021.9509707.

Indonesian J Elec Eng & Comp Sci, Vol. 34, No. 3, June 2024: 1753-1759
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  1759

[20] Z. Dai, Q. Li, Y. Shang, and X. Wang, “Depression detection based on facial expression, audio, and gait,” in ITNEC 2023 - IEEE
6th Information Technology, Networking, Electronic and Automation Control Conference, Feb. 2023, pp. 1568–1573, doi:
10.1109/ITNEC56291.2023.10082163.
[21] C. Fan, Z. Lv, S. Pei, and M. Niu, “Csenet: complex squeeze-and-excitation network for speech depression level prediction,” in
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, May 2022, vol. 2022-May,
pp. 546–550, doi: 10.1109/ICASSP43922.2022.9746011.
[22] H. Kaya, F. Eyben, A. A. Salah, and B. Schuller, “CCA based feature selection with application to continuous depression
recognition from acoustic speech features,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal
Processing - Proceedings, May 2014, pp. 3729–3733, doi: 10.1109/ICASSP.2014.6854298.
[23] W. C. De Melo, E. Granger, and A. Hadid, “Combining global and local convolutional 3D networks for detecting depression from
facial expressions,” in Proceedings - 14th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2019,
May 2019, pp. 1–8, doi: 10.1109/FG.2019.8756568.
[24] J. Yoon, C. Kang, S. Kim, and J. Han, “D-vlog: multimodal vlog dataset for depression detection,” Proceedings of the 36th AAAI
Conference on Artificial Intelligence, AAAI 2022, vol. 36, pp. 12226–12234, 2022, doi: 10.1609/aaai.v36i11.21483.
[25] Y. Tao, M. Yang, Y. Wu, K. Lee, A. Kline, and B. Hu, “Depressive semantic awareness from vlog facial and vocal streams via
spatio-temporal transformer,” Digital Communications and Networks, pp. 2352–8648, Mar. 2023, doi:
10.1016/j.dcan.2023.03.007.
[26] W. Zheng, L. Yan, and F. Y. Wang, “Two birds with one stone: knowledge-embedded temporal convolutional transformer for
depression detection and emotion recognition,” IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 2595–2613, Oct.
2023, doi: 10.1109/TAFFC.2023.3282704.
[27] B. G. Dadiz and C. R. Ruiz, “Detecting depression in videos using uniformed local binary pattern on facial features,” in Lecture
Notes in Electrical Engineering, vol. 481, 2019, pp. 413–422.

BIOGRAPHIES OF AUTHORS

Chandan Gautam holding a B.Tech degree in Electronics and Communication

Engineering from MIT WPU in 2023, Pune, is an accomplished Software Developer at
Northern Trust. Awarded the first rank in her department and a gold medal, also served as the
IEEE Chairperson, MITWPU in 2022. She demonstrates a compelling blend of leadership
skills and academic prowess, contributing significantly to advancements in affective
computing and multimodal emotion recognition, underscoring her dedication to the field. She
can be contacted at email: [email protected].

Aaradhya Raj with a B.Tech degree in Electronics and Communication

Engineering from MIT WPU (2023, Pune), currently works at Harman as a software
developer, demonstrating proficiency and dedication to advancing technology. Her research
interests encompass speech processing, computer vision, affective computing, and machine
learning. She can be contacted at email: [email protected].

Bhargavee Nemade a B.Tech. in Electronics and Communication graduate from

MIT WPU (2023) and Test Engineer at IDeaS-A SAS Company, has made substantial
contributions to advancing affective computing and multimodal emotion recognition,
highlighting her dedication to innovation at the intersection of technology and emotion
analysis. She can be contacted at email: [email protected].

Dr. Vinaya Gohokar is working as Professor in School of ECE, MIT World

Peace University. She has more than 30 years of experience in teaching and research.
Her area of interest includes internet of things (IoT), machine learning, edge intelligence, and
computer vision. She has many grants from DST, AICTE to her credit. She has published
research papers in various journals and conferences. She can be contacted at email:
[email protected].

Depression recognition over fusion of visual and vocal expression … (Chandan Gautam)
Copyright of Indonesian Journal of Electrical Engineering & Computer Science is the
property of Institute of Advanced Engineering & Science and its content may not be copied or
emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.

(Đề thi có 05 trang) Thời gian làm bài: 60 phút, không kể thời gian phát đề
100% (1)
(Đề thi có 05 trang) Thời gian làm bài: 60 phút, không kể thời gian phát đề
23 pages
Objective of Public Policy
No ratings yet
Objective of Public Policy
3 pages
Total Hip Replacement Case Series From A Leading Registry PDF
100% (10)
Total Hip Replacement Case Series From A Leading Registry PDF
16 pages
Management of Cybersecurity in Medical Devices - Diec 2020
100% (1)
Management of Cybersecurity in Medical Devices - Diec 2020
13 pages
School-Based Immunization
100% (2)
School-Based Immunization
1 page
Reflections On Expanding Ethiopia's Democratic Space: Aspirations, Opportunities, Choices
100% (1)
Reflections On Expanding Ethiopia's Democratic Space: Aspirations, Opportunities, Choices
56 pages
Jetam Price List
No ratings yet
Jetam Price List
12 pages
Sejarah Dan Perkembangan Ilmu Bedah September 2017
100% (1)
Sejarah Dan Perkembangan Ilmu Bedah September 2017
18 pages
Mobile Computing Solutions for Healthcare Systems
From Everand
Mobile Computing Solutions for Healthcare Systems
PublishDrive
No ratings yet
Data Personil
No ratings yet
Data Personil
74 pages
5.IEEE Trans Affect Compu Interpretation of Depression Detection Models Via Feature Selection Methods
No ratings yet
5.IEEE Trans Affect Compu Interpretation of Depression Detection Models Via Feature Selection Methods
52 pages
PPE Policy
No ratings yet
PPE Policy
4 pages
Depression Detection Presentation
No ratings yet
Depression Detection Presentation
20 pages
Counseling Formats Report
No ratings yet
Counseling Formats Report
36 pages
23 July 2024 - Comprehensive Review of Depression Detection Techniques Based On Machine Learning Approach
No ratings yet
23 July 2024 - Comprehensive Review of Depression Detection Techniques Based On Machine Learning Approach
25 pages
Automatic Assessment of Depression Based On Visual Cues: A Systematic Review
No ratings yet
Automatic Assessment of Depression Based On Visual Cues: A Systematic Review
26 pages
Automatic Diagnosis of Depression Based On Attention Mechanism and Feature Pyramid Model.
No ratings yet
Automatic Diagnosis of Depression Based On Attention Mechanism and Feature Pyramid Model.
20 pages
Intelligent System For Depression Scale Estimation With Facial Expressions and Case Study in Industrial Intelligence
No ratings yet
Intelligent System For Depression Scale Estimation With Facial Expressions and Case Study in Industrial Intelligence
19 pages
Depression Detection Using Text Face and Audio
No ratings yet
Depression Detection Using Text Face and Audio
19 pages
Multiple Choice Questions: A Triumph of Surgery Extra Questions and Answers
100% (2)
Multiple Choice Questions: A Triumph of Surgery Extra Questions and Answers
6 pages
Prediction of Depression Severity Based On The Prosodic and Semantic Features Wi
No ratings yet
Prediction of Depression Severity Based On The Prosodic and Semantic Features Wi
15 pages
PHQ-V GAD-V Assessments To Identify Signals of Depression
No ratings yet
PHQ-V GAD-V Assessments To Identify Signals of Depression
15 pages
Investigation of Speech Landmark Patterns For Depression Detection
No ratings yet
Investigation of Speech Landmark Patterns For Depression Detection
14 pages
PDF 3
No ratings yet
PDF 3
16 pages
An Adaptive Multi-Graph Neural Network With Multimodal Feature Fusion Learning For MDD Detection
No ratings yet
An Adaptive Multi-Graph Neural Network With Multimodal Feature Fusion Learning For MDD Detection
14 pages
Prediction of Depression Severity Based On The Prosodic and Semantic Features With Bidirectional LSTM and Time Distributed CNN
No ratings yet
Prediction of Depression Severity Based On The Prosodic and Semantic Features With Bidirectional LSTM and Time Distributed CNN
15 pages
MODMA Dataset: A Multi-Modal Open Dataset For Mental-Disorder Analysis
No ratings yet
MODMA Dataset: A Multi-Modal Open Dataset For Mental-Disorder Analysis
15 pages
Communication and Utilization of Research
No ratings yet
Communication and Utilization of Research
97 pages
Depression Recognition Using Voice Based Pre Training Model
No ratings yet
Depression Recognition Using Voice Based Pre Training Model
13 pages
Towards Automatic Depression Detection A BiLSTM1D CNN-Based Model
No ratings yet
Towards Automatic Depression Detection A BiLSTM1D CNN-Based Model
20 pages
DepXGBoot: Depression Detection Using A Robust Tuned Extreme Gradient Boosting Model Generator
No ratings yet
DepXGBoot: Depression Detection Using A Robust Tuned Extreme Gradient Boosting Model Generator
12 pages
Cai 2020
No ratings yet
Cai 2020
12 pages
1.CAAI Transactions On Intelligence Technology Automatic Depression Recognition by Intelligent Speech Signal Processing A Systematic Survey
No ratings yet
1.CAAI Transactions On Intelligence Technology Automatic Depression Recognition by Intelligent Speech Signal Processing A Systematic Survey
11 pages
Sample Paper
No ratings yet
Sample Paper
5 pages
1 s2.0 S2590005623000164 Main
No ratings yet
1 s2.0 S2590005623000164 Main
17 pages
A Hybrid Model For Depression Detection Using Deep Learning
No ratings yet
A Hybrid Model For Depression Detection Using Deep Learning
10 pages
Additive Cross-Modal Attention Network ACMA For Depression Detection Based On Audio and Textual Features
No ratings yet
Additive Cross-Modal Attention Network ACMA For Depression Detection Based On Audio and Textual Features
11 pages
Depression Detection Using EI
No ratings yet
Depression Detection Using EI
7 pages
1 s2.0 S0010482524014100 Main
No ratings yet
1 s2.0 S0010482524014100 Main
12 pages
Depression Detection and Analysis Using Large Language Models On Textual and Audio-Visual Modalities
No ratings yet
Depression Detection and Analysis Using Large Language Models On Textual and Audio-Visual Modalities
12 pages
Detect Depression From Communication How Computer Vision Signal Processing and Sentiment Analysis Join Forces
No ratings yet
Detect Depression From Communication How Computer Vision Signal Processing and Sentiment Analysis Join Forces
14 pages
Towards Automatic Text-Based Estimation of Depression Through Symptom Prediction
No ratings yet
Towards Automatic Text-Based Estimation of Depression Through Symptom Prediction
14 pages
Whta Revels About Depression Level The Role of Multimodal Features at The Level of Interview Questions
No ratings yet
Whta Revels About Depression Level The Role of Multimodal Features at The Level of Interview Questions
14 pages
Final Paperhh
No ratings yet
Final Paperhh
13 pages
Prediction of Depression Severity Based On The Prosodic and Semantic Features With Bidirectional LSTM and Time Distributed CNN
No ratings yet
Prediction of Depression Severity Based On The Prosodic and Semantic Features With Bidirectional LSTM and Time Distributed CNN
15 pages
A Computer Vision Based Image Processing System Fo
No ratings yet
A Computer Vision Based Image Processing System Fo
10 pages
E V S - B D D U D A M: Motional Ietnamese Peech Ased Epression Iagnosis Sing Ynamic Ttention Echanism
No ratings yet
E V S - B D D U D A M: Motional Ietnamese Peech Ased Epression Iagnosis Sing Ynamic Ttention Echanism
9 pages
MFCC-based Recurrent Neural Network For Automatic Clinical Depression
No ratings yet
MFCC-based Recurrent Neural Network For Automatic Clinical Depression
14 pages
Automatic Diagnosis of Depression Based On Facial Expression Information and Deep Convolutional Neural Network
No ratings yet
Automatic Diagnosis of Depression Based On Facial Expression Information and Deep Convolutional Neural Network
12 pages
BIGDAS2023 Paper 13
No ratings yet
BIGDAS2023 Paper 13
8 pages
Reviewing Transcutaneous Power Supplies For High Power Implantable Heart Pumps
No ratings yet
Reviewing Transcutaneous Power Supplies For High Power Implantable Heart Pumps
16 pages
Depression Detection Using Multimodal Analysis With Chatbot Support
No ratings yet
Depression Detection Using Multimodal Analysis With Chatbot Support
7 pages
An Automated System For Depression Detection Based On Facial and Vocal Features
No ratings yet
An Automated System For Depression Detection Based On Facial and Vocal Features
7 pages
Detecting Depression From Speech
No ratings yet
Detecting Depression From Speech
8 pages
2017 Multimodal2
No ratings yet
2017 Multimodal2
13 pages
MSDS - Molecular Sieve
No ratings yet
MSDS - Molecular Sieve
8 pages
Enhancing - Depression - Detection - Employing - Autoencoders - and - Linguistic - Feature - Analysis - With - BERT - and - LSTM - Model
No ratings yet
Enhancing - Depression - Detection - Employing - Autoencoders - and - Linguistic - Feature - Analysis - With - BERT - and - LSTM - Model
6 pages
Multimodal Data Fusion For Depression Detection Approach
No ratings yet
Multimodal Data Fusion For Depression Detection Approach
18 pages
Detecting Depression With Heterogeneous Graph Neural Network in Clinical Interview Transcript
No ratings yet
Detecting Depression With Heterogeneous Graph Neural Network in Clinical Interview Transcript
10 pages
Cloud Computing Along With ML - Research - Paper
No ratings yet
Cloud Computing Along With ML - Research - Paper
6 pages
A Survey On Cross-Platform Depression Detection Combining Text, Audio, Images To Understand Emotions Over Time
No ratings yet
A Survey On Cross-Platform Depression Detection Combining Text, Audio, Images To Understand Emotions Over Time
7 pages
A Machine Learning Based Depression Analysis
No ratings yet
A Machine Learning Based Depression Analysis
6 pages
Detecting Depression With Word-Level Multimodal Fusion - LEÍDO
No ratings yet
Detecting Depression With Word-Level Multimodal Fusion - LEÍDO
5 pages
Conference
No ratings yet
Conference
7 pages
SSRN Id3363837
No ratings yet
SSRN Id3363837
5 pages
Vesicular Drug Delivery System: A Novel Approach: September 2014
No ratings yet
Vesicular Drug Delivery System: A Novel Approach: September 2014
15 pages
Multimodal Depression Detection Using Audio Visual Cues
No ratings yet
Multimodal Depression Detection Using Audio Visual Cues
5 pages
Group 40 Ijarcce
No ratings yet
Group 40 Ijarcce
5 pages
(IJCST-V11I3P25) :pooja Patil, Swati J. Patel
No ratings yet
(IJCST-V11I3P25) :pooja Patil, Swati J. Patel
5 pages
Paper 3173
No ratings yet
Paper 3173
5 pages
SSRN Id3363837
No ratings yet
SSRN Id3363837
4 pages
Detecting Depression From Facial Actions and Vocal Prosody
No ratings yet
Detecting Depression From Facial Actions and Vocal Prosody
7 pages
Tainted Cutter Polio Vaccine Killed and Paralyzed Children in 1955 - The Washington Post
No ratings yet
Tainted Cutter Polio Vaccine Killed and Paralyzed Children in 1955 - The Washington Post
4 pages
HRD2602 Examination
No ratings yet
HRD2602 Examination
4 pages
Enhancing Depression Detection in Clinical Interviews Integration of Fused BERT BiLSTM Model and XGBoost
No ratings yet
Enhancing Depression Detection in Clinical Interviews Integration of Fused BERT BiLSTM Model and XGBoost
6 pages
Automated Classification of Depression Severity Using Speech - A Comparison of Two Machine Learning Architectures
No ratings yet
Automated Classification of Depression Severity Using Speech - A Comparison of Two Machine Learning Architectures
4 pages
Eeg Depresion Detection Ieee
No ratings yet
Eeg Depresion Detection Ieee
8 pages
Ramponi 2018
No ratings yet
Ramponi 2018
7 pages
MSDS PVP K30 Usp
No ratings yet
MSDS PVP K30 Usp
5 pages
Detecting Depression With Audio/Text Sequence Modeling of Interviews
No ratings yet
Detecting Depression With Audio/Text Sequence Modeling of Interviews
5 pages
Skowhegan Regional Chamber of Commerce Indian RFP
No ratings yet
Skowhegan Regional Chamber of Commerce Indian RFP
3 pages
Laboratory 9 Digestive System
No ratings yet
Laboratory 9 Digestive System
8 pages
$Xwrpdwlf'Hsuhvvlrq/Hyho'Hwhfwlrq7Kurxjk 9lvxdo, QSXW: Abstract Depression Is The Most Comprehensive Mood
No ratings yet
$Xwrpdwlf'Hsuhvvlrq/Hyho'Hwhfwlrq7Kurxjk 9lvxdo, QSXW: Abstract Depression Is The Most Comprehensive Mood
4 pages
PEH 2 Notes (Badminton)
No ratings yet
PEH 2 Notes (Badminton)
3 pages
X Ray Technician Resume 8
No ratings yet
X Ray Technician Resume 8
2 pages
Chapter 2 Factories Act, 1948
No ratings yet
Chapter 2 Factories Act, 1948
11 pages
University of Technology, Jamaica: Halls of Residence Application Form
No ratings yet
University of Technology, Jamaica: Halls of Residence Application Form
3 pages
Who Diagnostic Criteria Myelofibrosis
No ratings yet
Who Diagnostic Criteria Myelofibrosis
2 pages
Effectiveness of Planned Teaching Programme On Prevention of Gastroenteritis
No ratings yet
Effectiveness of Planned Teaching Programme On Prevention of Gastroenteritis
16 pages
Script Emcee
No ratings yet
Script Emcee
7 pages

Depression Recognition Over Fusion of Visual and Vocal Expression Using Artificial Intelligence

Uploaded by

Depression Recognition Over Fusion of Visual and Vocal Expression Using Artificial Intelligence

Uploaded by

Indonesian Journal of Electrical Engineering and Computer Science

Vol. 34, No. 3, June 2024, pp. 1753~1759

Depression recognition over fusion of visual and vocal

Chandan Gautam, Aaradhya Raj, Bhargavee Nemade, Vinaya Gohokar

Article Info ABSTRACT

Journal homepage: http://ijeecs.iaescore.com

Table 1. The BDI-II scores and corresponding depression degree

Figure 1. Block diagram for the proposed system

3. RESULTS AND DISCUSSION

3.2. Analysis of CNN algorithm

Table 3. Analysis of CNN algorithm on video-based depression detection

Table 4. Performance evaluation of LSTM algorithm on training audio data

Chandan Gautam holding a B.Tech degree in Electronics and Communication

Aaradhya Raj with a B.Tech degree in Electronics and Communication

Bhargavee Nemade a B.Tech. in Electronics and Communication graduate from

Dr. Vinaya Gohokar is working as Professor in School of ECE, MIT World

You might also like